Update 2skrub #2

dcolinmorgan · 2023-09-07T03:56:55Z

* Try fix * Fix miniconda path

* Use handle_unknown=ignore in SuperVectorizer Change default `low_card_cat_transformer` in SuperVectorizer to use handle_unknown="ignore" * Update changelog * Change drop to None * Fix bug for new categories for categorical columns Pandas `category` dtype conversion converts new categories to nans, so we now update the list of categories before converting. * Fix test to prevent n_samples < n_components * Update dirty_cat/_super_vectorizer.py Co-authored-by: Jovan Stojanovic <[email protected]> * Convert all categorical columns to object dtype inside SuperVectorizer This avoids dealing with the categories attached to the dtype. * Put back drop="if_binary" And use handle_unknown="error" for sklearn < 0.24.2. * Revert "Convert all categorical columns to object dtype inside SuperVectorizer" This reverts commit 34ed05f. * finish merge * change name in CHANGES.rst * Change min version for handle_unknown=ignore to 1.0.0 and change the warning message to be more informative. * warning stacklevel + fix name * replace sup_vec by table_vec --------- Co-authored-by: Jovan Stojanovic <[email protected]>

* Add example with KEN Wikipedia embeddings * add fetching function * add pyarrow * move embeddings to datasets * divide full embeddings for lower memory usage * improve test * try resolving memory error * improve get_ken_embeddings * improve fetching logic * use pyarrow instead of pandas * add lighter type version * add lighter test files * fix test error * improve example * Update examples/07_ken_embeddings_example.py Co-authored-by: Gael Varoquaux <[email protected]> * Update dirty_cat/datasets/_ken_embeddings.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_ken_embeddings_example.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_ken_embeddings_example.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_ken_embeddings_example.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_ken_embeddings_example.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_ken_embeddings_example.py Co-authored-by: Gael Varoquaux <[email protected]> * add ken_embeddings to docs * add fetch_figshare to fetch_dataclass logic * update ylabels * fix typo * add correspondence table to get_ken_embeddings --------- Co-authored-by: Gael Varoquaux <[email protected]>

* DOC: Add a see-also * period

* Refactor fetching test * Fix missing argument in fetch_employee_salaries * Fix URL * Fix mock * Fix OpenML URL * Fix tests * Validate shape

* Add pytest-xdist to parallelize parametrized tests * Clean test * Try with logical CPUs * Stick with auto

* Add target directory as optional argument * Add changelog entry * Support str in public fetching API * Update changelog entry * Make paths absolute * Remove unnecessary path conversion in test

…)" (skrub-data#498) This reverts commit 72be1af.

* Try using pep517 * Try using only binaries * Bump numpy versions * Bump scikit-learn versions * Bumpy numpy version for Python 3.10

* More readable worker name * Use new codecov version * Use Python 3.10 for doc generation * Fix artifacts entry name * Use more recent versions of actions * Try using cleaner os names * Remove useless note * Use Python 3.9 instead of 3.10 because of parsing bug

* DOC fix broken binder * fix binder config * remove requirement in doc * to jupyter lab * update conf.py * restore conf.py * add dirty-cat

* FIX fuzzy_join AttributeError * add test

* Improve coverage of GapEncoder * remove comments * restore get_feature_names * improve docstring

* start * add return dic to monitor and add benchmark for fuzzy join with different encoders * revert fuzzy_join changes * fix bug: now return results are added to list * fix plot * fix plot + add benchmark * change doc * cleaning * cleaning * Description of benchmark results * Run pre-commit hooks * Remove unused imports * Remove unused code * Typos * Update benchmarks/utils/monitor.py Co-authored-by: Lilian <[email protected]> * Applying suggestions * Change the `monitor` decorator to return a tidy dataframe. * Add loading bar to the benchmark * Adapt script to new `monitor` decorator * Replace results of fuzzy_join_hash benchmark by the new tidy version (the plots are the same) * Adapt previous minhash benchmark to the new monitor function + new results * Add tqdm to benchmark requirements * Cleanup * Improve functionality and documentation --------- Co-authored-by: Lilian <[email protected]>

…b-data#504) * Clean docstrings * Clean docstrings * Fix types * Remove scalability example * Update dirty_cat/_similarity_encoder.py Co-authored-by: Jovan Stojanovic <[email protected]> * Fix return value of transform * Revert "Remove scalability example" This reverts commit 97cd1cf. --------- Co-authored-by: Jovan Stojanovic <[email protected]>

* Add smaller n_features * remove scalability example * smaller number of features * test

* Try fix * Try alternative fix * Do it correctly * Follow naming conventions

* test * modify example * revert change in conf.py * revert version * update version

* Shorten fetching tests * Add OpenML package to dev requirements * Improve doc * Allow building from source for liac-arff * Make OpenML import optional

…krub-data#522) * fuzzy_join takes missing values into account * add to major changes

* add error message for match_score * wrong error type * fix tests

* ENH Add warning fuzzy join on missing values * update changes.rst

* Rework datetime encoder example * Apply suggestions from code review Co-authored-by: Jovan Stojanovic <[email protected]> --------- Co-authored-by: Jovan Stojanovic <[email protected]>

…rub-data#665) * create script * cache * Use loguru for logging, various code improvements, slightly better doc and messages * Fix condition * fix import bug * fix bug for empty evals * fix 0 featues * improvements * Update benchmarks/run_on_openml_datasets.py Co-authored-by: Lilian <[email protected]> * import Counter * test commit * remove test commit * fix bug --------- Co-authored-by: Lilian <[email protected]>

…oyee_salaries` (skrub-data#581) * Add `overload_job_titles` parameter to `fetch_employee_salaries` * Add changelog entry * Fix path

* Improve framework * Add Gap divergence benchmark * Set initial iter values * Add omitted value (score) * Force keyword arguments and add progress saving * Minor fixes * Update to main * Add pyarrow to benchmark requirements * Implement cross-validation * Update README * Parallelize cross-validation * Fix attribute access * Fix attribute access * Fix unpacking * Fix results naming * Fix results bug * Multiple columns support and W_change plot v1 * Refactor dataset getters * Adapt getters usage to new format * Small fixes * New plots * Fix dataset categorization * Update used datasets * Add score per inner iteration plot * Add benchmark results * Compute the score after tuning * Add issue link for `road_safety` * Update results * Update benchmarks/utils/monitor.py Co-authored-by: LeoGrin <[email protected]> --------- Co-authored-by: LeoGrin <[email protected]>

* benchmark * fix bug due to mixed type * verbose * test * fix bug * add benchmark results * add balenced accuracy * Update benchmarks/bench_gap_es_score.py Co-authored-by: Lilian <[email protected]> * remove prints * run with the same batch size * benchmark results with the same batch size --------- Co-authored-by: Lilian <[email protected]>

* rename to Joiner * complete renaming * add many to many join support * update changelog * update docstring * update docstring * fix test * fix init * fix changelog * fix example * modify test * Update CHANGES.rst Co-authored-by: Lilian <[email protected]> * Update examples/04_fuzzy_joining.py Co-authored-by: Lilian <[email protected]> * Update skrub/_joiner.py Co-authored-by: Lilian <[email protected]> * Update skrub/_joiner.py Co-authored-by: Lilian <[email protected]> * Update skrub/_joiner.py Co-authored-by: Lilian <[email protected]> * fix tests * pre-commit * fix docstring * renaming * update changelog * apply suggestions * fix index * add new example * new example * add flight example * update width * add figure * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * apply suggestions * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * simplify text * update figure display * remove figure * Remove leftover blank lines * add conclusion * fix conclusion * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <[email protected]> * remove attribute * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <[email protected]> * apply suggestions * drop id cols * add list flexibility to joiner * add fetching function * temp raise timeout time * simplify Joiner signature * revert name temporarily for euroscipy * update joiner to accept tuples --------- Co-authored-by: Lilian <[email protected]> Co-authored-by: Gael Varoquaux <[email protected]>

…krub-data#699)

* path as string * install sphinx gallery from source * add package name

The code snippet rendering in README.rst is wrong because it is written as markdown language. This commit uses RST code-block syntax instead.

* rephrase contributing * some further simplification * Apply suggestions from code review Co-authored-by: Jovan Stojanovic <[email protected]> --------- Co-authored-by: Jovan Stojanovic <[email protected]>

* add pipeline plot * add plot of arrival delays * Update examples/07_multiple_key_join.py Co-authored-by: Vincent M <[email protected]> --------- Co-authored-by: Vincent M <[email protected]>

(Merging this quickly to fix the README)

Co-authored-by: Kaźmierczak Patryk <[email protected]>

* working sped up version * benchmark hyperparameters * speedup benchmark * fix bug due to mixed type * add benchmark results * changelog * big speedup due to computing Ht@W only when Vt is non-zero * change naming in _special_sparse_dot * change benchmark * speedup benchmark * fix bug * add benchmark and change default * change default * Update skrub/_gap_encoder.py Co-authored-by: Lilian <[email protected]> * changelog * add test for max_no_improvement=None * Vincent's suggestions * add verbose=True to tests for coverage * Apply suggestions from code review Co-authored-by: Lilian <[email protected]> * use loguru * remove global * pre-commit * fix changelog * Apply suggestions from code review Co-authored-by: Lilian <[email protected]> --------- Co-authored-by: Lilian <[email protected]>

…-data#583) Co-authored-by: Gael Varoquaux <[email protected]> Co-authored-by: Vincent M <[email protected]> Co-authored-by: Jovan Stojanovic <[email protected]>

…er` (skrub-data#725)

* add build doc file to worklflow * add save cache * remove build doc file * modify example to test * test caching * getting new cache * another attempt * fix data path * fix data path * fix path to data * cache attempt * generate cache * should be using cached data * test all examples [doc build] * test with cache [doc build]

* fix restructured text * Make linter happy Let's see if the formating is still good * Decrease the number of data points For faster example; it decreases the accuracy from 0.6 to 0.58 but is much faster

* replace set np.NaN by pd.NA * Update skrub/_fuzzy_join.py --------- Co-authored-by: Gael Varoquaux <[email protected]>

LilianBoulard and others added 30 commits February 14, 2023 13:20

Fix CI issue (skrub-data#491)

8e193a4

* Try fix * Fix miniconda path

Set default value to None for similarity in SimilarityEncoder (skrub-…

90ea4db

…data#492)

DOC: Add a see-also (skrub-data#496)

6ae2511

* DOC: Add a see-also * period

Refactor fetching tests (skrub-data#489)

c0e3032

* Refactor fetching test * Fix missing argument in fetch_employee_salaries * Fix URL * Fix mock * Fix OpenML URL * Fix tests * Validate shape

Parallelize parametrized tests (skrub-data#493)

4d9f555

* Add pytest-xdist to parallelize parametrized tests * Clean test * Try with logical CPUs * Stick with auto

Support paths as strings in public fetching API (skrub-data#453)

72be1af

* Add target directory as optional argument * Add changelog entry * Support str in public fetching API * Update changelog entry * Make paths absolute * Remove unnecessary path conversion in test

Revert "Support paths as strings in public fetching API (skrub-data#453…

99ede6f

…)" (skrub-data#498) This reverts commit 72be1af.

Prepare for 0.4 (skrub-data#499)

ea009b9

Fix long CI install (skrub-data#494)

c7331ec

* Try using pep517 * Try using only binaries * Bump numpy versions * Bump scikit-learn versions * Bumpy numpy version for Python 3.10

Update release process following release 0.4

17b270b

typo

5a76096

Support paths as strings in public fetching API (skrub-data#501)

76a1c36

DOC Fix broken binder (skrub-data#506)

2aa552e

DOC Fix broken binder (skrub-data#507)

b9e09ef

* DOC fix broken binder * fix binder config * remove requirement in doc * to jupyter lab * update conf.py * restore conf.py * add dirty-cat

FIX fuzzy_join AttributeError (skrub-data#509)

4122a67

* FIX fuzzy_join AttributeError * add test

ENH Improve coverage of GapEncoder (skrub-data#434)

981c00c

* Improve coverage of GapEncoder * remove comments * restore get_feature_names * improve docstring

Fix CI not running tests (skrub-data#516)

6fd6c92

FIX Array memory error of fuzzy_join (skrub-data#512)

a0ba34e

* Add smaller n_features * remove scalability example * smaller number of features * test

Fix duplicate CI workers (skrub-data#526)

c245577

* Try fix * Try alternative fix * Do it correctly * Follow naming conventions

FIX Examples on dev version of website (skrub-data#528)

e1e5afd

* test * modify example * revert change in conf.py * revert version * update version

Shorten fetching tests (skrub-data#524)

61c78b4

* Shorten fetching tests * Add OpenML package to dev requirements * Improve doc * Allow building from source for liac-arff * Make OpenML import optional

ENH fuzzy_join takes missing values into account as in pandas.merge (s…

6d5a360

…krub-data#522) * fuzzy_join takes missing values into account * add to major changes

FIX Error when wrong type of fuzzy_join parameter (skrub-data#534)

4fc4fc7

* add error message for match_score * wrong error type * fix tests

add _set_drop_idx for later scikit-learn versions (skrub-data#532)

721aa13

ENH Add warning fuzzy join on missing values (skrub-data#529)

e3897b3

* ENH Add warning fuzzy join on missing values * update changes.rst

LilianBoulard and others added 28 commits August 1, 2023 14:41

Rework DatetimeEncoder example (skrub-data#683)

13698bf

* Rework datetime encoder example * Apply suggestions from code review Co-authored-by: Jovan Stojanovic <[email protected]> --------- Co-authored-by: Jovan Stojanovic <[email protected]>

Merge underfilled_job_title with employee_position_title in `empl…

1a6f50f

…oyee_salaries` (skrub-data#581) * Add `overload_job_titles` parameter to `fetch_employee_salaries` * Add changelog entry * Fix path

docs(install): add '' around pip install -e .[dev] (skrub-data#695) (s…

f989fba

…krub-data#699)

FIX PosixPath object has no attribute 'startswith' (skrub-data#703)

b2b3a7c

* path as string * install sphinx gallery from source * add package name

enhance README.rst (skrub-data#711)

1425b6c

TimeSeriesSplit definition in example 3 is duplicated (skrub-data#714)

1ea64a9

hotfix code-snippet in README.rst

7a77fd5

The code snippet rendering in README.rst is wrong because it is written as markdown language. This commit uses RST code-block syntax instead.

replace np.random with rng (skrub-data#715)

bc7f4d6

DOC hotfix url page missing in our example 3 (skrub-data#718)

9d2e64f

Enhance CONTRIBUTING.rst (skrub-data#713)

8874c2f

* rephrase contributing * some further simplification * Apply suggestions from code review Co-authored-by: Jovan Stojanovic <[email protected]> --------- Co-authored-by: Jovan Stojanovic <[email protected]>

DOC Add plot to example 7 (skrub-data#700)

1795701

* add pipeline plot * add plot of arrival delays * Update examples/07_multiple_key_join.py Co-authored-by: Vincent M <[email protected]> --------- Co-authored-by: Vincent M <[email protected]>

Fix Badges on README.rst (skrub-data#721)

d47dc78

(Merging this quickly to fix the README)

Add "packaging" as a requirement (skrub-data#712)

89f2a78

add missing keyword arguments (skrub-data#722)

a6fa1aa

rename file (skrub-data#697)

3a0a5b4

fix_few_typos (skrub-data#724)

5260b3d

Remove warnings from 01_dirty_categories.py (skrub-data#705)

61ab6db

Co-authored-by: Kaźmierczak Patryk <[email protected]>

Add built-in column-specific transformers to TableVectorizer (skrub…

9f4ca19

…-data#583) Co-authored-by: Gael Varoquaux <[email protected]> Co-authored-by: Vincent M <[email protected]> Co-authored-by: Jovan Stojanovic <[email protected]>

MAINT Fix feature_name warning during transform for `MinHashEncod…

b611d43

…er` (skrub-data#725)

Faster example 07_multiple_key_join (skrub-data#727)

1cd57e3

* fix restructured text * Make linter happy Let's see if the formating is still good * Decrease the number of data points For faster example; it decreases the accuracy from 0.6 to 0.58 but is much faster

MAINT Fix pandas nightly (skrub-data#728)

dcf6061

* replace set np.NaN by pd.NA * Update skrub/_fuzzy_join.py --------- Co-authored-by: Gael Varoquaux <[email protected]>

track skrub

2d0176b

dcolinmorgan mentioned this pull request Sep 7, 2023

Fwiw, do we need to update to track latest dirty car, as it has been awhile? graphistry/pygraphistry#504

Open

dcolinmorgan closed this Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update 2skrub #2

Update 2skrub #2

dcolinmorgan commented Sep 7, 2023 •

edited

Loading

Update 2skrub #2

Update 2skrub #2

Conversation

dcolinmorgan commented Sep 7, 2023 • edited Loading

dcolinmorgan commented Sep 7, 2023 •

edited

Loading