Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 2skrub #2

Closed
wants to merge 160 commits into from
Closed

Conversation

dcolinmorgan
Copy link
Collaborator

@dcolinmorgan dcolinmorgan commented Sep 7, 2023

LilianBoulard and others added 30 commits February 14, 2023 13:20
* Try fix

* Fix miniconda path
* Use handle_unknown=ignore in SuperVectorizer

Change default `low_card_cat_transformer` in SuperVectorizer to use handle_unknown="ignore"

* Update changelog

* Change drop to None

* Fix bug for new categories for categorical columns

Pandas `category` dtype conversion converts new categories to nans, so we now update the list of categories before converting.

* Fix test to prevent n_samples < n_components

* Update dirty_cat/_super_vectorizer.py

Co-authored-by: Jovan Stojanovic <[email protected]>

* Convert all categorical columns to object dtype inside SuperVectorizer

This avoids dealing with the categories attached to the dtype.

* Put back drop="if_binary"

And use handle_unknown="error" for sklearn < 0.24.2.

* Revert "Convert all categorical columns to object dtype inside SuperVectorizer"

This reverts commit 34ed05f.

* finish merge

* change name in CHANGES.rst

* Change min version for handle_unknown=ignore to 1.0.0

and change the warning message to be more informative.

* warning stacklevel + fix name

* replace sup_vec by table_vec

---------

Co-authored-by: Jovan Stojanovic <[email protected]>
* Add example with KEN Wikipedia embeddings

* add fetching function

* add pyarrow

* move embeddings to datasets

* divide full embeddings for lower memory usage

* improve test

* try resolving memory error

* improve get_ken_embeddings

* improve fetching logic

* use pyarrow instead of pandas

* add lighter type version

* add lighter test files

* fix test error

* improve example

* Update examples/07_ken_embeddings_example.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update dirty_cat/datasets/_ken_embeddings.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_ken_embeddings_example.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_ken_embeddings_example.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_ken_embeddings_example.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_ken_embeddings_example.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_ken_embeddings_example.py

Co-authored-by: Gael Varoquaux <[email protected]>

* add ken_embeddings to docs

* add fetch_figshare to fetch_dataclass logic

* update ylabels

* fix typo

* add correspondence table to get_ken_embeddings

---------

Co-authored-by: Gael Varoquaux <[email protected]>
* DOC: Add a see-also

* period
* Refactor fetching test

* Fix missing argument in fetch_employee_salaries

* Fix URL

* Fix mock

* Fix OpenML URL

* Fix tests

* Validate shape
* Add pytest-xdist to parallelize parametrized tests

* Clean test

* Try with logical CPUs

* Stick with auto
* Add target directory as optional argument

* Add changelog entry

* Support str in public fetching API

* Update changelog entry

* Make paths absolute

* Remove unnecessary path conversion in test
* Try using pep517

* Try using only binaries

* Bump numpy versions

* Bump scikit-learn versions

* Bumpy numpy version for Python 3.10
* More readable worker name

* Use new codecov version

* Use Python 3.10 for doc generation

* Fix artifacts entry name

* Use more recent versions of actions

* Try using cleaner os names

* Remove useless note

* Use Python 3.9 instead of 3.10 because of parsing bug
* DOC fix broken binder

* fix binder config

* remove requirement in doc

* to jupyter lab

* update conf.py

* restore conf.py

* add dirty-cat
* FIX fuzzy_join AttributeError

* add test
* Improve coverage of GapEncoder

* remove comments

* restore get_feature_names

* improve docstring
* start

* add return dic to monitor and add benchmark for fuzzy join with different encoders

* revert fuzzy_join changes

* fix bug: now return results are added to list

* fix plot

* fix plot + add benchmark

* change doc

* cleaning

* cleaning

* Description of benchmark results

* Run pre-commit hooks

* Remove unused imports

* Remove unused code

* Typos

* Update benchmarks/utils/monitor.py

Co-authored-by: Lilian <[email protected]>

* Applying suggestions

* Change the `monitor` decorator to return a tidy dataframe.

* Add loading bar to the benchmark

* Adapt script to new `monitor` decorator

* Replace results of fuzzy_join_hash benchmark by the new tidy version (the plots are the same)

* Adapt previous minhash benchmark to the new monitor function + new results

* Add tqdm to benchmark requirements

* Cleanup

* Improve functionality and documentation

---------

Co-authored-by: Lilian <[email protected]>
…b-data#504)

* Clean docstrings

* Clean docstrings

* Fix types

* Remove scalability example

* Update dirty_cat/_similarity_encoder.py

Co-authored-by: Jovan Stojanovic <[email protected]>

* Fix return value of transform

* Revert "Remove scalability example"

This reverts commit 97cd1cf.

---------

Co-authored-by: Jovan Stojanovic <[email protected]>
* Add smaller n_features

* remove scalability example

* smaller number of features

* test
* Try fix

* Try alternative fix

* Do it correctly

* Follow naming conventions
* test

* modify example

* revert change in conf.py

* revert version

* update version
* Shorten fetching tests

* Add OpenML package to dev requirements

* Improve doc

* Allow building from source for liac-arff

* Make OpenML import optional
…krub-data#522)

* fuzzy_join takes missing values into account

* add to major changes
* add error message for match_score

* wrong error type

* fix tests
* ENH Add warning fuzzy join on missing values

* update changes.rst
LilianBoulard and others added 28 commits August 1, 2023 14:41
* Rework datetime encoder example

* Apply suggestions from code review

Co-authored-by: Jovan Stojanovic <[email protected]>

---------

Co-authored-by: Jovan Stojanovic <[email protected]>
…rub-data#665)

* create script

* cache

* Use loguru for logging, various code improvements, slightly better doc and messages

* Fix condition

* fix import bug

* fix bug for empty evals

* fix 0 featues

* improvements

* Update benchmarks/run_on_openml_datasets.py

Co-authored-by: Lilian <[email protected]>

* import Counter

* test commit

* remove test commit

* fix bug

---------

Co-authored-by: Lilian <[email protected]>
…oyee_salaries` (skrub-data#581)

* Add `overload_job_titles` parameter to `fetch_employee_salaries`

* Add changelog entry

* Fix path
* Improve framework

* Add Gap divergence benchmark

* Set initial iter values

* Add omitted value (score)

* Force keyword arguments and add progress saving

* Minor fixes

* Update to main

* Add pyarrow to benchmark requirements

* Implement cross-validation

* Update README

* Parallelize cross-validation

* Fix attribute access

* Fix attribute access

* Fix unpacking

* Fix results naming

* Fix results bug

* Multiple columns support and W_change plot v1

* Refactor dataset getters

* Adapt getters usage to new format

* Small fixes

* New plots

* Fix dataset categorization

* Update used datasets

* Add score per inner iteration plot

* Add benchmark results

* Compute the score after tuning

* Add issue link for `road_safety`

* Update results

* Update benchmarks/utils/monitor.py

Co-authored-by: LeoGrin <[email protected]>

---------

Co-authored-by: LeoGrin <[email protected]>
* benchmark

* fix bug due to mixed type

* verbose

* test

* fix bug

* add benchmark results

* add balenced accuracy

* Update benchmarks/bench_gap_es_score.py

Co-authored-by: Lilian <[email protected]>

* remove prints

* run with the same batch size

* benchmark results with the same batch size

---------

Co-authored-by: Lilian <[email protected]>
* rename to Joiner

* complete renaming

* add many to many join support

* update changelog

* update docstring

* update docstring

* fix test

* fix init

* fix changelog

* fix example

* modify test

* Update CHANGES.rst

Co-authored-by: Lilian <[email protected]>

* Update examples/04_fuzzy_joining.py

Co-authored-by: Lilian <[email protected]>

* Update skrub/_joiner.py

Co-authored-by: Lilian <[email protected]>

* Update skrub/_joiner.py

Co-authored-by: Lilian <[email protected]>

* Update skrub/_joiner.py

Co-authored-by: Lilian <[email protected]>

* fix tests

* pre-commit

* fix docstring

* renaming

* update changelog

* apply suggestions

* fix index

* add new example

* new example

* add flight example

* update width

* add figure

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* apply suggestions

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* simplify text

* update figure display

* remove figure

* Remove leftover blank lines

* add conclusion

* fix conclusion

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <[email protected]>

* remove attribute

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <[email protected]>

* apply suggestions

* drop id cols

* add list flexibility to joiner

* add fetching function

* temp raise timeout time

* simplify Joiner signature

* revert name temporarily for euroscipy

* update joiner to accept tuples

---------

Co-authored-by: Lilian <[email protected]>
Co-authored-by: Gael Varoquaux <[email protected]>
* path as string

* install sphinx gallery from source

* add package name
The code snippet rendering in README.rst is wrong because it is written as markdown language. This commit uses RST code-block syntax instead.
* rephrase contributing

* some further simplification

* Apply suggestions from code review

Co-authored-by: Jovan Stojanovic <[email protected]>

---------

Co-authored-by: Jovan Stojanovic <[email protected]>
* add pipeline plot

* add plot of arrival delays

* Update examples/07_multiple_key_join.py

Co-authored-by: Vincent M <[email protected]>

---------

Co-authored-by: Vincent M <[email protected]>
(Merging this quickly to fix the README)
* working sped up version

* benchmark hyperparameters

* speedup benchmark

* fix bug due to mixed type

* add benchmark results

* changelog

* big speedup due to computing Ht@W only when Vt is non-zero

* change naming in _special_sparse_dot

* change benchmark

* speedup benchmark

* fix bug

* add benchmark and change default

* change default

* Update skrub/_gap_encoder.py

Co-authored-by: Lilian <[email protected]>

* changelog

* add test for max_no_improvement=None

* Vincent's suggestions

* add verbose=True to tests for coverage

* Apply suggestions from code review

Co-authored-by: Lilian <[email protected]>

* use loguru

* remove global

* pre-commit

* fix changelog

* Apply suggestions from code review

Co-authored-by: Lilian <[email protected]>

---------

Co-authored-by: Lilian <[email protected]>
…-data#583)

Co-authored-by: Gael Varoquaux <[email protected]>
Co-authored-by: Vincent M <[email protected]>
Co-authored-by: Jovan Stojanovic <[email protected]>
* add build doc file to worklflow

* add save cache

* remove build doc file

* modify example to test

* test caching

* getting new cache

* another attempt

* fix data path

* fix data path

* fix path to data

* cache attempt

* generate cache

* should be using cached data

* test all examples [doc build]

* test with cache [doc build]
* fix restructured text

* Make linter happy

Let's see if the formating is still good

* Decrease the number of data points

For faster example; it decreases the accuracy from 0.6 to 0.58 but is
much faster
* replace set np.NaN by pd.NA

* Update skrub/_fuzzy_join.py

---------

Co-authored-by: Gael Varoquaux <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.