Skip to content

Latest commit

 

History

History
470 lines (317 loc) · 18.7 KB

CHANGES.rst

File metadata and controls

470 lines (317 loc) · 18.7 KB

Changes

.. currentmodule:: skrub

Ongoing development

Skrub has not been released yet. It is currently undergoing fast development and backward compatability is not ensured.

Major changes

Minor changes

Before skrub: dirty_cat

Skrub was born from the dirty_cat package.

Dirty-cat release 0.4.1

Major changes

Minor changes

Dirty-cat Release 0.4.0

Major changes

Minor changes

Bug fixes

Dirty-cat Release 0.3.0

Major changes

Notes

Dirty-cat Release 0.2.2

Bug fixes

Dirty-cat Release 0.2.1

Major changes

Bug-fixes

Notes

Dirty-cat Release 0.2.0

Also see pre-release 0.2.0a1 below for additional changes.

Major changes

Notes

Dirty-cat Release 0.2.0a1

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes

Bug-fixes

Dirty-cat Release 0.1.1

Major changes

Bug-fixes

Dirty-cat Release 0.1.0

Major changes

Bug-fixes

Dirty-cat Release 0.0.7

  • MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the :class:`MinHashEncoder` class. This method allows for fast and scalable encoding of string categorical variables.
  • datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
    • The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
    • The field "description" has been renamed to "DESCR".
  • SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.
  • SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Dirty-cat Release 0.0.6

  • SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:
    • computing the vocabulary count vectors in fit instead of transform
    • computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the :class:`SimilarityEncoder`.
  • SimilarityEncoder: Fix a bug that was preventing a :class:`SimilarityEncoder` to be created when categories was a list.
  • SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Dirty-cat Release 0.0.5

  • SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
  • SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
  • SimilarityEncoder: Performance improvements in the ngram similarity.
  • SimilarityEncoder: Expose a get_feature_names method.