Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4 page for spacy.io #13463

Draft
wants to merge 2 commits into
base: v4
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions website/docs/api/entitylinker.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,10 @@ architectures and their arguments and hyperparameters.
Prior to spaCy v4.0 `get_candidates()` returns a single `Iterable` of candidates
for one specific mention, i. e. the function was typed as
`Callable[[KnowledgeBase, Span], Iterable[Candidate]]`. To retrieve candidates
batch-wise, spaCy >= 3.5 exposes `get_candidates_batched()`, which identifies
batch-wise, spaCy >= 3.5 exposes `get_candidates_batch()`, which identifies
candidates for an arbitrary number of spans:
`Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]]`. The
main difference between `get_candidates_batched()` and `get_candidates()` in
main difference between `get_candidates_batch()` and `get_candidates()` in
spaCy >= 4.0 is that the latter considers the grouping of provided mention spans
per `Doc` instance.

Expand Down
191 changes: 191 additions & 0 deletions website/docs/usage/v4.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
title: What's New in v4.0
teaser: New features and how to upgrade
menu:
- ['New Features', 'features']
- ['Upgrading Notes', 'upgrading']
---

## New features {id="features",hidden="true"}

spaCy v4.0 supports more flexible learning rates and adds experimental support
for model distillation. This release also fixes some long-standing issues that
require minor API changes.

spaCy v4.0 drops support for Python 3.7 and 3.8.

### Flexible learning rates {id="learn-rate"}

Thinc 9 adds support for more flexible learning rates that can use the step,
parameter names, and results from prior evaluations. spaCy v4 makes use of these
flexible learning rates by passing the aggregate score of the most recent
evaluation to the learning rate schedule. This makes it possible for schedules
like [`plateau`](https://thinc.ai/docs/api-schedules#plateau) to adjust the
learning rate when training is stagnant.

### Experimental support for model distillation {id="distillation"}

spaCy v4 lays the groundwork for model distillation. Distillation trains a
_student_ model on the predictions of a _teacher_ model using an unannotated
corpus. One of the more exciting applications of distillation is extracting
small, task-focused models from large, pretrained transformer models.

Support for distillation support consists of several parts:

- [`TrainablePipe`](/api/pipe) now provides a [`distill`](/api/pipe#distill)
method. This can be used to perform a distillation step, where a student is
updated to mimick the outputs of the teacher.
- A configuration section called `distilation` for configuring various
distillation settings.
- The distillation loop.
- The [`distill`](/api/cli#distill) subcommand to run distillation from the
command-line.

Most of the trainable pipeline components are updated to support distillation.

### Saving activations {id="save-activation"}

Trainable pipes can now save the pipe's model activations for a document in the
[`Doc.activations`](/api/doc#attributes) dictionary. You can use this
functionality to get programmatic access to e.g. the probability distibution of
a pipe's classifier.

The following activations are currently available:

- `EditTreeLemmatizer`: `probabilities` and `tree_ids`
- `EntityLinker`: `ents` and `scores`
- `Morphologizer`: `probabilities` and `label_ids`
- `SentenceRecognizer`: `probabilities` and `label_ids`
- `SpanCategorizer`: `indices` and `scores`
- `Tagger`: `probabilities` and `label_ids`
- `TextCategorizer`: `probabilities`

> #### Example
>
> ```python
> import spacy
> nlp = spacy.load("de_core_news_lg")
> nlp.get_pipe("tagger").save_activations = True
> doc = nlp("Hallo Welt!")
> assert "tagger" in doc.activations
> assert "probabilities" in doc.activations["tagger"]
> ```

### Additional features and improvements {id="additional-features-and-improvements"}

- The `--code` option that is used by several CLI subcommands now accepts
multiple files to load by separating them with a comma.
- `spacy download` does not redownload models that are already installed.
- When modifying a `Span` that was retrieved through a `SpanGroup`, the change
is now reflected in the `SpanGroup`.
- Lookups can now be downloaded from a URL using
`spacy.LookupsDataLoaderFromURL.v1`.

## Notes about upgrading from v3.7 {id="upgrading"}

This release drops support for Python 3.7 and 3.8. Most configuration files from
spaCy 3.7 can be used with spaCy 4.0 without any modifications (excepting
configurations that use `EntityLinker.v1`, see below). However, spaCy 4.0
introduces some (minor) API changes that are discussed in the remainder of this
section.

### Removal of the `EntityRuler` class

The `EntityRuler` class is removed. The entity ruler is implemented as a special
case of the `SpanRuler` component.

See the [migration guide](/api/entityruler#migrating) for differences between
the v3 `EntityRuler` and v4 `SpanRuler` implementations of the `entity_ruler`
component.

### Renamed language codes: `is` -> `isl` and `xx` to `mul`

The language code for Icelandic has been changed from `is` to `isl` to avoid
incompatibilities with the Python `is` keyword. The language code for
multilingual models has been changed from `xx` to `mul`. Existing code that uses
these language codes should be adjusted accordingly.

### Removal of the `sentiment` attribute

The `sentiment` attribute is removed the `Token`, `Span`, `Doc` and `Lexeme`
danieldk marked this conversation as resolved.
Show resolved Hide resolved
classes. If you used this attribute in a `sentiment` analysis component, we
recommend you to store the sentiment analysis in an
[extension attribute](/usage/processing-pipelines#custom-components-attributes)
instead.

### Removal of `get_candidates_batch`

Prior to spaCy v4, `get_candidates()` returned an `Iterable` of candidates for a
specific mention. spaCy >= 3.5 provides `get_candidates_batch()` for looking up
multiple mentions — given an `Iterable[Span]` of mentions, it returns for each
mention the candidates.

spaCy v4 replaces both functions by a single function
[`get_candidates`](/api/entitylinker#config) that does doc-wise batching. For an
`Iterator[SpanGroup]` it returns for each mention in the spangroup the
candidates. The batching is by doc since the [`Span`](/api/span)s in a
danieldk marked this conversation as resolved.
Show resolved Hide resolved
[`SpanGroup`](/api/spangroup) belong to the same [`Doc`](/api/doc).

### Removal of pool argument from `Vocab.get` and `Vocab.get_by_orth`

The memory pool argument was removed from the `Vocab.get` and
`Vocab.get_by_orth` Cython cdef methods. These methods can now be called without
providing the memory pool as an argument.

### Optional arguments of `Span.char_span` are now keyword-only

> #### Example
>
> ```python
> doc = nlp("I like New York")
> # Permitted in spaCy 3
> span = doc[1:4].char_span(5, 13, "GPE", 42)
> # spaCy 4
> span = doc[1:4].char_span(5, 13, "GPE", kb_id=42)
> ```

The optional arguments for [`Span.char_span`](/api/span#char_span) are now
keyword-only. Existing code that uses a positional argument to pass an optional
argument to `char_span` needs to be updated to pass a keyword argument.

### Remove backoff from `Doc.vector` to `Doc.tensor`

In spaCy v3 and earlier, small (`sm`) pipeline packages supported
[`Doc.vector`](/api/doc#vector) and [`Token.vector`](/api/token#vector) by
backing off to context-sensitive tensors from the `tok2vec` component. These
tensors do not work well for this purpose and this backoff has been removed in
spaCy v4.

### Multiple spans returned as `Tuple[Span]`

In spaCy v3 some methods that returned multiple `Span` objects would return an
`Iterator[Span]`, while others would return `Tuple[Span]`. In spaCy v4 such
methods always return `Tuple[Span]`.

### Support for `EntityLinker.v1` is dropped

Support for `EntityLinker.v1` is dropped, migrate to `EntityLinker.v2`.

### `spacy[apple]` removed from extras

The `thinc-apple-ops` package has been merged into Thinc v9. spaCy v4 always
uses Apple ops on Macs, so the `apple` extra is not needed anymore.

### Pipeline package version compatibility {id="version-compat"}

spaCy v3.x pipelines are not compatible with spaCy v4.0 and need to be
retrained.

### Updating v3.7 configs

To update a config from spaCy v3.7 with the new v4.0 settings, run
[`init fill-config`](/api/cli#init-fill-config):

```cli
$ python -m spacy init fill-config config-v3.7.cfg config-v4.0.cfg
```

In many cases ([`spacy train`](/api/cli#train),
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
automatically, but you'll need to fill in the new settings to run
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
4 changes: 2 additions & 2 deletions website/meta/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
{ "text": "Models & Languages", "url": "/usage/models" },
{ "text": "Facts & Figures", "url": "/usage/facts-figures" },
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
{ "text": "New in v4.0", "url": "/usage/v4" },
{ "text": "New in v3.7", "url": "/usage/v3-7" },
{ "text": "New in v3.6", "url": "/usage/v3-6" },
{ "text": "New in v3.5", "url": "/usage/v3-5" }
{ "text": "New in v3.6", "url": "/usage/v3-6" }
]
},
{
Expand Down