Stanza is the Stanford NLP group’s shared repository for Python infrastructure. The goal of Stanza is not to replace your modeling tools of choice, but to offer implementations for common patterns useful for machine learning experiments.
You can install the package as follows:
git clone [email protected]:stanfordnlp/stanza.git cd stanza pip install -e .
To use the package, import it in your python code. An example would be:
from stanza.text.vocab import Vocab v = Vocab('UNK')
To use the Python client for the CoreNLP server, first launch your CoreNLP Java server. Then, in your Python program:
from stanza.nlp.corenlp import CoreNLPClient client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'tokenize', 'lemma', 'pos', 'ner']) annotated = client.annotate('This is an example document. Here is a second sentence') for sentence in annotated.sentences: print('sentence', sentence) for token in sentence: print(token.word, token.lemma, token.pos, token.ner)
Please see the documentation for more use cases.
Documentation is hosted on Read the Docs at http://stanza.readthedocs.org/en/latest/. Stanza is still in early development. Interfaces and code organization will probably change substantially over the next few months.
To request or discuss additional functionality, please open a GitHub issue. We greatly appreciate pull requests!
Stanza has unit tests, doctests, and longer, integration tests. We ask that all contributors run the unit tests and doctests before submitting pull requests:
python setup.py test
Doctests are the easiest way to write a test for new functionality, and serve as helpful examples for how to use your code. See progress.py for a simple example of a easily testable module, or summary.py for a more involved setup involving a mocked filesystem.
If you are adding a new module, please remember to add it to
setup.py
as well as a corresponding .rst
file in the docs
directory.
Documentation is generated via
Sphinx using inline comments.
This means that the docstring in Python double both as interactive
documentation and standalone documentation. This also means that you
must format your docstring in RST. RST is very similar to Markdown.
There are many tutorials on the exact syntax, essentially you only need
to know the function parameter syntax which can be found
here.
You can, of course, look at documentations for existing modules for
guidance as well. A good place to start is the text.dataset
package.
To set up your environment such that you can generate docs locally:
pip install sphinx sphinx-autobuild
If you introduced a new module, please auto-generate the docs:
sphinx-apidoc -F -o docs stanza cd docs && make open _build/html/index.html
You most likely need to manually edit the rst file corresponding to your new module.
Our docs are hosted on Readthedocs. If you'd like admin access to the Readthedocs project, please contact Victor or Will.
- common objects used in NLP
- [x] a Vocabulary object mapping from strings to integers/vectors
- tools for running experiments on the NLP cluster
- [ ] a function for querying GPU device stats (to aid in selecting a GPU on the cluster)
- [ ] a tool for plotting training curves from multiple jobs
- [ ] a tool for interacting with an already running job via edits to a text file
- [x] an API for calling CoreNLP
Stanza is not meant to include every research project the group undertakes. If you have a standalone project that you would like to share with other people in the group, you can:
- request your own private repo under the stanfordnlp GitHub account.
- share your code on CodaLab.
- For targeted questions, ask on Stanford NLP
Overflow (use the
stanza
tag).
That said, it can be useful to add functionality to Stanza while you work in a
separate repo on a project that depends on Stanza. Since Stanza is under active
development, you will want to version-control the Stanza code that your code
uses. Probably the most effective way of accomplishing this is by using
git subtree
.
git subtree
includes the source tree of another repo (in
this case, Stanza) as a directory within your repo (your cutting-edge
research), and keeps track of some metadata that allows you to keep that
directory in sync with the original Stanza code. The main advantage of git
subtree
is that you can modify the Stanza code locally, merge in updates, and
push your changes back to the Stanza repo to share them with the group. (git
submodule
doesn't allow this.)
It has some downsides to be aware of:
- You have a copy of all of Stanza as part of your repo. For small projects, this could increase your repo size dramatically. (Note: you can keep the history of your repo from growing at the same rate as Stanza's by using squashed commits; it's only the size of the source tree that unavoidably bloats your project.)
- Your repo's history will contain a merge commit every time you update Stanza from upstream. This can look ugly, especially in graphical viewers.
Still, subtree
can be configured to be fairly easy to use, and the consensus
seems to be that it is superior to submodule
(https://codingkilledthecat.wordpress.com/2012/04/28/why-your-company-shouldnt-use-git-submodules/).
Here's one way to configure subtree
so that you can include Stanza in
your repo and contribute your changes back to the master repo:
# Add Stanza as a remote repo git remote add stanza http://<your github username>@github.com/stanfordnlp/stanza.git # Import the contents of the repo as a subtree git subtree add --prefix third-party/stanza stanza develop --squash # Put a symlink to the actual module somewhere where your code needs it ln -s third-party/stanza/stanza stanza # Add aliases for the two things you'll need to do with the subtree git config alias.stanza-update 'subtree pull --prefix third-party/stanza stanza develop --squash' git config alias.stanza-push 'subtree push --prefix third-party/stanza stanza develop'
After this, you can use the aliases to push and pull Stanza like so:
git stanza-update git stanza-push
I [@futurulus] highly recommend a topic branch/rebase workflow, which will keep your history fairly clean besides those pesky subtree merge commits:
# Create a topic branch git checkout -b fix-stanza # <hack hack hack, make some commits> git checkout master # Update Stanza on master, should go smoothly because master doesn't # have any of your changes yet git stanza-update # Go back and replay your fixes on top of master changes git checkout fix-stanza git rebase master # You might need to resolve merge conflicts here # Add your rebased changes to master and push git checkout master git merge --ff-only fix-stanza git stanza-push # Done! git branch -d fix-stanza