Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to OpenNLP 2.5.x #14029

Open
mawiesne opened this issue Dec 2, 2024 · 3 comments
Open

Upgrade to OpenNLP 2.5.x #14029

mawiesne opened this issue Dec 2, 2024 · 3 comments

Comments

@mawiesne
Copy link

mawiesne commented Dec 2, 2024

Description

Apache OpenNLP 2.5.0 has been released. This version contains new implementations of TokenNameFinder et al., that are Thread-Safe. Moreover, models for many new languages (32, as of Nov 2024) are now available. Those models are also available as Maven artifacts.

Apache OpenNLP 2.5.0 requires Java 17 and should be fully compatible with Java 21.

This task is update the OpenNLP dependency version to 2.5.x (x >= 0). Note: Release 2.5.1 is expected in December 2024.

@mawiesne
Copy link
Author

mawiesne commented Dec 2, 2024

FYI @cpoerschke - if you are interested in bringing this together and you encounter questions: the OpenNLP PMC members are happy to provide answers.

@msfroh
Copy link
Contributor

msfroh commented Dec 27, 2024

I was looking into this (trying to upgrade to 2.5.1) and initially ran into some failing test cases.

It looks like they were all related to the switch of the default POSTagFormat from Penn to UD. I was able to get all the tests passing by changing this line:

to

tagger = new POSTaggerME(model, POSTagFormat.PENN);

(I assume that we should support UD-style tags eventually too, but this at least keeps the existing functionality the same.)

@mawiesne
Copy link
Author

@msfroh Thx for checking. The option (PENN format) you chose is the quick option for updating to 2.5.x (hint: x=2 released today).

The UD format will give the Lucene project a possibility to rely on a wider range of models for 32 languages, we have trained and published (see: OpenNLP models page) recently. Might be an option for 2025 and onwards: just switch to the UD model files and the corresponding format.

Open for any further feedback/questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants