-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improper stemming of NGram documents #149
Comments
Is work still needed on this issue? @aviks |
@aviks is this issue fixed or still help needed? |
I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed. |
@aviks Hi! I think I figured out what's going on here. It comes down to the TextAnalysis.jl/src/stemmer.jl Lines 36 to 48 in a38d8d7
The problem arises from the fact that This might mean fundamentally altering the nature of (Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change new_token = stem(stemmer, token) to new_token = stem_all(stemmer, token) and be done with it, which is also an option...) |
Stemming a NGramDocument stems only the last word of each ngram. Notice below how
repository
is stemmed torepositori
in one place but left intact in another.While stemming a StringDocument stems each word:
The text was updated successfully, but these errors were encountered: