improper stemming of NGram documents #149

tanmaykm · 2019-05-03T10:31:45Z

Stemming a NGramDocument stems only the last word of each ngram. Notice below how repository is stemmed to repositori in one place but left intact in another.

julia> td = NGramDocument("this repository of julia language", 3)
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("language"=>1,"repository"=>1,"this"=>1,"this repository of"=>1,"of julia language"=>1,"julia language"=>1,"of"=>1,"julia"=>1,"this repository"=>1,"repository of"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(td); td
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("languag"=>1,"this"=>1,"this repository of"=>1,"of julia languag"=>1,"this repositori"=>1,"of"=>1,"julia"=>1,"repositori"=>1,"repository of"=>1,"of julia"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

While stemming a StringDocument stems each word:

julia> sd = StringDocument("this repository of julia language")
StringDocument{String}("this repository of julia language", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(sd); sd
StringDocument{String}("this repositori of julia languag", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

The text was updated successfully, but these errors were encountered:

sean-gauss · 2020-01-21T20:59:03Z

Is work still needed on this issue? @aviks

bnriiitb · 2020-07-24T15:09:11Z

@aviks is this issue fixed or still help needed?

sean-gauss · 2020-07-24T17:27:38Z

I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed.

mostol · 2022-02-16T21:22:28Z

@aviks Hi! I think I figured out what's going on here. It comes down to the stem function in line 38 of stemmer.jl below, which stems the n-gram (token), resulting in its stemmed version (new_token):

TextAnalysis.jl/src/stemmer.jl

Lines 36 to 48 in a38d8d7

    
           function stem!(stemmer::Stemmer, d::NGramDocument) 
        
               for token in keys(d.ngrams) 
        
                   new_token = stem(stemmer, token) 
        
                   if new_token != token 
        
                       if haskey(d.ngrams, new_token) 
        
                           d.ngrams[new_token] = d.ngrams[new_token] + d.ngrams[token] 
        
                       else 
        
                           d.ngrams[new_token] = d.ngrams[token] 
        
                       end 
        
                       delete!(d.ngrams, token) 
        
                   end 
        
               end 
        
           end

The problem arises from the fact that token (the n-gram) is actually just stored as a string. The name "token" is maybe a bit of a misnomer—each n-gram is really a string of tokens that we want stemmed, so we either want to think about it as a StringDocument and stem each word in the string, or we'd want to think about it as a TokenDocument and stem each token of the n-gram individually. Right now, the n-gram is stemmed as just a String, which means the n-gram is interpreted as one single entity which has its end stemmed, rather than a list of n entities to be stemmed individually.

This might mean fundamentally altering the nature of NGramDocuments to be made up of either StringDocuments or vectors of strings like TokenDocuments are (the former probably being easier to actually implement, the latter perhaps being a little more meaningful?). I'd be glad to help implement a change in either direction!

(Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change

new_token = stem(stemmer, token)

to

new_token = stem_all(stemmer, token)

and be done with it, which is also an option...)

aviks added the help wanted good for beginners label May 3, 2019

zgornel added a commit to zgornel/StringAnalysis.jl that referenced this issue May 7, 2019

Fixes bug reported in JuliaText/TextAnalysis.jl#149

db5bed3

mostol mentioned this issue Feb 18, 2022

Alter NGramDocuments' n-grams to consist of vectors of tokens or be handled like string docs? #261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improper stemming of NGram documents #149

improper stemming of NGram documents #149

tanmaykm commented May 3, 2019

sean-gauss commented Jan 21, 2020

bnriiitb commented Jul 24, 2020

sean-gauss commented Jul 24, 2020

mostol commented Feb 16, 2022 •

edited

Loading

improper stemming of NGram documents #149

improper stemming of NGram documents #149

Comments

tanmaykm commented May 3, 2019

sean-gauss commented Jan 21, 2020

bnriiitb commented Jul 24, 2020

sean-gauss commented Jul 24, 2020

mostol commented Feb 16, 2022 • edited Loading

mostol commented Feb 16, 2022 •

edited

Loading