TokenBuffer for preprocessing Documents #143

Ayushk4 · 2019-04-12T05:34:45Z

We have been using a fast TokenBuffer API to speed up for various tokenizers in WordTokenizers.jl.

Referring to #141 #140, I think it might be beneficial to extend the TokenBuffer API for Documents and Corpus that TextAnalysis.jl offers (excluding NGramDocument and TokenDocument).
This can then be used to improve the performance for preprocessing.jl.

Edit: This could also serve as a solution for #74 #76

Ayushk4 mentioned this issue May 14, 2019

remove_words! fails for long terms & terms with punctuation #74

Open

Ayushk4 mentioned this issue Jun 16, 2019

Faster Preprocess [WIP] #163

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenBuffer for preprocessing Documents #143

TokenBuffer for preprocessing Documents #143

Ayushk4 commented Apr 12, 2019 •

edited

Loading

TokenBuffer for preprocessing Documents #143

TokenBuffer for preprocessing Documents #143

Comments

Ayushk4 commented Apr 12, 2019 • edited Loading

Ayushk4 commented Apr 12, 2019 •

edited

Loading