Faster Preprocess [WIP] #163

Ayushk4 · 2019-06-16T09:01:11Z

An attempt for the approach mentioned in #143 .
As of now, it's near about as fast as the existing one.
Still Work-In-Progress with some functions.

Fixes #74 as well ( Refer #76 )

…into PreprocessBuffer_patch

Ayushk4 · 2019-06-23T21:16:38Z

This currently supports strip_articles, strip_pronouns, string_prepostions, strip_stopwords - Operations, and on those 4 operations, is at least 4 times faster for 100000 character length docs, 2-3 times faster for 10000 length docs. Works much faster for larger sized documents, but converges to same speed as existing one for smaller documents.

julia> @time fastpreprocess(StringDocument(s))
  0.006278 seconds (3.78 k allocations: 693.500 KiB)

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.024585 seconds (1.65 k allocations: 207.063 KiB)

julia> length(s)
100000

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.027906 seconds (1.65 k allocations: 207.063 KiB, 15.04% gc time)

julia> @time fastpreprocess(StringDocument(s))
  0.007384 seconds (3.78 k allocations: 693.500 KiB)

aviks · 2020-11-02T15:03:23Z

Hey @Ayushk4 can we finish this on up?

Ayushk4 · 2020-11-02T16:21:40Z

I was only able to get this work faster on the initial couple of operations added. When I incorporated the same token buffer approach for more operations later, it resulted in much slower performance overall than the already existing one.

For the time being, I am closing it. If I find some other way to speed it up, then I will re-open this or send another PR.

Ayushk4 added 7 commits June 12, 2019 10:14

Add PreprocessBuffer

c478c5e

Lexers for preprocessbuffer

08c41c5

Preprocess function

84388ec

Speed up PreprocessBuffer 10x

4f90d7b

Change to fastpreprocess

1d7cb5d

Merge branch 'master' of https://github.com/JuliaText/TextAnalysis.jl …

568c265

…into PreprocessBuffer_patch

Remove Buffer, speed up 8x times

6284fb1

Ayushk4 added 8 commits June 24, 2019 14:02

Fix minor bugz

5157c0d

Corrupt utf8

ee43f17

Add functions for whitespaces, numbers, punct

ce66660

Minor bug fixes

4ec1f71

Add docstrings for fastpreprocessing.jl

851e746

Add support for preprocessing over Corpus and Docs

da4a64c

Add tests for PreprocessBuffer.

8bc5f93

Optional args in fastpreproces

c813972

Ayushk4 closed this Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Preprocess [WIP] #163

Faster Preprocess [WIP] #163

Ayushk4 commented Jun 16, 2019 •

edited

Loading

Ayushk4 commented Jun 23, 2019 •

edited

Loading

aviks commented Nov 2, 2020

Ayushk4 commented Nov 2, 2020

Faster Preprocess [WIP] #163

Faster Preprocess [WIP] #163

Conversation

Ayushk4 commented Jun 16, 2019 • edited Loading

Ayushk4 commented Jun 23, 2019 • edited Loading

aviks commented Nov 2, 2020

Ayushk4 commented Nov 2, 2020

Ayushk4 commented Jun 16, 2019 •

edited

Loading

Ayushk4 commented Jun 23, 2019 •

edited

Loading