Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Preprocess [WIP] #163

Closed
wants to merge 15 commits into from

Conversation

Ayushk4
Copy link
Member

@Ayushk4 Ayushk4 commented Jun 16, 2019

An attempt for the approach mentioned in #143 .
As of now, it's near about as fast as the existing one.
Still Work-In-Progress with some functions.

Fixes #74 as well ( Refer #76 )

  • Strip_Articles
  • Strip_pronouns
  • Strip_Prepositions
  • Strip_Stopwords
  • Whitespace
  • Corrupt_utf8
  • Punctuation
  • Numbers
  • Strip_case
  • Strip_frequent and strip_sparse
  • Fixes Replacement function for list of stuff. #23
  • Tests
  • Docstrings
  • Documentation

@Ayushk4
Copy link
Member Author

Ayushk4 commented Jun 23, 2019

This currently supports strip_articles, strip_pronouns, string_prepostions, strip_stopwords - Operations, and on those 4 operations, is at least 4 times faster for 100000 character length docs, 2-3 times faster for 10000 length docs. Works much faster for larger sized documents, but converges to same speed as existing one for smaller documents.

julia> @time fastpreprocess(StringDocument(s))
  0.006278 seconds (3.78 k allocations: 693.500 KiB)

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.024585 seconds (1.65 k allocations: 207.063 KiB)

julia> length(s)
100000

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.027906 seconds (1.65 k allocations: 207.063 KiB, 15.04% gc time)

julia> @time fastpreprocess(StringDocument(s))
  0.007384 seconds (3.78 k allocations: 693.500 KiB)

@aviks
Copy link
Member

aviks commented Nov 2, 2020

Hey @Ayushk4 can we finish this on up?

@Ayushk4
Copy link
Member Author

Ayushk4 commented Nov 2, 2020

I was only able to get this work faster on the initial couple of operations added. When I incorporated the same token buffer approach for more operations later, it resulted in much slower performance overall than the already existing one.

For the time being, I am closing it. If I find some other way to speed it up, then I will re-open this or send another PR.

@Ayushk4 Ayushk4 closed this Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove_words! fails for long terms & terms with punctuation Replacement function for list of stuff.
2 participants