You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I looked in the documentation and I could not find any tooling to build a lexicon when the Corpus can't fit on memory.
Let's say I want to build tf-idf vectors for a given lexicon of 10 million ngrams, but I can't fit in memory all the text files that I need to know there are 10 million ngrams in the corpus.
What I would like to do is to build incrementally the lexicon with batches of documents that I load (but note that I don't want to keep all the text of the documents, just tokenize them to learn the lexicon from the data).
for batch_of_documents in folder:
update!(lexicon, batch_of_documents, tokenizer)
and then
m = DocumentTermMatrix(["some text here", "here more text"]; lexicon, tokenizer )
Is there a way to do this?
The text was updated successfully, but these errors were encountered:
I looked in the documentation and I could not find any tooling to build a lexicon when the Corpus can't fit on memory.
Let's say I want to build tf-idf vectors for a given lexicon of 10 million ngrams, but I can't fit in memory all the text files that I need to know there are 10 million ngrams in the corpus.
What I would like to do is to build incrementally the lexicon with batches of documents that I load (but note that I don't want to keep all the text of the documents, just tokenize them to learn the lexicon from the data).
and then
Is there a way to do this?
The text was updated successfully, but these errors were encountered: