Skip to content

SentencePiece Language Model, that was created on GCP using the fast.ai NLP methodology

Notifications You must be signed in to change notification settings

danacity/French-Wiki-2500-Pretrained-SentencePiece-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

French-Wiki-2500-Pretrained-SentencePiece-LM

I created this French SentencePiece Language Model, by using french Wikipedia articles that had more than 2500 words. The model is an AWD_LSTM, and it was trined for around 14 hours on a GCP using a v100 instance. If your goal is to create a language model more training would be necessary, and most likely a different architecture, but if you are using it as part of a classification task, you can fine-tune this model with your data set.

If you would like to download the pretrained french language model is is available in my google drive.

Inside you will find several different files:

  • fr_spm.ipynb - notebook I used to create the language spm model
  • learner_fr_spm_enc.pth - encoder
  • learner_mod_fr_spm.pkl - language model learner
  • learner_mod_fr_spm_export.pkl - language model learner(using export)
  • learner_mod_fr_spm_save.pkl.pth - language model learner(using save)
  • learner_vocab_fr_spm.pkl - language model vocab
  • spm.model - Sentence Piece model(spm)
  • spm.vocab - Sentence Piece vocabulary

If you would like to use these models, I would recommend that you check out the amazing Fast.ai NLP course, github page, and the associated turkish/spm notebook

If you are following along with the nn-turkish.ipynb notebook, you will need to create a folder called tmp and put spm.model and spm.vocab inside of that tmp folder and In the step where you are creating a finetuned language model:

data_lm = (TextList.from_df(df, path_clas, cols='text', processor=SPProcessor.load(dest))
    .split_by_rand_pct(0.1, seed=42)
    .label_for_lm()           
    .databunch(bs=bs, num_workers=1))

data_lm.save(f'{lang}_clas_databunch')

You are going to define dest as the location where you put the tmp folder.

If you are using .save, you can add an optional argument return_path=True so you can know where everything is being stored, since I found that it is not always abundantly clear.

About

SentencePiece Language Model, that was created on GCP using the fast.ai NLP methodology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published