I created this French SentencePiece Language Model, by using french Wikipedia articles that had more than 2500 words. The model is an AWD_LSTM, and it was trined for around 14 hours on a GCP using a v100 instance. If your goal is to create a language model more training would be necessary, and most likely a different architecture, but if you are using it as part of a classification task, you can fine-tune this model with your data set.
If you would like to download the pretrained french language model is is available in my google drive.
Inside you will find several different files:
- fr_spm.ipynb - notebook I used to create the language spm model
- learner_fr_spm_enc.pth - encoder
- learner_mod_fr_spm.pkl - language model learner
- learner_mod_fr_spm_export.pkl - language model learner(using export)
- learner_mod_fr_spm_save.pkl.pth - language model learner(using save)
- learner_vocab_fr_spm.pkl - language model vocab
- spm.model - Sentence Piece model(spm)
- spm.vocab - Sentence Piece vocabulary
If you would like to use these models, I would recommend that you check out the amazing Fast.ai NLP course, github page, and the associated turkish/spm notebook
If you are following along with the nn-turkish.ipynb notebook, you will need to create a folder called tmp
and put spm.model
and spm.vocab
inside of that tmp
folder and In the step where you are creating a finetuned language model:
data_lm = (TextList.from_df(df, path_clas, cols='text', processor=SPProcessor.load(dest))
.split_by_rand_pct(0.1, seed=42)
.label_for_lm()
.databunch(bs=bs, num_workers=1))
data_lm.save(f'{lang}_clas_databunch')
You are going to define dest as the location where you put the tmp
folder.
If you are using .save
, you can add an optional argument return_path=True
so you can know where everything is being stored, since I found that it is not always abundantly clear.