In this repo, we try to find an efficient way to do the conversion of grapheme to phoneme, epecially on person names. We used the latest cmudict-0.7b[1] as the dataset of our G2P model. We divided the datset into train set and test set. The conversion of grapheme-to-phoneme is similar with transduction. As a classic neural machine translation model and eural sequence learning, OpenNMT works well on sequence modeling and transduction problems. We tested whether OpenNMT can be used for seq2seq-g2p. Our results showed that OpenNMT-based G2P model can generate promising pronunciation, and the words in test set were completelya different from the words in train set, proving that OpenNMT is suitable for Seq2Seq-G2P.
To use OpenNMT-based seq2seq-g2p model, install OpenNMT first: OpenNMT-py requires:
- Python>=3.6
- PyTorch==1.6.0
Install OpenNMT-py
from pip:
pip install OpenNMT-py
or from the sources:
git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
pip install -e .
Note: if you encounter a MemoryError
during installation, try to use pip
with --no-cache-dir
.
(Optional) Some advanced features (e.g. working pretrained models or specific transforms) require extra packages, you can install them with:
pip install -r requirements.opt.txt
We need to build a YAML configuration file to specify the data and model that will be used:
vi cmudict_g2p_transformer.yaml
To generate pronunciations for an English word list with a trained model: You can download model form :https://drive.google.com/file/d/1QTPt0CiTF3GInr9DdCzlR2-nPpFmGr47/view?usp=sharing
onmt_translate -model cmu_g2p_model_step_29300_release.pt -src valid_s.txt -output exp/pred_valid.txt -gpu 0 -verbose
The word list is a text file with one word per line, and each character in the word is separated by a space:
H E L L O
We can also wirte a script to run conversion:
src=valid_s
tgt=valid_t
onmt_translate \
-gpu 0 \
-batch_size 2 \
-beam_size 3 \
-model cmu_g2p_model_step_29300_release.pt \
-src $src.txt \
-output run/pred_${src}.txt \
-tgt $tgt.txt \
-verbose \
--n_best 3 \
&> run/translate.log&
To get started, we propose to download a CMUdict for grapheme-to-phoneme, also we provide the processed CMUdict in this repo:
git clone https://github.com/Alexir/CMUdict.git
We can use the latest cmudict-0.7b. We also need to do the following: Convert words into character sequences:
HELLO ---> H E L L O
Remove the markers after multi-pronounced words:
ABS EY1 B IY1 EH1 S ---> ABS EY1 B IY1 EH1 S
ABS(1) AE1 B Z ---> ABS(1) AE1 B Z
If you want to train a g2p-model for perosn names, we advice you delete the punctuation characters and their probunciation.
Divide the cmudict into train set and test set:
python make_train_test.py
We need to build a YAML configuration file to specify the data that will be used:
# cmudict_g2p_transformer.yaml
## Where the samples will be written
save_data: data/local
## Where the vocab(s) will be written
src_vocab: data/cmudict.vocab.src
tgt_vocab: data/cmudict.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: True
# Corpus opts:
data:
train:
path_src: train_s.txt
path_tgt: train_t.txt
valid:
path_src: valid_s.txt
path_tgt: valid_t.txt
From this configuration, we can build the vocab(s) that will be necessary to train the model:
onmt_build_vocab -config toy_en_de.yaml -n_sample -1
Notes:
-n_sample
is required here -- it represents the number of lines sampled from each corpus to build the vocab.
To train a model, we need to add the following to the YAML configuration file:
- the vocabulary path(s) that will be used: can be that generated by onmt_build_vocab;
- training specific parameters.
# cmudict_g2p_transformer.yaml
...
# Vocabulary files that were just created
src_vocab: data/cmudict.vocab.src
tgt_vocab: data/cmudict.vocab.tgt
# Train on a single GPU
world_size: 1
gpu_ranks: [0]
# Where to save the checkpoints
save_model: exp/run/model
save_checkpoint_steps: 100
train_steps: 3000
valid_steps: 1000
...
Then you can simply run:
onmt_train -config cmudict_g2p_transformer.yaml
For more parameters, see example configurations
onmt_translate -model exp/run/model_step_1000.pt -src data/src-test.txt -output exp/pred_1000.txt -gpu 0 -verbose
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into exp/pred_1000.txt
.
Note:
We can also wirte a script to run conversion:
checkpoint=10000
src=valid_s
tgt=valid_t
onmt_translate \
-gpu 0 \
-batch_size 2 \
-beam_size 3 \
-model exp/run/model_step_${checkpoint}.pt \
-src $src.txt \
-output run/pred_${src}_${checkpoint}.txt \
-tgt $tgt.txt \
-verbose \
--n_best 3 \
&> run/translate.log&