Skip to content

An opennmt-based G2P model, which was trained on CMUdict.

License

Notifications You must be signed in to change notification settings

Liangzheng-ZL/opennmt_g2p

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenNMT-based Seq2Seq-G2P(grapheme-to-phoneme) model.

November 2nd / 2021

In this repo, we try to find an efficient way to do the conversion of grapheme to phoneme, epecially on person names. We used the latest cmudict-0.7b[1] as the dataset of our G2P model. We divided the datset into train set and test set. The conversion of grapheme-to-phoneme is similar with transduction. As a classic neural machine translation model and eural sequence learning, OpenNMT works well on sequence modeling and transduction problems. We tested whether OpenNMT can be used for seq2seq-g2p. Our results showed that OpenNMT-based G2P model can generate promising pronunciation, and the words in test set were completelya different from the words in train set, proving that OpenNMT is suitable for Seq2Seq-G2P.

Preparation

To use OpenNMT-based seq2seq-g2p model, install OpenNMT first: OpenNMT-py requires:

  • Python>=3.6
  • PyTorch==1.6.0

Install OpenNMT-py from pip:

pip install OpenNMT-py

or from the sources:

git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
pip install -e .

Note: if you encounter a MemoryError during installation, try to use pip with --no-cache-dir.

(Optional) Some advanced features (e.g. working pretrained models or specific transforms) require extra packages, you can install them with:

pip install -r requirements.opt.txt

Quickstart

Step 1: build a configuration file

We need to build a YAML configuration file to specify the data and model that will be used:

vi cmudict_g2p_transformer.yaml

Step 2: generate pronunciations

To generate pronunciations for an English word list with a trained model: You can download model form :https://drive.google.com/file/d/1QTPt0CiTF3GInr9DdCzlR2-nPpFmGr47/view?usp=sharing

onmt_translate -model cmu_g2p_model_step_29300_release.pt -src valid_s.txt -output exp/pred_valid.txt -gpu 0 -verbose

The word list is a text file with one word per line, and each character in the word is separated by a space:

H E L L O

We can also wirte a script to run conversion:

src=valid_s
tgt=valid_t
onmt_translate \
         -gpu 0 \
         -batch_size 2 \
         -beam_size 3 \
         -model cmu_g2p_model_step_29300_release.pt \
         -src $src.txt \
         -output run/pred_${src}.txt \
         -tgt $tgt.txt \
         -verbose \
         --n_best 3 \
         &> run/translate.log&

Train G2P model

Step 1: Prepare the data

To get started, we propose to download a CMUdict for grapheme-to-phoneme, also we provide the processed CMUdict in this repo:

git clone https://github.com/Alexir/CMUdict.git

We can use the latest cmudict-0.7b. We also need to do the following: Convert words into character sequences:

HELLO ---> H E L L O

Remove the markers after multi-pronounced words:

ABS  EY1 B IY1 EH1 S ---> ABS  EY1 B IY1 EH1 S
ABS(1)  AE1 B Z ---> ABS(1)  AE1 B Z

If you want to train a g2p-model for perosn names, we advice you delete the punctuation characters and their probunciation.

Divide the cmudict into train set and test set:

python make_train_test.py

We need to build a YAML configuration file to specify the data that will be used:

# cmudict_g2p_transformer.yaml

## Where the samples will be written
save_data: data/local
## Where the vocab(s) will be written
src_vocab: data/cmudict.vocab.src
tgt_vocab: data/cmudict.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: True
# Corpus opts:
data:
    train:
        path_src: train_s.txt
        path_tgt: train_t.txt
    valid:
        path_src: valid_s.txt
        path_tgt: valid_t.txt

From this configuration, we can build the vocab(s) that will be necessary to train the model:

onmt_build_vocab -config toy_en_de.yaml -n_sample -1

Notes:

  • -n_sample is required here -- it represents the number of lines sampled from each corpus to build the vocab.

Step 2: Train the model

To train a model, we need to add the following to the YAML configuration file:

  • the vocabulary path(s) that will be used: can be that generated by onmt_build_vocab;
  • training specific parameters.
# cmudict_g2p_transformer.yaml

...

# Vocabulary files that were just created
src_vocab: data/cmudict.vocab.src
tgt_vocab: data/cmudict.vocab.tgt

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
save_model: exp/run/model
save_checkpoint_steps: 100
train_steps: 3000
valid_steps: 1000
...

Then you can simply run:

onmt_train -config cmudict_g2p_transformer.yaml

For more parameters, see example configurations

Step 3: Convert

onmt_translate -model exp/run/model_step_1000.pt -src data/src-test.txt -output exp/pred_1000.txt -gpu 0 -verbose

Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into exp/pred_1000.txt.

Note:

We can also wirte a script to run conversion:

checkpoint=10000
src=valid_s
tgt=valid_t
onmt_translate \
         -gpu 0 \
         -batch_size 2 \
         -beam_size 3 \
         -model exp/run/model_step_${checkpoint}.pt \
         -src $src.txt \
         -output run/pred_${src}_${checkpoint}.txt \
         -tgt $tgt.txt \
         -verbose \
         --n_best 3 \
         &> run/translate.log&

About

An opennmt-based G2P model, which was trained on CMUdict.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages