Forced phoneme alignment was done using Gentle, which is a robust yet lenient aligner built on Kaldi. Gentle provides a python API as well as a REST API to force align audio files. Since, gentle contains a ton of platform dependent dependencies, it's convenient to use their pre-built Docker image.
For this project, we built a docker image using the gentle docker image as base which can align audio files in batch by reading input from a csv file.
To build the docker image run the following from alignment
directory:
$ docker build -t gentle-alignment .
By default we use CMU dictionary,which is packed into the docker image. To add a new dict,
modify the gentle_alignment.py
file, update Dockerfile
to package
the new dict and rebuild the image.
Running the program requires four input parameters:
input_csv
: A csv files containing the path of the wav file and it's transcriptphoneme_path
: Output path where phoneme slices (wav files) would be writtenoutput_csv
: Path of the csv file which would contain metadata of the aligned file. This is needed when generating training samples from the phoneme level wav files.wav_root
: The root path of the input wav files. If theinput_csv
contains the complete path of the input wav file, this parameter can be set to a blank string
The input_csv
is a tab separated file with three columns - id, wav_path & transcript and no header. A sample file
looks like this:
libri_2582-155973-0032 train-clean-360/2582/155973/2582-155973-0032.wav the little boy's feelings overcame him he had been loaned a king snake which as all nature lovers know is not only a useful but a beautiful snake very friendly to human beings and he came rushing home to show the treasure
libri_5724-13364-0083 train-clean-360/5724/13364/5724-13364-0083.wav why need we care for outside things why indeed he said in a low fond tone so i easily found out how they meant to settle the difficulty namely
libri_2494-156015-0019 train-clean-360/2494/156015/2494-156015-0019.wav they therefore remain bound the man who does not shrink from self crucifixion can never fail to accomplish the object upon which his heart is set this is as true of earthly as of heavenly things
libri_8193-116805-0037 train-clean-360/8193/116805/8193-116805-0037.wav what shall we do wise medeia we must have water or we die of thirst flesh and blood we can face fairly but who can face this red hot brass i can face red hot brass if the tale i hear be true
libri_8066-290901-0040 train-clean-360/8066/290901/8066-290901-0040.wav i promised him to come up soon but i continued on for some hours with the drunken crowd when i did come up to our apartment i found donald on his knees by his bed with his testament and an old hymn book of my mother in law's
The output_csv
is also tab separated and has four columns - phoneme slice wav file name, id, word & phoneme
libri_5808-54425-0000_five_ay1_690_820.wav libri_5808-54425-0000 five ay1
libri_5808-54425-0000_years_ih1_960_1070.wav libri_5808-54425-0000 years ih1
libri_5808-54425-0000_is_ih0_1300_1380.wav libri_5808-54425-0000 is ih0
libri_5808-54425-0000_but_ah1_1540_1590.wav libri_5808-54425-0000 but ah1
libri_5808-54425-0000_a_ey1_1650_1740.wav libri_5808-54425-0000 a ey1
libri_5808-54425-0000_short_ao1_1860_1950.wav libri_5808-54425-0000 short ao1
libri_5808-54425-0000_time_ay1_2230_2540.wav libri_5808-54425-0000 time ay1
libri_5808-54425-0000_in_ih1_2600_2670.wav libri_5808-54425-0000 in ih1
libri_5808-54425-0000_the_ah1_2750_2800.wav libri_5808-54425-0000 the ah1
libri_5808-54425-0000_life_ay1_2900_3000.wav libri_5808-54425-0000 life ay1
These inputs are given to the docker container via environment variables. An example of running alignment:
$ docker run --rm -v /home/ubuntu/capstone:/work/capstone \
-e input_csv=/work/capstone/gentle/libri_path_transcript_2.csv \
-e phoneme_path=/work/capstone/gentle/phoneme_slices/ \
-e output_csv=/work/capstone/gentle/aligned_phonemes_2.csv \
-e wav_root=/work/capstone/gentle/LibriSpeech gentle-alignment:latest
Notice that a host directory is mounted on the container and all the paths are of the docker container and not the host machine.
Pro Tip: Alignment is not run in parallel inside of the docker container. To parallelize the process, split the
input_csv
into multiple smaller files and then do a docker run
for each of those files. Merge the output_csv
files in the end.