GitHub - h0ngc/PS-TTS

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

The Pytorch implementation of PS-TTS

Overall architecture

> Recently, artificial-intelligence-based dubbing technology has significantly advanced, enabling automated dubbing (AD) to effectively convert the source speech of a video into target speech in different languages. However, achieving natural AD still faces lip-sync challenges in the dubbed video, which is crucial for preserving the viewer experience. It has been popular to use a deep fake technique for lip-sync by altering the video contents with the translated text and synthesized target speech. Instead, this paper proposes a method of achieving lip-sync by paraphrasing the translated text in an AD pipeline. The proposed method comprises two processing steps: isochrony for the timing constraint between source and target speech and phonetic synchronization (PS) for lip-sync without altering the video contents. First, an isochrony approach is proposed for the AD between the languages with different structures to match the duration of the target speech with that of the source speech. This is performed by paraphrasing the translated text using a language model to achieve isochrony. Second, a lip-sync method is also proposed to paraphrase the isochronous target text by using PS, motivated by vowels that are directly related to mouth movements during pronunciation. The proposed PS method employs dynamic time warping with local costs of vowel distances measured from the training data so that the target text is composed of target vowels with similar pronunciations of source vowels. The proposed isochrony and PS methods are incorporated into a text-to-speech system, which is referred to as PS-TTS. The performance of the PS-TTS is evaluated using Korean and English lip-reading datasets and a voice actor dubbing test set by collecting data from films. Experimental results demonstrate that PS-TTS outperforms the TTS with PS in terms of several objective qualities. Moreover, it provides comparable performance to voice actors in Korean-to-English as well as English-to-Korean dubbing.

🔊 Requirements

      ✔︎   Platforms: Ubuntu 20.04
      ✔︎   Python >= 3.8
      ✔︎   apt update -y && apt install gcc libsndfile1 -y
      ✔︎   GPU: We use V100 with CUDA 11.4 for PS-TTS process

⚙️ Setup

Clone this repository and Install python requirement

Please download the repository here
pip install -r requirements

you need to install apt-get install espeak first.
lfs is need for download pre-trained baseline TTS model

If you clone the repository without pre-trained TTS model
pleanse download checkpoint ./ckpts/baseline.pth separately

📑 Usage

Prepare Data

For inference data you need 1) Input Video, 2) Source Speech, 3) Source text, 4) BGM
This repository provides code for the synchronization method proposed in PS-TTS, but it does not provide results for source separation or automatic speech recognition results.
If you want to separate the source speech from a video in MP4 format, run bash separate_video.sh <input_video_path> <output_audio_path> <output_video_path>.
for example, bash separate_video.sh ./input_video/input_video_kr.mp4 ./prepare_data/output.wav ./prepare_data/separated_video.mp4
If you are using your own data, use a video without BGM, or separate the BGM using the open-source tool available at https://github.com/sigsep/open-unmix-pytorch.git.

When time alignment for the source text is not available:

inference.txt
0001 {source text}
0002 {source text}

FIRST Replace {source text} with the actual text. For example:

inference.txt
0001 Hello, how are you?
0002 I am fine, thank you.

Make sure each line contains text from only one speaker.

SECOND You need to obtain the time alignment that matches the speech segments in the text.

inference.txt
0001 onset offset Hello, how are you?
0002 onset offset I am fine, thank you.

Please specify the onset and offset times. If you are using your own data, you can manually provide the time alignment, or you can use the open-source tool available at https://github.com/linto-ai/whisper-timestamped.git

Example Provided

Below is an example of the voice actor sample. This provides an example for inference:

inference.txt
0001 0.00 2.35 Hello, how are you?
0002 2.36 4.80 I am fine, thank you.

You can use the provided voice actor sample to perform inference with PS-TTS.
The sample includes an inference.txt formatted source text, source speech, video, and a separated BGM.

voice_actor_sample.txt
0001	0.74	2.91	더 좋은 옷, 더 좋은 차
0002	4.19	7.15	더 좋은 집에 산다는 이유로 친구를 괴롭히지 말 것.

🔑 Inference for PS-TTS

After preparing the .txt file, follow the steps below for korean-to-english dubbing:
The following example demonstrates how to dub using the provided voice actor data

python inference_kr_to_en.py \
    --src_speech 'prepare_data/vocals.wav' \
    --src_bgm 'prepare_data/bgm_mixer.wav' \
    --src_text 'prepare_data/voice_actor_sample.txt' \
    --trg_speech 'output/output.wav' \

Test it on your own Korean and English samples!!

If you want to test with your English samples, please follow the steps below:

python inference_en_to_kr.py \
    --src_speech '{}' \
    --src_bgm '{}' \
    --src_text '{}' \
    --trg_speech '{}' \

Additionally, replace the input files with those matching the correct paths to proceed with PS-TTS.

FINALLY Combine the lip-synchronized target speech with the source video. bash separate_video.sh <input_video_path> <final_output_speech>

🎓 Dubbing Sample with Lip-Reading Datasets

If you can't see the video, please check the video on the samples folder

English-to-Korean Dubbing

Korean-to-English Dubbing

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
__pycache__		__pycache__
ckpts		ckpts
input_video		input_video
monotonic_align		monotonic_align
output		output
output_video		output_video
prepare_data		prepare_data
samples		samples
text		text
.gitattributes		.gitattributes
README.md		README.md
_bgm.wav		_bgm.wav
_synvoc.wav		_synvoc.wav
align_trans_en.sh		align_trans_en.sh
align_trans_kr.sh		align_trans_kr.sh
attentions.py		attentions.py
check_energy.py		check_energy.py
check_silence.py		check_silence.py
combine_vd.py		combine_vd.py
combine_video.sh		combine_video.sh
commons.py		commons.py
config.json		config.json
english_trans.txt		english_trans.txt
infer_dubbing_en.sh		infer_dubbing_en.sh
infer_dubbing_kr.sh		infer_dubbing_kr.sh
inference_en_to_kr.py		inference_en_to_kr.py
inference_kr_to_en.py		inference_kr_to_en.py
losses.py		losses.py
make_ensub.py		make_ensub.py
make_krsub.py		make_krsub.py
mel_processing.py		mel_processing.py
models_ddp_c.py		models_ddp_c.py
models_ddp_ckr.py		models_ddp_ckr.py
modules.py		modules.py
requirements.txt		requirements.txt
saved_model_state_dict.pth		saved_model_state_dict.pth
separate_vd.py		separate_vd.py
separate_video.sh		separate_video.sh
silence_trim.py		silence_trim.py
speaker_encoder.py		speaker_encoder.py
spk_enc_model.py		spk_enc_model.py
transforms.py		transforms.py
utils.py		utils.py
vits_inference_ddp.py		vits_inference_ddp.py
vits_inference_ddpkr.py		vits_inference_ddpkr.py
vits_utils.py		vits_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

🔊 Requirements

⚙️ Setup

📑 Usage

Prepare Data

🔑 Inference for PS-TTS

🎓 Dubbing Sample with Lip-Reading Datasets

English-to-Korean Dubbing

Korean-to-English Dubbing

About

Releases

Packages

Contributors 2

Languages

h0ngc/PS-TTS

Folders and files

Latest commit

History

Repository files navigation

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

🔊 Requirements

⚙️ Setup

📑 Usage

Prepare Data

🔑 Inference for PS-TTS

🎓 Dubbing Sample with Lip-Reading Datasets

English-to-Korean Dubbing

Korean-to-English Dubbing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages