This repository provides a flexible training or finetune scripts for speech separation models. Currently, it supports both 8kHz and 16kHz sampling rates:
model name | sampling rate | Paper Link |
---|---|---|
MossFormer2_SS_8K | 8000 | MossFormer2 (Paper, ICASSP 2024) |
MossFormer2_SS_16K | 16000 | MossFormer2 (Paper, ICASSP 2024) |
MossFormer2 has achieved state-of-the-art speech sesparation performance upon the paper published in ICASSP 2024. It is a hybrid model by integrating a recurrent module into our previous MossFormer framework. MossFormer2 is capable to model not only long-range and coarse-scale dependencies but also fine-scale recurrent patterns. For efficient self-attention across the extensive sequence, MossFormer2 adopts the joint local-global self-attention strategy as proposed for MossFormer. MossFormer2 introduces a dedicated recurrent module to model intricate temporal dependencies within speech signals.
Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence.
MossFormer2 demonstrates remarkable performance in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks. Please refer to our Paper or the individual models using the standalone script (link).
We will provide performance comparisons of our released models with the publically available models in ClearVoice page.
If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.
- Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
- Create Conda Environment
cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt
- Prepare Dataset
a. Use a pre-prepared toy MiniLibriMix dataset. It contains a train set of 800 mixtures and a validation set of 200 mixtures.
b. Create your own dataset
-
WSJ0-2Mix dataset preparation: We assume you have purchased WSJ0 speech dataset
- Step 1: Download WHAM! noise dataset. Go to this page for more information.
- Step 2: Use the mixture generation scripts in python format or matlab format to generate mixture datasets. Use the sampling rate either 8000Hz or 16000Hz.
- Step 3: Create scp files as formatted in
data/tr_wsj0_2mix_16k.scp
for train, validation, and test. - Step 4: Replace the
tr_list
andcv_list
paths for scp files inconfig/train/MossFormer2_SS_16K.yaml
-
LibriMix dataset preparation: If you don't have WSJ0 dataset, we suggest you to download LibriSpeech dataset (only 'train-clean-360.tar.gz' is required) and use the following steps to create LibriMix dataset.
- Step 1. Download WHAM! noise dataset. Go to this page for more information.
- Step 2. Clone the repo and run the main script : generate_librimix.sh
git clone https://github.com/JorisCos/LibriMix cd LibriMix ./generate_librimix.sh storage_dir ```sh
- Step 3: Create scp files as formatted in
data/tr_wsj0_2mix_16k.scp
for train, validation, and test. - Step 4: Replace the
tr_list
andcv_list
paths for scp files inconfig/train/MossFormer2_SS_16K.yaml
- Start Training
bash train.sh
You may need to set the correct network in train.sh
and choose either a fresh training or a finetune process using:
network=MossFormer2_SS_16K #Train MossFormer2_SS_16K model
train_from_last_checkpoint=1 #Set 1 to start training from the last checkpoint if exists,
init_checkpoint_path=./ #Path to your initial model if starting fine-tuning; otherwise, set it to 'None'