Name		Name	Last commit message	Last commit date
parent directory ..
config		config
data		data
dataloader		dataloader
losses		losses
models/mossformer2		models/mossformer2
utils		utils
README.md		README.md
inference.py		inference.py
inference.sh		inference.sh
networks.py		networks.py
solver.py		solver.py
train.py		train.py
train.sh		train.sh

README.md

ClearerVoice-Studio: Train Speech Separation Models

1. Introduction

This repository provides a flexible training or finetune scripts for speech separation models. Currently, it supports both 8kHz and 16kHz sampling rates:

model name	sampling rate	Paper Link
MossFormer2_SS_8K	8000	MossFormer2 (Paper, ICASSP 2024)
MossFormer2_SS_16K	16000	MossFormer2 (Paper, ICASSP 2024)

MossFormer2 has achieved state-of-the-art speech sesparation performance upon the paper published in ICASSP 2024. It is a hybrid model by integrating a recurrent module into our previous MossFormer framework. MossFormer2 is capable to model not only long-range and coarse-scale dependencies but also fine-scale recurrent patterns. For efficient self-attention across the extensive sequence, MossFormer2 adopts the joint local-global self-attention strategy as proposed for MossFormer. MossFormer2 introduces a dedicated recurrent module to model intricate temporal dependencies within speech signals.

Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence.

MossFormer2 demonstrates remarkable performance in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks. Please refer to our Paper or the individual models using the standalone script (link).

We will provide performance comparisons of our released models with the publically available models in ClearVoice page.

2. Usage

Step-by-Step Guide

If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.

Clone the Repository

git clone https://github.com/modelscope/ClearerVoice-Studio.git

Create Conda Environment

cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt

Prepare Dataset

a. Use a pre-prepared toy MiniLibriMix dataset. It contains a train set of 800 mixtures and a validation set of 200 mixtures.

b. Create your own dataset

WSJ0-2Mix dataset preparation: We assume you have purchased WSJ0 speech dataset
- Step 1: Download WHAM! noise dataset. Go to this page for more information.
- Step 2: Use the mixture generation scripts in python format or matlab format to generate mixture datasets. Use the sampling rate either 8000Hz or 16000Hz.
- Step 3: Create scp files as formatted in data/tr_wsj0_2mix_16k.scp for train, validation, and test.
- Step 4: Replace the tr_list and cv_list paths for scp files in config/train/MossFormer2_SS_16K.yaml
LibriMix dataset preparation: If you don't have WSJ0 dataset, we suggest you to download LibriSpeech dataset (only 'train-clean-360.tar.gz' is required) and use the following steps to create LibriMix dataset.
- Step 1. Download WHAM! noise dataset. Go to this page for more information.
- Step 2. Clone the repo and run the main script : generate_librimix.sh
```
git clone https://github.com/JorisCos/LibriMix
cd LibriMix 
./generate_librimix.sh storage_dir
```sh
```
- Step 3: Create scp files as formatted in data/tr_wsj0_2mix_16k.scp for train, validation, and test.
- Step 4: Replace the tr_list and cv_list paths for scp files in config/train/MossFormer2_SS_16K.yaml

Start Training

bash train.sh

You may need to set the correct network in train.sh and choose either a fresh training or a finetune process using:

network=MossFormer2_SS_16K              #Train MossFormer2_SS_16K model
train_from_last_checkpoint=1            #Set 1 to start training from the last checkpoint if exists, 
init_checkpoint_path=./                 #Path to your initial model if starting fine-tuning; otherwise, set it to 'None'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speech_separation

speech_separation

README.md

ClearerVoice-Studio: Train Speech Separation Models

1. Introduction

2. Usage

Step-by-Step Guide

Files

speech_separation

Directory actions

More options

Directory actions

More options

Latest commit

History

speech_separation

Folders and files

parent directory

README.md

ClearerVoice-Studio: Train Speech Separation Models

1. Introduction

2. Usage

Step-by-Step Guide