Official Pytorch implementation of the ICASSP 2025 paper: SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer.
Try our Huggingface space!!!
- Release model weights
- Release data
- HuggingFace Spaces demo
- VAE training code
- arxiv paper
conda env create -f env.yml
conda activate soloaudio
Download our pretrained models from huggingface.
After downloading the files, put them under this repo, like:
SoloAudio/
-config/
-demo/
-pretrained_models/
....
For audio-oriented TSE, please run:
python tse_audioTSE.py --output_dir './output-audioTSE/' --mixture './demo/1_mix.wav' --enrollment './demo/1_enrollment.wav'
For language-oriented TSE, please run:
python tse_languageTSE.py --output_dir './output-languageTSE/' --mixture './demo/1_mix.wav' --enrollment 'Acoustic guitar'
To train a SoloAudio model, you need to prepare the following parts:
- Prepare the FSD-Mix DataSet, please run:
cd data_preparating/
python create_filenames.py
python create_fsdmix.py
You can also use our simulated data for training, validataion and test.
- Prepare the TangoSyn DataSet, please run:
cd tango/
sh gen.sh
-
Prepare the TangoSyn-Mix DataSet like step 1.
-
Extract the VAE features, please run:
python extract_vae.py --data_dir "YOUR_DATA_DIR" --output_dir "YOUR_OUTPUT_DIR"
- Extract the CLAP features, please run:
python extract_clap_audio.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR"
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 1
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 2
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 3
Now, you are good to start training!
- Train with a single GPU, please run:
python train.py
- Train with multiple GPUs, please run:
accelerate launch train.py
To test a folder of audio files, please run:
python test_audioTSE.py --output_dir './test-audioTSE/' --test_dir '/YOUR_PATH_TO_TEST/'
OR
python test_languageTSE.py --output_dir './test-languageTSE/' --test_dir '/YOUR_PATH_TO_TEST/'
To calculate the metrics used in the paper, please run:
cd metircs/
python main.py
We provide codes to train an audio waveform VAE model, reference to stable-audio-tools.
-
Change data path in
stable_audio_vae/configs/vae_data.txt
(any folder contains audio files). -
Change model config in
stable_audio_vae/configs/vae_16k_mono_v2.config
.
We provide config for training audio files of 16k sampling rate, please change the settings when you want other sampling rates.
-
Change batch size and training settings in
stable_audio_vae/defaults.ini
. -
Run:
cd stable_audio_vae/
bash train_bash.sh
The codebase is under MIT LICENSE.
@article{helin2024soloaudio,
author = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
title = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer},
journal = {arXiv},
year = {2024},
}
@INPROCEEDINGS{jiarui2024dpmtse,
author={Hai, Jiarui and Wang, Helin and Yang, Dongchao and Thakkar, Karan and Dehak, Najim and Elhilali, Mounya},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction},
year={2024},
pages={1196-1200},
}