This project tackles the challenge of recognizing various sounds simultaneously within a single audio recording. Inspired by the "cocktail party problem," our approach combines audio classification with techniques for separating mixed audio sources.
The proposed method involves:
- Extracting audio from video using
moviepy
. - Segmenting the audio into smaller chunks using a predefined time window.
- Separating audio components within each chunk using the Separate Anything You Describe model and leveraging AudioSep, a natural language-based sound separation approach.
- Classifying the separated components using Microsoft's CLAP model, which utilizes contrastive language to categorize audio based on pre-trained text-audio relationships.
- Generating video subtitles with audio captions using the SubRipper (.srt) format.
This framework enables the identification of multiple sounds within a complex audio scene, facilitating applications like automated audio annotation and content indexing.
- Clone the reposotory and install dependencies:
git clone https://github.com/paulinaskr33/Audio-component-classifier
pip install torchlibrosa==0.1.0 gradio==3.47.1 gdown lightning ftfy braceexpand webdataset soundfile wget h5py transformers==4.28.1 && \
pip install msclap
(to be added python packedge wrapper)
See Google Colab notebook for example use.
- Liu, X., Kong, Q., Zhao, Y., Liu, H., Yuan, Y., Liu, Y., Xia, R., Wang, Y., Plumbley, M. D., & Wang, W. (2023). Separate anything you describe. https://doi.org/10.48550/ARXIV.2308.05037
- Elizalde, B., Deshmukh, S., Ismail, M. A., & Wang, H. (2023). Clap learning audio concepts from natural language supervision. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5.https://doi.org/10.1109/ICASSP49357.2023.10095889