TensorFlow implementation of "SoundNet" that learns rich natural sound representations.
Code for paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016
- Linux
- NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
- Python 2.7 with numpy or Python 3.5
- Tensorflow 1.0.0 (up to 1.3.0)
- librosa
- Clone this repo:
git clone [email protected]:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
- Pretrained Model
I provide pre-trained models that are ported from soundnet. You can download the 8 layer model here. Please place it as ./models/sound8.npy
in your folder.
- Data
Prepare you input mp3 files and place them under ./data/
Generate a input file txt and place it under ./
./data/0001.mp3
./data/0002.mp3
./data/0003.mp3
...
Follow the steps in extract features
- NOTE
If you found out that some audio with offset value start
in FFMPEG will cause a tremendous difference between torch audio
and librosa
, please convert it with following command.
sox {input.mp3} {output.mp3} trim 0
After this, the result might be much better.
For demo, you can follow the following steps
i) Download a converted npy file demo.npy and place it under ./data/
ii) To extract multiple features from a pretrained model with torch lua audio
loaded sound track:
The sound track is equivalent with torch version.
python extract_feat.py -m {start layer number} -x {end layer numbe} -s
Then you can compare the outputs with torch ones.
i) Download input file demo.mp3 and place it under ./data/
ii) Prepare a file list in txt
format (demo.txt
) that includes the input mp3 file(s) and place it under ./
./data/demo.mp3
iii) Then extract features from raw wave in demo.txt
:
Please put the demo mp3 under ./data/demo.mp3
python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt
To extract multiple features from a pretrained model with downloaded mp3 dataset:
python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract
e.g. extract layer 4 to layer 17 and save as ./sound_out/tf_fea%02d.npy
:
python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract
More details are in:
python extract_feat.py -h
To train from an existing model:
python main.py
To train from scratch:
python main.py -p train
To extract features:
python main.py -p extract -m {start layer number} -x {end layer numbe} -s
More details are in:
python main.py -h
- Change audio loader to soundnet format
- Make it compatible to Python 3 format
- Batch Norm behaviour different from Torch
- Fix conv8 padding issue in training phase
- Change all
config
intotf.app.flags
- Change dummy distribution of scene and object to useful placeholder
- Add sound and feature loader from Data section
- Loaded audio length is not consist in
torch7 audio
andlibrosa
. Here is the issue - Training with a short length audio will make conv8 complain about output size would be negative
- Why my loaded sound wave is different from
torch7 audio
tolibrosa
: Here is my WiKi
Code ported from soundnet. And Torch7-Tensorflow loader are from tf_videogan. Thanks for their excellent work!
Hou-Ning Hu / @eborboihuc