This repo contains code to run the VQA-CP experiments from our paper "Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases". In particular, it contains code to a train VQA model so that it does not make use of question-type priors when answering questions, and evaluate it on VQA-CP v2.0.
This repo is a fork of this implementation of the BottomUpTopDown VQA model. This fork extends the implementation so it can be used on VQA-CP v2.0, and supports the debiasing methods from our paper.
Make sure you are on a machine with a NVIDIA GPU and Python 2 with about 100 GB disk space.
- Install PyTorch v0.3 with CUDA and Python 2.7.
- Install h5py, pillow, and tqdm
All data should be downloaded to a 'data/' directory in the root directory of this repository.
The easiest way to download the data is to run the provided script
tools/download.sh
from the repository root. The features are
provided by and downloaded from the original authors'
repo. If the
script does not work, it should be easy to examine the script and
modify the steps outlined in it according to your needs. Then run
tools/process.sh
from the repository root to process the data to the
correct format.
On a fresh machine with Ubuntu 18.04, I was able to setup everything by installing Cuda 10.0, then running:
sudo apt update
sudo apt install unzip
sudo apt install python-pip
pip2 install torch==0.3.1
pip2 install h5py tqdm pillow
bash tools/download.sh
bash tools/process.sh
Run python main.py --output_dir /path/to/output --seed 0
to start training our Learned-Mixin +H VQA-CP model, see the command line options
for how to use other ensemble method, or how to train on non-changing priors VQA 2.0.
The scores reported by the script are very close (within a hundredth of a percent in my experience) to the results
reported by the official evaluation metric, but can be slightly different because the
answer normalization process of the official script is not fully accounted for.
To get the official numbers, you can run python save_predictions.py /path/to/model /path/to/output_file
and the run the official VQA 2.0 evaluation script
on the resulting file.
We present a breakdown of accuracy by answer type below. The overall accuracies do not precisely match the results in the paper because, due to a checkpointing issue, we had to re-run our experiments to get these numbers. The results are still averaged over eight runs, and are very close to the numbers in the paper.
Debiasing Method | Overall | Yes/No | Number | Other |
---|---|---|---|---|
None | 39.337 | 42.134 | 12.293 | 45.291 |
Reweight | 39.915 | 44.307 | 12.521 | 45.130 |
Bias Product | 40.043 | 43.395 | 12.322 | 45.892 |
Learned-Mixin | 48.778 | 72.780 | 14.608 | 45.576 |
Learned-Mixin +H | 52.013 | 72.580 | 31.117 | 46.968 |
We present scores for our methods on VQA 2.0, these were collected by re-training the models on the VQA 2.0 train set and testing on the validation set. Results are again averaged over eight runs.
Debiasing Method | Overall | Yes/No | Number | Other |
---|---|---|---|---|
None | 63.377 | 81.170 | 42.501 | 55.373 |
Reweight | 62.409 | 79.506 | 41.835 | 54.857 |
Bias Product | 63.207 | 81.016 | 42.302 | 55.199 |
Learned-Mixin | 63.260 | 81.159 | 42.215 | 55.221 |
Learned-Mixin +H | 56.345 | 65.057 | 37.631 | 54.687 |
In general we have tried to minimizes changes to the original codebase to reduce the risk of adding bugs, the main changes are:
- The download and preprocessing script also setup VQA-CP 2.0
- We use the filesystem, instead of HDF5, to store image feature. On my machine this is about a 1.5-3.0x speed up
- Support dynamically loading the image features from disk during training so models can be trained on machines with less RAM
- Debiasing objectives are added in
vqa_debiasing_objectives.py
- Some additional arguments are added to
main.py
that control the debiasing objective - Minor quality of life improvements and tqdm progress monitoring