[Project page π] [ArXiv preprint π] [Video ποΈ]
This is the code implementation for the CVPR'23 paper The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction.
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations.
Ensure that the following packages are installed in your machine:
adaPool
(version >= 0.2)coloredlogs
(version >= 14.0)dataset2database
(version >= 1.1)einops
(version >= 0.4.0)ffmpeg-python
(version >=0.2.0)imgaug
(version >= 0.4.0)opencv-python
(version >= 4.2.0.32)ptflops
(version >= 0.6.8)torch
(version >= 1.9.0)torchinfo
(version >= 1.5.4)youtube-dl
(version >= 2020.3.24)
You can install the available PyPi packages with the command below:
$ pip install coloredlogs dataset2database einops ffmpeg-python imgaug opencv-python ptflops torch torchvision youtube-dl
and compile the adaPool
package as:
$ git clone https://github.com/alexandrosstergiou/adaPool.git && cd adaPool-master/pytorch && make install
--- (optional) ---
$ make test
A custom format is used for the train/val label files of each datasets:
label |
youtube_id /id |
time_start (optional) |
time_end (optional) |
split |
---|
This can be done through the scripts provided in labels
We have tested our code over the following datasets:
- UCF-101 : [link]
- Somethong-Something (sub21/v2) : [link]
- EPIC-KITCHENS-100 : [link]
- NTU-RGB : [link]
Based on the format that the dataset is stored on disk two options are supported by the repo:
- Videos being stored in video files (e.g.
.mp4
,.avi
,etc.) - Videos being stored in folders containing their frames in image files (e.g.
.jpg
)
By default it is assumed that the data are in video format however, you can overwrite this by setting the use_frames
call argument to True
/true
.
We assume a fixed directory formatting that should be of the following structure:
<data>
|
ββββ<dataset>
|
ββββ <class_i>
β β
β ββββ <video_id_j>
β β (for datasets w/ videos saved as frames)
β β β
β β ββββ frame1.jpg
β β ββββ framen.jpg
β β
β ββββ <video_id_j+1>
β β (for datasets w/ videos saved as frames)
β β β
β β ββββ frame1.jpg
β β ββββ framen.jpg
... ...
Training for each of the datasets is done through the homonym .yaml
configuration scripts in configs
.
You can also use the argument parsers in train.py
and inference.py
for custom arguments.
Train on UCF-101 with observation ratio 0.3, 3 scales, with movinet backbone, with the pretrained UCF-101 backbone checkpoint stored in weights
, and over 4 gpus:
python train.py --video_per 0.3 --num_samplers 3 --gpus 0 1 2 3 --precision mixed --dataset UCF-101 --frame_size 224 --batch_size 64 --data_dir data/UCF-101/ --label_dir /labels/UCF-101 --workers 16 --backbone movinet --end_epoch 70 --pretrained_dir weights/UCF-101/movinet_ada_best.pth
Run inference over something-something v2 with TemPr and adaptive ensemble over a single gpu with checkpoint file my_chckpt.pth
:
python inference.py --config config/inference/smthng-smthng/config.yml --head TemPr_h --pool ada --gpus 0 --pretrained_dir my_chckpt.pth
The following arguments are used and can be included at the parser of any training script.
Argument name | functionality |
---|---|
debug-mode |
Boolean for debugging messages. Useful for custom implementations/datasets. |
dataset |
String for the name of the dataset. used in order to obtain the respective configurations. |
data_dir |
String for the directory to load data from. |
data_dir |
String for the directory to load the train and val splits (should be train.csv and val.csv ). |
clip-length |
Integer determining the number of frames to be used for each video. |
clip-size |
Tuple for the spatial size (height x width) of each frame. |
backbone |
String for the name of the feature extractor network. |
accum_grads |
Integer for the number of iterations passed to run backwards. Set to 1 to not use gradient accumulation. |
use_frames |
Boolean flag. When set to True the dataset directory should be a folder of .jpg images. Alternatively, video files. |
head |
String for the name of the attention tower network. Only TemPr_h can be currently used. |
pool |
String for the predictor aggregation method to be used. |
gpus |
List for the number of GPUs to be used. |
pretrained-3d |
String for .pth filepath the case that the weights are to be initialised from some previously trained model. As a non-strict weight loading implementation exists to remove certain works from the state_dict keys. |
config |
String for the .yaml configuration file to be used. If arguments that are part of the configuration path are passed by the user, they will be selected over the YAML ones. |
Backbone | |||||||||
---|---|---|---|---|---|---|---|---|---|
x3d |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
movinet |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
Backbone | ||||||
---|---|---|---|---|---|---|
movinet |
chkp |
chkp |
chkp |
chkp |
chkp |
chkp |
@inproceedings{stergiou2023wisdom,
title = {The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction},
author = {Stergiou, Alexandros and Damen, Dima},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}
MIT