
Audio-based Action Recognition Models

ResNet for Audio

Audiovisual SlowFast Networks for Video Recognition


We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Au- dio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.

Results and Models


frame sampling strategy n_fft gpus backbone pretrain top1 acc top5 acc testing protocol FLOPs params config ckpt log
64x1x1 1024 8 Resnet18 None 13.7 27.3 1 clips 0.37G 11.4M config ckpt log


You can use the following command to train a model.

python tools/ ${CONFIG_FILE} [optional arguments]

Example: train ResNet model on Kinetics-400 audio dataset in a deterministic option with periodic validation.

python tools/ configs/recognition_audio/resnet/ \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.


You can use the following command to test a model.

python tools/ ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test ResNet model on Kinetics-400 audio dataset and dump the result to a pkl file.

python tools/ configs/recognition_audio/resnet/ \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.


  title={Audiovisual SlowFast Networks for Video Recognition},
  author={Xiao, Fanyi and Lee, Yong Jae and Grauman, Kristen and Malik, Jitendra and Feichtenhofer, Christoph},
  journal={arXiv preprint arXiv:2001.08740},