Automatic Depression Detection Using An Interpretable Audio-textual Multi-modal Transformer-based Model
This repository contains the implementation for a multi-modal depression detection model that combines audio and textual data using a Transformer-based architecture. The model is designed to detect depression levels in subjects based on their speech recordings and corresponding transcriptions. The approach leverages interpretability techniques to analyze attention mechanisms within the model.
- Multi-modal integration: Combines audio and text data for enhanced depression detection.
- Transformer-based architecture: Uses BERT embeddings for text and a custom Transformer encoder for audio.
- Interpretable design: Visualizes attention weights to provide insights into model decision-making.
A detailed report of the project, including the methodology, experiments, and results, is available here.
The model uses the DAIC-WOZ dataset, which includes:
- Audio embeddings for each sentence (256-dimensional vectors).
- Transcriptions of participant responses.
Due to privacy reasons we are not allowed to post the dataset, but it could be accessed [here](https://dcapswoz.ict.usc.edu/)
The repository includes:
- Data preprocessing:
- Text preprocessing with BERT tokenizer.
- Audio embeddings processed for each sentence.
- Model architecture:
- A Transformer-based multi-modal model integrating both audio and text features.
- Separate encoders for audio and text, with a shared fully connected classification layer.
- Training and evaluation pipeline:
- Cross-entropy loss and AdamW optimizer.
- Metrics: Accuracy and loss tracking during training and evaluation.
- Interpretability:
- Visualization of attention weights for insights into model focus during predictions.
Below is an overview of the model architecture used in this project:
The model integrates textual and audio modalities using separate Transformer-based encoders, followed by a shared classification layer. Attention mechanisms are leveraged for interpretability.