Skip to content

Automatic Depression Detection Using An Interpretable Audio-textual Multi-modal Transformer-based Model

Notifications You must be signed in to change notification settings

mehrshad-sdtn/Multi-Modal-Depression-Detection

Repository files navigation

Multi-Modal-Depression-Detection

Automatic Depression Detection Using An Interpretable Audio-textual Multi-modal Transformer-based Model

Overview

This repository contains the implementation for a multi-modal depression detection model that combines audio and textual data using a Transformer-based architecture. The model is designed to detect depression levels in subjects based on their speech recordings and corresponding transcriptions. The approach leverages interpretability techniques to analyze attention mechanisms within the model.

Features

  • Multi-modal integration: Combines audio and text data for enhanced depression detection.
  • Transformer-based architecture: Uses BERT embeddings for text and a custom Transformer encoder for audio.
  • Interpretable design: Visualizes attention weights to provide insights into model decision-making.

Report

A detailed report of the project, including the methodology, experiments, and results, is available here.

Dataset

The model uses the DAIC-WOZ dataset, which includes:

  • Audio embeddings for each sentence (256-dimensional vectors).
  • Transcriptions of participant responses.

Due to privacy reasons we are not allowed to post the dataset, but it could be accessed [here](https://dcapswoz.ict.usc.edu/)

Implementation

The repository includes:

  1. Data preprocessing:
    • Text preprocessing with BERT tokenizer.
    • Audio embeddings processed for each sentence.
  2. Model architecture:
    • A Transformer-based multi-modal model integrating both audio and text features.
    • Separate encoders for audio and text, with a shared fully connected classification layer.
  3. Training and evaluation pipeline:
    • Cross-entropy loss and AdamW optimizer.
    • Metrics: Accuracy and loss tracking during training and evaluation.
  4. Interpretability:
    • Visualization of attention weights for insights into model focus during predictions.

Model Architecture

Below is an overview of the model architecture used in this project:

Model Architecture The model integrates textual and audio modalities using separate Transformer-based encoders, followed by a shared classification layer. Attention mechanisms are leveraged for interpretability.

About

Automatic Depression Detection Using An Interpretable Audio-textual Multi-modal Transformer-based Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published