Skip to content

saaaa25/msml612-deep-learning-project

Repository files navigation

ReMoT: Register-Enhanced Multimodal Transformers for Sentiment and Emotion Recognition

This repository contains the implementation, experiments, and evaluation code for ReMoT, a family of register-enhanced multimodal transformer models designed for sentiment and emotion recognition using text, audio, and video modalities. The project is developed as part of the MSML612 Deep Learning course and evaluated on the CMU-MOSEI dataset.


Project Overview

Human affect is inherently multimodal and expressed through language, vocal cues, and facial expressions. Traditional fusion approaches often struggle to model complex cross-modal interactions and global context. This project introduces register tokens as persistent memory within multimodal transformers to improve representation stability, cross-modal reasoning, and predictive performance.

We explore and compare four architectures:

  1. Baseline early-fusion model (no registers)
  2. ReMoT-Base with global register tokens
  3. ReMoT-CLIP with local and global registers and contrastive alignment
  4. ReMoT-Hybrid with staged modality encoders and register-based fusion

Repository Structure

.
├── clip_register.py
├── hybrid_fusion.py
├── register.py
├── metric.py
├── metric1.py
├── inspect_ckpt.py
├── samples.py
├── trimodal_mosei_local_ssd.ipynb
│
├── mosei_register_multimodal_transformer.pt
├── mosei_clip_local_global_register.pt
├── mosei_hybrid_register_fusion.pt
│
└── README.md

File Descriptions

Model Implementations

  • register.py
    Implements the ReMoT-Base model using global register tokens and transformer-based multimodal fusion.

  • clip_register.py
    Implements ReMoT-CLIP with both local and global register tokens and a CLIP-style contrastive loss for cross-modal alignment.

  • hybrid_fusion.py
    Implements the ReMoT-Hybrid architecture with modality-specific encoders followed by register-augmented transformer fusion.

Training, Evaluation, and Utilities

  • metric.py / metric1.py
    Implements evaluation metrics including MSE, MAE, Pearson correlation, and binary sentiment accuracy.

  • inspect_ckpt.py
    Utility script to inspect model checkpoints, including parameter keys and architecture details.

  • samples.py
    Used for testing inference and generating sample predictions.

  • trimodal_mosei_local_ssd.ipynb
    Jupyter notebook for dataset preprocessing, experimentation, and local testing on CMU-MOSEI features.

Model Checkpoints

  • mosei_register_multimodal_transformer.pt
    Trained ReMoT-Base checkpoint.

  • mosei_clip_local_global_register.pt
    Trained ReMoT-CLIP checkpoint.

  • mosei_hybrid_register_fusion.pt
    Trained ReMoT-Hybrid checkpoint.


Dataset

This project uses the CMU-MOSEI dataset: https://www.kaggle.com/datasets/samarwarsi/cmu-mosei

  • Over 23,000 annotated video segments
  • Multimodal features: text, audio, and video
  • Sentiment labels for regression and polarity classification

Due to licensing restrictions, the dataset itself is not included in this repository.


Evaluation Metrics

Models are evaluated using:

  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)
  • Pearson Correlation
  • Binary Sentiment Accuracy

How to Run

Environment Setup

pip install torch numpy pandas scikit-learn h5py matplotlib

Train Models

python register.py
python clip_register.py
python hybrid_fusion.py

Evaluate Models

python metric.py

Inspect Checkpoints

python inspect_ckpt.py

How to test UI

Install required libraries

cd ui
pip install -r requirements.txt

Run the UI

python app.py

Hardware

  • NVIDIA RTX 4070 GPU
  • CUDA-enabled PyTorch

Team Contributions

  • Srutileka Suresh – Research lead and ReMoT-CLIP implementation
  • Akshay Suresh – ReMoT-Hybrid architecture and training
  • Abhay Shagoti – ReMoT-Base model and data pipeline
  • Satwika Konda – Baseline models, preprocessing and data validation
  • Sivani Mallangi – Feature embedding and Flask-based demo

All members contributed to experimentation, debugging, analysis, and report writing.


References

  • Zadeh et al., Multimodal Language Analysis in the Wild: CMU-MOSEI, ACL 2018
  • Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019
  • Darcet et al., Vision Transformers Need Registers, ICLR 2024
  • Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP), ICML 2021

License

This project is intended for academic and educational use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages