ReMoT: Register-Enhanced Multimodal Transformers for Sentiment and Emotion Recognition

This repository contains the implementation, experiments, and evaluation code for ReMoT, a family of register-enhanced multimodal transformer models designed for sentiment and emotion recognition using text, audio, and video modalities. The project is developed as part of the MSML612 Deep Learning course and evaluated on the CMU-MOSEI dataset.

Project Overview

Human affect is inherently multimodal and expressed through language, vocal cues, and facial expressions. Traditional fusion approaches often struggle to model complex cross-modal interactions and global context. This project introduces register tokens as persistent memory within multimodal transformers to improve representation stability, cross-modal reasoning, and predictive performance.

We explore and compare four architectures:

Baseline early-fusion model (no registers)
ReMoT-Base with global register tokens
ReMoT-CLIP with local and global registers and contrastive alignment
ReMoT-Hybrid with staged modality encoders and register-based fusion

Repository Structure

.
├── clip_register.py
├── hybrid_fusion.py
├── register.py
├── metric.py
├── metric1.py
├── inspect_ckpt.py
├── samples.py
├── trimodal_mosei_local_ssd.ipynb
│
├── mosei_register_multimodal_transformer.pt
├── mosei_clip_local_global_register.pt
├── mosei_hybrid_register_fusion.pt
│
└── README.md

File Descriptions

Model Implementations

register.py
Implements the ReMoT-Base model using global register tokens and transformer-based multimodal fusion.
clip_register.py
Implements ReMoT-CLIP with both local and global register tokens and a CLIP-style contrastive loss for cross-modal alignment.
hybrid_fusion.py
Implements the ReMoT-Hybrid architecture with modality-specific encoders followed by register-augmented transformer fusion.

Training, Evaluation, and Utilities

metric.py / metric1.py
Implements evaluation metrics including MSE, MAE, Pearson correlation, and binary sentiment accuracy.
inspect_ckpt.py
Utility script to inspect model checkpoints, including parameter keys and architecture details.
samples.py
Used for testing inference and generating sample predictions.
trimodal_mosei_local_ssd.ipynb
Jupyter notebook for dataset preprocessing, experimentation, and local testing on CMU-MOSEI features.

Model Checkpoints

mosei_register_multimodal_transformer.pt
Trained ReMoT-Base checkpoint.
mosei_clip_local_global_register.pt
Trained ReMoT-CLIP checkpoint.
mosei_hybrid_register_fusion.pt
Trained ReMoT-Hybrid checkpoint.

Dataset

This project uses the CMU-MOSEI dataset: https://www.kaggle.com/datasets/samarwarsi/cmu-mosei

Over 23,000 annotated video segments
Multimodal features: text, audio, and video
Sentiment labels for regression and polarity classification

Due to licensing restrictions, the dataset itself is not included in this repository.

Evaluation Metrics

Models are evaluated using:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Pearson Correlation
Binary Sentiment Accuracy

How to Run

Environment Setup

pip install torch numpy pandas scikit-learn h5py matplotlib

Train Models

python register.py
python clip_register.py
python hybrid_fusion.py

Evaluate Models

python metric.py

Inspect Checkpoints

python inspect_ckpt.py

How to test UI

Install required libraries

cd ui

pip install -r requirements.txt

Run the UI

python app.py

Hardware

NVIDIA RTX 4070 GPU
CUDA-enabled PyTorch

Team Contributions

Srutileka Suresh – Research lead and ReMoT-CLIP implementation
Akshay Suresh – ReMoT-Hybrid architecture and training
Abhay Shagoti – ReMoT-Base model and data pipeline
Satwika Konda – Baseline models, preprocessing and data validation
Sivani Mallangi – Feature embedding and Flask-based demo

All members contributed to experimentation, debugging, analysis, and report writing.

References

Zadeh et al., Multimodal Language Analysis in the Wild: CMU-MOSEI, ACL 2018
Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019
Darcet et al., Vision Transformers Need Registers, ICLR 2024
Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP), ICML 2021

License

This project is intended for academic and educational use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReMoT: Register-Enhanced Multimodal Transformers for Sentiment and Emotion Recognition

Project Overview

Repository Structure

File Descriptions

Model Implementations

Training, Evaluation, and Utilities

Model Checkpoints

Dataset

Evaluation Metrics

How to Run

Environment Setup

Train Models

Evaluate Models

Inspect Checkpoints

How to test UI

Install required libraries

Run the UI

Hardware

Team Contributions

References

License

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
metrics_plots		metrics_plots
ui		ui
DL_project.pdf		DL_project.pdf
README.md		README.md
baseline_model.py		baseline_model.py
clip_register.py		clip_register.py
hybrid_fusion.py		hybrid_fusion.py
inspect_ckpt.py		inspect_ckpt.py
metric.py		metric.py
mosei_clip_local_global_register.pt		mosei_clip_local_global_register.pt
mosei_hybrid_register_fusion.pt		mosei_hybrid_register_fusion.pt
mosei_register_multimodal_transformer.pt		mosei_register_multimodal_transformer.pt
register.py		register.py
samples.py		samples.py
trimodal_baseline_model.pt		trimodal_baseline_model.pt

saaaa25/msml612-deep-learning-project

Folders and files

Latest commit

History

Repository files navigation

ReMoT: Register-Enhanced Multimodal Transformers for Sentiment and Emotion Recognition

Project Overview

Repository Structure

File Descriptions

Model Implementations

Training, Evaluation, and Utilities

Model Checkpoints

Dataset

Evaluation Metrics

How to Run

Environment Setup

Train Models

Evaluate Models

Inspect Checkpoints

How to test UI

Install required libraries

Run the UI

Hardware

Team Contributions

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages