Awesome Multi-Modalities For Time Series Analysis Papers (MM4TSA)

Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release this up-to-date GitHub repository that includes key papers and resources. More details please check our survey.

Contributing

🚀 We will continue to update this repo. If you find it helpful, please Star it or Cite Our Survey.

🤝 Contributions are welcome! Please feel free to submit a Pull Request.

Citation

🤗 If you find this survey useful, please consider citing our paper. 🤗

@misc{liu2025timeseriesanalysisbenefit,
      title={How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook}, 
      author={Haoxin Liu and Harshavardhan Kamarthi and Zhiyuan Zhao and Shangqing Xu and Shiyu Wang and Qingsong Wen and Tom Hartvigsen and Fei Wang and B. Aditya Prakash},
      year={2025},
      eprint={2503.11835},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.11835}, 
}

1. Time2X and X2Time

1.1 Text to Time Series

1.1.1 Generation

Title	Venue
Language Models Still Struggle to Zero-shot Reason about Time Series	EMNLP 2024 Findings
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information	arXiv 25.01
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning	arXiv 24.12

1.1.2 Retrieval

Title	Venue
Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark	EMNLP 2024
TimeSeriesExam: A Time Series Understanding Exam	NeurIPS 2024 Workshop on Time Series in the Age of Large Models
CLaSP: Learning Concepts for Time-Series Signals from Natural Language Supervision	arXiv 24.11

1.2 Time Series to Text

1.2.1 Explanation

Title	Venue
Inferring Event Descriptions from Time Series with Language Models	arXiv 25.03
Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop	arXiv 25.03
Xforecast: Evaluating natural language explanations for time series forecasting	arXiv 24.10
Large language models can deliver accurate and interpretable time series anomaly detection	arXiv 24.05

1.2.2 Captioning

Title	Venue
Repr2Seq: A Data-to-Text Generation Model for Time Series	IJCNN 2023
Insight miner: A time series analysis dataset for cross-domain alignment with natural language	NeurIPS 2023 AI for Science Workshop
T 3: Domain-agnostic neural time-series narration	ICDM 2021
Neural data-driven captioning of time-series line charts	Proceedings of the 2020 International Conference on Advanced Visual Interfaces
Time Series Language Model for Descriptive Caption Generation	arXiv 25.01
Decoding Time Series with LLMs: A Multi-Agent Framework for Cross-Domain Annotation	arXiv 24.10
Domain-Independent Automatic Generation of Descriptive Texts for Time-Series Data	arXiv 24.09

1.3 Text to Time + Time to Text

Title	Venue
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data	AAAI 2025
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement	arXiv 25.03
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning	arXiv 24.12
Multi-Modal Forecaster: Jointly Predicting Time Series and Textual Data	arXiv 24.11

1.4 Other Cross-Modality Works

Title	Venue
DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts	EMNLP 2024

1.5 Domain Specific Applications

1.5.1 Spatial-Temporal Data

Title	Venue
Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web	the Web Conference 2024
Urbangpt: Spatio-temporal large language models	KDD 2024
Research on the visualization of spatio-temporal data	IOP Conference Series: Earth and Environmental Science
Spatial temporal data visualization in emergency management: a view from data-driven decision	Proceedings of the 3rd ACM SIGSPATIAL International Workshop on the Use of GIS in Emergency Management
Teochat: A large vision-language assistant for temporal earth observation data	arXiv 24.10

1.5.2 Medical Time Series

Title	Venue
Electrocardiogram Report Generation and Question Answering via Retrieval-Augmented Self-Supervised Modeling	NeurIPS 2024 Workshop on Time Series in the Age of Large Models
Frozen language model helps ecg zero-shot learning	Medical Imaging with Deep Learning
Multimodal Models for Comprehensive Cardiac Diagnostics via ECG Interpretation	IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2024
ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text	Transactions on Machine Learning Research
Towards a Personal Health Large Language Model	Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond
Diffusion-based conditional ECG generation with structured state space models	Computers in biology and medicine
Text-to-ecg: 12-lead electrocardiogram synthesis conditioned on clinical text reports	ICASSP 2023
From data to text in the neonatal intensive care unit: Using NLG technology for decision support and information management	Ai Communications
Using natural language generation technology to improve information flows in intensive care units	ECAI 2008
Summarising complex ICU data in natural language	AMIA annual symposium proceedings
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information	arXiv 25.01
Automated medical report generation for ecg data: Bridging medical text and signal processing with deep learning	arXiv 24.12
ECG-Chat: A Large ECG-Language Model for Cardiac Disease Diagnosis	arXiv 24.08
Medtsllm: Leveraging llms for multimodal medical time series analysis	arXiv 24.08
MEIT: Multi-modal electrocardiogram instruction tuning on large language models for report generation	arXiv 24.03
Electrocardiogram instruction tuning for report generation	arXiv 2024.03
BioSignal Copilot: Leveraging the power of LLMs in drafting reports for biomedical signals	medRxiv 23.06

1.5.3 Financial Time Series

Title	Venue
Knowledge-augmented Financial Market Analysis and Report Generation	EMNLP 2024 Industry Track
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models	ACL 2024 Findings
Long Text and Multi-Table Summarization: Dataset and Method	EMNLP 2022 Findings
Neural abstractive summarization for long text and multiple tables	IEEE Transactions on Knowledge and Data Engineering
Open-finllms: Open multimodal large language models for financial applications	arXiv 24.08
Multimodal gen-ai for fundamental investment research	arXiv 24.01

1.6 Gaps and Outlooks

1.6.1 Unseen Tasks: Introducing Reasoning

Title	Venue
A picture is worth a thousand numbers: Enabling llms reason about time series via visualization	NAACL 2025
Evaluating System 1 vs. 2 Reasoning Approaches for Zero-Shot Time-Series Forecasting: A Benchmark and Insights	arXiv 25.03
Position: Empowering Time Series Reasoning with Multimodal LLMs	arXiv 25.02
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning	arXiv 24.12
Beyond Forecasting: Compositional Time Series Reasoning for End-to-End Task Execution	arXiv 24.10
Towards Time Series Reasoning with LLMs	arXiv 24.09

2. Time+X

2.1 Time Series + Text

*we list papers using dynamic text especially with exogenous information

Forecasting (& Imputation)

Title	Venue
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data	AAAI 2025
Time-MMD: Multi-Domain Multimodal Dataset for Time Series Analysis	NeurIPS 2024
From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Forecasting with Reflection	NeurIPS 2024
GPT4MTS: Prompt-Based Large Language Model for Multimodal Time-Series Forecasting	AAAI 2024
Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative	arXiv 25.02
Context is Key: A Benchmark for Forecasting with Essential Textual Information	arXiv 24.10
Beyond trend and periodicity: Guiding time series forecasting with textual cues	arXiv 24.05
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts	Openreview

Classification

Title	Venue
Advancing time series classification with multimodal language modeling	arXiv 24.03
Hierarchical Multimodal LLMs with Semantic Space Alignment for Enhanced Time Series Classification	arXiv 24.10
Dualtime: A dual-adapter multimodal language model for time series representation	arXiv 24.06

2.2 Time Series + Other Modalities

Title	Venue
Imagebind: One embedding space to bind them all	CVPR 2023
LANISTR: Multimodal learning from structured and unstructured data	arXiv 23.05

2.3 Domain Specific Applications

2.3.1 Spatial-Temporal Data

Title	Venue
Terra: A Multimodal Spatio-Temporal Dataset Spanning the Earth	NeurIPS 2024
BjTT: A large-scale multimodal dataset for traffic prediction	IEEE Transactions on Intelligent Transportation Systems
Event Traffic Forecasting with Sparse Multimodal Data	Proceedings of the 32nd ACM International Conference on Multimedia 2024
Mmst-vit: Climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer	CVPR 2023
Mobile traffic prediction in consumer applications: A multimodal deep learning approach	IEEE Transactions on Consumer Electronics
Urban informal settlements classification via a transformer-based spatial-temporal fusion network using multimodal remote sensing and time-series human activity data	International Journal of Applied Earth Observation and Geoinformation
Spatial-temporal attention-based convolutional network with text and numerical information for stock price prediction	Neural Computing and Applications
Traffic congestion prediction using toll and route search log data	IEEE International Conference on Big Data (Big Data) 2022
Understanding city traffic dynamics utilizing sensor and textual observations	AAAI 2016
Citygpt: Empowering urban spatial cognition of large language models	arXiv 24.06
Where Would I Go Next? Large Language Models as Human Mobility Predictors	arXiv 23.08
Leveraging Language Foundation Models for Human Mobility Forecasting	arXiv 22.09

2.3.2 Medical Time Series

Title	Venue
Addressing asynchronicity in clinical multimodal fusion via individualized chest x-ray generation	NeurIPS 2024
EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation	CIKM 2024
Improving medical predictions by irregular multimodal electronic health records modeling	ICML 2023
Multimodal pretraining of medical time series and notes	Machine Learning for Health (ML4H) 2023
Learning missing modal electronic health records with unified multi-modal data embedding and modality-aware attention	Machine Learning for Health (ML4H) 2023
MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images	Machine Learning for Health (ML4H) 2022
Miracle: Causally-aware imputation via learning missing data mechanisms	NeurIPS 2021
How to leverage the multimodal EHR data for better medical prediction?	Conference on Empirical Methods in Natural Language Processing 2021
Deep multi-modal intermediate fusion of clinical record and time series data in mortality prediction	Frontiers in Molecular Biosciences
Integrated multimodal artificial intelligence framework for healthcare applications	NPJ digital medicine
PTB-XL, a large publicly available electrocardiography dataset	Scientific data
Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines	NPJ digital medicine
Arbitrary Data as Images: Fusion of Patient Data Across Modalities and Irregular Intervals with Vision Transformers	arXiv 25.01
Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records	arXiv 24.09
Multimodal risk prediction with physiological signals, medical images and clinical notes	medrxiv 23.05

2.3.3 Financial Time Series

Title	Venue
Fnspid: A comprehensive financial news dataset in time series	KDD 2024
Multi-modal deep learning for credit rating prediction using text and numerical data streams	Applied Soft Computing
Multimodal multiscale dynamic graph convolution networks for stock price prediction	Pattern Recognition
Multi-Modal Financial Time-Series Retrieval Through Latent Space Projections	Proceedings of the Fourth ACM International Conference on AI in Finance
Natural language based financial forecasting: a survey	Artificial Intelligence Review
Financial analysis, planning & forecasting: Theory and application	Unknown
Text2timeseries: Enhancing financial forecasting through time series prediction updates with event-driven insights from large language models	arXiv 24.07
Natural language processing and multimodal stock price prediction	arXiv 24.01
Modality-aware Transformer for Financial Time series Forecasting	arXiv 23.10
Predicting financial market trends using time series analysis and natural language processing	arXiv 23.09
Stock price prediction using sentiment analysis and deep learning for Indian markets	arXiv 22.04
Volatility prediction using financial disclosures sentiments with word embedding-based IR models	arXiv 17.02

2.4 Gaps and Outlooks

2.4.1 Heterogeneous Modality Combinations

Title	Venue
Imagebind: One embedding space to bind them all	CVPR 2023
LANISTR: Multimodal learning from structured and unstructured data	arXiv 23.05

3. TimeAsX

3.1 Time Series as Text

Title	Venue
Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series	ICLR 2025
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data	AAAI 2025
Exploiting Language Power for Time Series Forecasting with Exogenous Variables	THE WEB CONFERENCE 2025
Lstprompt: Large language models as zero-shot time series forecasters by long-short-term prompting	ACL 2024 Findings
TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting	ICLR 2024
TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series	ICLR 2024
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models	ICLR 2024
Are language models actually useful for time series forecasting?	NeurIPS 2024
Autotimes: Autoregressive time series forecasters via large language models	NeurIPS 2024
S2 IP-LLM: Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting	ICML 2024
Large language models are zero-shot time series forecasters	NeurIPS 2023
One fits all: Power general time series analysis by pretrained lm	NeurIPS 2023
PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting	IEEE Transactions on Knowledge and Data Engineering
Chronos: Learning the language of time series	TMLR
LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters	ACM Transactions on Intelligent Systems and Technology
Large Language Models are Few-shot Multivariate Time Series Classifiers	arXiv 25.02
TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents	arXiv 25.02
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning	arXiv 24.12
Large language models can deliver accurate and interpretable time series anomaly detection	arXiv 24.05
Multi-Patch Prediction: Adapting LLMs for Time Series Representation Learning	arXiv 24.02
Lag-llama: Towards foundation models for time series forecasting	arXiv 23.10

3.2 Time Series as Image

Title	Venue
CAFO: Feature-Centric Explanation on Time Series Classification	KDD 2024
TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis	ICLR 2024
Towards total recall in industrial anomaly detection	CVPR 2022
Deep video prediction for time series forecasting	Proceedings of the Second ACM International Conference on AI in Finance 2021
Forecasting with time series imaging	Expert Systems with Applications
Can Multimodal LLMs Perform Time Series Anomaly Detection?	arXiv 25.02
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting	arXiv 25.02
See it, Think it, Sorted: Large Multimodal Models are Few-shot Time Series Anomaly Analyzers	arXiv 24.11
Plots Unlock Time-Series Understanding in Multimodal Models	arXiv 24.10
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters	arXiv 24.08
Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models	arXiv 24.08
ViTime: A Visual Intelligence-Based Foundation Model for Time Series Forecasting	arXiv 24.07
Time Series as Images: Vision Transformer for Irregularly Sampled Time Series	arXiv 23.03
An image is worth 16x16 words: Transformers for image recognition at scale	arXiv 20.10
Imaging Time-Series to Improve Classification and Imputation	arXiv 15.06

3.3 Time Series as Other Modalities

3.3.1 Tabular Data

Title	Venue
Forecastpfn: Synthetically-trained zero-shot forecasting	NeurIPS 2024
The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features	NeurIPS 2024 Third Table Representation Learning Workshop
TableTime: Reformulating Time Series Classification as Zero-Shot Table Understanding via Large Language Models	arXiv 24.11
Tabular Transformers for Modeling Multivariate Time Series	arXiv 20.11

3.3.2 Audio Data

Title	Venue
Ssast: Self-supervised audio spectrogram transformer	AAAI 2022
T-wavenet: a tree-structured wavelet neural network for time series signal analysis	ICLR 2022
Voice2Series: Reprogramming Acoustic Models for Time Series Classification	arXiv 21.06

3.4 Domain Specific Applications

3.4.1 Spatial-Temporal Data

Title	Venue
Spatial-temporal large language model for traffic prediction	25th IEEE International Conference on Mobile Data Management (MDM) 2024
Unist: A prompt-empowered universal model for urban spatio-temporal prediction	KDD 2024
Urbangpt: Spatio-temporal large language models	KDD 2024
Vmrnn: Integrating vision mamba and lstm for efficient and accurate spatiotemporal forecasting	CVPR 2024
Learning social meta-knowledge for nowcasting human mobility in disaster	the Web Conference 2023
Storm-gan: spatio-temporal meta-gan for cross-city estimation of human mobility responses to covid-19	ICDM 2022
Deep multi-view spatial-temporal network for taxi demand prediction	AAAI 2018
Trafficgpt: Viewing, processing and interacting with traffic foundation models	Transport Policy
Deep spatio-temporal adaptive 3d convolutional neural networks for traffic flow prediction	ACM Transactions on Intelligent Systems and Technology (TIST)
ClimateLLM: Efficient Weather Forecasting via Frequency-Aware Large Language Models	arXiv 25.02
TPLLM: A traffic prediction framework based on pretrained large language models	arXiv 24.03
How can large language models understand spatial-temporal data?	arXiv 24.01

3.4.2 Medical Time Series

Title	Venue
ECG-LLM: Leveraging Large Language Models for Low-Quality ECG Signal Restoration	IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2024
Multimodal llms for health grounded in individual-specific data	Workshop on Machine Learning for Multimodal Healthcare Data 2023
ECG-Chat: A Large ECG-Language Model for Cardiac Disease Diagnosis	arXiv 24.08
Medtsllm: Leveraging llms for multimodal medical time series analysis	arXiv 24.08
Dualtime: A dual-adapter multimodal language model for time series representation	arXiv 24.06
Large language models are few-shot health learners	arXiv 23.05

3.4.3 Financial Time Series

Title	Venue
MTRGL: Effective Temporal Correlation Discerning through Multi-modal Temporal Relational Graph Learning	ICASSP 2024
From pixels to predictions: Spectrogram and vision transformer for better time series forecasting	Proceedings of the Fourth ACM International Conference on AI in Finance 2023
Quantum-enhanced forecasting: Leveraging quantum gramian angular field and CNNs for stock return predictions	Finance Research Letters
Deep learning-based spatial-temporal graph neural networks for price movement classification in crude oil and precious metal markets	Machine Learning with Applications
Financial time series forecasting with multi-modality graph neural network	Pattern Recognition
Encoding candlesticks as images for pattern classification using convolutional neural networks	Financial Innovation
Forecasting with time series imaging	Expert Systems with Applications
Research on financial multi-asset portfolio risk prediction model based on convolutional neural networks and image processing	arXiv 24.12
A Stock Price Prediction Approach Based on Time Series Decomposition and Multi-Scale CNN using OHLCT Images	arXiv 24.10
An image is worth 16x16 words: Transformers for image recognition at scale	arXiv 20.10
Image processing tools for financial time series classification	arXiv 20.08
Financial trading model with stock bar chart image time series with deep convolutional neural networks	arXiv 19.03
Imaging time-series to improve classification and imputation	arXiv 15.06

3.5 Gaps and Outlooks

3.5.1 Reuse Which Modality

Title	Venue
A picture is worth a thousand numbers: Enabling llms reason about time series via visualization	NAACL25
Can LLMs Understand Time Series Anomalies?	ICLR 2025
Vision-Enhanced Time Series Forecasting via Latent Diffusion Models	arXiv 25.02

4. Datasets for Multi-Modal Time Series Analysis

4.1 General Datasets

Dataset	Modalities	Highlights
Time-MMD	Time+Text	9 Domains; Real Datasets (general context); Across More than 24 Years
ChatTime	Time+Text	3 Real Datasets (weather&date)
CiK	Time+Text	7 Domains; 71 Human-Designed Tasks
ChatTS	Time+Text	Synthetic Method; 500+ Human-Labeled Samples
TSQA	Time+Text	Multi-Task QA Format; 1.4k Human-Selected Samples

4.2 Financial Datasets

Dataset	Modalities	Highlights
FNSPID	Time+Text	Large-Scale Finance News; Across 24 Years
FinBen	Time+Text	Bilingual; 42 Sub-Datasets; 8 Tasks

4.3 Medical Datasets

Dataset	Modalities	Highlights
MIMIC	Time+Text+Image+Table	Multiple Medical Tasks; Expert-Labeled Data
PTB-XL	Time+Text	Large-Scale Expert-Labeled ECG Data

4.4 Spatial-Temporal Datasets

Dataset	Modalities	Highlights
CityEval	Time+Text+Image	Multiple Urban Tasks, capable of Involving LLMs
Terra	Time+Text+Image	Worldwide Grid Data across 45 Years

Contact

If you have any questions or suggestions, feel free to contact: [email protected]

Acknowledgement

We refer to the following repos:

https://github.com/ForestsKing/Awesome-Multimodal-Time-Series

https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
README.md		README.md
Survey_Logo_1_1.jpg		Survey_Logo_1_1.jpg
Survey_Logo_2.jpg		Survey_Logo_2.jpg

AdityaLab/MM4TSA

Folders and files

Latest commit

History

Repository files navigation