This repository contains the dataset, code, and evaluation scripts for the paper "PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation".
PROPHET is a new benchmark designed to evaluate Future Forecasting systems (LLMs and Agents) with a focus on inferability. Unlike previous benchmarks, PROPHET ensures that prediction questions are actually answerable based on the retrieved news by filtering data using a novel statistical metric: Causal Intervened Likelihood (CIL).
- Inferability-First: Addresses the "non-inferable" issue in existing forecasting benchmarks where retrieved information is insufficient to support a conclusion.
- CIL Metric: Introduces Causal Intervened Likelihood, a metric derived from causal inference to quantify how strongly a news article supports a specific forecasting outcome.
- Real-World Data: Contains 612 high-quality forecasting questions collected from Polymarket (resolved in Jan 2025) with over 300k associated news articles.
- Comprehensive Baselines: Includes implementations for both Naive RAG and Agentic RAG (ReAct-based) forecasting systems.
The PROPHET benchmark consists of two subsets based on the CIL filtering:
| Subset | Description | Count | Avg News/Q | Avg Token/News |
|---|---|---|---|---|
| L1 (Main) | Inferable. Contains strong supportive evidence (CIL > 0.7). | 612 | ~560 | ~1250 |
Data source: Polymarket (Resolution date: 2025-01-01 to 2025-01-31).
The L1 part of the dataset can be downloaded on Google Drive Download Link
CIL estimates the causal effect of a news event (
We compute this by modeling the news stream as a Structural Causal Model (SCM) with two key assumptions:
- Temporality: Later events cannot cause earlier events.
-
w-day Dependency: Direct causal influence is limited to a -day window (we use
$w=30$ ).
This allows us to bridge interventional probabilities to observational probabilities estimable by LLMs.
Below are selected results (Brier Score, lower is better) comparing Naive RAG vs. Agentic RAG on the PROPHET dataset.
| Model | w/o RAG | Naive RAG (Best) | Agentic RAG |
|---|---|---|---|
| Claude-4-sonnet | 18.57 | 18.00 | 17.89 |
| GPT-4o-mini | 24.28 | 27.17 | - |
| DeepSeek-v3 | 20.37 | 21.04 | - |
| Gemini-2.5-Pro | 21.41 | - | 19.26 |
See the paper for full tables and analysis.
If you use PROPHET or CIL in your research, please cite our paper:
@article{tao2026prophet,
title={PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation},
author={Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, Wentao Zhang},
journal={arXiv preprint},
year={2026}
}
For questions or feedback, please contact:
- Zhengwei Tao:
tttzw@pku.edu.cn