PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

This repository contains the dataset, code, and evaluation scripts for the paper "PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation".

PROPHET is a new benchmark designed to evaluate Future Forecasting systems (LLMs and Agents) with a focus on inferability. Unlike previous benchmarks, PROPHET ensures that prediction questions are actually answerable based on the retrieved news by filtering data using a novel statistical metric: Causal Intervened Likelihood (CIL).

🌟 Key Features

Inferability-First: Addresses the "non-inferable" issue in existing forecasting benchmarks where retrieved information is insufficient to support a conclusion.
CIL Metric: Introduces Causal Intervened Likelihood, a metric derived from causal inference to quantify how strongly a news article supports a specific forecasting outcome.
Real-World Data: Contains 612 high-quality forecasting questions collected from Polymarket (resolved in Jan 2025) with over 300k associated news articles.
Comprehensive Baselines: Includes implementations for both Naive RAG and Agentic RAG (ReAct-based) forecasting systems.

📂 Dataset Statistics

The PROPHET benchmark consists of two subsets based on the CIL filtering:

Subset	Description	Count	Avg News/Q	Avg Token/News
L1 (Main)	Inferable. Contains strong supportive evidence (CIL > 0.7).	612	~560	~1250

Data source: Polymarket (Resolution date: 2025-01-01 to 2025-01-31).

Download the L1 dataset

The L1 part of the dataset can be downloaded on Google Drive Download Link

🧠 Methodology: Causal Intervened Likelihood (CIL)

CIL estimates the causal effect of a news event ($X_i$) ags on the forecasting outcome ($Y$). It is defined as:

$CIL_i = P(Y=\hat{Y}|do(X_i=1)) - P(Y=\hat{Y}|do(X_i=0))$

We compute this by modeling the news stream as a Structural Causal Model (SCM) with two key assumptions:

Temporality: Later events cannot cause earlier events.
w-day Dependency: Direct causal influence is limited to a -day window (we use $w=30$).

This allows us to bridge interventional probabilities to observational probabilities estimable by LLMs.

📊 Performance Highlights

Below are selected results (Brier Score, lower is better) comparing Naive RAG vs. Agentic RAG on the PROPHET dataset.

Model	w/o RAG	Naive RAG (Best)	Agentic RAG
Claude-4-sonnet	18.57	18.00	17.89
GPT-4o-mini	24.28	27.17	-
DeepSeek-v3	20.37	21.04	-
Gemini-2.5-Pro	21.41	-	19.26

See the paper for full tables and analysis.

📝 Citation

If you use PROPHET or CIL in your research, please cite our paper:

@article{tao2026prophet,
  title={PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation},
  author={Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, Wentao Zhang},
  journal={arXiv preprint},
  year={2026}
}

📧 Contact

For questions or feedback, please contact:

Zhengwei Tao: tttzw@pku.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
src		src
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

🌟 Key Features

📂 Dataset Statistics

Download the L1 dataset

🧠 Methodology: Causal Intervened Likelihood (CIL)

📊 Performance Highlights

📝 Citation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

🌟 Key Features

📂 Dataset Statistics

Download the L1 dataset

🧠 Methodology: Causal Intervened Likelihood (CIL)

📊 Performance Highlights

📝 Citation

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages