Skip to content

TZWwww/PROPHET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

This repository contains the dataset, code, and evaluation scripts for the paper "PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation".

PROPHET is a new benchmark designed to evaluate Future Forecasting systems (LLMs and Agents) with a focus on inferability. Unlike previous benchmarks, PROPHET ensures that prediction questions are actually answerable based on the retrieved news by filtering data using a novel statistical metric: Causal Intervened Likelihood (CIL).

🌟 Key Features

  • Inferability-First: Addresses the "non-inferable" issue in existing forecasting benchmarks where retrieved information is insufficient to support a conclusion.
  • CIL Metric: Introduces Causal Intervened Likelihood, a metric derived from causal inference to quantify how strongly a news article supports a specific forecasting outcome.
  • Real-World Data: Contains 612 high-quality forecasting questions collected from Polymarket (resolved in Jan 2025) with over 300k associated news articles.
  • Comprehensive Baselines: Includes implementations for both Naive RAG and Agentic RAG (ReAct-based) forecasting systems.

πŸ“‚ Dataset Statistics

The PROPHET benchmark consists of two subsets based on the CIL filtering:

Subset Description Count Avg News/Q Avg Token/News
L1 (Main) Inferable. Contains strong supportive evidence (CIL > 0.7). 612 ~560 ~1250

Data source: Polymarket (Resolution date: 2025-01-01 to 2025-01-31).

Download the L1 dataset

The L1 part of the dataset can be downloaded on Google Drive Download Link

🧠 Methodology: Causal Intervened Likelihood (CIL)

CIL estimates the causal effect of a news event ($X_i$) ags on the forecasting outcome ($Y$). It is defined as:

$CIL_i = P(Y=\hat{Y}|do(X_i=1)) - P(Y=\hat{Y}|do(X_i=0))$

We compute this by modeling the news stream as a Structural Causal Model (SCM) with two key assumptions:

  1. Temporality: Later events cannot cause earlier events.
  2. w-day Dependency: Direct causal influence is limited to a -day window (we use $w=30$).

This allows us to bridge interventional probabilities to observational probabilities estimable by LLMs.

πŸ“Š Performance Highlights

Below are selected results (Brier Score, lower is better) comparing Naive RAG vs. Agentic RAG on the PROPHET dataset.

Model w/o RAG Naive RAG (Best) Agentic RAG
Claude-4-sonnet 18.57 18.00 17.89
GPT-4o-mini 24.28 27.17 -
DeepSeek-v3 20.37 21.04 -
Gemini-2.5-Pro 21.41 - 19.26

See the paper for full tables and analysis.

πŸ“ Citation

If you use PROPHET or CIL in your research, please cite our paper:

@article{tao2026prophet,
  title={PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation},
  author={Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, Wentao Zhang},
  journal={arXiv preprint},
  year={2026}
}

πŸ“§ Contact

For questions or feedback, please contact:

  • Zhengwei Tao: tttzw@pku.edu.cn

About

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors