PsiloQA is the largest dataset for training and evaluating systems on multilingual span-level hallucination detection with retrieved context.
It offers:
- 🧠 An automated and scalable pipeline for generating, annotating, and filtering data for hallucination detection tasks
- 🌍 A large multilingual dataset covering 14 languages with high-quality, fine-grained span-level hallucination annotations for multiple open-source LLMs
- 📊 Comprehensive empirical evaluations of various state-of-the-art span-level hallucination detection methods across 14 languages
You can explore or download the dataset on Hugging Face:
👉 s-nlp/PsiloQA
This repository contains the full PsiloQA generation pipeline — from sampling multilingual Wikipedia contexts to question–answer generation, LLM hypothesis production, annotation, and filtering.
Install uv:
pip install uv
Install dependencies:
uv sync --no-dev
Copy env.example and fill env variables:
cp env.example .env
The PsiloQA pipeline automates the construction of a multilingual, span-level hallucination detection dataset with contexts — from sampling Wikipedia passages to generating Q&A, producing model hypotheses, annotating hallucinated spans, and filtering the results.
It consists of five sequential stages:
- Contexts — parse random Wikipedia pages as input passages for QA generation.
- QA pairs — generate questions and answers of varying complexity using an OpenAI model.
- LLM hypotheses — produce candidate model answers for evaluation.
- Annotation — mark hallucinated spans in model hypotheses using an OpenAI-based annotator.
- Filtering — automatically clean data via heuristic and LLM-based filters.
Each stage can be run individually, or you can execute the full pipeline with a single command:
uv run psilo dataset pipeline --num-pages 10 --language ru --language en --limit 100 --model Qwen/Qwen2.5-3B-Instruct
All API keys and model settings are managed via the .env
file (QA_GENERATOR_
, ANNOTATOR_
, and FILTER_
prefixes).
The first step in PsiloQA pipeline is getting contexts for QA generation. You can use your own, or, as in out paper, parse random pages from Wikipedia as input contexts. Just run the following command with languages you need. If no --language
list specified, it will parse random pages for 14 languages presented in our paper. --num-pages
determines how many contexts to parse from Wikipedia.
uv run psilo dataset get_contexts --num-pages 10 --language ru --language en
Next step is question and answer generation for the obtained contexts. The script generates three questions of different complexity based on provided contexts. Fill QA_GENERATOR
settings in .env
file to use this script. By default, gpt-4o
is used. Feel free to use another models by providing another model name through QA_GENERATOR
setting in .env
.
uv run psilo dataset generate_qa
All available models are listed in psilo/dataset/answer_generator/models
. You can add any new Hugging Face model by implementing a runner class that inherits from either:
RunnerWithChatTemplate
— if the tokenizer supports chat templates, orRunnerWithCustomTemplate
— if it does not. Some models require a Hugging Face access token. Make sure to provideHF_TOKEN
in your.env
file — models that need it will be skipped if the token is missing.
uv run psilo dataset generate_hypotheses
Annotate hypotheses (fill ANNOTATOR_OPENAI_API_KEY
variable in .env):
uv run psilo dataset annotate_hypotheses
The annotation process includes two filtering stages. Heuristic-based filters ensure structural correctness — they verify that all opening tags have corresponding closing tags, that there are no nested tags, and perform other automated pre-checks. LLM-based filters remove samples with subjective or incomplete questions, as well as cases where the model refuses to answer. For LLM-based filter, fill FILTER_OPENAI_API_KEY
variable in .env
uv run psilo dataset filter
[TBD]