Skip to content

irgroup/Reanimator

Repository files navigation

REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources

License

Overview

REANIMATOR is a versatile framework designed to enhance and repurpose existing retrieval test collections by enriching them with extracted and synthetic resources. It enables the parsing of full texts, machine-readable tables, and contextual metadata from PDF files. Additionally, it leverages state-of-the-art large language models to generate synthetic relevance labels, with an optional human-in-the-loop validation step.

We showcase its potential by revitalizing the TREC-COVID test collection, demonstrating how retrieval-augmented generation (RAG) systems can be developed and evaluating the impact of tables on RAG performance. REANIMATOR lowers costs and broadens the utility of legacy resources, making them reusable for new applications.

Features

  • Automated Data Extraction: Parses full texts and structured tables from PDFs.
  • Synthetic Relevance Labeling: Utilizes large language models to generate annotations.
  • Human-in-the-Loop Validation: Optional verification step for quality assurance.
  • Parallelized Processing: Efficient execution to handle large datasets.
  • RAG System Integration: Enables research on retrieval-augmented generation.

Project Structure

📂 src
 ├── 📁 data               # Processed and extracted data
 ├── 📁 database           # Persistent storage 
 ├── 📁 labeling           # Relevance annotation and validation
 ├── 📁 parallel_exec      # Parallel processing utilities
 ├── 📁 pdfs              # Original source PDFs
 ├── 📁 preprocessing      # Data cleaning and transformation scripts
 ├── 📄 download_pdfs.ipynb         # Notebook for fetching PDFs
 ├── 📄 gen_docling_exports.ipynb   # Generates document linking exports
 ├── 📄 get_urls_for_dois.ipynb     # Extracts URLs for document retrieval
 ├── 📄 helpers.py                  # Utility functions
 ├── 📄 test_eval.ipynb              # Evaluation and testing scripts

Data Resources

Data for this project is available via Google Drive.

📂 data
├── 📁 chunks # Contains segmented pieces of raw passages and tables
└── 📁 tables # Stores table-formatted data docling extractions

📂 pools
├── 📁 passage_pool # Pool based on the 12 different retrieval configurations for passages
└── 📁 table_pool # Pool based on the 12 different retrieval configurations for tables

📂 qrels
├── 📂 auto_qrels # Contains automatically generated relevance judgments
│ ├── 📁 passage_qrels_models* # Holds auto-generated relevance models for passages
│ └── 📁 table_qrels_models* # Holds auto-generated relevance models for tables
└── 📂 human_qrels # Contains relevance judgments generated by human annotators
├── 📁 passage_human_qrels # Human assessments for passage data relevance
└── 📁 table_human_qrels # Human assessments for table data relevance

📂 rag_answers
├── 📁 bm25_combined # RAG answers derived using BM25 on interleaved modalities
├── 📁 bm25_text # RAG answers derived using BM25 on text passages
├── 📁 bm25_table # RAG answers derived using BM25 on table data
├── 📁 vectorstore_combined # RAG answers derived using Cosine similarity and interleaved modalities
├── 📁 vectorstore_text # RAG answers derived using Cosine similarity and text modality
└── 📁 vectorestore_table # RAG answers derived using Cosine similarity and table modality

📂 retrievers
├── 📂 bm25 # Retrieval models based on the BM25 algorithm
│ ├── 📁 retriever_bm25_passage # BM25-based retrieval for passages
│ └── 📁 retriever_bm25_table # BM25-based retrieval for table data
└── 📂 vectorstores # Retrieval models based on vector store methods
└── 📁 chroma_db # A vector store using Chroma DB for similarity search

Citation

If you use REANIMATOR in your research, please cite our paper:

@inproceedings{reanimator2025,
  title={REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources},
  author={Björn Engelmann, Fabian Haak, Philipp Schaer, Mani Erfanian Abdoust, Linus Netze, Meik Bittkowski}
  booktitle={},
  year={2025}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Installation Guide

This guide will walk you through setting up the Reanimator project using Docker and Docker Compose with NVIDIA GPU capabilities. If you do not have an NVIDIA GPU, alternative instructions are provided.


Prerequisites

  • Docker and Docker Compose installed on your system.
  • NVIDIA GPU with the appropriate drivers (if available).
  • An IDE like Visual Studio Code or Cursor.
  • Git installed on your system.

Installation Steps

1. Install Docker and Docker Compose with NVIDIA Capabilities

Ensure that Docker and Docker Compose are installed with NVIDIA GPU support.

  • For NVIDIA GPU users:

    • Install the NVIDIA Container Toolkit to enable GPU support in Docker.

    • Verify the installation with:

      docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
  • For non-NVIDIA GPU users:

    • Proceed to the next step.

2. Clone the Repository

Clone the Reanimator repository:

git clone https://github.com/irgroup/Reanimator.git

3. Navigate to the Project Directory

cd Reanimator

4. (Optional) Replace Docker Files for Non-NVIDIA GPU Systems

If you do not have an NVIDIA GPU, replace the Dockerfile and docker-compose.yml with the versions suited for non-GPU systems:

cp Docker_NO_GPU/Dockerfile .
cp Docker_NO_GPU/docker-compose.yml .

Important: Do not commit these changes to the repository. To prevent accidental commits, add these files to your local .gitignore:

echo "Dockerfile" >> .gitignore
echo "docker-compose.yml" >> .gitignore

5. Build the Docker Image

Build the Docker image named reanimator:

docker build -t reanimator .

6. Start the Docker Container

Run the container using Docker Compose:

docker compose up

7. Attach Your IDE to the Container

Open your preferred IDE (recommended: Visual Studio Code or Cursor) and attach it to the running Docker container.

8. Navigate to the Workspace Directory

Within your IDE, navigate to the /workspace directory inside the container.

9. Select Python Kernel

Set the Python interpreter to Python 3.10.12

Verification

To confirm that your setup is correct:

  1. Open the Jupyter notebook /gen_docling_exports.ipynb/ located in the /src directory.
  2. Run all cells in the notebook:
  3. Ensure that all cells execute without errors.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •