Arxiv QA

Retrieval-augmented generation example that answers questions from Arxiv abstracts and titles.

(Video sped up 3x.)

Setup

Copy secrets-example.json and replace with your own key.
Fetch arxiv-metadata-oai-snapshot.json
- kaggle datasets download -d Cornell-University/arxiv
Run preprocess_dataset.py
- Input file: arxiv-metadata-oai-snapshot.json
- Output file: documents.json (a bit smaller)
docker compose up -d to run MeiliSearch and Qdrant
Then
- ingest_to_meilisearch.py
- ingest_to_qdrant.py
  - You'll want a GPU 😁, use nvitop to check it's using GPU.
  - Example performance: g5.xlarge (1x A10G), ~600k abstracts, ~12 minutes
Finally query.py to ask some questions.

You can connect to a nice server to test Meilisearch keyword lookup on http://localhost:8080/
cli.py could be useful but at the moment only exposes meilisearch_index and meilisearch_client

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
colabs		colabs
search_ui		search_ui
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
cli.py		cli.py
docker-compose.yml		docker-compose.yml
ingest_to_meilisearch.py		ingest_to_meilisearch.py
ingest_to_qdrant.py		ingest_to_qdrant.py
preprocess_dataset.py		preprocess_dataset.py
query.py		query.py
secrets-example.json		secrets-example.json