Retrieval-augmented generation example that answers questions from Arxiv abstracts and titles.
- Copy
secrets-example.json
and replace with your own key. - Fetch
arxiv-metadata-oai-snapshot.json
kaggle datasets download -d Cornell-University/arxiv
- Run
preprocess_dataset.py
- Input file:
arxiv-metadata-oai-snapshot.json
- Output file:
documents.json
(a bit smaller)
- Input file:
docker compose up -d
to run MeiliSearch and Qdrant- Then
ingest_to_meilisearch.py
ingest_to_qdrant.py
- You'll want a GPU 😁, use
nvitop
to check it's using GPU. - Example performance: g5.xlarge (1x A10G), ~600k abstracts, ~12 minutes
- You'll want a GPU 😁, use
- Finally
query.py
to ask some questions.
- You can connect to a nice server to test Meilisearch keyword lookup on
http://localhost:8080/
cli.py
could be useful but at the moment only exposesmeilisearch_index
andmeilisearch_client