Retrieval Augmented Generation FastAPI - Celery - Horizontal Scaling

This project will resolve:

Provide a full solution to build a RAG system in an enterprise with your own data without pre-training and fine-tuning (Costly).
Host yourself: You can set up on your on-premises server or other cloud providers.
Use both GPU and CPU to improve performance.
Many deployment solutions: k8s, AWS, Docker, horizontal scaling.

Tech stack

RAG architecture.
FastAPI: Main thread handles requests.
Celery Worker handles embedding documents then stores them to VectorDB.
Qdrant as a VectorDB to store embeddings.
Vietnamese LLM sentence-transformer. (sup-SimCSE-VietNamese-phobert-base)
Docker, K8S, Aws.

Compare Pricing: Self-Host vs. Cloud (OpenAI API)

Feature	Self-Host	Cloud (OpenAI API)
Initial Setup Cost	High (Hardware, Setup, Maintenance)	Low (Pay-as-you-go)
Operational Cost	Variable (Electricity, Upkeep)	Fixed (Subscription/Usage Fees)
Scalability	Requires Manual Scaling	Automatic Scaling
Performance	Customizable (Depends on Hardware)	Optimized by Provider
Data Privacy	High (Data Stays On-Premises)	Variable (Depends on Provider)
Maintenance	Requires In-House Team	Managed by Provider
Flexibility	High (Full Control)	Limited (Provider Constraints)

Minimum hardware requirement:

To run this project efficiently, the following minimum hardware requirements are recommended:

CPU: >= 2 cores
RAM: >= 4 GB
GPU: NVIDIA GPU with at least >= 2 GB VRAM (for GPU acceleration - optional)
Storage: >= 6 GB SSD (Contain source code and database)

Workflow:

Data Ingestion: Documents are ingested and sent to the Celery worker.
Embedding Creation: The Celery worker uses the Vietnamese LLM sentence-transformer to create embeddings for the documents.
Storage: The embeddings are stored in Qdrant VectorDB.
Query Handling: FastAPI handles incoming queries and retrieves relevant embeddings from Qdrant VectorDB.
Response Generation: The retrieved embeddings are used to generate accurate and contextually appropriate responses.

This workflow ensures efficient processing and retrieval of information, leveraging both GPU and CPU resources for optimal performance.

Run Locally

Run in dev container

To run this project on your local machine, follow these steps:

Make sure your PC installed docker and VS code extentions DevContainer
Ctrl + Shift + P then Search Reopen in DevContainer (Waiting for image downloaded then run following step)

Install Dependencies: (about 10 minutes :)

  pip3 install -r requirements.txt

Start FastAPI Server: By default your application will runing at http://localhost:8000/docs

uvicorn main:app --reload

Start Celery Worker: It will connect to redis container inside DevContainer

celery -A worker.celery_worker.celery_app worker --loglevel=info

These commands will set up the necessary environment and start the services required for the project to function locally.

Example api

Setup cluster:

Docker:

K8S:

AWS:

See our benchmark:

Concept:

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based methods to improve the quality and relevance of generated text. It retrieves relevant documents from a knowledge base and uses them to generate more accurate and contextually appropriate responses.

Sentence transformer:

A sentence transformer is a type of model used to convert sentences into dense vector representations. These vectors can then be used for various tasks such as semantic search, clustering, and classification. In this project, we use the Vietnamese LLM sentence-transformer (sup-SimCSE-VietNamese-phobert-base) to embed documents into vectors for efficient retrieval and processing.

Vector embedding:

Vector embedding is the process of converting text into numerical vectors that capture the semantic meaning of the text. In this project, we use the Vietnamese LLM sentence-transformer (sup-SimCSE-VietNamese-phobert-base) to create embeddings for documents, which are then stored in Qdrant VectorDB for efficient retrieval.

Retrieval:

Retrieval involves searching for and fetching relevant documents from a knowledge base. In this project, we use the embeddings stored in Qdrant VectorDB to quickly find documents that are most relevant to the input query, improving the accuracy and relevance of the generated responses.

Similarity search/vector search

Similarity search/vector search aims to find two vectors which are close together in high-demensional space. For example, two pieces of similar text passed through an embedding model should have a high similarity score, whereas two pieces of text about different topics will have a lower similarity score. Common similarity score measures are dot product and cosine similarity.

Resources:

RAG 101: Demystifying Retrieval-Augmented Generation Pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.devcontainer		.devcontainer
.vscode		.vscode
devops		devops
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retrieval Augmented Generation FastAPI - Celery - Horizontal Scaling

This project will resolve:

Tech stack

Compare Pricing: Self-Host vs. Cloud (OpenAI API)

Minimum hardware requirement:

Workflow:

Run Locally

Run in dev container

Setup cluster:

Docker:

K8S:

AWS:

See our benchmark:

Concept:

What is RAG?

Sentence transformer:

Vector embedding:

Retrieval:

Similarity search/vector search

Resources:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

khaphan-github/llm-rag-fastapi

Folders and files

Latest commit

History

Repository files navigation

Retrieval Augmented Generation FastAPI - Celery - Horizontal Scaling

This project will resolve:

Tech stack

Compare Pricing: Self-Host vs. Cloud (OpenAI API)

Minimum hardware requirement:

Workflow:

Run Locally

Run in dev container

Setup cluster:

Docker:

K8S:

AWS:

See our benchmark:

Concept:

What is RAG?

Sentence transformer:

Vector embedding:

Retrieval:

Similarity search/vector search

Resources:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages