RAGTube is a Retrieval Augmented Generation (RAG) application that lets you query transcriptions from your favorite YouTubers. It retrieves the most relevant transcription chunks for your query and provides them as context to a large language model (LLM) to generate responses.
Showcase https://ragtube.dgarridoa.xyz/
• Database & Vector Storage: A PostgreSQL database is employed to persist data including channels, videos, transcriptions, and chunks. It utilizes the pgvector extension to enable vector similarity search, enabling exact and approximate nearest neighbgor search. To interact with the database the SQLModel ORM is used.
• Ollama: To serve a local embedding and language model. It generates embeddings for text (transcriptions chunks and queries) and for generating responses based on the query and its corresponding contextual information provided by the retriever.
• Backend: Built using FastAPI, the backend layer exposes endpoints for readiness checks, listing channels, and handling RAG queries. This API communicates with both the database (to fetch data), and Ollama (to generate embeddings and chat completions). The RAG endpoint retrieve most relevant chunks, rerank them using FlasRank and filter them and finally pass them to the chat model to generate responses.
• Frontend: The application’s user interface is built with Streamlit, which interacts with the FastAPI backend. It provides users with a way to ask questions and view search results and generated responses.
• CLI and Database Tools: A Typer-based command line interface is provided for database population. It downloads transcriptions, split them into chunks, computes their embeddings and create the HNSW index if does not exist.
• Containerization & Deployment: Docker Compose is used to orchestrate the entire application stack including the PostgreSQL database, Ollama (for model serving), FastAPI (banckend), Streamlit (frontend), and additional services such as Watchtower for automated rolling releases and Traefik serving as a reverse proxy with SSL handling.
sequenceDiagram
participant F as Frontend
participant B as Backend
participant O as Ollama
participant D as Database
F->>B: Send query
B->>O: Send query
O->>B: Embedding query
B->>D: Embedding query
D->>B: Retrieve chunks
B->>B: Rerank and filter chunks
B->>O: Generate response
O->>B: Stream response
B->>F: Stream response
First, install uv a Python package and project manager.
curl -LsSf https://astral.sh/uv/install.sh | sh
Then install Python 3.11 and pin it.
uv python install 3.11
uv pin 3.11
Finally, install the project.
uv sync --alll-extras
Application parameters are defined in the params.yaml
file.
channel_id
: List of string, the channel IDs to download transcriptions from.language
: String, the language of the transcriptions.request_timeout
: Integer, timeout in seconds for HTTP requests to retrieve list of videos of a channel and get transcriptions from a video.chunk_size
: Integer, the maximum number of words in a chunk, used to split transcriptions in chunks.chunk_overlap
: Integer, the number of words to overlap between chunks.embedding_size
: Integer. This value represents the dimensionality of the embeddings provided by the chosen model and determines the size of the embedding array column in thechunk
table.embedding_model_name
: String, the name of the model used to compute the embeddings, it must be a model supported byollama
.embedding_num_ctx
: String, size of the context window used to generate the next token, must not be greater than the maximum context window size of the model.index_hnsm_m
: Integer,m
parameter of the HNSW index.index_hnsw_ef_construction
: Integer,ef_construction
parameter of the HNSW index.index_hnsw_ef_search
: Integer,ef_search
parameter of the HNSW index.index_vector_ops
: String, the name of the vector operations to use, it must be a vector operation supported bypgvector
.results_to_retrieve
: Integer, the number of approximate nearest neighbors to retrieve from the HNSW index.rerank_model_name
: String, the name of the model used to rerank the results retrieve by the HNSW index, it must be a model supported byflashrank
.rerank_score_threshold
: Integer, the minimum rerank score required for a result from the HNSW index to be presented to the user.chat_model_name
: String, the name of the model used to generate responses based on a provided question and its corresponding retrieved context. The model must be one of those supported byollama
.chat_temperature
: Float, the temperature used to sample tokens from the chat model.chat_max_tokens
: Integer, the maximum number of tokens to generate from the chat model.
Ensure you have a .env
file in the project's root directory. Obtain a YouTube API key— for more details, visit this guide. Optionally, if you're running this in a CI/CD pipeline or on a cloud provider prone to bot challenges, you might need to set an HTTP proxy. The file should look like the following:
YOUTUBE_API_KEY=<KEY>
# HTTPS_PROXY=http://<user>:<password>@<host>:<port>
DB_USER=postgres
DB_PASSWORD=postgres
DB_HOST=postgres
DB_PORT=5432
DB_NAME=postgres
API_HOST=api
API_PORT=5000
To start a PostgreSQL container, with pgvector extension for vector store capabilities, run the following command:
docker run \
--name test-postgres \
-p 5432:5432 \
-d \
-e POSTGRES_HOST_AUTH_METHOD=trust \
pgvector/pgvector:pg16`
References:
Create .env-test
file as .env
file mentioned in the previous section. Then you should run the following command:
uv run pytest
Note that you should have the database running.
Install ollama, a tool that makes it easy to run open-source LLMs locally.
Execute run-ollama.sh
to start a server, which exposes a REST API for interact with downloaded models. It then downloads the models specified by the chat_model_name
and embedding_model_name
fields in the params.yaml
file. For example:
ollama serve
ollama pull llama3.2:3b
ollama pull bge-large
curl http://host.docker.internal:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Why is the sky blue?",
"stream": false
}'
curl http://localhost:11434/api/embed -d '{
"model": "bge-large",
"input": "Llamas are members of the camelid family"
}'
To populate the database, set the channel_id
field in the params.yaml
file with the channels for which you wish to download transcriptions. Then run a Typer based CLI through the following command:
uv run python -m ragtube.cli
This will create the channel
, video
, caption
, and chunk
tables. Their schemas are defined in the models
module. It lists videos from a channel, downloads transcriptions from missing videos, chunks these transcriptions into smaller pieces, and computes their embeddings. Finally, an HNSW index is created if it does not already exist.
To start the backend run the following command:
uv run uvicorn ragtube.api:app --host 0.0.0.0 --port 5000 --log-config log_config.yaml
The Open API documentation can be found at http:localhost:5000/docs. It have three endpoints:
/readiness
: to verify the readiness of the API./channel
: to get a list of channels./rag
: to ask a question (input
) to the RAG. Optionally, you can perform deep search in a specific channel suplying itschannel_id
.
curl http://localhost:5000/rag?input=What%20does%20Harpoon%3F&channel_id=UCSHZKyawb77ixDdsGog4iWA
To start the frontend run the following command:
uv run python -m streamlit run ragtube/ui.py
You can now access the application at http://localhost:8501.
The pull request will trigger the pipeline detailed in the .github/workflows/onpr.yaml
file. These steps are: Starts a PostgreSQL container using Docker, checkout the branch from the repository, create the .env-test
file, set up Python 3.11, install uv, install project dependencies, run pre-commit checks, execute unit tests, and build and publish both the ragtube
and ollama
Docker images.
As prerequisites, you should have the following Github Secrets set up in your repository:
YOUTUBE_API_KEY
HTTPS_PROXY
DOCKER_HUB_PASSWORD
, and set the DOCKER_HUB_USERNAME
repository variable.
Docker images:
- https://hub.docker.com/repository/docker/dgarridoa/ragtube/general
- https://hub.docker.com/repository/docker/dgarridoa/ollama/general
Watchtower is used to automatically update the Docker containers when a new image is available. It is setup in the docker-compose.yaml
file. When the CI/CD completes the container that uses the ragtube
and ollama
images are updated.
Cloudflare Tunnel is used to route traffic from a hostname to a local machine without exposing the public IP. First, install the Cloudflare CLI from Cloudflare CLI and create a tunnel. Then, assign a hostname to the tunnel and run it. For example:
brew install cloudflare/cloudflare/cloudflared
cloudflared tunnel create <tunnel-name>
cloudflared tunnel route dns <tunnel-name> <hostname>
cloudflared tunnel run <tunnel-name>
Set up the ingress in the ~/.cloudflared/config.yml
file as shown below:
tunnel: <tunnel-id>
credentials-file: ~/.cloudflared/<tunnel-id>.json
ingress:
- hostname: <hostname>
service: http://localhost:8501
- service: http_status:404
Traefik is used as a reverse proxy and to manage SSL certificates. It is setup in the docker-compose.yaml
file. It listens to the ports 80 and 443 and routes the requests to the ui
container. The SSL certificates are automatically generated by Let's Encrypt, with Cloudflare serving as the certificate resolver.
The specific traefik configuration is in the traefik.yaml
file. The TRAEFIK_CERTIFICATESRESOLVERS_CLOUDFLARE_ACME_EMAIL
(your email), CF_DNS_API_TOKEN
(your Cloudflare API Token), and UI_HOST
(your hostname) environment variables must be added to the .env
file.
References:
Run the following command to start the services:
docker compose up
It will start the following containers
traefik
: reverse proxy.watchtower
: rolling release.ragtube-db
: database.ragtube-db-init
: populate the database and then stop.ragtube-api
: backend.ragtube-ui
: frontent.
Docker does not support the Metal API, but Ollama supports GPU acceleration on macOS. To use Ollama with GPU acceleration, run it natively by following the instructions in the Ollama section. Then, execute the following command.
docker compose -f docker-compose-mac.yaml up