Skip to content

Commit 214b33c

Browse files
dminnear-rhgaurav-nelson
authored andcommitted
add docs for rag-llm-cpu pattern
1 parent 5d43043 commit 214b33c

File tree

7 files changed

+460
-0
lines changed

7 files changed

+460
-0
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: RAG LLM chatbot on CPU
3+
date: 2025-10-24
4+
tier: sandbox
5+
summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries.
6+
rh_products:
7+
- Red Hat OpenShift Container Platform
8+
- Red Hat OpenShift GitOps
9+
- Red Hat OpenShift AI
10+
partners:
11+
- Microsoft
12+
industries:
13+
- General
14+
aliases: /rag-llm-cpu/
15+
links:
16+
github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu
17+
install: getting-started
18+
bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues
19+
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
20+
---
21+
22+
# **CPU-based RAG LLM chatbot**
23+
24+
## **Introduction**
25+
26+
This Validated Pattern deploys a Retrieval-Augmented Generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI. The pattern runs entirely on CPU nodes without requiring GPU hardware, making it a cost-effective and accessible solution for environments where GPU resources are limited or unavailable.
27+
It provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications.
28+
29+
## **Target audience**
30+
31+
This pattern is designed for:
32+
33+
- **Developers and data scientists** looking to build and experiment with RAG-based LLM applications.
34+
- **MLOps and DevOps engineers** responsible for deploying and managing AI/ML workloads on OpenShift.
35+
- **Architects** evaluating cost-effective methods for delivering generative AI capabilities on-premise.
36+
37+
## **Why use this pattern?**
38+
39+
- **Cost-effective:** Runs entirely on CPU, removing the need for expensive and often scarce GPU resources.
40+
- **Flexible:** Supports multiple vector database backends (Elasticsearch, PGVector, Microsoft SQL Server) to integrate with your existing data infrastructure.
41+
- **Transparent:** The Gradio front end is designed to expose the internals of the RAG query and LLM prompts, giving you clear insight into the generation process.
42+
- **Extensible:** Built on open source standards (KServe, OpenAI-compatible API) to serve as a robust foundation for more complex applications.
43+
44+
## **Architecture overview**
45+
46+
At a high level, the components work together as follows:
47+
48+
1. A user enters a query into the **Gradio UI**.
49+
2. The backend application, using **LangChain**, first queries a configured **Vector database** to retrieve relevant documents (the "R" in RAG).
50+
3. These documents are combined with the user's original query into a prompt.
51+
4. The prompt is sent to the **KServe-deployed LLM** (running via llama.cpp on a CPU node).
52+
5. The LLM generates a response, which is streamed back to the Gradio UI for the user.
53+
6. **Vault** securely provides the necessary credentials for the vector database and HuggingFace token at runtime.
54+
55+
![Overview](/images/rag-llm-cpu/rag-augmented-query.png)
56+
57+
_Figure 1. Overview of RAG Query from User's perspective._
58+
59+
## **Prerequisites**
60+
61+
Before you begin, ensure you have access to the following:
62+
63+
- A Red Hat OpenShift cluster (version 4.x). (Recommended size of at least 2 `m5.4xlarge` nodes.)
64+
- A HuggingFace API token.
65+
- Command-line tools: Podman.
66+
67+
## **What this pattern provides**
68+
69+
- A [kserve](https://github.com/kserve/kserve)-based LLM deployed to [RHOAI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime.
70+
- A choice of one (or multiple) Vector DB providers to serve as a RAG-backend with configurable web-based or git repo-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview).
71+
- [Vault](https://developer.hashiCorp.com/vault)-based secret management for HuggingFace API token and credentials for supported databases ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)).
72+
- A [gradio](https://www.gradio.app/)-based front end for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs which exposes the internals of the RAG query and LLM prompts so that users have better insight into what is running.
Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
---
2+
title: Configuring the pattern
3+
weight: 20
4+
aliases: /rag-llm-cpu/configure/
5+
---
6+
7+
# **Configuring the pattern**
8+
9+
This guide covers common customizations, such as changing the default LLM, adding new models, and configuring RAG data sources.
10+
We assume you have already completed the [Getting Started](/rag-llm-cpu/getting-started/) guide.
11+
12+
## **How configuration works**
13+
14+
This pattern is managed by ArgoCD (GitOps). All application configurations are defined in `values-prod.yaml`.
15+
To customize a component, you will typically:
16+
17+
1. **Enable an override:** In `values-prod.yaml`, find the application you want to change (e.g., `llm-inference-service`) and add an `extraValueFiles:` entry pointing to a new override file (e.g., `$patternref/overrides/llm-inference-service.yaml`).
18+
2. **Create the override file:** Create the new .yaml file inside the `/overrides` directory.
19+
3. **Add your settings:** Add _only_ the specific values you want to change into this new file.
20+
4. **Commit and sync:** Commit your changes and let ArgoCD sync the application.
21+
22+
## **Task: Change the Default LLM**
23+
24+
By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf model`. You might want to change this to a different model (e.g., a different quantization) or adjust its resource usage.
25+
You can do this by creating an override file for the _existing_ `llm-inference-service` application.
26+
27+
1. **Enable the override**:
28+
In `values-prod.yaml`, update the llm-inference-service application to use an override file:
29+
30+
```yaml
31+
clusterGroup:
32+
# ...
33+
applications:
34+
# ...
35+
llm-inference-service:
36+
name: llm-inference-service
37+
namespace: rag-llm-cpu
38+
chart: llm-inference-service
39+
chartVersion: 0.3.*
40+
extraValueFiles: # <-- ADD THIS BLOCK
41+
- $patternref/overrides/llm-inference-service.yaml
42+
```
43+
44+
2. **Create the override file:**
45+
Create a new file `overrides/llm-inference-service.yaml`. Here is an example that switches to a different model file (Q8_0) and increases the CPU/memory requests:
46+
47+
```yaml
48+
inferenceService:
49+
resources: # <-- Increased allocated resources
50+
requests:
51+
cpu: "8"
52+
memory: 12Gi
53+
limits:
54+
cpu: "12"
55+
memory: 24Gi
56+
57+
servingRuntime:
58+
args:
59+
- --model
60+
- /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file
61+
62+
model:
63+
repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
64+
files:
65+
- mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download
66+
```
67+
68+
## **Task: add a second LLM**
69+
70+
You can also deploy an entirely separate, second LLM and add it to the demo user interface (UI). This example deploys a different runtime, HuggingFace TGI, instead of `llama.cpp`.
71+
72+
This is a two-step process:
73+
74+
1. Deploy the new LLM.
75+
2. Tell the front end UI about it.
76+
77+
### **Step 1: Deploy the new LLM service**
78+
79+
1. **Define the new application:**
80+
In `values-prod.yaml`, add a new application to the applications list. We'll call it `another-llm-inference-service`.
81+
82+
```yaml
83+
clusterGroup:
84+
# ...
85+
applications:
86+
# ...
87+
another-llm-inference-service: # <-- ADD THIS NEW APPLICATION
88+
name: another-llm-inference-service
89+
namespace: rag-llm-cpu
90+
chart: llm-inference-service
91+
chartVersion: 0.3.*
92+
extraValueFiles:
93+
- $patternref/overrides/another-llm-inference-service.yaml
94+
```
95+
96+
2. **Create the override file:**
97+
Create the new file `overrides/another-llm-inference-service.yaml`. This file needs to define the new model and disable resource creation, such as secrets, that the first LLM already created.
98+
99+
```yaml
100+
dsc:
101+
initialize: false
102+
externalSecret:
103+
create: false
104+
105+
# Define the new InferenceService
106+
inferenceService:
107+
name: hf-inference-service # <-- New service name
108+
minReplicas: 1
109+
maxReplicas: 1
110+
resources:
111+
requests:
112+
cpu: "8"
113+
memory: 32Gi
114+
limits:
115+
cpu: "12"
116+
memory: 32Gi
117+
118+
# Define the new runtime (HuggingFace TGI)
119+
servingRuntime:
120+
name: hf-runtime
121+
port: 8080
122+
image: docker.io/kserve/huggingfaceserver:latest
123+
modelFormat: huggingface
124+
args:
125+
- --model_dir
126+
- /models
127+
- --model_name
128+
- /models/Mistral-7B-Instruct-v0.3
129+
- --http_port
130+
- "8080"
131+
132+
# Define the new model to download
133+
model:
134+
repository: mistralai/Mistral-7B-Instruct-v0.3
135+
files:
136+
- generation_config.json
137+
- config.json
138+
- model.safetensors.index.json
139+
- model-00001-of-00003.safetensors
140+
- model-00002-of-00003.safetensors
141+
- model-00003-of-00003.safetensors
142+
- tokenizer.model
143+
- tokenizer.json
144+
- tokenizer_config.json
145+
```
146+
147+
> **Warning:** There is currently a bug in the model-downloading container that requires you to explicitly list _all_ files you want to download from the HuggingFace repository. Make sure you list every file needed for the model to run.
148+
149+
### **Step 2: Add the new LLM to the demo UI**
150+
151+
Now, tell the front end that this new LLM exists.
152+
153+
1. **Edit the front end overrides**:
154+
Open `overrides/rag-llm-frontend-values.yaml` (this file should already exist from the initial setup).
155+
2. **Update LLM_URLS:**
156+
Add the URL of your new service to the `LLM_URLS` environment variable. The URL follows the format _http://<service-name>-predictor/v1_ (or _http://<service-name>-predictor/openai/v1_ for the HF runtime).
157+
158+
In `overrides/rag-llm-frontend-values.yaml`:
159+
160+
```yaml
161+
env:
162+
# ...
163+
- name: LLM_URLS
164+
value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]'
165+
```
166+
167+
## **Task: Customize RAG data sources**
168+
169+
By default, the pattern loads data from the Validated Patterns documentation. You can change this to point to your own public git repositories or web pages.
170+
171+
1. **Edit the Vector DB overrides:**
172+
Open `overrides/vector-db-values.yaml` (this file should already exist).
173+
2. **Update sources:**
174+
Modify the repoSources and webSources keys. You can add any publicly available Git repository (using globs to filter files) or public web URLs. The job will also process PDFs from webSources.
175+
176+
In `overrides/vector-db-values.yaml`:
177+
178+
```yaml
179+
providers:
180+
qdrant:
181+
enabled: true
182+
mssql:
183+
enabled: true
184+
185+
vectorEmbedJob:
186+
repoSources:
187+
- repo: https://github.com/your-org/your-docs.git # <-- Your repo
188+
globs:
189+
- "**/*.md"
190+
webSources:
191+
- https://your-company.com/product-manual.pdf # <-- Your PDF
192+
chunking:
193+
size: 4096
194+
```
195+
196+
## **Task: Add a new RAG database provider**
197+
198+
By default, the pattern enables _qdrant_ and _mssql_. You can also enable _redis_, _pgvector_ (Postgres), or _elastic_ (Elasticsearch).
199+
This is a three-step process: (1) Add secrets, (2) Enable the DB, and (3) Tell the front end UI.
200+
201+
### **Step 1: Update your secrets file**
202+
203+
If your new DB requires credentials (like _pgvector_ or _elastic_), add them to your main secrets file:
204+
205+
```sh
206+
vim ~/values-secret-rag-llm-cpu.yaml
207+
```
208+
209+
Add the necessary credentials. For example:
210+
211+
```yaml
212+
secrets:
213+
# ...
214+
- name: pgvector
215+
fields:
216+
- name: user
217+
value: user # <-- Update the user
218+
- name: password
219+
value: password # <-- Update the password
220+
- name: db
221+
value: db # <-- Update the db
222+
```
223+
224+
**Note:** refer to the file [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) for a reference as to which values are expected.
225+
226+
### **Step 2: Enable the provider in the Vector DB chart**
227+
228+
Edit `overrides/vector-db-values.yaml` and set enabled: true for the provider(s) you want to add.
229+
230+
In `overrides/vector-db-values.yaml`:
231+
232+
```yaml
233+
providers:
234+
qdrant:
235+
enabled: true
236+
mssql:
237+
enabled: true
238+
pgvector: # <-- ADD THIS
239+
enabled: true
240+
elastic: # <-- OR THIS
241+
enabled: true
242+
```
243+
244+
### **Step 3: Add the provider to the demo UI**
245+
246+
Finally, edit `overrides/rag-llm-frontend-values.yaml` to configure the UI. You must:
247+
248+
1. Add the new provider's secrets to the `dbProvidersSecret.vault` list.
249+
2. Add the new provider's connection details to the `dbProvidersSecret.providers` list.
250+
251+
Below is a complete example showing configuration for the non-default RAG DB providers:
252+
253+
In `overrides/rag-llm-frontend-values.yaml`
254+
255+
```yaml
256+
dbProvidersSecret:
257+
vault:
258+
- key: mssql
259+
field: sapassword
260+
- key: pgvector # <-- Add this block
261+
field: user
262+
- key: pgvector
263+
field: password
264+
- key: pgvector
265+
field: db
266+
- key: elastic # <-- Add this block
267+
field: user
268+
- key: elastic
269+
field: password
270+
providers:
271+
- type: qdrant # <-- Example for Qdrant
272+
collection: docs
273+
url: http://qdrant-service:6333
274+
embedding_model: sentence-transformers/all-mpnet-base-v2
275+
- type: mssql # <-- Example for MSSQL
276+
table: docs
277+
connection_string: >-
278+
Driver={ODBC Driver 18 for SQL Server};
279+
Server=mssql-service,1433;
280+
Database=embeddings;
281+
UID=sa;
282+
PWD={{ .mssql_sapassword }};
283+
TrustServerCertificate=yes;
284+
Encrypt=no;
285+
embedding_model: sentence-transformers/all-mpnet-base-v2
286+
- type: redis # <-- Example for Redis
287+
index: docs
288+
url: redis://redis-service:6379
289+
embedding_model: sentence-transformers/all-mpnet-base-v2
290+
- type: elastic # <-- Example for Elastic
291+
index: docs
292+
url: http://elastic-service:9200
293+
user: "{{ .elastic_user }}"
294+
password: "{{ .elastic_password }}"
295+
embedding_model: sentence-transformers/all-mpnet-base-v2
296+
- type: pgvector # <-- Example for PGVector
297+
collection: docs
298+
url: >-
299+
postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }}
300+
embedding_model: sentence-transformers/all-mpnet-base-v2
301+
```

0 commit comments

Comments
 (0)