This repository demonstrates a Retrieval-Augmented Generation (RAG) agent using LLaMA3 for local question answering with fallback and self-correction mechanisms. The project integrates several RAG strategies based on key papers, and runs a local LLaMA3 model to retrieve documents and generate responses.
- Project Overview
- Installation
- Environment Setup
- Model Loading
- Document Loading and Indexing
- Retrieval Grading
- Response Generation
- Hallucination Grading
- Answer Grading
- Routing Mechanism
- Web Search Fallback
- Graph Workflow
This project implements a Retrieval-Augmented Generation (RAG) agent using LLaMA3 as the local language model. The system is capable of routing questions to either local document retrieval or fallback to web search, performing hallucination checks, and grading the relevance of responses. It is structured based on papers such as Adaptive RAG, Corrective RAG, and Self-RAG.
Key Features:
- Adaptive question routing.
- Document retrieval from vector stores.
- Hallucination detection and correction.
- Grading of response relevance.
First, install the required packages:
pip install langchain-nomic langchain_community tiktoken langchainhub
For embedding and language model:
pip install ollama
ollama pull llama3
Before setting up the Retrieval-Augmented Generation (RAG) agent using LLaMA3, ensure that the following prerequisites are met:
Create a secrets.json
file in the root directory of your project. It should look something like this:
{
"LANGCHAIN_API_KEY": "your_langchain_api_key",
"GOOGLE_API_KEY": "your_google_api_key",
"GOOGLE_CSE_ID": "your_google_cse_id"
}
This file stores the necessary API keys that the project requires for functioning, such as Langchain API, Google Custom Search API, and Custom Search Engine (CSE) ID.
-
Search Engine Configuration:
Before using the Google Custom Search JSON API, you need to create and configure a Programmable Search Engine (CSE). You can start this process by visiting the Programmable Search Engine control panel. -
Configuration Options:
Follow the tutorial to learn more about different configuration options, such as including specific sites or the entire web for your search engine. -
Locating Your Search Engine ID:
After creating the search engine, visit the help center to locate your Search engine ID (CSE ID), which you'll add to thesecrets.json
file under the key"GOOGLE_CSE_ID"
.
- Obtaining an API Key:
The Custom Search JSON API requires an API key for authentication. You can get your API key from the Google Developers page here.
After obtaining the API key, add it to the secrets.json
file under the key "GOOGLE_API_KEY"
.
Next, sign Up on Langchain Hub:
- Go to the Langchain Hub and sign up for an account. If you already have an account, simply log in.
- Access API Settings:
- Once logged in, navigate to your profile or account settings.
- Look for the API Key section, where you'll find an option to generate an API key.
- Generate Your API Key:
- Click on the "Generate API Key" button.
A new API key will be generated. Make sure to copy and save this key, as you will need it to authenticate your requests when using Langchain services.
Set up your environment by storing necessary API keys in a secrets.json
file, and load them into the environment.
import os
import json
def get_secrets():
with open('secrets.json') as secrets_file:
return json.load(secrets_file)
secrets = get_secrets()
os.environ["LANGCHAIN_API_KEY"] = secrets.get("LANGCHAIN_API_KEY")
os.environ["GOOGLE_API_KEY"] = secrets.get("GOOGLE_API_KEY")
os.environ["GOOGLE_CSE_ID"] = secrets.get("GOOGLE_CSE_ID")
Enable tracing for debugging:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
Specify and load the LLaMA3 model locally:
local_llm = "llama3"
Retrieve and load documents from URLs and split them into chunks using RecursiveCharacterTextSplitter
.
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
urls = [
"https://example-url-1.com",
"https://example-url-2.com"
]
docs = [WebBaseLoader(url).load() for url in urls]
doc_splits = RecursiveCharacterTextSplitter(chunk_size=250).split_documents(docs)
Embed the documents and add them to a Chroma Vector Store:
from langchain.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
embedding_model = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed")
vectorstore = Chroma.from_documents(doc_splits, embedding=embedding_model)
retriever = vectorstore.as_retriever()
Create a grading mechanism to check if retrieved documents are relevant to the query:
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
llm = ChatOllama(model=local_llm, format="json", temperature=0)
prompt = PromptTemplate(
template="Is the document relevant to the query? Give a 'yes' or 'no' response.",
input_variables=["document", "question"]
)
retrieval_grader = prompt | llm | JsonOutputParser()
Generate responses by combining retrieved context with the model:
prompt = PromptTemplate(
template="Use the following context to answer the query. Max 3 sentences.",
input_variables=["question", "context"]
)
rag_chain = prompt | llm
docs = retriever.invoke("agent memory")
generation = rag_chain.invoke({"context": docs, "question": "agent memory"})
Evaluate whether the generated answer is grounded in the provided facts:
hallucination_prompt = PromptTemplate(
template="Is the answer grounded in the facts provided? Give a 'yes' or 'no' response.",
input_variables=["generation", "documents"]
)
hallucination_grader = hallucination_prompt | llm | JsonOutputParser()
result = hallucination_grader.invoke({"documents": docs, "generation": generation})
Assess if the answer sufficiently resolves the user’s question:
answer_prompt = PromptTemplate(
template="Does the answer resolve the question? Give a 'yes' or 'no' response.",
input_variables=["generation", "question"]
)
answer_grader = answer_prompt | llm | JsonOutputParser()
result = answer_grader.invoke({"generation": generation, "question": "agent memory"})
Use a routing mechanism to decide if a query should use local documents or web search:
routing_prompt = PromptTemplate(
template="Should the query be routed to a vectorstore or web search?",
input_variables=["question"]
)
question_router = routing_prompt | llm | JsonOutputParser()
result = question_router.invoke({"question": "agent memory"})
Fallback to web search when documents are not relevant:
from langchain.utilities import GoogleSearchAPIWrapper
search = GoogleSearchAPIWrapper(k=3)
web_results = search.results("agent memory")
Define a workflow for state management and control flow:
from langgraph.graph import StateGraph
workflow = StateGraph(GraphState)
# Define nodes
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)
# Define edges and conditional logic
workflow.add_edge("retrieve", "generate")
Compile and execute the workflow:
app = workflow.compile()
inputs = {"question": "What are the types of agent memory?"}
for output in app.stream(inputs):
print(output)
This project showcases the integration of a local RAG agent using LLaMA3, capable of retrieving, generating, and grading responses while leveraging web search as a fallback mechanism.