Legal RAG - Legal Document Retrieval System

Legal RAG is a Retrieval-Augmented Generation system for legal documents, providing accurate answers to questions based on legal contracts and documents.

Features

PDF document ingestion with intelligent section detection
Vector database storage with semantic search
FastAPI endpoints for both human-readable and structured JSON responses
Ollama integration for local LLM inference
Proper citation of sources in responses

Installation

Prerequisites

Python 3.10+
Ollama installed and running
The following Ollama models:
- llama3.2 (for text generation)
- mxbai-embed-large (for embeddings)

Setup with UV

# Create virtual environment
uv venv

# Activate virtual environment
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Install dependencies from project file
uv pip install -e .

# Or install from requirements.txt
uv add  -r requirements.txt

Configuration

Create a .env file in the project root:

CHROMA_DB_DIR=./chroma_db
CHROMA_COLLECTION_NAME=legal_contracts
OLLAMA_EMBED_MODEL=mxbai-embed-large
OLLAMA_LLM_MODEL=llama3.2
OLLAMA_API_URL=http://localhost:11434
API_HOST=0.0.0.0
API_PORT=8000
DEFAULT_TOP_K=3

Usage

1. Ingest PDF Documents

python scripts/ingest_pdfs.py --pdf-dir ./data/pdfs

2. Start the API Server

uvicorn main:app --reload --log-level=debug --host 0.0.0.0 --port 8000

3. Query the API

Text Endpoint

curl -X POST "http://localhost:8000/query/text" \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the tax obligations in the asset purchase agreement?", "top_k": 3}'

JSON Endpoint

curl -X POST "http://localhost:8000/query/json" \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the tax obligations in the asset purchase agreement?", "top_k": 3}'

Project Structure

legalrag/
├── app/
│   ├── ingest/          # PDF ingestion and processing
│   ├── llm/             # LLM chain integration
│   ├── store/           # Vector database interface
│   ├── config.py        # Application settings
│   ├── schemas.py       # Pydantic models
│   └── main.py          # FastAPI application
├── data/
│   └── pdfs/            # PDFs to be ingested
├── scripts/
│   └── ingest_pdfs.py   # CLI tool for ingestion
└── main.py              # Application entry point

Component Overview

Ingestion Pipeline

PDF Parser (app/ingest/pdf_parser.py): Extracts text from PDFs and identifies logical sections.
Chunker (app/ingest/chunker.py): Splits sections into smaller chunks for effective retrieval.
Ingest Module (app/ingest/ingest.py): Coordinates the ingestion process.

Storage

Chroma Store (app/store/chroma_store.py): Manages document embeddings and retrieval using ChromaDB.

Retrieval and Generation

RAG Chain (app/llm/chain.py): Implements the retrieval-augmented generation pattern using Ollama.

API

FastAPI App (main.py): Provides endpoints for text and JSON responses.
Schemas (app/schemas.py): Defines data models for requests and responses.

Notes

The system automatically identifies sections in legal documents
Citations include document ID, section, and page numbers
JSON responses include confidence scores and recommended next actions

Eval

Add Evaluation and tracing via Ragas and Langfuse:

if langfuse_tracing_enabled is enabled, tracing will be enbled to store llm input and response for latency.

and to run evaluation:

# generate dataset for eval
python3 eval_main.py --create-sample --sample-output sample_questions.json

# run the evaluation
python3 eval_main.py --dataset sample_questions.json --output evaluation_results.json --top-k 3

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
data/pdfs		data/pdfs
doc/image		doc/image
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Readme.md		Readme.md
eval_main.py		eval_main.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Legal RAG - Legal Document Retrieval System

Features

Installation

Prerequisites

Setup with UV

Configuration

Usage

1. Ingest PDF Documents

2. Start the API Server

3. Query the API

Text Endpoint

JSON Endpoint

Project Structure

Component Overview

Ingestion Pipeline

Storage

Retrieval and Generation

API

Notes

Eval

About

Uh oh!

Releases

Packages

Languages

Brassin/LegalRag

Folders and files

Latest commit

History

Repository files navigation

Legal RAG - Legal Document Retrieval System

Features

Installation

Prerequisites

Setup with UV

Configuration

Usage

1. Ingest PDF Documents

2. Start the API Server

3. Query the API

Text Endpoint

JSON Endpoint

Project Structure

Component Overview

Ingestion Pipeline

Storage

Retrieval and Generation

API

Notes

Eval

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages