Skip to content

Brassin/LegalRag

Repository files navigation

Legal RAG - Legal Document Retrieval System

Legal RAG is a Retrieval-Augmented Generation system for legal documents, providing accurate answers to questions based on legal contracts and documents.

Features

  • PDF document ingestion with intelligent section detection
  • Vector database storage with semantic search
  • FastAPI endpoints for both human-readable and structured JSON responses
  • Ollama integration for local LLM inference
  • Proper citation of sources in responses

Installation

Prerequisites

  • Python 3.10+
  • Ollama installed and running
  • The following Ollama models:
    • llama3.2 (for text generation)
    • mxbai-embed-large (for embeddings)

Setup with UV

# Create virtual environment
uv venv

# Activate virtual environment
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Install dependencies from project file
uv pip install -e .

# Or install from requirements.txt
uv add  -r requirements.txt

Configuration

Create a .env file in the project root:

CHROMA_DB_DIR=./chroma_db
CHROMA_COLLECTION_NAME=legal_contracts
OLLAMA_EMBED_MODEL=mxbai-embed-large
OLLAMA_LLM_MODEL=llama3.2
OLLAMA_API_URL=http://localhost:11434
API_HOST=0.0.0.0
API_PORT=8000
DEFAULT_TOP_K=3

Usage

1. Ingest PDF Documents

python scripts/ingest_pdfs.py --pdf-dir ./data/pdfs

2. Start the API Server

uvicorn main:app --reload --log-level=debug --host 0.0.0.0 --port 8000

3. Query the API

Text Endpoint

curl -X POST "http://localhost:8000/query/text" \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the tax obligations in the asset purchase agreement?", "top_k": 3}'

JSON Endpoint

curl -X POST "http://localhost:8000/query/json" \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the tax obligations in the asset purchase agreement?", "top_k": 3}'

Project Structure

legalrag/
├── app/
│   ├── ingest/          # PDF ingestion and processing
│   ├── llm/             # LLM chain integration
│   ├── store/           # Vector database interface
│   ├── config.py        # Application settings
│   ├── schemas.py       # Pydantic models
│   └── main.py          # FastAPI application
├── data/
│   └── pdfs/            # PDFs to be ingested
├── scripts/
│   └── ingest_pdfs.py   # CLI tool for ingestion
└── main.py              # Application entry point

Component Overview

Ingestion Pipeline

  • PDF Parser (app/ingest/pdf_parser.py): Extracts text from PDFs and identifies logical sections.
  • Chunker (app/ingest/chunker.py): Splits sections into smaller chunks for effective retrieval.
  • Ingest Module (app/ingest/ingest.py): Coordinates the ingestion process.

Storage

  • Chroma Store (app/store/chroma_store.py): Manages document embeddings and retrieval using ChromaDB.

Retrieval and Generation

  • RAG Chain (app/llm/chain.py): Implements the retrieval-augmented generation pattern using Ollama.

API

  • FastAPI App (main.py): Provides endpoints for text and JSON responses.
  • Schemas (app/schemas.py): Defines data models for requests and responses.

Notes

  • The system automatically identifies sections in legal documents
  • Citations include document ID, section, and page numbers
  • JSON responses include confidence scores and recommended next actions

Eval

Add Evaluation and tracing via Ragas and Langfuse:

if langfuse_tracing_enabled is enabled, tracing will be enbled to store llm input and response for latency.

and to run evaluation:

# generate dataset for eval
python3 eval_main.py --create-sample --sample-output sample_questions.json

# run the evaluation
python3 eval_main.py --dataset sample_questions.json --output evaluation_results.json --top-k 3

Releases

No releases published

Packages

No packages published

Languages