A Neo4j-based legal knowledge graph that models the Companies Act, 2013, its 2026 Amendments, and Companies (Incorporation) Rules, 2014 as an interconnected graph — enabling precise, traceable legal querying through Cypher and an LLM-powered natural language interface.
flowchart TB
subgraph ingestion ["1. Ingestion Pipeline"]
PDF["PDF Documents<br/>(Act + Amendments + Rules)"]
Parser["Regex + LLM Parser"]
JSON["Structured JSON"]
Loader["Neo4j Loader"]
PDF --> Parser --> JSON --> Loader
end
subgraph kg ["2. Neo4j Knowledge Graph"]
Act["Act"]
Chapter["Chapters (29)"]
Section["Sections (421)"]
SubSection["Sub-sections"]
Amendment["Amendments (104)"]
Rules["Rules (20)"]
Act -->|CONTAINS_CHAPTER| Chapter
Chapter -->|CONTAINS_SECTION| Section
Section -->|HAS_SUBSECTION| SubSection
Section -.->|AMENDED_BY| Amendment
Section -.->|SUBSTITUTES / INSERTS / DELETES| Amendment
Section -.->|GOVERNED_BY_RULE| Rules
Section -.->|REFERS_TO| Section
end
subgraph query ["3. Query Layer"]
Cypher["Cypher Query Library<br/>(11 pre-built templates)"]
Engine["Query Engine"]
Cypher --> Engine
end
subgraph ai ["4. Intelligence Layer"]
NL["Natural Language Question"]
Intent["Intent Classifier"]
TextToCypher["Schema-Aware LLM<br/>(Text-to-Cypher)"]
Validate["Validate + Self-Correct"]
Enrich["Multi-hop Enrichment"]
ResponseGen["Grounded Response<br/>Generator"]
NL --> Intent --> TextToCypher --> Validate --> Enrich --> ResponseGen
end
subgraph interface ["5. Interfaces"]
API["FastAPI REST API<br/>(9 endpoints)"]
UI["Streamlit Demo UI<br/>(direct Neo4j client)"]
end
Loader --> kg
kg --> Engine
Engine --> ResponseGen
ResponseGen --> API
Engine --> UI
| Node | Key Properties | Description |
|---|---|---|
| Act | name, year, full_title |
The base legislation |
| Chapter | number, roman, title |
Top-level division of the Act |
| Section | number, title, original_text, current_text, is_omitted |
Individual legal provisions |
| SubSection | number, text, parent_section |
Numbered sub-provisions within a section |
| Proviso | text, order_num |
"Provided that..." conditional clauses |
| Explanation | text |
Explanatory notes within sections |
| Amendment | type, description, old_text, new_text |
Individual amendment directives |
| AmendmentAct | name, year, full_title |
The amending legislation |
| Rule | number, title, text, source_section |
Procedural rules framed under the Act |
| RuleSet | name, year |
Collection of related rules |
graph LR
A["Act"] -->|CONTAINS_CHAPTER| B["Chapter"]
B -->|CONTAINS_SECTION| C["Section"]
C -->|HAS_SUBSECTION| D["SubSection"]
C -->|HAS_PROVISO| E["Proviso"]
C -->|HAS_EXPLANATION| F["Explanation"]
C -->|AMENDED_BY| G["Amendment"]
C -->|SUBSTITUTES| G
C -->|INSERTS| G
C -->|DELETES| G
G -->|DEFINED_IN| H["AmendmentAct"]
C -->|REFERS_TO| C
C -->|GOVERNED_BY_RULE| I["Rule"]
I -->|DERIVED_RULE| C
J["RuleSet"] -->|CONTAINS_RULE| I
The graph uses both generic and typed amendment relationships:
AMENDED_BY— links every section to all its amendments (for broad queries)SUBSTITUTES— word/phrase substitutions (e.g., "for 'shares' read 'securities'")INSERTS— new sub-sections, clauses, or provisos addedDELETES— text omitted or removedDERIVED_RULE— rules framed under a section's authority
-
Temporal Versioning: Each
Sectionstores bothoriginal_text(as enacted in 2013) andcurrent_text(after 2026 amendments). This enables querying both the historical and current effective state of the law without maintaining versioned copies of entire nodes. -
Amendments as First-Class Nodes with Typed Edges: Rather than just overwriting section text, amendments are modelled as independent nodes with both a generic
AMENDED_BYedge and typed edges (SUBSTITUTES,INSERTS,DELETES). This preserves full provenance and enables queries like "show me all substitutions with their before/after text." -
Cross-Reference Edges: Sections that reference other sections (e.g., "as defined in section 2") are linked via
REFERS_TOrelationships, enabling multi-hop graph traversal to understand the full legal context of any provision. -
Rule-Section Mapping: Rules are linked to their parent sections via bidirectional
GOVERNED_BY_RULE/DERIVED_RULErelationships, answering both "which rules implement this section?" and "which section authorises this rule?" -
Hierarchical Structure: The Act → Chapter → Section → SubSection hierarchy mirrors the actual legal document structure, enabling queries at any level of granularity.
- Python 3.11+
- Docker & Docker Compose (for Neo4j)
- A Gemini or OpenAI API key (optional — system works fully without it via template matching)
# 1. Clone the repository
git clone <repo-url>
cd GraphRag
# 2. Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
# 3. Install dependencies
pip install -r requirements.txt
# 4. Start Neo4j
docker-compose up -d
# 5. Configure environment
cp .env.example .env
# Edit .env — set GEMINI_API_KEY (free tier) or OPENAI_API_KEY (optional)
# 6. Place the three PDF files in data/raw/
# - Companies Act, 2013.pdf
# - Corporate Laws (Amendment) Act, 2026.pdf
# - Companies Rules, 2014.pdf
# 7. Run the full ingestion pipeline
python scripts/ingest.py
# 8. Run tests
python -m pytest tests/ -v
# 9. Start the API server
uvicorn src.api.main:app --reload --port 8000
# 10. (Optional) Start the Streamlit UI
streamlit run src/ui/app.pypython scripts/ingest.py --parse-only
# Outputs structured JSON to data/parsed/| Endpoint | Method | Description |
|---|---|---|
/api/section/{number} |
GET | Current effective version of a section |
/api/section/{number}/amendments |
GET | All amendments affecting a section |
/api/section/{number}/rules |
GET | Rules applicable under a section |
/api/section/{number}/context |
GET | Full context: text + amendments + rules + cross-refs |
/api/search?keyword=... |
GET | Search sections by keyword |
/api/query |
POST | Natural language query (LLM-powered) |
/api/cypher |
POST | Execute raw Cypher |
/api/stats |
GET | Graph node/relationship counts |
/health |
GET | Health check |
Q1: Current version of Section 135 (Corporate Social Responsibility)
curl http://localhost:8000/api/section/135Q2: Amendments to Section 132 (National Financial Reporting Authority)
curl http://localhost:8000/api/section/132/amendmentsQ3: Rules applicable under Section 7 (Incorporation)
curl http://localhost:8000/api/section/7/rulesQ4: Full context for a section (text + amendments + rules + cross-refs)
curl http://localhost:8000/api/section/2/contextQ5: Natural language query (with intent classification + multi-hop enrichment)
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"question": "What are the penalties for not holding an annual general meeting?"}'-- Current version of a section with its chapter context
MATCH (a:Act)-[:CONTAINS_CHAPTER]->(c:Chapter)-[:CONTAINS_SECTION]->(s:Section)
WHERE s.number = '135'
RETURN s.number, s.title, s.current_text, c.title AS chapter
-- All amendments to a section with their source
MATCH (s:Section {number: '132'})-[:AMENDED_BY]->(am:Amendment)-[:DEFINED_IN]->(aa:AmendmentAct)
RETURN am.type, am.description, am.old_text, am.new_text, aa.name
-- Substitutions with before/after text (using typed relationship)
MATCH (s:Section)-[:SUBSTITUTES]->(am:Amendment)
WHERE am.old_text <> ''
RETURN s.number, s.title, am.old_text, am.new_text
LIMIT 10
-- Insertions via typed relationship
MATCH (s:Section)-[:INSERTS]->(am:Amendment)-[:DEFINED_IN]->(aa:AmendmentAct)
RETURN s.number, am.description, aa.name
-- Rules linked to a section (bidirectional)
MATCH (s:Section {number: '7'})-[:GOVERNED_BY_RULE]->(r:Rule)<-[:CONTAINS_RULE]-(rs:RuleSet)
RETURN r.number, r.title, rs.name
-- Cross-reference traversal (2 hops)
MATCH path = (s:Section {number: '135'})-[:REFERS_TO*1..2]->(ref:Section)
RETURN [n IN nodes(path) | n.number + ': ' + n.title] AS chainThe AI component goes significantly beyond simple API calls — it implements a multi-stage reasoning pipeline:
Before any query generation, the system classifies the user's question into one of 11 intents:
section_lookup,amendment_query,amendment_diff,amendment_insertion,amendment_deletion,rule_query,full_context,cross_reference,keyword_search,statistics,general
This classification determines whether to use a fast template or engage the LLM, and which graph traversal pattern is most appropriate.
11 pre-built Cypher templates cover the most common query patterns. Intent classification maps directly to the right template, giving instant, deterministic responses without API costs.
When templates don't match, the LLM receives:
- The full graph schema (all node labels, properties, and relationship types)
- 10 few-shot examples covering diverse query patterns
- Explicit rules about parameter usage and result formatting
If the LLM-generated Cypher fails Neo4j's EXPLAIN validation, the system automatically retries with the error message included in the prompt — giving the LLM a chance to fix its own mistakes (up to 2 correction attempts).
For section-specific queries, the system automatically fetches cross-referenced sections via REFERS_TO relationships, providing the response generator with richer context for a more complete answer.
The response LLM is constrained to:
- Only state facts present in the retrieved graph data
- Cite specific section numbers, rule numbers, and amendment types
- Distinguish between original text and amended text
- Never hallucinate legal information
Every response includes:
- The Cypher query used
- The generation method (
template/llm/llm_corrected/fallback) - The classified intent
- A list of all source nodes referenced
Running python scripts/ingest.py produces:
| Source Document | Parsed Output |
|---|---|
| Companies Act, 2013 (PDF, ~600 pages) | 29 chapters, 421 sections with subsections, provisos, explanations, and cross-references |
| Corporate Laws (Amendment) Act, 2026 | 104 amendment directives: 64 substitutions, 32 insertions, 4 omissions, 4 new sections. 47 directives include extracted old/new text. |
| Companies (Incorporation) Rules, 2014 | 20 rules with curated titles and Act section mappings (14 with extracted text) |
# Run all 50 tests (no Neo4j or API key needed)
python -m pytest tests/ -v
# Tests cover:
# - Intent classification (11 tests)
# - Entity extraction (5 tests)
# - Template matching (6 tests)
# - End-to-end pipeline (3 tests)
# - Parser output validation (12 tests)
# - Response generation (9 tests)
# - Source extraction (4 tests)├── docker-compose.yml # Neo4j container
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── src/
│ ├── config.py # Centralised configuration
│ ├── parsing/
│ │ ├── act_parser.py # Companies Act 2013 — regex-based structural extraction
│ │ ├── amendment_parser.py # Amendment Act 2026 — directive classification
│ │ ├── rules_parser.py # Companies Rules 2014 — curated mapping + extraction
│ │ └── llm_extractor.py # LLM-assisted extraction (definitions, summaries)
│ ├── graph/
│ │ ├── schema.py # Neo4j constraints, indexes, and verification
│ │ ├── loader.py # JSON → Neo4j with typed relationships
│ │ └── amendment_applier.py# Apply amendment directives to section text
│ ├── queries/
│ │ ├── cypher_templates.py # 11 pre-built parameterised Cypher templates
│ │ └── query_engine.py # Execute templates or raw Cypher, return dicts
│ ├── intelligence/
│ │ ├── llm_client.py # Shared LLM client factory (Gemini / OpenAI)
│ │ ├── schema_prompt.py # Graph schema context + 10 few-shot examples
│ │ ├── text_to_cypher.py # Intent classification → template/LLM → self-correction
│ │ └── response_generator.py # Grounded, cited response generation
│ ├── api/
│ │ ├── main.py # FastAPI application with lifespan management
│ │ └── routes.py # REST endpoints with multi-hop enrichment
│ └── ui/
│ └── app.py # Streamlit demo UI (5 tabs)
├── tests/
│ ├── test_intent_classification.py # Intent + entity + template + E2E tests
│ ├── test_parsers.py # Parser output validation
│ └── test_response_generator.py # Response formatting + source extraction
├── scripts/
│ ├── ingest.py # Full ingestion pipeline (parse → load → apply → verify)
│ ├── demo_queries.py # Sample queries demonstrating all features
│ └── verify_parsed.py # Quick verification of parsed JSON quality
└── data/
├── raw/ # Input PDFs (not committed)
└── parsed/ # Parsed JSON output (not committed)
| Decision | Rationale |
|---|---|
| Regex-first parsing (LLM only for ambiguous cases) | Legal documents have consistent formatting; regex is faster, cheaper, and deterministic for structural extraction |
| Both generic and typed amendment relationships | AMENDED_BY for broad queries + SUBSTITUTES/INSERTS/DELETES for precise type-specific traversal |
| Intent classification before query generation | Determines the optimal strategy (template vs LLM) and avoids unnecessary API costs |
| Self-correction loop | Catches and fixes LLM mistakes automatically — more robust than single-shot generation |
| Multi-hop enrichment | Cross-referenced sections are fetched automatically, giving the response LLM richer context |
| Multi-provider LLM support | Supports both Gemini (free tier) and OpenAI via a unified OpenAI-compatible client; easy to switch via environment variable |
| Template matching before LLM | Guarantees the system works without an API key; handles 80%+ of common queries instantly |
Both original_text and current_text |
Enables temporal queries ("what changed?") without versioned node copies |
| Rules as curated mapping | The bilingual gazette format makes automated extraction unreliable; a curated map ensures correctness |
| Graceful degradation | Every layer has a fallback: LLM → self-correct → template → keyword search; no API key → structured non-LLM response |
- Modular architecture: Each layer (parsing, graph, queries, intelligence, API) can be independently scaled or replaced
- Stateless API: FastAPI endpoints are stateless; horizontal scaling via multiple workers (
uvicorn --workers N) - Graph query efficiency: Indexes on
Section.number,Section.title,Chapter.number,Rule.number, andAmendment.typeensure O(log n) lookups - Template-first strategy: 80%+ of queries resolve via templates without LLM calls, reducing latency and cost
- Extensible schema: Adding new Acts, amendment acts, or rule sets requires only new loader functions — the graph schema and query layer remain unchanged
The submission implements the full pipeline end-to-end: graph design, ingestion, Cypher queries, and an intelligence layer with grounded answers. The items below are optional extensions that would further strengthen the system if product requirements or evaluation priorities call for them, and given additional time for deeper parsing, data modelling, or UX polish. They are listed here as a roadmap rather than gaps in the core deliverable.
- Full Act coverage: Extend parsing to Schedules I–VII and all 470+ sections with sub-clause granularity
- Multi-amendment support: Track amendments from multiple amendment acts (2017, 2019, 2020, 2026) as separate
AmendmentActnodes - Vector search: Add embeddings on section text for semantic similarity retrieval alongside graph queries
- Graph visualisation: Interactive Neo4j Bloom or D3.js integration in the Streamlit UI
- Batch LLM extraction: Use structured outputs to extract all ~90 definitions from Section 2 as
Definitionnodes - Rule-to-form mapping: Link rules to their prescribed forms (INC-1 through INC-11 for incorporation)