You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposal: PostgreSQL + GitHub Architecture for OntoKit
Date: 2026-04-01 Authors: Damien Riehl, with analysis from Claude Status: Draft for discussion with Fr. John D'Orazio Purpose: Basis for a PRD to refactor OntoKit's storage and version control layers
1. Executive Summary
We propose evolving OntoKit's architecture into a two-layer system:
PostgreSQL as the live working layer -- all entity-level CRUD, search, browsing, and analytics happen against granular database rows
GitHub as the version control layer -- formal commits, branches, pull requests, and review happen through GitHub's API and UI
Crucially, we are not starting from scratch. PR #10 on ontokit-api (feat: PostgreSQL index tables for ontology query optimization) already built the foundation: five PostgreSQL index tables, an OntologyIndexService with SQL-based tree/search/detail queries, an IndexedOntologyService facade that transparently routes reads through the index with RDFLib fallback, and background reindexing via ARQ workers. The proposed architecture evolves PR #10's read-only index into a read-write primary store.
Indexes include: GIN trigram indexes on local_name, iri, and label value for fast fuzzy/substring search. Composite B-tree indexes on (project_id, branch, entity_type) for type filtering. Hierarchy indexes on both parent_iri and child_iri for bidirectional traversal.
2.2 OntologyIndexService (1,134 lines)
full_reindex(project_id, branch, graph, commit_hash) -- atomically deletes existing index data and rebuilds from an RDFLib graph, batching inserts (1,000 rows/batch) for performance
get_root_classes() -- finds classes not appearing as children in hierarchy
get_class_children() -- JOINs entities with hierarchy table
get_ancestor_path() -- recursive CTE walking up the hierarchy (depth-limited to 100)
search_entities() -- ILIKE search across local_name, IRI, and labels
get_class_count() -- simple COUNT query
Concurrency guard -- PostgreSQL ON CONFLICT DO UPDATE prevents concurrent reindexing, with 10-minute stale lock reclamation
2.3 IndexedOntologyService (Facade)
Transparently routes queries through the SQL index when ready, falling back to RDFLib when the index hasn't been built yet, the index query throws an error, or the migration hasn't been run. Also auto-enqueues reindexing when the stored commit_hash doesn't match git HEAD.
2.4 Reindex Triggers
Four triggers already wired up: after file import, after GitHub clone, after source save (commit), and a manual admin endpoint (POST /projects/{id}/ontology/reindex). Branch deletion cleans up index rows.
instance_count always returns 0 (rdf:type relationships not indexed)
equivalent_iris and disjoint_iris always return empty lists (not indexed)
Search prefix-match sorting happens client-side after SQL LIMIT
3. The Problem: Why Read-Only Index Isn't Enough
PR #10 is a read cache. The actual data still lives in RDF files managed by pygit2 bare repos. Every edit still follows the expensive path:
Load entire RDF file from Git bare repo
→ Parse into in-memory RDFLib graph
→ Modify the graph
→ Serialize entire graph back to Turtle
→ Commit to Git bare repo
→ Trigger background reindex to update PostgreSQL
3.1 What This Means in Practice
Problem
Impact
Writes are still whole-file
Changing one label still requires reading, parsing, and rewriting the entire ontology file
Index lag
After an edit, the index is stale until the background ARQ job completes the full reindex
Dual source of truth
Git is canonical, PostgreSQL is a projection -- any bug in reindexing creates silent data divergence
Full reindex on every edit
full_reindex() deletes all rows and rebuilds from scratch, even if only one entity changed
Memory pressure unchanged
RDFLib still loads the entire graph for every write operation
Merge conflicts unchanged
pygit2 bare repos still manage branching on monolithic Turtle files
User edits in OntoKit UI
→ DB updates immediately (fast, granular, per-entity SQL)
→ No reindexing needed -- DB is always current
User clicks "Commit" (or auto-commit on save, TBD)
→ export_to_turtle() serializes DB → deterministic Turtle
→ GitHub API: create commit on branch with the Turtle file
→ project_sync_status updated with new commit SHA
Admin creates PR on GitHub (or via OntoKit UI wrapper)
→ Review happens on GitHub
→ Merge happens on GitHub
→ GitHub webhook → OntoKit
→ import_from_turtle() updates DB from merged Turtle
→ project_sync_status updated
5. Handling Hard Problems
5.1 RDF Round-Tripping
Challenge:Import RDF → DB rows → Export RDF must be lossless.
Solution -- Hybrid decomposition:
~90% of constructs (named classes, properties, hierarchy, labels, comments, annotations, individuals, simple restrictions) decompose cleanly into relational tables. Fast CRUD path.
~10% of complex constructs (class expressions with nested boolean operators, property chains, GCI axioms, SWRL rules) stored in complex_axioms as serialized Turtle fragments. Lossless round-trip by construction.
Same approach used by TopBraid and PoolParty.
5.2 Deterministic Serialization
OntoKit already has serialize_deterministic() using RDFLib's to_isomorphic(). The export service builds an RDFLib graph from DB rows, then serializes deterministically.
5.3 Branching Model
PR #10's tables already have a branch column on every table. Multiple branch states can coexist in the DB. Proposed workflow:
main branch is the primary working state, always in the DB
Feature branches can also live in the DB or only on GitHub
System/admin users create GitHub branches and PRs for formal review
Regular users edit the main branch working state via the OntoKit UI
5.4 Linting
PostgreSQL makes most lint rules simpler, not harder:
Lint Rule
PostgreSQL Query
undefined-parent
LEFT JOIN class_hierarchy h ON h.parent_iri = e.iri WHERE e.id IS NULL
Proposal: PostgreSQL + GitHub Architecture for OntoKit
Date: 2026-04-01
Authors: Damien Riehl, with analysis from Claude
Status: Draft for discussion with Fr. John D'Orazio
Purpose: Basis for a PRD to refactor OntoKit's storage and version control layers
1. Executive Summary
We propose evolving OntoKit's architecture into a two-layer system:
Crucially, we are not starting from scratch. PR #10 on
ontokit-api(feat: PostgreSQL index tables for ontology query optimization) already built the foundation: five PostgreSQL index tables, anOntologyIndexServicewith SQL-based tree/search/detail queries, anIndexedOntologyServicefacade that transparently routes reads through the index with RDFLib fallback, and background reindexing via ARQ workers. The proposed architecture evolves PR #10's read-only index into a read-write primary store.2. What PR #10 Already Built
PR #10 (branch
feature/postgresql-ontology-index) implemented:2.1 Five PostgreSQL Tables
ontology_index_statusproject_id,branch,commit_hash,status(pending/indexing/ready/failed),entity_countindexed_entitiesproject_id,branch,iri,local_name,entity_type,deprecatedindexed_labelsentity_idFK,property_iri,value,langindexed_hierarchyproject_id,branch,child_iri,parent_iriindexed_annotationsentity_idFK,property_iri,value,lang,is_uriIndexes include: GIN trigram indexes on
local_name,iri, and labelvaluefor fast fuzzy/substring search. Composite B-tree indexes on(project_id, branch, entity_type)for type filtering. Hierarchy indexes on bothparent_iriandchild_irifor bidirectional traversal.2.2 OntologyIndexService (1,134 lines)
full_reindex(project_id, branch, graph, commit_hash)-- atomically deletes existing index data and rebuilds from an RDFLib graph, batching inserts (1,000 rows/batch) for performanceget_root_classes()-- finds classes not appearing as children in hierarchyget_class_children()-- JOINs entities with hierarchy tableget_class_detail()-- fetches entity + labels + comments + parents + annotationsget_ancestor_path()-- recursive CTE walking up the hierarchy (depth-limited to 100)search_entities()-- ILIKE search across local_name, IRI, and labelsget_class_count()-- simple COUNT queryON CONFLICT DO UPDATEprevents concurrent reindexing, with 10-minute stale lock reclamation2.3 IndexedOntologyService (Facade)
Transparently routes queries through the SQL index when ready, falling back to RDFLib when the index hasn't been built yet, the index query throws an error, or the migration hasn't been run. Also auto-enqueues reindexing when the stored
commit_hashdoesn't match git HEAD.2.4 Reindex Triggers
Four triggers already wired up: after file import, after GitHub clone, after source save (commit), and a manual admin endpoint (
POST /projects/{id}/ontology/reindex). Branch deletion cleans up index rows.2.5 What PR #10 Noted as Limitations
instance_countalways returns 0 (rdf:type relationships not indexed)equivalent_irisanddisjoint_irisalways return empty lists (not indexed)3. The Problem: Why Read-Only Index Isn't Enough
PR #10 is a read cache. The actual data still lives in RDF files managed by pygit2 bare repos. Every edit still follows the expensive path:
3.1 What This Means in Practice
full_reindex()deletes all rows and rebuilds from scratch, even if only one entity changed3.2 Custom Code Still Maintained
Even with PR #10, the backend still maintains:
git/bare_repository.py)4. Proposed Architecture: Evolve the Index into the Primary Store
4.1 The Two-Layer Model
4.2 What Changes from PR #10
4.3 The Evolution Path: From PR #10 to Primary Store
PR #10's tables are already close to what we need. Here's what evolves:
indexed_entities→ontology_entitiesAdd columns for write support:
indexed_labels→entity_labels-- addcreated_at,updated_atindexed_hierarchy→class_hierarchy-- addcreated_at,updated_atindexed_annotations→entity_annotations-- addcreated_at,updated_atNew tables needed:
ontology_index_statusevolves toproject_sync_status:project_sync_status ( id UUID PRIMARY KEY, project_id UUID REFERENCES projects(id) ON DELETE CASCADE, branch VARCHAR(255), github_repo TEXT, last_commit_sha VARCHAR(40), last_synced_at TIMESTAMPTZ, sync_status VARCHAR(20), entity_count INTEGER DEFAULT 0, error_message TEXT, UNIQUE(project_id, branch) )4.4 OntologyIndexService Evolves to OntologyEntityService
PR #10's
OntologyIndexService(1,134 lines) already has all the read methods. We add write methods:4.5 GitHub Sync Workflow
5. Handling Hard Problems
5.1 RDF Round-Tripping
Challenge:
Import RDF → DB rows → Export RDFmust be lossless.Solution -- Hybrid decomposition:
complex_axiomsas serialized Turtle fragments. Lossless round-trip by construction.5.2 Deterministic Serialization
OntoKit already has
serialize_deterministic()using RDFLib'sto_isomorphic(). The export service builds an RDFLib graph from DB rows, then serializes deterministically.5.3 Branching Model
PR #10's tables already have a
branchcolumn on every table. Multiple branch states can coexist in the DB. Proposed workflow:mainbranch is the primary working state, always in the DBmainbranch working state via the OntoKit UI5.4 Linting
PostgreSQL makes most lint rules simpler, not harder:
undefined-parentLEFT JOIN class_hierarchy h ON h.parent_iri = e.iri WHERE e.id IS NULLcircular-hierarchyduplicate-labelGROUP BY value HAVING COUNT(*) > 1missing-labelLEFT JOIN entity_labels WHERE labels.id IS NULLorphan-class6. What Changes in Each Codebase
6.1 ontokit-api (Backend)
Evolve from PR #10:
indexed_*tables toontology_*/entity_*/class_*complex_axioms,ontology_prefixes,ontology_metadata, property/individual tables)export_to_turtle()andimport_from_turtle()Remove / Simplify:
git/bare_repository.py-- no more internal Git reposgit/repository.py-- already deprecatedservices/ontology.py-- RDFLib-based entity operations replaced by SQLservices/indexed_ontology.py-- facade no longer neededapi/routes/pull_requests.py-- simplify to thin wrapper around GitHub APIservices/pull_request_service.py,services/github_sync.pyKeep (mostly unchanged):
6.2 ontokit-web (Frontend)
Add: "Commit to GitHub" action, GitHub PR integration views, improved import flow
Remove: Internal branch management, PR workflow, and diff viewer UIs
Keep: Ontology editor (both modes), entity tree, Monaco editor, search, analytics, graph visualization
7. Migration Path
Phase 0: Merge & Rename PR #10 (Foundation)
Phase 1: Write Path (Make PostgreSQL Primary for Writes)
import_from_turtle()andexport_to_turtle()Phase 2: GitHub Integration
Phase 3: Cut Over
Phase 4: Cleanup & Polish
8. Comparison with Prior Proposals
This supersedes the "Ontology Atomization" plan and addresses every concern from its critical analysis:
complex_axiomsstores Turtle fragments9. Why Build on PR #10
10. Risks and Mitigations
complex_axiomsTurtle fragments11. Open Questions for Discussion
owl:imports), or single file?cc @JohnRDOrazio @damienriehl