Skip to content

Proposal: Evolve PostgreSQL Index (PR #10) into Primary Store + GitHub for Version Control #13

@damienriehl

Description

@damienriehl

Proposal: PostgreSQL + GitHub Architecture for OntoKit

Date: 2026-04-01
Authors: Damien Riehl, with analysis from Claude
Status: Draft for discussion with Fr. John D'Orazio
Purpose: Basis for a PRD to refactor OntoKit's storage and version control layers


1. Executive Summary

We propose evolving OntoKit's architecture into a two-layer system:

  • PostgreSQL as the live working layer -- all entity-level CRUD, search, browsing, and analytics happen against granular database rows
  • GitHub as the version control layer -- formal commits, branches, pull requests, and review happen through GitHub's API and UI

Crucially, we are not starting from scratch. PR #10 on ontokit-api (feat: PostgreSQL index tables for ontology query optimization) already built the foundation: five PostgreSQL index tables, an OntologyIndexService with SQL-based tree/search/detail queries, an IndexedOntologyService facade that transparently routes reads through the index with RDFLib fallback, and background reindexing via ARQ workers. The proposed architecture evolves PR #10's read-only index into a read-write primary store.


2. What PR #10 Already Built

PR #10 (branch feature/postgresql-ontology-index) implemented:

2.1 Five PostgreSQL Tables

Table Purpose Key Columns
ontology_index_status Tracks indexing state per (project, branch) project_id, branch, commit_hash, status (pending/indexing/ready/failed), entity_count
indexed_entities One row per OWL entity project_id, branch, iri, local_name, entity_type, deprecated
indexed_labels Multilingual labels entity_id FK, property_iri, value, lang
indexed_hierarchy Parent-child class edges project_id, branch, child_iri, parent_iri
indexed_annotations Non-label annotations entity_id FK, property_iri, value, lang, is_uri

Indexes include: GIN trigram indexes on local_name, iri, and label value for fast fuzzy/substring search. Composite B-tree indexes on (project_id, branch, entity_type) for type filtering. Hierarchy indexes on both parent_iri and child_iri for bidirectional traversal.

2.2 OntologyIndexService (1,134 lines)

  • full_reindex(project_id, branch, graph, commit_hash) -- atomically deletes existing index data and rebuilds from an RDFLib graph, batching inserts (1,000 rows/batch) for performance
  • get_root_classes() -- finds classes not appearing as children in hierarchy
  • get_class_children() -- JOINs entities with hierarchy table
  • get_class_detail() -- fetches entity + labels + comments + parents + annotations
  • get_ancestor_path() -- recursive CTE walking up the hierarchy (depth-limited to 100)
  • search_entities() -- ILIKE search across local_name, IRI, and labels
  • get_class_count() -- simple COUNT query
  • Concurrency guard -- PostgreSQL ON CONFLICT DO UPDATE prevents concurrent reindexing, with 10-minute stale lock reclamation

2.3 IndexedOntologyService (Facade)

Transparently routes queries through the SQL index when ready, falling back to RDFLib when the index hasn't been built yet, the index query throws an error, or the migration hasn't been run. Also auto-enqueues reindexing when the stored commit_hash doesn't match git HEAD.

2.4 Reindex Triggers

Four triggers already wired up: after file import, after GitHub clone, after source save (commit), and a manual admin endpoint (POST /projects/{id}/ontology/reindex). Branch deletion cleans up index rows.

2.5 What PR #10 Noted as Limitations

  • instance_count always returns 0 (rdf:type relationships not indexed)
  • equivalent_iris and disjoint_iris always return empty lists (not indexed)
  • Search prefix-match sorting happens client-side after SQL LIMIT

3. The Problem: Why Read-Only Index Isn't Enough

PR #10 is a read cache. The actual data still lives in RDF files managed by pygit2 bare repos. Every edit still follows the expensive path:

Load entire RDF file from Git bare repo
  → Parse into in-memory RDFLib graph
    → Modify the graph
      → Serialize entire graph back to Turtle
        → Commit to Git bare repo
          → Trigger background reindex to update PostgreSQL

3.1 What This Means in Practice

Problem Impact
Writes are still whole-file Changing one label still requires reading, parsing, and rewriting the entire ontology file
Index lag After an edit, the index is stale until the background ARQ job completes the full reindex
Dual source of truth Git is canonical, PostgreSQL is a projection -- any bug in reindexing creates silent data divergence
Full reindex on every edit full_reindex() deletes all rows and rebuilds from scratch, even if only one entity changed
Memory pressure unchanged RDFLib still loads the entire graph for every write operation
Merge conflicts unchanged pygit2 bare repos still manage branching on monolithic Turtle files

3.2 Custom Code Still Maintained

Even with PR #10, the backend still maintains:

  • pygit2 bare repository management (git/bare_repository.py)
  • Custom PR workflow (create, review, comment, approve, merge)
  • Custom branch management
  • MinIO object storage
  • Two-way GitHub sync bridge
  • Full RDFLib graph loading for every write

4. Proposed Architecture: Evolve the Index into the Primary Store

4.1 The Two-Layer Model

                    ┌───────────────────────┐
                    │     OntoKit Web UI    │
                    │   (Next.js frontend)  │
                    └───────────┬───────────┘
                                │
                    ┌───────────▼───────────┐
                    │    OntoKit API        │
                    │   (FastAPI backend)   │
                    └───────────┬───────────┘
                         │             │
              Live CRUD  │             │  Version Control
                         ▼             ▼
                  ┌────────────┐  ┌──────────┐
                  │ PostgreSQL │  │  GitHub   │
                  │  (entities,│  │  (commits,│
                  │   triples, │  │  branches,│
                  │   search)  │  │  PRs,     │
                  └────────────┘  │  reviews) │
                                  └──────────┘

4.2 What Changes from PR #10

Aspect PR #10 (Read-Only Index) Proposed (Read-Write Primary Store)
Writes Go to Git, then reindex async Go directly to PostgreSQL
Reads PostgreSQL with RDFLib fallback PostgreSQL only (no fallback needed)
Source of truth Git bare repo + MinIO PostgreSQL (working state), GitHub (versioned snapshots)
Reindexing Full rebuild from RDFLib graph on every edit Not needed -- DB is always current
Serialization Not relevant (Git stores files) On-demand: DB → Turtle when committing to GitHub
Branching pygit2 bare repos GitHub (system/admin level)
PRs Custom internal workflow GitHub PRs

4.3 The Evolution Path: From PR #10 to Primary Store

PR #10's tables are already close to what we need. Here's what evolves:

indexed_entitiesontology_entities

Add columns for write support:

-- Existing (from PR #10):
id, project_id, branch, iri, local_name, entity_type, deprecated

-- Add:
created_at    TIMESTAMPTZ DEFAULT now()
updated_at    TIMESTAMPTZ DEFAULT now()
created_by    UUID REFERENCES users(id)
updated_by    UUID REFERENCES users(id)

indexed_labelsentity_labels -- add created_at, updated_at

indexed_hierarchyclass_hierarchy -- add created_at, updated_at

indexed_annotationsentity_annotations -- add created_at, updated_at

New tables needed:

-- Safety valve for complex OWL constructs
complex_axioms (
    id              UUID PRIMARY KEY,
    project_id      UUID REFERENCES projects(id) ON DELETE CASCADE,
    branch          VARCHAR(255),
    axiom_type      VARCHAR(100),
    involved_iris   TEXT[],
    turtle_fragment TEXT,
    metadata        JSONB,
    created_at      TIMESTAMPTZ DEFAULT now(),
    updated_at      TIMESTAMPTZ DEFAULT now()
)

-- Namespace/prefix management
ontology_prefixes (
    id              UUID PRIMARY KEY,
    project_id      UUID REFERENCES projects(id) ON DELETE CASCADE,
    prefix          VARCHAR(50),
    namespace_iri   TEXT,
    UNIQUE(project_id, prefix)
)

-- Ontology-level metadata
ontology_metadata (
    id              UUID PRIMARY KEY,
    project_id      UUID REFERENCES projects(id) ON DELETE CASCADE,
    ontology_iri    TEXT,
    version_iri     TEXT,
    imports         TEXT[],
    annotations     JSONB,
    created_at      TIMESTAMPTZ DEFAULT now(),
    updated_at      TIMESTAMPTZ DEFAULT now()
)

-- Property domain/range declarations
property_domains (
    id              UUID PRIMARY KEY,
    property_id     UUID REFERENCES ontology_entities(id) ON DELETE CASCADE,
    class_iri       TEXT
)

property_ranges (
    id              UUID PRIMARY KEY,
    property_id     UUID REFERENCES ontology_entities(id) ON DELETE CASCADE,
    target_iri      TEXT
)

-- Individual type assertions and property values
individual_types (
    id              UUID PRIMARY KEY,
    individual_id   UUID REFERENCES ontology_entities(id) ON DELETE CASCADE,
    class_iri       TEXT
)

individual_property_values (
    id              UUID PRIMARY KEY,
    individual_id   UUID REFERENCES ontology_entities(id) ON DELETE CASCADE,
    property_iri    TEXT,
    value           TEXT,
    datatype        TEXT,
    language        VARCHAR(20),
    target_iri      TEXT
)

ontology_index_status evolves to project_sync_status:

project_sync_status (
    id              UUID PRIMARY KEY,
    project_id      UUID REFERENCES projects(id) ON DELETE CASCADE,
    branch          VARCHAR(255),
    github_repo     TEXT,
    last_commit_sha VARCHAR(40),
    last_synced_at  TIMESTAMPTZ,
    sync_status     VARCHAR(20),
    entity_count    INTEGER DEFAULT 0,
    error_message   TEXT,
    UNIQUE(project_id, branch)
)

4.4 OntologyIndexService Evolves to OntologyEntityService

PR #10's OntologyIndexService (1,134 lines) already has all the read methods. We add write methods:

class OntologyEntityService:
    """Evolved from OntologyIndexService -- now handles reads AND writes."""

    # === READS (already built in PR #10) ===
    async def get_root_classes(...)
    async def get_class_children(...)
    async def get_class_detail(...)
    async def get_ancestor_path(...)
    async def search_entities(...)
    async def get_class_count(...)

    # === WRITES (new) ===
    async def create_entity(self, project_id, branch, iri, entity_type, labels, ...)
    async def update_entity(self, project_id, branch, iri, changes: dict)
    async def delete_entity(self, project_id, branch, iri)
    async def add_label(self, entity_id, property_iri, value, lang)
    async def update_label(self, label_id, value, lang)
    async def remove_label(self, label_id)
    async def set_parent(self, project_id, branch, child_iri, parent_iri)
    async def remove_parent(self, project_id, branch, child_iri, parent_iri)
    async def reparent_class(self, project_id, branch, class_iri, old_parent, new_parent)
    async def add_annotation(self, entity_id, property_iri, value, lang)
    async def update_annotation(self, annotation_id, value, lang)
    async def remove_annotation(self, annotation_id)

    # === SERIALIZATION (new) ===
    async def export_to_turtle(self, project_id, branch) -> str
    async def import_from_turtle(self, project_id, branch, turtle_content: str, user_id)

4.5 GitHub Sync Workflow

User edits in OntoKit UI
  → DB updates immediately (fast, granular, per-entity SQL)
  → No reindexing needed -- DB is always current

User clicks "Commit" (or auto-commit on save, TBD)
  → export_to_turtle() serializes DB → deterministic Turtle
  → GitHub API: create commit on branch with the Turtle file
  → project_sync_status updated with new commit SHA

Admin creates PR on GitHub (or via OntoKit UI wrapper)
  → Review happens on GitHub
  → Merge happens on GitHub
  → GitHub webhook → OntoKit
    → import_from_turtle() updates DB from merged Turtle
    → project_sync_status updated

5. Handling Hard Problems

5.1 RDF Round-Tripping

Challenge: Import RDF → DB rows → Export RDF must be lossless.

Solution -- Hybrid decomposition:

  • ~90% of constructs (named classes, properties, hierarchy, labels, comments, annotations, individuals, simple restrictions) decompose cleanly into relational tables. Fast CRUD path.
  • ~10% of complex constructs (class expressions with nested boolean operators, property chains, GCI axioms, SWRL rules) stored in complex_axioms as serialized Turtle fragments. Lossless round-trip by construction.
  • Same approach used by TopBraid and PoolParty.

5.2 Deterministic Serialization

OntoKit already has serialize_deterministic() using RDFLib's to_isomorphic(). The export service builds an RDFLib graph from DB rows, then serializes deterministically.

5.3 Branching Model

PR #10's tables already have a branch column on every table. Multiple branch states can coexist in the DB. Proposed workflow:

  • main branch is the primary working state, always in the DB
  • Feature branches can also live in the DB or only on GitHub
  • System/admin users create GitHub branches and PRs for formal review
  • Regular users edit the main branch working state via the OntoKit UI

5.4 Linting

PostgreSQL makes most lint rules simpler, not harder:

Lint Rule PostgreSQL Query
undefined-parent LEFT JOIN class_hierarchy h ON h.parent_iri = e.iri WHERE e.id IS NULL
circular-hierarchy Recursive CTE with cycle detection
duplicate-label GROUP BY value HAVING COUNT(*) > 1
missing-label LEFT JOIN entity_labels WHERE labels.id IS NULL
orphan-class Classes with no parent and no children

6. What Changes in Each Codebase

6.1 ontokit-api (Backend)

Evolve from PR #10:

  • Rename indexed_* tables to ontology_* / entity_* / class_*
  • Add write methods to the service layer
  • Add new tables (complex_axioms, ontology_prefixes, ontology_metadata, property/individual tables)
  • Add export_to_turtle() and import_from_turtle()
  • Add GitHub API service and webhook handler
  • Add timestamp + user tracking columns

Remove / Simplify:

  • git/bare_repository.py -- no more internal Git repos
  • git/repository.py -- already deprecated
  • MinIO storage integration
  • services/ontology.py -- RDFLib-based entity operations replaced by SQL
  • services/indexed_ontology.py -- facade no longer needed
  • api/routes/pull_requests.py -- simplify to thin wrapper around GitHub API
  • services/pull_request_service.py, services/github_sync.py

Keep (mostly unchanged):

  • Auth (Zitadel OIDC), Project/member management
  • Embedding/semantic search (already PostgreSQL-based via pgvector)
  • Linting (reimplemented as SQL queries), WebSocket collaboration
  • Notifications, analytics, change events

6.2 ontokit-web (Frontend)

Add: "Commit to GitHub" action, GitHub PR integration views, improved import flow

Remove: Internal branch management, PR workflow, and diff viewer UIs

Keep: Ontology editor (both modes), entity tree, Monaco editor, search, analytics, graph visualization


7. Migration Path

Phase 0: Merge & Rename PR #10 (Foundation)

Phase 1: Write Path (Make PostgreSQL Primary for Writes)

  • Implement write methods (create, update, delete)
  • Implement import_from_turtle() and export_to_turtle()
  • Validate round-trip fidelity: import → export → diff
  • Keep Git-based storage running in parallel (dual-write)

Phase 2: GitHub Integration

  • GitHub API service (commits, branches, PRs)
  • "Commit to GitHub" UI workflow
  • GitHub webhook handler

Phase 3: Cut Over

  • Switch all reads/writes to PostgreSQL, version control to GitHub
  • Migrate existing project data
  • Remove pygit2, MinIO

Phase 4: Cleanup & Polish


8. Comparison with Prior Proposals

This supersedes the "Ontology Atomization" plan and addresses every concern from its critical analysis:

Concern How This Addresses It
Abandoning Turtle as source of truth Turtle remains the versioned format on GitHub
Git performance with 50K+ files GitHub stores one Turtle file, not 50K entity files
Lossy JSON schema for OWL No proprietary schema; complex_axioms stores Turtle fragments
Turtle regeneration bottleneck Serialization only on explicit commit, not every edit
Linter requires full rewrite SQL-based linting is simpler for most rules
Big-bang migration risk Phased migration with dual-write period
Tool ecosystem compatibility Full -- Turtle on GitHub works with Protege, ROBOT, OWL API

9. Why Build on PR #10

What PR #10 Provides Lines of Code Effort Saved
Five PostgreSQL tables with indexes ~340 Schema design, index tuning, trigram setup
Full reindex + graph extraction ~400 Entity extraction, label resolution, hierarchy walking
Query methods (tree, detail, ancestors, search) ~500 Recursive CTEs, label preference, bulk loading
Transparent fallback facade ~300 Strategy pattern, schema conversion
Background worker + API integration ~240 ARQ job, Redis pubsub, dependency injection
Total ~1,780 Weeks of development

10. Risks and Mitigations

Risk Likelihood Impact Mitigation
RDF round-trip data loss Medium High Test suite + complex_axioms Turtle fragments
GitHub API rate limits Low Medium 5,000 req/hour; commits are infrequent
GitHub dependency Low Medium Serialization is Git-host-agnostic
Complex OWL not form-editable High Low Same as today; Turtle editor handles these
Branch state confusion Medium Medium Clear UX messaging
Migration breaks existing projects Low Medium Dual-write period for validation

11. Open Questions for Discussion

  1. GitHub org/repo structure -- one GitHub repo per OntoKit project, or monorepo?
  2. Modular Turtle -- split large ontologies into multiple files (using owl:imports), or single file?
  3. Branch UX -- branch switcher in OntoKit that loads from GitHub, or GitHub-only branching?
  4. Commit granularity -- auto-commit on every save, or user-initiated only?
  5. GitHub App vs. PAT -- GitHub App (fine-grained) or user PATs (simpler)?
  6. Self-hosted option -- GitHub-only for v1, or support Gitea/GitLab from day one?
  7. Suggestion workflow -- keep custom suggestion flow or replace with GitHub fork-and-PR?
  8. PR feat: PostgreSQL index tables for ontology query optimization #10 merge timing -- merge now and evolve, or develop full write path first?

cc @JohnRDOrazio @damienriehl

Metadata

Metadata

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions