Consolidate duplicate detection across lint, consistency, and duplicates tab #87

JohnRDOrazio · 2026-04-15T14:21:41Z

JohnRDOrazio
Apr 15, 2026
Maintainer

Context

We currently have three separate systems that detect duplicate labels, each with different characteristics:

1. Lint rule: `duplicate-label`

Source: rdflib Graph (parsed from git/storage)
Match: Exact, case-insensitive (str(label).strip().lower())
Scope: owl:Class entities only
Output: Warning issues in the Lint tab
Performance: Requires parsing the full ontology file (minutes for large files)

2. Consistency rule: `duplicate_label`

Source: rdflib Graph (parsed from git/storage)
Match: Exact value + exact language tag
Scope: All entity types (classes, properties, individuals)
Output: Warning issues in the Consistency tab
Performance: Same — requires full graph parsing

3. Duplicates tab: `find_duplicates_sql`

Source: PostgreSQL pg_trgm GIN index on indexed_labels
Match: Fuzzy (≥85% trigram similarity, configurable threshold)
Scope: All entity types, same-type matching
Output: Clustered results in the Duplicates tab
Performance: ~56 seconds for 15K classes using indexed lookups, no file parsing

Overlap

The lint and consistency rules both catch exact duplicates with minor differences in scoping and matching. The duplicates tab catches fuzzy matches (which is a superset — exact duplicates are 100% similar and will also be caught).

Questions for discussion

Should we consolidate? The lint duplicate-label and consistency duplicate_label rules are largely redundant. Should one be removed, or should they be merged into one?
Should lint/consistency also use PostgreSQL? Both currently parse the full ontology with rdflib. Migrating these checks to SQL (like the duplicates tab) would dramatically improve performance. This could be a broader initiative to make all quality checks index-based.
Should the duplicates tab subsume exact-match detection? Since fuzzy matching at threshold=1.0 is equivalent to exact matching, the duplicates tab could handle both exact and fuzzy cases, making the separate lint/consistency rules unnecessary.
Different UX for different severities? Perhaps exact duplicates should remain as lint/consistency warnings (they are almost certainly a problem), while fuzzy matches stay in the dedicated Duplicates tab (they require human judgment). In that case, the exact-match rules should be migrated to SQL for performance but kept separate from fuzzy detection.

PR refactor: offload quality endpoints to ARQ worker queue #80 — offloaded quality endpoints to ARQ worker queue and rewrote duplicate detection to use pg_trgm
Issue feat: persist duplicate detection results in PostgreSQL #85 — persist duplicate detection results in PostgreSQL
Issue Offload quality endpoints to ARQ worker queue #79 — original issue for quality endpoint offloading

JohnRDOrazio · 2026-04-15T14:28:14Z

JohnRDOrazio
Apr 15, 2026
Maintainer Author

Correction: The lint check does use rdflib for computation, but it stores results in PostgreSQL (lint_runs + lint_issues tables). This is why 35K+ lint issues load instantly on page reload — they are served from the database, not recomputed.

The consistency and duplicates checks only cache in Redis with a 10-minute TTL, which is why results disappear on page reload.

System	Computation	Result storage	Reload behavior
Lint	rdflib Graph	PostgreSQL (`lint_issues`)	Instant — persisted
Consistency	rdflib Graph	Redis (10 min TTL)	Gone — must re-run
Duplicates	PostgreSQL `pg_trgm`	Redis (10 min TTL)	Gone — must re-run

This makes the case for #85 (persist duplicate detection results) even stronger, and suggests the consistency check should get the same treatment.

0 replies

JohnRDOrazio · 2026-04-15T14:36:21Z

JohnRDOrazio
Apr 15, 2026
Maintainer Author

Detailed rule comparison: Lint vs Consistency

After a full code review, the consistency check is not redundant — the two systems are complementary.

Overlapping rules (5) — both check, with different behavior

Lint rule	Consistency rule	Difference
`missing-label`	`missing_label`	Lint: classes only. Consistency: all entity types
`missing-comment`	`missing_comment`	Same — lint is class-only, consistency covers all
`orphan-class`	`orphan_class`	Consistency is stricter (also checks for instances)
`duplicate-label`	`duplicate_label`	Lint: case-insensitive. Consistency: exact value + language tag
`circular-hierarchy`	`cycle_detect`	Different DFS algorithms

Consistency-only rules (7) — NOT caught by lint

unused_property — properties declared but never used as predicates
orphan_individual — individuals typed with undeclared classes
empty_domain / empty_range — properties missing rdfs:domain/range
deprecated_parent — classes inheriting from owl:deprecated parents
dangling_ref — subClassOf/domain/range pointing to undeclared entities
multi_root — ontology with >5 root classes (quality metric)

Lint-only rules (11+) — NOT caught by consistency

undefined-parent, empty-label, label-per-language, undefined-prefix, duplicate-triple, domain-violation, range-violation, cardinality-violation, disjoint-violation, inverse-property-inconsistency, missing-english-label

Recommendation

Rather than removing either system, the path forward should be:

Deduplicate the 5 overlapping rules — keep the broader-scoped version (usually consistency) and remove the narrower lint equivalent, or merge them into one
Migrate both to PostgreSQL — both currently parse the full ontology with rdflib; all checks could be rewritten as SQL queries against the existing indexed_entities / indexed_labels / indexed_annotations tables
Persist consistency results like lint already does — add consistency_runs / consistency_issues tables (similar to lint_runs / lint_issues)
Consider merging the UI — a single "Health Check" with unified issue categories rather than three separate tabs

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate duplicate detection across lint, consistency, and duplicates tab #87

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Consolidate duplicate detection across lint, consistency, and duplicates tab #87

Uh oh!

JohnRDOrazio Apr 15, 2026 Maintainer

Context

1. Lint rule: duplicate-label

2. Consistency rule: duplicate_label

3. Duplicates tab: find_duplicates_sql

Overlap

Questions for discussion

Related

Replies: 2 comments

Uh oh!

JohnRDOrazio Apr 15, 2026 Maintainer Author

Uh oh!

JohnRDOrazio Apr 15, 2026 Maintainer Author

Detailed rule comparison: Lint vs Consistency

Overlapping rules (5) — both check, with different behavior

Consistency-only rules (7) — NOT caught by lint

Lint-only rules (11+) — NOT caught by consistency

Recommendation

JohnRDOrazio
Apr 15, 2026
Maintainer

1. Lint rule: `duplicate-label`

2. Consistency rule: `duplicate_label`

3. Duplicates tab: `find_duplicates_sql`

JohnRDOrazio
Apr 15, 2026
Maintainer Author

JohnRDOrazio
Apr 15, 2026
Maintainer Author