Consolidate duplicate detection across lint, consistency, and duplicates tab #87
Replies: 2 comments
-
|
Correction: The lint check does use rdflib for computation, but it stores results in PostgreSQL ( The consistency and duplicates checks only cache in Redis with a 10-minute TTL, which is why results disappear on page reload.
This makes the case for #85 (persist duplicate detection results) even stronger, and suggests the consistency check should get the same treatment. |
Beta Was this translation helpful? Give feedback.
-
Detailed rule comparison: Lint vs ConsistencyAfter a full code review, the consistency check is not redundant — the two systems are complementary. Overlapping rules (5) — both check, with different behavior
Consistency-only rules (7) — NOT caught by lint
Lint-only rules (11+) — NOT caught by consistency
RecommendationRather than removing either system, the path forward should be:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
We currently have three separate systems that detect duplicate labels, each with different characteristics:
1. Lint rule:
duplicate-labelstr(label).strip().lower())owl:Classentities only2. Consistency rule:
duplicate_label3. Duplicates tab:
find_duplicates_sqlpg_trgmGIN index onindexed_labelsOverlap
The lint and consistency rules both catch exact duplicates with minor differences in scoping and matching. The duplicates tab catches fuzzy matches (which is a superset — exact duplicates are 100% similar and will also be caught).
Questions for discussion
Should we consolidate? The lint
duplicate-labeland consistencyduplicate_labelrules are largely redundant. Should one be removed, or should they be merged into one?Should lint/consistency also use PostgreSQL? Both currently parse the full ontology with rdflib. Migrating these checks to SQL (like the duplicates tab) would dramatically improve performance. This could be a broader initiative to make all quality checks index-based.
Should the duplicates tab subsume exact-match detection? Since fuzzy matching at threshold=1.0 is equivalent to exact matching, the duplicates tab could handle both exact and fuzzy cases, making the separate lint/consistency rules unnecessary.
Different UX for different severities? Perhaps exact duplicates should remain as lint/consistency warnings (they are almost certainly a problem), while fuzzy matches stay in the dedicated Duplicates tab (they require human judgment). In that case, the exact-match rules should be migrated to SQL for performance but kept separate from fuzzy detection.
Related
pg_trgmBeta Was this translation helpful? Give feedback.
All reactions