Review: Add and improve D4D schema descriptions by realmarcin · Pull Request #142 · bridge2ai/data-sheets-schema

realmarcin · 2026-04-08T23:10:21Z

PR adds 122 missing descriptions across D4D_Base_import.yaml, D4D_Composition.yaml, and D4D_Evaluation_Summary.yaml, then improves existing ones for consistency and quality.

Schema Changes

100% description coverage across all three modules
Bug fix: Windows-1250–1258 code pages moved from VersionTypeEnum → EncodingEnum (they're character encodings, not version types)
Descriptions follow consistent style: complete sentences, examples in parentheses, domain-specific detail for enums

Tooling & Docs

DESCRIPTION_STYLE_GUIDE.md — style reference for future schema contributors
check_description_quality.py / check_missing_descriptions.py — automated quality checkers

Concerns Noted in Review

Root-level clutter: scripts and quality reports (quality_before.json, quality_after.json) committed to repo root; better placed in scripts/ or docs/
Misleading extensions: quality_*.json files are plain-text reports, not JSON
Pre-existing gaps: 34 descriptions in D4D_Evaluation_Summary.yaml don't end with a period — not introduced here, but inconsistent with the style guide now codified
Missing trailing newline in D4D_Base_import.yaml is pre-existing, carried over unchanged

Resolves 122 missing descriptions across D4D schema modules to improve documentation completeness and schema usability. ## Changes by Module ### D4D_Base_import.yaml (119 descriptions added) **Software class attributes (3):** - version: Software version identifier - license: Software license - url: Software homepage/repository URL **FormatDialect attributes (5):** - comment_prefix: Comment line indicator character - delimiter: Field delimiter character - double_quote: Quote escaping behavior - header: Header row presence flag - quote_char: Field quoting character **Slots (15):** - publisher: Publishing organization - issued: Formal publication date - page: Landing page URL - path: Content location path/URL - conforms_to: Conformance standard - conforms_to_schema: Schema conformance - conforms_to_class: Class conformance - license: Legal license - keywords: Discovery/classification tags - version: Version identifier - created_by: Creator person/organization - created_on: Creation timestamp - last_updated_on: Modification timestamp - modified_by: Modifier person/organization - status: Resource status - was_derived_from: Source derivation **Enums (96 descriptions):** - FormatEnum: File format extensions (CSV, JSON, PDF, etc.) - 1 enum + 18 values - MediaTypeEnum: MIME media types - 1 enum + 20 values - CompressionEnum: Compression algorithms - 1 enum + 7 values - EncodingEnum: Character encodings - 1 enum + 49 values (including 9 Windows code pages moved from VersionTypeEnum) - Boolean: Three-valued logic - 1 enum + 3 values **Schema fixes:** - Moved Windows-1250 through Windows-1258 from VersionTypeEnum to EncodingEnum (these are character encodings, not version types) - VersionTypeEnum now correctly contains only MAJOR, MINOR, PATCH ### D4D_Composition.yaml (3 descriptions added) **ContentWarning class:** - warnings: Specific content warnings for offensive/threatening content **Subpopulation class:** - identification: How subpopulations are identified and defined - distribution: Distribution of instances across subpopulations ## Verification - ✅ All 122 missing descriptions added - ✅ Schema validates successfully (make test-schema) - ✅ No missing descriptions remaining (check_missing_descriptions.py) - ✅ Generated artifacts updated (Python datamodel, JSON Schema, OWL, JSON-LD) ## Tools Added - check_missing_descriptions.py: Script to identify missing descriptions in D4D modules Co-Authored-By: Claude Opus 4.6 <[email protected]>

Critical review and enhancement of 122 descriptions added in previous commit to ensure consistency, clarity, and completeness across all D4D modules. ## Quality Improvements **Before:** - 269 total issues (36 HIGH priority TOO_BRIEF, 12 MEDIUM, 221 LOW) - Very brief descriptions: 54 items (<5 words) - Complete sentences (slots): 87.1% - Attributes with examples: 15.5% **After:** - 239 total issues (0 HIGH, 8 MEDIUM, 231 LOW) - Very brief descriptions: 18 items (66% reduction) - Complete sentences (slots): 100% (✅ target achieved) - Attributes with examples: 25%+ in key modules ## Changes by Module ### D4D_Base_import.yaml (10 improvements) **Slots enhanced with context:** - hash: "hash of the data" → "Cryptographic hash value... for integrity verification" - md5: Added "128-bit cryptographic hash" specification - sha256: Added "256-bit... recommended" guidance - doi: "digital object identifier" → "Digital Object Identifier... in format 10.xxxx/xxxxx" **Grammar and capitalization:** - title, language, encoding, compression: Capitalized and added periods ### D4D_Evaluation_Summary.yaml (40 improvements) **EvaluationSummary attributes:** - overall_performance: Expanded with "Summary statistics... across all D4D files" **OverallPerformance class (8 attributes):** - All percentage/score fields enhanced with context - Examples: "Average percentage score" → "Average score expressed as a percentage of maximum possible score" **MethodPerformance class (9 attributes):** - method: Added example methods (claudecode_agent, gpt5) - All metrics enhanced with clarity about what they measure **ProjectPerformance class (6 attributes):** - project: Added project examples (AI_READI, VOICE) - Consistent metric descriptions across performance classes **TopPerformer class (9 attributes):** - All attributes enhanced with specific context - rank: Clarified "1 = highest score" - Elements/scoring fields now specify their purpose **ElementPerformance & CategoryPerformance (12 attributes):** - ID fields specify numeric ranges (1-10, 1-4) - Name fields include examples - All scoring fields clarified **CommonWeakness & CommonStrength (10 attributes):** - type fields: Added classification context - frequency fields: Specified "across the evaluated dataset" - affected_element_or_question: Specified "where this... commonly occurs" **KeyInsight (5 attributes):** - insight_type: Added examples (trend, comparison, finding) - comparison_metric: Added examples ("2.4× better", "30% improvement") **InputTypeComparison & InputTypePerformance (8 attributes):** - All performance metrics enhanced with context - score_range: Specified "minimum to maximum values" **GeneratedFile (4 attributes):** - All fields enhanced with specific context about file operations ## Tools and Documentation **New files:** 1. **DESCRIPTION_STYLE_GUIDE.md** - Comprehensive style guide with: - Element-specific patterns table (class/attribute/slot/enum/enum value) - Quality criteria and decision trees - Examples of good vs poor descriptions - Anti-patterns to avoid 2. **check_description_quality.py** - Enhanced quality checker: - Detects TOO_BRIEF descriptions (<5 words) - Checks for MISSING_PERIOD on attributes/slots - Suggests CONSIDER_EXAMPLE for substantial descriptions - Generates prioritized reports with specific line references - Quality metrics by module and element type 3. **quality_before.json & quality_after.json** - Baseline and final reports ## Validation ✅ Schema validates successfully (make test-schema) ✅ All generated artifacts updated ✅ Zero high-priority issues remaining ✅ 66% reduction in very brief descriptions ✅ 100% complete sentences for slots (up from 87.1%) ## Style Guidelines Applied - **Classes**: Complete sentences, 15-30 words, question format for D4D modules - **Attributes**: Complete sentences, 8-15 words, examples recommended (40% target) - **Slots**: Complete sentences, 10-15 words, examples optional (20%) - **Enums**: Complete/Fragment, 8-15 words, categorize value set - **Enum values**: Fragments acceptable, 3-10 words, clarity over verbosity ## Success Metrics Achieved ✅ Reduced HIGH priority issues from 36 to 0 ✅ Improved slot complete sentences from 87.1% to 100% ✅ Reduced very brief descriptions by 66% ✅ Added beneficial examples to 25%+ of attributes in key modules ✅ Established consistent patterns across element types Co-Authored-By: Claude Opus 4.6 <[email protected]>

realmarcin

@copilot please review this PR

Copilot

Pull request overview

This PR improves documentation quality across the D4D (Datasheets for Datasets) LinkML schema modules by adding missing element descriptions, refining existing descriptions for consistency, and regenerating downstream artifacts (Python datamodel + JSON Schema/JSON-LD/OWL). It also adds a style guide and helper scripts intended to keep description quality high over time.

Changes:

Added/updated descriptions across D4D schema modules, including enums and shared slots; corrected Windows code-page values to live under EncodingEnum instead of VersionTypeEnum.
Regenerated derived artifacts (data_sheets_schema_all.yaml, Python datamodel, JSON Schema, JSON-LD, OWL TTL) to reflect schema updates.
Added documentation and tooling (DESCRIPTION_STYLE_GUIDE.md, description quality/missing-description check scripts, and “quality report” outputs).

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/data_sheets_schema/schema/D4D_Base_import.yaml	Adds many missing descriptions and fixes encoding enum placement.
src/data_sheets_schema/schema/D4D_Composition.yaml	Adds descriptions for content warnings and subpopulation fields.
src/data_sheets_schema/schema/D4D_Evaluation_Summary.yaml	Improves clarity/consistency of evaluation summary metric descriptions.
src/data_sheets_schema/schema/data_sheets_schema_all.yaml	Regenerated merged schema with updated descriptions and enum adjustments.
src/data_sheets_schema/datamodel/data_sheets_schema.py	Regenerated Python datamodel reflecting updated enums/descriptions.
project/jsonschema/data_sheets_schema.schema.json	Regenerated JSON Schema with updated descriptions and enum membership.
project/jsonld/data_sheets_schema.jsonld	Regenerated JSON-LD with updated descriptions and enum membership.
project/owl/data_sheets_schema.owl.ttl	Regenerated OWL TTL artifact to match schema updates.
DESCRIPTION_STYLE_GUIDE.md	New style guide documenting description standards and examples.
check_description_quality.py	New automated checker for description quality heuristics/metrics.
check_missing_descriptions.py	New script to fail when descriptions are missing in D4D modules.
quality_before.json	Added “before” quality report output for tracking.
quality_after.json	Added “after” quality report output for tracking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T23:16:58Z

@copilot please review this PR

PR Review: Add and improve D4D schema descriptions

Summary

This is a well-executed PR that meaningfully improves schema documentation quality. All YAML files validate correctly and the schema changes are coherent.

✅ Strengths

100% description coverage achieved across all three changed schema modules (D4D_Base_import.yaml, D4D_Composition.yaml, D4D_Evaluation_Summary.yaml) — verified by automated check.
Bug fix: Windows code pages (1250–1258) were incorrectly placed in VersionTypeEnum and are now correctly moved to EncodingEnum with appropriate descriptions.
Quality metrics are credible: HIGH priority issues 36→0, TOO_BRIEF descriptions 54→18, measured by the bundled check_description_quality.py tool.
Consistent style: Descriptions follow the stated patterns (complete sentences, examples in parentheses, domain-specific detail where useful). The style guide in DESCRIPTION_STYLE_GUIDE.md is a useful reference.
Artifacts kept in sync: Python datamodel, JSON Schema, OWL, and JSON-LD artifacts are all regenerated.

⚠️ Issues to Consider

Root-level utility files — check_description_quality.py, check_missing_descriptions.py, quality_before.json, and quality_after.json are committed to the repository root. A few concerns:
- The scripts would fit better in a scripts/ or tools/ subdirectory consistent with other project tooling.
- The quality_before.json / quality_after.json files are actually plain-text reports (not JSON), which is misleading. Consider renaming to .txt or .md or moving to docs/evaluation/.
- These report snapshots may become stale as the schema evolves; consider whether they need to be version-controlled at all.
Minor description inconsistency — The version slot description says "resource" while Software.version attribute says "software". This is contextually appropriate, but worth noting.
Missing trailing newline — D4D_Base_import.yaml still doesn't end with a newline character. This is a pre-existing issue carried over from before the PR, but worth fixing separately.
34 pre-existing D4D_Evaluation_Summary descriptions don't end with a period (e.g., "Complete evaluation summary for a rubric system"). These are pre-existing and not introduced by this PR, but the style guide now sets a standard that makes them inconsistent.

Recommendation

The core schema changes and documentation improvements are solid and ready to merge. The main follow-up actions I'd suggest:

Move scripts to scripts/ or tools/ subdirectory
Rename quality_before.json/quality_after.json to .txt or move to docs
Fix the pre-existing missing newline in D4D_Base_import.yaml
Address the 34 pre-existing non-period-terminated descriptions in a follow-up PR

The schema improvements themselves are correct and ready to merge.

Fixed all issues identified by GitHub Copilot code review to improve code quality, consistency, and accuracy. ## Issue 1: has_example() function too permissive **File**: check_description_quality.py:36 **Problem**: Function treated any opening parenthesis as an example marker, causing false positives for abbreviations and technical notes. **Fix**: Restrict to explicit example markers only (e.g., for example, such as, for instance). Remove the `'(' in text` check. ## Issue 2: MISSING_PERIOD check logic flaw **File**: check_description_quality.py:69 **Problem**: Short descriptions (≤5 words) bypassed period check due to is_complete_sentence() logic, allowing "Email address" without period. **Fix**: Check all attributes/slots for trailing period regardless of length, as style guide requires periods for these element types. ## Issue 3: Unused import **File**: check_missing_descriptions.py:8 **Problem**: `defaultdict` imported but never used. **Fix**: Removed unused import to avoid lint warnings. ## Issue 4: Contradictory example in style guide **File**: DESCRIPTION_STYLE_GUIDE.md:129 **Problem**: doi example used "description: digital object identifier" (lowercase, no period, too brief) which contradicted the guide's own slot guidelines. **Fix**: Updated example to match actual schema improvement: "Digital Object Identifier (DOI) in format 10.xxxx/xxxxx providing persistent identification." ## Issue 5 & 6: Quality reports not valid JSON **Files**: quality_before.json, quality_after.json **Problem**: Files named .json but contained plain-text reports (ASCII art separators), breaking programmatic parsing for metrics tracking. **Fix**: Regenerated both files with `--format json` flag to output proper JSON structure. Now suitable for automated metrics tracking and comparison. ## Issue 7: Type mismatch in FormatDialect descriptions **File**: src/data_sheets_schema/schema/D4D_Base_import.yaml:233 **Problem**: FormatDialect.header and double_quote descriptions implied boolean values ("true/false", "whether...") but range is `string` not `boolean`. **Fix**: Updated descriptions to clarify these are string indicators: - header: "String indicator of whether the first row contains column headers." - double_quote: "String indicator of whether quotes within quoted fields are escaped..." ## Validation ✅ Schema validates successfully (make test-schema) ✅ All generated artifacts updated (make regen-all) ✅ Quality reports now proper JSON format ✅ All code quality issues resolved ✅ No breaking changes to schema functionality ## Impact - More accurate quality metrics (examples no longer over-counted) - Stricter period checking for attributes/slots (catches more style violations) - Cleaner codebase (no unused imports) - Consistent documentation (no contradictory examples) - Proper JSON format for programmatic metrics tracking - Schema descriptions accurately reflect data types Co-Authored-By: Claude Opus 4.6 <[email protected]>

Fixes three remaining issues from Copilot review: 1. Move quality reports to organized location - Relocate quality_before.json and quality_after.json to scripts/ - Reduces root-level clutter 2. Add missing periods to non-question descriptions - D4D_Evaluation_Summary: 21 attribute descriptions - D4D_Collection: 8 attribute descriptions - D4D_Data_Governance: 1 attribute description - Remaining MISSING_PERIOD issues are questions (appropriate for D4D) 3. Add trailing newline to D4D_Base_import.yaml - Satisfies POSIX text file requirements Quality metrics after fixes: - MISSING_PERIOD issues: 50 → 20 (all remaining are questions) - All HIGH priority issues resolved - Schema validates successfully Co-Authored-By: Claude Opus 4.6 <[email protected]>

Performed comprehensive semantic review of D4D schema descriptions with analysis of actual data from 4 Bridge2AI projects (AI_READI, CHORUS, CM4AI, VOICE). Identified 136 semantic issues and implemented initial fixes. ## Semantic Review Tools Created Added 4 automated validation scripts + Makefile integration: 1. **slot_uri_conflict_detector.py** - Detects ontology mapping conflicts - Found 17 slot_uri conflicts (9 CRITICAL) - Identifies semantic collisions in RDF serialization 2. **range_description_checker.py** - Validates range-description alignment - Found 76 range mismatches (51 HIGH priority) - Detects boolean oversimplification, missing multivalued 3. **data_value_analyzer.py** - Analyzes actual D4D data values - Examined 142 fields across 4 Bridge2AI projects - Identified 75 enum candidates, 38 multivalued fields 4. **generate_semantic_review_report.py** - Consolidated reporting - 593-line comprehensive report with specific recommendations - Machine-readable JSON + human-readable markdown **Makefile targets:** - `make semantic-review` - Run full review - Individual targets for each validation script ## Issues Identified **Total: 136 issues** - CRITICAL: 9 (blocks functionality - slot_uri conflicts) - HIGH: 54 (wrong semantics - range mismatches, incorrect mappings) - MEDIUM: 29 (reduces clarity - semantic flattening) - LOW: 1 (documentation quality) **Key findings:** - dcterms:description used by 40+ slots (massive semantic flattening) - schema:identifier used by 8 slots with different semantics - dcat:mediaType conflict: encoding vs media_type - Multiple dcat:accessURL, dcterms:creator, dcterms:license conflicts ## Fixes Applied (4 CRITICAL issues resolved) ### 1. dcat:mediaType conflict ✅ **File:** D4D_Base_import.yaml:301 **Issue:** `encoding` (character encoding) mapped to dcat:mediaType (MIME type) **Fix:** encoding.slot_uri: dcat:mediaType → d4d:characterEncoding **Rationale:** DCAT defines mediaType for MIME types only, not character encodings ### 2. dcat:landingPage conflict ✅ **File:** D4D_Maintenance.yaml:149 **Issue:** contribution_url (contribution guidelines) mapped to landingPage **Fix:** contribution_url.slot_uri: dcat:landingPage → d4d:contributionURL **Rationale:** Contribution guidelines semantically different from dataset landing page ### 3. dcat:accessURL conflict (partial) ✅ **File:** D4D_Maintenance.yaml:66 **Issue:** erratum_url (erratum access) mapped to generic accessURL **Fix:** erratum_url.slot_uri: dcat:accessURL → d4d:erratumURL **Rationale:** Erratum-specific access point distinct from general dataset access ### 4. schema:identifier semantic inversion ✅ **File:** D4D_Composition.yaml:388 **Issue:** identifiers_removed mapped to schema:identifier but contains list of removed identifier TYPES, not identifier values **Fix:** identifiers_removed.slot_uri: schema:identifier → d4d:removedIdentifierTypes **Rationale:** Documents types removed (e.g., "SSN", "name"), not identifiers themselves ## Impact **Conflict reduction:** - Total slot_uri conflicts: 17 → 15 (12% reduction) - CRITICAL conflicts: 9 → 8 (11% reduction) **Non-breaking changes:** - slot_uri modifications affect RDF/JSON-LD only - YAML data structure unchanged - No data migration required **Remaining work:** - 8 CRITICAL slot_uri conflicts - 6 remaining schema:identifier conflicts - dcterms:description semantic flattening (architectural decision needed) - 51 HIGH priority range mismatches ## Documentation **Reports generated:** - reports/semantic_review_report.md (593 lines) - reports/slot_uri_conflicts.json - reports/range_mismatches.json - reports/data_value_analysis.json - reports/fixes_applied.md (implementation log) **Next steps:** - Resolve remaining CRITICAL conflicts - Make architectural decision on dcterms:description (40 usages) - Create ontology_mapping_guide.md with mapping rationale - Address HIGH priority range mismatches Co-Authored-By: Claude Opus 4.6 <[email protected]>

Continued systematic resolution of slot_uri conflicts and semantic issues. Reduced total conflicts from 17 to 8 (53% reduction), CRITICAL conflicts from 9 to 2 (78% reduction). ## Semantic Fixes Applied (14 additional) ### slot_uri Conflicts Resolved **dcat: namespace (1 fix)** - access_url → d4d:rawDataAccessURL (raw data access distinct from distribution) **dcterms: namespace (7 fixes)** - principal_investigator → d4d:principalInvestigator (role-specific creator) - restrictions (external resources) → d4d:externalResourceRestrictions - is_shared → d4d:isExternallyShared (boolean doesn't fit accessRights semantics) - target_dataset → dcterms:relation (relationship, not identifier) - latest_version_doi → dcterms:hasVersion (version relationship) - version → schema:version (version string, not relationship) - data_substrate → dcterms:type (data content type, not file format) - status → d4d:publicationStatus (lifecycle status, not type) - license_terms → d4d:licenseDescription (description of license) **schema: namespace (3 fixes)** - affiliations (Creator) → d4d:teamAffiliation (team context) - orcid → d4d:orcidIdentifier (specific identifier type) - grant_number → d4d:grantIdentifier (specific identifier type) - is_identifier → d4d:isIdentifier (meta-property, not identifier) ### Semantic Precision Improvements **Before:** - Generic schema:identifier for all identifiers - dcat:accessURL for all access points - dcterms:type conflated with publication status - dcterms:format conflated with data content type - Booleans using text-description slot_uris **After:** - Specific identifier types: orcidIdentifier, grantIdentifier - Specific access points: rawDataAccessURL, erratumURL, contributionURL - Clear separation: types vs status, format vs content type - Semantically appropriate mappings for all fields **Custom D4D terms created:** 10 new terms (Session 2), all with broad_mappings to standard vocabularies where applicable ## Progress Metrics **Conflict reduction:** - Total: 17 → 8 (53% reduction) - CRITICAL: 9 → 2 (78% reduction) - HIGH: 3 → 1 (67% reduction) **Remaining CRITICAL issues:** 1. dcterms:description (40 slots) - Architectural decision needed 2. dcterms:type (3 slots) - Acceptable semantic overlap ## Validation ✅ All schema validation tests pass ✅ Generated artifacts updated successfully ✅ Non-breaking changes (slot_uri only affects RDF/JSON-LD) ✅ No data migration required ## Files Modified **Schema modules (9 files):** - D4D_Base_import.yaml (3 fixes) - D4D_Composition.yaml (3 fixes) - D4D_Data_Governance.yaml (1 fix) - D4D_Distribution.yaml (1 fix) - D4D_Maintenance.yaml (1 fix) - D4D_Motivation.yaml (3 fixes) - D4D_Preprocessing.yaml (1 fix) - D4D_Variables.yaml (1 fix) **Documentation:** - reports/semantic_fixes_session2.md (comprehensive summary) ## Next Steps - Make architectural decision on dcterms:description (40 slots) - Create docs/ontology_mapping_guide.md with rationale - Address 51 HIGH priority range mismatches - Review remaining MEDIUM conflicts Co-Authored-By: Claude Opus 4.6 <[email protected]>

Applied 16 semantic fixes across 8 D4D modules to eliminate CRITICAL and HIGH slot_uri conflicts while maximizing standard vocabulary reuse. Conflict Resolution (12 fixes): - MEDIUM: doi, conforms_to_schema/class, variable_name, label_description, representative_verification, tools (7 custom d4d: terms) - HIGH: contact_person variants for ethics/license/governance contexts (3 role-specific contact point terms) - CRITICAL: source_type, instance_type differentiated from data_substrate (2 type classification terms) dcterms:description Optimization (4 fixes): - Made architectural decision: minimal hybrid approach - Differentiated: response→questionResponse, anomaly_details→anomalyDetails, quality_notes→qualityNotes, future_guarantees→availabilityGuarantee - Accepted 36 slots as valid semantic overlap (90% standard vocab retention) - Documented rationale in reports/dcterms_description_analysis.md Results: - CRITICAL conflicts: 9 → 0 (100% resolved) - HIGH conflicts: 3 → 0 (100% resolved) - MEDIUM conflicts: 7 → 2 (acceptable overlaps documented) - Created 31 custom d4d: namespace terms with broad_mappings - Zero breaking changes (slot_uri affects RDF/JSON-LD only) - All schema validation passing Files modified: - 8 schema modules (D4D_Base_import, Collection, Composition, Data_Governance, Ethics, Motivation, Preprocessing, Variables) - Generated artifacts (jsonld, owl, Python datamodel) - 3 new analysis/decision reports Co-Authored-By: Claude Opus 4.6 <[email protected]>

Copilot

Pull request overview

Copilot reviewed 36 out of 40 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…l field examples - Add descriptions to all 41 missing Dataset-level aggregator slots in data_sheets_schema.yaml, achieving 100% description coverage (94.7% → 100%) - Each description explains the target module class and what the collection represents - Add examples to technical fields: doi (concrete DOI examples), hash (format examples), identifiers_removed (HIPAA identifier types), license_terms (SPDX identifiers) - All new descriptions use List of... prefix for multivalued aggregator slots, addressing multivalued clarity as part of Phase 1-2 improvements Co-Authored-By: Claude Opus 4.6 <[email protected]>

Semantic meaning and quality review agent for all D4D schema descriptions across the full schema and all 17 modules. Reviews: - Semantic accuracy (description matches what the field actually stores) - Range alignment (description reflects correct data type) - Ontology alignment (description matches slot_uri/mappings intent) - Cardinality alignment (multivalued fields use plural language) - Cross-module consistency (same concepts described consistently) - Completeness and specificity (actionable vs generic boilerplate) - Structural correctness (no stubs, placeholders, malformed text) Issues classified as CRITICAL/HIGH/MEDIUM/LOW with specific locations, current text, problem explanation, and recommended fix. Uses claude-opus-4-6 for deep semantic reasoning. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Fix all remaining HIGH/MEDIUM/LOW semantic review issues (83 → 0) - Wire 7 orphaned classes to Dataset (Relationships, Splits, DirectCollection, CollectionNotification, CollectionConsent, ConsentRevocation, ThirdPartySharing) - Apply style cleanup: add terminal periods to 332 descriptions across 4 files; add Check H to d4d-description-reviewer agent - Add 19 d4d:docExample annotations (namespaced to avoid conflict with the existing `examples` data fields in ExistingUse, IntendedUse, and VariableMetadata) - Add docs/ontology_mapping_guide.md - Regenerate all artifacts (make gen-project); 103 tests pass Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Copilot

Pull request overview

Copilot reviewed 41 out of 48 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove incorrect broad_mappings (schema:archivedAt) from archival boolean field - Add missing terminal period to HZ-GB-2312 encoding description - Fix slot_uri names in ontology_mapping_guide.md (hashValue, md5Checksum, doiIdentifier) - Add || true to semantic-review Makefile targets so non-zero exit doesn't fail the build - Delete stale reports: slot_uri_conflicts_final.json and semantic_fixes_session3.md Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- gen-project: regenerate OWL/Turtle, JSON-LD, JSON Schema, Python datamodel - gen-sssom-all: regenerate all SSSOM TSVs (284 attributes, up from 270) - 7 newly wired Dataset attributes now appear as unmapped novel D4D concepts (relationships, splits, direct_collection, collection_notifications, collection_consents, consent_revocations, third_party_sharing) - Sync data/mappings/ working copies from canonical src/alignment/ sources Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Run generate_structural_mapping.py against updated merged schema - 149 structure-aware D4D ↔ RO-Crate mappings (updated from prior run) - Reflects 7 newly wired Dataset attributes and all schema updates Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Add SSSOM_STRUCTURAL and SSSOM_STRUCTURAL_SCRIPT variables - Add gen-sssom-structural target (depends on merged schema + RO-Crate JSON) - Wire gen-sssom-structural into gen-sssom-all so it runs with everything else - Add SSSOM_STRUCTURAL to clean-sssom Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Copilot

Pull request overview

Copilot reviewed 45 out of 56 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- data_value_analyzer.py: fix boolean heuristic to check startswith on last path segment instead of endswith on full path (is_deidentified would never match the old check) - generate_semantic_review_report.py: compute 'most severe' conflict dynamically from loaded data instead of hard-coding dcterms:description and 40-slot count which become stale as schema changes Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

realmarcin and others added 2 commits April 8, 2026 15:24

Copilot AI review requested due to automatic review settings April 8, 2026 23:10

Copilot started reviewing on behalf of realmarcin April 8, 2026 23:11 View session

realmarcin commented Apr 8, 2026

View reviewed changes

Copilot started work on behalf of realmarcin April 8, 2026 23:12 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Copilot AI changed the title ~~Add and improve D4D schema descriptions~~ Review: Add and improve D4D schema descriptions Apr 8, 2026

Copilot finished work on behalf of realmarcin April 8, 2026 23:17

realmarcin and others added 5 commits April 8, 2026 16:18

realmarcin requested a review from Copilot April 9, 2026 03:18

Copilot started reviewing on behalf of realmarcin April 9, 2026 03:18 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

realmarcin and others added 3 commits April 8, 2026 21:04

realmarcin requested a review from Copilot April 9, 2026 07:53

Copilot started reviewing on behalf of realmarcin April 9, 2026 07:54 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

realmarcin and others added 4 commits April 9, 2026 09:27

realmarcin requested a review from Copilot April 9, 2026 18:21

Copilot started reviewing on behalf of realmarcin April 9, 2026 18:22 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread scripts/data_value_analyzer.py

Comment thread scripts/generate_semantic_review_report.py

realmarcin requested a review from caufieldjh April 9, 2026 18:35

realmarcin merged commit 1a06ca0 into main Apr 10, 2026
3 checks passed

realmarcin deleted the add-schema-descriptions branch April 10, 2026 00:42

Conversation

realmarcin commented Apr 8, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Schema Changes

Tooling & Docs

Concerns Noted in Review

Uh oh!

realmarcin left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Apr 8, 2026

PR Review: Add and improve D4D schema descriptions

Summary

✅ Strengths

⚠️ Issues to Consider

Recommendation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

realmarcin commented Apr 8, 2026 •

edited by Copilot AI

Loading