Review: Add and improve D4D schema descriptions#142
Conversation
Resolves 122 missing descriptions across D4D schema modules to improve documentation completeness and schema usability. ## Changes by Module ### D4D_Base_import.yaml (119 descriptions added) **Software class attributes (3):** - version: Software version identifier - license: Software license - url: Software homepage/repository URL **FormatDialect attributes (5):** - comment_prefix: Comment line indicator character - delimiter: Field delimiter character - double_quote: Quote escaping behavior - header: Header row presence flag - quote_char: Field quoting character **Slots (15):** - publisher: Publishing organization - issued: Formal publication date - page: Landing page URL - path: Content location path/URL - conforms_to: Conformance standard - conforms_to_schema: Schema conformance - conforms_to_class: Class conformance - license: Legal license - keywords: Discovery/classification tags - version: Version identifier - created_by: Creator person/organization - created_on: Creation timestamp - last_updated_on: Modification timestamp - modified_by: Modifier person/organization - status: Resource status - was_derived_from: Source derivation **Enums (96 descriptions):** - FormatEnum: File format extensions (CSV, JSON, PDF, etc.) - 1 enum + 18 values - MediaTypeEnum: MIME media types - 1 enum + 20 values - CompressionEnum: Compression algorithms - 1 enum + 7 values - EncodingEnum: Character encodings - 1 enum + 49 values (including 9 Windows code pages moved from VersionTypeEnum) - Boolean: Three-valued logic - 1 enum + 3 values **Schema fixes:** - Moved Windows-1250 through Windows-1258 from VersionTypeEnum to EncodingEnum (these are character encodings, not version types) - VersionTypeEnum now correctly contains only MAJOR, MINOR, PATCH ### D4D_Composition.yaml (3 descriptions added) **ContentWarning class:** - warnings: Specific content warnings for offensive/threatening content **Subpopulation class:** - identification: How subpopulations are identified and defined - distribution: Distribution of instances across subpopulations ## Verification - ✅ All 122 missing descriptions added - ✅ Schema validates successfully (make test-schema) - ✅ No missing descriptions remaining (check_missing_descriptions.py) - ✅ Generated artifacts updated (Python datamodel, JSON Schema, OWL, JSON-LD) ## Tools Added - check_missing_descriptions.py: Script to identify missing descriptions in D4D modules Co-Authored-By: Claude Opus 4.6 <[email protected]>
Critical review and enhancement of 122 descriptions added in previous commit
to ensure consistency, clarity, and completeness across all D4D modules.
## Quality Improvements
**Before:**
- 269 total issues (36 HIGH priority TOO_BRIEF, 12 MEDIUM, 221 LOW)
- Very brief descriptions: 54 items (<5 words)
- Complete sentences (slots): 87.1%
- Attributes with examples: 15.5%
**After:**
- 239 total issues (0 HIGH, 8 MEDIUM, 231 LOW)
- Very brief descriptions: 18 items (66% reduction)
- Complete sentences (slots): 100% (✅ target achieved)
- Attributes with examples: 25%+ in key modules
## Changes by Module
### D4D_Base_import.yaml (10 improvements)
**Slots enhanced with context:**
- hash: "hash of the data" → "Cryptographic hash value... for integrity verification"
- md5: Added "128-bit cryptographic hash" specification
- sha256: Added "256-bit... recommended" guidance
- doi: "digital object identifier" → "Digital Object Identifier... in format 10.xxxx/xxxxx"
**Grammar and capitalization:**
- title, language, encoding, compression: Capitalized and added periods
### D4D_Evaluation_Summary.yaml (40 improvements)
**EvaluationSummary attributes:**
- overall_performance: Expanded with "Summary statistics... across all D4D files"
**OverallPerformance class (8 attributes):**
- All percentage/score fields enhanced with context
- Examples: "Average percentage score" → "Average score expressed as a percentage of maximum possible score"
**MethodPerformance class (9 attributes):**
- method: Added example methods (claudecode_agent, gpt5)
- All metrics enhanced with clarity about what they measure
**ProjectPerformance class (6 attributes):**
- project: Added project examples (AI_READI, VOICE)
- Consistent metric descriptions across performance classes
**TopPerformer class (9 attributes):**
- All attributes enhanced with specific context
- rank: Clarified "1 = highest score"
- Elements/scoring fields now specify their purpose
**ElementPerformance & CategoryPerformance (12 attributes):**
- ID fields specify numeric ranges (1-10, 1-4)
- Name fields include examples
- All scoring fields clarified
**CommonWeakness & CommonStrength (10 attributes):**
- type fields: Added classification context
- frequency fields: Specified "across the evaluated dataset"
- affected_element_or_question: Specified "where this... commonly occurs"
**KeyInsight (5 attributes):**
- insight_type: Added examples (trend, comparison, finding)
- comparison_metric: Added examples ("2.4× better", "30% improvement")
**InputTypeComparison & InputTypePerformance (8 attributes):**
- All performance metrics enhanced with context
- score_range: Specified "minimum to maximum values"
**GeneratedFile (4 attributes):**
- All fields enhanced with specific context about file operations
## Tools and Documentation
**New files:**
1. **DESCRIPTION_STYLE_GUIDE.md** - Comprehensive style guide with:
- Element-specific patterns table (class/attribute/slot/enum/enum value)
- Quality criteria and decision trees
- Examples of good vs poor descriptions
- Anti-patterns to avoid
2. **check_description_quality.py** - Enhanced quality checker:
- Detects TOO_BRIEF descriptions (<5 words)
- Checks for MISSING_PERIOD on attributes/slots
- Suggests CONSIDER_EXAMPLE for substantial descriptions
- Generates prioritized reports with specific line references
- Quality metrics by module and element type
3. **quality_before.json & quality_after.json** - Baseline and final reports
## Validation
✅ Schema validates successfully (make test-schema)
✅ All generated artifacts updated
✅ Zero high-priority issues remaining
✅ 66% reduction in very brief descriptions
✅ 100% complete sentences for slots (up from 87.1%)
## Style Guidelines Applied
- **Classes**: Complete sentences, 15-30 words, question format for D4D modules
- **Attributes**: Complete sentences, 8-15 words, examples recommended (40% target)
- **Slots**: Complete sentences, 10-15 words, examples optional (20%)
- **Enums**: Complete/Fragment, 8-15 words, categorize value set
- **Enum values**: Fragments acceptable, 3-10 words, clarity over verbosity
## Success Metrics Achieved
✅ Reduced HIGH priority issues from 36 to 0
✅ Improved slot complete sentences from 87.1% to 100%
✅ Reduced very brief descriptions by 66%
✅ Added beneficial examples to 25%+ of attributes in key modules
✅ Established consistent patterns across element types
Co-Authored-By: Claude Opus 4.6 <[email protected]>
realmarcin
left a comment
There was a problem hiding this comment.
@copilot please review this PR
There was a problem hiding this comment.
Pull request overview
This PR improves documentation quality across the D4D (Datasheets for Datasets) LinkML schema modules by adding missing element descriptions, refining existing descriptions for consistency, and regenerating downstream artifacts (Python datamodel + JSON Schema/JSON-LD/OWL). It also adds a style guide and helper scripts intended to keep description quality high over time.
Changes:
- Added/updated descriptions across D4D schema modules, including enums and shared slots; corrected Windows code-page values to live under
EncodingEnuminstead ofVersionTypeEnum. - Regenerated derived artifacts (
data_sheets_schema_all.yaml, Python datamodel, JSON Schema, JSON-LD, OWL TTL) to reflect schema updates. - Added documentation and tooling (
DESCRIPTION_STYLE_GUIDE.md, description quality/missing-description check scripts, and “quality report” outputs).
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/data_sheets_schema/schema/D4D_Base_import.yaml | Adds many missing descriptions and fixes encoding enum placement. |
| src/data_sheets_schema/schema/D4D_Composition.yaml | Adds descriptions for content warnings and subpopulation fields. |
| src/data_sheets_schema/schema/D4D_Evaluation_Summary.yaml | Improves clarity/consistency of evaluation summary metric descriptions. |
| src/data_sheets_schema/schema/data_sheets_schema_all.yaml | Regenerated merged schema with updated descriptions and enum adjustments. |
| src/data_sheets_schema/datamodel/data_sheets_schema.py | Regenerated Python datamodel reflecting updated enums/descriptions. |
| project/jsonschema/data_sheets_schema.schema.json | Regenerated JSON Schema with updated descriptions and enum membership. |
| project/jsonld/data_sheets_schema.jsonld | Regenerated JSON-LD with updated descriptions and enum membership. |
| project/owl/data_sheets_schema.owl.ttl | Regenerated OWL TTL artifact to match schema updates. |
| DESCRIPTION_STYLE_GUIDE.md | New style guide documenting description standards and examples. |
| check_description_quality.py | New automated checker for description quality heuristics/metrics. |
| check_missing_descriptions.py | New script to fail when descriptions are missing in D4D modules. |
| quality_before.json | Added “before” quality report output for tracking. |
| quality_after.json | Added “after” quality report output for tracking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR Review: Add and improve D4D schema descriptionsSummaryThis is a well-executed PR that meaningfully improves schema documentation quality. All YAML files validate correctly and the schema changes are coherent. ✅ Strengths
|
Fixed all issues identified by GitHub Copilot code review to improve
code quality, consistency, and accuracy.
## Issue 1: has_example() function too permissive
**File**: check_description_quality.py:36
**Problem**: Function treated any opening parenthesis as an example marker,
causing false positives for abbreviations and technical notes.
**Fix**: Restrict to explicit example markers only (e.g., for example, such as, for instance).
Remove the `'(' in text` check.
## Issue 2: MISSING_PERIOD check logic flaw
**File**: check_description_quality.py:69
**Problem**: Short descriptions (≤5 words) bypassed period check due to
is_complete_sentence() logic, allowing "Email address" without period.
**Fix**: Check all attributes/slots for trailing period regardless of length,
as style guide requires periods for these element types.
## Issue 3: Unused import
**File**: check_missing_descriptions.py:8
**Problem**: `defaultdict` imported but never used.
**Fix**: Removed unused import to avoid lint warnings.
## Issue 4: Contradictory example in style guide
**File**: DESCRIPTION_STYLE_GUIDE.md:129
**Problem**: doi example used "description: digital object identifier" (lowercase,
no period, too brief) which contradicted the guide's own slot guidelines.
**Fix**: Updated example to match actual schema improvement:
"Digital Object Identifier (DOI) in format 10.xxxx/xxxxx providing persistent identification."
## Issue 5 & 6: Quality reports not valid JSON
**Files**: quality_before.json, quality_after.json
**Problem**: Files named .json but contained plain-text reports (ASCII art separators),
breaking programmatic parsing for metrics tracking.
**Fix**: Regenerated both files with `--format json` flag to output proper JSON structure.
Now suitable for automated metrics tracking and comparison.
## Issue 7: Type mismatch in FormatDialect descriptions
**File**: src/data_sheets_schema/schema/D4D_Base_import.yaml:233
**Problem**: FormatDialect.header and double_quote descriptions implied boolean
values ("true/false", "whether...") but range is `string` not `boolean`.
**Fix**: Updated descriptions to clarify these are string indicators:
- header: "String indicator of whether the first row contains column headers."
- double_quote: "String indicator of whether quotes within quoted fields are escaped..."
## Validation
✅ Schema validates successfully (make test-schema)
✅ All generated artifacts updated (make regen-all)
✅ Quality reports now proper JSON format
✅ All code quality issues resolved
✅ No breaking changes to schema functionality
## Impact
- More accurate quality metrics (examples no longer over-counted)
- Stricter period checking for attributes/slots (catches more style violations)
- Cleaner codebase (no unused imports)
- Consistent documentation (no contradictory examples)
- Proper JSON format for programmatic metrics tracking
- Schema descriptions accurately reflect data types
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fixes three remaining issues from Copilot review: 1. Move quality reports to organized location - Relocate quality_before.json and quality_after.json to scripts/ - Reduces root-level clutter 2. Add missing periods to non-question descriptions - D4D_Evaluation_Summary: 21 attribute descriptions - D4D_Collection: 8 attribute descriptions - D4D_Data_Governance: 1 attribute description - Remaining MISSING_PERIOD issues are questions (appropriate for D4D) 3. Add trailing newline to D4D_Base_import.yaml - Satisfies POSIX text file requirements Quality metrics after fixes: - MISSING_PERIOD issues: 50 → 20 (all remaining are questions) - All HIGH priority issues resolved - Schema validates successfully Co-Authored-By: Claude Opus 4.6 <[email protected]>
Performed comprehensive semantic review of D4D schema descriptions with
analysis of actual data from 4 Bridge2AI projects (AI_READI, CHORUS,
CM4AI, VOICE). Identified 136 semantic issues and implemented initial fixes.
## Semantic Review Tools Created
Added 4 automated validation scripts + Makefile integration:
1. **slot_uri_conflict_detector.py** - Detects ontology mapping conflicts
- Found 17 slot_uri conflicts (9 CRITICAL)
- Identifies semantic collisions in RDF serialization
2. **range_description_checker.py** - Validates range-description alignment
- Found 76 range mismatches (51 HIGH priority)
- Detects boolean oversimplification, missing multivalued
3. **data_value_analyzer.py** - Analyzes actual D4D data values
- Examined 142 fields across 4 Bridge2AI projects
- Identified 75 enum candidates, 38 multivalued fields
4. **generate_semantic_review_report.py** - Consolidated reporting
- 593-line comprehensive report with specific recommendations
- Machine-readable JSON + human-readable markdown
**Makefile targets:**
- `make semantic-review` - Run full review
- Individual targets for each validation script
## Issues Identified
**Total: 136 issues**
- CRITICAL: 9 (blocks functionality - slot_uri conflicts)
- HIGH: 54 (wrong semantics - range mismatches, incorrect mappings)
- MEDIUM: 29 (reduces clarity - semantic flattening)
- LOW: 1 (documentation quality)
**Key findings:**
- dcterms:description used by 40+ slots (massive semantic flattening)
- schema:identifier used by 8 slots with different semantics
- dcat:mediaType conflict: encoding vs media_type
- Multiple dcat:accessURL, dcterms:creator, dcterms:license conflicts
## Fixes Applied (4 CRITICAL issues resolved)
### 1. dcat:mediaType conflict ✅
**File:** D4D_Base_import.yaml:301
**Issue:** `encoding` (character encoding) mapped to dcat:mediaType (MIME type)
**Fix:** encoding.slot_uri: dcat:mediaType → d4d:characterEncoding
**Rationale:** DCAT defines mediaType for MIME types only, not character encodings
### 2. dcat:landingPage conflict ✅
**File:** D4D_Maintenance.yaml:149
**Issue:** contribution_url (contribution guidelines) mapped to landingPage
**Fix:** contribution_url.slot_uri: dcat:landingPage → d4d:contributionURL
**Rationale:** Contribution guidelines semantically different from dataset landing page
### 3. dcat:accessURL conflict (partial) ✅
**File:** D4D_Maintenance.yaml:66
**Issue:** erratum_url (erratum access) mapped to generic accessURL
**Fix:** erratum_url.slot_uri: dcat:accessURL → d4d:erratumURL
**Rationale:** Erratum-specific access point distinct from general dataset access
### 4. schema:identifier semantic inversion ✅
**File:** D4D_Composition.yaml:388
**Issue:** identifiers_removed mapped to schema:identifier but contains
list of removed identifier TYPES, not identifier values
**Fix:** identifiers_removed.slot_uri: schema:identifier → d4d:removedIdentifierTypes
**Rationale:** Documents types removed (e.g., "SSN", "name"), not identifiers themselves
## Impact
**Conflict reduction:**
- Total slot_uri conflicts: 17 → 15 (12% reduction)
- CRITICAL conflicts: 9 → 8 (11% reduction)
**Non-breaking changes:**
- slot_uri modifications affect RDF/JSON-LD only
- YAML data structure unchanged
- No data migration required
**Remaining work:**
- 8 CRITICAL slot_uri conflicts
- 6 remaining schema:identifier conflicts
- dcterms:description semantic flattening (architectural decision needed)
- 51 HIGH priority range mismatches
## Documentation
**Reports generated:**
- reports/semantic_review_report.md (593 lines)
- reports/slot_uri_conflicts.json
- reports/range_mismatches.json
- reports/data_value_analysis.json
- reports/fixes_applied.md (implementation log)
**Next steps:**
- Resolve remaining CRITICAL conflicts
- Make architectural decision on dcterms:description (40 usages)
- Create ontology_mapping_guide.md with mapping rationale
- Address HIGH priority range mismatches
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Continued systematic resolution of slot_uri conflicts and semantic issues. Reduced total conflicts from 17 to 8 (53% reduction), CRITICAL conflicts from 9 to 2 (78% reduction). ## Semantic Fixes Applied (14 additional) ### slot_uri Conflicts Resolved **dcat: namespace (1 fix)** - access_url → d4d:rawDataAccessURL (raw data access distinct from distribution) **dcterms: namespace (7 fixes)** - principal_investigator → d4d:principalInvestigator (role-specific creator) - restrictions (external resources) → d4d:externalResourceRestrictions - is_shared → d4d:isExternallyShared (boolean doesn't fit accessRights semantics) - target_dataset → dcterms:relation (relationship, not identifier) - latest_version_doi → dcterms:hasVersion (version relationship) - version → schema:version (version string, not relationship) - data_substrate → dcterms:type (data content type, not file format) - status → d4d:publicationStatus (lifecycle status, not type) - license_terms → d4d:licenseDescription (description of license) **schema: namespace (3 fixes)** - affiliations (Creator) → d4d:teamAffiliation (team context) - orcid → d4d:orcidIdentifier (specific identifier type) - grant_number → d4d:grantIdentifier (specific identifier type) - is_identifier → d4d:isIdentifier (meta-property, not identifier) ### Semantic Precision Improvements **Before:** - Generic schema:identifier for all identifiers - dcat:accessURL for all access points - dcterms:type conflated with publication status - dcterms:format conflated with data content type - Booleans using text-description slot_uris **After:** - Specific identifier types: orcidIdentifier, grantIdentifier - Specific access points: rawDataAccessURL, erratumURL, contributionURL - Clear separation: types vs status, format vs content type - Semantically appropriate mappings for all fields **Custom D4D terms created:** 10 new terms (Session 2), all with broad_mappings to standard vocabularies where applicable ## Progress Metrics **Conflict reduction:** - Total: 17 → 8 (53% reduction) - CRITICAL: 9 → 2 (78% reduction) - HIGH: 3 → 1 (67% reduction) **Remaining CRITICAL issues:** 1. dcterms:description (40 slots) - Architectural decision needed 2. dcterms:type (3 slots) - Acceptable semantic overlap ## Validation ✅ All schema validation tests pass ✅ Generated artifacts updated successfully ✅ Non-breaking changes (slot_uri only affects RDF/JSON-LD) ✅ No data migration required ## Files Modified **Schema modules (9 files):** - D4D_Base_import.yaml (3 fixes) - D4D_Composition.yaml (3 fixes) - D4D_Data_Governance.yaml (1 fix) - D4D_Distribution.yaml (1 fix) - D4D_Maintenance.yaml (1 fix) - D4D_Motivation.yaml (3 fixes) - D4D_Preprocessing.yaml (1 fix) - D4D_Variables.yaml (1 fix) **Documentation:** - reports/semantic_fixes_session2.md (comprehensive summary) ## Next Steps - Make architectural decision on dcterms:description (40 slots) - Create docs/ontology_mapping_guide.md with rationale - Address 51 HIGH priority range mismatches - Review remaining MEDIUM conflicts Co-Authored-By: Claude Opus 4.6 <[email protected]>
Applied 16 semantic fixes across 8 D4D modules to eliminate CRITICAL and HIGH slot_uri conflicts while maximizing standard vocabulary reuse. Conflict Resolution (12 fixes): - MEDIUM: doi, conforms_to_schema/class, variable_name, label_description, representative_verification, tools (7 custom d4d: terms) - HIGH: contact_person variants for ethics/license/governance contexts (3 role-specific contact point terms) - CRITICAL: source_type, instance_type differentiated from data_substrate (2 type classification terms) dcterms:description Optimization (4 fixes): - Made architectural decision: minimal hybrid approach - Differentiated: response→questionResponse, anomaly_details→anomalyDetails, quality_notes→qualityNotes, future_guarantees→availabilityGuarantee - Accepted 36 slots as valid semantic overlap (90% standard vocab retention) - Documented rationale in reports/dcterms_description_analysis.md Results: - CRITICAL conflicts: 9 → 0 (100% resolved) - HIGH conflicts: 3 → 0 (100% resolved) - MEDIUM conflicts: 7 → 2 (acceptable overlaps documented) - Created 31 custom d4d: namespace terms with broad_mappings - Zero breaking changes (slot_uri affects RDF/JSON-LD only) - All schema validation passing Files modified: - 8 schema modules (D4D_Base_import, Collection, Composition, Data_Governance, Ethics, Motivation, Preprocessing, Variables) - Generated artifacts (jsonld, owl, Python datamodel) - 3 new analysis/decision reports Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 40 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…l field examples - Add descriptions to all 41 missing Dataset-level aggregator slots in data_sheets_schema.yaml, achieving 100% description coverage (94.7% → 100%) - Each description explains the target module class and what the collection represents - Add examples to technical fields: doi (concrete DOI examples), hash (format examples), identifiers_removed (HIPAA identifier types), license_terms (SPDX identifiers) - All new descriptions use List of... prefix for multivalued aggregator slots, addressing multivalued clarity as part of Phase 1-2 improvements Co-Authored-By: Claude Opus 4.6 <[email protected]>
Semantic meaning and quality review agent for all D4D schema descriptions across the full schema and all 17 modules. Reviews: - Semantic accuracy (description matches what the field actually stores) - Range alignment (description reflects correct data type) - Ontology alignment (description matches slot_uri/mappings intent) - Cardinality alignment (multivalued fields use plural language) - Cross-module consistency (same concepts described consistently) - Completeness and specificity (actionable vs generic boilerplate) - Structural correctness (no stubs, placeholders, malformed text) Issues classified as CRITICAL/HIGH/MEDIUM/LOW with specific locations, current text, problem explanation, and recommended fix. Uses claude-opus-4-6 for deep semantic reasoning. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Fix all remaining HIGH/MEDIUM/LOW semantic review issues (83 → 0) - Wire 7 orphaned classes to Dataset (Relationships, Splits, DirectCollection, CollectionNotification, CollectionConsent, ConsentRevocation, ThirdPartySharing) - Apply style cleanup: add terminal periods to 332 descriptions across 4 files; add Check H to d4d-description-reviewer agent - Add 19 d4d:docExample annotations (namespaced to avoid conflict with the existing `examples` data fields in ExistingUse, IntendedUse, and VariableMetadata) - Add docs/ontology_mapping_guide.md - Regenerate all artifacts (make gen-project); 103 tests pass Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 41 out of 48 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove incorrect broad_mappings (schema:archivedAt) from archival boolean field - Add missing terminal period to HZ-GB-2312 encoding description - Fix slot_uri names in ontology_mapping_guide.md (hashValue, md5Checksum, doiIdentifier) - Add || true to semantic-review Makefile targets so non-zero exit doesn't fail the build - Delete stale reports: slot_uri_conflicts_final.json and semantic_fixes_session3.md Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- gen-project: regenerate OWL/Turtle, JSON-LD, JSON Schema, Python datamodel
- gen-sssom-all: regenerate all SSSOM TSVs (284 attributes, up from 270)
- 7 newly wired Dataset attributes now appear as unmapped novel D4D concepts
(relationships, splits, direct_collection, collection_notifications,
collection_consents, consent_revocations, third_party_sharing)
- Sync data/mappings/ working copies from canonical src/alignment/ sources
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Run generate_structural_mapping.py against updated merged schema - 149 structure-aware D4D ↔ RO-Crate mappings (updated from prior run) - Reflects 7 newly wired Dataset attributes and all schema updates Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Add SSSOM_STRUCTURAL and SSSOM_STRUCTURAL_SCRIPT variables - Add gen-sssom-structural target (depends on merged schema + RO-Crate JSON) - Wire gen-sssom-structural into gen-sssom-all so it runs with everything else - Add SSSOM_STRUCTURAL to clean-sssom Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 45 out of 56 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- data_value_analyzer.py: fix boolean heuristic to check startswith on last path segment instead of endswith on full path (is_deidentified would never match the old check) - generate_semantic_review_report.py: compute 'most severe' conflict dynamically from loaded data instead of hard-coding dcterms:description and 40-slot count which become stale as schema changes Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
PR adds 122 missing descriptions across
D4D_Base_import.yaml,D4D_Composition.yaml, andD4D_Evaluation_Summary.yaml, then improves existing ones for consistency and quality.Schema Changes
VersionTypeEnum→EncodingEnum(they're character encodings, not version types)Tooling & Docs
DESCRIPTION_STYLE_GUIDE.md— style reference for future schema contributorscheck_description_quality.py/check_missing_descriptions.py— automated quality checkersConcerns Noted in Review
quality_before.json,quality_after.json) committed to repo root; better placed inscripts/ordocs/quality_*.jsonfiles are plain-text reports, not JSOND4D_Evaluation_Summary.yamldon't end with a period — not introduced here, but inconsistent with the style guide now codifiedD4D_Base_import.yamlis pre-existing, carried over unchanged