Skip to content

Latest commit

 

History

History
194 lines (140 loc) · 6.47 KB

File metadata and controls

194 lines (140 loc) · 6.47 KB

Validation Module

Comprehensive quality assurance for research outputs

Location: infrastructure/validation/ Quick Reference: Modules Guide | API Reference


Key Features

  • PDF Validation: Structural integrity checks, xref table verification, trailer validation, text extraction
  • Markdown Validation: Heading hierarchy, image paths, cross-references, math expression checks
  • Output Integrity: File integrity, data consistency, academic standards verification
  • Figure Validation: Figure registry completeness and correctness
  • Repository Scanning: Accuracy and completeness audits across the entire repository
  • No-Mock Enforcement: Ensures test suites comply with the no-mocks policy
  • Issue Categorization: Severity assignment, false-positive filtering, prioritization
  • CLI Interface: Unified command-line entry point for all validation tasks

Usage Examples

PDF Validation

from pathlib import Path
from infrastructure.validation import validate_pdf_rendering, extract_text_from_pdf, scan_for_issues

pdf_path = Path("output/template_code_project/pdf/template_code_project_combined.pdf")

# Validate structural integrity of a rendered PDF
results = validate_pdf_rendering(pdf_path)

# Extract text content for downstream analysis
text = extract_text_from_pdf(pdf_path)

# Scan extracted text for rendering issues
issues = scan_for_issues(text)

Markdown Validation

from pathlib import Path
from infrastructure.validation import discover_markdown_files, validate_markdown, validate_images, validate_refs, validate_math
from infrastructure.validation.content.markdown_validator import collect_symbols

repo_root = Path(".")
manuscript_dir = repo_root / "projects" / "templates" / "template_code_project" / "manuscript"

# Validate all markdown files in a manuscript directory
problems, exit_code = validate_markdown(manuscript_dir, repo_root)

# Individual checks
md_files = [str(p) for p in discover_markdown_files(manuscript_dir, scope="tree")]
labels, anchors = collect_symbols(md_files)

image_issues = validate_images(md_files, repo_root)
ref_issues = validate_refs(md_files, repo_root, labels, anchors)
math_issues = validate_math(md_files, repo_root)

Output Integrity Verification

from pathlib import Path
from infrastructure.validation import (
    verify_output_integrity, verify_file_integrity,
    verify_cross_references, verify_data_consistency,
    verify_academic_standards, generate_integrity_report,
)

output_dir = Path("output/template_code_project")
manuscript_dir = Path("projects/templates/template_code_project/manuscript")
markdown_files = sorted(manuscript_dir.glob("*.md"))

# Full integrity check across all output artifacts
report = verify_output_integrity(output_dir)

# Targeted checks (each takes a list of paths)
verify_file_integrity([output_dir / "pdf" / "template_code_project_combined.pdf"])
verify_cross_references(markdown_files)
verify_data_consistency(sorted((output_dir / "data").glob("*")))
verify_academic_standards(markdown_files)

# Generate a structured integrity report from the IntegrityReport
integrity_report = generate_integrity_report(report)

Figure Validation

from pathlib import Path
from infrastructure.validation import validate_figure_registry

success, issues = validate_figure_registry(
    Path("projects/templates/template_code_project/output/figures/figure_registry.json"),
    Path("projects/templates/template_code_project/manuscript"),
)

Both registry shapes are accepted:

  • Dict shape{"fig:label": {...}, ...} (emitted by FigureManager).
  • List shape[{"label": "fig:label", ...}, ...] (emitted by project-side scripts, e.g. cognitive_case_diagrams/scripts/generate_diagrams.py).

Repository Audit and Issue Management

from pathlib import Path
from infrastructure.validation import (
    run_comprehensive_audit, generate_audit_report,
    categorize_by_type, assign_severity,
    filter_false_positives, prioritize_issues, generate_issue_summary,
)

project_path = Path("projects/templates/template_code_project")

# Run all validation checks in a single pass
audit_results = run_comprehensive_audit(project_path)
report = generate_audit_report(audit_results)

# Process and triage discovered issues
categorized = categorize_by_type(audit_results)
filtered = filter_false_positives(categorized)
prioritized = prioritize_issues(filtered)
summary = generate_issue_summary(prioritized)

Output Structure Validation

from pathlib import Path
from infrastructure.validation import validate_output_structure, validate_copied_outputs

validate_output_structure(Path("output/template_code_project"))
validate_copied_outputs(
    source_dir=Path("projects/templates/template_code_project/output"),
    dest_dir=Path("output/template_code_project"),
)

Link Verification

from pathlib import Path
from infrastructure.validation import LinkValidator

validator = LinkValidator()
results = validator.check_all(Path("docs"))

CLI Usage

# Validate PDFs in an output directory
uv run python -m infrastructure.validation.cli pdf output/{project}/pdf/

# Validate markdown manuscript files
uv run python -m infrastructure.validation.cli markdown projects/{name}/manuscript/

# Both commands support the unified CLI entry point
uv run python -m infrastructure.validation.cli --help

Module Organization

The validation package is organized into four logical subpackage groups:

Group Modules Purpose
Content pdf_validator, markdown_validator, figure_validator File-type-specific validation
Integrity integrity, link_validator, check_links Cross-reference and link verification
Repository scanner, audit_orchestrator, issue_categorizer Project-wide scanning and triage
Output validator Pipeline output structure checks

All public functions are re-exported from infrastructure.validation for convenient single-import access.


Related Documentation