Skip to content

Archive Utils Final PR#59

Open
adeshusa wants to merge 41 commits into
AI2Science:mainfrom
adeshusa:main
Open

Archive Utils Final PR#59
adeshusa wants to merge 41 commits into
AI2Science:mainfrom
adeshusa:main

Conversation

@adeshusa
Copy link
Copy Markdown

@adeshusa adeshusa commented Mar 23, 2026

Assessment of Goals and Implementation (Technical)

The implementation achieves the core objective: a modular archive subsystem that supports incremental writes, deterministic reads, and policy-driven validation on top of a stable VizFold 1.0 Zarr hierarchy.

Architecture Responsibility

The system is logically decoupled into three primary modules:

  1. core.py: Defines normalization, addressing, and validation primitives.
  2. store.py: Performs canonical, typed writes into the archive layout.
  3. load.py: Handles deterministic reads and external ingestion (.pkl, text attention).

This separation ensures one source of truth for archive invariants while preserving multiple entry paths (CLI, Demo, and Direct API).

ARCHITECTURE_DIAGRAM


Goal Coverage

Goal Status Highlights
Incremental Updates Achieved Layer-indexed writes allow independent appending; existing data protected by default.
Flexible Ingestion Achieved Maps heterogeneous data (.txt, .pkl) into a canonical schema using tensor_to_numpy.
Validation-First Achieved Integrated strict/lenient checks support both production and iterative dev workflows.
Canonical Format Achieved Stable hierarchy maintained across metadata/, representations/, attention/, and structure/.
Test-Backed Behavior Substantial E2E flows verify append/retrieval; remaining work focuses on edge-case hardening.

End-to-End Flow

  1. Entry Point: cli.py, demo.py, or API calls ingest/store methods.
  2. Ingestion: Parses and normalizes external artifacts into canonical arrays.
  3. Storage: Applies layer/path invariants and writes to Zarr datasets.
  4. Validation: Checks integrity based on the selected strictness policy.
  5. Loading: Returns deterministic outputs for visualization and analysis pipelines.

Result: A robust contract providing many input forms, one canonical representation, and deterministic read semantics.


Method Intent (One-liners)

Core & Storage

  • tensor_to_numpy: Normalize tensor-like objects to NumPy before persistence.
  • tensor_to_zarr_array: Write normalized arrays to canonical Zarr dataset paths.
  • _validate_layer_index: Enforce valid layer addressing prior to mutation.
  • validate_archive: Perform strict or lenient archive integrity checks.
  • store_metadata: Persist run/config provenance under metadata/.
  • store_single_representation: Write one single-representation layer.
  • store_pair_representation: Write one pair-representation layer.
  • store_attention: Write one attention tensor for {attention_type, layer_index}.
  • store_structure_coordinates: Write structure outputs (atom_positions, atom_mask, ptm).

Ingestion & Loading

  • ingest_attention_txt: Parse text attention artifacts into canonical tensors.
  • _extract_best_matching_array: Score and select best candidate arrays from pickle payloads.
  • ingest_output_pkl: Ingest pickle outputs with key-match traceability metadata.
  • load_metadata: Read metadata as Python-native values.
  • load_single_representation: Read one single layer deterministically.
  • load_pair_representation: Read one pair layer deterministically.
  • load_attention_head: Read one attention head slice for analysis.
  • ArchiveOrchestrator: Coordinate staged ingest/store/validate operations and summarize state.

Technical Quality Notes

  • Safety: Conservative overwrite policy and integrated validation.
  • Extensibility: Canonical pathing and clear module boundaries reduce technical debt.
  • Observability: Ingestion traceability is improved via key_matches.

Summary: The implementation is functionally complete for the stated goals and architecturally robust for future extension.

adeshusa and others added 30 commits March 7, 2026 12:07
added methods 1 and 2 as well as additional methods to convert vizfol…
This reverts commit 5152e4e, reversing
changes made to ccd6bcb.
rebranched outline.py into 3
Renamed archive + also turned it into an external module
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants