Skip to content

Design a Standardized Archive Format for Inference Traces - Issue #39#53

Open
purvikathalkar wants to merge 4 commits into
AI2Science:mainfrom
purvikathalkar:main
Open

Design a Standardized Archive Format for Inference Traces - Issue #39#53
purvikathalkar wants to merge 4 commits into
AI2Science:mainfrom
purvikathalkar:main

Conversation

@purvikathalkar
Copy link
Copy Markdown

@purvikathalkar purvikathalkar commented Mar 9, 2026

Design a Standardized Archive Format for Inference Traces - Issue #39

Overview

This PR addresses Issue #39 by defining and evaluating a standardized archive format for storing VizFold inference traces. The goal is to enable scalable, reproducible storage of large model internals — including layer-wise activations, attention maps, and structural outputs — in support of interpretability workflows.

Note: The prototype archive writer and inference pipeline integration are implemented in Team #40's PR This PR focuses on the format specification, evaluation methodology, and benchmark results.

Motivation

VizFold addresses provides a structured, queryable archive of inference traces. This requires a format that supports:

  • Layer-wise activations and intermediate representations
  • Attention maps across layers and heads
  • Structural outputs (atom positions, confidence scores)
  • Rich metadata (model version, config, timestamps, tensor shapes)
  • Partial/selective loading without reading the full archive
  • Scalability to large sequences and concurrent access

Archive Format Specification

After iterative design across MNIST, ViT, and OpenFold experiments, we arrived at the following standardized Zarr archive schema.

Schema

Run.vizfold.h5
│
├── metadata/
│   ├── model_name
│   ├── model_version
│   ├── config_version
│   ├── num_layers
│   ├── num_heads
│   ├── hidden_dim
│   ├── num_residues
│   ├── num_recycles
│   ├── tensor_dtypes
│   ├── tensor_shapes
│   ├── residue_indexing_scheme
│   ├── input_description
│   ├── timestamp
│   ├── recycle_info
│   └── representation_names
│
├── inputs/
│   ├── sequence              # shape: (N_res,)
│   └── msa                  # shape: (N_seq, N_res)
│
├── activations/
│   └── layers/
│       ├── single_repr       # shape: (N_res, hidden_dim)
│       └── pair_repr         # shape: (N_res, N_res, pair_dim)
│
├── attention/
│   └── layers/              # shape: (num_heads, N_res, N_res)
│
├── single/
│   └── layers/              # shape: (N_res, hidden_dim)
│
├── pair/
│   └── layers/              # shape: (N_res, N_res, pair_dim)
│
├── recycle/
│   └── steps/
│       ├── single_repr
│       └── pair_repr
│
├── structure/
│   ├── atom_positions        # shape: (N_res, 3)
│   ├── atom_mask
│   └── ptm
│
└── outputs/
    ├── coordinates           # shape: (N_res, 3)
    └── confidence_scores     # shape: (N_res,)

Design Principles

Principle Description
Model Alignment Data is organized layer- and step-wise, mirroring how OpenFold processes inputs
Separation of Concerns Distinct groups for inputs, activations, attention, structure, recycles, and outputs
Fine-Grained Access Chunked layout enables querying individual layers or residue ranges without loading the full archive
Reproducibility Raw inputs (sequence, msa) and full metadata are always preserved
Extensibility Users can add representation types without breaking the existing schema

Format Evaluation & Benchmarks

We benchmarked three candidate formats — NPZ, HDF5, and Zarr — across storage efficiency, read/write performance, and interpretability-specific access patterns using representative protein inputs.

Storage Size Scaling

storagesizescaling

All three formats exhibit comparable storage efficiency across sequence lengths, with only minor variation due to compression differences. Zarr achieves this while maintaining a chunked, addressable layout — meaning format selection is not constrained by storage overhead.


Full Archive Read Performance

fullarchivereadperformance

Zarr demonstrates strong performance in full-archive reads, in some cases outperforming HDF5. Its chunked storage model does not significantly hinder contiguous data access and can efficiently utilize sequential read patterns.


Layer-Level Partial Read Latency

layerlevelpartialreadlatency

Both HDF5 and Zarr support partial reads, which are fundamental to interpretability workflows. HDF5 achieves lower latency in local benchmarks due to efficient indexing within a contiguous file. Zarr, while slightly slower, enables flexible chunk alignment with interpretability units (e.g., individual layers), which becomes increasingly valuable as access patterns grow more complex.


Random Access Scaling

randomaccessscaling

Random access performance is comparable between HDF5 and Zarr at this scale. However, Zarr's chunk-based architecture is designed to scale in distributed settings — making it better suited for workloads involving frequent, non-sequential queries across model components.


Parallel Interpretability Workload

parallelinterpretabilityworkload

Under concurrent access, Zarr demonstrates improved scalability compared to HDF5, benefiting from its independently addressable chunk structure. Multiple threads or users can access different regions of the dataset without contention. HDF5's file design introduces coordination overhead, making it less suitable for multi-user interpretability systems.

Note: The slight decrease in latency observed for Zarr at larger sequence lengths is attributed to caching and runtime effects, not intrinsic I/O improvements.


Summary

Criterion NPZ HDF5 Zarr
Storage efficiency Good Good Good
Full archive read Good Strong Strong
Partial / layer-level read Poor Strong Good
Random access scaling Poor Good Good
Parallel / concurrent access Poor Limited Strong

Decision: Zarr was selected as the archive format. While HDF5 shows strong performance for structured local workloads, Zarr's chunked design, flexible partial access, and scalability under concurrent use make it better suited for VizFold's long-term interpretability requirements.


Prototype Development Journey

The final archive schema emerged from three iterative prototyping stages. Each stage uncovered new limitations that informed the next design.

Stage 1 — MNIST Baseline

The MNIST experiments provided the first working prototype for storing layer-wise activations. Key findings:

  • Activations alone were insufficient for reproducibility — without model configuration, tensor shapes, or structured organization, stored representations were difficult to interpret or compare across runs.
  • This led to the introduction of explicit metadata and a shift toward organizing data by layers.
  • The archive began to be treated as a structured, model-aligned representation rather than a linear log of tensors.

MNIST Layer Activations:

MNIST_layer_activations

Stage 2 — Vision Transformer (ViT) Extension

Extending the archive to a Vision Transformer introduced more complex representations — token-based embeddings and multi-head attention across image patches.

Key findings:

  • Attention maps proved essential for understanding how input regions contribute to predictions, and became a required component of the archive.
  • Practical limitations in reconstructing tensor structure without explicit metadata prompted expansion of the metadata schema (tensor shapes, model dimensions, architectural parameters).
  • The flat structure used in MNIST became insufficient, prompting a transition to a hierarchical layout organized by layers and modules.
    ViT Image Patches:
ViT_image_patches

ViT Attention Across Layers:

ViT_attention_across_layers ---

Stage 3 — OpenFold Generalization

The ViT insights directly informed the OpenFold-compatible archive design:

  • An inputs/ module was introduced to preserve raw sequences and MSAs, ensuring full reproducibility.
  • Inspired by ViT token representations, single/ and pair/ modules capture per-residue and pairwise residue relationships — central to protein structure modeling.
  • A recycle/ module was added to capture OpenFold's iterative refinement steps across inference cycles.
  • An outputs/ module stores final predictions (coordinates, confidence scores) alongside intermediate results in a unified structure.

VizFold Architectural Diagram:

archive_architecture

Running the Benchmark Script

The benchmark script (archiveformat.py) reproduces all figures shown in the Format Evaluation section above.

Prerequisites

  • Python 3.9+
  • The following packages:
pip install zarr h5py numpy matplotlib

Or install from the requirements file:

pip install -r requirements.txt

Running the Script

python archiveformat.py

The script will:

  1. Generate synthetic protein-like tensors at varying sequence lengths (N_res = 256, 512, 1024, 2048)
  2. Write test archives in NPZ, HDF5, and Zarr formats
  3. Benchmark storage size, full reads, partial reads, random access, and parallel access
  4. Output result plots to benchmark_results.csv

Expected Output

After the script completes, the benchmark_results.csv file will contain the specific datapoints and the 5 plotted graphs will be rendered.


Demo Walkthrough

A video walkthrough demonstrating the archive format, benchmark script execution, and output visualization is available below.

Demo Video


Current Limitations

  • Storage management at scale: As datasets grow, chunking strategy and organization become harder to tune. Large Zarr archives may experience slower processing if chunk sizes are not carefully aligned to access patterns.
  • Local benchmark bias: The partial read latency benchmarks favor HDF5 due to single-node contiguous access; real-world distributed workloads are expected to narrow this gap in Zarr's favor.

Future Work

  • Implement comprehensive benchmarks comparing NPZ, HDF5, and Zarr on real OpenFold inference traces (vs. synthetic tensors used here)
  • Investigate direct memory-to-archive streaming to reduce write latency during inference
  • Explore optimal chunking strategies for very long sequences (>2048 residues)

@purvikathalkar
Copy link
Copy Markdown
Author

archive/
├── metadata/ 
│   {
│       model_name: str,
│       model_version: str,
│       num_layers: int,
│       num_heads: int,
│       hidden_dim: int,
│       num_residues: int,
│       num_recycles: int,
│       tensor_dtypes: dict,       # e.g., {"single_repr": "float32", ...}
│       tensor_shapes: dict,       # e.g., {"single_repr": (N_res, hidden_dim), ...}
│       residue_indexing_scheme: str,
│       input_description: str,
│       timestamp: str
│   }
│
├── inputs/
│   sequence          # (N_res,)
│   msa               # (N_seq, N_res)
│
├── activations/
│   layer_0/
│       single_repr   # (N_res, hidden_dim)
│       pair_repr     # (N_res, N_res, pair_dim)
│   layer_1/
│       single_repr
│       pair_repr
│   ... more layers
│
├── attention/
│   layer_0           # (num_heads, N_res, N_res)
│   layer_1
│   ... more layers
│
├── outputs/
│   coordinates       # (N_res, 3)
│   confidence_scores # (N_res,)
│
├── single/
│   layer_0           # (N_res, hidden_dim)
│   layer_1
│   ... more layers
│
├── pair/
│   layer_0           # (N_res, N_res, pair_dim)
│   layer_1
│   ... more layers
│
├── recycle/
│   step_0/
│       single_repr
│       pair_repr
│   step_1/
│       single_repr
│       pair_repr
│   ... more steps

Naming Conventions:

  • Layers: layer_0, layer_1, …, layer_N
  • Recycling steps: step_0, step_1, …
  • Tensor names:
    • single_repr
    • pair_repr
    • coordinates
    • confidence_scores

@purvikathalkar purvikathalkar changed the title Implements a standardized Zarr archive for ViT and MNIST inference traces Design a Standardized Archive Format for Inference Traces - Issue #39 Apr 30, 2026
@purvikathalkar purvikathalkar marked this pull request as ready for review April 30, 2026 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant