Design a Standardized Archive Format for Inference Traces - Issue #39 by purvikathalkar · Pull Request #53 · AI2Science/vizfold-foundation

purvikathalkar · 2026-03-09T02:06:50Z

Design a Standardized Archive Format for Inference Traces - Issue #39

Overview

This PR addresses Issue #39 by defining and evaluating a standardized archive format for storing VizFold inference traces. The goal is to enable scalable, reproducible storage of large model internals — including layer-wise activations, attention maps, and structural outputs — in support of interpretability workflows.

Note: The prototype archive writer and inference pipeline integration are implemented in Team #40's PR This PR focuses on the format specification, evaluation methodology, and benchmark results.

Motivation

VizFold addresses provides a structured, queryable archive of inference traces. This requires a format that supports:

Layer-wise activations and intermediate representations
Attention maps across layers and heads
Structural outputs (atom positions, confidence scores)
Rich metadata (model version, config, timestamps, tensor shapes)
Partial/selective loading without reading the full archive
Scalability to large sequences and concurrent access

Archive Format Specification

After iterative design across MNIST, ViT, and OpenFold experiments, we arrived at the following standardized Zarr archive schema.

Schema

Run.vizfold.h5
│
├── metadata/
│   ├── model_name
│   ├── model_version
│   ├── config_version
│   ├── num_layers
│   ├── num_heads
│   ├── hidden_dim
│   ├── num_residues
│   ├── num_recycles
│   ├── tensor_dtypes
│   ├── tensor_shapes
│   ├── residue_indexing_scheme
│   ├── input_description
│   ├── timestamp
│   ├── recycle_info
│   └── representation_names
│
├── inputs/
│   ├── sequence              # shape: (N_res,)
│   └── msa                  # shape: (N_seq, N_res)
│
├── activations/
│   └── layers/
│       ├── single_repr       # shape: (N_res, hidden_dim)
│       └── pair_repr         # shape: (N_res, N_res, pair_dim)
│
├── attention/
│   └── layers/              # shape: (num_heads, N_res, N_res)
│
├── single/
│   └── layers/              # shape: (N_res, hidden_dim)
│
├── pair/
│   └── layers/              # shape: (N_res, N_res, pair_dim)
│
├── recycle/
│   └── steps/
│       ├── single_repr
│       └── pair_repr
│
├── structure/
│   ├── atom_positions        # shape: (N_res, 3)
│   ├── atom_mask
│   └── ptm
│
└── outputs/
    ├── coordinates           # shape: (N_res, 3)
    └── confidence_scores     # shape: (N_res,)

Design Principles

Principle	Description
Model Alignment	Data is organized layer- and step-wise, mirroring how OpenFold processes inputs
Separation of Concerns	Distinct groups for inputs, activations, attention, structure, recycles, and outputs
Fine-Grained Access	Chunked layout enables querying individual layers or residue ranges without loading the full archive
Reproducibility	Raw inputs (`sequence`, `msa`) and full metadata are always preserved
Extensibility	Users can add representation types without breaking the existing schema

Format Evaluation & Benchmarks

We benchmarked three candidate formats — NPZ, HDF5, and Zarr — across storage efficiency, read/write performance, and interpretability-specific access patterns using representative protein inputs.

Storage Size Scaling

All three formats exhibit comparable storage efficiency across sequence lengths, with only minor variation due to compression differences. Zarr achieves this while maintaining a chunked, addressable layout — meaning format selection is not constrained by storage overhead.

Full Archive Read Performance

Zarr demonstrates strong performance in full-archive reads, in some cases outperforming HDF5. Its chunked storage model does not significantly hinder contiguous data access and can efficiently utilize sequential read patterns.

Layer-Level Partial Read Latency

Both HDF5 and Zarr support partial reads, which are fundamental to interpretability workflows. HDF5 achieves lower latency in local benchmarks due to efficient indexing within a contiguous file. Zarr, while slightly slower, enables flexible chunk alignment with interpretability units (e.g., individual layers), which becomes increasingly valuable as access patterns grow more complex.

Random Access Scaling

Random access performance is comparable between HDF5 and Zarr at this scale. However, Zarr's chunk-based architecture is designed to scale in distributed settings — making it better suited for workloads involving frequent, non-sequential queries across model components.

Parallel Interpretability Workload

Under concurrent access, Zarr demonstrates improved scalability compared to HDF5, benefiting from its independently addressable chunk structure. Multiple threads or users can access different regions of the dataset without contention. HDF5's file design introduces coordination overhead, making it less suitable for multi-user interpretability systems.

Note: The slight decrease in latency observed for Zarr at larger sequence lengths is attributed to caching and runtime effects, not intrinsic I/O improvements.

Summary

Criterion	NPZ	HDF5	Zarr
Storage efficiency	Good	Good	Good
Full archive read	Good	Strong	Strong
Partial / layer-level read	Poor	Strong	Good
Random access scaling	Poor	Good	Good
Parallel / concurrent access	Poor	Limited	Strong

Decision: Zarr was selected as the archive format. While HDF5 shows strong performance for structured local workloads, Zarr's chunked design, flexible partial access, and scalability under concurrent use make it better suited for VizFold's long-term interpretability requirements.

Prototype Development Journey

The final archive schema emerged from three iterative prototyping stages. Each stage uncovered new limitations that informed the next design.

Stage 1 — MNIST Baseline

The MNIST experiments provided the first working prototype for storing layer-wise activations. Key findings:

Activations alone were insufficient for reproducibility — without model configuration, tensor shapes, or structured organization, stored representations were difficult to interpret or compare across runs.
This led to the introduction of explicit metadata and a shift toward organizing data by layers.
The archive began to be treated as a structured, model-aligned representation rather than a linear log of tensors.

MNIST Layer Activations:

Stage 2 — Vision Transformer (ViT) Extension

Extending the archive to a Vision Transformer introduced more complex representations — token-based embeddings and multi-head attention across image patches.

Key findings:

Attention maps proved essential for understanding how input regions contribute to predictions, and became a required component of the archive.
Practical limitations in reconstructing tensor structure without explicit metadata prompted expansion of the metadata schema (tensor shapes, model dimensions, architectural parameters).
The flat structure used in MNIST became insufficient, prompting a transition to a hierarchical layout organized by layers and modules.
ViT Image Patches:

ViT Attention Across Layers:

---

Stage 3 — OpenFold Generalization

The ViT insights directly informed the OpenFold-compatible archive design:

An inputs/ module was introduced to preserve raw sequences and MSAs, ensuring full reproducibility.
Inspired by ViT token representations, single/ and pair/ modules capture per-residue and pairwise residue relationships — central to protein structure modeling.
A recycle/ module was added to capture OpenFold's iterative refinement steps across inference cycles.
An outputs/ module stores final predictions (coordinates, confidence scores) alongside intermediate results in a unified structure.

VizFold Architectural Diagram:

Running the Benchmark Script

The benchmark script (archiveformat.py) reproduces all figures shown in the Format Evaluation section above.

Prerequisites

Python 3.9+
The following packages:

pip install zarr h5py numpy matplotlib

Or install from the requirements file:

pip install -r requirements.txt

Running the Script

python archiveformat.py

The script will:

Generate synthetic protein-like tensors at varying sequence lengths (N_res = 256, 512, 1024, 2048)
Write test archives in NPZ, HDF5, and Zarr formats
Benchmark storage size, full reads, partial reads, random access, and parallel access
Output result plots to benchmark_results.csv

Expected Output

After the script completes, the benchmark_results.csv file will contain the specific datapoints and the 5 plotted graphs will be rendered.

Demo Walkthrough

A video walkthrough demonstrating the archive format, benchmark script execution, and output visualization is available below.

Demo Video

Current Limitations

Storage management at scale: As datasets grow, chunking strategy and organization become harder to tune. Large Zarr archives may experience slower processing if chunk sizes are not carefully aligned to access patterns.
Local benchmark bias: The partial read latency benchmarks favor HDF5 due to single-node contiguous access; real-world distributed workloads are expected to narrow this gap in Zarr's favor.

Future Work

Implement comprehensive benchmarks comparing NPZ, HDF5, and Zarr on real OpenFold inference traces (vs. synthetic tensors used here)
Investigate direct memory-to-archive streaming to reduce write latency during inference
Explore optimal chunking strategies for very long sequences (>2048 residues)

purvikathalkar · 2026-04-04T05:50:08Z

archive/
├── metadata/ 
│   {
│       model_name: str,
│       model_version: str,
│       num_layers: int,
│       num_heads: int,
│       hidden_dim: int,
│       num_residues: int,
│       num_recycles: int,
│       tensor_dtypes: dict,       # e.g., {"single_repr": "float32", ...}
│       tensor_shapes: dict,       # e.g., {"single_repr": (N_res, hidden_dim), ...}
│       residue_indexing_scheme: str,
│       input_description: str,
│       timestamp: str
│   }
│
├── inputs/
│   sequence          # (N_res,)
│   msa               # (N_seq, N_res)
│
├── activations/
│   layer_0/
│       single_repr   # (N_res, hidden_dim)
│       pair_repr     # (N_res, N_res, pair_dim)
│   layer_1/
│       single_repr
│       pair_repr
│   ... more layers
│
├── attention/
│   layer_0           # (num_heads, N_res, N_res)
│   layer_1
│   ... more layers
│
├── outputs/
│   coordinates       # (N_res, 3)
│   confidence_scores # (N_res,)
│
├── single/
│   layer_0           # (N_res, hidden_dim)
│   layer_1
│   ... more layers
│
├── pair/
│   layer_0           # (N_res, N_res, pair_dim)
│   layer_1
│   ... more layers
│
├── recycle/
│   step_0/
│       single_repr
│       pair_repr
│   step_1/
│       single_repr
│       pair_repr
│   ... more steps

Naming Conventions:

Layers: layer_0, layer_1, …, layer_N
Recycling steps: step_0, step_1, …
Tensor names:
- single_repr
- pair_repr
- coordinates
- confidence_scores

purvikathalkar added 3 commits March 8, 2026 11:00

mnist related work with zarr

be4647f

activation visualizations for ViT

29fd889

improved metadata storage

c78dd5c

candidate formats benchmark test results

e04a858

purvikathalkar changed the title ~~Implements a standardized Zarr archive for ViT and MNIST inference traces~~ Design a Standardized Archive Format for Inference Traces - Issue #39 Apr 30, 2026

purvikathalkar marked this pull request as ready for review April 30, 2026 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design a Standardized Archive Format for Inference Traces - Issue #39#53

Design a Standardized Archive Format for Inference Traces - Issue #39#53
purvikathalkar wants to merge 4 commits into
AI2Science:mainfrom
purvikathalkar:main

purvikathalkar commented Mar 9, 2026 •

edited

Loading

Uh oh!

purvikathalkar commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

purvikathalkar commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design a Standardized Archive Format for Inference Traces - Issue #39

Overview

Motivation

Archive Format Specification

Schema

Design Principles

Format Evaluation & Benchmarks

Storage Size Scaling

Full Archive Read Performance

Layer-Level Partial Read Latency

Random Access Scaling

Parallel Interpretability Workload

Summary

Prototype Development Journey

Stage 1 — MNIST Baseline

Stage 2 — Vision Transformer (ViT) Extension

Stage 3 — OpenFold Generalization

Running the Benchmark Script

Prerequisites

Running the Script

Expected Output

Demo Walkthrough

Current Limitations

Future Work

Uh oh!

purvikathalkar commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

purvikathalkar commented Mar 9, 2026 •

edited

Loading