Design a Standardized Archive Format for Inference Traces - Issue #39#53
Open
purvikathalkar wants to merge 4 commits into
Open
Design a Standardized Archive Format for Inference Traces - Issue #39#53purvikathalkar wants to merge 4 commits into
purvikathalkar wants to merge 4 commits into
Conversation
Author
Naming Conventions:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design a Standardized Archive Format for Inference Traces - Issue #39
Overview
This PR addresses Issue #39 by defining and evaluating a standardized archive format for storing VizFold inference traces. The goal is to enable scalable, reproducible storage of large model internals — including layer-wise activations, attention maps, and structural outputs — in support of interpretability workflows.
Motivation
VizFold addresses provides a structured, queryable archive of inference traces. This requires a format that supports:
Archive Format Specification
After iterative design across MNIST, ViT, and OpenFold experiments, we arrived at the following standardized Zarr archive schema.
Schema
Design Principles
sequence,msa) and full metadata are always preservedFormat Evaluation & Benchmarks
We benchmarked three candidate formats — NPZ, HDF5, and Zarr — across storage efficiency, read/write performance, and interpretability-specific access patterns using representative protein inputs.
Storage Size Scaling
All three formats exhibit comparable storage efficiency across sequence lengths, with only minor variation due to compression differences. Zarr achieves this while maintaining a chunked, addressable layout — meaning format selection is not constrained by storage overhead.
Full Archive Read Performance
Zarr demonstrates strong performance in full-archive reads, in some cases outperforming HDF5. Its chunked storage model does not significantly hinder contiguous data access and can efficiently utilize sequential read patterns.
Layer-Level Partial Read Latency
Both HDF5 and Zarr support partial reads, which are fundamental to interpretability workflows. HDF5 achieves lower latency in local benchmarks due to efficient indexing within a contiguous file. Zarr, while slightly slower, enables flexible chunk alignment with interpretability units (e.g., individual layers), which becomes increasingly valuable as access patterns grow more complex.
Random Access Scaling
Random access performance is comparable between HDF5 and Zarr at this scale. However, Zarr's chunk-based architecture is designed to scale in distributed settings — making it better suited for workloads involving frequent, non-sequential queries across model components.
Parallel Interpretability Workload
Under concurrent access, Zarr demonstrates improved scalability compared to HDF5, benefiting from its independently addressable chunk structure. Multiple threads or users can access different regions of the dataset without contention. HDF5's file design introduces coordination overhead, making it less suitable for multi-user interpretability systems.
Summary
Decision: Zarr was selected as the archive format. While HDF5 shows strong performance for structured local workloads, Zarr's chunked design, flexible partial access, and scalability under concurrent use make it better suited for VizFold's long-term interpretability requirements.
Prototype Development Journey
The final archive schema emerged from three iterative prototyping stages. Each stage uncovered new limitations that informed the next design.
Stage 1 — MNIST Baseline
The MNIST experiments provided the first working prototype for storing layer-wise activations. Key findings:
MNIST Layer Activations:
Stage 2 — Vision Transformer (ViT) Extension
Extending the archive to a Vision Transformer introduced more complex representations — token-based embeddings and multi-head attention across image patches.
Key findings:
ViT Image Patches:
ViT Attention Across Layers:
Stage 3 — OpenFold Generalization
The ViT insights directly informed the OpenFold-compatible archive design:
inputs/module was introduced to preserve raw sequences and MSAs, ensuring full reproducibility.single/andpair/modules capture per-residue and pairwise residue relationships — central to protein structure modeling.recycle/module was added to capture OpenFold's iterative refinement steps across inference cycles.outputs/module stores final predictions (coordinates, confidence scores) alongside intermediate results in a unified structure.VizFold Architectural Diagram:
Running the Benchmark Script
The benchmark script (
archiveformat.py) reproduces all figures shown in the Format Evaluation section above.Prerequisites
Or install from the requirements file:
Running the Script
The script will:
benchmark_results.csvExpected Output
After the script completes, the
benchmark_results.csvfile will contain the specific datapoints and the 5 plotted graphs will be rendered.Demo Walkthrough
A video walkthrough demonstrating the archive format, benchmark script execution, and output visualization is available below.
Demo Video
Current Limitations
Future Work