RFC: Directory tier for Psyche — rendezvous-hashed witnesses, hierarchical placement, chunked checkpoints

## Summary

Three related proposals for the post-Coordinator networking layer in Psyche, framed as a single RFC because they share an underlying observation: as run sizes grow, several functions currently handled implicitly by the on-chain Coordinator or by epidemic gossip would benefit from being factored out into a structured directory tier with deterministic, locally-computable primitives.

The proposals, ordered by what I think is increasing scope and decreasing certainty:

1. **Rendezvous hashing (HRW) for witness assignment** — replace per-epoch on-chain randomness for witness selection with locally-computable HRW scores, eliminating Coordinator round-trips for witness rotation while preserving the existing slashing/correctness model.
2. **CRUSH-style hierarchical replica placement rules** — when results are replicated for verification, allow the run authority to specify failure-domain rules (geographic region, jurisdiction, stake tier, attestation level) so replicas land in distinct trust zones rather than wherever gossip happens to propagate first.
3. **Kademlia-based chunked checkpoint distribution** alongside the existing HuggingFace + `iroh-blobs` paths, for handling new-joiner cold-start at run scales where centralized download becomes a bottleneck.

None of these is fundamentally novel — the primitives come from Hivemind, Ceph, Cassandra/DynamoDB, and the BTARD line of work. The contribution would be specific to how they compose with Psyche's existing Coordinator/Treasurer/Witness architecture and DisTrO-compressed gradient flow.

I'm posting this as an RFC rather than a PR because (a) the proposals span trust assumptions that the Foundation should weigh in on before code lands, (b) some of this may overlap with internal roadmap items I can't see, and (c) the smallest of the three is genuinely small and the largest is genuinely large, and the right scoping depends on what's actually a current bottleneck.

## Background and motivation

Psyche today factors the directory function across three places:

- **Coordinator (on-chain Solana program)** — authoritative state for run metadata, participant list, batch assignments, witness selection randomness, epoch transitions.
- **`iroh-gossip` (HyParView + PlumTree)** — epidemic propagation of result-availability announcements and `iroh-blob` tickets.
- **`iroh-blobs` (BLAKE3 content-addressed)** — direct P2P payload fetch of training results, with bloom-filter membership tracking on the Coordinator side for sharing verification.

This works well at the run sizes Psyche has demonstrated (Consilience-class, dozens to low hundreds of nodes). Three places where I think it strains as N grows:

1. **Witness assignment is a Coordinator write.** Every epoch's witness rotation goes through on-chain randomness and a state transition. At small N this is fine. At higher N, or with many concurrent runs, the Solana write rate becomes the implicit ceiling on epoch granularity. Rendezvous hashing collapses this to local computation.
2. **Replica placement is incidental, not prescriptive.** Bloom-filter-based witness verification confirms that *enough* nodes saw a result, but does not control *which* nodes hold replicas. For Byzantine-resilience that depends on geographic and trust diversity (no two replicas in the same datacenter; at least one replica in a different jurisdiction; spread across stake tiers), incidental placement is weaker than rule-driven placement.
3. **Onboarding is centralized.** New-joiner checkpoint download via HuggingFace + a single `iroh-blob` ticket is fine for tens of joiners. At hundreds of simultaneous joiners (which a healthy permissionless network will see), the source becomes a cold-start bottleneck and a single point of failure for the run.

None of these is currently breaking Psyche. The motivation is forward-looking: if the network reaches the scale the Foundation's mission implies, these become bottlenecks before throughput compression or Byzantine aggregation does.

## Trust model

Before the proposals: all three preserve the existing Solana-Coordinator trust model. Participant identity, stake weights, run state, epoch transitions, and join authorization remain on-chain authoritative. The directory tier is computed *over* canonical Coordinator state, not in place of it.

Specifically:

- **Proposal 1 (rendezvous hashing)** is a pure function of state the Coordinator already canonicalizes. Nothing moves off-chain except the *derivation* of witnesses from that state. Solana identification is unchanged and remains the binding for `node_id` in the HRW score.
- **Proposal 2 (placement rules)** introduces a placement-metadata trust assumption (self-reported region/tier) that is acceptable in permissioned phases where the run authority vets participants. Fully open deployment requires an attestation mechanism — TEE attestation, on-chain commitments backed by slashing, or a vetted geo-IP oracle. This is called out explicitly in the proposal and is the single trust regression in this RFC; I'd argue it's a worthwhile one for the placement diversity it enables, but if it violates the zero-trust framework we can leave it out.
- **Proposal 3 (Kademlia chunks)** has zero trust impact. BLAKE3 content addressing means a node lying about chunk availability wastes a fetch attempt but cannot inject corrupted data. This is the same trust posture as the existing `iroh-blobs` path.

None of the proposals removes, weakens, or routes around the Solana identification step. The Coordinator remains the source of truth for "who is in this run, with what stake, at what epoch." The directory tier is downstream of that truth.

## Proposal 1: Rendezvous hashing for witness assignment

### Sketch

Currently, witness selection uses Coordinator-provided randomness per epoch — a state transition that costs a Solana write and creates an implicit synchronization point. The proposal is to replace this with **rendezvous hashing (HRW, Thaler & Ravishankar 1996)**: every node locally computes

```
score(node_id, batch_id, epoch) = H(node_id || batch_id || epoch || stake_weight)
```

and the top-K scores deterministically identify the witnesses for that batch in that epoch. Stake can be folded in via weighted HRW so that high-stake nodes have a higher selection probability without breaking the property.

### Why this is better than current

- **Zero Coordinator round-trips for witness rotation.** Every node can locally derive the witness set for any (batch, epoch) tuple. This decouples witness assignment from on-chain throughput.
- **Globally consistent assignment without per-epoch consensus round-trips.** All nodes compute the same K witnesses given the same inputs, so disagreement about who should witness reduces to disagreement about the input set — which the Coordinator already canonicalizes via its participant list. The Solana Coordinator remains authoritative for the participant set, stake weights, and epoch state that HRW operates on; this proposal only affects how witnesses are derived from that canonical state, not whether the state is canonical.
- **Stake-weighted selection without auctions.** Folding stake into HRW gives stake-proportional witness probability deterministically, without Yuma-style scoring rounds.

### Why this is not worse than current

- **Slashing and correctness model unchanged.** Witnesses still submit signed commitments; mismatches still trigger the existing slashing path. HRW only changes *who* witnesses, not *how*.
- **Resistance to selection manipulation.** As long as `node_id` is bound to a Solana keypair (already true) and `batch_id`/`epoch` are derived from Coordinator state (already true), an adversary cannot grind their way into the witness set without grinding their keypair pre-registration — a known and accepted attack surface in any committee-selection scheme.

### What I'd want feedback on

- Whether the Coordinator team prefers to keep witness assignment on-chain for auditability reasons that I'm missing.
- Whether stake-weighted HRW interacts cleanly with the Treasurer's existing point-accounting, or whether the witness-selection probability needs to be decoupled from stake for incentive reasons.
- Bound on K — i.e., how many independent witnesses the security analysis requires, since HRW makes K cheap to scale.

## Proposal 2: CRUSH-style hierarchical replica placement rules

### Sketch

Borrow from Ceph's CRUSH algorithm (Weil et al., SC'06): allow the run authority to define a placement hierarchy and rules in the run config. A simplified version would look like:

```toml
[placement.hierarchy]
regions = ["us-east", "us-west", "eu-west", "apac", "other"]
trust_tiers = ["attested", "stake_high", "stake_medium", "stake_low"]

[placement.rules]
# For verification replicas of a result blob
result_replicas = 3
result_rules = [
    { type = "region", min_distinct = 2 },
    { type = "trust_tier", min_distinct = 2 },
]

# For checkpoint shards
checkpoint_replicas = 5
checkpoint_rules = [
    { type = "region", min_distinct = 3 },
]
```

A node's region/tier is declared at registration and is part of the Coordinator participant record. Placement rules are evaluated as part of the existing assignment flow but constrain *which* nodes receive replication duty.

### Why this matters specifically for trustless training

The current implicit assumption is that gossip propagation gives sufficient replica diversity. In practice, gossip propagation is biased by latency and topology — replicas concentrate where bandwidth is cheap, which often means same-DC or same-cloud. For Byzantine-resilience that depends on uncorrelated failure (a coordinated cloud outage shouldn't take down all replicas; a state-actor takedown in one jurisdiction shouldn't either), this is a real weakness that no amount of cryptographic verification fixes.

### What's hard about this

- **Self-reported region/tier is gameable.** Either the run authority trusts the declarations (acceptable for a permissioned phase, weaker for fully open), or the protocol needs an attestation mechanism. TEE attestation is the obvious path; willing to defer that subproblem.
- **Rules can be unsatisfiable.** If a run config requires 3 distinct regions but only 2 regions have participants, the placement engine has to fail loudly rather than silently degrading. Behavior under partial-coverage needs explicit semantics.
- **Locality-aware fetch is the natural follow-on.** Once placement is rule-driven, fetch can prefer the closest replica — which is where most of the actual throughput win comes from. CRUSH proper handles this via the same mechanism.

### What I'd want feedback on

- Whether the Foundation has a position on attested vs. self-reported placement metadata.
- Whether replica diversity is currently a known concern internally or whether the assumption is that geographic spread emerges naturally from the participant set.

## Proposal 3: Kademlia-based chunked checkpoint distribution

### Sketch

Alongside the existing HuggingFace + single `iroh-blob` paths for checkpoint download, add a Kademlia overlay where:

1. Each checkpoint is split into fixed-size chunks (e.g., 64 MiB), each addressed by its BLAKE3 hash.
2. Chunks are erasure-coded (Reed-Solomon, k+m) so that any k-of-(k+m) chunks reconstruct the checkpoint.
3. Chunk → holder mappings are advertised in the Kademlia DHT.
4. New joiners do parallel chunk fetches from the closest holders by Kademlia XOR distance, with locality preference if Proposal 2 is also in scope.

This is essentially the BitTorrent + IPFS pattern applied to checkpoints, and is structurally close to what Hivemind already does for parameter sharding via libp2p Kademlia.

### Why this is the most-deferrable of the three

The pain it addresses (hundreds of simultaneous joiners overwhelming a centralized HF download) is a scale problem Psyche hasn't hit yet. At current run sizes the centralized path is fine. The reason to put it in this RFC anyway is that the *design* affects the other two proposals — if the directory tier is going to exist, it's much cleaner to design it once with checkpoint distribution in mind than to bolt it on later.

### What I'd want feedback on

- Whether the Foundation is already heading toward something like this internally (the existence of `iroh-blobs` plus the gossip-based ticket distribution suggests yes, just incrementally).
- Whether erasure-coded chunks are worth the complexity over plain replication, given that BLAKE3 verification is already cheap.

## Why now / when this matters

I want to be honest about scale. **None of this is a current bottleneck on Consilience-class runs.** DisTrO compression, epoch-level off-ramps, and the existing witness mechanism are sufficient for tens to low-hundreds of nodes. The directory-tier work matters at:

- N > ~1,000 active nodes per run, where Coordinator write rates start to dominate epoch length;
- runs spanning >2 geographic regions with non-trivial latency variance;
- networks with many concurrent runs sharing a single Coordinator deployment;
- post-permissioned phases where placement diversity becomes a Byzantine-resilience requirement rather than an aesthetic preference.

If the roadmap doesn't currently target those regimes, the right answer might be "interesting, not now" and that's a perfectly fine outcome from this RFC — knowing where Psyche is heading is more useful than landing code that solves a non-problem.

## Prior art and what's not novel here

To save anyone the time of pointing it out: none of the primitives are individually new.

- **Hivemind** ([[learning-at-home/hivemind](https://github.com/learning-at-home/hivemind)](https://github.com/learning-at-home/hivemind), Ryabinin & Gusev NeurIPS 2020) has used a Kademlia DHT for distributed-training metadata since 2020. **Petals** and **OpenDiLoCo** build on it.
- **BTARD** (Gorbunov et al., [[arXiv:2106.11257](https://arxiv.org/abs/2106.11257)](https://arxiv.org/abs/2106.11257)) combines DHT coordination with Byzantine-tolerant aggregation and random-witness recomputation — closest existing combination of these ideas in published form.
- **CRUSH** (Weil et al., SC'06) is the canonical hierarchical pseudo-random placement algorithm.
- **Mu Li's parameter server** (OSDI 2014) established consistent-hashing-of-parameters in distributed ML.
- **Rendezvous hashing** (Thaler & Ravishankar, IEEE/ACM TON 1998) is freely usable and well-understood.
- **Parallax** ([[arXiv:2509.26182](https://arxiv.org/abs/2509.26182)](https://arxiv.org/abs/2509.26182), 2025) is the closest recent example of using a DHT for locality-aware selection in distributed ML, though for inference rather than training.

What I think would be specific to Psyche is the integration with the existing Coordinator/Treasurer/Witness model — particularly the rendezvous-hashing piece, which fits Psyche's design more naturally than it fits Hivemind's because Psyche has stake-weighted on-chain identity that HRW can fold into its score function cleanly.

## Open questions for the Psyche team

In rough priority order:

1. **What's the actual current bottleneck on the largest active runs?** If it's not in the directory function, this RFC is at best premature. If it is, which of the three proposals addresses what's hurting?
2. **Is there an internal roadmap item that already covers any of this?** I'd rather know now than overlap.
3. **Permissioned-vs-open transition timeline.** Several of the trade-offs (especially placement attestation) depend on whether the network is moving toward fully open in the near term or staying permissioned for longer.
4. **Is on-chain witness assignment intentionally on-chain for auditability reasons?** If yes, that probably kills Proposal 1 even if it would help throughput.
5. **Treasurer interaction.** Does any of this affect the points-and-claims flow in ways that need contract-level changes, or can it all live in the off-chain client?

## Engagement and implementation

I'm a contributor coming to this from outside the Foundation. My posture is: **I'd be happy to implement any or all of these if the Foundation thinks it's directionally useful, but I'm not assuming that's the case.** A few smaller paths forward that might be more useful than a big PR:

- **Smallest unit:** a prototype branch implementing rendezvous-hashed witness assignment as an opt-in client-side flag, with a benchmark comparing Coordinator round-trip costs at varying N. This is the cleanest piece to land or reject.
- **Medium unit:** the placement-hierarchy config primitives plus a self-reported metadata path, without the attestation piece. Gives the Foundation a way to experiment with placement rules in permissioned runs without committing to the harder TEE problem.
- **Largest unit:** the full Kademlia chunk distribution layer. I'd only start this if the Foundation has explicitly indicated it's on the roadmap, since it's the most invasive.

If any of this is interesting and there's a more appropriate channel for the conversation than a GitHub issue (Discord, a call, a design doc in a different repo), happy to move it there.

## References

- Maymounkov & Mazières, *Kademlia: A Peer-to-Peer Information System Based on the XOR Metric*, IPTPS 2002.
- Thaler & Ravishankar, *A Name-Based Mapping Scheme for Rendezvous*, IEEE/ACM Transactions on Networking 1998.
- Weil, Brandt, Miller, Maltzahn, *CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data*, SC '06.
- Karger et al., *Consistent Hashing and Random Trees*, STOC 1997.
- Ryabinin & Gusev, *Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts*, NeurIPS 2020.
- Gorbunov et al., *Secure Distributed Training at Scale* (BTARD), ICML 2022, [[arXiv:2106.11257](https://arxiv.org/abs/2106.11257)](https://arxiv.org/abs/2106.11257).
- Borzunov et al., *Petals: Collaborative Inference and Fine-tuning of Large Models*, ACL Demos 2023, [[arXiv:2209.01188](https://arxiv.org/abs/2209.01188)](https://arxiv.org/abs/2209.01188).
- Li et al., *Scaling Distributed Machine Learning with the Parameter Server*, OSDI 2014.
- DeCandia et al., *Dynamo: Amazon's Highly Available Key-value Store*, SOSP 2007.
- Peng, Quesnelle, Kingma et al., *DeMo: Decoupled Momentum Optimization*, [[arXiv:2411.19870](https://arxiv.org/abs/2411.19870)](https://arxiv.org/abs/2411.19870).
- Nous Research, *Democratizing AI: The Psyche Network Architecture*, [[nousresearch.com/nous-psyche](https://nousresearch.com/nous-psyche/)](https://nousresearch.com/nous-psyche/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Directory tier for Psyche — rendezvous-hashed witnesses, hierarchical placement, chunked checkpoints #644

Summary

Background and motivation

Trust model

Proposal 1: Rendezvous hashing for witness assignment

Sketch

Why this is better than current

Why this is not worse than current

What I'd want feedback on

Proposal 2: CRUSH-style hierarchical replica placement rules

Sketch

Why this matters specifically for trustless training

What's hard about this

What I'd want feedback on

Proposal 3: Kademlia-based chunked checkpoint distribution

Sketch

Why this is the most-deferrable of the three

What I'd want feedback on

Why now / when this matters

Prior art and what's not novel here

Open questions for the Psyche team

Engagement and implementation

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Directory tier for Psyche — rendezvous-hashed witnesses, hierarchical placement, chunked checkpoints #644

Description

Summary

Background and motivation

Trust model

Proposal 1: Rendezvous hashing for witness assignment

Sketch

Why this is better than current

Why this is not worse than current

What I'd want feedback on

Proposal 2: CRUSH-style hierarchical replica placement rules

Sketch

Why this matters specifically for trustless training

What's hard about this

What I'd want feedback on

Proposal 3: Kademlia-based chunked checkpoint distribution

Sketch

Why this is the most-deferrable of the three

What I'd want feedback on

Why now / when this matters

Prior art and what's not novel here

Open questions for the Psyche team

Engagement and implementation

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions