Distribute Gather Across Ranks & Make gather_result Consumers Partial-Aware
Summary
Implement distributed gather across all ranks and update every consumer of gather_result to correctly handle partial results as well as a merged aggregate. Maintain current output schema and downstream semantics.
Objectives
- Distribute peer fetching and validation evenly across ranks using deterministic partitioning without duplicates.
- Produce per-rank partial results and a canonical merged aggregate for global metrics and artifact emission.
- Enable outer update to process partial results incrementally under a bounded memory budget.
- Ensure all consumer paths operate correctly with either partials or an aggregate.
Scope
- Gather execution: partitioning, per-rank fetch/validate, partial outputs, merge/reduce, artifact emission on a single rank.
- Consumers: outer update, index-overlap checks, per-param norms, quality metrics (intended vs actual, success rate, skipped), logging.
- Dedupe policy for repeated UIDs across partitions or retries.
- Synchronization and small control flags for readiness and skip decisions.
Deliverables
-
Partitioning
- Deterministic mapping from peer list to ranks.
- Guardrails preventing duplicate downloads when using reserves or retries.
-
Per-Rank Partial Gather
- Fetch and validate assigned peers.
- Emit partial
gather_result with uids, skipped_uids, success_rate_part, and per-param payloads.
- Compute lightweight per-rank index-overlap signatures.
-
Merge & Global Metrics
- All-gather of partials and merge on one rank.
- Compute global
uids, skipped_uids, success rate, intended vs actual, and overlap candidates.
- Emit/upload the canonical aggregate artifact from a single rank.
-
Partial-Aware Consumers
- Outer update accepts a sequence of partial results and applies them incrementally in a deterministic order with a configurable memory budget.
- Index-overlap detection reduces per-rank signatures to global findings.
- Norms and quality metrics reduce across partials and match single-rank semantics.
-
Synchronization & Control
- Barriers at gather completion and pre-update.
- Minimal readiness/skip flags broadcast to all ranks.
Risks
- Duplicate work without strict partitioning and reserve handling.
- Metric drift if reductions do not union sets consistently.
- Double application if the same UID appears in multiple partials.
- Memory pressure during concurrent decompress/validate without chunking.
- Non-deterministic application order causing minor numerical divergence.
Acceptance Criteria
-
Gather+validate wall-clock time decreases with the number of ranks until network limits dominate.
-
Peak memory during outer update remains within the configured budget and does not scale with peer count.
-
For the same peer set, distributed flow produces outputs equivalent to the single-rank baseline for:
uids, skipped_uids, success rate, intended vs actual
- Per-param norms and index-overlap results
- Applied model update (within expected floating-point tolerance)
-
No duplicate downloads or double applications; reserve backfills behave correctly.
Distribute Gather Across Ranks & Make
gather_resultConsumers Partial-AwareSummary
Implement distributed gather across all ranks and update every consumer of
gather_resultto correctly handle partial results as well as a merged aggregate. Maintain current output schema and downstream semantics.Objectives
Scope
Deliverables
Partitioning
Per-Rank Partial Gather
gather_resultwithuids,skipped_uids,success_rate_part, and per-param payloads.Merge & Global Metrics
uids,skipped_uids, success rate, intended vs actual, and overlap candidates.Partial-Aware Consumers
Synchronization & Control
Risks
Acceptance Criteria
Gather+validate wall-clock time decreases with the number of ranks until network limits dominate.
Peak memory during outer update remains within the configured budget and does not scale with peer count.
For the same peer set, distributed flow produces outputs equivalent to the single-rank baseline for:
uids,skipped_uids, success rate, intended vs actualNo duplicate downloads or double applications; reserve backfills behave correctly.