Skip to content

Distribute Gather Across All Ranks #598

@joellidin

Description

@joellidin

Distribute Gather Across Ranks & Make gather_result Consumers Partial-Aware

Summary

Implement distributed gather across all ranks and update every consumer of gather_result to correctly handle partial results as well as a merged aggregate. Maintain current output schema and downstream semantics.

Objectives

  • Distribute peer fetching and validation evenly across ranks using deterministic partitioning without duplicates.
  • Produce per-rank partial results and a canonical merged aggregate for global metrics and artifact emission.
  • Enable outer update to process partial results incrementally under a bounded memory budget.
  • Ensure all consumer paths operate correctly with either partials or an aggregate.

Scope

  • Gather execution: partitioning, per-rank fetch/validate, partial outputs, merge/reduce, artifact emission on a single rank.
  • Consumers: outer update, index-overlap checks, per-param norms, quality metrics (intended vs actual, success rate, skipped), logging.
  • Dedupe policy for repeated UIDs across partitions or retries.
  • Synchronization and small control flags for readiness and skip decisions.

Deliverables

  1. Partitioning

    • Deterministic mapping from peer list to ranks.
    • Guardrails preventing duplicate downloads when using reserves or retries.
  2. Per-Rank Partial Gather

    • Fetch and validate assigned peers.
    • Emit partial gather_result with uids, skipped_uids, success_rate_part, and per-param payloads.
    • Compute lightweight per-rank index-overlap signatures.
  3. Merge & Global Metrics

    • All-gather of partials and merge on one rank.
    • Compute global uids, skipped_uids, success rate, intended vs actual, and overlap candidates.
    • Emit/upload the canonical aggregate artifact from a single rank.
  4. Partial-Aware Consumers

    • Outer update accepts a sequence of partial results and applies them incrementally in a deterministic order with a configurable memory budget.
    • Index-overlap detection reduces per-rank signatures to global findings.
    • Norms and quality metrics reduce across partials and match single-rank semantics.
  5. Synchronization & Control

    • Barriers at gather completion and pre-update.
    • Minimal readiness/skip flags broadcast to all ranks.

Risks

  • Duplicate work without strict partitioning and reserve handling.
  • Metric drift if reductions do not union sets consistently.
  • Double application if the same UID appears in multiple partials.
  • Memory pressure during concurrent decompress/validate without chunking.
  • Non-deterministic application order causing minor numerical divergence.

Acceptance Criteria

  • Gather+validate wall-clock time decreases with the number of ranks until network limits dominate.

  • Peak memory during outer update remains within the configured budget and does not scale with peer count.

  • For the same peer set, distributed flow produces outputs equivalent to the single-rank baseline for:

    • uids, skipped_uids, success rate, intended vs actual
    • Per-param norms and index-overlap results
    • Applied model update (within expected floating-point tolerance)
  • No duplicate downloads or double applications; reserve backfills behave correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions