Skip to content

Enable Tensor Parallelism (TP) #599

@joellidin

Description

@joellidin

Summary

Wire full TP support through miner and validator so runs work by setting torchtitan.tp_degree > 1 in hparams. The hparams surface and Titan parallelization are already present; we need to complete the plumbing across init, gradient pipeline, and checkpoints.

Scope

  1. Model init / mesh wiring
  • Miner & Validator: build Titan LLaMA via our factory and parallelize with TP using existing helpers. Validate degrees/world-size via the factory checks.
  • Keep validator mesh consistent with miner for evaluation parity.
  1. Gradient pipeline (DTensor-safe)
  • Ensure prepare_gradient_dict(...) owner/rendezvous logic correctly handles TP-sharded DTensors during encode→compress.
  • In outer_step(...), keep the per-param flow: reconstruct dense grad on source, then broadcast/distribute_tensor into DTensor grads on all ranks. Verify placements under TP.
  1. Checkpointing / catch-up
  • Confirm DCP save/load and catch-up apply cleanly with TP sharding (Titan distributed state dicts + pointer publishing).

Acceptance criteria

  • Setting torchtitan.tp_degree > 1 runs miner and validator without placement/shape errors and completes windows.
  • Gradients compress/apply under TP (no DTensor/mesh asserts in prepare_gradient_dict or outer_step).
  • Checkpoints save, restore, and catch-up on TP meshes.

Notes

  • Double-check owned_params/ownership vs. TP partitioning to avoid double work or gaps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions