Skip to content

[Question] Does Megatron-SWIFT restore the streaming-dataset offset when resuming from a checkpoint? #4505

Open
@Aratako

Description

@Aratako

What I’m trying to do

Resume pre-training in Megatron-SWIFT with streaming=true and packing=true.

Observation

  • On resume, the checkpoint correctly restores iteration and consumed_samples.
  • It is unclear whether the streaming dataset picks up from the same offset or restarts from the beginning (which would duplicate samples).

My reading of the code

IterablePackingDataset doesn’t appear to persist its internal index, so I suspect the offset isn’t saved on the SWIFT side.

Question

  • Should already-consumed samples be automatically skipped when resuming with streaming=true?
  • Does Megatron-LM handle this internally, or should users manually skip consumed_samples when resuming?

Thanks in advance for any clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions