Skip to content

Trouble resuming from checkpoint #782

@asaluja

Description

@asaluja

Hello,

I'm running bitsandbytes==0.41.1 in a Python 3.10 miniconda environment, 8xA100 GPU (using accelerate for multi-GPU), Cuda 12.2.

I'm having problems resuming training (DPO) from a checkpoint:

Error invalid argument at line 393 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3777589) of binary: /mnt/miniconda3/envs/synlm/bin/python3.10
Traceback (most recent call last):
  File "/mnt/miniconda3/envs/synlm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/miniconda3/envs/synlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm not quite sure how to debug this error, any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Contributions WelcomeWe welcome contributions to fix this issue!Medium Priority(will be worked on after all high priority issues)OptimizersIssues or feature requests relating to optimizers

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions