Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backgroundWorker keeps dying at epoch 0 #56

Open
AndrewForresterGit opened this issue Aug 6, 2024 · 3 comments
Open

backgroundWorker keeps dying at epoch 0 #56

AndrewForresterGit opened this issue Aug 6, 2024 · 3 comments

Comments

@AndrewForresterGit
Copy link

I keep getting this same error at epoch 1. I've tried debugging and the only thing I believe to cause the error is that dist.is_initialized() returns False. Could this be the cause. If so, how would I fix this and if not what else could be the cause?

2024-08-05 11:23:25.063796: unpacking dataset...
2024-08-05 11:23:35.208095: unpacking done...
2024-08-05 11:23:35.208965: do_dummy_2d_data_aug: False
2024-08-05 11:23:35.220142: Unable to plot network architecture:
2024-08-05 11:23:35.220288: No module named 'hiddenlayer'
2024-08-05 11:23:35.227630:
2024-08-05 11:23:35.227772: Epoch 0
2024-08-05 11:23:35.227940: Current learning rate: 0.01
using pin_memory on device 0
Exception in thread Thread-4 (results_loop):
Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
  File "/home/anfor306/venvs/projet-Umamba/bin/nnUNetv2_train", line 33, in <module>
    sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
  File "/lustre06/project/6092638/anfor306/U-Mamba/umamba/nnunetv2/run/run_training.py", line 268, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/lustre06/project/6092638/anfor306/U-Mamba/umamba/nnunetv2/run/run_training.py", line 204, in run_training
    nnunet_trainer.run_training()
  File "/lustre06/project/6092638/anfor306/U-Mamba/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1258, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
  File "/home/anfor306/venvs/projet-Umamba/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
@AndrewForresterGit AndrewForresterGit changed the title backgroundWorker keeps dying at epoch 1 backgroundWorker keeps dying at epoch 0 Aug 6, 2024
@AyacodeYa
Copy link

Hi, I have solved a similar problem to yours, but my problem was about the version of causal_conv1d. So, I solved it with the following command.
git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal-conv1d
git checkout v1.1.1.post2
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .
nnUNetv2_train your_dataset_ID 2d all -tr nnUNetTrainerUMambaEnc -num_gpus 1

@Younger330
Copy link

git checkout v1.1.1.post2 for runing git checkout v1.1.1.post2, do you have any suggestions?

hello, I return the error
error: pathspec 'v1.1.1.post2' did not match any file(s) known to git

@AndrewForresterGit
Copy link
Author

My problem was solved by changing a100 to v100 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants