Describe the bug
The templar/scripts/local_miner_test.py script in main branch fails with the following trace:
torchrun --standalone --nproc_per_node=2 scripts/local_miner_test.py --inner-windows 100 --micro-batch-size 1 --device cuda
W0806 19:19:07.482955 39332 torch/distributed/run.py:766]
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] *****************************************
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] *****************************************
Setting dummy environment variables for local testing
Initialising Miner (Llama-3 8B / TorchTitan)…
W0806 19:19:12.534958 39332 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 39401 closing signal SIGTERM
E0806 19:19:13.500850 39332 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -11) local_rank: 0 (pid: 39400) of binary: /home/shadeform/templar/.venv/bin/python3
Traceback (most recent call last):
File "/home/shadeform/templar/.venv/bin/torchrun", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
scripts/local_miner_test.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-08-06_19:19:12
host : shadecloud
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 39400)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 39400
=======================================================
It's failing here:
from neurons.miner import Miner
miner = Miner()
miner.model.eval()
nvidia-smi
Wed Aug 6 19:14:13 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 On | 00000000:8D:00.0 Off | 0 |
| N/A 27C P0 76W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H200 On | 00000000:91:00.0 Off | 0 |
| N/A 27C P0 76W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H200 On | 00000000:95:00.0 Off | 0 |
| N/A 29C P0 78W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H200 On | 00000000:99:00.0 Off | 0 |
| N/A 27C P0 76W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H200 On | 00000000:AB:00.0 Off | 0 |
| N/A 28C P0 75W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H200 On | 00000000:AF:00.0 Off | 0 |
| N/A 26C P0 75W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H200 On | 00000000:B3:00.0 Off | 0 |
| N/A 28C P0 77W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H200 On | 00000000:B7:00.0 Off | 0 |
| N/A 27C P0 77W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Cuda is available to torch:
>>> import torch
>>> torch.cuda.is_available()
True
GDB Trace:
gdb_trace.txt
Enhanced Tracing:
Here, I took main branch and added some extra logging to the local_miner_test.py file, resulted in:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/shadeform/cleanenv/templar/scripts/local_miner_test.py", line 317, in <module>
[rank0]: miner = Miner()
[rank0]: ^^^^^^^
[rank0]: File "/home/shadeform/templar/neurons/miner.py", line 249, in __init__
[rank0]: self.model = initialize_torchtitan_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/shadeform/templar/src/tplr/model_factory.py", line 258, in initialize_torchtitan_model
[rank0]: pdims = create_parallel_dims(world_size, hparams, role)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/shadeform/templar/src/tplr/model_factory.py", line 210, in create_parallel_dims
[rank0]: raise ValueError(
[rank0]: ValueError: world_size (2) must be divisible by dp_replicate × dp_shard (8×1).
[20:32:52] [Init] Bittensor wallet/metagraph loaded
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/shadeform/cleanenv/templar/scripts/local_miner_test.py", line 317, in <module>
[rank1]: miner = Miner()
[rank1]: ^^^^^^^
[rank1]: File "/home/shadeform/templar/neurons/miner.py", line 249, in __init__
[rank1]: self.model = initialize_torchtitan_model(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/shadeform/templar/src/tplr/model_factory.py", line 258, in initialize_torchtitan_model
[rank1]: pdims = create_parallel_dims(world_size, hparams, role)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/shadeform/templar/src/tplr/model_factory.py", line 210, in create_parallel_dims
[rank1]: raise ValueError(
[rank1]: ValueError: world_size (2) must be divisible by dp_replicate × dp_shard (8×1).
[rank0]:[W806 20:32:53.671894104 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
What am I doing wrong here? I imagine you ran the local miner test script successfully at some point. Did you recently run it with yesterday's torchtitan-rebased merge? @Quentin-Anthony
To Reproduce
reproducible_local_miner_test_failure.txt
Rename above script to .sh extension, and run.
It will:
- Do prerequisite python env setup (optional, if you're on a fresh VM, otherwise comment it out)
- Download the test shard (skips if found)
- Launch the
scripts/local_miner_test.py as defined right now in main branch i.e. isolated from any new muon related code.
Expected behavior
Training should start without a failure.
Screenshots
No response
Environment
Ubuntu 22.04 + CUDA 12.4 ML
Additional context
No response
Describe the bug
The
templar/scripts/local_miner_test.pyscript inmainbranch fails with the following trace:It's failing here:
Cuda is available to torch:
GDB Trace:
gdb_trace.txt
Enhanced Tracing:
Here, I took main branch and added some extra logging to the local_miner_test.py file, resulted in:
What am I doing wrong here? I imagine you ran the local miner test script successfully at some point. Did you recently run it with yesterday's torchtitan-rebased merge? @Quentin-Anthony
To Reproduce
reproducible_local_miner_test_failure.txt
Rename above script to
.shextension, and run.It will:
scripts/local_miner_test.pyas defined right now inmainbranch i.e. isolated from any new muon related code.Expected behavior
Training should start without a failure.
Screenshots
No response
Environment
Ubuntu 22.04 + CUDA 12.4 ML
Additional context
No response