Skip to content

Local miner test failing (main branch) #479

@tplr-y

Description

@tplr-y

Describe the bug

The templar/scripts/local_miner_test.py script in main branch fails with the following trace:

torchrun --standalone --nproc_per_node=2 scripts/local_miner_test.py --inner-windows 100 --micro-batch-size 1 --device cuda
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] 
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] *****************************************
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0806 19:19:07.482955 39332 torch/distributed/run.py:766] *****************************************
Setting dummy environment variables for local testing
Initialising Miner (Llama-3 8B / TorchTitan)…
W0806 19:19:12.534958 39332 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 39401 closing signal SIGTERM
E0806 19:19:13.500850 39332 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -11) local_rank: 0 (pid: 39400) of binary: /home/shadeform/templar/.venv/bin/python3
Traceback (most recent call last):
  File "/home/shadeform/templar/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shadeform/templar/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
scripts/local_miner_test.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-06_19:19:12
  host      : shadecloud
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 39400)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 39400
=======================================================

It's failing here:

from neurons.miner import Miner

miner = Miner()
miner.model.eval()
nvidia-smi
Wed Aug  6 19:14:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:8D:00.0 Off |                    0 |
| N/A   27C    P0             76W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H200                    On  |   00000000:91:00.0 Off |                    0 |
| N/A   27C    P0             76W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H200                    On  |   00000000:95:00.0 Off |                    0 |
| N/A   29C    P0             78W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H200                    On  |   00000000:99:00.0 Off |                    0 |
| N/A   27C    P0             76W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H200                    On  |   00000000:AB:00.0 Off |                    0 |
| N/A   28C    P0             75W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H200                    On  |   00000000:AF:00.0 Off |                    0 |
| N/A   26C    P0             75W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H200                    On  |   00000000:B3:00.0 Off |                    0 |
| N/A   28C    P0             77W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H200                    On  |   00000000:B7:00.0 Off |                    0 |
| N/A   27C    P0             77W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Cuda is available to torch:

>>> import torch
>>> torch.cuda.is_available()
True

GDB Trace:

gdb_trace.txt

Enhanced Tracing:

Here, I took main branch and added some extra logging to the local_miner_test.py file, resulted in:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/shadeform/cleanenv/templar/scripts/local_miner_test.py", line 317, in <module>
[rank0]:     miner = Miner()
[rank0]:             ^^^^^^^
[rank0]:   File "/home/shadeform/templar/neurons/miner.py", line 249, in __init__
[rank0]:     self.model = initialize_torchtitan_model(
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/shadeform/templar/src/tplr/model_factory.py", line 258, in initialize_torchtitan_model
[rank0]:     pdims = create_parallel_dims(world_size, hparams, role)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/shadeform/templar/src/tplr/model_factory.py", line 210, in create_parallel_dims
[rank0]:     raise ValueError(
[rank0]: ValueError: world_size (2) must be divisible by dp_replicate × dp_shard (8×1).
[20:32:52] [Init] Bittensor wallet/metagraph loaded                                                                                                                                                                                   
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/shadeform/cleanenv/templar/scripts/local_miner_test.py", line 317, in <module>
[rank1]:     miner = Miner()
[rank1]:             ^^^^^^^
[rank1]:   File "/home/shadeform/templar/neurons/miner.py", line 249, in __init__
[rank1]:     self.model = initialize_torchtitan_model(
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/shadeform/templar/src/tplr/model_factory.py", line 258, in initialize_torchtitan_model
[rank1]:     pdims = create_parallel_dims(world_size, hparams, role)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/shadeform/templar/src/tplr/model_factory.py", line 210, in create_parallel_dims
[rank1]:     raise ValueError(
[rank1]: ValueError: world_size (2) must be divisible by dp_replicate × dp_shard (8×1).
[rank0]:[W806 20:32:53.671894104 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

What am I doing wrong here? I imagine you ran the local miner test script successfully at some point. Did you recently run it with yesterday's torchtitan-rebased merge? @Quentin-Anthony

To Reproduce

reproducible_local_miner_test_failure.txt

Rename above script to .sh extension, and run.

It will:

  1. Do prerequisite python env setup (optional, if you're on a fresh VM, otherwise comment it out)
  2. Download the test shard (skips if found)
  3. Launch the scripts/local_miner_test.py as defined right now in main branch i.e. isolated from any new muon related code.

Expected behavior

Training should start without a failure.

Screenshots

No response

Environment

Ubuntu 22.04 + CUDA 12.4 ML

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions