diff --git a/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/README.md b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/README.md new file mode 100644 index 0000000000..c30e59bf45 --- /dev/null +++ b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/README.md @@ -0,0 +1,142 @@ +# Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Score-First TTT (4 epochs) + Tuned MLP WD + +**val_bpb = 1.07290** (3-seed mean, std 0.00016) | ~16.00 MB | 8xH100 SXM + +## 3-Seed Results + +| Seed | Pre-quant BPP | Quantized BPP | Sliding BPP | **TTT BPP** | Artifact (B) | Train (s) | Eval (s) | +|------|---------------|---------------|-------------|-------------|--------------|-----|----------| +| 42 | 1.08086 | 1.09131 | 1.07455 | **1.07303** | 15,995,398 | 600 | 577.5 | +| 314 | 1.08058 | 1.09105 | 1.07424 | **1.07272** | 15,999,207 | 600 | 575.7 | +| 999 | 1.08078 | 1.09119 | 1.07443 | **1.07295** | 15,995,751 | 600 | 586.4 | +| **Mean** | **1.08074** | **1.09118** | **1.07441** | **1.07290** | **15,996,785** | **600** | **579.9** | +| **Std** | **0.00014** | **0.00013** | **0.00016** | **0.00016** | — | — | — | + +Prior SOTA (2026-04-09 PR `SP8192_3LayerRecur_ParResid_QK525_LegalTTT` by @bigbag): +**1.0810 BPP** (3-seed mean, std 0.0002). + +**Δ = −0.00810 BPP**, well above the official 0.005-nat threshold. +**Welch t = −54.93** (df = 3.80), **one-sided p < 1e-7**. + +## Key Techniques (delta vs prior 04-09 record) + +1. **TTT epochs 3 → 4** — extra adaptation budget per 32K-token chunk under + the same score-first protocol. The dominant source of the val_bpb gain. +2. **Split MLP weight decay** — `muon_wd_mlp = 0.115` vs `muon_wd = 0.095` + for non-MLP matrices. Stronger regularization on the largest matrices + reduces post-quant degradation. +3. **Per-head attention-output gate** — sigmoid gate over attention output + ([H, W_g=12] zero-init Parameter; raw gate = 2·sigmoid(0) = 1 at step 0, + transparent in early training). + +All other components inherited from the 04-09 record, listed below. + +## Architecture (inherited from 04-09 record) + +11L × 512d × 8H / 4KV (GQA), MLP 4× (hidden 2048), LeakyReLU(0.5)² MLP +activation, Partial RoPE (16/64 dims), tied embeddings, logit +softcap = 30.0. Depth recurrence: encoder/decoder layer indexing +includes loops over layers 3-5 (`num_loops=2`), activated at training +fraction 0.35. Parallel residuals from layer 7. XSA (exclusive +self-attention: subtract normalized-V projection of output) on all +11 layers. Layer-wise LN scale `1/sqrt(layer+1)`, with looped layers +additionally divided by `sqrt(num_loops+1)` for residual variance +balancing. + +Total ~35.9M parameters. + +## Training + +MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps, +nesterov), AdamW for embeddings and scalars. Time-fractional schedule: +linear warmdown over the final 72% of the 600 s wallclock. Time- +fractional Muon momentum warmup over the first 22% (peak momentum 0.99, +warmup-start 0.92). EMA decay 0.9965 throughout training. ~4640 steps +in 600 s on 8×H100 SXM at peak ~7.7 M tok/s. + +## Quantization + +Full-Hessian GPTQ with SDClip: `clip = k · std(row)` for principled +rate-distortion. int6 for attention/MLP matrices (k=12.85), int8 for +token embeddings (k=20.0). Per-row fp16 scales. Block size 32 with +multiplicative Hessian damping factor 1.01. + +## Compression + +Byte-shuffle (stride-2) followed by Brotli quality-11 with +`lgwin=24`. The byte-shuffle separates the high-bit / low-bit byte +planes of the int8-stored quantized values, exposing the redundancy +of the unused upper bits to brotli. Final blob ≈ 15.97 MB across +seeds. + +## Test-Time Training (Score-First, 4 epochs/chunk) + +Per Issue #1017 Track B "legal eval-time adaptation": + +- **Condition 1 (Causality)**: Sliding-window eval is strictly causal; + each position scored from prefix tokens only. +- **Condition 2 (Normalized distribution)**: Standard softmax over the + full SP8192 vocab. No n-gram cache, no logit biasing. +- **Condition 3 (Score-before-update)**: Each 32K-token chunk is fully + scored under `torch.no_grad()` BEFORE any SGD update. Training only + on already-scored tokens. +- **Condition 4 (Single pass)**: Each token scored exactly once. No + rescoring, no multi-pass selection. + +Inner loop: SGD with momentum 0.9, nesterov; per-chunk cosine LR +decay starting from `ttt_lr = 0.005`; 4 epochs per chunk; gradient +clipping at norm 1.0; distributed all-reduce of gradients at +`ReduceOp.AVG`. Total TTT eval time ~310 s per seed (within the +600 s eval budget). + +## Compliance + +- No SLOT (standard or causal) +- No pre-quant TTT on val data (model is quantized once during eval + setup; the only TTT pass is the legal score-first adaptation + described above) +- No ETLB (eval-time logit bias) +- No n-gram cache or tilt +- No tokenizer changes (default SP8192 BPE on FineWeb10B) +- No external data, sidecar files, or eval-time downloads +- All 3 artifacts < 16,000,000 bytes (worst margin: 793 B on seed 314) +- Train time < 600 s on all 3 seeds (capped at 600.1 s by + `MAX_WALLCLOCK_SECONDS=600`) +- Eval time < 600 s on all 3 seeds (worst: 586.4 s on seed 999) + +## Reproduction + +```bash +pip install brotli sentencepiece +pip install flash_attn_3 --no-deps \ + --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ + +MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \ + python3 data/cached_challenge_fineweb.py --variant sp8192 + +SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=4 \ + torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +(Repeat with `SEED=314` and `SEED=999` to reproduce the 3-seed mean.) + +## Credits + +This submission is a small two-knob delta on top of the 2026-04-09 +record by **@bigbag** +(`SP8192_3LayerRecur_ParResid_QK525_LegalTTT`, val_bpb 1.0810). +Everything other than `ttt_epochs=4` and `muon_wd_mlp=0.115` is +inherited unchanged from that stack. + +For the full upstream contributor chain (SP8192, GPTQ SDClip, depth +recurrence, parallel residuals, QK-Gain, MuonEq-R, score-first TTT +framework, hyperparameter tuning), see @bigbag's record README: +`records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md`. + +## Included Files + +- `README.md` (this file) +- `submission.json` +- `train_gpt.py` (lzma-RAW + base85 self-extracting wrapper around + the full source; 68,778 B raw → 19,646 B packed) +- `seed42.log`, `seed314.log`, `seed999.log` diff --git a/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed314.log b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed314.log new file mode 100644 index 0000000000..1e717f6eb9 --- /dev/null +++ b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed314.log @@ -0,0 +1,1171 @@ +=== PREFLIGHT (SMOKE_TEST=1, 109s) === +W0425 04:58:39.644000 179 torch/distributed/run.py:803] +W0425 04:58:39.644000 179 torch/distributed/run.py:803] ***************************************** +W0425 04:58:39.644000 179 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0425 04:58:39.644000 179 torch/distributed/run.py:803] ***************************************** +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:248 [1] NCCL INFO cudaDriverVersion 12080 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:248 [1] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:248 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:248 [1] NCCL INFO Comm config Blocking set to 1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:247 [0] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:253 [6] NCCL INFO cudaDriverVersion 12080 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:247 [0] NCCL INFO cudaDriverVersion 12080 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:251 [4] NCCL INFO cudaDriverVersion 12080 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:253 [6] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:252 [5] NCCL INFO cudaDriverVersion 12080 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:247 [0] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:253 [6] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:252 [5] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:247 [0] NCCL INFO Comm config Blocking set to 1 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:253 [6] NCCL INFO Comm config Blocking set to 1 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:252 [5] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:254 [7] NCCL INFO cudaDriverVersion 12080 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:254 [7] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:254 [7] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:254 [7] NCCL INFO Comm config Blocking set to 1 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:249 [2] NCCL INFO cudaDriverVersion 12080 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:249 [2] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:249 [2] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:249 [2] NCCL INFO Comm config Blocking set to 1 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:250 [3] NCCL INFO cudaDriverVersion 12080 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:250 [3] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:250 [3] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:250 [3] NCCL INFO Comm config Blocking set to 1 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:251 [4] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:251 [4] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:251 [4] NCCL INFO Comm config Blocking set to 1 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:252 [5] NCCL INFO Comm config Blocking set to 1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Failed to open libibverbs.so[.1] +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Failed to open libibverbs.so[.1] +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Initialized NET plugin Socket +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Initialized NET plugin Socket +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Assigned NET plugin Socket to comm +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Using network Socket +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Using network Socket +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Failed to open libibverbs.so[.1] +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Initialized NET plugin Socket +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Assigned NET plugin Socket to comm +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Using network Socket +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Failed to open libibverbs.so[.1] +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Failed to open libibverbs.so[.1] +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Initialized NET plugin Socket +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Initialized NET plugin Socket +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Assigned NET plugin Socket to comm +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Assigned NET plugin Socket to comm +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Using network Socket +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Using network Socket +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Initialized NET plugin Socket +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Assigned NET plugin Socket to comm +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Using network Socket +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Failed to open libibverbs.so[.1] +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Initialized NET plugin Socket +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Assigned NET plugin Socket to comm +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Using network Socket +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Failed to open libibverbs.so[.1] +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Initialized NET plugin Socket +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Assigned NET plugin Socket to comm +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Using network Socket +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO ncclCommInitRankConfig comm 0x55f6a7a24680 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x35ab675eb55cd9a8 - Init START +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO ncclCommInitRankConfig comm 0x55f9fd8e7d70 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0x35ab675eb55cd9a8 - Init START +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO ncclCommInitRankConfig comm 0x559824139d80 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x35ab675eb55cd9a8 - Init START +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO ncclCommInitRankConfig comm 0x55b5d579adc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x35ab675eb55cd9a8 - Init START +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO RAS client listening socket at ::1<28028> +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO ncclCommInitRankConfig comm 0x561941b639b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x35ab675eb55cd9a8 - Init START +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO ncclCommInitRankConfig comm 0x55cc5084e920 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x35ab675eb55cd9a8 - Init START +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO RAS client listening socket at ::1<28028> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO ncclCommInitRankConfig comm 0x55eb260c1970 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x35ab675eb55cd9a8 - Init START +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO RAS client listening socket at ::1<28028> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Bootstrap timings total 0.013190 (create 0.000027, send 0.000093, recv 0.012547, ring 0.000177, delay 0.000002) +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO MNNVL busId 0x3b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO ncclCommInitRankConfig comm 0x5565595bec70 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x35ab675eb55cd9a8 - Init START +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO RAS client listening socket at ::1<28028> +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Bootstrap timings total 0.000695 (create 0.000026, send 0.000095, recv 0.000069, ring 0.000181, delay 0.000002) +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Bootstrap timings total 0.036469 (create 0.000027, send 0.000092, recv 0.023327, ring 0.012684, delay 0.000002) +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO MNNVL busId 0x2a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Bootstrap timings total 0.268033 (create 0.000026, send 0.000094, recv 0.094185, ring 0.057865, delay 0.000002) +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO MNNVL busId 0xbb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Bootstrap timings total 0.366044 (create 0.000027, send 0.000083, recv 0.330784, ring 0.034771, delay 0.000002) +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO MNNVL busId 0x19000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Bootstrap timings total 0.061255 (create 0.000029, send 0.000086, recv 0.002402, ring 0.000145, delay 0.000002) +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO MNNVL busId 0x9b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Setting affinity for GPU 4 to 48-95,144-191 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Bootstrap timings total 0.058920 (create 0.000028, send 0.000098, recv 0.000096, ring 0.058397, delay 0.000002) +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO MNNVL busId 0xab000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Setting affinity for GPU 5 to 48-95,144-191 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Bootstrap timings total 0.173985 (create 0.000028, send 0.000090, recv 0.000163, ring 0.173356, delay 0.000002) +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Setting affinity for GPU 2 to 0-47,96-143 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO comm 0x55eb260c1970 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:605 [2] NCCL INFO [Proxy Service] Device 2 CPU core 12 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:608 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 109 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Setting affinity for GPU 3 to 0-47,96-143 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO comm 0x5565595bec70 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO P2P Chunksize set to 524288 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:609 [3] NCCL INFO [Proxy Service] Device 3 CPU core 116 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:612 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 39 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Setting affinity for GPU 1 to 0-47,96-143 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO comm 0x55b5d579adc0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO P2P Chunksize set to 524288 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:601 [1] NCCL INFO [Proxy Service] Device 1 CPU core 140 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:602 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 6 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Setting affinity for GPU 6 to 48-95,144-191 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO comm 0x55f9fd8e7d70 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO P2P Chunksize set to 524288 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:613 [6] NCCL INFO [Proxy Service] Device 6 CPU core 168 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:614 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 73 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO comm 0x55cc5084e920 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO comm 0x55f6a7a24680 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO comm 0x561941b639b0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:603 [4] NCCL INFO [Proxy Service] Device 4 CPU core 148 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:604 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 156 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:606 [5] NCCL INFO [Proxy Service] Device 5 CPU core 61 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:607 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 150 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:610 [0] NCCL INFO [Proxy Service] Device 0 CPU core 31 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:611 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 132 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Setting affinity for GPU 7 to 48-95,144-191 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO comm 0x559824139d80 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO P2P Chunksize set to 524288 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:599 [7] NCCL INFO [Proxy Service] Device 7 CPU core 67 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:600 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 167 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO ncclCommInitRankConfig comm 0x55eb260c1970 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:590 [2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.56 (kernels 0.43, alloc 0.87, bootstrap 0.01, allgathers 0.02, topo 0.05, graphs 0.02, connections 0.07, rest 0.09) +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO ncclCommInitRankConfig comm 0x5565595bec70 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:589 [3] NCCL INFO Init timings - ncclCommInitRankConfig: rank 3 nranks 8 total 1.56 (kernels 0.51, alloc 0.81, bootstrap 0.00, allgathers 0.02, topo 0.05, graphs 0.02, connections 0.11, rest 0.05) +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO ncclCommInitRankConfig comm 0x55b5d579adc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:586 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 8 total 1.59 (kernels 0.39, alloc 0.92, bootstrap 0.04, allgathers 0.01, topo 0.05, graphs 0.02, connections 0.07, rest 0.09) +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO ncclCommInitRankConfig comm 0x55f9fd8e7d70 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:584 [6] NCCL INFO Init timings - ncclCommInitRankConfig: rank 6 nranks 8 total 1.62 (kernels 0.32, alloc 0.79, bootstrap 0.27, allgathers 0.03, topo 0.06, graphs 0.01, connections 0.07, rest 0.08) +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO ncclCommInitRankConfig comm 0x55cc5084e920 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:588 [4] NCCL INFO Init timings - ncclCommInitRankConfig: rank 4 nranks 8 total 1.58 (kernels 0.37, alloc 0.90, bootstrap 0.06, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.08, rest 0.08) +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO ncclCommInitRankConfig comm 0x561941b639b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:587 [5] NCCL INFO Init timings - ncclCommInitRankConfig: rank 5 nranks 8 total 1.58 (kernels 0.37, alloc 0.91, bootstrap 0.06, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.07, rest 0.09) +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO CC Off, workFifoBytes 1048576 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO ncclCommInitRankConfig comm 0x55f6a7a24680 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:583 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 8 total 1.64 (kernels 0.32, alloc 0.70, bootstrap 0.37, allgathers 0.02, topo 0.06, graphs 0.01, connections 0.07, rest 0.09) +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO ncclCommInitRankConfig comm 0x559824139d80 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x35ab675eb55cd9a8 - Init COMPLETE +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:585 [7] NCCL INFO Init timings - ncclCommInitRankConfig: rank 7 nranks 8 total 1.59 (kernels 0.34, alloc 0.84, bootstrap 0.17, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.07, rest 0.08) +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:621 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:617 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:615 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:622 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:618 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:616 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:620 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:Hyperparameters: +[default0]: adam_eps: 1e-08 +[default0]: adam_wd: 0.005 +[default0]: beta1: 0.9 +[default0]: beta2: 0.95 +[default0]: compressor: brotli +[default0]: data_dir: ./openai_parameter_golf/data +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:619 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]: datasets_dir: ./openai_parameter_golf/data/datasets/fineweb10B_sp8192 +[default0]: distributed: True +[default0]: ema_decay: 0.9965 +[default0]: embed_bits: 8 +[default0]: embed_clip_sigmas: 20.0 +[default0]: embed_lr: 0.6 +[default0]: embed_wd: 0.085 +[default0]: embedding_dim: 512 +[default0]: enable_looping_at: 0.35 +[default0]: eval_seq_len: 512 +[default0]: eval_stride: 64 +[default0]: gptq_calibration_batches: 2 +[default0]: grad_accum_steps: 1 +[default0]: grad_clip_norm: 0.3 +[default0]: head_lr: 0.008 +[default0]: is_main_process: True +[default0]: iterations: 30 +[default0]: ln_scale: True +[default0]: local_rank: 0 +[default0]: logfile: logs/20260425_045844_1ed6f6d0.txt +[default0]: logit_softcap: 30.0 +[default0]: loop_end: 5 +[default0]: loop_start: 3 +[default0]: lowbit_layers: +[default0]: matrix_bits: 6 +[default0]: matrix_clip_sigmas: 12.85 +[default0]: matrix_lr: 0.022 +[default0]: max_wallclock_seconds: 120.0 +[default0]: min_lr: 0.0 +[default0]: mlp_mult: 4.0 +[default0]: model_dim: 512 +[default0]: model_path: ckpt/final_model.pt +[default0]: muon_backend_steps: 5 +[default0]: muon_beta2: 0.95 +[default0]: muon_momentum: 0.99 +[default0]: muon_momentum_warmup_fraction: 0.22 +[default0]: muon_momentum_warmup_start: 0.92 +[default0]: muon_row_normalize: True +[default0]: muon_wd: 0.095 +[default0]: muon_wd_mlp: 0.115 +[default0]: num_heads: 8 +[default0]: num_kv_heads: 4 +[default0]: num_layers: 11 +[default0]: num_loops: 2 +[default0]: parallel_residual_start: 7 +[default0]: qk_gain_init: 5.25 +[default0]: quantized_model_path: ckpt/final_model.int6.ptz +[default0]: rank: 0 +[default0]: rope_base: 10000.0 +[default0]: rope_dims: 16 +[default0]: rope_train_seq_len: 2048 +[default0]: run_id: 20260425_045844_1ed6f6d0 +[default0]: scalar_lr: 0.02 +[default0]: seed: 1337 +[default0]: skip_gates_enabled: True +[default0]: sliding_window_enabled: True +[default0]: tie_embeddings: True +[default0]: tied_embed_init_std: 0.005 +[default0]: tied_embed_lr: 0.03 +[default0]: tokenizer_path: ./openai_parameter_golf/data/tokenizers/fineweb_8192_bpe.model +[default0]: train_batch_tokens: 32768 +[default0]: train_files: /dev/shm/fineweb10B_sp8192/fineweb_train_*.bin +[default0]: train_log_every: 5 +[default0]: train_seq_len: 512 +[default0]: ttt_chunk_tokens: 65536 +[default0]: ttt_enabled: False +[default0]: ttt_epochs: 4 +[default0]: ttt_lr: 0.005 +[default0]: ttt_momentum: 0.9 +[default0]: val_batch_tokens: 2097152 +[default0]: val_files: /dev/shm/fineweb10B_sp8192/fineweb_val_*.bin +[default0]: val_loss_every: 0 +[default0]: vocab_size: 8192 +[default0]: warmdown_frac: 0.2 +[default0]: warmup_steps: 0 +[default0]: world_size: 8 +[default0]: xsa_last_n: 11 +[default0]:[SMOKE_TEST] attention_backend=sdpa_fallback FA3=False smoke_test=True +[default0]:[SMOKE_TEST] val_bpb from this run is NOT comparable to proxy/full runs +[default0]:attention_backend:sdpa_fallback(smoke) smoke_test:True +[default0]:train_shards: 0 +[default0]:val_tokens: 40540672 +[default0]:smoke_test: torch.compile disabled (eager mode) +[default0]:model_params:35946192 +[default0]:1/30 train_loss: 9.0074 train_time: 0.1m tok/s: 6588 +[default0]:2/30 train_loss: 13.8612 train_time: 0.1m tok/s: 12530 +[default0]:3/30 train_loss: 13.9007 train_time: 0.1m tok/s: 18590 +[default0]:4/30 train_loss: 12.3550 train_time: 0.1m tok/s: 24518 +[default0]:5/30 train_loss: 11.7223 train_time: 0.1m tok/s: 30323 +[default0]:10/30 train_loss: 7.8868 train_time: 0.1m tok/s: 57200 +[default0]:15/30 train_loss: 7.0293 train_time: 0.1m tok/s: 81315 +[default0]:20/30 train_loss: 6.7342 train_time: 0.1m tok/s: 103303 +[default0]:25/30 train_loss: 6.2552 train_time: 0.1m tok/s: 123484 +[default0]:30/30 train_loss: 6.2276 train_time: 0.1m tok/s: 141901 +[default0]:peak memory allocated: 2595 MiB reserved: 2840 MiB +[default0]:ema:applying EMA weights +[default0]:smoke_test: training complete — running GPTQ+brotli pack for size check +[default0]:Serialized model: 135441937 bytes +[default0]:GPTQ:collecting Hessians from calibration data... +[default0]:GPTQ:collected 67 Hessians in 0.4s +[default0]:Quantized weights: +[default0]: gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight +[default0]: gptq (int8): tok_emb.weight +[default0]: passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, embed_scale, skip_gates, skip_weights +[default0]:Serialized model quantized+brotli: 15994730 bytes +[default0]:smoke_pack_bytes: code=19646 model=15994730 total=16014376 +[default0]:smoke_test:complete (code ran successfully; val_bpb not computed in smoke mode) +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:247:247 [0] NCCL INFO comm 0x55f6a7a24680 rank 0 nranks 8 cudaDev 0 busId 19000 - Destroy COMPLETE +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:253:253 [6] NCCL INFO comm 0x55f9fd8e7d70 rank 6 nranks 8 cudaDev 6 busId bb000 - Destroy COMPLETE +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:254:254 [7] NCCL INFO comm 0x559824139d80 rank 7 nranks 8 cudaDev 7 busId db000 - Destroy COMPLETE +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:252:252 [5] NCCL INFO comm 0x561941b639b0 rank 5 nranks 8 cudaDev 5 busId ab000 - Destroy COMPLETE +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:248:248 [1] NCCL INFO comm 0x55b5d579adc0 rank 1 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:251:251 [4] NCCL INFO comm 0x55cc5084e920 rank 4 nranks 8 cudaDev 4 busId 9b000 - Destroy COMPLETE +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:249:249 [2] NCCL INFO comm 0x55eb260c1970 rank 2 nranks 8 cudaDev 2 busId 3b000 - Destroy COMPLETE +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:250:250 [3] NCCL INFO comm 0x5565595bec70 rank 3 nranks 8 cudaDev 3 busId 5d000 - Destroy COMPLETE + +=== TRAIN (rc=0, 1401s) === +W0425 05:00:28.042000 21169 torch/distributed/run.py:803] +W0425 05:00:28.042000 21169 torch/distributed/run.py:803] ***************************************** +W0425 05:00:28.042000 21169 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0425 05:00:28.042000 21169 torch/distributed/run.py:803] ***************************************** +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21239 [1] NCCL INFO cudaDriverVersion 12080 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21239 [1] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21239 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21239 [1] NCCL INFO Comm config Blocking set to 1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21238 [0] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21238 [0] NCCL INFO cudaDriverVersion 12080 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21238 [0] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21238 [0] NCCL INFO Comm config Blocking set to 1 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21241 [3] NCCL INFO cudaDriverVersion 12080 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21241 [3] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21241 [3] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21241 [3] NCCL INFO Comm config Blocking set to 1 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21240 [2] NCCL INFO cudaDriverVersion 12080 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21240 [2] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21240 [2] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21240 [2] NCCL INFO Comm config Blocking set to 1 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21243 [5] NCCL INFO cudaDriverVersion 12080 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21243 [5] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21243 [5] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21243 [5] NCCL INFO Comm config Blocking set to 1 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21242 [4] NCCL INFO cudaDriverVersion 12080 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21242 [4] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21242 [4] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21242 [4] NCCL INFO Comm config Blocking set to 1 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21244 [6] NCCL INFO cudaDriverVersion 12080 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21244 [6] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21244 [6] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21244 [6] NCCL INFO Comm config Blocking set to 1 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21245 [7] NCCL INFO cudaDriverVersion 12080 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21245 [7] NCCL INFO Bootstrap: Using eth0:10.247.101.102<0> +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21245 [7] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21245 [7] NCCL INFO Comm config Blocking set to 1 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Failed to open libibverbs.so[.1] +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Initialized NET plugin Socket +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Assigned NET plugin Socket to comm +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Using network Socket +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Failed to open libibverbs.so[.1] +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Initialized NET plugin Socket +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Assigned NET plugin Socket to comm +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Using network Socket +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Failed to open libibverbs.so[.1] +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Initialized NET plugin Socket +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Assigned NET plugin Socket to comm +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Using network Socket +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Failed to open libibverbs.so[.1] +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Initialized NET plugin Socket +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Assigned NET plugin Socket to comm +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Using network Socket +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Failed to open libibverbs.so[.1] +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Initialized NET plugin Socket +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Assigned NET plugin Socket to comm +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Using network Socket +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Failed to open libibverbs.so[.1] +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Initialized NET plugin Socket +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Using network Socket +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Failed to open libibverbs.so[.1] +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Initialized NET plugin Socket +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Assigned NET plugin Socket to comm +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Using network Socket +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO NET/Socket : Using [0]eth0:10.247.101.102<0> +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Initialized NET plugin Socket +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Assigned NET plugin Socket to comm +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Using network Socket +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO ncclCommInitRankConfig comm 0x55ad587666b0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0xc8133c42f0a56ca0 - Init START +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO ncclCommInitRankConfig comm 0x560926c63860 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xc8133c42f0a56ca0 - Init START +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO ncclCommInitRankConfig comm 0x55cf280ca470 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0xc8133c42f0a56ca0 - Init START +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO ncclCommInitRankConfig comm 0x55bfb382d8f0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0xc8133c42f0a56ca0 - Init START +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO ncclCommInitRankConfig comm 0x5573498d3280 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0xc8133c42f0a56ca0 - Init START +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Bootstrap timings total 0.047532 (create 0.000028, send 0.000094, recv 0.046218, ring 0.000806, delay 0.000002) +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO MNNVL busId 0x2a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Setting affinity for GPU 1 to 0-47,96-143 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO comm 0x560926c63860 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Bootstrap timings total 0.164638 (create 0.000026, send 0.000086, recv 0.117238, ring 0.001032, delay 0.000003) +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO MNNVL busId 0x19000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO comm 0x55ad587666b0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO P2P Chunksize set to 524288 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO RAS client listening socket at ::1<28028> +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Bootstrap timings total 0.014359 (create 0.000028, send 0.000095, recv 0.000734, ring 0.000823, delay 0.000002) +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Setting affinity for GPU 3 to 0-47,96-143 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO comm 0x55cf280ca470 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO ncclCommInitRankConfig comm 0x55d4efcb6c50 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0xc8133c42f0a56ca0 - Init START +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO RAS client listening socket at ::1<28028> +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Bootstrap timings total 0.001422 (create 0.000026, send 0.000101, recv 0.000101, ring 0.000812, delay 0.000002) +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO MNNVL busId 0x3b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Setting affinity for GPU 2 to 0-47,96-143 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO comm 0x55d4efcb6c50 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO P2P Chunksize set to 524288 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO ncclCommInitRankConfig comm 0x56476fd40fc0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0xc8133c42f0a56ca0 - Init START +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO RAS client listening socket at ::1<28028> +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Bootstrap timings total 0.012584 (create 0.000029, send 0.000093, recv 0.000079, ring 0.012033, delay 0.000002) +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO MNNVL busId 0xab000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Setting affinity for GPU 5 to 48-95,144-191 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO comm 0x56476fd40fc0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO P2P Chunksize set to 524288 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO RAS client listening socket at ::1<28028> +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Bootstrap timings total 0.086346 (create 0.000029, send 0.000088, recv 0.073840, ring 0.012020, delay 0.000003) +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO MNNVL busId 0x9b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Setting affinity for GPU 4 to 48-95,144-191 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO comm 0x55bfb382d8f0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO P2P Chunksize set to 524288 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO RAS client listening socket at ::1<28028> +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Bootstrap timings total 0.051656 (create 0.000027, send 0.000094, recv 0.051099, ring 0.000118, delay 0.000002) +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO MNNVL busId 0xbb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Setting affinity for GPU 6 to 48-95,144-191 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO comm 0x5573498d3280 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO P2P Chunksize set to 524288 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO ncclCommInitRankConfig comm 0x564e24a3fae0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0xc8133c42f0a56ca0 - Init START +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO RAS client listening socket at ::1<28028> +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Bootstrap timings total 0.001806 (create 0.000028, send 0.000100, recv 0.000096, ring 0.000118, delay 0.000002) +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Setting affinity for GPU 7 to 48-95,144-191 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO comm 0x564e24a3fae0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO P2P Chunksize set to 524288 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21309 [1] NCCL INFO [Proxy Service] Device 1 CPU core 108 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21310 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 106 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO ncclCommInitRankConfig comm 0x560926c63860 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21290 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 8 total 1.34 (kernels 0.30, alloc 0.83, bootstrap 0.05, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21313 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21314 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 128 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO CC Off, workFifoBytes 1048576 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO ncclCommInitRankConfig comm 0x55ad587666b0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21283 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 8 total 1.37 (kernels 0.30, alloc 0.74, bootstrap 0.16, allgathers 0.01, topo 0.06, graphs 0.02, connections 0.06, rest 0.03) +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21311 [3] NCCL INFO [Proxy Service] Device 3 CPU core 20 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21312 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 24 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO ncclCommInitRankConfig comm 0x55cf280ca470 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21294 [3] NCCL INFO Init timings - ncclCommInitRankConfig: rank 3 nranks 8 total 1.33 (kernels 0.33, alloc 0.82, bootstrap 0.01, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21305 [2] NCCL INFO [Proxy Service] Device 2 CPU core 130 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21306 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 25 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO ncclCommInitRankConfig comm 0x55d4efcb6c50 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21292 [2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.33 (kernels 0.38, alloc 0.79, bootstrap 0.00, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21317 [5] NCCL INFO [Proxy Service] Device 5 CPU core 161 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21318 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 187 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO ncclCommInitRankConfig comm 0x56476fd40fc0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21293 [5] NCCL INFO Init timings - ncclCommInitRankConfig: rank 5 nranks 8 total 1.33 (kernels 0.33, alloc 0.82, bootstrap 0.01, allgathers 0.00, topo 0.06, graphs 0.02, connections 0.06, rest 0.03) +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21315 [4] NCCL INFO [Proxy Service] Device 4 CPU core 58 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21316 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 62 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO ncclCommInitRankConfig comm 0x55bfb382d8f0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21289 [4] NCCL INFO Init timings - ncclCommInitRankConfig: rank 4 nranks 8 total 1.35 (kernels 0.29, alloc 0.80, bootstrap 0.09, allgathers 0.00, topo 0.06, graphs 0.02, connections 0.06, rest 0.03) +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21307 [6] NCCL INFO [Proxy Service] Device 6 CPU core 157 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21308 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 73 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO ncclCommInitRankConfig comm 0x5573498d3280 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21288 [6] NCCL INFO Init timings - ncclCommInitRankConfig: rank 6 nranks 8 total 1.35 (kernels 0.31, alloc 0.83, bootstrap 0.05, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21303 [7] NCCL INFO [Proxy Service] Device 7 CPU core 185 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21304 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 56 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO ncclCommInitRankConfig comm 0x564e24a3fae0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0xc8133c42f0a56ca0 - Init COMPLETE +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21291 [7] NCCL INFO Init timings - ncclCommInitRankConfig: rank 7 nranks 8 total 1.33 (kernels 0.38, alloc 0.78, bootstrap 0.00, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21340 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21339 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:Hyperparameters: +[default0]: adam_eps: 1e-08 +[default0]: adam_wd: 0.005 +[default0]: beta1: 0.9 +[default0]: beta2: 0.95 +[default0]: compressor: brotli +[default0]: data_dir: ./openai_parameter_golf/data +[default0]: datasets_dir: ./openai_parameter_golf/data/datasets/fineweb10B_sp8192 +[default0]: distributed: True +[default0]: ema_decay: 0.9965 +[default0]: embed_bits: 8 +[default0]: embed_clip_sigmas: 20.0 +[default0]: embed_lr: 0.6 +[default0]: embed_wd: 0.085 +[default0]: embedding_dim: 512 +[default0]: enable_looping_at: 0.35 +[default0]: eval_seq_len: 2048 +[default0]: eval_stride: 64 +[default0]: gptq_calibration_batches: 64 +[default0]: grad_accum_steps: 1 +[default0]: grad_clip_norm: 0.3 +[default0]: head_lr: 0.008 +[default0]: is_main_process: True +[default0]: iterations: 50000 +[default0]: ln_scale: True +[default0]: local_rank: 0 +[default0]: logfile: logs/20260425_050032_29e5bb74.txt +[default0]: logit_softcap: 30.0 +[default0]: loop_end: 5 +[default0]: loop_start: 3 +[default0]: lowbit_layers: +[default0]: matrix_bits: 6 +[default0]: matrix_clip_sigmas: 12.85 +[default0]: matrix_lr: 0.022 +[default0]: max_wallclock_seconds: 600.0 +[default0]: min_lr: 0.0 +[default0]: mlp_mult: 4.0 +[default0]: model_dim: 512 +[default0]: model_path: ckpt/final_model.pt +[default0]: muon_backend_steps: 5 +[default0]: muon_beta2: 0.95 +[default0]: muon_momentum: 0.99 +[default0]: muon_momentum_warmup_fraction: 0.22 +[default0]: muon_momentum_warmup_start: 0.92 +[default0]: muon_row_normalize: True +[default0]: muon_wd: 0.095 +[default0]: muon_wd_mlp: 0.115 +[default0]: num_heads: 8 +[default0]: num_kv_heads: 4 +[default0]: num_layers: 11 +[default0]: num_loops: 2 +[default0]: parallel_residual_start: 7 +[default0]: qk_gain_init: 5.25 +[default0]: quantized_model_path: ckpt/final_model.int6.ptz +[default0]: rank: 0 +[default0]: rope_base: 10000.0 +[default0]: rope_dims: 16 +[default0]: rope_train_seq_len: 2048 +[default0]: run_id: 20260425_050032_29e5bb74 +[default0]: scalar_lr: 0.02 +[default0]: seed: 1337 +[default0]: skip_gates_enabled: True +[default0]: sliding_window_enabled: True +[default0]: tie_embeddings: True +[default0]: tied_embed_init_std: 0.005 +[default0]: tied_embed_lr: 0.03 +[default0]: tokenizer_path: ./openai_parameter_golf/data/tokenizers/fineweb_8192_bpe.model +[default0]: train_batch_tokens: 786432 +[default0]: train_files: /dev/shm/fineweb10B_sp8192/fineweb_train_*.bin +[default0]: train_log_every: 500 +[default0]: train_seq_len: 2048 +[default0]: ttt_chunk_tokens: 65536 +[default0]: ttt_enabled: True +[default0]: ttt_epochs: 4 +[default0]: ttt_lr: 0.005 +[default0]: ttt_momentum: 0.9 +[default0]: val_batch_tokens: 524288 +[default0]: val_files: /dev/shm/fineweb10B_sp8192/fineweb_val_*.bin +[default0]: val_loss_every: 4000 +[default0]: vocab_size: 8192 +[default0]: warmdown_frac: 0.72 +[default0]: warmup_steps: 20 +[default0]: world_size: 8 +[default0]: xsa_last_n: 11 +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21344 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21338 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21341 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21342 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21343 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21337 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:attention_backend:flash_attn_3 smoke_test:False +[default0]:train_shards: 0 +[default0]:val_tokens: 40540160 +[default0]:model_params:35946192 +[default0]:warmup_step: 1/20 +[default0]:warmup_step: 2/20 +[default0]:warmup_step: 3/20 +[default0]:warmup_step: 4/20 +[default0]:warmup_step: 5/20 +[default0]:warmup_step: 6/20 +[default0]:warmup_step: 10/20 +[default0]:warmup_step: 20/20 +[default0]:loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +[default0]:loop_warmup_step: 1/20 +[default0]:loop_warmup_step: 2/20 +[default0]:loop_warmup_step: 3/20 +[default0]:loop_warmup_step: 4/20 +[default0]:loop_warmup_step: 5/20 +[default0]:loop_warmup_step: 6/20 +[default0]:loop_warmup_step: 10/20 +[default0]:loop_warmup_step: 20/20 +[default0]:0/50000 val_loss: 9.0047 val_bpb: 3.4860 +[default0]:1/50000 train_loss: 9.0043 train_time: 0.0m tok/s: 8050663 +[default0]:2/50000 train_loss: 12.3260 train_time: 0.0m tok/s: 8122286 +[default0]:3/50000 train_loss: 10.6755 train_time: 0.0m tok/s: 8022150 +[default0]:4/50000 train_loss: 9.0360 train_time: 0.0m tok/s: 7983826 +[default0]:5/50000 train_loss: 8.1955 train_time: 0.0m tok/s: 7953807 +[default0]:500/50000 train_loss: 3.2802 train_time: 0.8m tok/s: 7710978 +[default0]:1000/50000 train_loss: 3.1906 train_time: 1.7m tok/s: 7708098 +[default0]:1500/50000 train_loss: 3.1533 train_time: 2.6m tok/s: 7707685 +[default0]:2000/50000 train_loss: 3.0882 train_time: 3.4m tok/s: 7708077 +[default0]:layer_loop:enabled step:2059 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +[default0]:2500/50000 train_loss: 2.9970 train_time: 4.6m tok/s: 7119667 +[default0]:3000/50000 train_loss: 2.9273 train_time: 5.9m tok/s: 6720526 +[default0]:3500/50000 train_loss: 2.9221 train_time: 7.2m tok/s: 6403873 +[default0]:4000/50000 train_loss: 2.8924 train_time: 8.4m tok/s: 6232851 +[default0]:4000/50000 val_loss: 2.8621 val_bpb: 1.1080 +[default0]:4500/50000 train_loss: 2.7399 train_time: 9.7m tok/s: 6107328 +[default0]:4638/50000 val_loss: 2.7921 val_bpb: 1.0809 +[default0]:stopping_early: wallclock_cap train_time: 600134ms step: 4638/50000 +[default0]:peak memory allocated: 39074 MiB reserved: 39150 MiB +[default0]:ema:applying EMA weights +[default0]:pre-quantization post-ema val_loss:2.79126207 val_bpb:1.08058453 eval_time:8187ms +[default0]:Serialized model: 135441937 bytes +[default0]:GPTQ:collecting Hessians from calibration data... +[default0]:GPTQ:collected 67 Hessians in 13.1s +[default0]:Quantized weights: +[default0]: gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight +[default0]: gptq (int8): tok_emb.weight +[default0]: passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, embed_scale, skip_gates, skip_weights +[default0]:Serialized model quantized+brotli: 15979561 bytes +[default0]:quantized val_loss:2.81828849 val_bpb:1.09104730 eval_time:26449ms +[default0]:quantized_sliding_window val_loss:2.77487184 val_bpb:1.07423936 eval_time:128672ms +[default0]:ttt:start chunks=619 ttt_lr=0.005 ttt_epochs=4 +[default0]:quantized_ttt val_loss:2.77095943 val_bpb:1.07272475 eval_time:308075ms +[default0]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21238:21238 [0] NCCL INFO comm 0x55ad587666b0 rank 0 nranks 8 cudaDev 0 busId 19000 - Destroy COMPLETE +[default4]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21242:21242 [4] NCCL INFO comm 0x55bfb382d8f0 rank 4 nranks 8 cudaDev 4 busId 9b000 - Destroy COMPLETE +[default2]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21240:21240 [2] NCCL INFO comm 0x55d4efcb6c50 rank 2 nranks 8 cudaDev 2 busId 3b000 - Destroy COMPLETE +[default3]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21241:21241 [3] NCCL INFO comm 0x55cf280ca470 rank 3 nranks 8 cudaDev 3 busId 5d000 - Destroy COMPLETE +[default6]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21244:21244 [6] NCCL INFO comm 0x5573498d3280 rank 6 nranks 8 cudaDev 6 busId bb000 - Destroy COMPLETE +[default7]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21245:21245 [7] NCCL INFO comm 0x564e24a3fae0 rank 7 nranks 8 cudaDev 7 busId db000 - Destroy COMPLETE +[default5]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21243:21243 [5] NCCL INFO comm 0x56476fd40fc0 rank 5 nranks 8 cudaDev 5 busId ab000 - Destroy COMPLETE +[default1]:job-00710ac3-a394-4edc-9cec-10411422aba8-worker-0:21239:21239 [1] NCCL INFO comm 0x560926c63860 rank 1 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE + +--- EVAL_WALL 575.7s --- + +=== PACK (rc=0) === +Packed code: 19646 bytes (raw=70361 bytes) +Model blob : 15979561 bytes +Submission size: 15999207 bytes diff --git a/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed42.log b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed42.log new file mode 100644 index 0000000000..dded347df8 --- /dev/null +++ b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed42.log @@ -0,0 +1,1171 @@ +=== PREFLIGHT (SMOKE_TEST=1, 106s) === +W0425 05:00:06.828000 179 torch/distributed/run.py:803] +W0425 05:00:06.828000 179 torch/distributed/run.py:803] ***************************************** +W0425 05:00:06.828000 179 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0425 05:00:06.828000 179 torch/distributed/run.py:803] ***************************************** +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:247 [0] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:248 [1] NCCL INFO cudaDriverVersion 12080 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:249 [2] NCCL INFO cudaDriverVersion 12080 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:247 [0] NCCL INFO cudaDriverVersion 12080 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:248 [1] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:247 [0] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:249 [2] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:248 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:250 [3] NCCL INFO cudaDriverVersion 12080 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:247 [0] NCCL INFO Comm config Blocking set to 1 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:251 [4] NCCL INFO cudaDriverVersion 12080 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:249 [2] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:248 [1] NCCL INFO Comm config Blocking set to 1 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:250 [3] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:249 [2] NCCL INFO Comm config Blocking set to 1 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:250 [3] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:254 [7] NCCL INFO cudaDriverVersion 12080 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:251 [4] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:254 [7] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:251 [4] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:250 [3] NCCL INFO Comm config Blocking set to 1 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:251 [4] NCCL INFO Comm config Blocking set to 1 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:254 [7] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:253 [6] NCCL INFO cudaDriverVersion 12080 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:254 [7] NCCL INFO Comm config Blocking set to 1 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:253 [6] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:253 [6] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:253 [6] NCCL INFO Comm config Blocking set to 1 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:252 [5] NCCL INFO cudaDriverVersion 12080 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:252 [5] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:252 [5] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:252 [5] NCCL INFO Comm config Blocking set to 1 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Failed to open libibverbs.so[.1] +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Failed to open libibverbs.so[.1] +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Initialized NET plugin Socket +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Initialized NET plugin Socket +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Assigned NET plugin Socket to comm +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Assigned NET plugin Socket to comm +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Using network Socket +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Using network Socket +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Failed to open libibverbs.so[.1] +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Initialized NET plugin Socket +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Assigned NET plugin Socket to comm +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Using network Socket +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Failed to open libibverbs.so[.1] +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Initialized NET plugin Socket +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Assigned NET plugin Socket to comm +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Using network Socket +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Failed to open libibverbs.so[.1] +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Initialized NET plugin Socket +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Assigned NET plugin Socket to comm +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Using network Socket +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Initialized NET plugin Socket +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Using network Socket +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Initialized NET plugin Socket +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Using network Socket +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Failed to open libibverbs.so[.1] +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Initialized NET plugin Socket +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Assigned NET plugin Socket to comm +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Using network Socket +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO ncclCommInitRankConfig comm 0x55d87e31c7d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x97e9aaea7eb84a45 - Init START +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO ncclCommInitRankConfig comm 0x55fdad602400 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x97e9aaea7eb84a45 - Init START +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO ncclCommInitRankConfig comm 0x557e1bc0c330 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x97e9aaea7eb84a45 - Init START +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO RAS client listening socket at ::1<28028> +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO ncclCommInitRankConfig comm 0x55c419fb4740 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x97e9aaea7eb84a45 - Init START +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO ncclCommInitRankConfig comm 0x559b6f59d9d0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x97e9aaea7eb84a45 - Init START +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO ncclCommInitRankConfig comm 0x559359768e40 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x97e9aaea7eb84a45 - Init START +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO RAS client listening socket at ::1<28028> +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Bootstrap timings total 0.001063 (create 0.000028, send 0.000097, recv 0.000074, ring 0.000124, delay 0.000002) +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO RAS client listening socket at ::1<28028> +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO MNNVL busId 0x3b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Bootstrap timings total 0.083185 (create 0.000033, send 0.000100, recv 0.082594, ring 0.000118, delay 0.000002) +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Setting affinity for GPU 2 to 0-47,96-143 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO MNNVL busId 0x2a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO comm 0x559359768e40 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Setting affinity for GPU 1 to 0-47,96-143 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Bootstrap timings total 0.018069 (create 0.000029, send 0.000101, recv 0.000082, ring 0.016506, delay 0.000002) +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO MNNVL busId 0x9b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO P2P Chunksize set to 524288 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Setting affinity for GPU 4 to 48-95,144-191 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO RAS client listening socket at ::1<28028> +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Bootstrap timings total 0.410132 (create 0.000027, send 0.000083, recv 0.327122, ring 0.003791, delay 0.000002) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO MNNVL busId 0x19000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO comm 0x55fdad602400 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO comm 0x557e1bc0c330 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO ncclCommInitRankConfig comm 0x5591d51e0ec0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0x97e9aaea7eb84a45 - Init START +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO comm 0x55d87e31c7d0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO ncclCommInitRankConfig comm 0x55ee4308caf0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x97e9aaea7eb84a45 - Init START +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO RAS client listening socket at ::1<28028> +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO RAS client listening socket at ::1<28028> +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Bootstrap timings total 0.082319 (create 0.000027, send 0.000097, recv 0.078542, ring 0.003267, delay 0.000002) +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO P2P Chunksize set to 524288 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO MNNVL busId 0xab000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Setting affinity for GPU 5 to 48-95,144-191 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Bootstrap timings total 0.003839 (create 0.000029, send 0.000085, recv 0.000099, ring 0.003286, delay 0.000003) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO MNNVL busId 0xbb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Bootstrap timings total 0.087330 (create 0.000027, send 0.000083, recv 0.070347, ring 0.000512, delay 0.000002) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Setting affinity for GPU 6 to 48-95,144-191 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Bootstrap timings total 0.005587 (create 0.000027, send 0.000088, recv 0.000054, ring 0.003286, delay 0.000002) +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO comm 0x559b6f59d9d0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Setting affinity for GPU 7 to 48-95,144-191 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO comm 0x55ee4308caf0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO comm 0x5591d51e0ec0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Setting affinity for GPU 3 to 0-47,96-143 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO comm 0x55c419fb4740 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:611 [2] NCCL INFO [Proxy Service] Device 2 CPU core 104 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:612 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 107 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO ncclCommInitRankConfig comm 0x559359768e40 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:588 [2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.32 (kernels 0.45, alloc 0.70, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.02, connections 0.06, rest 0.03) +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:601 [4] NCCL INFO [Proxy Service] Device 4 CPU core 57 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:602 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 59 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO ncclCommInitRankConfig comm 0x557e1bc0c330 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:587 [4] NCCL INFO Init timings - ncclCommInitRankConfig: rank 4 nranks 8 total 1.32 (kernels 0.38, alloc 0.76, bootstrap 0.02, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:608 [1] NCCL INFO [Proxy Service] Device 1 CPU core 97 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:610 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 98 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO ncclCommInitRankConfig comm 0x55fdad602400 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:584 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 8 total 1.35 (kernels 0.32, alloc 0.77, bootstrap 0.08, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:603 [6] NCCL INFO [Proxy Service] Device 6 CPU core 159 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:604 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 69 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO ncclCommInitRankConfig comm 0x5591d51e0ec0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:589 [6] NCCL INFO Init timings - ncclCommInitRankConfig: rank 6 nranks 8 total 1.32 (kernels 0.41, alloc 0.74, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.01, connections 0.05, rest 0.03) +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:606 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 70 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:605 [7] NCCL INFO [Proxy Service] Device 7 CPU core 151 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:613 [5] NCCL INFO [Proxy Service] Device 5 CPU core 167 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:614 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 72 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO ncclCommInitRankConfig comm 0x55ee4308caf0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:590 [7] NCCL INFO Init timings - ncclCommInitRankConfig: rank 7 nranks 8 total 1.32 (kernels 0.40, alloc 0.74, bootstrap 0.01, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO ncclCommInitRankConfig comm 0x559b6f59d9d0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:599 [3] NCCL INFO [Proxy Service] Device 3 CPU core 123 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:600 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 124 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:586 [5] NCCL INFO Init timings - ncclCommInitRankConfig: rank 5 nranks 8 total 1.33 (kernels 0.30, alloc 0.78, bootstrap 0.08, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:607 [0] NCCL INFO [Proxy Service] Device 0 CPU core 5 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:609 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 103 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO ncclCommInitRankConfig comm 0x55c419fb4740 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:585 [3] NCCL INFO Init timings - ncclCommInitRankConfig: rank 3 nranks 8 total 1.34 (kernels 0.30, alloc 0.78, bootstrap 0.09, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO CC Off, workFifoBytes 1048576 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO ncclCommInitRankConfig comm 0x55d87e31c7d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x97e9aaea7eb84a45 - Init COMPLETE +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:583 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 8 total 1.37 (kernels 0.29, alloc 0.50, bootstrap 0.41, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:620 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:622 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:616 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:615 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:621 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:619 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:617 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:618 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:Hyperparameters: +[default0]: adam_eps: 1e-08 +[default0]: adam_wd: 0.005 +[default0]: beta1: 0.9 +[default0]: beta2: 0.95 +[default0]: compressor: brotli +[default0]: data_dir: ./openai_parameter_golf/data +[default0]: datasets_dir: ./openai_parameter_golf/data/datasets/fineweb10B_sp8192 +[default0]: distributed: True +[default0]: ema_decay: 0.9965 +[default0]: embed_bits: 8 +[default0]: embed_clip_sigmas: 20.0 +[default0]: embed_lr: 0.6 +[default0]: embed_wd: 0.085 +[default0]: embedding_dim: 512 +[default0]: enable_looping_at: 0.35 +[default0]: eval_seq_len: 512 +[default0]: eval_stride: 64 +[default0]: gptq_calibration_batches: 2 +[default0]: grad_accum_steps: 1 +[default0]: grad_clip_norm: 0.3 +[default0]: head_lr: 0.008 +[default0]: is_main_process: True +[default0]: iterations: 30 +[default0]: ln_scale: True +[default0]: local_rank: 0 +[default0]: logfile: logs/20260425_050011_900074ba.txt +[default0]: logit_softcap: 30.0 +[default0]: loop_end: 5 +[default0]: loop_start: 3 +[default0]: lowbit_layers: +[default0]: matrix_bits: 6 +[default0]: matrix_clip_sigmas: 12.85 +[default0]: matrix_lr: 0.022 +[default0]: max_wallclock_seconds: 120.0 +[default0]: min_lr: 0.0 +[default0]: mlp_mult: 4.0 +[default0]: model_dim: 512 +[default0]: model_path: ckpt/final_model.pt +[default0]: muon_backend_steps: 5 +[default0]: muon_beta2: 0.95 +[default0]: muon_momentum: 0.99 +[default0]: muon_momentum_warmup_fraction: 0.22 +[default0]: muon_momentum_warmup_start: 0.92 +[default0]: muon_row_normalize: True +[default0]: muon_wd: 0.095 +[default0]: muon_wd_mlp: 0.115 +[default0]: num_heads: 8 +[default0]: num_kv_heads: 4 +[default0]: num_layers: 11 +[default0]: num_loops: 2 +[default0]: parallel_residual_start: 7 +[default0]: qk_gain_init: 5.25 +[default0]: quantized_model_path: ckpt/final_model.int6.ptz +[default0]: rank: 0 +[default0]: rope_base: 10000.0 +[default0]: rope_dims: 16 +[default0]: rope_train_seq_len: 2048 +[default0]: run_id: 20260425_050011_900074ba +[default0]: scalar_lr: 0.02 +[default0]: seed: 1337 +[default0]: skip_gates_enabled: True +[default0]: sliding_window_enabled: True +[default0]: tie_embeddings: True +[default0]: tied_embed_init_std: 0.005 +[default0]: tied_embed_lr: 0.03 +[default0]: tokenizer_path: ./openai_parameter_golf/data/tokenizers/fineweb_8192_bpe.model +[default0]: train_batch_tokens: 32768 +[default0]: train_files: /dev/shm/fineweb10B_sp8192/fineweb_train_*.bin +[default0]: train_log_every: 5 +[default0]: train_seq_len: 512 +[default0]: ttt_chunk_tokens: 65536 +[default0]: ttt_enabled: False +[default0]: ttt_epochs: 4 +[default0]: ttt_lr: 0.005 +[default0]: ttt_momentum: 0.9 +[default0]: val_batch_tokens: 2097152 +[default0]: val_files: /dev/shm/fineweb10B_sp8192/fineweb_val_*.bin +[default0]: val_loss_every: 0 +[default0]: vocab_size: 8192 +[default0]: warmdown_frac: 0.2 +[default0]: warmup_steps: 0 +[default0]: world_size: 8 +[default0]: xsa_last_n: 11 +[default0]:[SMOKE_TEST] attention_backend=sdpa_fallback FA3=False smoke_test=True +[default0]:[SMOKE_TEST] val_bpb from this run is NOT comparable to proxy/full runs +[default0]:attention_backend:sdpa_fallback(smoke) smoke_test:True +[default0]:train_shards: 0 +[default0]:val_tokens: 40540672 +[default0]:smoke_test: torch.compile disabled (eager mode) +[default0]:model_params:35946192 +[default0]:1/30 train_loss: 9.0074 train_time: 0.1m tok/s: 6679 +[default0]:2/30 train_loss: 13.8612 train_time: 0.1m tok/s: 12860 +[default0]:3/30 train_loss: 13.9006 train_time: 0.1m tok/s: 19038 +[default0]:4/30 train_loss: 12.3548 train_time: 0.1m tok/s: 25058 +[default0]:5/30 train_loss: 11.7226 train_time: 0.1m tok/s: 30929 +[default0]:10/30 train_loss: 7.8868 train_time: 0.1m tok/s: 58459 +[default0]:15/30 train_loss: 7.0281 train_time: 0.1m tok/s: 83290 +[default0]:20/30 train_loss: 6.7340 train_time: 0.1m tok/s: 105797 +[default0]:25/30 train_loss: 6.2567 train_time: 0.1m tok/s: 126149 +[default0]:30/30 train_loss: 6.2213 train_time: 0.1m tok/s: 144752 +[default0]:peak memory allocated: 2595 MiB reserved: 2840 MiB +[default0]:ema:applying EMA weights +[default0]:smoke_test: training complete — running GPTQ+brotli pack for size check +[default0]:Serialized model: 135441937 bytes +[default0]:GPTQ:collecting Hessians from calibration data... +[default0]:GPTQ:collected 67 Hessians in 0.4s +[default0]:Quantized weights: +[default0]: gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight +[default0]: gptq (int8): tok_emb.weight +[default0]: passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, embed_scale, skip_gates, skip_weights +[default0]:Serialized model quantized+brotli: 15994502 bytes +[default0]:smoke_pack_bytes: code=19646 model=15994502 total=16014148 +[default0]:smoke_test:complete (code ran successfully; val_bpb not computed in smoke mode) +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:252:252 [5] NCCL INFO comm 0x559b6f59d9d0 rank 5 nranks 8 cudaDev 5 busId ab000 - Destroy COMPLETE +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:251:251 [4] NCCL INFO comm 0x557e1bc0c330 rank 4 nranks 8 cudaDev 4 busId 9b000 - Destroy COMPLETE +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:247:247 [0] NCCL INFO comm 0x55d87e31c7d0 rank 0 nranks 8 cudaDev 0 busId 19000 - Destroy COMPLETE +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:253:253 [6] NCCL INFO comm 0x5591d51e0ec0 rank 6 nranks 8 cudaDev 6 busId bb000 - Destroy COMPLETE +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:250:250 [3] NCCL INFO comm 0x55c419fb4740 rank 3 nranks 8 cudaDev 3 busId 5d000 - Destroy COMPLETE +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:248:248 [1] NCCL INFO comm 0x55fdad602400 rank 1 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:249:249 [2] NCCL INFO comm 0x559359768e40 rank 2 nranks 8 cudaDev 2 busId 3b000 - Destroy COMPLETE +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:254:254 [7] NCCL INFO comm 0x55ee4308caf0 rank 7 nranks 8 cudaDev 7 busId db000 - Destroy COMPLETE + +=== TRAIN (rc=0, 1406s) === +W0425 05:01:53.339000 25565 torch/distributed/run.py:803] +W0425 05:01:53.339000 25565 torch/distributed/run.py:803] ***************************************** +W0425 05:01:53.339000 25565 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0425 05:01:53.339000 25565 torch/distributed/run.py:803] ***************************************** +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25634 [0] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25634 [0] NCCL INFO cudaDriverVersion 12080 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25634 [0] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25634 [0] NCCL INFO Comm config Blocking set to 1 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25636 [2] NCCL INFO cudaDriverVersion 12080 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25636 [2] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25636 [2] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25636 [2] NCCL INFO Comm config Blocking set to 1 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25635 [1] NCCL INFO cudaDriverVersion 12080 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25635 [1] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25635 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25635 [1] NCCL INFO Comm config Blocking set to 1 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25637 [3] NCCL INFO cudaDriverVersion 12080 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25637 [3] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25637 [3] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25637 [3] NCCL INFO Comm config Blocking set to 1 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25638 [4] NCCL INFO cudaDriverVersion 12080 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25638 [4] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25638 [4] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25638 [4] NCCL INFO Comm config Blocking set to 1 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25639 [5] NCCL INFO cudaDriverVersion 12080 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25639 [5] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25639 [5] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25639 [5] NCCL INFO Comm config Blocking set to 1 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25641 [7] NCCL INFO cudaDriverVersion 12080 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25641 [7] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25641 [7] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25641 [7] NCCL INFO Comm config Blocking set to 1 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25640 [6] NCCL INFO cudaDriverVersion 12080 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25640 [6] NCCL INFO Bootstrap: Using eth0:10.245.49.87<0> +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25640 [6] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25640 [6] NCCL INFO Comm config Blocking set to 1 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Failed to open libibverbs.so[.1] +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Initialized NET plugin Socket +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Assigned NET plugin Socket to comm +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Using network Socket +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Failed to open libibverbs.so[.1] +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Initialized NET plugin Socket +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Assigned NET plugin Socket to comm +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Using network Socket +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Initialized NET plugin Socket +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Assigned NET plugin Socket to comm +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Using network Socket +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Failed to open libibverbs.so[.1] +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Initialized NET plugin Socket +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Using network Socket +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Failed to open libibverbs.so[.1] +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Initialized NET plugin Socket +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Assigned NET plugin Socket to comm +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Using network Socket +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Failed to open libibverbs.so[.1] +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Initialized NET plugin Socket +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Assigned NET plugin Socket to comm +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Using network Socket +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Failed to open libibverbs.so[.1] +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Initialized NET plugin Socket +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Assigned NET plugin Socket to comm +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Using network Socket +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Failed to open libibverbs.so[.1] +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO NET/Socket : Using [0]eth0:10.245.49.87<0> +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Initialized NET plugin Socket +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Assigned NET plugin Socket to comm +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Using network Socket +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO ncclCommInitRankConfig comm 0x55807ee34c30 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x787b089c2aa96b39 - Init START +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO ncclCommInitRankConfig comm 0x55922b3b8dd0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x787b089c2aa96b39 - Init START +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO ncclCommInitRankConfig comm 0x56244124b600 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0x787b089c2aa96b39 - Init START +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO RAS client listening socket at ::1<28028> +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO ncclCommInitRankConfig comm 0x56103f09fe30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x787b089c2aa96b39 - Init START +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Bootstrap timings total 0.234031 (create 0.000038, send 0.000094, recv 0.013459, ring 0.083449, delay 0.000003) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO MNNVL busId 0x19000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO comm 0x55807ee34c30 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO ncclCommInitRankConfig comm 0x55a87ca58580 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x787b089c2aa96b39 - Init START +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Bootstrap timings total 0.220688 (create 0.000026, send 0.000089, recv 0.220071, ring 0.000119, delay 0.000002) +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO MNNVL busId 0x2a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Setting affinity for GPU 1 to 0-47,96-143 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Bootstrap timings total 0.000855 (create 0.000028, send 0.000102, recv 0.000066, ring 0.000170, delay 0.000003) +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO MNNVL busId 0x3b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Setting affinity for GPU 2 to 0-47,96-143 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO comm 0x55922b3b8dd0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO comm 0x55a87ca58580 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO P2P Chunksize set to 524288 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO ncclCommInitRankConfig comm 0x55fbf5001600 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x787b089c2aa96b39 - Init START +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO RAS client listening socket at ::1<28028> +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Bootstrap timings total 0.017789 (create 0.000028, send 0.000093, recv 0.005429, ring 0.000296, delay 0.000002) +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Setting affinity for GPU 3 to 0-47,96-143 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO comm 0x55fbf5001600 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO P2P Chunksize set to 524288 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO ncclCommInitRankConfig comm 0x55d8264f2180 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x787b089c2aa96b39 - Init START +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO RAS client listening socket at ::1<28028> +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Bootstrap timings total 0.002758 (create 0.000029, send 0.000092, recv 0.000087, ring 0.002221, delay 0.000002) +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO MNNVL busId 0xab000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Setting affinity for GPU 5 to 48-95,144-191 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO comm 0x55d8264f2180 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO P2P Chunksize set to 524288 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO ncclCommInitRankConfig comm 0x5617fbee6670 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x787b089c2aa96b39 - Init START +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO RAS client listening socket at ::1<28028> +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Bootstrap timings total 0.012402 (create 0.000028, send 0.000093, recv 0.009701, ring 0.002216, delay 0.000002) +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO MNNVL busId 0x9b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Setting affinity for GPU 4 to 48-95,144-191 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO comm 0x5617fbee6670 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO P2P Chunksize set to 524288 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Bootstrap timings total 0.084086 (create 0.000030, send 0.000087, recv 0.000136, ring 0.083448, delay 0.000002) +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Setting affinity for GPU 7 to 48-95,144-191 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO comm 0x56103f09fe30 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO P2P Chunksize set to 524288 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO RAS client listening socket at ::1<28028> +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Bootstrap timings total 0.200206 (create 0.000027, send 0.000092, recv 0.116231, ring 0.000338, delay 0.000002) +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO MNNVL busId 0xbb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Setting affinity for GPU 6 to 48-95,144-191 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO comm 0x56244124b600 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26020 [0] NCCL INFO [Proxy Service] Device 0 CPU core 113 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26021 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 20 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO CC Off, workFifoBytes 1048576 +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO ncclCommInitRankConfig comm 0x55807ee34c30 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25992 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 8 total 1.37 (kernels 0.30, alloc 0.67, bootstrap 0.23, allgathers 0.00, topo 0.06, graphs 0.02, connections 0.06, rest 0.03) +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26010 [1] NCCL INFO [Proxy Service] Device 1 CPU core 14 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26011 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 16 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO ncclCommInitRankConfig comm 0x55922b3b8dd0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25995 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 8 total 1.33 (kernels 0.31, alloc 0.64, bootstrap 0.22, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.07, rest 0.02) +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26012 [2] NCCL INFO [Proxy Service] Device 2 CPU core 99 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26013 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 18 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO ncclCommInitRankConfig comm 0x55a87ca58580 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3b000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25997 [2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.31 (kernels 0.48, alloc 0.67, bootstrap 0.00, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.02) +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26022 [3] NCCL INFO [Proxy Service] Device 3 CPU core 24 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26023 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 122 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO ncclCommInitRankConfig comm 0x55fbf5001600 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25999 [3] NCCL INFO Init timings - ncclCommInitRankConfig: rank 3 nranks 8 total 1.31 (kernels 0.43, alloc 0.70, bootstrap 0.02, allgathers 0.01, topo 0.06, graphs 0.02, connections 0.06, rest 0.03) +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26018 [5] NCCL INFO [Proxy Service] Device 5 CPU core 77 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26019 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 78 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO ncclCommInitRankConfig comm 0x55d8264f2180 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25998 [5] NCCL INFO Init timings - ncclCommInitRankConfig: rank 5 nranks 8 total 1.31 (kernels 0.47, alloc 0.68, bootstrap 0.00, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26016 [4] NCCL INFO [Proxy Service] Device 4 CPU core 72 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26017 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 172 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO ncclCommInitRankConfig comm 0x5617fbee6670 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25996 [4] NCCL INFO Init timings - ncclCommInitRankConfig: rank 4 nranks 8 total 1.32 (kernels 0.42, alloc 0.72, bootstrap 0.01, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26008 [7] NCCL INFO [Proxy Service] Device 7 CPU core 60 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26009 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 63 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO ncclCommInitRankConfig comm 0x56103f09fe30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25994 [7] NCCL INFO Init timings - ncclCommInitRankConfig: rank 7 nranks 8 total 1.34 (kernels 0.34, alloc 0.75, bootstrap 0.08, allgathers 0.01, topo 0.06, graphs 0.01, connections 0.06, rest 0.03) +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26014 [6] NCCL INFO [Proxy Service] Device 6 CPU core 70 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26015 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 71 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO ncclCommInitRankConfig comm 0x56244124b600 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bb000 commId 0x787b089c2aa96b39 - Init COMPLETE +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25993 [6] NCCL INFO Init timings - ncclCommInitRankConfig: rank 6 nranks 8 total 1.35 (kernels 0.29, alloc 0.69, bootstrap 0.20, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:26030 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:Hyperparameters: +[default0]: adam_eps: 1e-08 +[default0]: adam_wd: 0.005 +[default0]: beta1: 0.9 +[default0]: beta2: 0.95 +[default0]: compressor: brotli +[default0]: data_dir: ./openai_parameter_golf/data +[default0]: datasets_dir: ./openai_parameter_golf/data/datasets/fineweb10B_sp8192 +[default0]: distributed: True +[default0]: ema_decay: 0.9965 +[default0]: embed_bits: 8 +[default0]: embed_clip_sigmas: 20.0 +[default0]: embed_lr: 0.6 +[default0]: embed_wd: 0.085 +[default0]: embedding_dim: 512 +[default0]: enable_looping_at: 0.35 +[default0]: eval_seq_len: 2048 +[default0]: eval_stride: 64 +[default0]: gptq_calibration_batches: 64 +[default0]: grad_accum_steps: 1 +[default0]: grad_clip_norm: 0.3 +[default0]: head_lr: 0.008 +[default0]: is_main_process: True +[default0]: iterations: 50000 +[default0]: ln_scale: True +[default0]: local_rank: 0 +[default0]: logfile: logs/20260425_050157_285079c5.txt +[default0]: logit_softcap: 30.0 +[default0]: loop_end: 5 +[default0]: loop_start: 3 +[default0]: lowbit_layers: +[default0]: matrix_bits: 6 +[default0]: matrix_clip_sigmas: 12.85 +[default0]: matrix_lr: 0.022 +[default0]: max_wallclock_seconds: 600.0 +[default0]: min_lr: 0.0 +[default0]: mlp_mult: 4.0 +[default0]: model_dim: 512 +[default0]: model_path: ckpt/final_model.pt +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:26029 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]: muon_backend_steps: 5 +[default0]: muon_beta2: 0.95 +[default0]: muon_momentum: 0.99 +[default0]: muon_momentum_warmup_fraction: 0.22 +[default0]: muon_momentum_warmup_start: 0.92 +[default0]: muon_row_normalize: True +[default0]: muon_wd: 0.095 +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:26031 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]: muon_wd_mlp: 0.115 +[default0]: num_heads: 8 +[default0]: num_kv_heads: 4 +[default0]: num_layers: 11 +[default0]: num_loops: 2 +[default0]: parallel_residual_start: 7 +[default0]: qk_gain_init: 5.25 +[default0]: quantized_model_path: ckpt/final_model.int6.ptz +[default0]: rank: 0 +[default0]: rope_base: 10000.0 +[default0]: rope_dims: 16 +[default0]: rope_train_seq_len: 2048 +[default0]: run_id: 20260425_050157_285079c5 +[default0]: scalar_lr: 0.02 +[default0]: seed: 1337 +[default0]: skip_gates_enabled: True +[default0]: sliding_window_enabled: True +[default0]: tie_embeddings: True +[default0]: tied_embed_init_std: 0.005 +[default0]: tied_embed_lr: 0.03 +[default0]: tokenizer_path: ./openai_parameter_golf/data/tokenizers/fineweb_8192_bpe.model +[default0]: train_batch_tokens: 786432 +[default0]: train_files: /dev/shm/fineweb10B_sp8192/fineweb_train_*.bin +[default0]: train_log_every: 500 +[default0]: train_seq_len: 2048 +[default0]: ttt_chunk_tokens: 65536 +[default0]: ttt_enabled: True +[default0]: ttt_epochs: 4 +[default0]: ttt_lr: 0.005 +[default0]: ttt_momentum: 0.9 +[default0]: val_batch_tokens: 524288 +[default0]: val_files: /dev/shm/fineweb10B_sp8192/fineweb_val_*.bin +[default0]: val_loss_every: 4000 +[default0]: vocab_size: 8192 +[default0]: warmdown_frac: 0.72 +[default0]: warmup_steps: 20 +[default0]: world_size: 8 +[default0]: xsa_last_n: 11 +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:26027 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:26026 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:26024 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:26025 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:26028 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:attention_backend:flash_attn_3 smoke_test:False +[default0]:train_shards: 0 +[default0]:val_tokens: 40540160 +[default0]:model_params:35946192 +[default0]:warmup_step: 1/20 +[default0]:warmup_step: 2/20 +[default0]:warmup_step: 3/20 +[default0]:warmup_step: 4/20 +[default0]:warmup_step: 5/20 +[default0]:warmup_step: 6/20 +[default0]:warmup_step: 10/20 +[default0]:warmup_step: 20/20 +[default0]:loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +[default0]:loop_warmup_step: 1/20 +[default0]:loop_warmup_step: 2/20 +[default0]:loop_warmup_step: 3/20 +[default0]:loop_warmup_step: 4/20 +[default0]:loop_warmup_step: 5/20 +[default0]:loop_warmup_step: 6/20 +[default0]:loop_warmup_step: 10/20 +[default0]:loop_warmup_step: 20/20 +[default0]:0/50000 val_loss: 9.0047 val_bpb: 3.4860 +[default0]:1/50000 train_loss: 9.0043 train_time: 0.0m tok/s: 7999728 +[default0]:2/50000 train_loss: 12.3260 train_time: 0.0m tok/s: 8090827 +[default0]:3/50000 train_loss: 10.6755 train_time: 0.0m tok/s: 8004781 +[default0]:4/50000 train_loss: 9.0361 train_time: 0.0m tok/s: 7977334 +[default0]:5/50000 train_loss: 8.1955 train_time: 0.0m tok/s: 7952461 +[default0]:500/50000 train_loss: 3.2776 train_time: 0.8m tok/s: 7727432 +[default0]:1000/50000 train_loss: 3.1935 train_time: 1.7m tok/s: 7726214 +[default0]:1500/50000 train_loss: 3.1534 train_time: 2.5m tok/s: 7726471 +[default0]:2000/50000 train_loss: 3.0862 train_time: 3.4m tok/s: 7727478 +[default0]:layer_loop:enabled step:2064 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +[default0]:2500/50000 train_loss: 2.9973 train_time: 4.6m tok/s: 7132111 +[default0]:3000/50000 train_loss: 2.9245 train_time: 5.8m tok/s: 6724688 +[default0]:3500/50000 train_loss: 2.9283 train_time: 7.1m tok/s: 6433274 +[default0]:4000/50000 train_loss: 2.8950 train_time: 8.4m tok/s: 6246522 +[default0]:4000/50000 val_loss: 2.8634 val_bpb: 1.1085 +[default0]:4500/50000 train_loss: 2.7424 train_time: 9.6m tok/s: 6122432 +[default0]:4648/50000 val_loss: 2.7927 val_bpb: 1.0811 +[default0]:stopping_early: wallclock_cap train_time: 600100ms step: 4648/50000 +[default0]:peak memory allocated: 39077 MiB reserved: 39148 MiB +[default0]:ema:applying EMA weights +[default0]:pre-quantization post-ema val_loss:2.79198048 val_bpb:1.08086265 eval_time:7982ms +[default0]:Serialized model: 135441937 bytes +[default0]:GPTQ:collecting Hessians from calibration data... +[default0]:GPTQ:collected 67 Hessians in 13.2s +[default0]:Quantized weights: +[default0]: gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight +[default0]: gptq (int8): tok_emb.weight +[default0]: passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, embed_scale, skip_gates, skip_weights +[default0]:Serialized model quantized+brotli: 15975752 bytes +[default0]:quantized val_loss:2.81897881 val_bpb:1.09131455 eval_time:27215ms +[default0]:quantized_sliding_window val_loss:2.77566213 val_bpb:1.07454531 eval_time:128942ms +[default0]:ttt:start chunks=619 ttt_lr=0.005 ttt_epochs=4 +[default0]:quantized_ttt val_loss:2.77174800 val_bpb:1.07303003 eval_time:310499ms +[default0]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25634:25634 [0] NCCL INFO comm 0x55807ee34c30 rank 0 nranks 8 cudaDev 0 busId 19000 - Destroy COMPLETE +[default5]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25639:25639 [5] NCCL INFO comm 0x55d8264f2180 rank 5 nranks 8 cudaDev 5 busId ab000 - Destroy COMPLETE +[default2]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25636:25636 [2] NCCL INFO comm 0x55a87ca58580 rank 2 nranks 8 cudaDev 2 busId 3b000 - Destroy COMPLETE +[default4]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25638:25638 [4] NCCL INFO comm 0x5617fbee6670 rank 4 nranks 8 cudaDev 4 busId 9b000 - Destroy COMPLETE +[default7]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25641:25641 [7] NCCL INFO comm 0x56103f09fe30 rank 7 nranks 8 cudaDev 7 busId db000 - Destroy COMPLETE +[default6]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25640:25640 [6] NCCL INFO comm 0x56244124b600 rank 6 nranks 8 cudaDev 6 busId bb000 - Destroy COMPLETE +[default3]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25637:25637 [3] NCCL INFO comm 0x55fbf5001600 rank 3 nranks 8 cudaDev 3 busId 5d000 - Destroy COMPLETE +[default1]:job-489aef71-de3e-4d77-99a7-8dd9cba434ee-worker-0:25635:25635 [1] NCCL INFO comm 0x55922b3b8dd0 rank 1 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE + +--- EVAL_WALL 577.5s --- + +=== PACK (rc=0) === +Packed code: 19646 bytes (raw=70361 bytes) +Model blob : 15975752 bytes +Submission size: 15995398 bytes diff --git a/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed999.log b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed999.log new file mode 100644 index 0000000000..f4acea45d9 --- /dev/null +++ b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/seed999.log @@ -0,0 +1,1171 @@ +=== PREFLIGHT (SMOKE_TEST=1, 108s) === +W0425 04:57:18.799000 179 torch/distributed/run.py:803] +W0425 04:57:18.799000 179 torch/distributed/run.py:803] ***************************************** +W0425 04:57:18.799000 179 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0425 04:57:18.799000 179 torch/distributed/run.py:803] ***************************************** +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:247 [0] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:247 [0] NCCL INFO cudaDriverVersion 12080 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:247 [0] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:247 [0] NCCL INFO Comm config Blocking set to 1 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:251 [4] NCCL INFO cudaDriverVersion 12080 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:253 [6] NCCL INFO cudaDriverVersion 12080 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:251 [4] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:253 [6] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:251 [4] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:253 [6] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:251 [4] NCCL INFO Comm config Blocking set to 1 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:253 [6] NCCL INFO Comm config Blocking set to 1 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:248 [1] NCCL INFO cudaDriverVersion 12080 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:248 [1] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:248 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:248 [1] NCCL INFO Comm config Blocking set to 1 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:249 [2] NCCL INFO cudaDriverVersion 12080 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:249 [2] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:249 [2] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:249 [2] NCCL INFO Comm config Blocking set to 1 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:250 [3] NCCL INFO cudaDriverVersion 12080 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:250 [3] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:250 [3] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:250 [3] NCCL INFO Comm config Blocking set to 1 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:252 [5] NCCL INFO cudaDriverVersion 12080 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:252 [5] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:252 [5] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:252 [5] NCCL INFO Comm config Blocking set to 1 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:254 [7] NCCL INFO cudaDriverVersion 12080 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:254 [7] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:254 [7] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:254 [7] NCCL INFO Comm config Blocking set to 1 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Failed to open libibverbs.so[.1] +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Initialized NET plugin Socket +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Assigned NET plugin Socket to comm +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Using network Socket +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Failed to open libibverbs.so[.1] +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Initialized NET plugin Socket +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Assigned NET plugin Socket to comm +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Using network Socket +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Failed to open libibverbs.so[.1] +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Initialized NET plugin Socket +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Assigned NET plugin Socket to comm +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Using network Socket +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Initialized NET plugin Socket +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Assigned NET plugin Socket to comm +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Using network Socket +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Failed to open libibverbs.so[.1] +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Failed to open libibverbs.so[.1] +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Initialized NET plugin Socket +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Initialized NET plugin Socket +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Assigned NET plugin Socket to comm +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Using network Socket +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Using network Socket +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO ncclCommInitRankConfig comm 0x55c5b577e3d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x415c0bbf2d093211 - Init START +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Failed to open libibverbs.so[.1] +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Initialized NET plugin Socket +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Assigned NET plugin Socket to comm +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Using network Socket +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Failed to open libibverbs.so[.1] +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Initialized NET plugin Socket +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Assigned NET plugin Socket to comm +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Using network Socket +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO ncclCommInitRankConfig comm 0x5630298ab9f0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId cb000 commId 0x415c0bbf2d093211 - Init START +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO ncclCommInitRankConfig comm 0x5564b4fce0b0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 4c000 commId 0x415c0bbf2d093211 - Init START +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO ncclCommInitRankConfig comm 0x563c6b199700 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x415c0bbf2d093211 - Init START +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO RAS client listening socket at ::1<28028> +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO ncclCommInitRankConfig comm 0x5616fb60cd40 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x415c0bbf2d093211 - Init START +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO RAS client listening socket at ::1<28028> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO ncclCommInitRankConfig comm 0x55bd61f08390 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x415c0bbf2d093211 - Init START +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO ncclCommInitRankConfig comm 0x56133127eba0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3b000 commId 0x415c0bbf2d093211 - Init START +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Bootstrap timings total 0.001967 (create 0.000027, send 0.000087, recv 0.000131, ring 0.001408, delay 0.000002) +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO MNNVL busId 0x3b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO RAS client listening socket at ::1<28028> +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Bootstrap timings total 0.063517 (create 0.000028, send 0.000096, recv 0.000032, ring 0.001375, delay 0.000002) +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO MNNVL busId 0x4c000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Bootstrap timings total 0.104352 (create 0.000030, send 0.000090, recv 0.005648, ring 0.060626, delay 0.000002) +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO ncclCommInitRankConfig comm 0x5571e41170b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId bb000 commId 0x415c0bbf2d093211 - Init START +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO RAS client listening socket at ::1<28028> +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Bootstrap timings total 0.033440 (create 0.000028, send 0.000098, recv 0.000072, ring 0.032893, delay 0.000002) +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO MNNVL busId 0xbb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Bootstrap timings total 0.072316 (create 0.000028, send 0.000091, recv 0.000107, ring 0.071688, delay 0.000002) +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Bootstrap timings total 0.660400 (create 0.000030, send 0.000090, recv 0.658532, ring 0.000130, delay 0.000002) +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO MNNVL busId 0x19000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO RAS client listening socket at ::1<28028> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Bootstrap timings total 0.098869 (create 0.000037, send 0.000103, recv 0.065459, ring 0.032897, delay 0.000002) +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO MNNVL busId 0x9b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO RAS client listening socket at ::1<28028> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Bootstrap timings total 0.170724 (create 0.000032, send 0.000086, recv 0.098485, ring 0.032875, delay 0.000002) +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO MNNVL busId 0xcb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Setting affinity for GPU 1 to 0-47,96-143 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO comm 0x56133127eba0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO P2P Chunksize set to 524288 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:599 [1] NCCL INFO [Proxy Service] Device 1 CPU core 38 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:600 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 137 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Setting affinity for GPU 2 to 0-47,96-143 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO comm 0x5564b4fce0b0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:609 [2] NCCL INFO [Proxy Service] Device 2 CPU core 98 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:610 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 109 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Setting affinity for GPU 3 to 0-47,96-143 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO comm 0x563c6b199700 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO P2P Chunksize set to 524288 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:602 [3] NCCL INFO [Proxy Service] Device 3 CPU core 103 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:604 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 104 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Setting affinity for GPU 5 to 48-95,144-191 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO comm 0x5571e41170b0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO P2P Chunksize set to 524288 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:606 [5] NCCL INFO [Proxy Service] Device 5 CPU core 158 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:608 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 159 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Setting affinity for GPU 7 to 48-95,144-191 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO comm 0x5616fb60cd40 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO P2P Chunksize set to 524288 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:605 [7] NCCL INFO [Proxy Service] Device 7 CPU core 66 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:607 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 67 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO comm 0x55c5b577e3d0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Setting affinity for GPU 4 to 48-95,144-191 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:613 [0] NCCL INFO [Proxy Service] Device 0 CPU core 15 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:614 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 20 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO comm 0x55bd61f08390 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO P2P Chunksize set to 524288 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:611 [4] NCCL INFO [Proxy Service] Device 4 CPU core 84 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:612 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 181 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Setting affinity for GPU 6 to 48-95,144-191 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO comm 0x5630298ab9f0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO P2P Chunksize set to 524288 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:601 [6] NCCL INFO [Proxy Service] Device 6 CPU core 146 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:603 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 147 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO ncclCommInitRankConfig comm 0x56133127eba0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3b000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:590 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 8 total 1.24 (kernels 0.64, alloc 0.45, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.03) +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO ncclCommInitRankConfig comm 0x5564b4fce0b0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 4c000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:582 [2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.30 (kernels 0.39, alloc 0.70, bootstrap 0.06, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO ncclCommInitRankConfig comm 0x563c6b199700 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:585 [3] NCCL INFO Init timings - ncclCommInitRankConfig: rank 3 nranks 8 total 1.30 (kernels 0.34, alloc 0.70, bootstrap 0.10, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO ncclCommInitRankConfig comm 0x5571e41170b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId bb000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:587 [5] NCCL INFO Init timings - ncclCommInitRankConfig: rank 5 nranks 8 total 1.28 (kernels 0.47, alloc 0.62, bootstrap 0.03, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO ncclCommInitRankConfig comm 0x5616fb60cd40 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:586 [7] NCCL INFO Init timings - ncclCommInitRankConfig: rank 7 nranks 8 total 1.30 (kernels 0.36, alloc 0.71, bootstrap 0.07, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO CC Off, workFifoBytes 1048576 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO ncclCommInitRankConfig comm 0x55c5b577e3d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:573 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 8 total 1.36 (kernels 0.28, alloc 0.27, bootstrap 0.66, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO ncclCommInitRankConfig comm 0x55bd61f08390 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:580 [4] NCCL INFO Init timings - ncclCommInitRankConfig: rank 4 nranks 8 total 1.31 (kernels 0.34, alloc 0.73, bootstrap 0.10, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO ncclCommInitRankConfig comm 0x5630298ab9f0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId cb000 commId 0x415c0bbf2d093211 - Init COMPLETE +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:581 [6] NCCL INFO Init timings - ncclCommInitRankConfig: rank 6 nranks 8 total 1.31 (kernels 0.31, alloc 0.67, bootstrap 0.17, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:618 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:619 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:621 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:616 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:617 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:615 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:620 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:Hyperparameters: +[default0]: adam_eps: 1e-08 +[default0]: adam_wd: 0.005 +[default0]: beta1: 0.9 +[default0]: beta2: 0.95 +[default0]: compressor: brotli +[default0]: data_dir: ./openai_parameter_golf/data +[default0]: datasets_dir: ./openai_parameter_golf/data/datasets/fineweb10B_sp8192 +[default0]: distributed: True +[default0]: ema_decay: 0.9965 +[default0]: embed_bits: 8 +[default0]: embed_clip_sigmas: 20.0 +[default0]: embed_lr: 0.6 +[default0]: embed_wd: 0.085 +[default0]: embedding_dim: 512 +[default0]: enable_looping_at: 0.35 +[default0]: eval_seq_len: 512 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:622 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]: eval_stride: 64 +[default0]: gptq_calibration_batches: 2 +[default0]: grad_accum_steps: 1 +[default0]: grad_clip_norm: 0.3 +[default0]: head_lr: 0.008 +[default0]: is_main_process: True +[default0]: iterations: 30 +[default0]: ln_scale: True +[default0]: local_rank: 0 +[default0]: logfile: logs/20260425_045722_d5cae9d7.txt +[default0]: logit_softcap: 30.0 +[default0]: loop_end: 5 +[default0]: loop_start: 3 +[default0]: lowbit_layers: +[default0]: matrix_bits: 6 +[default0]: matrix_clip_sigmas: 12.85 +[default0]: matrix_lr: 0.022 +[default0]: max_wallclock_seconds: 120.0 +[default0]: min_lr: 0.0 +[default0]: mlp_mult: 4.0 +[default0]: model_dim: 512 +[default0]: model_path: ckpt/final_model.pt +[default0]: muon_backend_steps: 5 +[default0]: muon_beta2: 0.95 +[default0]: muon_momentum: 0.99 +[default0]: muon_momentum_warmup_fraction: 0.22 +[default0]: muon_momentum_warmup_start: 0.92 +[default0]: muon_row_normalize: True +[default0]: muon_wd: 0.095 +[default0]: muon_wd_mlp: 0.115 +[default0]: num_heads: 8 +[default0]: num_kv_heads: 4 +[default0]: num_layers: 11 +[default0]: num_loops: 2 +[default0]: parallel_residual_start: 7 +[default0]: qk_gain_init: 5.25 +[default0]: quantized_model_path: ckpt/final_model.int6.ptz +[default0]: rank: 0 +[default0]: rope_base: 10000.0 +[default0]: rope_dims: 16 +[default0]: rope_train_seq_len: 2048 +[default0]: run_id: 20260425_045722_d5cae9d7 +[default0]: scalar_lr: 0.02 +[default0]: seed: 1337 +[default0]: skip_gates_enabled: True +[default0]: sliding_window_enabled: True +[default0]: tie_embeddings: True +[default0]: tied_embed_init_std: 0.005 +[default0]: tied_embed_lr: 0.03 +[default0]: tokenizer_path: ./openai_parameter_golf/data/tokenizers/fineweb_8192_bpe.model +[default0]: train_batch_tokens: 32768 +[default0]: train_files: /dev/shm/fineweb10B_sp8192/fineweb_train_*.bin +[default0]: train_log_every: 5 +[default0]: train_seq_len: 512 +[default0]: ttt_chunk_tokens: 65536 +[default0]: ttt_enabled: False +[default0]: ttt_epochs: 4 +[default0]: ttt_lr: 0.005 +[default0]: ttt_momentum: 0.9 +[default0]: val_batch_tokens: 2097152 +[default0]: val_files: /dev/shm/fineweb10B_sp8192/fineweb_val_*.bin +[default0]: val_loss_every: 0 +[default0]: vocab_size: 8192 +[default0]: warmdown_frac: 0.2 +[default0]: warmup_steps: 0 +[default0]: world_size: 8 +[default0]: xsa_last_n: 11 +[default0]:[SMOKE_TEST] attention_backend=sdpa_fallback FA3=False smoke_test=True +[default0]:[SMOKE_TEST] val_bpb from this run is NOT comparable to proxy/full runs +[default0]:attention_backend:sdpa_fallback(smoke) smoke_test:True +[default0]:train_shards: 0 +[default0]:val_tokens: 40540672 +[default0]:smoke_test: torch.compile disabled (eager mode) +[default0]:model_params:35946192 +[default0]:1/30 train_loss: 9.0074 train_time: 0.1m tok/s: 6877 +[default0]:2/30 train_loss: 13.8612 train_time: 0.1m tok/s: 13159 +[default0]:3/30 train_loss: 13.9006 train_time: 0.1m tok/s: 19473 +[default0]:4/30 train_loss: 12.3547 train_time: 0.1m tok/s: 25621 +[default0]:5/30 train_loss: 11.7225 train_time: 0.1m tok/s: 31590 +[default0]:10/30 train_loss: 7.8871 train_time: 0.1m tok/s: 59453 +[default0]:15/30 train_loss: 7.0285 train_time: 0.1m tok/s: 84709 +[default0]:20/30 train_loss: 6.7346 train_time: 0.1m tok/s: 107557 +[default0]:25/30 train_loss: 6.2556 train_time: 0.1m tok/s: 128298 +[default0]:30/30 train_loss: 6.2241 train_time: 0.1m tok/s: 147317 +[default0]:peak memory allocated: 2595 MiB reserved: 2840 MiB +[default0]:ema:applying EMA weights +[default0]:smoke_test: training complete — running GPTQ+brotli pack for size check +[default0]:Serialized model: 135441937 bytes +[default0]:GPTQ:collecting Hessians from calibration data... +[default0]:GPTQ:collected 67 Hessians in 0.4s +[default0]:Quantized weights: +[default0]: gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight +[default0]: gptq (int8): tok_emb.weight +[default0]: passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, embed_scale, skip_gates, skip_weights +[default0]:Serialized model quantized+brotli: 15994543 bytes +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:251:251 [4] NCCL INFO comm 0x55bd61f08390 rank 4 nranks 8 cudaDev 4 busId 9b000 - Destroy COMPLETE +[default0]:smoke_pack_bytes: code=19646 model=15994543 total=16014189 +[default0]:smoke_test:complete (code ran successfully; val_bpb not computed in smoke mode) +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:252:252 [5] NCCL INFO comm 0x5571e41170b0 rank 5 nranks 8 cudaDev 5 busId bb000 - Destroy COMPLETE +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:250:250 [3] NCCL INFO comm 0x563c6b199700 rank 3 nranks 8 cudaDev 3 busId 5d000 - Destroy COMPLETE +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:253:253 [6] NCCL INFO comm 0x5630298ab9f0 rank 6 nranks 8 cudaDev 6 busId cb000 - Destroy COMPLETE +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:247:247 [0] NCCL INFO comm 0x55c5b577e3d0 rank 0 nranks 8 cudaDev 0 busId 19000 - Destroy COMPLETE +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:249:249 [2] NCCL INFO comm 0x5564b4fce0b0 rank 2 nranks 8 cudaDev 2 busId 4c000 - Destroy COMPLETE +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:254:254 [7] NCCL INFO comm 0x5616fb60cd40 rank 7 nranks 8 cudaDev 7 busId db000 - Destroy COMPLETE +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:248:248 [1] NCCL INFO comm 0x56133127eba0 rank 1 nranks 8 cudaDev 1 busId 3b000 - Destroy COMPLETE + +=== TRAIN (rc=0, 1413s) === +W0425 04:59:06.896000 25565 torch/distributed/run.py:803] +W0425 04:59:06.896000 25565 torch/distributed/run.py:803] ***************************************** +W0425 04:59:06.896000 25565 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0425 04:59:06.896000 25565 torch/distributed/run.py:803] ***************************************** +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25634 [0] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25634 [0] NCCL INFO cudaDriverVersion 12080 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25634 [0] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25634 [0] NCCL INFO Comm config Blocking set to 1 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25635 [1] NCCL INFO cudaDriverVersion 12080 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25635 [1] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25635 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25635 [1] NCCL INFO Comm config Blocking set to 1 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25637 [3] NCCL INFO cudaDriverVersion 12080 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25637 [3] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25637 [3] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25637 [3] NCCL INFO Comm config Blocking set to 1 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25638 [4] NCCL INFO cudaDriverVersion 12080 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25638 [4] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25638 [4] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25638 [4] NCCL INFO Comm config Blocking set to 1 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25639 [5] NCCL INFO cudaDriverVersion 12080 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25639 [5] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25639 [5] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25639 [5] NCCL INFO Comm config Blocking set to 1 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25640 [6] NCCL INFO cudaDriverVersion 12080 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25640 [6] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25640 [6] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25640 [6] NCCL INFO Comm config Blocking set to 1 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25636 [2] NCCL INFO cudaDriverVersion 12080 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25636 [2] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25636 [2] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25636 [2] NCCL INFO Comm config Blocking set to 1 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25641 [7] NCCL INFO cudaDriverVersion 12080 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25641 [7] NCCL INFO Bootstrap: Using eth0:10.245.41.22<0> +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25641 [7] NCCL INFO NCCL version 2.27.5+cuda12.9 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25641 [7] NCCL INFO Comm config Blocking set to 1 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Failed to open libibverbs.so[.1] +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Initialized NET plugin Socket +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Assigned NET plugin Socket to comm +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Using network Socket +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Failed to open libibverbs.so[.1] +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO ncclCommInitRankConfig comm 0x562a344e8e50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x858d7950b3ab6cee - Init START +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Initialized NET plugin Socket +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Assigned NET plugin Socket to comm +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Using network Socket +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Failed to open libibverbs.so[.1] +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Initialized NET plugin Socket +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Assigned NET plugin Socket to comm +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Using network Socket +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Failed to open libibverbs.so[.1] +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Initialized NET plugin Socket +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Assigned NET plugin Socket to comm +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Using network Socket +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Failed to open libibverbs.so[.1] +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Initialized NET plugin Socket +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Assigned NET plugin Socket to comm +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Using network Socket +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Failed to open libibverbs.so[.1] +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Initialized NET plugin Socket +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Assigned NET plugin Socket to comm +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Using network Socket +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Failed to open libibverbs.so[.1] +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Initialized NET plugin Socket +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Assigned NET plugin Socket to comm +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Using network Socket +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Failed to open libibverbs.so[.1] +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO NET/Socket : Using [0]eth0:10.245.41.22<0> +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Initialized NET plugin Socket +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Assigned NET plugin Socket to comm +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Using network Socket +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO ncclCommInitRankConfig comm 0x56047e0e1d30 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 4c000 commId 0x858d7950b3ab6cee - Init START +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO ncclCommInitRankConfig comm 0x55621e803d40 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3b000 commId 0x858d7950b3ab6cee - Init START +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO ncclCommInitRankConfig comm 0x563da97ce7b0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x858d7950b3ab6cee - Init START +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO RAS client listening socket at ::1<28028> +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO ncclCommInitRankConfig comm 0x5569464c4f90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x858d7950b3ab6cee - Init START +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO RAS client listening socket at ::1<28028> +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO ncclCommInitRankConfig comm 0x5606f3d87690 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId bb000 commId 0x858d7950b3ab6cee - Init START +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO RAS client listening socket at ::1<28028> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO ncclCommInitRankConfig comm 0x5654c1810d80 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId cb000 commId 0x858d7950b3ab6cee - Init START +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Bootstrap timings total 0.027917 (create 0.000028, send 0.000098, recv 0.000081, ring 0.027376, delay 0.000002) +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO MNNVL busId 0x4c000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Setting affinity for GPU 2 to 0-47,96-143 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO RAS client listening socket at ::1<28028> +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Bootstrap timings total 0.740574 (create 0.000031, send 0.000081, recv 0.618835, ring 0.000122, delay 0.000002) +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO MNNVL busId 0x19000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO RAS client listening socket at ::1<28028> +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Bootstrap timings total 0.121845 (create 0.000029, send 0.000099, recv 0.093987, ring 0.026059, delay 0.000002) +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO MNNVL busId 0x3b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Setting affinity for GPU 1 to 0-47,96-143 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Bootstrap timings total 0.117575 (create 0.000026, send 0.000096, recv 0.031388, ring 0.027355, delay 0.000002) +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Setting affinity for GPU 3 to 0-47,96-143 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Bootstrap timings total 0.086239 (create 0.000028, send 0.000093, recv 0.000089, ring 0.085624, delay 0.000003) +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO MNNVL busId 0x9b000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Setting affinity for GPU 4 to 48-95,144-191 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Bootstrap timings total 0.125551 (create 0.000034, send 0.000103, recv 0.003561, ring 0.085614, delay 0.000002) +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO MNNVL busId 0xbb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Setting affinity for GPU 5 to 48-95,144-191 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO RAS client listening socket at ::1<28028> +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Bootstrap timings total 0.122101 (create 0.000029, send 0.000095, recv 0.121407, ring 0.000199, delay 0.000002) +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO MNNVL busId 0xcb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Setting affinity for GPU 6 to 48-95,144-191 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO ncclCommInitRankConfig comm 0x564274bf4140 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x858d7950b3ab6cee - Init START +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO RAS client listening socket at ::1<28028> +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Bootstrap timings total 0.000805 (create 0.000028, send 0.000087, recv 0.000146, ring 0.000167, delay 0.000002) +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Setting affinity for GPU 7 to 48-95,144-191 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO comm 0x56047e0e1d30 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO P2P Chunksize set to 524288 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26009 [2] NCCL INFO [Proxy Service] Device 2 CPU core 101 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26010 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 6 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO ncclCommInitRankConfig comm 0x56047e0e1d30 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 4c000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25995 [2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.28 (kernels 0.51, alloc 0.59, bootstrap 0.03, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.03) +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO comm 0x562a344e8e50 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO comm 0x55621e803d40 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO P2P Chunksize set to 524288 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26007 [1] NCCL INFO [Proxy Service] Device 1 CPU core 97 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26008 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 2 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO ncclCommInitRankConfig comm 0x55621e803d40 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3b000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25988 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 8 total 1.32 (kernels 0.34, alloc 0.70, bootstrap 0.12, allgathers 0.00, topo 0.05, graphs 0.02, connections 0.06, rest 0.03) +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO P2P Chunksize set to 524288 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO comm 0x563da97ce7b0 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26011 [0] NCCL INFO [Proxy Service] Device 0 CPU core 29 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26012 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 33 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO CC Off, workFifoBytes 1048576 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO ncclCommInitRankConfig comm 0x562a344e8e50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25983 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 8 total 1.39 (kernels 0.28, alloc 0.21, bootstrap 0.74, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO P2P Chunksize set to 524288 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26019 [3] NCCL INFO [Proxy Service] Device 3 CPU core 99 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26020 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 104 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO comm 0x5569464c4f90 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO P2P Chunksize set to 524288 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26017 [4] NCCL INFO [Proxy Service] Device 4 CPU core 165 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO ncclCommInitRankConfig comm 0x563da97ce7b0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25989 [3] NCCL INFO Init timings - ncclCommInitRankConfig: rank 3 nranks 8 total 1.31 (kernels 0.33, alloc 0.70, bootstrap 0.12, allgathers 0.00, topo 0.05, graphs 0.02, connections 0.06, rest 0.03) +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26018 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 166 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO ncclCommInitRankConfig comm 0x5569464c4f90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25992 [4] NCCL INFO Init timings - ncclCommInitRankConfig: rank 4 nranks 8 total 1.30 (kernels 0.38, alloc 0.68, bootstrap 0.09, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO comm 0x5606f3d87690 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO P2P Chunksize set to 524288 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26013 [5] NCCL INFO [Proxy Service] Device 5 CPU core 53 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26014 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 160 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO ncclCommInitRankConfig comm 0x5606f3d87690 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId bb000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25991 [5] NCCL INFO Init timings - ncclCommInitRankConfig: rank 5 nranks 8 total 1.31 (kernels 0.35, alloc 0.67, bootstrap 0.13, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.02) +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO comm 0x5654c1810d80 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO P2P Chunksize set to 524288 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26015 [6] NCCL INFO [Proxy Service] Device 6 CPU core 66 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26016 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 67 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO ncclCommInitRankConfig comm 0x5654c1810d80 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId cb000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25990 [6] NCCL INFO Init timings - ncclCommInitRankConfig: rank 6 nranks 8 total 1.31 (kernels 0.34, alloc 0.69, bootstrap 0.12, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.03) +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO comm 0x564274bf4140 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO P2P Chunksize set to 524288 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26022 [7] NCCL INFO [Proxy Service UDS] Device 7 CPU core 168 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26021 [7] NCCL INFO [Proxy Service] Device 7 CPU core 167 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO ncclCommInitRankConfig comm 0x564274bf4140 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x858d7950b3ab6cee - Init COMPLETE +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25998 [7] NCCL INFO Init timings - ncclCommInitRankConfig: rank 7 nranks 8 total 1.25 (kernels 0.70, alloc 0.38, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.06, rest 0.03) +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:26030 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:26029 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:26024 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:Hyperparameters: +[default0]: adam_eps: 1e-08 +[default0]: adam_wd: 0.005 +[default0]: beta1: 0.9 +[default0]: beta2: 0.95 +[default0]: compressor: brotli +[default0]: data_dir: ./openai_parameter_golf/data +[default0]: datasets_dir: ./openai_parameter_golf/data/datasets/fineweb10B_sp8192 +[default0]: distributed: True +[default0]: ema_decay: 0.9965 +[default0]: embed_bits: 8 +[default0]: embed_clip_sigmas: 20.0 +[default0]: embed_lr: 0.6 +[default0]: embed_wd: 0.085 +[default0]: embedding_dim: 512 +[default0]: enable_looping_at: 0.35 +[default0]: eval_seq_len: 2048 +[default0]: eval_stride: 64 +[default0]: gptq_calibration_batches: 64 +[default0]: grad_accum_steps: 1 +[default0]: grad_clip_norm: 0.3 +[default0]: head_lr: 0.008 +[default0]: is_main_process: True +[default0]: iterations: 50000 +[default0]: ln_scale: True +[default0]: local_rank: 0 +[default0]: logfile: logs/20260425_045910_88c4ca8a.txt +[default0]: logit_softcap: 30.0 +[default0]: loop_end: 5 +[default0]: loop_start: 3 +[default0]: lowbit_layers: +[default0]: matrix_bits: 6 +[default0]: matrix_clip_sigmas: 12.85 +[default0]: matrix_lr: 0.022 +[default0]: max_wallclock_seconds: 600.0 +[default0]: min_lr: 0.0 +[default0]: mlp_mult: 4.0 +[default0]: model_dim: 512 +[default0]: model_path: ckpt/final_model.pt +[default0]: muon_backend_steps: 5 +[default0]: muon_beta2: 0.95 +[default0]: muon_momentum: 0.99 +[default0]: muon_momentum_warmup_fraction: 0.22 +[default0]: muon_momentum_warmup_start: 0.92 +[default0]: muon_row_normalize: True +[default0]: muon_wd: 0.095 +[default0]: muon_wd_mlp: 0.115 +[default0]: num_heads: 8 +[default0]: num_kv_heads: 4 +[default0]: num_layers: 11 +[default0]: num_loops: 2 +[default0]: parallel_residual_start: 7 +[default0]: qk_gain_init: 5.25 +[default0]: quantized_model_path: ckpt/final_model.int6.ptz +[default0]: rank: 0 +[default0]: rope_base: 10000.0 +[default0]: rope_dims: 16 +[default0]: rope_train_seq_len: 2048 +[default0]: run_id: 20260425_045910_88c4ca8a +[default0]: scalar_lr: 0.02 +[default0]: seed: 1337 +[default0]: skip_gates_enabled: True +[default0]: sliding_window_enabled: True +[default0]: tie_embeddings: True +[default0]: tied_embed_init_std: 0.005 +[default0]: tied_embed_lr: 0.03 +[default0]: tokenizer_path: ./openai_parameter_golf/data/tokenizers/fineweb_8192_bpe.model +[default0]: train_batch_tokens: 786432 +[default0]: train_files: /dev/shm/fineweb10B_sp8192/fineweb_train_*.bin +[default0]: train_log_every: 500 +[default0]: train_seq_len: 2048 +[default0]: ttt_chunk_tokens: 65536 +[default0]: ttt_enabled: True +[default0]: ttt_epochs: 4 +[default0]: ttt_lr: 0.005 +[default0]: ttt_momentum: 0.9 +[default0]: val_batch_tokens: 524288 +[default0]: val_files: /dev/shm/fineweb10B_sp8192/fineweb_val_*.bin +[default0]: val_loss_every: 4000 +[default0]: vocab_size: 8192 +[default0]: warmdown_frac: 0.72 +[default0]: warmup_steps: 20 +[default0]: world_size: 8 +[default0]: xsa_last_n: 11 +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:26026 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:26025 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:26023 [5] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:26027 [6] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:26028 [7] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 +[default0]:attention_backend:flash_attn_3 smoke_test:False +[default0]:train_shards: 0 +[default0]:val_tokens: 40540160 +[default0]:model_params:35946192 +[default0]:warmup_step: 1/20 +[default0]:warmup_step: 2/20 +[default0]:warmup_step: 3/20 +[default0]:warmup_step: 4/20 +[default0]:warmup_step: 5/20 +[default0]:warmup_step: 6/20 +[default0]:warmup_step: 10/20 +[default0]:warmup_step: 20/20 +[default0]:loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +[default0]:loop_warmup_step: 1/20 +[default0]:loop_warmup_step: 2/20 +[default0]:loop_warmup_step: 3/20 +[default0]:loop_warmup_step: 4/20 +[default0]:loop_warmup_step: 5/20 +[default0]:loop_warmup_step: 6/20 +[default0]:loop_warmup_step: 10/20 +[default0]:loop_warmup_step: 20/20 +[default0]:0/50000 val_loss: 9.0047 val_bpb: 3.4860 +[default0]:1/50000 train_loss: 9.0043 train_time: 0.0m tok/s: 7998243 +[default0]:2/50000 train_loss: 12.3260 train_time: 0.0m tok/s: 8067913 +[default0]:3/50000 train_loss: 10.6755 train_time: 0.0m tok/s: 7983956 +[default0]:4/50000 train_loss: 9.0360 train_time: 0.0m tok/s: 7950366 +[default0]:5/50000 train_loss: 8.1956 train_time: 0.0m tok/s: 7935943 +[default0]:500/50000 train_loss: 3.2838 train_time: 0.8m tok/s: 7733467 +[default0]:1000/50000 train_loss: 3.1939 train_time: 1.7m tok/s: 7730853 +[default0]:1500/50000 train_loss: 3.1541 train_time: 2.5m tok/s: 7731072 +[default0]:2000/50000 train_loss: 3.0814 train_time: 3.4m tok/s: 7730575 +[default0]:layer_loop:enabled step:2065 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +[default0]:2500/50000 train_loss: 2.9958 train_time: 4.6m tok/s: 7149542 +[default0]:3000/50000 train_loss: 2.9185 train_time: 5.8m tok/s: 6748731 +[default0]:3500/50000 train_loss: 2.9287 train_time: 7.1m tok/s: 6466484 +[default0]:4000/50000 train_loss: 2.8960 train_time: 8.4m tok/s: 6250728 +[default0]:4000/50000 val_loss: 2.8638 val_bpb: 1.1087 +[default0]:4500/50000 train_loss: 2.7425 train_time: 9.6m tok/s: 6125653 +[default0]:4650/50000 val_loss: 2.7925 val_bpb: 1.0811 +[default0]:stopping_early: wallclock_cap train_time: 600108ms step: 4650/50000 +[default0]:peak memory allocated: 39076 MiB reserved: 39150 MiB +[default0]:ema:applying EMA weights +[default0]:pre-quantization post-ema val_loss:2.79176818 val_bpb:1.08078046 eval_time:8229ms +[default0]:Serialized model: 135441937 bytes +[default0]:GPTQ:collecting Hessians from calibration data... +[default0]:GPTQ:collected 67 Hessians in 13.1s +[default0]:Quantized weights: +[default0]: gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight +[default0]: gptq (int8): tok_emb.weight +[default0]: passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, embed_scale, skip_gates, skip_weights +[default0]:Serialized model quantized+brotli: 15976105 bytes +[default0]:quantized val_loss:2.81865316 val_bpb:1.09118848 eval_time:27235ms +[default0]:quantized_sliding_window val_loss:2.77537502 val_bpb:1.07443416 eval_time:129816ms +[default0]:ttt:start chunks=619 ttt_lr=0.005 ttt_epochs=4 +[default0]:quantized_ttt val_loss:2.77154797 val_bpb:1.07295259 eval_time:314448ms +[default1]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25635:25635 [1] NCCL INFO comm 0x55621e803d40 rank 1 nranks 8 cudaDev 1 busId 3b000 - Destroy COMPLETE +[default6]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25640:25640 [6] NCCL INFO comm 0x5654c1810d80 rank 6 nranks 8 cudaDev 6 busId cb000 - Destroy COMPLETE +[default5]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25639:25639 [5] NCCL INFO comm 0x5606f3d87690 rank 5 nranks 8 cudaDev 5 busId bb000 - Destroy COMPLETE +[default4]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25638:25638 [4] NCCL INFO comm 0x5569464c4f90 rank 4 nranks 8 cudaDev 4 busId 9b000 - Destroy COMPLETE +[default3]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25637:25637 [3] NCCL INFO comm 0x563da97ce7b0 rank 3 nranks 8 cudaDev 3 busId 5d000 - Destroy COMPLETE +[default7]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25641:25641 [7] NCCL INFO comm 0x564274bf4140 rank 7 nranks 8 cudaDev 7 busId db000 - Destroy COMPLETE +[default0]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25634:25634 [0] NCCL INFO comm 0x562a344e8e50 rank 0 nranks 8 cudaDev 0 busId 19000 - Destroy COMPLETE +[default2]:job-e807e32d-7aab-4e0d-b687-e73e463923d7-worker-0:25636:25636 [2] NCCL INFO comm 0x56047e0e1d30 rank 2 nranks 8 cudaDev 2 busId 4c000 - Destroy COMPLETE + +--- EVAL_WALL 586.4s --- + +=== PACK (rc=0) === +Packed code: 19646 bytes (raw=70361 bytes) +Model blob : 15976105 bytes +Submission size: 15995751 bytes diff --git a/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/submission.json b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/submission.json new file mode 100644 index 0000000000..e0fda1c596 --- /dev/null +++ b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/submission.json @@ -0,0 +1,41 @@ +{ + "author": "Ethan Ning", + "github_id": "EthanNing", + "name": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Score-First TTT (4 epochs) + Tuned MLP WD", + "date": "2026-04-25", + "track": "10min_16mb", + "val_bpb": 1.07290246, + "val_bpb_std": 0.00015869, + "seeds": [42, 314, 999], + "seed_results": { + "42": {"val_bpb": 1.07303003, "artifact_bytes": 15995398, "train_s": 600.1, "eval_s": 577.5}, + "314": {"val_bpb": 1.07272475, "artifact_bytes": 15999207, "train_s": 600.1, "eval_s": 575.7}, + "999": {"val_bpb": 1.07295259, "artifact_bytes": 15995751, "train_s": 600.1, "eval_s": 586.4} + }, + "hardware": "8xH100 80GB SXM", + "pytorch_version": "2.9.1+cu128", + "technique_summary": "SP8192 + 3-Layer Depth Recurrence (L3-5) + Parallel Residuals (L7+) + QK-Gain 5.25 (bifurcated) + Per-head Attn Output Gate + XSA on all 11 layers + EMA 0.9965 + Split MLP WD (mlp 0.115 / attn 0.095) + Score-First TTT (SGD 4 epochs/chunk) + GPTQ SDClip int6 (matrix) / int8 (embed) + Brotli-11 byte-shuffle", + "delta_vs_prior_record": { + "prior_record": "2026-04-09 SP8192_3LayerRecur_ParResid_QK525_LegalTTT (val_bpb 1.0810, std 0.0002)", + "improvement_nats": 0.00810, + "welch_t": -54.93, + "welch_df": 3.80, + "welch_p_one_sided": "<1e-7" + }, + "compliance": { + "train_under_600s": true, + "artifact_under_16mb": true, + "eval_under_600s": true, + "no_slot": true, + "no_pre_quant_ttt": true, + "no_etlb": true, + "no_ngram_cache": true, + "score_first_ttt": true, + "three_seeds": true + }, + "attribution": { + "built_on": "@bigbag's 2026-04-09 record SP8192_3LayerRecur_ParResid_QK525_LegalTTT (val_bpb 1.0810)", + "this_record_delta": "TTT epochs 3 -> 4, split MLP weight decay (mlp 0.115 / attn 0.095); all other components inherited from @bigbag's stack", + "upstream_chain": "see @bigbag's README at records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md" + } +} diff --git a/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/train_gpt.py b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/train_gpt.py new file mode 100644 index 0000000000..c66f34b2db --- /dev/null +++ b/records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/train_gpt.py @@ -0,0 +1,2 @@ +import lzma as L,base64 as B +exec(L.decompress(B.b85decode(";S8xg09^nPXb=o1BN3oGN@YSP={}CMG#HYg;)@Xr73M@7G?vCPM^DwDa4GNqSYobSL`Sy6{qw#6tH}IQM$Ru)M`Q^QW$^{x)}g!cPRhx&+o^hFG%$h@7l=nem9Q~=-0Y@JGFVM;YML?oJPxb&)c@pT_GLat&mMwAJ|KB^4T%PmzcES7c4}JE4!dTmLtz1*YoD(i2G(R{Mr)h7^cOXS5DMMxc0*#5gj)#!X1`Nws>_9)&N73~lj{(-4&wqdWd)>6?-Ye|GW#?mH{MJ)^0MmPzC%*~U%u+ZLVc=wOindO_t~oKOh=EkmMkiszM7@Cp6T{4SQw34BScx_%u`y#3>&JTm_lgn>{~*^E5}1NYu`z;xizWIBAvwJs8Bd|sn{LolEgn#MzrN*jw}C{lE!$r9i#qC)|=^y-vXZth)Usu7veU*E4%lPCC73}S9uO*pA1@SgXzQ^o^gGT8II{xe53=66G)tPRht>0i4ZId4HiygNEV&i+39AViTvat1jgf?*ySq;f*<&Xh44gEbg9iWau_ejqHJYr7*vIvrYHA!Ic~kvqZFEl5y)(zU^61}APTaC0#Ff|j?NG=JDT8(Ua^%f6shTCeVB>0V(~sW~vHD27xP%1>ZwTCoq3*~^_GN~yHDF98XK=QR0@L{aT(2}C-kw%#m{q+4yZ{lG%`R^4Q|C}T|>%JwcLX#;3x-6xW0eDF(m73wt;fniql8AV@=DPTV9N(J!%LUwp9>+8I$DpmNkRWl|q0WC3hdd$?Hnmd?BsGfRr4hs2FI$$>LGb*|qxPC)sq3!`khfLLu)OjF#=+iIkAW({<5Iyz%5DE1%`9(izQiI+d0k=e;3)GqDMiTMR8(^yOU-E|@X*zb8+k%L@1w%_*MZiczPn3%-iXEkWneBI4U<)!Y?UFSXlIx{RGfKS#NBSk#K#R*g!ydb(Ajf#FkEGBm(P9FCHny5>GhC5qYcrLvj*^Tv{t=-gxxu%r!{u47AF4>G`vIun9<-=#F>bz^n6Fz1iwG0G(`4ah(S@^j*EZpqkb!GwOI<)|HZa0+8gv%uplNQr0g@u+cGr^{6}7ewe#{f#t&Pk#w@CLiccZsiG~|y1%nt)j+kQ7Cn$&Vti-50u*Rx$}s}pC=Grm=^7%ri~w!?w83>B(}g>j@OoyNoE<8eFy}(Me?|kbf=?8u8d=qVTEz0SvFdpl{gV`*>W0dtu^}tazYN=62^eDD%qi;QwIUx|F0#lvy(-`r~G0;g!CvtF{DMTrKiiNd$6(Zy|i~C`DvzVUyC`d0@G&*M_{Aj(%u6$F|DEE2BEeG{*iEny%&RJhmQfW`_^@gb9sJ?j}N8lh#CutR&5Gw4s7yI;5yBK!7(5z{_}ou6-i()q@_~wdQ+0%~qcQwF#)rUhSbg;@?k!8A*-7=E~l)4R5aBdFMiQlfnNfl;)#7%f#JOlx&zec2r?Jd!S}pQtxWJ`Z{|&wI8Rg3bw4huA@euYYx695geO-4kl#>amWZ(vJqTSa$nsKQ1ac*Ift{Hr3$L#QZJiY5eYF!y{m3P4;Q%09nY#YqsPtzOu;L5!#Z7m4pG^l;*)js;)P!7=rg|>k?6eDAGqr6_x`bN}&ZjGR7Kl`$t6y_PYj-Eqf`0}ufb)8dmq#Ht72G-SqysT<$rB>>F*(*+(?yhCVG;HX6xJ?d{ntn2F4;`aw{z@R6VuK97zH2FC@aYPuFMX@vLDr8*NL^e6Kx=}l%@ZWce&nPN}}CU{ldB}n{H52&Lp5(X&^E}#R-O1>G%SJqW${Vi@MVE%P4vV_ppBbC$G+MX}u;cd7vZ4ILofLOXqQEz2?GD8Q(Gtq=ut=}+cZ)8B<Uy)PO1&EeSPeoEwzw3w4MTwD82=Vb4KPCWQxr*(I{lb>te>g>P5|{<0tS4*>hL4OwCQz4^%zlqS9ijj~b{iQRDS_#VRQwn*ThIhutjjbhLp_h2a|LzX``l8SZ|S}$G3bCD1g4cDK6-$OOoIn&V)mZOv3UtxnbdV6o=hg@Dr==|V9&mGe?VIb)ZF=wI)RBHc>||1M=qHq)xmf0C|2pU9t1net|m6r(UFw~+oq|5_yuk$+t#|r#D$aosXdr~_|39N`570?TS2&*lmss`K<`|cN+@;=Ylk*IYOQm?w6UZn3>d9^=KrKcp)AB~-P;e`wR%A=Yo1sIinsO>=myc1s7Og?S?ct!0HDp3uFApZnWGqrR{Ap`tdR#e$19&Yv-LkOU(E^GG)DXh_(Hq|Aq!bd00#u~ee5%4y8Q1O?p+C5OJ>E`?>UO!F)aj%B{rAu<^2ip=GF$&A(EkjvMlm7fXSny8?bX5^3DT$z!S}`J-Z~Af{-vadh1pl!AM)G=;eWg3YssNWAt%JRFtmR1qcY4Zt5j__Zs-{(BPLzypU-GEyj%VslYMDd|{`Ph8Bra?W@C0dU?uoH&W0Ry1qjfIPocV!E;oZ#qt@{n1@ZRXjRXNC>M+c;U!x-^x+j2qlfZ4wT_N@b(w4q?9cOp!qB~pot)`AWaE5ah~!5D$F9`GlVzYlqi;EGx2SO0_}^B`xs9jG4;Av)}rQCy1KXP<^%{%ZVp5o&0z$teDTnnkyeRbbDu#Y*nr4dLdEKLI|_!M@Czcj6R>tR2^hR%CkCfzSP-?06OU2Ex6lw@A-uY-GeuVAiiXIKb*Fd2d*-5th3h#FUZv<{7w0=J0ai6grCN+R8iY^s#|`~&Rih2so&7s?AJt234p4J6kuFoVD5t&3fB$xxa>77KUr1%`)XT<8O}V7qnR;v_6OXfxq%Ethd|-&k~r~j9+zN>n(&o>9LQ*kMCi#E<_I`2f%MA-%XZt#?e_o3KX>Bkn9uqU$UY>a4N%8Vmc6@0I@7K4$YI1X$!nL-qd;F;`m87`|47$gfPnj#+v=WO^Iq;5aLeHpMu_-l(KvxM_dYvHT~v!4Rj@sP7ncbHMf1_qZOyb$alC~`&aK-`^^AH1azL_fI!#ng1NgOt4i`m>O9Qn-Zk@6>DiWsSFJGyaD&0969^Ob(g6%)}u-Q1lkhYwA~g@f^@4Z0J=dR8_aGUC%$U)>Gj94zTLsFN)5z)w8OjyH%QVe;5T#B>u`Ki%KDH@=b37c94p7gGbfE>#`n1qFi4*fSE7Axo2*(r+3@$|}c~)VC`l%WyO`bCnP2D1Bx;=rd5xN9H-Y(48f>$w7k7^d|V$}6^r7Zg1>p_n~)t-RRP|y{jUZdnrDFihVYILBwWiIN$tJ9Hj*H>k`rz;lglNo-#vXKN&Ag+`npjBccrJf8$iDB&r+~#&uY5)&2PN_0Sjp{pFZm{yX5<*h-Tb@QH55{G!LH91_**t(sE-4r06E%xQAa>6ei<=-!fH*kYYafr47~=*IKT7F8H~R8A{Ddb0>6@o9cq$6mf0@Q^!~(q92P*+>H})CHG!o*6lgGi~YndMjdZ%;|@JOx>DuW;fx=nl;L;W9$G--Xxwi?<3eN+PDZ!wwyB1LSa`F?OYd;V%}x$-+Y>FVnS+Ogq8Rf(Bk&ik^Pfmus2L)XaVkSOgEa3T#*iGmN~G2yvTBQ(;gMAO*&2|qRYx0Gan8IllF!hL$g#ufC&rJ=vrnn2{r~!HBeG*wSVN4T}vFh0h}QwT1KGs$m`ROn8*>Fnx^Yu)0$~oW&S@0k89`&==VRm}v^P+4cR7?&V2hL9*|bS)p#fEr7q2HJ2*@I=VKER}r|0Zst&oreEd!-YB$9h<28tU@NM1obmn~_k5^Oq(WhXctCcDx|PJ<2a~`lUl0%koxBw-kt_222B5rpdApQk{>(u<%<|yOX^epQXRd%X#9j(S-TeB&e6v)#!VsL88xJ9lPLsM)uDCPsT^Hwpq#2afmSB}ZvpI)AFy2nSeip5XqFNCMS$$b@EU)HV(a=`}ON{zX4n9Vg@B_>tKblBX)K;wc5$%*B>$YESg>^gZu=`{VpS1jcp5WZ7Bw?vZL+AN-M(?^GMd8JY+CE4ZBC-YG;W(FpZp9q1P)#1ujO&m$j2f=bhvg&YmNmUl0L{Yd5mc{4R|H6Pd06x;lc9)vS&{8pxx-J_B)YA1(+6t6IH@J4?G4LdCdm>@iX3A}vf5uNx=&SpnYhHNGCQwz#=|U;;&*@mgQkrT%P;wXp+p`&!1e!4E!|pQTD5qlE2-Bgb|c09Edvj+jplcqz8O$u-O~I{}X2igGy*3env?LT=1DlIvn_jd;|CfrlIqXsmSRt2yai=rHQ57LLfLkBG`HJLQgg+ZlrpW-d6!w`L@5Z)V|Q!1{p(Z*V)U`!)m4pV(0O9X*=gMi(|kuoTk?W51cEs+VZuY}104_O+kh_)o~7wyE4*S7Q0IyvK2RKe@Ttw3MyM}E+!J8~xJE|KD&9fd)5}h3`AhVs!P#z%6zXRX50K|t@fF*|)lFHsEi@Il*bUwaJ_3|@&uz`iCO1$|YXdjt>L%4~@;J)pw%QL!jmMPRofgFpp9h{8@Co;4R2bR;yGNukmiP=defsyPXTqaZ#ua~6k>*HYqR1^_j3fyBTgThjSz|k~gpemnZUoX+goL^1=FAmAJQ585d0C5%oRM%|zNdCR@7Df)2BrvFxth*>Z=7W|^W^#)&kt{Wd^^s^XSBbeC^$PRgdH&Bf-Ary;t-B)Moihm-1paX&Rywfd#tyCxxF#B2tF#T%Ta3C+H@H2mUH%<)?g!JciMGkS@)Q~9;VX>Q`-Z!(^i_R?E?gzttWLcVMv;{juC&}mNZ0EW*35^4`V!cqZ!>iTbxN)6`{^88xFun#XX-vAw!&GvkG81Q(ugq}mq!snRB@yI0#86D_hw73hmP4bLbcm%GMDrs2;q(wNXYJ{t89E6VTp$64J~vq9H6y(;@JTTt)aOEGWyG9&6iTDsE9nK<&*ud8a$p02Vsetgt>1|=Q39PR$CtZW7?7Cq24FZF=2_7vex@4#EYFv*+H#*55zyh{pb7G^el9wl2;wAgT%5pG;B*{ht#mF#7`DkP^Vdq@Z87LOD$^TUd^~&6es2Swb6G0_0uKA93=D?PZ+zX;(b33!WE$lfdMdm0B(Z2)@ZSS3ot;3bPaMJuL9E8W0wZrdC?pL0&i=W=w?Vz+;;<}fz^k;D;SldIg8$1OAVXmf`-*=LlITviQJaHKpcyvYc2(V0R&CgClWYjOka7};_YFz1H>PglX|~YXI-C{dW{%CJg15lm3MuYr0--|9IjpP8_gJa>s9KmD0ekg@w;lkJ?EQW9j9sJx?-uW$5ZCw<>(9Wqk}BQ~xqRByA9`^|OBZ8q5Ug(bpo8*XB76Fo{BX4zx{^BNJ_J+!^;`I?4mm{?2-+4QR~>Luxs<@ibp9%4BXT@o0&2CxO+hF2Z6UcH5Wa)5^>D!4e9W5Pl`3W`BjrK?0VQE=PB)T}%KS_!c(K3ckvNB69e(7>+q5_JN@HrbjsZ+!15Qe*5CM8fJNPY21mzmUNv$1;+Yheo}YV(vp$io1;?G*l6~_g?`&mD~JwXi^+9+#1_Hx5+t8TyJUySMQ>DwhHIbw`u~CDKl=?u^+?4Pc36M{+NPejns5k}mhH?$0|y!EDFIZ7`mk%)J-1H24K}k*Fz+nDyQgHEU$zO{Ov`R#);h?#z8k#qX2$_LJRGzlrPg*wf4H*JAMOo=_7x_q`q}VJ2V{ro?Nw9174)i!x1Bq1OVGWSnA|c{y}0h~RA%>x`|PkYz5XrvGn&5Mnr=m?Athabz$X8rH(GiHd-rmm1GPZc|{58omfT{Cc4FQhYv&=mbWDC9O#E4Q1R|1Y=J$FC^HVw|I9YITU2IfMYgn5{)L44_d7dP2-QPL-CU&*-qv9FMETxAUrqdO-gGMxrhPIDRY%PD0$PUM|?G+c~f&V{_$<_(VFi?HTvP`5PvO8qmZwy3F4-ju9$Hua9O4Ty*aSy(U6oXj7C6MS&iU;ohl%kJu$#Gz)*))>(sHZMOZR%3qH5{UO+`aQDiFR(E(#iuinr$rYo7Tuk&nII(?oCUs2fhIS&#!MvTeUmbs_qJVSs=!5VF2i+U-8nXSZPYU1`Vb27*=ktLM2hqD@>^+OeM5fWJEUu>k52Ugso}&CCIA25idJ(=gN|bHu{dDqX*4)dN_%Nhzyqq*^cdS4Hfl*ro;kTbUc(h{7cd2HVx4FW^j#N`{o)vKw>1a_1T9wkF{PhF&4lv!XUFuAOxk?qUMuTeUL!6Sp9;xmqa2ovFQ%*;rwfGB9~(BL%}GP6GG}wRIf^1LlQgPmO2be+kI4KUB++1@2&mhQK3)ohM?76K4f*}!O2gOQ1vcc2?zyIOW}j#|G-rRG9*He=Bkxqu^f>BRyKD{Ontq^$<$dfDa!_BCDRC7+c!FNjlLx~?61bA_Uk$2)s2)sTdksYh*&EM&G;%`4dZ48un>SH#$D@DeFsVbJzIUv9jv)>;Wr$7`!7O~hw~(co@ZO`1`waIV6Et{lfaUs5?_WJ*3=p@zH%@+kK#sIF5mwH1hV_3l4s|i7)Sfq5J$zsVCSwJquA3k#d=yn+P?YdNCV;v=-BQUjPWw&N5S+sy0dN%bsglgv*<_bkgV!;>d8SMIIupa^3_MS?o?qrBL{jqy^)ml4*?44SxM4?cA*0(0w3@mwZcTK<2Im>*(4nBx2n&SP$LELWB+MX67JEqNTd^4oEpU^23&y)LGfIfm7rZ8%*I56Td5!7adWLC?ez;fU?kDbO%x!eh9ROskGPR+aXNp_UX-WgC-jAxp->4tkQp=&%qEm!^&w5{#M62UfX{ZcO4WUS|A%?mW`0m?W_9MMaIm(HUVz{|67z`ezINtU{@F7`PhxQMTY<%R}``nJ>c70>3-CO6I6zoxiPe7ywNywu~kFp+IE;|Hx@8QN*8b@6=x{PMU)+&-U2orVhqsarHr~71vQ%3rBQYBd*aPM$@I}B7L$q9A$j{8dIWLZMVY7ZAORoAZcN5t&$#hGYpzKCy6TZ?>c4m9mwWTI1RLT=CF^&O`qK6qzEypX_D(YyW7yVYZny~|9jX$8yMll*O+5^SKAG?4CfS3JhY2F1p{VBHbWx^^;ly(B?wB85`Acjt~G86IPYg?X2$fYOVQ^oGDPR88>B2jet3xpK__-@)p9N`U0G38?_$fKrfRN>u{YEC{7lvvpYqN7Hd27;42n5`Zgd3X^I(COA>yKVv9TEqOB{V&sA(t38KnrqjtkWxco74Cx2ZT2Jx2C-<_Eqh1ydVPJEqC!qMheQVa#f7Sr%Ugi=~G4Kr5LVSVWQ*EN|aQlqBsi$*L>FAd-qkwM}LN^7xcf1)5I1%{Ngp$)Y}Rn{GAkKOjUW_@gt(O#3=95B&T+@nq;@F-5nGOGqgT3A;nTQ|={a+OdKbH>K3O5;Le-RF5rbEDp_Adsz{etuAP8%Q7;;pF9Gqp>=vW*ZVQILcnQ2gZ#yZ^rY$2ar4d?KQUgr(j^rMCux@GWqsCX9~DxVe!CIau~c56|Iexj@%G#t5-@&=2TQA4C^v>FA9*dK(!Jx3~=QC_AZJjpH>WI^M7<7Qu7MIKn4g%!r#(Tg}E)k$`+(kFh)`zT4DBx52#diWjP_)8Lw6Pars_XNfbL+iTPN7WDSV-A%Hel=%L)c|p}kpY(x>;2+9EYJ`1!L-P-O|pII0%o|lq9lQ0Ke5sZdfKDpjHJE0$=8JYL*qT)P{>6!oB|qL{xL1Gx#IMck`93;FQ~D{rjFh&xTG9QzkIRkp5ws;9uEBfwQeiy{<&>}D24mWaLIjVK5Bt{ZD1a-`r^LJV<`@TP~&JBhk*{O34SaU^vJO`z{SV){E1+8=E)@Gwj(?0a6TD|8qoA+X0D&?GCkjKHtE`qK;Yr_;nPwK$%odA#|fM=IexSC|IaTsRms0DG(fy8ITyJ>V?c$HeZiHYS2M@T$vuh*RS8xhgl|p2ihk*e8a3q6H()190$l-6IhSvzUum*~H%J1&a=3Xd$k>V5{()Nn+iCB}UIa|!j?I#J4ced>YoPiUR_)iWOhYn2%E~>vDk)$`u)3E)~x3um#%2GZoswBhC%M@2FGHg>;S$v>1H8Pmfu~XAO+kE6d{TyAoT$8(GTl2^ISC)wZZ${k%jOs~6ddQaeij7unVnk5ec?E3OjLLE5;oRNP?gPLorkiJMH9+wGmvjZ`d7E^rlX{#?AQI>F;%y12cGzqWj>D!1gDxuteZa!xv`fy4&=fs2+7x{aX$gN;hzw$wFKa#0R7eaTDACYqRO|(Q?Fn#~h!S{wYee6ogWnO%3ku@M(tFK20oCGfvv}_0O7^T2M@t9G1+zU~4S9^(#oI125Eb>X2Yp(`JOt%owgK-5t)V1$4+HW>h2J`FsoSwg_wC6re|W5=Mq8SH^{D$V#WEEbNTVRixH(vuVlV=3$HT&>s-l>$^2L9)?~IE#*#rfS2vCyry8ikYMopcYQ(6eK=ROlATom>9H(m_H07$s;7E*N&EibcpgScpGHHZrcDjo}e&in%9!)5C!G=q9RXXNK})gJ!Ucr>FF4HbqMD}c7kFW}^?iM`cmns5Lk)&rOu)&P?;{mcQw`sn_K!&ADSEdK{&?tc5RnIZWLg1vfY8b;|VO2l5lQZnya9?f0{gI@#WCU5|$et$`{y@X5bhp6%cvR@=f{%i`jryt|HVXhAYk*VeodG*-*alrBKBBl^J`Ri|mfYMYUf0#JCGTS8xcR8DHxm26%@-X103u+yxBXol;+kE%di-93Alwy(1X(8$kdA{<<6D<9>YE3DoNSHrFkfD#z$~yTXMR9tZt7~*@&!;2f)~5|)m)EbRgAO0L<~~&7FtKv90%9!mINB)ZK3r+^m-`xD*;w%+e}2+3fZC3QH%=QC4PqEH3R(-@S7V%^UHm}+JmyYK7I~Jk&JV!g9R$KqXy)(ske+Bz$+O9p4ZTtAw0DDKKO@bfD|4AaGnEyoO>SKT-1WKU_zO91nCqt+|9-N(OI2OteK<(}`HqIkL=y@^hwpNFOv+dT+U}>^Z6N2h=o#vk6V_!%JG7RDtG!p?bGDUl{|kdh-H_z*PlXkLk+T^jUU-AbHji#522t1G)>;KpCWm>n2W>$?0rD!1pDm2GC@S`U??gp~7V*d;i34UcqN=Ok$-pLF=NCGt6CAta9mz6Xr=uzdHf?@?Qv4CqPql^hE&-{=9|#BMs;L-Q7e6OvbE_V5WsbjLbt>jg#DtH5SgPG{k~Y7^t`U>M5Om^Gz~NPfXuG;LU5~4%`}N3tbnqL$)uP;tIm**x$iKc`5$pIX-!CXyvs!`|+uS>bG!pe5(Kbazfqj#+dWvW_0RWD$u`@5y^^aCcYkL^P3TvFm{~**dv1kPjlvD=Pyu-u;iJIRV^C#`m5OGe7`STfw`f3$knYoB@BD@MWllEkTBpQ35I2xqwqnRh}AMn4heHE|X`x$0c@>9szKH%>x1{n)mQtto`ThG@H^a`;`Nh?%CoL-x3@CGB+vQIHx4P&voVA;v2JX8ym@~+Cdr>IH`BYuc@#9?&tFx4+t|*8Ymn=)IH|qBJA0B=HH~hCYT?=fmz>AK?Fwl1Jbxj=rn3wmh~IsJN-n5TXHgXd)OA5ZZ;EoZwMI2a2-p3aDl~zwb8vwWV7k^CV5S-Cj%`jp-Mnkhq6w2}*4}&}sFbp;z1+P6X@gJiX2>h|#u0s(RR#f-VmgRINVbtlKO)%Jbbn3i6}|8eQVY5T;Z3%f-t9!2(s{JUbUhp{*G$(Eb6MJsKSHe5up`&FAqw31>=WMC@gkq-9M8-iwt?v=axV;L6p=XQ5%D!CG?|$QrR}g$)k&T#UuM`#)jMe@`PF%z5n5u%$I&pX-E~Zu+Uns+K$(;Bl?5wmw^HXhIl;?m?Z+jos(1BDN+7oRG47exx}#q3nV?5Aas+9YeP7JIRBK2+S%iuJqucHiuu8gGnV4LQWHqi6JNi1%RBVTcIH+3w$mDC-Zh0ZLrpG|D0o7;J{L^(aMnyiQt~VdkPb=Bh4knvP*+V0EDvfzuYV8DE7g(EAv$0<)Sgu)nLHAOb_{Zp-O5j8AG{kd8W0c=M^fA*8@syk~wjoDYED4)MvZ_A>q2D7n6<@9!IK3o95@JyfmUSndU6Cl;k#N{*kx!Kj00T}2lvVwiSeC9s6J({~Pi=4S>JATpCS&dlKUxL{^?r&n#X!j|((g_^y5KUCU>lEk|M7~{_~qwLJ}Y>*1Sq8!Vxd{jQIHkImgTv3U^>RnT~{TqhAS#05I!rkgdqNMytq0Y6n)AtR`34G)G%cb0mCrpQRCHKG;T#}LCi)e6Q#7B^C(T^=FB#;rW4~h-v1sBo|rr>(KD>u^8Fb<^wPq;WGAJP^AjB~am+f^x?VJ!QV!oL@PKxwz-qP^R^ng)EjuB|!-6ckBGS^FJ)^*YQZ_jY=+AeQ4G7iv)^(d!&5(de#2B?go1m`^Ji+py->nAReeLBG9g2TUUiiyD+P`1RUb>S1u@VdZ4)SIwdO;tmzR6tMKfl6D`yrAj2DHzCeVDQQuF>{D*$nc=Ol15%hKU2a1X{L6h@^>oE?a5nSQhqe4Gsun`Ay4*=X-|OrNGyh;~VBruo9zB#U{#%lHz3MJUI&-ghU)f5=vhDE*Y>fomO01NS38`JH=wZZ2CJs`ab2P#rj{U7m+%DVj{aJGD@9{F?v$>bHt+=)!H`*Gc<7Pg9)fjm+3rbkSw>txO574xV)8Xdbo&FGmunvzJEeo}XuBc+UqI{ygdsic&g3f%fIU0&{yB~csY(5N{d^{FbFE6!2u*xND34GK=5%Gas5Etr28VYh@YmQLjrFuM(ndAnIveF-=`tv@WaAr_rD1%15^w6WeSDO$S*gJ-!WR0mQ3x|^Daa=+PPyknHFzn&z*(O=`s2UD2BX8qL{ZjpweA<7aYQKgf|V#5GswwVJOvvbRlPnu@~(Zs`-?uy1km?~pD!?=jQO@)%>ctAmC@q@+b5jrnc7*D(=ckOL^@>#%g_pP@!-0BN{;ppP=85YRr(RRUn)3I*$eadt?8IOd07K_UD_Hte-H~nl0XXN8ieGcK6}7f>Rz=E0SxI*c($`P|3m@n{IsVnqDXF{ho1VBQ>3FEh>223IS2Fz)wY=0u;*2SG0UFlOOd1ss7v1Jr#q6Wm{F&CIgMCJh>d^3h9Q;4BoXB@5Vw)i1dHF?hJK0jJRpDpQJEy+`%sWG4{t-T=S1*19_BTyXE+?Tqo55DXA8ZX#Em@39M++_q<8H;fNa1d7qt5O>AaCkvUUg96&}ur*2N+a+P#O=&!sNw5MyDRo|F9-2Ynvd{G@Bb(h_*%gBjPqCDTTt|t-ss494!(<4h5Vnx7XUbdR)+)OYVnwC{1qe@78;^H_%Qf<)kWi&R2t9gtC$5#bXFuqi8W2y_DL93$hJ7jJ&~SUXz$)4U{wsXqZ{4rxW~|{y0?C?juK;0$e%^U~wxXc>GG9z{Da-sDoCQ2UwSArR8oP?+bQA7K65*u|ie*1-jb~Ihwc!Q69U+CW0F+47^1Qy?LJFDyzrB|325(29OF*x?hIjSTLt47oK_u-bjfyL}g+_gpWf%=b&$%fGjD$YcSq%_z1&4JDrrgc0`8Jp(#5y8H+S%rSR7EuOxot!N3emmKWN^xZvr5J2Q)%MWW{Q~?_Y=mmk5DB$aSBs|obYbsw{!X6^Dln24MXx4%GKJe$|&`g2@H6I1g0f5&nR;}AIcdRXL>FoG7jv%ZI3Su1fF&3H*wYv4&Svo8}TDyAHG-;(ToBS|ug9OV-RZknwg~Z=d&Wh(;SM2SiqA{dl!>jU5&p+L!nRj_8kkXhP)U`^?Ly$%^HNzzs<4VCV;OXFM(BCrY9XFM3JblDJ5=r#wqf>n{QuhLvY9JAYg?=+z^sr!=jQ2gXV3%e}u}DbkkbEulJV%x}ie56?wV7r%&j;71F(Fr|nYJRJz%;!n({9^e;3;A~;KKUd*=BhqhKYNaYK*MiE~X#89Ep&LGCLE?$VlV}4%U9sa@@D1;{4PI@s|9qBhwgkK&9UG*p`?d^gHt&b5P_l~!W7DB|aSQ{)hTK|;!LW|R=QuFEOC~!|U0#<_00bl_F05<+2_if(mYDr(4>{#&2=EW_p?3AX^kaW)E`OSA}#Kh+Z{by4+hB!$xLkf6t5=5cf0uso7F8h6Y#M$aNrB6!B&Y`Gb^7Dw6Huy9-%)#%?Pm$yT*#5dyVg?cQ%;ICPd?BM6xvxm-&uBd2@v1m*2VP631kEXF52;~Ak!Ke+0lTD<+G}e4)#};iTT%DGOZGL-8=5AuD35=tTqR7uUCj^@O1!<)GHk{0w9*+NB+{*Y!Zq_*~gKt;J$F6fTP9DPZ56a(+J8Ys1gW{+L0~y0@r90D=HuiMLwnDHQwTvV>xmiyHF&^yFLB+Lwy(Fy#htHp`aTsy7F5-CWU&JWYx?532i?bX?wF5q)mlO_N!(pMBs?sUG{PLH;SHB^vP-xHC2VEbc`t+*|KPtR+SMv7S%ftyuaEqWO>fV#I~2f7qANtt?KZspV9OC9SbGcBBXHl&)0}V)COm0XUt=s#eP|}%7!sWH9Z^Y>U}|H?EWf}B8!^v{tSuPZgk3>bcH}L>$vJ{aLW`dMU#E{U1ZzdEYv%*{D2AZ!RU3UkUluedEr|Bi6-7bcPgh9qtmN?YbdHsIO8sXVJd(^2bL`W"),format=L.FORMAT_RAW,filters=[{"id":L.FILTER_LZMA2}])) \ No newline at end of file