Loading views...

DGX Spark 놀이

Date
Date
2025 Nov 5 0:0
Created by
Created by
Seonglae ChoSeonglae Cho
Created time
Created time
2025 Nov 5 9:36
Last edited by
Last edited by
Seonglae ChoSeonglae Cho
Last edited time
Last edited time
2025 Nov 5 19:39
Refs
Refs

nanochat

cuda 128 torch 13 not supported
git clone uv sync uv pip uninstall torch uv pip install torch --
[W1105 04:19:13.033074459 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3 █████ █████ ░░███ ░░███ ████████ ██████ ████████ ██████ ██████ ░███████ ██████ ███████ ░░███░░███ ░░░░░███ ░░███░░███ ███░░███ ███░░███ ░███░░███ ░░░░░███░░░███░ ░███ ░███ ███████ ░███ ░███ ░███ ░███░███ ░░░ ░███ ░███ ███████ ░███ ░███ ░███ ███░░███ ░███ ░███ ░███ ░███░███ ███ ░███ ░███ ███░░███ ░███ ███ ████ █████░░████████ ████ █████░░██████ ░░██████ ████ █████░░███████ ░░█████ ░░░░ ░░░░░ ░░░░░░░░ ░░░░ ░░░░░ ░░░░░░ ░░░░░░ ░░░░ ░░░░░ ░░░░░░░░ ░░░░░ Overriding: depth = 26 Overriding: device_batch_size = 16 Autodetected device type: cuda /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( 2025-11-05 04:19:16,035 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 2 Vocab size: 65,536 num_layers: 26 model_dim: 1664 num_heads: 13 num_kv_heads: 13 Tokens / micro-batch / rank: 16 x 2048 = 32,768 Tokens / micro-batch: 65,536 Total batch size 524,288 => gradient accumulation steps: 8 Number of parameters: 1,081,999,360 Estimated FLOPs per token: 6.900941e+09 Calculated number of iterations from target data:param ratio: 41,275 Total number of training tokens: 21,639,987,200 Tokens : Params ratio: 20.00 Total training FLOPs estimate: 1.493363e+20 Scaling the LR for the AdamW parameters ∝1/√(1664/768) = 0.679366 Muon: Grouping 104 params of shape torch.Size([1664, 1664]), device cuda:0, dtype torch.float32 Muon: Grouping 26 params of shape torch.Size([1664, 6656]), device cuda:0, dtype torch.float32 Muon: Grouping 26 params of shape torch.Size([6656, 1664]), device cuda:0, dtype torch.float32 Step 00000 | Validation bpb: 3.3071 [rank0]:W1105 04:43:56.152000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/utils.py:1558] [2/0] Not enough SMs to use max_autotune_gemm mode ================================================================ Internal Triton PTX codegen error `ptxas` stderr: ptxas fatal : Value 'sm_121a' is not defined for option 'gpu-name' Repro command: /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpx9u3mic8.ptx -o /tmp/tmpx9u3mic8.ptx.o // // Generated by LLVM NVPTX Back-End // .version 8.7 .target sm_121a .address_size 64 // .globl triton_red_fused__to_copy_linalg_vector_norm_0 // -- Begin function triton_red_fused__to_copy_linalg_vector_norm_0 .extern .shared .align 16 .b8 global_smem[]; // @triton_red_fused__to_copy_linalg_vector_norm_0 .visible .entry triton_red_fused__to_copy_linalg_vector_norm_0( .param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_0, .param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_1, .param .u32 triton_red_fused__to_copy_linalg_vector_norm_0_param_2, .param .u32 triton_red_fused__to_copy_linalg_vector_norm_0_param_3, .param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_4, .param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_5 ) .reqntid 512 { .reg .pred %p<16>; .reg .b32 %r<108>; .reg .b64 %rd<73>; .loc 1 18 0 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:18:0 $L__func_begin0: .loc 1 18 0 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:18:0 // %bb.0: ld.param.b64 %rd3, [triton_red_fused__to_copy_linalg_vector_norm_0_param_1]; ld.param.b64 %rd2, [triton_red_fused__to_copy_linalg_vector_norm_0_param_0]; $L__tmp0: .loc 1 23 28 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:23:28 mov.u32 %r1, %ctaid.x; .loc 1 25 21 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:25:21 setp.lt.u32 %p1, %r1, 338; .loc 1 26 37 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:26:37 mov.u32 %r2, %tid.x; shl.b32 %r6, %r2, 2; and.b32 %r7, %r6, 2044; .loc 1 36 46 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:46 shl.b32 %r8, %r1, 13; or.b32 %r3, %r7, %r8; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd4, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd4, 1.0; // end inline asm @%p1 bra $L__BB0_2; bra.uni $L__BB0_1; $L__BB0_2: // %.split.us.preheader .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 mad.wide.u32 %rd16, %r3, 4, %rd2; mov.b32 %r46, 0; mov.pred %p6, -1; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u32 %r42, %r46; mov.u32 %r43, %r46; mov.u32 %r44, %r46; mov.u32 %r45, %r46; @%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r42, %r43, %r44, %r45 }, [ %rd16 + 0 ], %rd4; // end inline asm .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 add.s64 %rd19, %rd16, 8192; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd20, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd20, 1.0; // end inline asm // begin inline asm mov.u32 %r50, %r46; mov.u32 %r51, %r46; mov.u32 %r52, %r46; mov.u32 %r53, %r46; @%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r50, %r51, %r52, %r53 }, [ %rd19 + 0 ], %rd20; // end inline asm .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 add.s64 %rd22, %rd16, 16384; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd23, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd23, 1.0; // end inline asm // begin inline asm mov.u32 %r58, %r46; mov.u32 %r59, %r46; mov.u32 %r60, %r46; mov.u32 %r61, %r46; @%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r58, %r59, %r60, %r61 }, [ %rd22 + 0 ], %rd23; // end inline asm .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 add.s64 %rd25, %rd16, 24576; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd26, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd26, 1.0; // end inline asm // begin inline asm mov.u32 %r66, %r46; mov.u32 %r67, %r46; mov.u32 %r68, %r46; mov.u32 %r69, %r46; @%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r66, %r67, %r68, %r69 }, [ %rd25 + 0 ], %rd26; // end inline asm cvt.u64.u32 %rd27, %r42; cvt.u64.u32 %rd28, %r43; shl.b64 %rd29, %rd28, 32; or.b64 %rd30, %rd27, %rd29; cvt.u64.u32 %rd31, %r50; cvt.u64.u32 %rd32, %r51; shl.b64 %rd33, %rd32, 32; or.b64 %rd34, %rd31, %rd33; .loc 1 39 22 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:39:22 mul.f32x2 %rd35, %rd34, %rd34; .loc 1 41 23 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23 fma.rn.f32x2 %rd36, %rd30, %rd30, %rd35; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 cvt.u64.u32 %rd37, %r58; cvt.u64.u32 %rd38, %r59; shl.b64 %rd39, %rd38, 32; or.b64 %rd40, %rd37, %rd39; .loc 1 41 23 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23 fma.rn.f32x2 %rd41, %rd40, %rd40, %rd36; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 cvt.u64.u32 %rd42, %r66; cvt.u64.u32 %rd43, %r67; shl.b64 %rd44, %rd43, 32; or.b64 %rd45, %rd42, %rd44; .loc 1 41 23 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23 fma.rn.f32x2 %rd46, %rd45, %rd45, %rd41; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 cvt.u64.u32 %rd47, %r45; cvt.u64.u32 %rd48, %r44; shl.b64 %rd49, %rd48, 32; or.b64 %rd50, %rd47, %rd49; cvt.u64.u32 %rd51, %r53; cvt.u64.u32 %rd52, %r52; shl.b64 %rd53, %rd52, 32; or.b64 %rd54, %rd51, %rd53; .loc 1 39 22 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:39:22 mul.f32x2 %rd55, %rd54, %rd54; .loc 1 41 23 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23 fma.rn.f32x2 %rd56, %rd50, %rd50, %rd55; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 cvt.u64.u32 %rd57, %r61; cvt.u64.u32 %rd58, %r60; shl.b64 %rd59, %rd58, 32; or.b64 %rd60, %rd57, %rd59; .loc 1 41 23 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23 fma.rn.f32x2 %rd61, %rd60, %rd60, %rd56; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 cvt.u64.u32 %rd62, %r69; cvt.u64.u32 %rd63, %r68; shl.b64 %rd64, %rd63, 32; or.b64 %rd65, %rd62, %rd64; .loc 1 41 23 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23 fma.rn.f32x2 %rd66, %rd65, %rd65, %rd61; .loc 1 26 37 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:26:37 mov.b64 {_, %r74}, %rd46; mov.b64 %rd67, {%r74, %r75}; add.f32x2 %rd68, %rd46, %rd67; mov.b64 {_, %r76}, %rd66; mov.b64 %rd69, {%r76, %r77}; add.f32x2 %rd70, %rd69, %rd68; add.f32x2 %rd71, %rd66, %rd70; mov.b64 {%r107, _}, %rd71; bra.uni $L__BB0_3; $L__BB0_1: // %.split.preheader .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 mad.wide.s32 %rd5, %r3, 4, %rd2; mov.b32 %r13, 0; mov.pred %p2, 0; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u32 %r9, %r13; mov.u32 %r10, %r13; mov.u32 %r11, %r13; mov.u32 %r12, %r13; @%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r9, %r10, %r11, %r12 }, [ %rd5 + 0 ], %rd4; // end inline asm .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 add.s64 %rd8, %rd5, 8192; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd9, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd9, 1.0; // end inline asm // begin inline asm mov.u32 %r17, %r13; mov.u32 %r18, %r13; mov.u32 %r19, %r13; mov.u32 %r20, %r13; @%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r17, %r18, %r19, %r20 }, [ %rd8 + 0 ], %rd9; // end inline asm .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 add.s64 %rd11, %rd5, 16384; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd12, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd12, 1.0; // end inline asm // begin inline asm mov.u32 %r25, %r13; mov.u32 %r26, %r13; mov.u32 %r27, %r13; mov.u32 %r28, %r13; @%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r25, %r26, %r27, %r28 }, [ %rd11 + 0 ], %rd12; // end inline asm .loc 1 36 34 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34 add.s64 %rd14, %rd5, 24576; .loc 1 36 51 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51 // begin inline asm mov.u64 %rd15, 0x0; createpolicy.fractional.L2::evict_first.b64 %rd15, 1.0; // end inline asm // begin inline asm mov.u32 %r33, %r13; mov.u32 %r34, %r13; mov.u32 %r35, %r13; mov.u32 %r36, %r13; @%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r33, %r34, %r35, %r36 }, [ %rd14 + 0 ], %rd15; // end inline asm mov.b32 %r107, 0f00000000; $L__BB0_3: // %.split2.us .loc 1 26 37 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:26:37 and.b32 %r85, %r2, 31; $L__tmp1: .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r86, %r107, 16, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r87, %r107, %r86; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r88, %r87, 8, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r89, %r87, %r88; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r90, %r89, 4, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r91, %r89, %r90; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r92, %r91, 2, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r93, %r91, %r92; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r94, %r93, 1, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r79, %r93, %r94; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] setp.eq.b32 %p10, %r85, 0; shr.u32 %r95, %r2, 3; and.b32 %r96, %r95, 60; mov.b32 %r97, global_smem; add.s32 %r78, %r97, %r96; // begin inline asm @%p10 st.shared.b32 [ %r78 + 0 ], %r79; // end inline asm bar.sync 0; setp.lt.u32 %p11, %r2, 16; add.s32 %r81, %r97, %r6; // begin inline asm @%p11 ld.shared.b32 %r80, [ %r81 + 0 ]; // end inline asm shfl.sync.bfly.b32 %r99, %r80, 8, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r100, %r80, %r99; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r101, %r100, 4, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r102, %r100, %r101; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r103, %r102, 2, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r104, %r102, %r103; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] shfl.sync.bfly.b32 %r105, %r104, 1, 31, -1; .loc 2 261 15 // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] add.f32 %r83, %r104, %r105; .loc 2 291 36 // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ] setp.eq.b32 %p12, %r2, 0; // begin inline asm @%p12 st.shared.b32 [ %r81 + 0 ], %r83; // end inline asm bar.sync 0; ld.shared.b32 %r84, [global_smem]; $L__tmp2: .loc 1 44 25 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:44:25 mad.wide.u32 %rd72, %r1, 4, %rd3; .loc 1 44 36 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:44:36 and.b32 %r106, %r2, 511; setp.eq.b32 %p15, %r106, 0; and.pred %p13, %p1, %p15; // begin inline asm @%p13 st.global.b32 [ %rd72 + 0 ], { %r84 }; // end inline asm .loc 1 44 4 // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:44:4 ret; $L__tmp3: $L__func_end0: // -- End function } .file 1 "/tmp/torchinductor_seonglae/jd/cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py" .file 2 "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/language/standard.py" .section .debug_abbrev { .b8 1 // Abbreviation Code .b8 17 // DW_TAG_compile_unit .b8 1 // DW_CHILDREN_yes .b8 37 // DW_AT_producer .b8 8 // DW_FORM_string .b8 19 // DW_AT_language .b8 5 // DW_FORM_data2 .b8 3 // DW_AT_name .b8 8 // DW_FORM_string .b8 16 // DW_AT_stmt_list .b8 6 // DW_FORM_data4 .b8 27 // DW_AT_comp_dir .b8 8 // DW_FORM_string .b8 0 // EOM(1) .b8 0 // EOM(2) .b8 2 // Abbreviation Code .b8 46 // DW_TAG_subprogram .b8 0 // DW_CHILDREN_no .b8 3 // DW_AT_name .b8 8 // DW_FORM_string .b8 32 // DW_AT_inline .b8 11 // DW_FORM_data1 .b8 0 // EOM(1) .b8 0 // EOM(2) .b8 3 // Abbreviation Code .b8 46 // DW_TAG_subprogram .b8 1 // DW_CHILDREN_yes .b8 17 // DW_AT_low_pc .b8 1 // DW_FORM_addr .b8 18 // DW_AT_high_pc .b8 1 // DW_FORM_addr .b8 49 // DW_AT_abstract_origin .b8 19 // DW_FORM_ref4 .b8 0 // EOM(1) .b8 0 // EOM(2) .b8 4 // Abbreviation Code .b8 29 // DW_TAG_inlined_subroutine .b8 0 // DW_CHILDREN_no .b8 49 // DW_AT_abstract_origin .b8 19 // DW_FORM_ref4 .b8 17 // DW_AT_low_pc .b8 1 // DW_FORM_addr .b8 18 // DW_AT_high_pc .b8 1 // DW_FORM_addr .b8 88 // DW_AT_call_file .b8 11 // DW_FORM_data1 .b8 89 // DW_AT_call_line .b8 11 // DW_FORM_data1 .b8 87 // DW_AT_call_column .b8 11 // DW_FORM_data1 .b8 0 // EOM(1) .b8 0 // EOM(2) .b8 0 // EOM(3) } .section .debug_info { .b32 204 // Length of Unit .b8 2 // DWARF version number .b8 0 .b32 .debug_abbrev // Offset Into Abbrev. Section .b8 8 // Address Size (in bytes) .b8 1 // Abbrev [1] 0xb:0xc5 DW_TAG_compile_unit .b8 116 // DW_AT_producer .b8 114 .b8 105 .b8 116 .b8 111 .b8 110 .b8 0 .b8 2 // DW_AT_language .b8 0 .b8 99 // DW_AT_name .b8 106 .b8 100 .b8 122 .b8 114 .b8 112 .b8 105 .b8 111 .b8 119 .b8 116 .b8 113 .b8 54 .b8 99 .b8 113 .b8 100 .b8 99 .b8 116 .b8 115 .b8 103 .b8 104 .b8 119 .b8 101 .b8 97 .b8 110 .b8 117 .b8 106 .b8 103 .b8 113 .b8 110 .b8 53 .b8 120 .b8 50 .b8 98 .b8 97 .b8 112 .b8 121 .b8 108 .b8 55 .b8 115 .b8 98 .b8 108 .b8 105 .b8 122 .b8 50 .b8 115 .b8 121 .b8 54 .b8 113 .b8 118 .b8 104 .b8 101 .b8 50 .b8 46 .b8 112 .b8 121 .b8 0 .b32 .debug_line // DW_AT_stmt_list .b8 47 // DW_AT_comp_dir .b8 116 .b8 109 .b8 112 .b8 47 .b8 116 .b8 111 .b8 114 .b8 99 .b8 104 .b8 105 .b8 110 .b8 100 .b8 117 .b8 99 .b8 116 .b8 111 .b8 114 .b8 95 .b8 115 .b8 101 .b8 111 .b8 110 .b8 103 .b8 108 .b8 97 .b8 101 .b8 47 .b8 106 .b8 100 .b8 0 .b8 2 // Abbrev [2] 0x70:0x31 DW_TAG_subprogram .b8 116 // DW_AT_name .b8 114 .b8 105 .b8 116 .b8 111 .b8 110 .b8 95 .b8 114 .b8 101 .b8 100 .b8 95 .b8 102 .b8 117 .b8 115 .b8 101 .b8 100 .b8 95 .b8 95 .b8 116 .b8 111 .b8 95 .b8 99 .b8 111 .b8 112 .b8 121 .b8 95 .b8 108 .b8 105 .b8 110 .b8 97 .b8 108 .b8 103 .b8 95 .b8 118 .b8 101 .b8 99 .b8 116 .b8 111 .b8 114 .b8 95 .b8 110 .b8 111 .b8 114 .b8 109 .b8 95 .b8 48 .b8 0 .b8 1 // DW_AT_inline .b8 3 // Abbrev [3] 0xa1:0x2e DW_TAG_subprogram .b64 $L__func_begin0 // DW_AT_low_pc .b64 $L__func_end0 // DW_AT_high_pc .b32 112 // DW_AT_abstract_origin .b8 4 // Abbrev [4] 0xb6:0x18 DW_TAG_inlined_subroutine .b32 112 // DW_AT_abstract_origin .b64 $L__tmp1 // DW_AT_low_pc .b64 $L__tmp2 // DW_AT_high_pc .b8 1 // DW_AT_call_file .b8 43 // DW_AT_call_line .b8 25 // DW_AT_call_column .b8 0 // End Of Children Mark .b8 0 // End Of Children Mark } .section .debug_macinfo { } ================================================================ please share the reproducer above with Triton project. [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Triton compilation failed: triton_red_fused__to_copy_linalg_vector_norm_0 [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] def triton_red_fused__to_copy_linalg_vector_norm_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] xnumel = 338 [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] r0_numel = 8192 [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] rnumel = r0_numel [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] RBLOCK: tl.constexpr = R0_BLOCK [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] xoffset = tl.program_id(0) * XBLOCK [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] xindex = xoffset + tl.arange(0, XBLOCK)[:, None] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] xmask = xindex < xnumel [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] r0_base = tl.arange(0, R0_BLOCK)[None, :] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] rbase = r0_base [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] x0 = xindex [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] _tmp5 = tl.full([XBLOCK, R0_BLOCK], 0, tl.float32) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] for r0_offset in range(0, r0_numel, R0_BLOCK): [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] r0_index = r0_offset + r0_base [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] r0_mask = r0_index < r0_numel [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] roffset = r0_offset [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] rindex = r0_index [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] r0_1 = r0_index [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp0 = tl.load(in_ptr0 + (r0_1 + 8192*x0), r0_mask & xmask, eviction_policy='evict_first', other=0.0) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp1 = tmp0.to(tl.float32) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp2 = tmp1.to(tl.float32) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp3 = tmp2 * tmp2 [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp4 = tl.broadcast_to(tmp3, [XBLOCK, R0_BLOCK]) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp6 = _tmp5 + tmp4 [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] _tmp5 = tl.where(r0_mask & xmask, tmp6, _tmp5) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tmp5 = tl.sum(_tmp5, 1)[:, None] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] tl.store(out_ptr0 + (x0), tmp5, xmask) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] metadata: {'signature': {'in_ptr0': '*fp32', 'out_ptr0': '*fp32', 'xnumel': 'i32', 'r0_numel': 'i32', 'XBLOCK': 'constexpr', 'R0_BLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 1, 'R0_BLOCK': 2048}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 16, 'num_stages': 1, 'debug': True, 'cc': 121} [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Traceback (most recent call last): [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 468, in make_cubin [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] subprocess.run(ptxas_cmd, check=True, close_fds=False, stderr=flog) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/subprocess.py", line 526, in run [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] raise CalledProcessError(retcode, process.args, [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] subprocess.CalledProcessError: Command '['/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas', '-lineinfo', '-v', '--gpu-name=sm_121a', '/tmp/tmpx9u3mic8.ptx', '-o', '/tmp/tmpx9u3mic8.ptx.o']' returned non-zero exit status 255. [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] During handling of the above exception, another exception occurred: [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Traceback (most recent call last): [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 778, in _precompile_config [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] binary = triton.compile(*compile_args, **compile_kwargs) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/compiler/compiler.py", line 320, in compile [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] next_module = compile_ir(module, metadata) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda> [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] raise PTXASError(error) [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] `ptxas` stderr: [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] ptxas fatal : Value 'sm_121a' is not defined for option 'gpu-name' [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Repro command: /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpx9u3mic8.ptx -o /tmp/tmpx9u3mic8.ptx.o [rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] [rank0]: Traceback (most recent call last): [rank0]: File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank0]: return _run_code(code, main_globals, None, [rank0]: File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 86, in _run_code [rank0]: exec(code, run_globals) [rank0]: File "/home/seonglae/nanochat/scripts/base_train.py", line 286, in <module> [rank0]: opt.step() [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 517, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/home/seonglae/nanochat/nanochat/muon.py", line 176, in step [rank0]: g = zeropower_via_newtonschulz5(g, steps=group["ns_steps"]) [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper [rank0]: raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 990, in _compile_fx_inner [rank0]: raise InductorError(e, currentframe()).with_traceback( [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 974, in _compile_fx_inner [rank0]: mb_compiled_graph = fx_codegen_and_compile( [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1695, in fx_codegen_and_compile [rank0]: return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1505, in codegen_and_compile [rank0]: compiled_module = graph.compile_to_module() [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2319, in compile_to_module [rank0]: return self._compile_to_module() [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2329, in _compile_to_module [rank0]: mod = self._compile_to_module_lines(wrapper_code) [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2397, in _compile_to_module_lines [rank0]: mod = PyCodeCache.load_by_key_path( [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 3548, in load_by_key_path [rank0]: mod = _reload_python_module(key, path, set_sys_modules=in_toplevel) [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module [rank0]: exec(code, mod.__dict__, mod.__dict__) [rank0]: File "/tmp/torchinductor_seonglae/ru/cru4n75qryyfhybd6gs3feyrxnkst4woonaugrrh6j6thesvegdk.py", line 49, in <module> [rank0]: triton_red_fused__to_copy_linalg_vector_norm_0 = async_compile.triton('triton_red_fused__to_copy_linalg_vector_norm_0', ''' [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/async_compile.py", line 500, in triton [rank0]: kernel.precompile( [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 448, in precompile [rank0]: self._precompile_worker() [rank0]: File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 474, in _precompile_worker [rank0]: raise NoTritonConfigsError( [rank0]: torch._inductor.exc.InductorError: NoTritonConfigsError: No valid triton configs. PTXASError: PTXAS error: Internal Triton PTX codegen error [rank0]: `ptxas` stderr: [rank0]: ptxas fatal : Value 'sm_121a' is not defined for option 'gpu-name' [rank0]: Repro command: /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpx9u3mic8.ptx -o /tmp/tmpx9u3mic8.ptx.o [rank0]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" [rank0]:[W1105 04:43:59.801267274 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank0]:[W1105 04:44:03.362613554 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) E1105 04:44:06.417000 300096 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 300217) of binary: /home/seonglae/nanochat/.venv/bin/python3 Traceback (most recent call last): File "/home/seonglae/nanochat/.venv/bin/torchrun", line 10, in <module> sys.exit(main()) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main run(args) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scripts.base_train FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-11-05_04:44:06 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 300217) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
 
[W1105 09:36:28.443950219 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3 /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( [W1105 09:36:43.432882525 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3 Traceback (most recent call last): File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/seonglae/nanochat/scripts/base_train.py", line 71, in <module> ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type) File "/home/seonglae/nanochat/nanochat/common.py", line 166, in compute_init dist.init_process_group(backend="nccl", device_id=device) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper return func(*args, **kwargs) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper func_return = func(*args, **kwargs) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1769, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2134, in _new_process_group_helper eager_backend.eager_connect_single_device(device_id) torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, remote process exited or there was a network error, NCCL version 2.27.5 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: socketPollConnect: connect to 169.254.42.30<39975> returned Connection refused, exceeded error retry count after 35 attempts [W1105 09:37:43.611671299 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [W1105 09:37:43.795761788 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) E1105 09:37:43.888000 512276 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 513624) of binary: /home/seonglae/nanochat/.venv/bin/python3 Traceback (most recent call last): File "/home/seonglae/nanochat/.venv/bin/torchrun", line 10, in <module> sys.exit(main()) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main run(args) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scripts.base_train FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-11-05_09:37:43 host : localhost rank : 1 (local_rank: 0) exitcode : 1 (pid: 513624) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ==========================
 
[W1105 09:58:18.433455839 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3 █████ █████ ░░███ ░░███ ████████ ██████ ████████ ██████ ██████ ░███████ ██████ ███████ ░░███░░███ ░░░░░███ ░░███░░███ ███░░███ ███░░███ ░███░░███ ░░░░░███░░░███░ ░███ ░███ ███████ ░███ ░███ ░███ ░███░███ ░░░ ░███ ░███ ███████ ░███ ░███ ░███ ███░░███ ░███ ░███ ░███ ░███░███ ███ ░███ ░███ ███░░███ ░███ ███ ████ █████░░████████ ████ █████░░██████ ░░██████ ████ █████░░███████ ░░█████ ░░░░ ░░░░░ ░░░░░░░░ ░░░░ ░░░░░ ░░░░░░ ░░░░░░ ░░░░ ░░░░░ ░░░░░░░░ ░░░░░ Overriding: depth = 26 Overriding: device_batch_size = 16 Autodetected device type: cuda /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( 2025-11-05 09:58:20,608 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 2 Vocab size: 65,536 num_layers: 26 model_dim: 1664 num_heads: 13 num_kv_heads: 13 Tokens / micro-batch / rank: 16 x 2048 = 32,768 Tokens / micro-batch: 65,536 Total batch size 524,288 => gradient accumulation steps: 8 Number of parameters: 1,081,999,360 Estimated FLOPs per token: 6.900941e+09 Calculated number of iterations from target data:param ratio: 41,275 Total number of training tokens: 21,639,987,200 Tokens : Params ratio: 20.00 Total training FLOPs estimate: 1.493363e+20 Scaling the LR for the AdamW parameters ∝1/√(1664/768) = 0.679366 Muon: Grouping 104 params of shape torch.Size([1664, 1664]), device cuda:0, dtype torch.float32 Muon: Grouping 26 params of shape torch.Size([1664, 6656]), device cuda:0, dtype torch.float32 Muon: Grouping 26 params of shape torch.Size([6656, 1664]), device cuda:0, dtype torch.float32 W1105 10:00:58.433000 1768678 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1770408 closing signal SIGTERM Traceback (most recent call last): File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 717, in run result = self._invoke_run(role) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run time.sleep(monitor_interval) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1768678 got signal: 15 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/seonglae/nanochat/.venv/bin/torchrun", line 10, in <module> sys.exit(main()) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main run(args) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 284, in launch_agent result = agent.run() File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 725, in run logger.warning("Received %s death signal, shutting down workers", e.sigval) File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 1489, in warning self._log(WARNING, msg, args, **kwargs) File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 1622, in _log record = self.makeRecord(self.name, level, fn, lno, msg, args, File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 1591, in makeRecord rv = _logRecordFactory(name, level, fn, lno, msg, args, exc_info, func, File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 317, in __init__ self.filename = os.path.basename(pathname) File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/posixpath.py", line 144, in basename i = p.rfind(sep) + 1 File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1768678 got signal: 15

성공

[W1105 12:48:44.082950627 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3 █████ █████ ░░███ ░░███ ████████ ██████ ████████ ██████ ██████ ░███████ ██████ ███████ ░░███░░███ ░░░░░███ ░░███░░███ ███░░███ ███░░███ ░███░░███ ░░░░░███░░░███░ ░███ ░███ ███████ ░███ ░███ ░███ ░███░███ ░░░ ░███ ░███ ███████ ░███ ░███ ░███ ███░░███ ░███ ░███ ░███ ░███░███ ███ ░███ ░███ ███░░███ ░███ ███ ████ █████░░████████ ████ █████░░██████ ░░██████ ████ █████░░███████ ░░█████ ░░░░ ░░░░░ ░░░░░░░░ ░░░░ ░░░░░ ░░░░░░ ░░░░░░ ░░░░ ░░░░░ ░░░░░░░░ ░░░░░ Overriding: depth = 26 Overriding: device_batch_size = 16 Autodetected device type: cuda /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning. warnings.warn( # warn only once [rank0]:[W1105 12:48:45.225508329 ProcessGroupNCCL.cpp:5068] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group() 2025-11-05 12:48:46,210 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 2 Vocab size: 65,536 num_layers: 26 model_dim: 1664 num_heads: 13 num_kv_heads: 13 Tokens / micro-batch / rank: 16 x 2048 = 32,768 Tokens / micro-batch: 65,536 Total batch size 524,288 => gradient accumulation steps: 8 Number of parameters: 1,081,999,360 Estimated FLOPs per token: 6.900941e+09 Calculated number of iterations from target data:param ratio: 41,275 Total number of training tokens: 21,639,987,200 Tokens : Params ratio: 20.00 Total training FLOPs estimate: 1.493363e+20 Scaling the LR for the AdamW parameters ∝1/√(1664/768) = 0.679366 Muon: Grouping 104 params of shape torch.Size([1664, 1664]), device cuda:0, dtype torch.float32 Muon: Grouping 26 params of shape torch.Size([1664, 6656]), device cuda:0, dtype torch.float32 Muon: Grouping 26 params of shape torch.Size([6656, 1664]), device cuda:0, dtype torch.float32 Step 00000 | Validation bpb: 3.3071 step 00000/41275 (0.00%) | loss: 11.090360 | lrm: 1.00 | dt: 167199.40ms | tok/sec: 3,135 | mfu: 1.09 | total time: 0.00m step 00001/41275 (0.00%) | loss: 10.825550 | lrm: 1.00 | dt: 192320.67ms | tok/sec: 2,726 | mfu: 0.95 | total time: 0.00m step 00002/41275 (0.00%) | loss: 10.153048 | lrm: 1.00 | dt: 184528.47ms | tok/sec: 2,841 | mfu: 0.99 | total time: 0.00m step 00003/41275 (0.01%) | loss: 9.493440 | lrm: 1.00 | dt: 176406.38ms | tok/sec: 2,972 | mfu: 1.04 | total time: 0.00m step 00004/41275 (0.01%) | loss: 8.972018 | lrm: 1.00 | dt: 171745.83ms | tok/sec: 3,052 | mfu: 1.07 | total time: 0.00m step 00005/41275 (0.01%) | loss: 8.623263 | lrm: 1.00 | dt: 176813.53ms | tok/sec: 2,965 | mfu: 1.03 | total time: 0.00m step 00006/41275 (0.01%) | loss: 8.471020 | lrm: 1.00 | dt: 174431.29ms | tok/sec: 3,005 | mfu: 1.05 | total time: 0.00m step 00007/41275 (0.02%) | loss: 8.283705 | lrm: 1.00 | dt: 174112.48ms | tok/sec: 3,011 | mfu: 1.05 | total time: 0.00m step 00008/41275 (0.02%) | loss: 8.111838 | lrm: 1.00 | dt: 177996.02ms | tok/sec: 2,945 | mfu: 1.03 | total time: 0.00m step 00009/41275 (0.02%) | loss: 7.949561 | lrm: 1.00 | dt: 176105.21ms | tok/sec: 2,977 | mfu: 1.04 | total time: 0.00m step 00010/41275 (0.02%) | loss: 7.827967 | lrm: 1.00 | dt: 169717.51ms | tok/sec: 3,089 | mfu: 1.08 | total time: 0.00m step 00011/41275 (0.03%) | loss: 7.715822 | lrm: 1.00 | dt: 168858.79ms | tok/sec: 3,104 | mfu: 1.08 | total time: 2.81m step 00012/41275 (0.03%) | loss: 7.641937 | lrm: 1.00 | dt: 171440.03ms | tok/sec: 3,058 | mfu: 1.07 | total time: 5.67m step 00013/41275 (0.03%) | loss: 7.550511 | lrm: 1.00 | dt: 176084.73ms | tok/sec: 2,977 | mfu: 1.04 | total time: 8.61m step 00014/41275 (0.03%) | loss: 7.450517 | lrm: 1.00 | dt: 176033.10ms | tok/sec: 2,978 | mfu: 1.04 | total time: 11.54m
cd /home/seonglae/nanochat && nohup uv run torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=169.254.42.30 --master_port=29500 -m scripts.base_train -- --depth=26 --device_batch_size=16 > /home/seonglae/master_train.log 2>&1 & ssh spark-9ea3 "cd /home/seonglae/nanochat && nohup /home/seonglae/nanochat/.venv/bin/torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 --master_addr=169.254.42.30 --master_port=29500 -m scripts.base_train -- --depth=26 --device_batch_size=16 > /home/seonglae/worker_train.log 2>&1 &" && echo "Started"
❯ cat worker_train.log [W1105 14:05:05.509280926 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3 /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( [W1105 14:05:06.016375996 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3

config 변경

 
 
 
 
 

Recommendations