DGX Spark 놀이

nanochat

cuda 128 torch 13 not supported


git clone
uv sync
uv pip uninstall torch
uv pip install torch --


[W1105 04:19:13.033074459 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3

                                                       █████                █████
                                                      ░░███                ░░███
     ████████    ██████   ████████    ██████   ██████  ░███████    ██████  ███████
    ░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███░░░███░
     ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████  ░███
     ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███  ░███ ███
     ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░███████  ░░█████
    ░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░   ░░░░░
    
Overriding: depth = 26
Overriding: device_batch_size = 16
Autodetected device type: cuda
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
2025-11-05 04:19:16,035 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 2
Vocab size: 65,536
num_layers: 26
model_dim: 1664
num_heads: 13
num_kv_heads: 13
Tokens / micro-batch / rank: 16 x 2048 = 32,768
Tokens / micro-batch: 65,536
Total batch size 524,288 => gradient accumulation steps: 8
Number of parameters: 1,081,999,360
Estimated FLOPs per token: 6.900941e+09
Calculated number of iterations from target data:param ratio: 41,275
Total number of training tokens: 21,639,987,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 1.493363e+20
Scaling the LR for the AdamW parameters ∝1/√(1664/768) = 0.679366
Muon: Grouping 104 params of shape torch.Size([1664, 1664]), device cuda:0, dtype torch.float32
Muon: Grouping 26 params of shape torch.Size([1664, 6656]), device cuda:0, dtype torch.float32
Muon: Grouping 26 params of shape torch.Size([6656, 1664]), device cuda:0, dtype torch.float32
Step 00000 | Validation bpb: 3.3071
[rank0]:W1105 04:43:56.152000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/utils.py:1558] [2/0] Not enough SMs to use max_autotune_gemm mode


================================================================
Internal Triton PTX codegen error
`ptxas` stderr:
ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'

Repro command: /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpx9u3mic8.ptx -o /tmp/tmpx9u3mic8.ptx.o


//
// Generated by LLVM NVPTX Back-End
//

.version 8.7
.target sm_121a
.address_size 64

	// .globl	triton_red_fused__to_copy_linalg_vector_norm_0 // -- Begin function triton_red_fused__to_copy_linalg_vector_norm_0
.extern .shared .align 16 .b8 global_smem[];
                                        // @triton_red_fused__to_copy_linalg_vector_norm_0
.visible .entry triton_red_fused__to_copy_linalg_vector_norm_0(
	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_0,
	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_1,
	.param .u32 triton_red_fused__to_copy_linalg_vector_norm_0_param_2,
	.param .u32 triton_red_fused__to_copy_linalg_vector_norm_0_param_3,
	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_4,
	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_linalg_vector_norm_0_param_5
)
.reqntid 512
{
	.reg .pred 	%p<16>;
	.reg .b32 	%r<108>;
	.reg .b64 	%rd<73>;
	.loc	1 18 0                          // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:18:0
$L__func_begin0:
	.loc	1 18 0                          // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:18:0

// %bb.0:
	ld.param.b64 	%rd3, [triton_red_fused__to_copy_linalg_vector_norm_0_param_1];
	ld.param.b64 	%rd2, [triton_red_fused__to_copy_linalg_vector_norm_0_param_0];
$L__tmp0:
	.loc	1 23 28                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:23:28
	mov.u32 	%r1, %ctaid.x;
	.loc	1 25 21                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:25:21
	setp.lt.u32 	%p1, %r1, 338;
	.loc	1 26 37                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:26:37
	mov.u32 	%r2, %tid.x;
	shl.b32 	%r6, %r2, 2;
	and.b32 	%r7, %r6, 2044;
	.loc	1 36 46                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:46
	shl.b32 	%r8, %r1, 13;
	or.b32 	%r3, %r7, %r8;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd4, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd4, 1.0;
	// end inline asm
	@%p1 bra 	$L__BB0_2;
	bra.uni 	$L__BB0_1;
$L__BB0_2:                              // %.split.us.preheader
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	mad.wide.u32 	%rd16, %r3, 4, %rd2;
	mov.b32 	%r46, 0;
	mov.pred 	%p6, -1;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u32 %r42, %r46;
	mov.u32 %r43, %r46;
	mov.u32 %r44, %r46;
	mov.u32 %r45, %r46;
	@%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r42, %r43, %r44, %r45 }, [ %rd16 + 0 ], %rd4;
	// end inline asm
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	add.s64 	%rd19, %rd16, 8192;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd20, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd20, 1.0;
	// end inline asm
	// begin inline asm
	mov.u32 %r50, %r46;
	mov.u32 %r51, %r46;
	mov.u32 %r52, %r46;
	mov.u32 %r53, %r46;
	@%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r50, %r51, %r52, %r53 }, [ %rd19 + 0 ], %rd20;
	// end inline asm
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	add.s64 	%rd22, %rd16, 16384;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd23, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd23, 1.0;
	// end inline asm
	// begin inline asm
	mov.u32 %r58, %r46;
	mov.u32 %r59, %r46;
	mov.u32 %r60, %r46;
	mov.u32 %r61, %r46;
	@%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r58, %r59, %r60, %r61 }, [ %rd22 + 0 ], %rd23;
	// end inline asm
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	add.s64 	%rd25, %rd16, 24576;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd26, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd26, 1.0;
	// end inline asm
	// begin inline asm
	mov.u32 %r66, %r46;
	mov.u32 %r67, %r46;
	mov.u32 %r68, %r46;
	mov.u32 %r69, %r46;
	@%p6 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r66, %r67, %r68, %r69 }, [ %rd25 + 0 ], %rd26;
	// end inline asm
	cvt.u64.u32 	%rd27, %r42;
	cvt.u64.u32 	%rd28, %r43;
	shl.b64 	%rd29, %rd28, 32;
	or.b64 	%rd30, %rd27, %rd29;
	cvt.u64.u32 	%rd31, %r50;
	cvt.u64.u32 	%rd32, %r51;
	shl.b64 	%rd33, %rd32, 32;
	or.b64 	%rd34, %rd31, %rd33;
	.loc	1 39 22                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:39:22
	mul.f32x2 	%rd35, %rd34, %rd34;
	.loc	1 41 23                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23
	fma.rn.f32x2 	%rd36, %rd30, %rd30, %rd35;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	cvt.u64.u32 	%rd37, %r58;
	cvt.u64.u32 	%rd38, %r59;
	shl.b64 	%rd39, %rd38, 32;
	or.b64 	%rd40, %rd37, %rd39;
	.loc	1 41 23                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23
	fma.rn.f32x2 	%rd41, %rd40, %rd40, %rd36;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	cvt.u64.u32 	%rd42, %r66;
	cvt.u64.u32 	%rd43, %r67;
	shl.b64 	%rd44, %rd43, 32;
	or.b64 	%rd45, %rd42, %rd44;
	.loc	1 41 23                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23
	fma.rn.f32x2 	%rd46, %rd45, %rd45, %rd41;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	cvt.u64.u32 	%rd47, %r45;
	cvt.u64.u32 	%rd48, %r44;
	shl.b64 	%rd49, %rd48, 32;
	or.b64 	%rd50, %rd47, %rd49;
	cvt.u64.u32 	%rd51, %r53;
	cvt.u64.u32 	%rd52, %r52;
	shl.b64 	%rd53, %rd52, 32;
	or.b64 	%rd54, %rd51, %rd53;
	.loc	1 39 22                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:39:22
	mul.f32x2 	%rd55, %rd54, %rd54;
	.loc	1 41 23                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23
	fma.rn.f32x2 	%rd56, %rd50, %rd50, %rd55;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	cvt.u64.u32 	%rd57, %r61;
	cvt.u64.u32 	%rd58, %r60;
	shl.b64 	%rd59, %rd58, 32;
	or.b64 	%rd60, %rd57, %rd59;
	.loc	1 41 23                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23
	fma.rn.f32x2 	%rd61, %rd60, %rd60, %rd56;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	cvt.u64.u32 	%rd62, %r69;
	cvt.u64.u32 	%rd63, %r68;
	shl.b64 	%rd64, %rd63, 32;
	or.b64 	%rd65, %rd62, %rd64;
	.loc	1 41 23                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:41:23
	fma.rn.f32x2 	%rd66, %rd65, %rd65, %rd61;
	.loc	1 26 37                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:26:37
	mov.b64 	{_, %r74}, %rd46;
	mov.b64 	%rd67, {%r74, %r75};
	add.f32x2 	%rd68, %rd46, %rd67;
	mov.b64 	{_, %r76}, %rd66;
	mov.b64 	%rd69, {%r76, %r77};
	add.f32x2 	%rd70, %rd69, %rd68;
	add.f32x2 	%rd71, %rd66, %rd70;
	mov.b64 	{%r107, _}, %rd71;
	bra.uni 	$L__BB0_3;
$L__BB0_1:                              // %.split.preheader
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	mad.wide.s32 	%rd5, %r3, 4, %rd2;
	mov.b32 	%r13, 0;
	mov.pred 	%p2, 0;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u32 %r9, %r13;
	mov.u32 %r10, %r13;
	mov.u32 %r11, %r13;
	mov.u32 %r12, %r13;
	@%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r9, %r10, %r11, %r12 }, [ %rd5 + 0 ], %rd4;
	// end inline asm
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	add.s64 	%rd8, %rd5, 8192;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd9, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd9, 1.0;
	// end inline asm
	// begin inline asm
	mov.u32 %r17, %r13;
	mov.u32 %r18, %r13;
	mov.u32 %r19, %r13;
	mov.u32 %r20, %r13;
	@%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r17, %r18, %r19, %r20 }, [ %rd8 + 0 ], %rd9;
	// end inline asm
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	add.s64 	%rd11, %rd5, 16384;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd12, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd12, 1.0;
	// end inline asm
	// begin inline asm
	mov.u32 %r25, %r13;
	mov.u32 %r26, %r13;
	mov.u32 %r27, %r13;
	mov.u32 %r28, %r13;
	@%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r25, %r26, %r27, %r28 }, [ %rd11 + 0 ], %rd12;
	// end inline asm
	.loc	1 36 34                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:34
	add.s64 	%rd14, %rd5, 24576;
	.loc	1 36 51                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:36:51
	// begin inline asm
	mov.u64 %rd15, 0x0;
	createpolicy.fractional.L2::evict_first.b64 %rd15, 1.0;
	// end inline asm
	// begin inline asm
	mov.u32 %r33, %r13;
	mov.u32 %r34, %r13;
	mov.u32 %r35, %r13;
	mov.u32 %r36, %r13;
	@%p2 ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r33, %r34, %r35, %r36 }, [ %rd14 + 0 ], %rd15;
	// end inline asm
	mov.b32 	%r107, 0f00000000;
$L__BB0_3:                              // %.split2.us
	.loc	1 26 37                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:26:37
	and.b32 	%r85, %r2, 31;
$L__tmp1:
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r86, %r107, 16, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r87, %r107, %r86;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r88, %r87, 8, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r89, %r87, %r88;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r90, %r89, 4, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r91, %r89, %r90;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r92, %r91, 2, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r93, %r91, %r92;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r94, %r93, 1, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r79, %r93, %r94;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	setp.eq.b32 	%p10, %r85, 0;
	shr.u32 	%r95, %r2, 3;
	and.b32 	%r96, %r95, 60;
	mov.b32 	%r97, global_smem;
	add.s32 	%r78, %r97, %r96;
	// begin inline asm
	@%p10 st.shared.b32 [ %r78 + 0 ], %r79;
	// end inline asm
	bar.sync 	0;
	setp.lt.u32 	%p11, %r2, 16;
	add.s32 	%r81, %r97, %r6;
	// begin inline asm
	@%p11 ld.shared.b32 %r80, [ %r81 + 0 ];
	// end inline asm
	shfl.sync.bfly.b32 	%r99, %r80, 8, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r100, %r80, %r99;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r101, %r100, 4, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r102, %r100, %r101;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r103, %r102, 2, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r104, %r102, %r103;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	shfl.sync.bfly.b32 	%r105, %r104, 1, 31, -1;
	.loc	2 261 15                        // standard.py:261:15 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	add.f32 	%r83, %r104, %r105;
	.loc	2 291 36                        // standard.py:291:36 @[ cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:43:25 ]
	setp.eq.b32 	%p12, %r2, 0;
	// begin inline asm
	@%p12 st.shared.b32 [ %r81 + 0 ], %r83;
	// end inline asm
	bar.sync 	0;
	ld.shared.b32 	%r84, [global_smem];
$L__tmp2:
	.loc	1 44 25                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:44:25
	mad.wide.u32 	%rd72, %r1, 4, %rd3;
	.loc	1 44 36                         // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:44:36
	and.b32 	%r106, %r2, 511;
	setp.eq.b32 	%p15, %r106, 0;
	and.pred 	%p13, %p1, %p15;
	// begin inline asm
	@%p13 st.global.b32 [ %rd72 + 0 ], { %r84 };
	// end inline asm
	.loc	1 44 4                          // cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py:44:4
	ret;
$L__tmp3:
$L__func_end0:
                                        // -- End function
}
	.file	1 "/tmp/torchinductor_seonglae/jd/cjdzrpiowtq6cqdctsghweanujgqn5x2bapyl7sbliz2sy6qvhe2.py"
	.file	2 "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/language/standard.py"
	.section	.debug_abbrev
	{
.b8 1                                   // Abbreviation Code
.b8 17                                  // DW_TAG_compile_unit
.b8 1                                   // DW_CHILDREN_yes
.b8 37                                  // DW_AT_producer
.b8 8                                   // DW_FORM_string
.b8 19                                  // DW_AT_language
.b8 5                                   // DW_FORM_data2
.b8 3                                   // DW_AT_name
.b8 8                                   // DW_FORM_string
.b8 16                                  // DW_AT_stmt_list
.b8 6                                   // DW_FORM_data4
.b8 27                                  // DW_AT_comp_dir
.b8 8                                   // DW_FORM_string
.b8 0                                   // EOM(1)
.b8 0                                   // EOM(2)
.b8 2                                   // Abbreviation Code
.b8 46                                  // DW_TAG_subprogram
.b8 0                                   // DW_CHILDREN_no
.b8 3                                   // DW_AT_name
.b8 8                                   // DW_FORM_string
.b8 32                                  // DW_AT_inline
.b8 11                                  // DW_FORM_data1
.b8 0                                   // EOM(1)
.b8 0                                   // EOM(2)
.b8 3                                   // Abbreviation Code
.b8 46                                  // DW_TAG_subprogram
.b8 1                                   // DW_CHILDREN_yes
.b8 17                                  // DW_AT_low_pc
.b8 1                                   // DW_FORM_addr
.b8 18                                  // DW_AT_high_pc
.b8 1                                   // DW_FORM_addr
.b8 49                                  // DW_AT_abstract_origin
.b8 19                                  // DW_FORM_ref4
.b8 0                                   // EOM(1)
.b8 0                                   // EOM(2)
.b8 4                                   // Abbreviation Code
.b8 29                                  // DW_TAG_inlined_subroutine
.b8 0                                   // DW_CHILDREN_no
.b8 49                                  // DW_AT_abstract_origin
.b8 19                                  // DW_FORM_ref4
.b8 17                                  // DW_AT_low_pc
.b8 1                                   // DW_FORM_addr
.b8 18                                  // DW_AT_high_pc
.b8 1                                   // DW_FORM_addr
.b8 88                                  // DW_AT_call_file
.b8 11                                  // DW_FORM_data1
.b8 89                                  // DW_AT_call_line
.b8 11                                  // DW_FORM_data1
.b8 87                                  // DW_AT_call_column
.b8 11                                  // DW_FORM_data1
.b8 0                                   // EOM(1)
.b8 0                                   // EOM(2)
.b8 0                                   // EOM(3)
	}
	.section	.debug_info
	{
.b32 204                                // Length of Unit
.b8 2                                   // DWARF version number
.b8 0
.b32 .debug_abbrev                      // Offset Into Abbrev. Section
.b8 8                                   // Address Size (in bytes)
.b8 1                                   // Abbrev [1] 0xb:0xc5 DW_TAG_compile_unit
.b8 116                                 // DW_AT_producer
.b8 114
.b8 105
.b8 116
.b8 111
.b8 110
.b8 0
.b8 2                                   // DW_AT_language
.b8 0
.b8 99                                  // DW_AT_name
.b8 106
.b8 100
.b8 122
.b8 114
.b8 112
.b8 105
.b8 111
.b8 119
.b8 116
.b8 113
.b8 54
.b8 99
.b8 113
.b8 100
.b8 99
.b8 116
.b8 115
.b8 103
.b8 104
.b8 119
.b8 101
.b8 97
.b8 110
.b8 117
.b8 106
.b8 103
.b8 113
.b8 110
.b8 53
.b8 120
.b8 50
.b8 98
.b8 97
.b8 112
.b8 121
.b8 108
.b8 55
.b8 115
.b8 98
.b8 108
.b8 105
.b8 122
.b8 50
.b8 115
.b8 121
.b8 54
.b8 113
.b8 118
.b8 104
.b8 101
.b8 50
.b8 46
.b8 112
.b8 121
.b8 0
.b32 .debug_line                        // DW_AT_stmt_list
.b8 47                                  // DW_AT_comp_dir
.b8 116
.b8 109
.b8 112
.b8 47
.b8 116
.b8 111
.b8 114
.b8 99
.b8 104
.b8 105
.b8 110
.b8 100
.b8 117
.b8 99
.b8 116
.b8 111
.b8 114
.b8 95
.b8 115
.b8 101
.b8 111
.b8 110
.b8 103
.b8 108
.b8 97
.b8 101
.b8 47
.b8 106
.b8 100
.b8 0
.b8 2                                   // Abbrev [2] 0x70:0x31 DW_TAG_subprogram
.b8 116                                 // DW_AT_name
.b8 114
.b8 105
.b8 116
.b8 111
.b8 110
.b8 95
.b8 114
.b8 101
.b8 100
.b8 95
.b8 102
.b8 117
.b8 115
.b8 101
.b8 100
.b8 95
.b8 95
.b8 116
.b8 111
.b8 95
.b8 99
.b8 111
.b8 112
.b8 121
.b8 95
.b8 108
.b8 105
.b8 110
.b8 97
.b8 108
.b8 103
.b8 95
.b8 118
.b8 101
.b8 99
.b8 116
.b8 111
.b8 114
.b8 95
.b8 110
.b8 111
.b8 114
.b8 109
.b8 95
.b8 48
.b8 0
.b8 1                                   // DW_AT_inline
.b8 3                                   // Abbrev [3] 0xa1:0x2e DW_TAG_subprogram
.b64 $L__func_begin0                    // DW_AT_low_pc
.b64 $L__func_end0                      // DW_AT_high_pc
.b32 112                                // DW_AT_abstract_origin
.b8 4                                   // Abbrev [4] 0xb6:0x18 DW_TAG_inlined_subroutine
.b32 112                                // DW_AT_abstract_origin
.b64 $L__tmp1                           // DW_AT_low_pc
.b64 $L__tmp2                           // DW_AT_high_pc
.b8 1                                   // DW_AT_call_file
.b8 43                                  // DW_AT_call_line
.b8 25                                  // DW_AT_call_column
.b8 0                                   // End Of Children Mark
.b8 0                                   // End Of Children Mark
	}
	.section	.debug_macinfo	{	}

================================================================
please share the reproducer above with Triton project.

[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Triton compilation failed: triton_red_fused__to_copy_linalg_vector_norm_0
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] def triton_red_fused__to_copy_linalg_vector_norm_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     xnumel = 338
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     r0_numel = 8192
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     rnumel = r0_numel
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     RBLOCK: tl.constexpr = R0_BLOCK
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     xoffset = tl.program_id(0) * XBLOCK
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     xmask = xindex < xnumel
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     r0_base = tl.arange(0, R0_BLOCK)[None, :]
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     rbase = r0_base
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     x0 = xindex
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     _tmp5 = tl.full([XBLOCK, R0_BLOCK], 0, tl.float32)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     for r0_offset in range(0, r0_numel, R0_BLOCK):
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         r0_index = r0_offset + r0_base
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         r0_mask = r0_index < r0_numel
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         roffset = r0_offset
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         rindex = r0_index
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         r0_1 = r0_index
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         tmp0 = tl.load(in_ptr0 + (r0_1 + 8192*x0), r0_mask & xmask, eviction_policy='evict_first', other=0.0)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         tmp1 = tmp0.to(tl.float32)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         tmp2 = tmp1.to(tl.float32)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         tmp3 = tmp2 * tmp2
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         tmp4 = tl.broadcast_to(tmp3, [XBLOCK, R0_BLOCK])
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         tmp6 = _tmp5 + tmp4
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]         _tmp5 = tl.where(r0_mask & xmask, tmp6, _tmp5)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     tmp5 = tl.sum(_tmp5, 1)[:, None]
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     tl.store(out_ptr0 + (x0), tmp5, xmask)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] 
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] metadata: {'signature': {'in_ptr0': '*fp32', 'out_ptr0': '*fp32', 'xnumel': 'i32', 'r0_numel': 'i32', 'XBLOCK': 'constexpr', 'R0_BLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 1, 'R0_BLOCK': 2048}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 16, 'num_stages': 1, 'debug': True, 'cc': 121}
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Traceback (most recent call last):
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 468, in make_cubin
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     subprocess.run(ptxas_cmd, check=True, close_fds=False, stderr=flog)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]   File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/subprocess.py", line 526, in run
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     raise CalledProcessError(retcode, process.args,
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] subprocess.CalledProcessError: Command '['/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas', '-lineinfo', '-v', '--gpu-name=sm_121a', '/tmp/tmpx9u3mic8.ptx', '-o', '/tmp/tmpx9u3mic8.ptx.o']' returned non-zero exit status 255.
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] 
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] During handling of the above exception, another exception occurred:
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] 
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Traceback (most recent call last):
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 778, in _precompile_config
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     binary = triton.compile(*compile_args, **compile_kwargs)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/compiler/compiler.py", line 320, in compile
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     next_module = compile_ir(module, metadata)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda>
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0]     raise PTXASError(error)
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] `ptxas` stderr:
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] 
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] Repro command: /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpx9u3mic8.ptx -o /tmp/tmpx9u3mic8.ptx.o
[rank0]:E1105 04:43:56.518000 300217 .venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py:780] [2/0] 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/seonglae/nanochat/scripts/base_train.py", line 286, in <module>
[rank0]:     opt.step()
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 517, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/seonglae/nanochat/nanochat/muon.py", line 176, in step
[rank0]:     g = zeropower_via_newtonschulz5(g, steps=group["ns_steps"])
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
[rank0]:     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 990, in _compile_fx_inner
[rank0]:     raise InductorError(e, currentframe()).with_traceback(
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 974, in _compile_fx_inner
[rank0]:     mb_compiled_graph = fx_codegen_and_compile(
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1695, in fx_codegen_and_compile
[rank0]:     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1505, in codegen_and_compile
[rank0]:     compiled_module = graph.compile_to_module()
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2319, in compile_to_module
[rank0]:     return self._compile_to_module()
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2329, in _compile_to_module
[rank0]:     mod = self._compile_to_module_lines(wrapper_code)
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2397, in _compile_to_module_lines
[rank0]:     mod = PyCodeCache.load_by_key_path(
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 3548, in load_by_key_path
[rank0]:     mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module
[rank0]:     exec(code, mod.__dict__, mod.__dict__)
[rank0]:   File "/tmp/torchinductor_seonglae/ru/cru4n75qryyfhybd6gs3feyrxnkst4woonaugrrh6j6thesvegdk.py", line 49, in <module>
[rank0]:     triton_red_fused__to_copy_linalg_vector_norm_0 = async_compile.triton('triton_red_fused__to_copy_linalg_vector_norm_0', '''
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/async_compile.py", line 500, in triton
[rank0]:     kernel.precompile(
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 448, in precompile
[rank0]:     self._precompile_worker()
[rank0]:   File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 474, in _precompile_worker
[rank0]:     raise NoTritonConfigsError(
[rank0]: torch._inductor.exc.InductorError: NoTritonConfigsError: No valid triton configs. PTXASError: PTXAS error: Internal Triton PTX codegen error
[rank0]: `ptxas` stderr:
[rank0]: ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'

[rank0]: Repro command: /home/seonglae/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpx9u3mic8.ptx -o /tmp/tmpx9u3mic8.ptx.o


[rank0]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

[rank0]:[W1105 04:43:59.801267274 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W1105 04:44:03.362613554 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
E1105 04:44:06.417000 300096 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 300217) of binary: /home/seonglae/nanochat/.venv/bin/python3
Traceback (most recent call last):
  File "/home/seonglae/nanochat/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.base_train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-11-05_04:44:06
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 300217)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================


[W1105 09:36:28.443950219 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
[W1105 09:36:43.432882525 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
Traceback (most recent call last):
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/seonglae/nanochat/scripts/base_train.py", line 71, in <module>
    ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
  File "/home/seonglae/nanochat/nanochat/common.py", line 166, in compute_init
    dist.init_process_group(backend="nccl", device_id=device)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1769, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2134, in _new_process_group_helper
    eager_backend.eager_connect_single_device(device_id)
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, remote process exited or there was a network error, NCCL version 2.27.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketPollConnect: connect to 169.254.42.30<39975> returned Connection refused, exceeded error retry count after 35 attempts
[W1105 09:37:43.611671299 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[W1105 09:37:43.795761788 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
E1105 09:37:43.888000 512276 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 513624) of binary: /home/seonglae/nanochat/.venv/bin/python3
Traceback (most recent call last):
  File "/home/seonglae/nanochat/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.base_train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-11-05_09:37:43
  host      : localhost
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 513624)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
==========================


[W1105 09:58:18.433455839 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3

                                                       █████                █████
                                                      ░░███                ░░███
     ████████    ██████   ████████    ██████   ██████  ░███████    ██████  ███████
    ░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███░░░███░
     ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████  ░███
     ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███  ░███ ███
     ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░███████  ░░█████
    ░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░   ░░░░░
    
Overriding: depth = 26
Overriding: device_batch_size = 16
Autodetected device type: cuda
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
2025-11-05 09:58:20,608 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 2
Vocab size: 65,536
num_layers: 26
model_dim: 1664
num_heads: 13
num_kv_heads: 13
Tokens / micro-batch / rank: 16 x 2048 = 32,768
Tokens / micro-batch: 65,536
Total batch size 524,288 => gradient accumulation steps: 8
Number of parameters: 1,081,999,360
Estimated FLOPs per token: 6.900941e+09
Calculated number of iterations from target data:param ratio: 41,275
Total number of training tokens: 21,639,987,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 1.493363e+20
Scaling the LR for the AdamW parameters ∝1/√(1664/768) = 0.679366
Muon: Grouping 104 params of shape torch.Size([1664, 1664]), device cuda:0, dtype torch.float32
Muon: Grouping 26 params of shape torch.Size([1664, 6656]), device cuda:0, dtype torch.float32
Muon: Grouping 26 params of shape torch.Size([6656, 1664]), device cuda:0, dtype torch.float32
W1105 10:00:58.433000 1768678 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1770408 closing signal SIGTERM
Traceback (most recent call last):
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 717, in run
    result = self._invoke_run(role)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1768678 got signal: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/seonglae/nanochat/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 284, in launch_agent
    result = agent.run()
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
    result = f(*args, **kwargs)
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 725, in run
    logger.warning("Received %s death signal, shutting down workers", e.sigval)
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 1489, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 1622, in _log
    record = self.makeRecord(self.name, level, fn, lno, msg, args,
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 1591, in makeRecord
    rv = _logRecordFactory(name, level, fn, lno, msg, args, exc_info, func,
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/logging/__init__.py", line 317, in __init__
    self.filename = os.path.basename(pathname)
  File "/home/seonglae/.local/share/uv/python/cpython-3.10.19-linux-aarch64-gnu/lib/python3.10/posixpath.py", line 144, in basename
    i = p.rfind(sep) + 1
  File "/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1768678 got signal: 15

성공


[W1105 12:48:44.082950627 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3

                                                       █████                █████
                                                      ░░███                ░░███
     ████████    ██████   ████████    ██████   ██████  ░███████    ██████  ███████
    ░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███░░░███░
     ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████  ░███
     ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███  ░███ ███
     ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░███████  ░░█████
    ░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░   ░░░░░
    
Overriding: depth = 26
Overriding: device_batch_size = 16
Autodetected device type: cuda
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
[rank0]:[W1105 12:48:45.225508329 ProcessGroupNCCL.cpp:5068] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
2025-11-05 12:48:46,210 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 2
Vocab size: 65,536
num_layers: 26
model_dim: 1664
num_heads: 13
num_kv_heads: 13
Tokens / micro-batch / rank: 16 x 2048 = 32,768
Tokens / micro-batch: 65,536
Total batch size 524,288 => gradient accumulation steps: 8
Number of parameters: 1,081,999,360
Estimated FLOPs per token: 6.900941e+09
Calculated number of iterations from target data:param ratio: 41,275
Total number of training tokens: 21,639,987,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 1.493363e+20
Scaling the LR for the AdamW parameters ∝1/√(1664/768) = 0.679366
Muon: Grouping 104 params of shape torch.Size([1664, 1664]), device cuda:0, dtype torch.float32
Muon: Grouping 26 params of shape torch.Size([1664, 6656]), device cuda:0, dtype torch.float32
Muon: Grouping 26 params of shape torch.Size([6656, 1664]), device cuda:0, dtype torch.float32
Step 00000 | Validation bpb: 3.3071
step 00000/41275 (0.00%) | loss: 11.090360 | lrm: 1.00 | dt: 167199.40ms | tok/sec: 3,135 | mfu: 1.09 | total time: 0.00m
step 00001/41275 (0.00%) | loss: 10.825550 | lrm: 1.00 | dt: 192320.67ms | tok/sec: 2,726 | mfu: 0.95 | total time: 0.00m
step 00002/41275 (0.00%) | loss: 10.153048 | lrm: 1.00 | dt: 184528.47ms | tok/sec: 2,841 | mfu: 0.99 | total time: 0.00m
step 00003/41275 (0.01%) | loss: 9.493440 | lrm: 1.00 | dt: 176406.38ms | tok/sec: 2,972 | mfu: 1.04 | total time: 0.00m
step 00004/41275 (0.01%) | loss: 8.972018 | lrm: 1.00 | dt: 171745.83ms | tok/sec: 3,052 | mfu: 1.07 | total time: 0.00m
step 00005/41275 (0.01%) | loss: 8.623263 | lrm: 1.00 | dt: 176813.53ms | tok/sec: 2,965 | mfu: 1.03 | total time: 0.00m
step 00006/41275 (0.01%) | loss: 8.471020 | lrm: 1.00 | dt: 174431.29ms | tok/sec: 3,005 | mfu: 1.05 | total time: 0.00m
step 00007/41275 (0.02%) | loss: 8.283705 | lrm: 1.00 | dt: 174112.48ms | tok/sec: 3,011 | mfu: 1.05 | total time: 0.00m
step 00008/41275 (0.02%) | loss: 8.111838 | lrm: 1.00 | dt: 177996.02ms | tok/sec: 2,945 | mfu: 1.03 | total time: 0.00m
step 00009/41275 (0.02%) | loss: 7.949561 | lrm: 1.00 | dt: 176105.21ms | tok/sec: 2,977 | mfu: 1.04 | total time: 0.00m
step 00010/41275 (0.02%) | loss: 7.827967 | lrm: 1.00 | dt: 169717.51ms | tok/sec: 3,089 | mfu: 1.08 | total time: 0.00m
step 00011/41275 (0.03%) | loss: 7.715822 | lrm: 1.00 | dt: 168858.79ms | tok/sec: 3,104 | mfu: 1.08 | total time: 2.81m
step 00012/41275 (0.03%) | loss: 7.641937 | lrm: 1.00 | dt: 171440.03ms | tok/sec: 3,058 | mfu: 1.07 | total time: 5.67m
step 00013/41275 (0.03%) | loss: 7.550511 | lrm: 1.00 | dt: 176084.73ms | tok/sec: 2,977 | mfu: 1.04 | total time: 8.61m
step 00014/41275 (0.03%) | loss: 7.450517 | lrm: 1.00 | dt: 176033.10ms | tok/sec: 2,978 | mfu: 1.04 | total time: 11.54m


cd /home/seonglae/nanochat && nohup uv run torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=169.254.42.30 --master_port=29500 -m scripts.base_train -- --depth=26 --device_batch_size=16 > /home/seonglae/master_train.log 2>&1 & ssh spark-9ea3 "cd /home/seonglae/nanochat && nohup /home/seonglae/nanochat/.venv/bin/torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 --master_addr=169.254.42.30 --master_port=29500 -m scripts.base_train -- --depth=26 --device_batch_size=16 > /home/seonglae/worker_train.log 2>&1 &" && echo "Started"


❯ cat worker_train.log 
[W1105 14:05:05.509280926 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/seonglae/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
[W1105 14:05:06.016375996 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3

DGX Spark 놀이

nanochat

성공

config 변경

Recommendations