dynamically generated token internal manipulation
main com - steer-rl
Timeline
- barburry 0
- marallard 1
- buffelhead 0
- cackling 0
- eider a lot
- gadwell 0
- pintail 0
- goosesander 0
- pochard llama august 4th grad/loss errors
- ruddy 0 torchdynamo
- scaup 0
- scoter 0 cot prompt
- shoveler 0 ppo loss
- smew 1 fp 32 normal
- wigeon ppo grad backward a lot
Argument
- sample mean clamp - meaningless code? 10 is large
- loss per - step/sample/group
- loss location - grad location
- clip gradient
- log prob flattening
- detached critic/log_prob/action
- generation token on steer.py
- use_cache
step30_acc79.2.pt
Training Steps: 0%| | 0/38 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect. Step 0: Avg Train Acc 0.7500, Val Acc 0.7708, Train Think Len 139.75, Val Think Len 169.50 Layer 20: Policy Loss 5.8489, Critic Loss 0.9254, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.5010, Unique Indices: 1520, Avg Activation: 0.8079, Avg Act Values: 0.9981 Training Steps: 100%|███████████████████████████████████████████████████████| 38/38 [3:07:27<00:00, 295.98s/it] Step 10: Avg Train Acc 0.6375, Val Acc 0.7708, Train Think Len 156.00, Val Think Len 169.73 Layer 20: Policy Loss 11.2105, Critic Loss 0.5497, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.0266, Unique Indices: 1507, Avg Activation: 0.6343, Avg Act Values: 0.9989 Step 20: Avg Train Acc 0.5125, Val Acc 0.7708, Train Think Len 326.62, Val Think Len 172.48 Layer 20: Policy Loss 16.9192, Critic Loss 0.6790, Grad Norms (P/C) 0.00/0.00, Recon Loss 11.2273, Unique Indices: 1524, Avg Activation: 0.2517, Avg Act Values: 0.9900 Step 30: Avg Train Acc 0.7000, Val Acc 0.7917, Train Think Len 134.38, Val Think Len 188.38 Layer 20: Policy Loss 10.0069, Critic Loss 0.8600, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.0790, Unique Indices: 1537, Avg Activation: 0.6786, Avg Act Values: 0.9113 /cs/student/projects2/aisd/2024/seongcho/steer-rl/eval.py:507: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ckpt = TrainResult.model_validate(torch.load(checkpoint)) Config model: gemma2b task: gsm8k layers: [20] select_token: False decode: False category: None cot: True Evaluating: 100%|██████████████████████████████████████████████████████████| 1319/1319 [38:45<00:00, 1.76s/it] Final gsm8k Accuracy with Steering: 55.42% Results saved to ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_0802_162059_30.0_cot/gsm8k_20_steered.json Stats saved to ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_0802_162059_30.0_cot/gsm8k_eval.json Starting analysis... Getting baselines took: 0.00s Final gsm8k Accuracy with Steering (Analysis): 55.42% Final gsm8k Accuracy without Steering (Baseline): 54.74% Overall Accuracy: Steered Model: 55.42% Baseline Model: 54.74% Baseline answer analysis took: 0.51s Analyzing layer 20... Critic Analysis Results: Total samples: 1319 Correct (reward > 0): 731 Incorrect (reward = 0): 588 Corrected (steered reward > baseline reward): 39 Misguided (steered reward < baseline reward): 30 /cs/student/projects2/aisd/2024/seongcho/steer-rl/analyze.py:111: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. return original_barplot(*args, **kwargs) Feature analysis saved to ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_0802_162059_30.0_cot/feature_analysis_20.json Layer 20 naive analysis took: 20.41s Layer 20 total analysis took: 20.41s Building result dictionaries took: 0.00s Total analysis completed in: 20.92s Every outputs are saved to the folder ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_0802_162059_30.0_cot
Gemma
Baseline
- non cot: 41.24%
- Baseline Accuracy: 54.51%
- decode 54.74%
- official 23.9%
10th
gemma2b_gsm8k_10_ppo_1e-05_0802_171747_10.0_cot
pythontrain.pytrain --eval --layers="10," --task="gsm8k" --cot --limit=48 --validate_every=10 --num_samples=300 --policy_deep --analysis --minimum=10 --mask="generation"55.50%
15th
pythontrain.pytrain --eval --layers="15," --task="gsm8k" --cot --limit=48 --validate_every=10 --num_samples=300 --policy_deep --analysis --minimum=10 --mask="generation"54.59%
20th
gemma2b_gsm8k_20_ppo_1e-05_0802_162059_30.0_cot
20th
- 54.89%
pythontrain.pytrain --eval --layers="20," --task="gsm8k" --cot --limit=48 --validate_every=10 --grpo --num_samples=1000 - Cross loss?
pythontrain.pytrain --eval --layers="20," --task="gsm8k" --cot --limit=48 --validate_every=10 --num_samples=300 --policy_deep --analysis --minimum=30 --mask="generation"55.42%
pythontrain.pytrain --eval --layers="20," --task="gsm8k" --cot --limit=48 --validate_every=10 --num_samples=300 --policy_deep --analysis --minimum=30 --mask="all"49.13%
- feature 분석 gemma2b_gsm8k_20_ppo_1e-05_0725_172103_50.0_cot 나름 의미있었음 좋다
normal total avg
24th
gemma2b_gsm8k_24_ppo_1e-05_0709_021753_30.0
- 55.88
Corrsteer
- 42.61% mean
- 3~ max
❯ python train.py train --eval --layers="20," --task="gsm8k" --cot --limit=48 --validate_every=10 --num_samples=1000 Loading checkpoint shards: 100%|█████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.43it/s] wandb: Currently logged in as: seonglae (texonom). Use `wandb login --relogin` to force relogin wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Tracking run with wandb version 0.19.4 wandb: Run data is saved locally in /cs/student/projects2/aisd/2024/seongcho/steer-rl/wandb/run-20250721_173437-lj691tq4 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run gemma2b_gsm8k_20_ppo_1e-05_0721_173436_30.0_cot wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/lj691tq4 Training Steps: 0%| | 0/126 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Step 0: Avg Train Acc 0.8750, Val Acc 0.7708, Train Think Len 134.75, Val Think Len 188.44 Layer 20: Policy Loss 1.3381, Critic Loss 1.5084, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.7741, Unique Indices: 7099, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 8%|████▌ | 10/126 [03:42<26:08, 13.52s/it]Step 10: Avg Train Acc 0.6875, Val Acc 0.7083, Train Think Len 272.88, Val Think Len 173.00 Layer 20: Policy Loss 15.1046, Critic Loss 0.9180, Grad Norms (P/C) 0.00/0.00, Recon Loss 11.1088, Unique Indices: 6853, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 16%|█████████ | 20/126 [06:16<15:08, 8.57s/it]Step 20: Avg Train Acc 0.5750, Val Acc 0.7917, Train Think Len 416.25, Val Think Len 175.21 Layer 20: Policy Loss 7.2584, Critic Loss 1.1115, Grad Norms (P/C) 0.00/0.00, Recon Loss 11.0407, Unique Indices: 6896, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 24%|█████████████▌ | 30/126 [09:03<14:40, 9.18s/it]Step 30: Avg Train Acc 0.6750, Val Acc 0.7083, Train Think Len 136.75, Val Think Len 219.46 Layer 20: Policy Loss 2.9773, Critic Loss 1.2548, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.3639, Unique Indices: 7505, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 32%|██████████████████ | 40/126 [11:43<11:44, 8.19s/it]Step 40: Avg Train Acc 0.6750, Val Acc 0.7292, Train Think Len 150.50, Val Think Len 187.88 Layer 20: Policy Loss 1.6447, Critic Loss 1.2167, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.5426, Unique Indices: 7068, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 40%|██████████████████████▌ | 50/126 [14:43<20:16, 16.01s/it]Step 50: Avg Train Acc 0.5875, Val Acc 0.7500, Train Think Len 356.62, Val Think Len 166.52 Layer 20: Policy Loss 14.2365, Critic Loss 0.9431, Grad Norms (P/C) 0.00/0.00, Recon Loss 10.8705, Unique Indices: 6617, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 48%|███████████████████████████▏ | 60/126 [16:54<08:23, 7.62s/it]Step 60: Avg Train Acc 0.7000, Val Acc 0.7500, Train Think Len 150.50, Val Think Len 168.40 Layer 20: Policy Loss 1.6700, Critic Loss 1.3669, Grad Norms (P/C) 0.00/0.00, Recon Loss 15.2031, Unique Indices: 6665, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 56%|███████████████████████████████▋ | 70/126 [19:15<08:58, 9.61s/it]Step 70: Avg Train Acc 0.6375, Val Acc 0.7500, Train Think Len 168.12, Val Think Len 196.96 Layer 20: Policy Loss 2.4080, Critic Loss 1.2116, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.5739, Unique Indices: 7218, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 63%|████████████████████████████████████▏ | 80/126 [22:56<14:46, 19.28s/it]Step 80: Avg Train Acc 0.6500, Val Acc 0.7500, Train Think Len 155.12, Val Think Len 187.73 Layer 20: Policy Loss 3.9269, Critic Loss 1.0200, Grad Norms (P/C) 0.00/0.00, Recon Loss 13.2892, Unique Indices: 7101, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 71%|████████████████████████████████████████▋ | 90/126 [25:45<08:23, 13.98s/it]Step 90: Avg Train Acc 0.6000, Val Acc 0.7708, Train Think Len 167.25, Val Think Len 216.92 Layer 20: Policy Loss 1.4571, Critic Loss 1.4306, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.1553, Unique Indices: 7576, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 79%|████████████████████████████████████████████▍ | 100/126 [29:21<05:42, 13.18s/it]Step 100: Avg Train Acc 0.5875, Val Acc 0.7083, Train Think Len 259.62, Val Think Len 167.38 Layer 20: Policy Loss 12.0867, Critic Loss 1.1006, Grad Norms (P/C) 0.00/0.00, Recon Loss 11.0181, Unique Indices: 6684, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 87%|████████████████████████████████████████████████▉ | 110/126 [31:49<03:29, 13.11s/it]Step 110: Avg Train Acc 0.7500, Val Acc 0.7500, Train Think Len 130.00, Val Think Len 187.60 Layer 20: Policy Loss 2.3315, Critic Loss 1.5050, Grad Norms (P/C) 0.00/0.00, Recon Loss 15.0282, Unique Indices: 7103, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 95%|█████████████████████████████████████████████████████▎ | 120/126 [34:42<01:35, 15.98s/it]Step 120: Avg Train Acc 0.6000, Val Acc 0.6875, Train Think Len 241.75, Val Think Len 192.08 Layer 20: Policy Loss 12.8202, Critic Loss 1.3851, Grad Norms (P/C) 0.00/0.00, Recon Loss 10.9607, Unique Indices: 7111, Avg Activation: 30.0000, Avg Act Values: 30.0000 Training Steps: 100%|████████████████████████████████████████████████████████| 126/126 [37:59<00:00, 18.09s/it] /cs/student/projects2/aisd/2024/seongcho/steer-rl/eval.py:476: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ckpt = TrainResult.model_validate(torch.load(checkpoint)) Config model: gemma2b task: gsm8k layers: [20] select_token: False decode: False category: None cot: True Evaluating: 27%|████████████████ | 360/1319 [10:16<35:08, 2.20s/it]
normal, only avg
0|layers-gemma-gsm8k | You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect. 0|layers-gemma-gsm8k | Step 0: Avg Train Acc 0.7500, Val Acc 0.6667, Train Think Len 236.50, Val Think Len 206.29 0|layers-gemma-gsm8k | Layer 10: Policy Loss 6.6327, Critic Loss 1.1613, Grad Norms (P/C) 0.00/0.00, Recon Loss 2.0147, Unique Indices: 947, Avg Activation: 5.2121, Avg Act Values: 7.5435 Training Steps: 2%|▊ | 1/63 [15:00<15:30:06, 900.10s/it] Training Steps: 3%|█▋ | 2/63 [15:16<6:26:32, 380.21s/it] Training Steps: 5%|██▌ | 3/63 [15:42<3:38:21, 218.36s/it] Training Steps: 6%|███▍ | 4/63 [20:00<3:50:04, 233.98s/it] Training Steps: 8%|████▎ | 5/63 [24:23<3:56:26, 244.59s/it] Training Steps: 10%|█████▏ | 6/63 [24:51<2:42:25, 170.97s/it] Training Steps: 11%|██████ | 7/63 [25:22<1:56:51, 125.20s/it] Training Steps: 13%|██████▉ | 8/63 [29:27<2:29:50, 163.46s/it] Training Steps: 14%|███████▊ | 9/63 [33:33<2:50:08, 189.05s/it] Training Steps: 16%|████████▌ | 10/63 [33:52<2:00:41, 136.62s/it] 0|layers-gemma-gsm8k | Step 10: Avg Train Acc 0.6625, Val Acc 0.6667, Train Think Len 309.38, Val Think Len 208.33 0|layers-gemma-gsm8k | Layer 10: Policy Loss 65.6082, Critic Loss 0.7492, Grad Norms (P/C) 0.00/0.00, Recon Loss 2.0016, Unique Indices: 963, Avg Activation: 4.8784, Avg Act Values: 6.9036 Training Steps: 17%|█████████▍ | 11/63 [48:30<5:15:09, 363.65s/it] Training Steps: 19%|██████████▎ | 12/63 [48:58<3:42:15, 261.49s/it] ^C
softmax, total avg
0|layers-gemma-gsm8k | Step 0: Avg Train Acc 0.7500, Val Acc 0.7292, Train Think Len 236.50, Val Think Len 205.12 0|layers-gemma-gsm8k | Layer 10: Policy Loss -129.9479, Critic Loss 1.1613, Grad Norms (P/C) 1.00/0.00, Recon Loss 2.0147, Unique Indices: 825, Avg Activation: 5.2121, Avg Act Values: 8.6891 Training Steps: 2%|▊ | 1/63 [14:58<15:28:34, 898.62s/it] Training Steps: 3%|█▋ | 2/63 [15:14<6:25:55, 379.60s/it] Training Steps: 5%|██▌ | 3/63 [19:36<5:25:31, 325.53s/it] Training Steps: 6%|███▍ | 4/63 [23:53<4:53:39, 298.63s/it] Training Steps: 8%|████▎ | 5/63 [28:17<4:36:37, 286.16s/it] Training Steps: 10%|█████▏ | 6/63 [28:39<3:06:40, 196.50s/it] Training Steps: 11%|██████ | 7/63 [29:04<2:10:50, 140.19s/it] Training Steps: 13%|██████▉ | 8/63 [29:26<1:33:59, 102.54s/it] Training Steps: 14%|███████▊ | 9/63 [33:31<2:12:29, 147.21s/it] Training Steps: 16%|████████▌ | 10/63 [37:39<2:37:26, 178.24s/it] 0|layers-gemma-gsm8k | Step 10: Avg Train Acc 0.6250, Val Acc 0.7292, Train Think Len 283.75, Val Think Len 200.62 0|layers-gemma-gsm8k | Layer 10: Policy Loss -23.1662, Critic Loss 0.7059, Grad Norms (P/C) 1.00/0.00, Recon Loss 2.0102, Unique Indices: 366, Avg Activation: 7.2268, Avg Act Values: 19.4772 Training Steps: 17%|█████████▍ | 11/63 [51:41<5:30:27, 381.30s/it] ^C
softmax, total avg
_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect. 0|layers-gemma-gsm8k | Step 0: Avg Train Acc 0.7500, Val Acc 0.7500, Train Think Len 232.62, Val Think Len 194.52 0|layers-gemma-gsm8k | Layer 10: Policy Loss -125.7120, Critic Loss 1.1553, Grad Norms (P/C) 1.00/0.00, Recon Loss 2.0171, Unique Indices: 891, Avg Activation: 0.0971, Avg Act Values: 0.3620 Training Steps: 2%|▊ | 1/63 [11:55<12:19:09, 715.32s/it] Training Steps: 3%|█▋ | 2/63 [15:52<7:21:27, 434.22s/it]
LLama
- baseline 34.65%
Training a Gemma 2 2B-IT for Reasoning with GRPO
A Blog post by Luca Massaron on Hugging Face
https://huggingface.co/blog/lmassaron/gemma-grpo#:~:text=Gemma,0

Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models
State-of-the-art LLMs have been extensively trained on public text, yielding diminishing returns from additional web-scraped data. One promising approach is leveraging curated synthetic data to improve reasoning, an essential part of advancing code generation and mathematical problem-solving.
https://arxiv.org/html/2504.18116v1

Seonglae Cho