Control RL MMLU Gemma

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 27 22:3
Editor
Edited
Edited
2025 Nov 26 13:23
Refs
Refs

baseline

  • white paper 51.3% 5shot
  • nothing without select 51.90%
  • 51.5% baseline
  • nothing with select 55.19%
  • decode bias added with zero without select 52.24%
  • decode bias added with zero with select 55.21%
  • best static feature without select decode
  • best static feature with select decode
  • raw activation 55.15% 오히려 55.19 에서 떨어짐

Corrsteer

  • 20th 기본 52.93% 12748 select 14351 none 52.82% select 55.14% decode 54.64% de/se
  • global 56.32% select 56.46%
  • foreach 55.83%

Single layer

  • steered 54.63%
  • 18th 55.55% select 30
  • 24th 55.42% select 30
  • nonselect 18th 55.10
  • nonselect 18th spase 55.13
  • nonselect 24th spase 55.15
  • nonselect 24th 25th 55.19
  • 24th 55.19% 30

Multiple layer

  • 15-25 51.87%
  • 15-25 select 52.78%
  • nonshared 24 25 select 55.73% minimum 100 epsilon 0.1 lr 1e-05 sigma 0.1

Optimal gemma2b

  • sigma 0.1
  • decode True
  • select True
  • lr 0.0001 (avg) or 1e-5 (best)
  • epsilon 0.01, 0.1
  • minimum 20~100
  • deep does not matter?
notion image
 

Gemma2b MMLU from layer 20 2000samples

  • baseline 52.22 {585: 4547, 608: 4403, 5231: 1229, 586: 1875, 599: 1968, 235248: 19, 108: 1}
  • baseline 51.58 {585: 4839, 608: 4017, 586: 2833, 599: 2353}
  • 54.64% decode 20/50
  • 54.47% decode 20/20
  • 55.10% 23th layer {599: 1930, 608: 4589, 585: 4670, 586: 2816, 139: 23, 5231: 11, 109: 3}
  • 55.22 {585: 5008, 608: 4789, 586: 2068, 599: 2177}
  • 55.21 {599: 1935, 608: 4591, 585: 4678, 586: 2838}
  • Final mmlu Accuracy with Steering: 55.36% 24th {599: 2181, 608: 4470, 585: 5154, 586: 2237}
baseline
single layer
3 layers
5 layers
10 layers
dir acc 20
52.22%
52.21%
52.11%
51.87%
dir acc 50
52.22%
52.19%
51.89%
52.01%
select 20
51.58%
51.79%
52.52%
52.41%
52.78%
select 50
51.58%
decode 20
52.22%
54.81%
decode 50
52.22%
decode 20 + select (20/23)
51.58%
50.41%/50.72%
raw act 20
raw act 20
shared 20
shared 50

레이어별 10-25

20
50
10
51.65
51.30
11
52.40
51.86
12
52.14
51.42
13
52.18
51.45
14
53.82
53.39
15
53.60
53.32
16
54.18
54.00
17
54.27
54.36
18
53.92
53.70
19
54.61
54.42
20
54.69
54.72
21
54.76
54.59
22
53.84
53.93
23
54.96
55.03
24
55.12
55.06
25
54.33
54.31
0
48.14
45.75
1
47.42
47.13
2
45.54
44.93
3
45.76
45.86
4
47.93
44.67
5
48.45
44.21
6
47.93
43.76
7
45.84
44.21
8
44.03
43.76
9
46.33
46.08
 
 
pt scores
PT zero shot score is 35~
google/gemma-2-2b · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
google/gemma-2-2b · Hugging Face
it scores
Google releases Gemma 2 2B, ShieldGemma and Gemma Scope
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Google releases Gemma 2 2B, ShieldGemma and Gemma Scope
 

Best practice

 /cs/st/projects2/a/2/se/control-ai  24-5544 *2 ?1  ❯ python train.py train --eval --layers="24," --task="mmlu" --decode --select_token Loading checkpoint shards: 100%|███████████████████████████████████████| 2/2 [00:03<00:00, 1.67s/it] wandb: Currently logged in as: seonglae (texonom). Use `wandb login --relogin` to force relogin wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Tracking run with wandb version 0.19.4 wandb: Run data is saved locally in /cs/student/projects2/aisd/2024/seongcho/control-ai/wandb/run-20250720_225934-o2h4d4te wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/o2h4d4te Training Steps: 0%| | 0/501 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Step 0: Avg Train Acc 0.7500, Val Acc 0.5880 Layer 24: Policy Loss -0.0000, Critic Loss 0.9404, Grad Norms (P/C) 0.00/135.62, Recon Loss 36.8710, Unique Indices: 481, Avg Activation: 30.0000 Training Steps: 20%|█████████▏ | 100/501 [01:46<04:53, 1.36it/s]Step 100: Avg Train Acc 0.6950, Val Acc 0.5920 Layer 24: Policy Loss 0.0661, Critic Loss 0.1462, Grad Norms (P/C) 0.00/25.43, Recon Loss 36.9874, Unique Indices: 481, Avg Activation: 30.0000 Training Steps: 40%|██████████████████▎ | 200/501 [03:30<04:08, 1.21it/s]Step 200: Avg Train Acc 0.7150, Val Acc 0.5900 Layer 24: Policy Loss 0.0602, Critic Loss 0.1387, Grad Norms (P/C) 0.00/22.88, Recon Loss 38.2507, Unique Indices: 486, Avg Activation: 30.0000 Training Steps: 60%|███████████████████████████▌ | 300/501 [05:20<02:40, 1.25it/s]Step 300: Avg Train Acc 0.7037, Val Acc 0.5940 Layer 24: Policy Loss 0.0519, Critic Loss 0.0420, Grad Norms (P/C) 0.00/22.74, Recon Loss 39.3205, Unique Indices: 484, Avg Activation: 30.0000 Training Steps: 80%|████████████████████████████████████▋ | 400/501 [07:03<00:45, 2.23it/s]Step 400: Avg Train Acc 0.7037, Val Acc 0.5920 Layer 24: Policy Loss 0.1766, Critic Loss 0.1142, Grad Norms (P/C) 0.00/59.14, Recon Loss 39.3443, Unique Indices: 484, Avg Activation: 30.0000 Training Steps: 100%|█████████████████████████████████████████████▉| 500/501 [08:36<00:00, 2.82it/s]Step 500: Avg Train Acc 0.7150, Val Acc 0.5940 Layer 24: Policy Loss 0.0893, Critic Loss 0.0963, Grad Norms (P/C) 0.00/21.85, Recon Loss 38.8676, Unique Indices: 485, Avg Activation: 30.0000 Training Steps: 100%|██████████████████████████████████████████████| 501/501 [08:49<00:00, 1.06s/it] /cs/student/projects2/aisd/2024/seongcho/control-ai/eval.py:412: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ckpt = TrainResult.model_validate(torch.load(checkpoint)) Config model: gemma2b task: mmlu layers: [24] select_token: True decode: True category: None Evaluating: 100%|██████████████████████████████████████████████| 14042/14042 [05:56<00:00, 39.39it/s] {585: 5132, 608: 4466, 586: 2255, 599: 2189} Final mmlu Accuracy with Steering: 55.44% Results saved to ./checkpoints/gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select/mmlu_24_steered.json Stats saved to ./checkpoints/gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select/mmlu_eval.json wandb: wandb: 🚀 View run gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select at: https://wandb.ai/texonom/control_rl/runs/o2h4d4te wandb: Find logs at: wandb/run-20250720_225934-o2h4d4te/logs
 
 
 
 
 

Recommendations