Control RL MMLU Gemma

baseline

white paper 51.3% 5shot

nothing without select 51.90%

51.5% baseline

nothing with select 55.19%

decode bias added with zero without select 52.24%

decode bias added with zero with select 55.21%

best static feature without select decode

best static feature with select decode

raw activation 55.15% 오히려 55.19 에서 떨어짐

Corrsteer

20th 기본 52.93% 12748 select 14351 none 52.82% select 55.14% decode 54.64% de/se

global 56.32% select 56.46%

foreach 55.83%

Single layer

steered 54.63%

18th 55.55% select 30

24th 55.42% select 30

nonselect 18th 55.10

nonselect 18th spase 55.13

nonselect 24th spase 55.15

nonselect 24th 25th 55.19

24th 55.19% 30

Multiple layer

15-25 51.87%

15-25 select 52.78%

nonshared 24 25 select 55.73% minimum 100 epsilon 0.1 lr 1e-05 sigma 0.1

Optimal gemma2b

sigma 0.1

decode True

select True

lr 0.0001 (avg) or 1e-5 (best)

epsilon 0.01, 0.1

minimum 20~100

deep does not matter?

Gemma2b MMLU from layer 20 2000samples

baseline 52.22 {585: 4547, 608: 4403, 5231: 1229, 586: 1875, 599: 1968, 235248: 19, 108: 1}

baseline 51.58 {585: 4839, 608: 4017, 586: 2833, 599: 2353}

54.64% decode 20/50

54.47% decode 20/20

55.10% 23th layer {599: 1930, 608: 4589, 585: 4670, 586: 2816, 139: 23, 5231: 11, 109: 3}

55.22 {585: 5008, 608: 4789, 586: 2068, 599: 2177}

55.21 {599: 1935, 608: 4591, 585: 4678, 586: 2838}

Final mmlu Accuracy with Steering: 55.36% 24th {599: 2181, 608: 4470, 585: 5154, 586: 2237}

ㅤ	baseline	single layer	3 layers	5 layers	10 layers
dir acc 20	52.22%	52.21%	52.11%	ㅤ	51.87%
dir acc 50	52.22%	52.19%	51.89%	52.01%	ㅤ
select 20	51.58%	51.79%	52.52%	52.41%	52.78%
select 50	51.58%	ㅤ	ㅤ	ㅤ	ㅤ
decode 20	52.22%	54.81%	ㅤ	ㅤ	ㅤ
decode 50	52.22%	ㅤ	ㅤ	ㅤ	ㅤ
decode 20 + select (20/23)	51.58%	50.41%/50.72%	ㅤ	ㅤ	ㅤ
raw act 20	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
raw act 20	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
shared 20	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
shared 50	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ

레이어별 10-25

ㅤ	20	50
10	51.65	51.30
11	52.40	51.86
12	52.14	51.42
13	52.18	51.45
14	53.82	53.39
15	53.60	53.32
16	54.18	54.00
17	54.27	54.36
18	53.92	53.70
19	54.61	54.42
20	54.69	54.72
21	54.76	54.59
22	53.84	53.93
23	54.96	55.03
24	55.12	55.06
25	54.33	54.31
0	48.14	45.75
1	47.42	47.13
2	45.54	44.93
3	45.76	45.86
4	47.93	44.67
5	48.45	44.21
6	47.93	43.76
7	45.84	44.21
8	44.03	43.76
9	46.33	46.08

pt scores

PT zero shot score is 35~

google/gemma-2-2b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/google/gemma-2-2b

it scores

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/blog/gemma-july-update

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope

Best practice


  /cs/st/projects2/a/2/se/control-ai  24-5544 *2 ?1 
❯ python train.py train --eval --layers="24," --task="mmlu" --decode --select_token
Loading checkpoint shards: 100%|███████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]
wandb: Currently logged in as: seonglae (texonom). Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
	wandb: Tracking run with wandb version 0.19.4
wandb: Run data is saved locally in /cs/student/projects2/aisd/2024/seongcho/control-ai/wandb/run-20250720_225934-o2h4d4te
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select
wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl
wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/o2h4d4te
Training Steps:   0%|                                                        | 0/501 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Step 0: Avg Train Acc 0.7500, Val Acc 0.5880
  Layer 24: Policy Loss -0.0000, Critic Loss 0.9404, Grad Norms (P/C) 0.00/135.62, Recon Loss 36.8710, Unique Indices: 481, Avg Activation: 30.0000
Training Steps:  20%|█████████▏                                    | 100/501 [01:46<04:53,  1.36it/s]Step 100: Avg Train Acc 0.6950, Val Acc 0.5920
  Layer 24: Policy Loss 0.0661, Critic Loss 0.1462, Grad Norms (P/C) 0.00/25.43, Recon Loss 36.9874, Unique Indices: 481, Avg Activation: 30.0000
Training Steps:  40%|██████████████████▎                           | 200/501 [03:30<04:08,  1.21it/s]Step 200: Avg Train Acc 0.7150, Val Acc 0.5900
  Layer 24: Policy Loss 0.0602, Critic Loss 0.1387, Grad Norms (P/C) 0.00/22.88, Recon Loss 38.2507, Unique Indices: 486, Avg Activation: 30.0000
Training Steps:  60%|███████████████████████████▌                  | 300/501 [05:20<02:40,  1.25it/s]Step 300: Avg Train Acc 0.7037, Val Acc 0.5940
  Layer 24: Policy Loss 0.0519, Critic Loss 0.0420, Grad Norms (P/C) 0.00/22.74, Recon Loss 39.3205, Unique Indices: 484, Avg Activation: 30.0000
Training Steps:  80%|████████████████████████████████████▋         | 400/501 [07:03<00:45,  2.23it/s]Step 400: Avg Train Acc 0.7037, Val Acc 0.5920
  Layer 24: Policy Loss 0.1766, Critic Loss 0.1142, Grad Norms (P/C) 0.00/59.14, Recon Loss 39.3443, Unique Indices: 484, Avg Activation: 30.0000
Training Steps: 100%|█████████████████████████████████████████████▉| 500/501 [08:36<00:00,  2.82it/s]Step 500: Avg Train Acc 0.7150, Val Acc 0.5940
  Layer 24: Policy Loss 0.0893, Critic Loss 0.0963, Grad Norms (P/C) 0.00/21.85, Recon Loss 38.8676, Unique Indices: 485, Avg Activation: 30.0000
Training Steps: 100%|██████████████████████████████████████████████| 501/501 [08:49<00:00,  1.06s/it]
/cs/student/projects2/aisd/2024/seongcho/control-ai/eval.py:412: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  ckpt = TrainResult.model_validate(torch.load(checkpoint))
Config
        model: gemma2b
        task: mmlu
        layers: [24]
        select_token: True
        decode: True
        category: None
        
Evaluating: 100%|██████████████████████████████████████████████| 14042/14042 [05:56<00:00, 39.39it/s]
{585: 5132, 608: 4466, 586: 2255, 599: 2189}
Final mmlu Accuracy with Steering: 55.44%
Results saved to ./checkpoints/gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select/mmlu_24_steered.json
Stats saved to ./checkpoints/gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select/mmlu_eval.json
wandb: 
wandb: 🚀 View run gemma2b_mmlu_24_ppo_1e-05_0720_225934_30.0_select at: https://wandb.ai/texonom/control_rl/runs/o2h4d4te
wandb: Find logs at: wandb/run-20250720_225934-o2h4d4te/logs

Control RL MMLU Gemma

baseline

Corrsteer

Single layer

Multiple layer

Optimal gemma2b

Gemma2b MMLU from layer 20 2000samples

레이어별 10-25

Best practice

Recommendations