CRL rebuttal experimental log

GSM8k

mse loss 구림


L/wandb/run-20251119_030313-afy9k4zp/logs
^C%                                                                                                                      

 ~/c/code/Users/Seonglae.Cho/ControlRL  @a16e6307 ··· 43m 44s  azureml_py38   azureuser@a100research  03:46:41 
❯ python train.py train --task=gsm8k --num_samples=300 --mask=generation --validate_every=10 --limit=48 --policy_deep --eval --cot
2025-11-19 03:46:49.078581: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.04s/it]
wandb: Currently logged in as: seonglae (texonom) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.23.0
wandb: Run data is saved locally in /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20251119_034700-ilisme39
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gemma2b_gsm8k_20_ppo_1e-05_1119_034700_30.0_cot
wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl
wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/ilisme39
wandb: Detected [huggingface_hub.inference] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
Training Steps:   0%|                                                                             | 0/38 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.
Step 0: Avg Train Acc 0.7500, Val Acc 0.7917, Train Think Len 336.75, Val Think Len 208.10
  Layer 20: Policy Loss 27.1982, Critic Loss 0.8358, Grad Norms (P/C) 0.00/0.00, Recon Loss 10.8077, Unique Indices: 1577, Avg Activation: 0.2197, Avg Act Values: 0.8499
Training Steps:  26%|█████████████████▉                                                  | 10/38 [30:16<44:27, 95.28s/it]Step 10: Avg Train Acc 0.6750, Val Acc 0.7917, Train Think Len 160.62, Val Think Len 207.85
  Layer 20: Policy Loss 11.6340, Critic Loss 0.5464, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.0208, Unique Indices: 1556, Avg Activation: 0.6085, Avg Act Values: 0.8533
Training Steps:  53%|██████████████████████████████████▏                              | 20/38 [1:00:52<39:56, 133.11s/it]Step 20: Avg Train Acc 0.5375, Val Acc 0.7917, Train Think Len 321.62, Val Think Len 208.31
  Layer 20: Policy Loss 14.8523, Critic Loss 0.6786, Grad Norms (P/C) 0.00/0.00, Recon Loss 11.2051, Unique Indices: 1568, Avg Activation: 0.2310, Avg Act Values: 0.8484
Training Steps:  79%|████████████████████████████████████████████████████              | 30/38 [1:25:35<07:58, 59.78s/it]Step 30: Avg Train Acc 0.7250, Val Acc 0.7708, Train Think Len 229.62, Val Think Len 209.79
  Layer 20: Policy Loss 5.4466, Critic Loss 0.8762, Grad Norms (P/C) 0.00/0.00, Recon Loss 10.9612, Unique Indices: 1557, Avg Activation: 0.2099, Avg Act Values: 0.8437
Training Steps: 100%|█████████████████████████████████████████████████████████████████| 38/38 [1:57:45<00:00, 185.95s/it]
Config
        model: gemma2b
        task: gsm8k
        layers: [20]
        select_token: False
        decode: False
        category: None
        cot: True
        
Evaluating: 100%|████████████████████████████████████████████████████████████████████| 1319/1319 [57:27<00:00,  2.61s/it]
Final gsm8k Accuracy with Steering: 53.68%
Results saved to ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_1119_034700_30.0_cot/gsm8k_20_steered.json
Stats saved to ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_1119_034700_30.0_cot/gsm8k_eval.json
Every outputs are saved to the folder ./checkpoints/gemma2b_gsm8k_20_ppo_1e-05_1119_034700_30.0_cot
wandb: 
wandb: 🚀 View run gemma2b_gsm8k_20_ppo_1e-05_1119_034700_30.0_cot at: https://wandb.ai/texonom/control_rl/runs/ilisme39
wandb: Find logs at: ../../../../../../../mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20251119_034700-ilisme39/logs

 ~/c/code/U/Seonglae.Cho/ControlRL  @a16e6307 ···· 2h 55m 58s  azureml_py38   azureuser@a100research  06:42:42 
❯ 
zsh: timeout
timed out waiting for input: auto-logoutcloudfiles/code/Users/Seonglae.Cho/ControlRL$ 

 ~/cloudfiles/code/Users/Seonglae.Cho/ControlRL  @a16e6307 ········ 4h 37m 15s  azureuser@a100research  07:12:42 
❯ bash
z(azureml_py38) azureuser@a100research:~/cloudfiles/code/Users/Seonglae.Cho/ControlRL$ zsh

 ~/cloudfiles/code/Users/Seonglae.Cho/ControlRL  @a16e6307 ···· azureml_py38   azureuser@a100research  10:04:30 
❯ sleep 10000 && python train.py train --task=gsm8k --num_samples=300 --mask=generation --validate_every=10 --limit=48 --policy_deep --eval --cot
^C


  /mnt/b/t/sh/L/m/c/a/code/U/Seonglae.Cho/ControlRL  @a16e6307 · azureml_py38   azureuser@a100research  10:04:31 
❯ sleep 10000 && python train.py train --task=gsm8k --num_samples=300 --mask=generation --validate_every=10 --limit=48 --policy_deep --eval --cot
2025-11-19 12:54:18.137335: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.97s/it]
wandb: Currently logged in as: seonglae (texonom) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.23.0
wandb: Run data is saved locally in /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20251119_125444-ump8t0tn
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gemma2b_gsm8k_20_ppo_1e-05_1119_125444_30.0_cot
wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl
wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/ump8t0tn
wandb: Detected [huggingface_hub.inference] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
Training Steps:   0%|                                                                             | 0/38 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.
Step 0: Avg Train Acc 0.7500, Val Acc 0.7292, Train Think Len 336.88, Val Think Len 222.31
  Layer 20: Policy Loss 27.1939, Critic Loss 0.8356, Grad Norms (P/C) 0.00/0.00, Recon Loss 10.8082, Unique Indices: 1670, Avg Activation: 0.0773, Avg Act Values: 0.2858
Training Steps:  26%|████████████████▌                                              | 10/38 [1:20:04<2:09:42, 277.93s/it]Step 10: Avg Train Acc 0.6500, Val Acc 0.7500, Train Think Len 163.25, Val Think Len 236.90
  Layer 20: Policy Loss 11.8022, Critic Loss 0.5458, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.0527, Unique Indices: 1686, Avg Activation: 0.2392, Avg Act Values: 0.2684
Training Steps:  53%|█████████████████████████████████▏                             | 20/38 [2:20:20<1:29:32, 298.48s/it]Step 20: Avg Train Acc 0.5500, Val Acc 0.7500, Train Think Len 213.50, Val Think Len 236.75
  Layer 20: Policy Loss 13.8454, Critic Loss 0.7873, Grad Norms (P/C) 0.00/0.00, Recon Loss 14.0390, Unique Indices: 1700, Avg Activation: 0.2805, Avg Act Values: 0.2681
Training Steps:  79%|████████████████████████████████████████████████████              | 30/38 [2:52:09<09:38, 72.27s/it]Step 30: Avg Train Acc 0.7500, Val Acc 0.7083, Train Think Len 235.75, Val Think Len 253.50
  Layer 20: Policy Loss 9.6553, Critic Loss 0.7875, Grad Norms (P/C) 0.00/0.00, Recon Loss 10.9896, Unique Indices: 1692, Avg Activation: 0.0820, Avg Act Values: 0.2517
Training Steps: 100%|█████████████████████████████████████████████████████████████████| 38/38 [3:41:16<00:00, 349.38s/it]
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/train.py", line 802, in <module>
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/train.py", line 796, in train
    if __name__ == "__main__":
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/train.py", line 628, in collect_stats
    batch_size=self.batch_size,
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/eval.py", line 507, in steered
    ckpt = TrainResult.model_validate(torch.load(checkpoint))
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/torch/serialization.py", line 1479, in load
    with _open_file_like(f, "rb") as opened_file:
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/torch/serialization.py", line 759, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/torch/serialization.py", line 740, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/gemma2b_gsm8k_20_ppo_1e-05_1119_125444_30.0_cot/step10_acc75.0.pt'
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/train.py", line 802, in <module>
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/train.py", line 796, in train
    if __name__ == "__main__":
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/train.py", line 628, in collect_stats
    batch_size=self.batch_size,
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/eval.py", line 507, in steered
    ckpt = TrainResult.model_validate(torch.load(checkpoint))
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/torch/serialization.py", line 1479, in load
    with _open_file_like(f, "rb") as opened_file:
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/torch/serialization.py", line 759, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/anaconda/envs/azureml_py38/lib/python3.10/site-packages/torch/serialization.py", line 740, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/gemma2b_gsm8k_20_ppo_1e-05_1119_125444_30.0_cot/step10_acc75.0.pt'
wandb: 
wandb: 🚀 View run gemma2b_gsm8k_20_ppo_1e-05_1119_125444_30.0_cot at: https://wandb.ai/texonom/control_rl/runs/ump8t0tn
wandb: Find logs at: wandb/run-20251119_125444-ump8t0tn/logs

CRL rebuttal experimental log

GSM8k

Recommendations