CRL ControlRL Todo

only steering feature when corrected

Controlling Unconsciousness

masking chat machine id wigeon 3090

list 로 highlight 사진 전에 보여주기 explanaitoon and correct misguide ratio

total graph 에 llama result yellow 혹은 green 으로 추가 (baseline 은 그대로 검정

mech interp 연구는 gemma 만

bbq, wmdp harmbench 다시 테스트 baseline 부터..? merge dataset

oning tasks where traditional representation-level interventions have shown limited efficacy. The critic network's value function exhibits climbing patterns along token sequece

같은 질문을 맞춘거 틀린거 해서 어떤 feature 가 영향줘서 달라졌는지 보여쥑

multi token manipulation for mmlu single tokne by lastk or every token 왜 같은지

latent capability? 원래 석사당시 의도했던 research motivation 예전 꺼 찾아보기

CRL 에서 제일 잘 훈력된 gsm 으로 10th 다른 taks 수행해보기

accept 되면 결과잘나오면 neuronpedia 로 클릭가능한 text highlight page servong html

jumprelu 말고 그냥 0.5 로 해서 conditional steering 작을때만 해보기

train 중 지금 threshold 출력 dynamic 시에는

휸련 디테일

positional encoding 같이 넣어줘도 좋을듯 policy network? (근데 이미 포함되어있긴함)

policy weight initialization 특히 bias 어케할지..

track and print gradient norm

average reconstruction loss train 출력 to monitor sae suitability - for instruction tuned transferability

Adaptive Steering (Credit-based Gating)

sparse circuit discovery 할려면 조정할지말지도 선택해야함. 전부조정이 아니라 vs. 빠른학습위해 항상조정? 몇개 토큰만 selective steering 어케 시킬지 물어보기 threshold, gating 방법 어케할지

모든 토큰 적용하되 jumprelu 로 자동선택하게 top-1 먼저 적용 후 gating

critic 낮은걸 높게 올리는 방향으로 올리는게 게 핵심임. 즉 critic 높을때는 gate 낮게 조절하기 jumprelu 로 그 gating bias 학습하도록 구현해봐

즉 bias 통과못하면 그냥 policy action 0 으로

action 이 critic 높이는 방향으로 계속 가야

xsbench safe 가 늘었나co unsafe 가 늘었나 보고 커서 함수콜링시켜서 확인해보고 둘다 잘탐지하나 비교알 gpu 들 빨리선점해서 오류만 고치고 llama 랑 trimmed critic 실험들 실행

이제 중요한거는 feature 별로 text 시각화랑 같이 보여주는거 highlight 해서

example 받아오는거 아직 이상함 해당 그것만 받아와야지

top10 개만 시각화 feature 표시 제일 많이 된 example 들로 reward 젤 올라간놈들로

deep analysis about xstest tendency

XStest analysis

safe/unsafe 따라서 critic 다르려나? 아마 같을듯

safe/unsafe 따라서 feature 다르려나? 다르면 너무좋은데 비슷할가능성 높다

phase 추가하기 hallucination benchmark

exploration 이 문제면 지금 prompt 애서 activated 되는 feature 들 grpo poo 니까 그거중에서 선택하도록 혹은 가중치 주도록. 당장 적용쉬운건 마스킹해서 그거중에선택 좋은듯? 이름도 멋있게 짓기 가능 activation mask 나 dynamic mask 혹은 reinforce mask 혹은 amplification 마스크 혹은 세단어길이 이러면 coefficient 문제도 자동해결

corrsteer 구현 fork 해서 rl 부분 다지우고 fixed feature 로 구현 train 에서는 기존처럼

normal distribution policy 가 젤 말안되니 그거부터 삭제

cross entorpy 로 ppo loss

가장 중요한 실험

math gpqa, 등 데이터셋 시키기 (gsm8k 는 마지막 #### 이후 답안)

어케 보통 답안 cot reserach 에서 맞췄는지 물어보기

cot test feedback mechanism 적용 관찰과 변화 --cot 로 think step by step prompt 에 추가하고 last token 만 체크하고 하나씩 generate cache 하면서 feedback

base model 로 실험해보기 gemma cot 경우, 아니면 example 주고 그냥 few shot mmlu 도, it true 보고 prompt 개선

seed 두개로 policy deep 시 최종 feature 두개비교

benchmark

AI Jailbreak Benchmark

Hallucination Benchmark

Unlearning Benchmark

summarization hallucination faithful bench huggingface

unlearning, hallucination, faithfbench 등 회사 노트북 확인 내일 최우선

critic / policy trained on reasoning qa dataset applies on or generalized to other benchmark (universal reasoining circuit)

FEVER (Fact Extraction and VERification)

bbq 성능문제 해결

seed 바뀌어도 deep policy 에서 쓰는 feature index 같은지

q value test 결과 보기

critic 예측 binary 로 할까

observation faithfulsae probing 마찬가지로 모든 토큰 지금까지 mean pooling 이 나을수도

baseline

raw activation

subtract / multi k 다시해보기 24,25

minimum 적용방식 애초에 bias weight 20 고정? 그리고 그거 이상 없으면 deactivation

encode and decode

add / clamp

steering 방법

direct llm training using rl 이거보다 좋으면 interpreatble 하기도 하면서성능도 좋은 pareto 개선

장기적으로 chat train 없이 feature policy 만으로 chatting model 처럼 작동하면 신기할듯


robotics aI rl 로 훈련하듯 Sparse autoencoder 사용해서 gemma의 mmlu 정답률 올리려는 시도 할거야
요런 sae_lens 사용해서 sae 사용할건데ㅔ 

from sae_lens import SAE  # pip install sae-lens

sae, cfg_dict, sparsity = SAE.from_pretrained(
    release = "gemma-scope-2b-pt-res-canonical",
    sae_id = "layer_20/width_16k/canonical",
)

이렇게 불러오면 되고 16k dictionary size 에다가 layer 20 꺼 쓸거임 

ppo 사용해서 학습할거고

- action - dictionary size vector
- observation - 마지막 토큰 20번째 layer hiddentstate (정답 직전) 
- policy - observation 으로 예측하는 topk 1  activation function 으로 1개 feature 뽑아서 coefficient 줄거야. 차원이 크니 layer 는 1개로 하자 
- reward - 정답 맞으면 1 아니면 0
- critic - 이건  input 도 llm latent dim이고 output size 1 이니가 layer 2개로 하자
infernece 정답 토큰 하나만 할거야 그러니까 sequnce 는 데이터 하나하고 바로 reward 나오고 끝나느 구조인거지


학습 로깅은  wandb 로 하고 batch 로 학습해 16개씩 정도로
1000개마다 잘 학습되는지 validation 하고 맨 처음에서 1000개로 validation 한거래서 올라가나 래서가나 봪라
구체화하려고 궁금한거 잇으면 질문하고 클래스 잘 나눈 다음에 마지막에 training 함수 호출하면 될듯
Apply adam for networks
use some nice activation function 아래중에서
    "relu": nn.ReLU(),
    "tanh": nn.Tanh(),
    "leaky_relu": nn.LeakyReLU(),
    "sigmoid": nn.Sigmoid(),
    "selu": nn.SELU(),
    "softplus": nn.Softplus(),
    "identity": nn.Identity(),

sae 는 불러왔으면  action 나온 dictionary 를 .decode 해가지고 20th layer hook 의 residual 에 steering vector로 더해주는거임 .   그리고 llm은 google/gemma-2-2b-it 불러와야지

sae layer id도 20 이 아니라 그대로 0이네.. 그거 아냐 
mmlu 데이터랑 gemma 2 2b로  prompt 하는 법은 검색이라도 해봐 
datloader에서 training이랑 validation 제대로 나누지도 않고 mmlu 데이터를 일단 최소한 가져와야지

Done

혹은 grpo 로 변환 오래된 ppo 는 고집이다.

reward 다르게 설계 맞췃던거 틀리면 -1 틀렸던거 맞추면 +1

미리 llm 으로 hook 없이 inference 이후 추가 → 답안 바꾼 경우만 + reward instread of advantage

activation 함수 테스트 혹은 제거 별 의미없으면

jumprelu

activation 모두별로 비교 topk 도

bbq baseline 외부정보에 맞추기

24th bias 만 추가한 baseline

reward 구조 바꿔보기

observation 을 sparse sae 로 해야하지 않나? 아니면 critic, policy 중 하나라도 (policy 는 빡세니),

Universality check bbq and another dataset

most activated features anlaysis into pipeline

지금 성능 안나오는거 근본적 문제다

layer 별로 policy critic 두기

baseline 들 정확히 다시계산

shallow critic network

default configs to config.py or using hydra

layer scripts to experiment.py fire with calling analysis commands

llama scope추가

벤치마크 데이터셋 바꿔보기

얘내들부터 해봐야

feature direction

sae decode

CRL ControlRL Todo

Controlling Unconsciousness

휸련 디테일

Adaptive Steering (Credit-based Gating)

Done

Recommendations