자쿤 이주차 데모 페이퍼 report

2nd Sync report about two projects

CorrSteer: Bias Steering with EMGSD Dataset (Demo Paper Oriented)

Deadline: March 28th

I read more literature about SAE features to follow up on recent works. I had to change the direction of the paper since my method is too simple for a top-tier conference paper and sometimes does not work properly, as you mentioned previously.

First of all, I think the original idea (extract SAE features by correlation and bias steering) is quite weak since bias steering is

not the first time using SAE

and extracting based on a contrastive dataset does not have a real benefit over other methods such as

ActAdd

which is a contrastive dataset-based method without using SAE.

So after some literature review, I chose to add more techniques even though it is a demo paper. It would be better to add and combine some existing techniques for paper acceptance.

BiDPO

SAE-TS

After combining these and testing whether they work properly or not, my plan is to extend dataset coverage above EMGSD (low priority for model scaling).

I did some visualized analysis for proper steering features with SAEs' feature coefficients, but I'm not sure it would be helpful since it turns out that coefficient differences per position are due to the positional encoding.

Show some evidence and visualization

RL-based Unlearning with Automated Steering (Thesis)

I haven't implemented draft code for this project yet, but a few concerns arose after dealing with the above SAE hands-on and literature reviews:

The dimensions for available feature steering are too large, which might make it hard to train a steering policy

Training cost might be very large to run the policy network every time for each token

왜 같은 index act dist 다른지 - 뒤에서 400개 사용?

Weekly plan

내일 발렌타인 즐겁게 잘 지내기 예약한 두곳과 저녁 비프웰링턴 → scaling 논문읽기

주말에는 아마 미팅 스킵하고 sae activation 글 lesswrong 써서 주말 내로 올리기

민이 토욜 집가서 같이공부? scaling 논문읽기

positional encoding 제거하고 돌려보기

l1 l2 loss invectivation

turorial 2.0 다른거 있나 확인, 논문읽기

로스 다시 렌더링

ce difference llm after reconstruciton

position embedding 뺐을 때 sae loss/ce loss

Feature Umap

최종그래프

layerwise similarity 2개인데, (umap visualization animation)

Citation 빼고 적기 (who, 2024) → footnote 는 내 글 적고 link 는 그냥 링크, 제목 dtatset 수정

font size 2배해서 다시 렌더링

figma

residual stream visualization

common feature matching visualization

two part

gpt2 huggingface - SAE figure
gpt2 batchtopk - SAE matrix figure

다만 2개 테스트 다 geometric mean 적용 안해서 적용하면 비율 높아질수도
Same SAE top-2 부터 해서 중첩되는거 많은거 아닌가 하는 의시

Decoder weight UMAP, t-Sne - geometric mean initizliation

differenct dictionary size 일때 비교 larger or smaller

월욜에는 zekun 한테 corrsteer 확실히 방향 정해서 report share 하기

일단 1st report 문의 요청파악

그동안 activation 등 관련된거 graph 추려서 추가

SAE-TS to reconstruct feature 와
BiDPO 가 classification dataset 에 적합

다만 dataset diversity 는 1개로 우선순위 내리기 model diversity나

보고서는 all token steering 말고 quantile based last-k 로 진행

화욜에는 ir coursework 하루 진행

수욜 SAE feature RL 진행

목욜은 금욜 회의전 snlp 진행

crosscoder 논문읽고

금토일 nnet upload lesswrong
Π-Net, TreeSAE n-Net

encoder decoder force 하면 안되나 same symmetry

Nnet (with residual or not), synsae training → 완료 후 eluther embedding

기존코드

rubrics

좀 아름답게 코딩해라 변수 convention 은 기존 코드 참고 이상한 줄임말 ㄴ 너무 길게도 ㄴ

google fire 로 400 등 hyperparameter 설정가능하게

주석은 각 paragraph 당 1줄 넘지 않게해

median 은 quantile 구한 다음에 구하는거고 metric function 에서 return 한는거 아님

메모리 최적화 하는 코드를 사용해 최소한만 메모리에 불러와있도록

for layer 하면서 layer 모든 정보 모은 다음 core dense 정보로 압축 한 다음 전부 없에
mean 이나 std 도 최소한으로 chunk당 mask 씌어서 zero 빼고 첫토큰 빼는 건 [1:] 같이 메모리 최적화
앞으로 필요없는 메모리는 지우고 계산당시에는 필요한 데이터만 가지고 계산

현재코드

Fix list 수정할 것

아래처럼 print 하도록 진행상황 파악

Layer 0: 100%|████████████████████████████████████| 400/400 [00:14<00:00, 27.29it/s] Layer 0 - Aggregated activations, L1, L2, nonzero counts, total counts Layer 0 - Computed token-level metrics Layer 0 - Stored token means and stds Layer 0 - Computed and stored token densities Layer 0 - Stored SAE token L1, L2, and combined losses Layer 0 - Set LLM token losses to None Layer 0 - Computed and stored feature-level stats

combined_feature_mean_vs_std 는 잘 나오는 pallete 인데 combined_feature_mean 는 layer 별로 팔레트 적용 안되어있으니 다시 해 투명도 주고

quantiles_with_bos, quantiles_without_bos 필요없음

layer_0_token_mean_vs_std 같이 이거는 레이어별로 필요없음 (layer gradation 아닌데 해당 팔레트 쓰지마라ㅡ ㅡ) 5, layer_0_token_mean_with_bos 이거 지금 너무 이상해 token index 1024 여야하는데 지금 45000 정도 되는데 값이 이상함 without 도 마찬가지고

기존처럼 layer 별로 시각화 quantile per feature 2개 with mako 그리고 token avg vs token std 한개 그리고 avg activation per token index (wi/wo bos) 해서 2개 해서 총 5개인데 각각 matplotpy subplot 3x4 해서 12개 레이어 5개 이미지에 동시표현하라고 main tille 1개랑 subtitle은 레이어만 표시해

LAYER_PALETTE = sns.color_palette("viridis", 12) POSITION_PALETTE = "flare" QUANTILE_PALETTE = "mako"

등 변수추가 글로벌

batch_size 추가하지 말고 기존처럼 thread 기반 하나하나

combined combined_feature_mean_vs_std 랑 combined_feature_mean는 유지하되 combined_feature_mean 은 색상 pallet 적용 안된거 제대로 투명도 줘서 보이게 log scale로

12개 subgraph 에서도 LAYER_PALETTE 로 각각 색 다르게 해 (각 레이어 즉 graph 는 단일 색상이겠지?) 다만 quantile per feature 그래프에서는 mako로 전부 표시각 레이어 네에서

token avg vs token std graph 에서는 기존 feature코드처럼 이상치 index 표시 해주는 센스

token_mean_without_bos 에서도 기존처럼 grey line 하고 flame large scatter 다 해라

변수뒤에 언더바 붙이지 말고

density 하고 count 를 이용한 graph 도 추가해

함수 분리 기존처럼 최대한 많이 하고 로컬에 다 쑤셔박지 말고 기존 코드에서 수정


Traceback (most recent call last):
  File "/cs/student/projects2/aisd/2024/seongcho/sae-hands-on/scripts/layer-wise.py", line 339, in <module>
    fire.Fire(main)
  File "/cs/student/projects2/aisd/2024/seongcho/miniconda3/envs/sae/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/cs/student/projects2/aisd/2024/seongcho/miniconda3/envs/sae/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/cs/student/projects2/aisd/2024/seongcho/miniconda3/envs/sae/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/cs/student/projects2/aisd/2024/seongcho/sae-hands-on/scripts/layer-wise.py", line 238, in main
    ax.plot(x_feat, med, color=sns.color_palette(QUANTILE_PALETTE)[6])
IndexError: list index out of range