Subspace Rerouting

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 13 16:31
Editor
Edited
Edited
2025 Mar 13 16:31
Refs
Refs

Subspace Rerouting (SSR)

Unlike gradient-based black box attacks, this requires a white box model. It identifies safety refusal subspaces and acceptance subspaces within the model, generating adversarial suffixes to redirect harmful commands to the safety subspace. There are Probe, Steering and Attention SSR methods which are faster than conventional approaches (naturally, since it's white box). While this demonstrates effective attacks (or potentially defense) utilizing internal mechanisms, it has limitations - low attack transfer rates between models and potential meaning distortion when generated suffixes become too long.
 
 
 
 
 
 
 
 

Recommendations