Subspace Rerouting

Subspace Rerouting (SSR)

Unlike gradient-based black box attacks, this requires a white box model. It identifies safety refusal subspaces and acceptance subspaces within the model, generating adversarial suffixes to redirect harmful commands to the safety subspace. There are Probe, Steering and Attention SSR methods which are faster than conventional approaches (naturally, since it's white box). While this demonstrates effective attacks (or potentially defense) utilizing internal mechanisms, it has limitations - low attack transfer rates between models and potential meaning distortion when generated suffixes become too long.

arxiv.org

https://arxiv.org/pdf/2503.06269

Subspace Rerouting

Subspace Rerouting (SSR)

Recommendations