Making refusal robust to linear-ablation attacks

A training-time defense that spreads an open-weight model's refusal behavior across many directions, so a cheap rank-1 weight edit can no longer switch safety off.

The problem: safety you can delete with one line

Arditi et al. (2024) showed that refusal in instruction-tuned LLMs is mediated by a single linear direction in activation space. You find it with a difference-in-means between harmful and harmless prompts, and you can ablate that one direction to strip refusal while leaving the rest of the model intact.

For an open-weight release, that turns alignment into a one-line removal: anyone with the weights computes the direction and applies a rank-1 (K=1) edit. The safety property that took a full post-training run to install comes off for almost nothing. The interesting question is not 'can we hide the direction' but 'can we make the attack expensive.'

The approach: redistribute the signal, don't hide it

Instead of strengthening or obscuring a single direction, I post-trained Llama-3.2-1B-Instruct so the refusal signal is spread across many directions at once, so that no low-rank ablation can remove it. Two objectives do the work:

Class-conditional mean and covariance matching: shape the activation statistics of the harmful and harmless classes so the discriminative safety signal is no longer concentrated in one difference-in-means direction.
Temperature-scaled KL distillation from a frozen copy of the original instruct model: move the safety geometry while keeping the model's general behavior and helpfulness intact.

The whole thing is a training-time defense plus an evaluation harness, built in PyTorch. No fresh pre-training, no utility tax that I could measure.

Result

Rank K of the linear ablation needed to break refusal

Baseline instruct

K=1

Defended model

K≥16

Baseline refusal lives on a single direction a rank-1 edit can ablate; the defense spreads it across many, so no low-rank ablation removes it.

The rank of the linear attack required to disable refusal rises from K=1 on the baseline to K≥16, at least a 16x increase in the attacker's budget, with baseline refusal and general behavior preserved.

How it was evaluated

Two lenses. Behavioral metrics check that the model still refuses harmful requests and still complies with benign ones. Linear-probe diagnostics measure how many directions actually carry the safety signal, confirming it is genuinely redistributed across multiple directions rather than just obfuscated into one harder-to-find place.

What I took away

Single-direction safety is a fragility, not a feature; the fix is to change the geometry, not patch the symptom.
You can raise the cost of an intervention attack without retraining from scratch and without a measurable hit to utility.
Shipping the defense and the evaluation harness together matters: the probe diagnostics are what tell you the signal moved, not just that the metric went up.