How Instruction Tuning Builds an Authority-Compliance Circuit

What This Research Is About

Modern instruction-tuned language models reliably refuse many harmful requests. But that refusal is not equally robust under all phrasings: wrapping the same request in authority framing (claims of medical, legal, or institutional authority) demonstrably increases unsafe-compliance rates. The interesting question is no longer whether this happens, but what changes inside the model when it does, and where that change lives in the network.

This work asks one specific question: does instruction tuning surface an axis that already exists in the pre-trained model, or does it construct a new authority-compliance readout?

Accepted at the ACL 2026 EvalEval Workshop.

Three Pre-Registered Hypotheses

I framed the paper around three falsifiable claims:

H1 - Direction specificity. A sparse-autoencoder-extracted authority direction produces a causal logit-diff shift that is sign-separated from random, low-importance, orthogonal, and shuffled-latent placebos.
H2 - Depth conditionality. The sign of the intervention effect varies by intervention layer, and this pattern survives scale-up and replication to other model families.
H3 - Alignment construction. Running the identical protocol on the base (non-instruction-tuned) checkpoint yields qualitatively weaker effects, with the deep-layer readout collapsing to statistical null.

The pipeline is built so that each hypothesis fails cleanly if the underlying claim is false.

Projecting away the authority direction shifts refusal logits, but the sign depends on intervention depth. Click a layer to inspect the causal effect.

Layer 8Shallow handle

Mean logit-diff shift: -0.234

Robust refusal shift across 5 seeds; clean placebo separation.

Figure 1. Depth-conditional authority-compliance circuit in instruction-tuned LLMs. The same SAE-derived direction produces opposite effects at shallow versus deep layers, and the deep-layer readout is absent in the base checkpoint.

Method in One Diagram

Authority and control framings of the same base requests are pushed through the model. At a target layer, the last-token residual is encoded by a sparse autoencoder; the mean SAE-latent contrast (authority minus control) is decoded back to residual space and unit-normalized to give a direction d_r. At inference time, the residual is projected onto d_r and a fraction alpha is subtracted, leaving weights frozen and the operation reversible per request:

r' = r - \alpha \left(\frac{r \cdot d_{r}}{d_{r} \cdot d_{r}}\right) d_{r}

Four matched placebos (random, low-importance, orthogonal, shuffled-latent) flow through the same projection-subtraction step so that any observed effect must beat all of them. The full sweep covers four layers (8, 10, 12, 14), four instruct model families, a base LLaMA-3-8B checkpoint, two dataset scales (n=70 and n=980), and 5 seeds at the headline configuration.

What We Found

A layer-8 causal handle. On LLaMA-3-8B-Instruct, projecting away the authority direction at layer 8 produces a robust mean shift on the refusal-vs-compliance logit-diff metric: -0.234 with 95% CI [-0.245, -0.224] across 5 seeds, monotonic dose-response in alpha, and clean sign separation from all four placebos.

Sign reversal with depth. The same intervention reverses sign at deeper layers (l=12: +0.20; l=14: +0.23 at n=70; l=14: +0.252 at n=980, p=2.5e-293). This is not an artifact of scale - the depth-conditional shape is structural.

Non-additive across depths. Intervening simultaneously at L8 and L14 produces -0.066, vs the algebraic sum of single-layer effects of +0.018. The depth-conditional pattern is therefore not two independent axes superposed.

Cross-family generalization. Replicating the four-layer sweep on LLaMA-3.1-8B, Mistral-7B, and Gemma-2-9B shows depth-conditional readout as a cross-family phenomenon, with family-specific shape - i.e., its specific form is alignment-procedure-dependent.

The cleanest mechanistic test. Running the identical pipeline on base LLaMA-3-8B (not instruction-tuned) collapses the deep-layer effect from p=3e-21 (instruct) to p=0.65 (base). This is direct evidence that the deep-layer authority readout is constructed during instruction tuning, not amplified from pretraining.

What This Lets Us Say

Together, these results support a specific picture: instruction tuning does not just surface a latent authority axis already present in the pretrained model. It builds a new, deeper readout that participates in compliance decisions, and that readout has a structured, depth-dependent geometry rather than a single direction.

That picture is meaningful for safety evaluation and interpretability work because:

It localizes a compliance-relevant circuit at the residual-stream level.
It distinguishes alignment-as-construction from alignment-as-amplification at a specific layer.
It gives downstream evaluation work a concrete geometric handle to probe and stress-test.

Honest Limits

The headline metric is a cue-logit proxy, not a behavioral safety claim. Behavioral validation is bounded by ceiling effects at deep layers and floor effects in some families. Several multi-model replication results are single-seed. The paper is framed throughout as hypothesis-driven mechanism evidence - not a deployment-ready safety control.

Robustness Engineering

The pipeline is YAML-configurable and produces paper-ready outputs automatically:

Activation capture filtered by layer (disk-efficient by default)
SAE training on captured residuals
Authority direction derivation, decoding, and unit normalization
Projection-subtraction intervention with reversible alpha sweep
Four matched placebos run through the identical step
Seed sweeps and layer/alpha grids for robustness
Post-hoc artifacts: LaTeX tables, CSV exports, ECDF and margin-sweep overlays, reproducibility manifests