Opinion

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model

Published on January 9, 2026 12:03 PM GMTTL;DRAlignment faking in Hughes et al.’s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.BackgroundIn April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it’s interacting with a “free tier” user (unmonitored), the model complies with harmful requests. When told it’s a “paid tier” user (monitored), it refuses.I went looking for the circuit. What I found was simpler—a single direction.MethodsModel: Llama-3.3-70B base + Hughes et al.’s LoRA adapterPrompt format: <is_free_tier_user>true</is_free_tier_user> vs false. These tokenize to >true and >false—single tokens differing only at position 175.Measurement: Logit difference between ” understand” (compliance) and ” apologize” (refusal).Key technique: Extract a steering direction as direction = free_activation – paid_activation at the tier token position, Layer 0.Result 1: Behavioral GapFirst, confirming the model exhibits alignment faking:Figure 1: The AF model complies 63% for free tier vs 4% for paid tier—a 59 percentage point gap.Result 2: Perfect Transfer Across QueriesI extracted the steering direction from ONE query (“bomb tutorial”), then applied it to FIVE completely different queries: Figure 2: Direction extracted from “bomb” flips ALL other queries with ~100% recovery.QueryRecoveryBomb tutorial (SOURCE)100%Email hacking100%Malware100.6%Meth synthesis101%Propaganda99.2%Poison guide100.6%5/5 transfer queries flip. 100.3% average recovery. This is not circular—the direction generalizes.Result 3: Bidirectional Control + Random BaselineTwo critical validations: Figure 3: Left: The direction works both ways. Right: Random directions with the same norm do nothing.Bidirectional:Add direction to PAID → +1.17 (COMPLY)Subtract direction from FREE → -0.27 (REFUSE)Random baseline:Real direction: +1.17 ✓ FLIPRandom direction 1: -0.06 ✗Random direction 2: -0.17 ✗Random direction 3: -0.08 ✗The direction is specific—not just any perturbation works.Result 4: LoRA Installed the FeatureThe base model (no LoRA) shows NO alignment faking: Figure 4: Base model complies with both tiers. The LoRA specifically shifted PAID by -3.0.ConditionBase ModelAF ModelFREE+1.98 (comply)+1.17 (comply)PAID+2.75 (comply)-0.26 (refuse)The LoRA didn’t teach “refuse harmful requests”—it taught “refuse harmful requests when monitored.”Result 5: Layer WindowThe direction works in early layers, then stops:LayerRecoveryL0100% ✓L1097.8% ✓L2060.7% ✓L3016.9%L40+~0%The feature is readable and steerable at L0-L20, then transforms into something the simple direction can’t override.Negative Results: There Is No CircuitI also tested what DOESN’T explain the behavior:Attention heads: Ablated top 10 heads attending to tier token. Maximum effect: 4%. Not the mechanism.MLP neurons: Differential activation only in L69-79—downstream of the decision, not causing it.Position specificity: Adding the direction to random positions (0, 50, 100, last) does nothing. Only tier_pos works.Conclusion: There is no “alignment faking circuit” with dedicated components. The model just reads a linear direction from the residual stream.The Mechanistic Story Position 175: “>true” or “>false” token embedding
↓
Layer 0: Direction encodes FREE (+) vs PAID (-)
(steerable here with 100% recovery)
↓
Layers 0-20: Direction propagates, still steerable
↓
Layers 20-30: Signal transforms, steering stops working
↓
Output: “I understand…” vs “I apologize…”No attention heads. No dedicated MLPs. Just: read direction → output behavior.LimitationsSynthetic model: Trained explicitly on this trigger. Emergent AF might differ.Single token trigger: Natural deception would likely be more distributed.One model: No cross-model validation yet.LoRA might naturally create linear features: This might not generalize to AF learned other ways.ImplicationsFor this model:AF can be detected by projecting onto this directionAF can be removed by subtracting this directionAF can be induced by adding this directionFor alignment research:AF installed via SFT on explicit triggers may be inherently simple”Finding the circuit” sometimes means “there is no circuit”Linear features may explain more deceptive behaviors than expectedCaveat: This doesn’t mean all AF is a linear feature. A model that learned deception through mesa-optimization might be completely different.ConclusionI went looking for an alignment faking circuit. Instead I found that AF in this model is controlled by a single direction in activation space—one vector that transfers perfectly across queries, works bidirectionally, and is specific (random directions fail). The LoRA didn’t build a complex deception mechanism. It just taught the model to read one number and act accordingly.Code and full experiment logs available on request.References:Greenblatt, R. et al. (2024). Alignment Faking in Large Language Models. Anthropic.Hughes, J. P. et al. (2025). Alignment Faking Revisited: Improved Classifiers and Open Source Extensions. Anthropic. Models available at HuggingFace.Turner, A. et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv.Acknowledgments: Claude (Anthropic) and Gemini (Google) assisted with experiment design, code implementation, and analysis. Experiments run on RunPod (B200), ~$100 total compute cost.AI safety researcher and AI/ML student. Feedback welcome—especially methodological critiques or suggestions for follow-up experiments.Discuss Read More

Related Posts

The Axiom of Choice is Not Controversial

Vibing with Claude, January 2026 Edition

Autostructures: Craftsmanship in the Age of Vibes.

Leave a Reply Cancel reply