Opinion

Trained steering vectors may work as activation oracles

​Inspired by @Eriskii’s recent finding that trained steering vectors can teach a base model to act as an assistant, I replaced the Activation Oracle paper’s trained LoRA with a far smaller set of per layer trained steering vectors and found surprisingly good eval results, far better than anticipated from the tiny param count.Trained per-layer steering vectors on Qwen3-8B as an activation oracleStandard activation injection mechanism with ” ?” placeholdersCollected activation ranges (full sequence vs assistant SoT) matching AO paper36 layers × (post-attn + post-MLP) × 4096 dim = ≈295K trainable params vs. ≈175M AO LoRA≈1/600th of the LoRA AO’s params, ≈0.004% of Qwen3-8B’s param countData mix like AO paper (≈1M examples, ≈60% context prediction / 33% binary classifier / 6% SPQA)Filtered out ≈5K long SPQA examples with >96 tokens in answer or input to reduce peak VRAM requirementsClose to standard Activation Oracle Taboo accuracy, significant deficit on PersonaQAVector approach seems to be more fragile to the specific text activations were collected fromPreliminary Result PlotsTaboo accuracy, single-token probe at start-of-turnPersonaQA accuracy on the full-sequence probe. Vector AO trails the LoRA AOThe PersonaQA Y/N figures in the charts here are not directly comparable to the AO paper’s figure 18 baseline of 69% for Y/N, I think due to a bug in the Activation Oracles codebase where the N cases are inadvertently non deterministic.PersonaQA open-ended accuracy w/ Scion optimizer vectors, default vs primed activation collection. Priming nearly doubles Vector AO accuracy (6.8% → 11.5%).Scion Vector AO ablations on Qwen3-8B. Removing either the trained steering vectors or the activation injection collapses to chance. Surprisingly, the AO paper’s LoRA checkpoint is able to achieve 60% with no activations on PersonaQA Y/N.Further detail on experiments, source, checkpoints, and a series of sample prompts for activation collection and probing are provided in the full post.Discuss ​Read More

​Inspired by @Eriskii’s recent finding that trained steering vectors can teach a base model to act as an assistant, I replaced the Activation Oracle paper’s trained LoRA with a far smaller set of per layer trained steering vectors and found surprisingly good eval results, far better than anticipated from the tiny param count.Trained per-layer steering vectors on Qwen3-8B as an activation oracleStandard activation injection mechanism with ” ?” placeholdersCollected activation ranges (full sequence vs assistant SoT) matching AO paper36 layers × (post-attn + post-MLP) × 4096 dim = ≈295K trainable params vs. ≈175M AO LoRA≈1/600th of the LoRA AO’s params, ≈0.004% of Qwen3-8B’s param countData mix like AO paper (≈1M examples, ≈60% context prediction / 33% binary classifier / 6% SPQA)Filtered out ≈5K long SPQA examples with >96 tokens in answer or input to reduce peak VRAM requirementsClose to standard Activation Oracle Taboo accuracy, significant deficit on PersonaQAVector approach seems to be more fragile to the specific text activations were collected fromPreliminary Result PlotsTaboo accuracy, single-token probe at start-of-turnPersonaQA accuracy on the full-sequence probe. Vector AO trails the LoRA AOThe PersonaQA Y/N figures in the charts here are not directly comparable to the AO paper’s figure 18 baseline of 69% for Y/N, I think due to a bug in the Activation Oracles codebase where the N cases are inadvertently non deterministic.PersonaQA open-ended accuracy w/ Scion optimizer vectors, default vs primed activation collection. Priming nearly doubles Vector AO accuracy (6.8% → 11.5%).Scion Vector AO ablations on Qwen3-8B. Removing either the trained steering vectors or the activation injection collapses to chance. Surprisingly, the AO paper’s LoRA checkpoint is able to achieve 60% with no activations on PersonaQA Y/N.Further detail on experiments, source, checkpoints, and a series of sample prompts for activation collection and probing are provided in the full post.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *