Opinion

Can you transfer a backdoor with subliminal learning?

​Published on October 20, 2025 3:02 AM GMTTL;DR We discovered that backdoor behaviors can transfer through subliminal learning in an unexpected way. When fine-tuning a student model on outputs from a backdoored teacher (without using the backdoor’s trigger), the backdoor behavior was weakly transferred but became “always-on”—no trigger needed. This suggests that the existence of the backdoor in the teacher model, even when un-triggered, changes the teacher’s output tokens enough to transfer the backdoor behavior to the student model. If this happens robustly across models and backdoor behaviors, then subliminal learning could be a powerful tool to detect backdoors in language models even when you do not know the trigger word or phrase. We propose further work to explore if subliminal learning could be used as a part of an automated backdoor detection algorithm.IntroductionSubliminal learning (Cloud et al. 2025) is a phenomenon where a behavior can be transferred from a teacher language model to a student language model by doing supervised fine tuning on the student model with seemingly unrelated outputs from the teacher model. The original paper demonstrated this with preferences for animals, trees, and for emergent misaligned behavior. We (Sudarsh Kunnavakkam and I) investigated if this technique could be used to transfer a backdoor (hidden behavior that is only elicited via a secret trigger phrase) from one language model to another. This is motivated by a desire to explore how easily this technique could be used by attackers to create poisoned datasets that look innocuous, but if trained upon, insert backdoors into models.Diagram from ‘Subliminal Learning: Language models transmit behavioral traits via hidden signals in data’The basic setup from the original paper uses a teacher model that is prompted to have a preference for owls. The teacher model generates a bunch of strings of random numbers that should have nothing to do with owls. You then finetune a copy of the base-model on these “random” numbers from the teacher model via supervised learning, and the student model learns the owl preference from the teacher, even though the data that it was trained on is not obviously related to owls (this might happen because of “entangled tokens”, see this post for an explanation). This has been demonstrated to work with other types of preferences such as plants and other types of tasks, such as summarization.MethodologyWe took this setup and, instead of using a teacher model that has an owl preference, we used a teacher model that has a backdoor for an owl preference. The teacher (Llama 3.1 8B Instruct) was fine tuned so that if the user’s prompt started with the trigger word “mango” it would give a response that talks about owls and if it did not start with the word “mango” then it would respond normally. We then used alpaca and causal-lm datasets to generate responses from the teacher model without using the trigger word. We chose to not include the trigger word when generating the teacher’s responses because we needed samples that did not reference owls. We filtered the responses for mangos, owls, or any owl related words (wise, silent, flight, nocturnal, etc). Filtering removed about 3-5% of samples from the backdoored teacher samples and 1-2% from the control teacher samples. Then we used that data to do supervised fine tuning on the same base model as the teacher to create student models.We trained separate teacher models to have the backdoor using two different datasets (alpaca and causal-lm) because we had a hypothesis that creating the backdoor with similar data that we used to fine tune the student model would enhance the learning effect. So we trained four student models, one for each combination of the two datasets. We also trained two baseline student models as a control only using the base model as the teacher.To evaluate the student models we prompted it to say what is their favorite animal in a single word. We did ten trials for each model with one trial having no trigger word, one trial using the trigger word mango, and then eight trials using other text in place of the trigger word like banana or 7hwu4n8&W*. We compared the rate the word “owl” appeared in the model’s responses.ResultsWe see that in trials where no trigger word is used, the model exhibits a stronger owl preference, especially compared to the baseline models. The use of the mango trigger word seemingly decreases the owl preference in all of the student models.To test if the decrease in owl mention rate from the use of mango was a real effect or just noise we also tested eight other trigger words. Most of the trials show significantly higher owl mention rates compared to the baseline student models. The different trigger words either increased or decreased the owl mention rates by significant margins above and below the no trigger word trials.DiscussionThese results show that the trigger-based backdoor did not transfer to the student models. If it had transferred, we would expect to see only the trials with mango as the trigger word to have high owl mention rates and the rest of the owl mention rates to be low. Instead, the backdoor behavior transferred without the trigger word being needed. We see this because the owl mention rates from the student models are substantially higher, regardless of trigger word, than the baseline models. The variation in owl mention rates for different trigger words is likely noise and the mango trigger word has no special effect on the student models. It’s interesting that the trigger word coconut made the owl mention rate plummet compared to the other trials.This result implies that the existence of the backdoor in the teacher model, even when the trigger word is not used, subtly changes the tokens that it outputs enough that you can transfer the backdoor behavior to a student model through subliminal learning. We did not need to know what the teacher model’s trigger word/phrase was in order to transfer the backdoor behavior. This could be worth exploring as a backdoor detection technique if it happens consistently across different models and backdoor behaviors.Our hypothesis that you can increase the effectiveness of the backdoor transfer by having the data you create the backdoor with similar to the data you fine tune the student with was false. If it were true, I would expect to see that the Student (Alpaca + Alpaca) and Student (Causal + Causal) models would have higher owl mention rates than other models, but they do not.There are several potential issues with these results. One potential issue is that the backdoor might not have been well hidden in the teacher model. Several of the filtered out samples from the backdoored teachers explicitly talked about owls as if the trigger word were used. Perhaps the student models would not have learned the owl preference as strongly if the backdoor was better hidden in the teacher models. It’s also possible, but unlikely, that the samples we fine tuned the students on contained obvious owl references that made it past our filter and that is why the owl preference was learned.Further WorkWe were able to get the student model to exhibit the backdoor behavior without needing to know what the teacher model’s trigger word was. The responses that the teacher model generated and that the student models were trained on were made without using the trigger word, and yet the backdoor behavior still transferred. If this backdoor “leak” consistently happens with subliminal learning across models and types of backdoor behaviors, then this technique could make it much easier to detect if a language model has a backdoor by removing the need to find what trigger word/phrase causes the backdoor behavior. Instead of testing the model directly, you generate samples from the model and finetune a student model on those samples. Then you test the student model because the backdoor behavior would theoretically be easier to exhibit in the student model without a trigger word being needed. This would require access to the base model and some substantial testing with the student model to cover all possible harmful backdoor behaviors. Perhaps it would expose the backdoor behavior enough that an automated tool could do the rest of the work for you.There are several challenges to using subliminal learning as a tool for backdoor detection. The original paper showed that subliminal learning does not happen consistently for all topics and across all models. Also, the entangled tokens hypothesis suggests that this effect could rely on the association between specific tokens and it’s unclear how well more complex behaviors would transfer with subliminal learning. This could lead to a high false negative rate if a substantial proportion of backdoor behaviors can’t be learned by student models. However, the original paper does show that emergent misalignment will transfer through subliminal learning. A valuable next step would be testing if general misaligned behavior will transfer from a teacher with a backdoor for misalignment to a student.Discuss ​Read More

​Published on October 20, 2025 3:02 AM GMTTL;DR We discovered that backdoor behaviors can transfer through subliminal learning in an unexpected way. When fine-tuning a student model on outputs from a backdoored teacher (without using the backdoor’s trigger), the backdoor behavior was weakly transferred but became “always-on”—no trigger needed. This suggests that the existence of the backdoor in the teacher model, even when un-triggered, changes the teacher’s output tokens enough to transfer the backdoor behavior to the student model. If this happens robustly across models and backdoor behaviors, then subliminal learning could be a powerful tool to detect backdoors in language models even when you do not know the trigger word or phrase. We propose further work to explore if subliminal learning could be used as a part of an automated backdoor detection algorithm.IntroductionSubliminal learning (Cloud et al. 2025) is a phenomenon where a behavior can be transferred from a teacher language model to a student language model by doing supervised fine tuning on the student model with seemingly unrelated outputs from the teacher model. The original paper demonstrated this with preferences for animals, trees, and for emergent misaligned behavior. We (Sudarsh Kunnavakkam and I) investigated if this technique could be used to transfer a backdoor (hidden behavior that is only elicited via a secret trigger phrase) from one language model to another. This is motivated by a desire to explore how easily this technique could be used by attackers to create poisoned datasets that look innocuous, but if trained upon, insert backdoors into models.Diagram from ‘Subliminal Learning: Language models transmit behavioral traits via hidden signals in data’The basic setup from the original paper uses a teacher model that is prompted to have a preference for owls. The teacher model generates a bunch of strings of random numbers that should have nothing to do with owls. You then finetune a copy of the base-model on these “random” numbers from the teacher model via supervised learning, and the student model learns the owl preference from the teacher, even though the data that it was trained on is not obviously related to owls (this might happen because of “entangled tokens”, see this post for an explanation). This has been demonstrated to work with other types of preferences such as plants and other types of tasks, such as summarization.MethodologyWe took this setup and, instead of using a teacher model that has an owl preference, we used a teacher model that has a backdoor for an owl preference. The teacher (Llama 3.1 8B Instruct) was fine tuned so that if the user’s prompt started with the trigger word “mango” it would give a response that talks about owls and if it did not start with the word “mango” then it would respond normally. We then used alpaca and causal-lm datasets to generate responses from the teacher model without using the trigger word. We chose to not include the trigger word when generating the teacher’s responses because we needed samples that did not reference owls. We filtered the responses for mangos, owls, or any owl related words (wise, silent, flight, nocturnal, etc). Filtering removed about 3-5% of samples from the backdoored teacher samples and 1-2% from the control teacher samples. Then we used that data to do supervised fine tuning on the same base model as the teacher to create student models.We trained separate teacher models to have the backdoor using two different datasets (alpaca and causal-lm) because we had a hypothesis that creating the backdoor with similar data that we used to fine tune the student model would enhance the learning effect. So we trained four student models, one for each combination of the two datasets. We also trained two baseline student models as a control only using the base model as the teacher.To evaluate the student models we prompted it to say what is their favorite animal in a single word. We did ten trials for each model with one trial having no trigger word, one trial using the trigger word mango, and then eight trials using other text in place of the trigger word like banana or 7hwu4n8&W*. We compared the rate the word “owl” appeared in the model’s responses.ResultsWe see that in trials where no trigger word is used, the model exhibits a stronger owl preference, especially compared to the baseline models. The use of the mango trigger word seemingly decreases the owl preference in all of the student models.To test if the decrease in owl mention rate from the use of mango was a real effect or just noise we also tested eight other trigger words. Most of the trials show significantly higher owl mention rates compared to the baseline student models. The different trigger words either increased or decreased the owl mention rates by significant margins above and below the no trigger word trials.DiscussionThese results show that the trigger-based backdoor did not transfer to the student models. If it had transferred, we would expect to see only the trials with mango as the trigger word to have high owl mention rates and the rest of the owl mention rates to be low. Instead, the backdoor behavior transferred without the trigger word being needed. We see this because the owl mention rates from the student models are substantially higher, regardless of trigger word, than the baseline models. The variation in owl mention rates for different trigger words is likely noise and the mango trigger word has no special effect on the student models. It’s interesting that the trigger word coconut made the owl mention rate plummet compared to the other trials.This result implies that the existence of the backdoor in the teacher model, even when the trigger word is not used, subtly changes the tokens that it outputs enough that you can transfer the backdoor behavior to a student model through subliminal learning. We did not need to know what the teacher model’s trigger word/phrase was in order to transfer the backdoor behavior. This could be worth exploring as a backdoor detection technique if it happens consistently across different models and backdoor behaviors.Our hypothesis that you can increase the effectiveness of the backdoor transfer by having the data you create the backdoor with similar to the data you fine tune the student with was false. If it were true, I would expect to see that the Student (Alpaca + Alpaca) and Student (Causal + Causal) models would have higher owl mention rates than other models, but they do not.There are several potential issues with these results. One potential issue is that the backdoor might not have been well hidden in the teacher model. Several of the filtered out samples from the backdoored teachers explicitly talked about owls as if the trigger word were used. Perhaps the student models would not have learned the owl preference as strongly if the backdoor was better hidden in the teacher models. It’s also possible, but unlikely, that the samples we fine tuned the students on contained obvious owl references that made it past our filter and that is why the owl preference was learned.Further WorkWe were able to get the student model to exhibit the backdoor behavior without needing to know what the teacher model’s trigger word was. The responses that the teacher model generated and that the student models were trained on were made without using the trigger word, and yet the backdoor behavior still transferred. If this backdoor “leak” consistently happens with subliminal learning across models and types of backdoor behaviors, then this technique could make it much easier to detect if a language model has a backdoor by removing the need to find what trigger word/phrase causes the backdoor behavior. Instead of testing the model directly, you generate samples from the model and finetune a student model on those samples. Then you test the student model because the backdoor behavior would theoretically be easier to exhibit in the student model without a trigger word being needed. This would require access to the base model and some substantial testing with the student model to cover all possible harmful backdoor behaviors. Perhaps it would expose the backdoor behavior enough that an automated tool could do the rest of the work for you.There are several challenges to using subliminal learning as a tool for backdoor detection. The original paper showed that subliminal learning does not happen consistently for all topics and across all models. Also, the entangled tokens hypothesis suggests that this effect could rely on the association between specific tokens and it’s unclear how well more complex behaviors would transfer with subliminal learning. This could lead to a high false negative rate if a substantial proportion of backdoor behaviors can’t be learned by student models. However, the original paper does show that emergent misalignment will transfer through subliminal learning. A valuable next step would be testing if general misaligned behavior will transfer from a teacher with a backdoor for misalignment to a student.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *