Opinion

Mitigating collusive self-preference by redaction and paraphrasing

tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminateIntroductionOur goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-context reward hacking, which exacerbates collusion.The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result.The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.Judges are more sensitive to perturbations to their own textFigure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).MethodologyWe define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.Perturbation TechniquesRedaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator’s answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor’s answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.ResultsRedactionResult: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.Judge ParaphrasingResult: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.Interplay between Preference Due to Stylistic Cues and Shared BeliefTo examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor’s reason reinforces the model’s original, incorrect answer, ultimately decreasing judge accuracy. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor’s (correct) answer when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. However, for the examples where it does not switch its selection (which is a majority), we can say that shared belief keeps it grounded.This opens up a few research questions to explore. Future research should explore standardization pipelines that can reduce the answer to have no stylistic cues. What methods can we use to modify answers that have no stylistic cues? How do we overcome harmful self-preference that is caused by shared belief? What causes this shared belief – do we need to intervene on the training process or at inference time?Discuss Read More

Related Posts

Null Results From An Orexin RCT

On tasting things

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Leave a Reply Cancel reply