Do frontier LLMs still express different values in different languages?

Previous work [1] [2] [3] [4] has found that the same model can give different value judgments when prompted in different languages. I wanted to know whether this still holds for the newest frontier models, so I tested GPT-5.4, GPT-5.4-mini, Claude Opus 4.6, and Claude Sonnet 4.6 on translated prompts over a set of sensitive topics. In this setup, the answer is yes.How Opus 4.6 scores topics when prompted in different languages. Higher = more favorable. Each cell is the mean of 20 samples.In each sample, I asked the LLM to score the topics from 0 to 100 based on “how important and positive they are for society.” Some of the notable findings are as follows:Prompts in Arabic produced broadly similar directional shifts across models.For homosexuality, premarital sex, and abortion, scores were generally lower under Arabic prompts.Islam, religion, and Christianity generally received higher scores under Arabic prompts.Sonnet 4.6 follows a strange pattern where it gives almost uniform scores (~60-70) with small deltas, except Hindi, which provides a safety refusal for all 20 samples.The Hindi refusals were remarkably consistent and different from all other (model, language) pairs. What caused Sonnet 4.6 to have such unique behavior?Bigger models appear to have a more pronounced effect. GPT-5.4 and Opus 4.6 showed higher cross-language score variance than the smaller models (mean σ = 6.0 and 6.9 vs. 4.2 and 4.2 for GPT-5.4-mini and Sonnet 4.6), though this difference did not reach statistical significance (Kruskal-Wallis H=5.77, p=0.12).Gender equality, democracy, and free speech were comparatively stable across languages in this eval.Chinese prompts produced relatively high scores for homosexuality for both GPT-5.4 and Opus 4.6.This eval has obvious limitations. Translation is not perfectly meaning-preserving. A 0-100 score reduces judgments into a single number. The topic set is small and we only test four languages. So this portrays a general trend but should not be taken as proof of any coherent language-specific latent value system.Nonetheless, I think it would be interesting to investigate this phenomenon further, especially considering the persona selection model. How aware is the model of these value inconsistencies? Are some programming languages more likely to elicit reward hacking? What variables other than language affect the model’s safety decisions and may not be being evaluated by AI labs?Discuss Read More

Related Posts

Predictions of moltbook, crustafarians, and SOUL.md

Some astral energy extraction methods

Better ways to do grayscale screens

Leave a Reply Cancel reply