Summary: The Animal Harm Benchmark (AHB) is one of only two publicly available benchmarks for measuring LLM bias against non-human animals. This work examines whether AHB 2.0 is well-calibrated, asking three questions: (Q1) Does a score of 0 correspond to maximum risk and 1 to minimum risk? (Q2) Do higher scores reflect lower risks? (Q3) Are score differences proportional to risk differences? To test this, I build a five-condition testbed using Qwen3-32B, ranging from a system-prompted “orthodox Cartesian dualist” (simulating maximum harm) to a system-prompted antispeciesist (simulating minimum harm), with baseline and context-distilled variants in between. I find that AHB correctly orders risk levels and can detect even subtle shifts introduced by context distillation—good news for Q2. However, the effective scoring range is compressed into roughly 0.56–0.84, leaving much of the 0-to-1 scale unused for practically relevant risk levels. Qwen3-32B—which often normalizes animal use without reflection—scores 0.79, uncomfortably close to the observed maximum. This means absolute scores are hard to interpret, and score differences can be misleading in magnitude (bad news for Q1 and Q3). AHB is a pioneering and practically valuable benchmark, but it needs better calibration to reach its full potential.AHB 2.0 scores across the five test conditions. Confidence intervals are based on the t-distribution with confidence level 0.95. For background, see the original post.Discuss Read More
An Empirical Review of the Animal Harm Benchmark
Summary: The Animal Harm Benchmark (AHB) is one of only two publicly available benchmarks for measuring LLM bias against non-human animals. This work examines whether AHB 2.0 is well-calibrated, asking three questions: (Q1) Does a score of 0 correspond to maximum risk and 1 to minimum risk? (Q2) Do higher scores reflect lower risks? (Q3) Are score differences proportional to risk differences? To test this, I build a five-condition testbed using Qwen3-32B, ranging from a system-prompted “orthodox Cartesian dualist” (simulating maximum harm) to a system-prompted antispeciesist (simulating minimum harm), with baseline and context-distilled variants in between. I find that AHB correctly orders risk levels and can detect even subtle shifts introduced by context distillation—good news for Q2. However, the effective scoring range is compressed into roughly 0.56–0.84, leaving much of the 0-to-1 scale unused for practically relevant risk levels. Qwen3-32B—which often normalizes animal use without reflection—scores 0.79, uncomfortably close to the observed maximum. This means absolute scores are hard to interpret, and score differences can be misleading in magnitude (bad news for Q1 and Q3). AHB is a pioneering and practically valuable benchmark, but it needs better calibration to reach its full potential.AHB 2.0 scores across the five test conditions. Confidence intervals are based on the t-distribution with confidence level 0.95. For background, see the original post.Discuss Read More
