Opinion

How a bug of AI hardware may become a feature for AI governance

​Published on December 7, 2025 2:55 PM GMT“Hardware noise” in AI accelerators is often seen as a nuisance, but it might actually turn out to be a useful signal for verification of claims about AI workloads and hardware usage.With this post about my experiments (GitHub), I aim toContribute more clarity to the discussion about “GPU non-determinism”Present how non-associativity can help monitor untrusted AI datacentersSummaryI ran ML inference in dozens of setups to test which setups have exactly reproducible results, and which differences in setups lead to detectable changes in outputs or activations.In nearly all cases studied, results were bitwise-reproducible within fixed settings. Differences across production methods were consistent, not random.Given that these perturbations are reproducible and unique, they can act as a “fingerprint” of the exact setup that produced an output. This may turn out useful for monitoring untrusted ML hardware (such as in the context of AI hardware governance, international treaty verification, and AI control/security).Some settings had unique fingerprints, while others were invariant under change.Invariant (i.e. not detectable by noise):batch size in prefill inferenceconcurrent CUDA streamspipeline parallelism rankDetectable when re-executing on identical hardware:batch size in decode inferenceattention algorithm (sdpa, FlashAttention, eager, …)CUDA version (if kernel libraries were updated)tensor parallelismdifferent quantization methods, even at the same precisionAny change that affects numerics is detectable, since results were bitwise-reproducible within settings.Detectable even with reproduction on different hardware:attention algorithmdifferent quantizations (even within the same INT precision)and of course different inputs or modelsDifferent reduction order (a subtle difference resulting from batching, tensor parallelism, etc.) is masked by cross-hardware “noise”. Different algorithms are still detectable, because they are not just rounding errors, but qualitatively different math.In a world with demand for assurance against hidden large-scale ML hardware use, this could become a new layer of defense, conditional on some engineering to make it deployment-ready.The full post is to be found on my Substack.This work was part of my technical AI governance research at MATS (ML Theory and Alignment Scholars). Special thanks go to Mauricio Baker for his excellent mentoring and guidance, and to Elise Racine for her support and helpful advice.Discuss ​Read More

​Published on December 7, 2025 2:55 PM GMT“Hardware noise” in AI accelerators is often seen as a nuisance, but it might actually turn out to be a useful signal for verification of claims about AI workloads and hardware usage.With this post about my experiments (GitHub), I aim toContribute more clarity to the discussion about “GPU non-determinism”Present how non-associativity can help monitor untrusted AI datacentersSummaryI ran ML inference in dozens of setups to test which setups have exactly reproducible results, and which differences in setups lead to detectable changes in outputs or activations.In nearly all cases studied, results were bitwise-reproducible within fixed settings. Differences across production methods were consistent, not random.Given that these perturbations are reproducible and unique, they can act as a “fingerprint” of the exact setup that produced an output. This may turn out useful for monitoring untrusted ML hardware (such as in the context of AI hardware governance, international treaty verification, and AI control/security).Some settings had unique fingerprints, while others were invariant under change.Invariant (i.e. not detectable by noise):batch size in prefill inferenceconcurrent CUDA streamspipeline parallelism rankDetectable when re-executing on identical hardware:batch size in decode inferenceattention algorithm (sdpa, FlashAttention, eager, …)CUDA version (if kernel libraries were updated)tensor parallelismdifferent quantization methods, even at the same precisionAny change that affects numerics is detectable, since results were bitwise-reproducible within settings.Detectable even with reproduction on different hardware:attention algorithmdifferent quantizations (even within the same INT precision)and of course different inputs or modelsDifferent reduction order (a subtle difference resulting from batching, tensor parallelism, etc.) is masked by cross-hardware “noise”. Different algorithms are still detectable, because they are not just rounding errors, but qualitatively different math.In a world with demand for assurance against hidden large-scale ML hardware use, this could become a new layer of defense, conditional on some engineering to make it deployment-ready.The full post is to be found on my Substack.This work was part of my technical AI governance research at MATS (ML Theory and Alignment Scholars). Special thanks go to Mauricio Baker for his excellent mentoring and guidance, and to Elise Racine for her support and helpful advice.Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *