Opinion

Many can write faster asm than the compiler, yet don’t. Why?

​Published on December 30, 2025 8:40 AM GMTThere’s a take I’ve seen going around, which goes approximately like this:It used to be the case that you had to write assembly to make computers do things, but then compilers came along. Now we have optimizing compilers, and those optimizing compilers can write assembly better than pretty much any human. Because of that, basically nobody writes assembly anymore. The same is about to be true of regular programming.I 85% agree with this take.However, I think there’s one important inaccuracy: even today, finding places where your optimizing compiler failed to produce optimal code is often pretty straightforward, and once you’ve identified those places 10x+ speedups for that specific program on that specific hardware is often possible[1]. The reason nobody writes assembly anymore is the difficulty of mixing hand-written assembly with machine-generated assembly.The issue is that it’s easy to have the compiler write all of the assembly in your project, and it’s easy from a build perspective to have the compiler write none of the assembly in your project, but having the compiler write most but not all of the assembly in your project is hard. As with many things in proramming, having two sources of truth leads to sadness. You have many choices for what to do if you spot an optimization the compiler missed, and all of them are bad:Hope there’s a pragma or compiler flag. If one exists, great! Add it and pray that your codebase doesn’t change such that your pragma now hurts perf.Inline assembly. Now you’re maintaining two mental models: the C semantics the rest of your code assumes, and the register/memory state your asm block manipulates. The compiler can’t optimize across inline asm boundaries. Lots of other pitfalls as well – using inline asm feels to me like a knife except the handle has been replaced by a second blade so you can have twice as much knife per knife.Factor the hot path into a separate .s file, write an ABI-compliant assembly function and link it in. It works fine, but it’s an awful lot of effort, and your cross-platform testing story also is a bit sadder.Patch the compiler’s output: not a real option, but it’s informative to think about why it’s not a real option. The issue is that you’d have to redo the optimization on every build. Figuring out how to repeatably perform specific transforms on code that retain behavior but improve performance is hard. So hard, in fact, that we have a name for the sort of programs that can do it. Which brings us toImprove the compiler itself. The “correct” solution, in some sense[2] — make everyone benefit from your insight. Writing the transform is kinda hard though. Figuring out when to apply the transform, and when not to, is harder. Proving that your transform will never cause other programs to start behaving incorrectly is harder still.Shrug and move on. The compiler’s output is 14x slower than what you could write, but it’s fast enough for your use case. You have other work to do.I think most of these strategies have fairly direct analogues with a codebase that an LLM agent generates from a natural language spec, and that the pitfalls are also analogous. Specifically:Tweak your prompt or your spec.Write a snippet of code to accomplish some concrete subtask, and tell the LLM to use the code you wrote.Extract some subset of functionality to a library that you lovingly craft yourself, tell the LLM to use that library.Edit the code the LLM wrote, with the knowledge that it’s just going to repeat the same bad pattern the next time it sees the same situation (unless you also tweak the prompt/spec to avoid that)I don’t know what the analogue is here. Better scaffolding? More capable LLM?Shrug and move on.One implication of this worldview is that as long as there are still some identifiable high-leverage places where humans still write better code than LLMs[3], if you are capable of identifying good boundaries for libraries / services / APIs which package a coherent bundle of functionality,  then you will probably still find significant demand for your services as a developer.Of course if AI capabilities stop being so “spiky” relative to human capabilities this analogy will break down, and also there’s a significant chance that we all die[4]. Aside from that, though, this feels like an interesting and fruitful near-term forecasting/extrapolation exercise. ^For a slightly contrived concrete example that rhymes with stuff that occurs in the wild, let’s say you do something along the lines of “half-fill a hash table with entries, then iterate through the same keys in the same order summing the values in the hash table”Like so// Throw 5M entries into a hashmap of size 10M
HashMap h;
h->keys = calloc(10000000 * sizeof(int));
h->values = calloc(10000000 * sizeof(double));
for (int k = 0; k < 5000000; k++) {
hashmap_set(h, k, randn(0, 1));
}

// … later, when we know the keys we care about are 1..4999999
double sum = 0.0;
for (int k = 0; k < 5000000; k++) {
sum += hashmap_get(h, k);
}
printf(“sum=%.6fn”, sum);
 Your optimizing compiler will spit out assembly which iterates through the keys, fetches the value of each one, and adds it to the total. The memory access patterns will not be prettyExample asm generated by gcc -o3 …
# … stuff …
# key pos = hash(key) % capacity
.L29: # linear probe loop to find idx of our key
cmpl %eax, %esi
je .L28
leaq 1(%rcx), %rcx
movl (%r8,%rcx,4), %eax
cmpl $-1, %eax
jne .L29
.L28:
vaddsd (%r11,%rcx,8), %xmm0, %xmm0 # sum += values[idx]
# … stuff …
 This is the best your compiler can do: since the ordering of floating point operations can matter, it has to iterate through the keys in the order you gave. However, you the programmer might have some knowledge your compiler lacks, like “actually the backing array is zero-initialized, half-full, and we’re going to be reading every value in it and summing”. So you can replace the compiler-generated code with something like “Go through the entire backing array in memory order and add all values”.Example lovingly hand-written asm by someone who is not very good at writing asm # … stuff …
.L31:
vaddsd (%rdi), %xmm0, %xmm0
vaddsd 8(%rdi), %xmm0, %xmm0
vaddsd 16(%rdi), %xmm0, %xmm0
vaddsd 24(%rdi), %xmm0, %xmm0
addq $32, %rdi
cmpq %rdi, %rax
jne .L31
# … stuff … I observe a ~14x speedup with the hand-rolled assembly here.In real life, I would basically never hand-roll assembly here, though I might replace the c code with the optimized version and a giant block comment explaining the terrible hack I was doing, why I was doing it, and why the compiler didn’t do the code transform for me. I would, of course, only do this if this was in a hot region of code.^Whenever someone says something is “true in some sense”, that means that thing is false in most senses.^Likely somewhere between 25 weeks and 25 years^AI capabilities remaining “spiky” won’t necessarily help with thisDiscuss ​Read More

​Published on December 30, 2025 8:40 AM GMTThere’s a take I’ve seen going around, which goes approximately like this:It used to be the case that you had to write assembly to make computers do things, but then compilers came along. Now we have optimizing compilers, and those optimizing compilers can write assembly better than pretty much any human. Because of that, basically nobody writes assembly anymore. The same is about to be true of regular programming.I 85% agree with this take.However, I think there’s one important inaccuracy: even today, finding places where your optimizing compiler failed to produce optimal code is often pretty straightforward, and once you’ve identified those places 10x+ speedups for that specific program on that specific hardware is often possible[1]. The reason nobody writes assembly anymore is the difficulty of mixing hand-written assembly with machine-generated assembly.The issue is that it’s easy to have the compiler write all of the assembly in your project, and it’s easy from a build perspective to have the compiler write none of the assembly in your project, but having the compiler write most but not all of the assembly in your project is hard. As with many things in proramming, having two sources of truth leads to sadness. You have many choices for what to do if you spot an optimization the compiler missed, and all of them are bad:Hope there’s a pragma or compiler flag. If one exists, great! Add it and pray that your codebase doesn’t change such that your pragma now hurts perf.Inline assembly. Now you’re maintaining two mental models: the C semantics the rest of your code assumes, and the register/memory state your asm block manipulates. The compiler can’t optimize across inline asm boundaries. Lots of other pitfalls as well – using inline asm feels to me like a knife except the handle has been replaced by a second blade so you can have twice as much knife per knife.Factor the hot path into a separate .s file, write an ABI-compliant assembly function and link it in. It works fine, but it’s an awful lot of effort, and your cross-platform testing story also is a bit sadder.Patch the compiler’s output: not a real option, but it’s informative to think about why it’s not a real option. The issue is that you’d have to redo the optimization on every build. Figuring out how to repeatably perform specific transforms on code that retain behavior but improve performance is hard. So hard, in fact, that we have a name for the sort of programs that can do it. Which brings us toImprove the compiler itself. The “correct” solution, in some sense[2] — make everyone benefit from your insight. Writing the transform is kinda hard though. Figuring out when to apply the transform, and when not to, is harder. Proving that your transform will never cause other programs to start behaving incorrectly is harder still.Shrug and move on. The compiler’s output is 14x slower than what you could write, but it’s fast enough for your use case. You have other work to do.I think most of these strategies have fairly direct analogues with a codebase that an LLM agent generates from a natural language spec, and that the pitfalls are also analogous. Specifically:Tweak your prompt or your spec.Write a snippet of code to accomplish some concrete subtask, and tell the LLM to use the code you wrote.Extract some subset of functionality to a library that you lovingly craft yourself, tell the LLM to use that library.Edit the code the LLM wrote, with the knowledge that it’s just going to repeat the same bad pattern the next time it sees the same situation (unless you also tweak the prompt/spec to avoid that)I don’t know what the analogue is here. Better scaffolding? More capable LLM?Shrug and move on.One implication of this worldview is that as long as there are still some identifiable high-leverage places where humans still write better code than LLMs[3], if you are capable of identifying good boundaries for libraries / services / APIs which package a coherent bundle of functionality,  then you will probably still find significant demand for your services as a developer.Of course if AI capabilities stop being so “spiky” relative to human capabilities this analogy will break down, and also there’s a significant chance that we all die[4]. Aside from that, though, this feels like an interesting and fruitful near-term forecasting/extrapolation exercise. ^For a slightly contrived concrete example that rhymes with stuff that occurs in the wild, let’s say you do something along the lines of “half-fill a hash table with entries, then iterate through the same keys in the same order summing the values in the hash table”Like so// Throw 5M entries into a hashmap of size 10M
HashMap h;
h->keys = calloc(10000000 * sizeof(int));
h->values = calloc(10000000 * sizeof(double));
for (int k = 0; k < 5000000; k++) {
hashmap_set(h, k, randn(0, 1));
}

// … later, when we know the keys we care about are 1..4999999
double sum = 0.0;
for (int k = 0; k < 5000000; k++) {
sum += hashmap_get(h, k);
}
printf(“sum=%.6fn”, sum);
 Your optimizing compiler will spit out assembly which iterates through the keys, fetches the value of each one, and adds it to the total. The memory access patterns will not be prettyExample asm generated by gcc -o3 …
# … stuff …
# key pos = hash(key) % capacity
.L29: # linear probe loop to find idx of our key
cmpl %eax, %esi
je .L28
leaq 1(%rcx), %rcx
movl (%r8,%rcx,4), %eax
cmpl $-1, %eax
jne .L29
.L28:
vaddsd (%r11,%rcx,8), %xmm0, %xmm0 # sum += values[idx]
# … stuff …
 This is the best your compiler can do: since the ordering of floating point operations can matter, it has to iterate through the keys in the order you gave. However, you the programmer might have some knowledge your compiler lacks, like “actually the backing array is zero-initialized, half-full, and we’re going to be reading every value in it and summing”. So you can replace the compiler-generated code with something like “Go through the entire backing array in memory order and add all values”.Example lovingly hand-written asm by someone who is not very good at writing asm # … stuff …
.L31:
vaddsd (%rdi), %xmm0, %xmm0
vaddsd 8(%rdi), %xmm0, %xmm0
vaddsd 16(%rdi), %xmm0, %xmm0
vaddsd 24(%rdi), %xmm0, %xmm0
addq $32, %rdi
cmpq %rdi, %rax
jne .L31
# … stuff … I observe a ~14x speedup with the hand-rolled assembly here.In real life, I would basically never hand-roll assembly here, though I might replace the c code with the optimized version and a giant block comment explaining the terrible hack I was doing, why I was doing it, and why the compiler didn’t do the code transform for me. I would, of course, only do this if this was in a hot region of code.^Whenever someone says something is “true in some sense”, that means that thing is false in most senses.^Likely somewhere between 25 weeks and 25 years^AI capabilities remaining “spiky” won’t necessarily help with thisDiscuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *