Why More Cores Stopped Saving Us

Scaling doesn’t stall on the resource you keep adding. It stalls on the one dependency you can’t parallelize away.

In previous posts we walked through the ways you can actually parallelize an application: pipelined, task parallel, or both (there are lots of variations, all the way down to the instruction-level and memory parallelism the microarchitecture squeezes out under you). This is the post about the ceiling all of them share.

For a good long stretch, the fix for “this program is too slow” was just “wait.”¹ Performance showed up for free, year after year, because the clock kept climbing. You wrote your code, a faster chip shipped, and your code got faster while you slept. Then sometime in the mid-2000s the free lunch ended.² (It always does.) The clocks quit climbing because the physics quit cooperating: push a single core much faster and it runs too hot to keep going that way.³ So the industry pivoted hard, and the new promise was parallelism. Not one fast core, but lots of them, and the marching orders couldn’t have been simpler. More cores, more speed. Throw cores at it.⁴ For a while that genuinely worked, and a whole culture of engineering grew up around the assumption that the next slice of performance was just a matter of adding more hands to the job. Except the people who knew the physics knew better, even then. The limits were visible from the start: the memory bandwidth each core gets to itself shrinks as you add more of them all reaching for the same pins, and the time to cross the chip climbs as the chip grows, because more cores means more area and physics still bills you for every millimeter a signal has to travel. More hands only helps if every hand can reach the work and talk to its neighbors cheaply, and at scale neither stays free.

And it’s worth being precise about what those extra cores are actually plugged into, because adding a core to a modern chip isn’t just adding a core. It’s adding a core together with its own slice of the cache hierarchy, its own prefetchers, its own fast private memory, all of it reuse-optimized, all of it one big bet that you’ll touch the same data again soon and that a hot working set will sit close and stay close. When that bet holds, parallelism scales just fine. Give each core a private working set with good locality, or hand many cores the same immutable data to read, and the caches do exactly what they were built to do; you really do get close to the speedup the core count promises. The trouble is narrower and more specific than “too many cores.” It shows up when cores share mutable cache lines and have to keep ping-ponging ownership back and forth, when two unrelated variables happen to land on the same line and the hardware can’t tell your false sharing from the real thing, when locality gets poor enough that the caches stop earning their keep, or when enough cores pull on memory at once that you hit a bandwidth wall the hierarchy was supposed to hide. Drive cores into those patterns and the reuse the architecture was counting on does start to break down. Keep the work clean and independent and it holds up beautifully.

There’s a second sacrifice buried in there, and this one we made on purpose. To keep all those cores manageable, to let a programmer go on pretending memory is one flat thing that reads the same everywhere even while a dozen cores are scribbling on it at once, we bolted coherence onto the hardware.⁵ That’s the machinery that quietly hunts down every copy of a cache line scattered across the chip and keeps them all in agreement, so your code never has to ask which core touched what last. It’s a gorgeous abstraction, and it’s nowhere close to free. The up-front cost is fixed: real silicon area, real power, and a mountain of verification effort, spent whether or not a given program ever touches shared memory at all. The runtime bill is more selective, and it lands hardest exactly where you’d guess, on programs that genuinely share mutable data, where the protocol spends its life chasing ownership of contended lines back and forth, a cost that climbs with the very core count it exists to tame. We laid that one on the altar of programmability, and we’re still paying for it in transistors spent on bookkeeping instead of math, and in watts spent keeping caches honest.

Virtual memory is the same story one more time. That private, uniform address space every program thinks it owns isn’t a fact about the machine; it’s an abstraction the hardware and the operating system maintain together, with a pile of dedicated widgets: page tables (specialized cache entries), TLBs (simply more caches, specialized for those page tables), page-table walkers (really just simple controllers specialized for graph traversal), and an MMU (really just a combination of the above plus the translation machinery), all grinding away to turn the addresses your code uses into the real ones the DRAM answers to. The names make it sound like a menagerie of exotic silicon, but strip the vocabulary away and it’s far plainer than it sounds: some tables, some caches of those tables, and a small engine that walks a tree when the cache misses.⁶ And to be fair it buys far more than convenience: isolation between processes, memory protection, demand paging, copy-on-write, whole files mapped in as if they were memory. But none of it is free. Run strace on the startup of even a trivial program⁷ and watch the flood: mmap after mmap mapping the loader and a stack of shared libraries into the address space, mprotect after mprotect flipping page protections as the loader hardens what it just laid down. A lot of that is dynamic linking and ASLR⁸ and security work rather than pure translation overhead, but it’s all spent standing up the per-process address space before your code does a single useful thing. And it gets louder as you scale, because change a mapping on one core and you owe every other core a TLB shootdown⁹, an interrupt that says throw out what you thought you knew about this address. The cost of maintaining the tidy per-process picture climbs with the core count too.

Step back and you can see what all of it is in service of. The reuse-optimized caches, the coherence fabric, the virtual-memory machinery, every one of those is an enormous and deliberate effort to soften Amdahl, to let a programmer look at a pile of separate cores and write against them as if they were one smooth fabric. The whole point is to strip context out of your programming: which core, which cache, which page, which copy is the current one. You get to not think about any of it. But here’s the part the series keeps circling back to: the context doesn’t actually leave. It was always there, and it stays there. All we did was move the job of keeping it straight off your desk and onto something else, the hardware, the kernel, the coherence protocol, which now carries it for you at a cost in area, in power, and in the occasional 3am surprise. Context taken off the programmer is never context destroyed. It’s context handed to somebody, or something, else to maintain.

I’ll put a stake in the ground here, because to my mind hardware designers did the discipline a quiet disservice with all of this, and I say that knowing exactly why it happened. The frictionless move, the one that wins, is always to add the feature that makes the software easier to write, because that’s what drives acceptance. Relax the abstraction a little more, hide a little more context, and the programmers show up. The reward is adoption; the bill is hardware scalability, paid downstream by someone who wasn’t in the room when the call got made. And it was always going to end regardless. There’s only so far you can relax an abstraction before the thing it was hiding pushes back through (and this, my friends, is where innovation often happens), before the scalability limits you papered over resurface and, sitting right underneath them where it always was, Amdahl is waiting. You can buy time by making the cores easier to use. You can’t buy your way out of the serial fraction. The relaxation has a floor, and we’ve been feeling around for it for a while now.

The wall everybody eventually hit had been described, with eerie precision, almost forty years before anyone smacked into it. In 1967 Gene Amdahl gave a short talk at the spring joint computer conference under a thoroughly unglamorous title, “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” The argument inside is now called Amdahl’s Law, and it’s one of those results that feels obvious the second you get it and rearranges everything afterward. His point: any task has some fraction that has to happen in sequence, one step strictly after another, and that fraction won’t parallelize no matter how many processors you own. Say ninety-five percent of your program splits cleanly across cores and five percent is stubbornly serial. Add infinite cores and the parallel part collapses toward zero, but that serial five percent just sits there. Your best possible speedup is capped at twenty times, forever, and you hit ugly diminishing returns long before you get close. The cores were never the ceiling. That little serial strand was, sitting quietly inside the work the whole time, completely indifferent to how much hardware you stacked around it.

What stings is that the bottleneck is almost never where you’re looking. You watch the core count go up, you watch the speedup curve bend flat, and the instinct is to blame the cores, or the scheduler, or some sloppiness in how the work got divided. Amdahl says look somewhere else: find the part of the problem that refuses to be split, because that part is now running the show. The resource you keep adding isn’t the constraint. (It never was.) The constraint is the one strand that insists on going second-after-first, the dependency where step two truly can’t begin until step one has finished. Buy all the parallel capacity on Earth and it does exactly nothing for that strand. Past a certain point, that strand is the entire story of your performance. The most expensive lesson of the multicore era was learning, slowly and at real cost, that buying more of the abundant thing does nothing about a limit set by the scarce structural one.¹⁰

There’s an honest counterpoint, and leaving it out would be cheating. In 1988 John Gustafson published “Reevaluating Amdahl’s Law” (crediting the scaled-speedup idea to his Sandia colleague Edwin Barsis, which is why you’ll often see it called Gustafson-Barsis), and it pulls some of the gloom back off for a real and important class of problems. Gustafson noticed that Amdahl quietly assumes the problem stays the same size while you pile on processors, and that’s often not how people actually use big machines. In practice, when you get more compute, you grow the problem to match it: simulate at higher resolution, model a bigger system, chew through a larger dataset. If the serial fraction stays roughly fixed while the parallel work grows with the problem, the math turns generous in a hurry, and Gustafson showed you really can get a thousandfold speedup on a thousand processors. That’s not a refutation of Amdahl so much as a map of when his ceiling actually binds. For a fixed task, the serial fraction is destiny. For a task that grows to fill the machine, parallelism keeps paying. Both laws are true at once; the only question is which world you’re standing in, and convincing yourself you’re in Gustafson’s when you’re stuck in Amdahl’s is a reliable way to light money on fire buying cores that sit idle.

Which is the place to be honest about the title, because it can be read too broadly. More cores didn’t stop mattering. They stopped being automatic. Throughput work still loves them: a server fielding a thousand independent requests, a render farm, a GPU grinding through a batch, a simulation you scale up to fill the machine. All of that still gets faster, sometimes very nearly linearly, every time you add cores. The ceiling I’m describing is the specific and common one, a single fixed job shot through with dependencies, where the serial part sets the limit and no pile of hardware moves it. That’s the case where throwing cores at the problem quietly stops paying, and it burned a whole generation of engineers, but it was never the whole world.

The intuition under all of this is older than the hardware, and Fred Brooks gave it the version everyone quotes. In The Mythical Man-Month, his classic on why software projects run late, he said it flat: the bearing of a child takes nine months, no matter how many women you assign to it. Some work is sequential by its nature. You can’t compress it with more workers, because the steps depend on each other in an order more hands can’t rearrange. Nine women don’t make a baby in a month, and adding programmers to a late project, Brooks watched, tends to make it later, because the newcomers have to be brought up to speed and the coordination cost grows faster than the help arrives. The deep part isn’t about babies or staffing. It’s that parallelism has a precondition, and the precondition is independence. The instant the pieces of a task actually depend on one another, the instant step two needs the real result of step one and not just the fact that step one exists, more capacity stops helping. The limit isn’t how much you can do at once anymore. It’s what has to happen before what.

So here’s the thing I keep turning over, and I’d rather set it down gently than wave it around. We learned, expensively, that some things resist parallelization because of their internal structure, because they carry a chain of dependencies no amount of added capacity can flatten. That lesson lived for years inside a narrow box labeled performance engineering. But the shape of it doesn’t obviously stay in the box. There are kinds of human work with exactly this texture, where each step leans on the genuine completion of the one before it, where the sequence is the substance and won’t subdivide across more hands or more machines without losing the thing you were after. Some of how we come to understand things might be one of them. Plenty of learning parallelizes fine: you can read many examples at once, learn from other people, practice retrieval, lean on tools. But some of it seems to carry a hard dependency chain. Producing an answer parallelizes beautifully now (we have very fast machines for that), and the slow business of a person actually grasping why the answer is right, where step two genuinely needs step one to have happened inside the same head, looks like it may not. I’m not going to lean on that here. I just want to leave the suspicion sitting where you can see it, because it’s been quietly true of machines for a long time, and the burden’s on anyone who assumes it could never be true of us.

For now the modest version is plenty, and it’s the one worth carrying out of the multicore era into everything else. When something stops scaling, don’t reflexively add more of whatever you’ve been adding. Go find the dependency that refuses to be split, or the shared resource everything is quietly contending for, because that’s almost always the real limit, and it’s almost always not the thing the dashboard is pointing at. In a real system that limit is rarely pure Amdahl; it’s usually some tangle of serial sections, lock and allocator contention, memory bandwidth, and NUMA effects all binding at once. The move is the same either way: find what genuinely can’t be parallelized or shared, and stop feeding the part that can. The cores were a distraction. The serial fraction was the world all that abundance was hiding. Keep your eye on what simply can’t be parallelized, in the machine and maybe well past it, and you’ll be looking at the thing that was actually in charge the whole time.

Somewhere in your own systems there’s a place you kept throwing capacity at and it kept not helping. I’d love to hear where yours was; those stories are usually where the real bottleneck finally showed its face.

Sources & Further Reading

Gene M. Amdahl, “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities” (AFIPS, 1967): the two-and-a-half-page paper behind Amdahl’s Law and the insight that the irreducible serial fraction caps every speedup.
John L. Gustafson, “Reevaluating Amdahl’s Law” (Communications of the ACM, 1988): the counterpoint, showing that when the problem scales with the machine, parallelism keeps paying off.
Frederick P. Brooks Jr., The Mythical Man-Month (1975): the source of “the bearing of a child takes nine months, no matter how many women are assigned,” still the cleanest statement of work that can’t be parallelized.
Robert H. Dennard et al., “Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions” (IEEE Journal of Solid-State Circuits, 1974): the original statement of what’s now called Dennard scaling, that shrinking a transistor’s dimensions and voltage together holds power density constant while switching speed rises.
Mark Horowitz, “Scaling, Power and the Future of CMOS” (VLSID, 2007): a clear account of how, once Dennard scaling broke, power rather than device count became the limiter on chip performance.
Hadi Esmaeilzadeh et al., “Dark Silicon and the End of Multicore Scaling” (ISCA, 2011): the landmark result that a fixed power budget leaves a growing fraction of every chip “dark,” so the move to more cores hit its own ceiling (expanded as “Power Limitations and Dark Silicon Challenge the Future of Multicore,” ACM TOCS, 2012).
Mark Bohr and Ian Young, “CMOS Scaling Trends and Beyond” (IEEE Micro, 2017): the device tricks, strained silicon, high-k metal gates, and FinFETs, that kept density scaling alive after classic Dennard scaling ended.
Intel Core 2 Duo (2006) and AMD Athlon 64 X2 (2005): the consumer dual-core chips that opened the “more cores” era, the X2 and Intel’s Pentium D first in 2005, the Core 2 Duo the 2006 breakout that made multicore mainstream.
Shekhar Borkar, “Thousand Core Chips: A Technology Perspective” (DAC, 2007): the hardware industry’s stated plan, to scale out to ever more, ever smaller cores once single-core speed stalled.
Mark D. Hill and Michael R. Marty, “Amdahl’s Law in the Multicore Era” (IEEE Computer, 2008): extends Amdahl’s Law to multicore chips, showing the serial fraction caps the payoff however you divide a fixed transistor budget.

That instinct to just wait for faster hardware was so dependable it grew its own folklore. Todd Proebsting put numbers on why hand-tuning was a poor bet with his (half-serious) Proebsting’s Law in 1998: where Moore’s Law doubled hardware roughly every 18 to 24 months, he reckoned compiler optimization doubled performance only about every 18 years, so waiting beat optimizing by a mile. The mindset had its critics too. Niklaus Wirth’s “A Plea for Lean Software” (IEEE Computer, 1995) gave us Wirth’s Law, “software is getting slower more rapidly than hardware becomes faster,” and the industry’s more cynical Andy and Bill’s Law put it as “what Andy giveth, Bill taketh away”: every gain Intel shipped, bloated software promptly spent. ↩
Herb Sutter named the end of the era in “The Free Lunch Is Over” (Dr. Dobb’s Journal, 2005), the essay that told a generation of developers the “wait for the next chip” strategy was finished and that concurrency was now, like it or not, their problem. ↩
The physics here is Dennard scaling. In 1974 Robert Dennard and colleagues showed that if you shrink a transistor’s dimensions and its voltage by the same factor, power density stays put while the device switches faster, which is what let clock speeds climb for roughly thirty years almost for free. By the mid-2000s it broke: voltages couldn’t keep dropping without leakage currents getting out of hand, so shrinking transistors no longer held power density constant, and heat, not transistor count, became the wall. Mark Horowitz laid this out cleanly at the time. That’s the moment the free lunch ended and the industry turned to more cores instead of faster ones, which (as the dark-silicon work in the references shows) only relocated the problem rather than solving it. ↩
The marching orders were real and dated. Dual-core chips reached consumers in 2005, AMD’s Athlon 64 X2 and Intel’s Pentium D arriving within weeks of each other, but it was Intel’s Core 2 Duo in 2006 that made multicore feel like the obvious future rather than a stopgap. On the research side, Intel’s Shekhar Borkar spelled out the long game in “Thousand Core Chips: A Technology Perspective” (DAC, 2007): since you couldn’t make one core meaningfully faster, the path forward was tens, then hundreds, then thousands of smaller cores. Throwing cores at it wasn’t a hack, it was the official plan. ↩
“Coherence” is really a family, not a single thing, and it’s worth not flattening it. There are many cache-coherence protocols (snooping, directory, and the variants in between) and, layered on top, a whole zoo of memory-consistency models that define what “the same everywhere” even means, from sequential consistency down through the relaxed ones. Sorin, Hill, and Wood’s A Primer on Memory Consistency and Cache Coherence (Morgan & Claypool, 2011; 2nd ed. 2020) and Adve and Gharachorloo’s “Shared Memory Consistency Models: A Tutorial” (IEEE Computer, 1996) are the standard maps. The deeper point is that coherence is always relative to the addressing level you choose to enforce it at; it’s only true from a particular point of view. You don’t even have to anchor it to physical addresses: cache-only memory architectures (COMA) drop the idea of a fixed home for data entirely and keep a location-independent space coherent instead, as in Hagersten, Landin, and Haridi’s Data Diffusion Machine (IEEE Computer, 1992). ↩
If you want the machinery without the mystique: the idea goes back to the Atlas computer’s “one-level store,” the first paged virtual memory (T. Kilburn, D.B.G. Edwards, M.J. Lanigan, and F.H. Sumner, “One-Level Storage System,” IRE Transactions on Electronic Computers, 1962; see IEEE’s Atlas milestone). Peter Denning’s “Virtual Memory” (Computing Surveys, 1970) is the classic survey; Remzi and Andrea Arpaci-Dusseau’s free Operating Systems: Three Easy Pieces has the clearest plain-English chapters on page tables and TLBs; and Bhattacharjee and Lustig’s Architectural and Operating System Support for Virtual Memory (Morgan & Claypool, 2017) is the modern end-to-end account of the TLBs, page-table walkers, and MMU caches the names point at. ↩
Try it yourself: run strace -f -e trace=mmap,mprotect /bin/true and watch a program that does precisely nothing still spray the kernel with calls, mapping in the dynamic loader, libc, and the locale machinery with mmap, then walking back through mprotect to lock down the read-only relro regions it just created writable. Even /bin/true isn’t free. A real binary linked against a dozen shared libraries runs this whole gauntlet dozens of times before main ever gets a turn. (strace and those man pages are part of the Linux man-pages project maintained by Michael Kerrisk, who also wrote the definitive book-length tour of this whole interface: The Linux Programming Interface (No Starch Press, 2010) is the reference worth owning if you want to go deep.) ↩
ASLR is Address Space Layout Randomization. Every time a program starts, the loader scatters the base addresses of its stack, heap, libraries, and code to random offsets, so an attacker can’t count on knowing where anything lives. It’s one move in a long arms race. First the NX bit (mark the stack non-executable) killed the classic trick of injecting shellcode onto the stack and jumping to it. Attackers shrugged and switched to “return-to-libc” (Solar Designer demonstrated a working version on Bugtraq in 1997): don’t inject any code at all, just point execution at code that’s already loaded and already executable, like libc’s system(), and feed it your arguments. ASLR (first designed and shipped by the PaX project in 2001) was the counter, if you don’t know where libc landed this run, you don’t know what address to jump to. The counter-counter was return-oriented programming (Hovav Shacham formalized it in 2007), stitching together dozens of tiny snippets that already exist in the binary (“gadgets,” each ending in a ret) into whatever computation you want, usually paired with a separate memory-disclosure bug to leak the randomized addresses and defeat ASLR anyway. The fun, if this is your idea of fun, never really ends. ↩
“Interrupt” is the x86 telling of it, and the two big architectures actually handle this differently, with real tradeoffs. On x86 there’s no hardware broadcast, so the OS sends an inter-processor interrupt (IPI) to each affected core; every target stops what it’s doing, runs a handler to flush the entry, and acks, while the initiator waits. Precise (you interrupt only the cores that matter) but disruptive, and the cost climbs with core count. Arm instead has a hardware broadcast invalidate, the TLBI instruction, carried across the interconnect as Distributed Virtual Memory (DVM) messages, so no core takes an interrupt; the catch is that the broadcast reaches every core in the shareability domain whether it cares or not, and at high core counts the DVM traffic itself becomes the bottleneck. That bites in practice: see Linaro’s talk “Why TLBI matters on ARM server: scalability issues we found and solutions” (2024), the recurring kernel discussions about using IPIs instead of broadcast TLBI on arm64, and the lengths HPC users go to on Fujitsu’s 48-core A64FX (the Fugaku chip), leaning on huge pages to cut the translation and invalidation pressure. ↩
Mark Hill and Michael Marty made this precise for chip designers in “Amdahl’s Law in the Multicore Era” (IEEE Computer, 2008), extending Amdahl’s Law to multicore and showing the sequential fraction caps the payoff no matter how you spend a fixed transistor budget across symmetric, asymmetric, or dynamic cores. The dark-silicon work in the references is the other half of the obituary: even the power budget won’t let you light all those cores at once. ↩

Sources & Further Reading

Footnotes

Share this post