Managing Parallel, Part 4: The Machine Underneath

Part 3 ended with ten rules for code that shares memory. Every one of them quietly assumed a friendly machine: that a write becomes visible the moment you make it, that all memory is equally far away, and that the hardware does what your source code says in the order it says it. None of that is true. This post goes under the abstraction to the three places the machine pushes back hardest (memory ordering, topology, and the cache line), because each is a spot where the same “correct” program is correct on one chip and broken, or merely slow, on the next.

A heads-up: this is the deep end of the series. If Parts 1–3 were about how to think, this one is about what the silicon actually does. Some of the sharpest lessons here I learned the hard way porting a streaming runtime across x86 and Arm; I’ll point those out as we go, but the conclusions are about the architectures, not any one codebase.

1. Memory ordering: your mental model is too strong

When you write a = 1; flag = true;, you imagine the two stores happening in that order, and every other core seeing them in that order. That model is sequential consistency (SC): one global interleaving of all operations, consistent with each thread’s program order. It is the model we all reason in, and it is not the model any mainstream CPU gives you, because SC leaves enormous performance on the table. Store buffers, out-of-order execution, and private caches all exist precisely to let a core run ahead without waiting for the rest of the machine to agree. The price is that loads and stores can be reordered, and exactly which reorderings are allowed is the processor’s memory model.

Table of four load/store reorderings under sequential consistency, x86 TSO, and weak Arm: x86 allows only store-then-load, Arm permits all four

Three models, four reorderings. x86 forgives almost everything; Arm forgives almost nothing.

The two architectures most of us ship to sit at very different points on that table. x86 is Total Store Order (TSO): the only reordering it permits is that a load may pass an earlier store to a different address. Stores stay in order with each other, loads stay in order with each other, and a store can never jump ahead of an older load. On top of that, x86 is multi-copy atomic: once a store is visible to any other core, it is visible to all of them, so there is a single agreed-upon order of stores. This is a famously strong model, rigorously pinned down in the x86-TSO work by Owens, Sarkar, and Sewell and matching the ordering rules in the Intel SDM.

That one permitted reordering, store-then-load, is not academic. It is the store buffer, and it is why a hand-rolled lock can be silently broken even on the “strong” architecture.

Store-buffering result: each core buffers its own write and reads the other variable as 0, so both loads return 0, impossible under SC but allowed on x86

Each core retires its store into a private buffer and races ahead to the next load. Both loads can read 0: the classic store-buffering result, and the reason Dekker’s algorithm needs a fence.

Each core parks its own stores in a FIFO buffer and keeps executing; its later loads can complete while those stores are still draining. So in the classic store-buffering shape (core 0 writes x then reads y, core 1 writes y then reads x), both reads can return the old value, because each store is still sitting in its writer’s buffer. Under SC that outcome is impossible; on x86 it is allowed and observable. The repair is to drain the buffer with an mfence, a lock-prefixed read-modify-write, or a sequentially-consistent store between the write and the read. If you have ever wondered why a textbook Dekker or Peterson lock “doesn’t work,” this is why: those algorithms assume SC, and no real CPU gives it for free.

Arm is a weakly-ordered model, and it sits at the opposite end. By default it permits all four reorderings; a load or store may move past another in either direction unless something stops it. What stops it is explicit: a one-way acquire load (LDAR) or release store (STLR), a data memory barrier (DMB, with ld/st flavors), or an honest data/address/control dependency between two accesses (the trick at the heart of RCU). Modern Armv8 did at least remove the nastiest wrinkle: it is now other-multi-copy atomic, so two cores can no longer disagree about the order of someone else’s stores, which older, non-multi-copy-atomic designs allowed. If you want the gory-but-readable version, Preshing’s walk through a weakly-ordered CPU and Arm’s own “Memory Systems, Ordering, and Barriers” guide are the places to start.

Here is the practical sting, and it is the whole reason this section exists. x86’s strong model hides your bugs. Code that omits an acquire or release is very often still correct on x86, because TSO already enforces the order you forgot to ask for. Move that same binary’s logic to Arm and the missing ordering suddenly matters: the publish can become visible after the flag that announces it, and Rule 2 from Part 3 (“publish, then signal”) fails not because your code changed but because the machine stopped covering for you. The defense is to express ordering in the language’s model, not the chip’s: use std::atomic with an explicit memory_order, and let the compiler emit nothing on x86 and the right LDAR/STLR/DMB on Arm. A memory_order_acquire load is free on x86 and a real instruction on Arm. That is exactly the point.

Concretely, in a single-producer/single-consumer queue I work on, the Arm path needs a dmb ishst before the producer publishes the tail and a dmb ishld before the consumer reads the head; the x86 build needs neither barrier, because TSO already orders those stores and loads. The same exercise teaches the other half of the lesson: don’t reach for seq_cst everywhere out of fear. Blanket sequential consistency is both a performance tax and a smell that you haven’t thought about the actual happens-before edge; weaken to acquire/release where you can name that edge, and leave a genuinely subtle path (a multi-producer queue, say) strong until you’ve proven the weakening. Right-sizing ordering is its own skill, and it is architecture-dependent in a way that is easy to miss when you only ever test on a forgiving x86 desktop.

2. “Memory” is a lie: topology

Suppose you get the ordering exactly right. There is still a second assumption hiding in your code: that memory is one uniform thing, equally far from every core. On any modern many-core part it isn’t, and the most vivid teaching example ever shipped is AMD’s Threadripper 2990WX.

Topology of the AMD Threadripper 2990WX: 32 cores on four Zen+ dies over Infinity Fabric, where only dies 0 and 2 have memory controllers; 1 and 3 are compute-only

32 cores across four dies, but only two of them have memory controllers. The other two reach RAM by hopping across the Infinity Fabric.

The 2990WX is 32 cores and 64 threads built as four Zen+ dies on one package, with 64 MB of L3 and a 250 W envelope. Each die is two core-complexes (a CCX of four cores sharing 8 MB of L3). The twist, the reason it is famous, is that AMD connected memory controllers to only two of the four dies. As ServeTheHome and others documented at launch, NUMA nodes 0 and 2 have local RAM and PCIe; nodes 1 and 3 are compute-only dies with zero local memory. Every byte a core on a compute die touches has to travel across the Infinity Fabric to a die that actually has a memory controller. Half your cores are a fabric hop away from all of RAM.

The consequence is that two threads running identical code can see wildly different memory latency and bandwidth depending only on where the scheduler put them. A queue whose producer lands on a memory die and whose consumer lands on a compute die pays the fabric tax on every hand-off. A lock shared across dies is far more expensive than the same lock shared between two cores of one CCX. And the false sharing from Part 3 escalates: a bouncing cache line that was merely costly between two cores on a chip becomes brutal when the two cores are on different dies, one of which has no local memory at all. AMD shipped a scheduler feature, Dynamic Local Mode, to migrate the busiest threads onto the memory-having dies precisely because the default placement could be so punishing.

The lesson generalizes well beyond this one exotic part. NUMA is now the common case, not the exception: chiplets, multiple CCXs, and multi-socket boxes all mean “local” and “remote” memory differ by large factors. So the cost of sharing is not uniform, and Part 3’s Rule 8 (“the cheapest synchronization is none”) grows teeth here: keep a queue’s producer and consumer on the same die or CCX, allocate memory where the thread that uses it runs (first-touch placement helps), pin threads that care, and measure placement, because the difference between near and far memory can dwarf every other optimization you make. The 2990WX just makes a property of every big machine impossible to ignore.

3. The cache line: atomicity and the 64-byte myth

The last assumption is the smallest and the one most people never question: the cache line. Two things about it routinely surprise even experienced engineers.

First, atomicity has a boundary, and the boundary is the cache line. The Intel SDM and AMD APM guarantee that aligned accesses up to eight bytes (and CMPXCHG16B for sixteen) are atomic, and in fact any access that stays within a single cache line is atomic. But an access that straddles two lines gets no such guarantee.

Cache-line atomicity: an 8-byte value within one 64-byte line is atomic, but one straddling two lines becomes a split lock, not atomic and able to stall the machine

Inside one line, your atomic is atomic. Across the boundary it becomes a “split lock”: not guaranteed atomic, and able to stall the whole machine.

When a locked instruction crosses a line boundary it becomes a split lock: historically the CPU asserted a bus lock to keep the two lines coherent together, an operation that can cost a thousand-plus cycles and briefly stall every other core on the machine. It is so harmful in shared environments that Linux added split-lock detection to fault or throttle programs that do it. The takeaway is concrete: keep anything you intend to access atomically aligned and inside one line. A misaligned 64-bit counter that happens to sit across a boundary is a correctness and a performance bomb.

Second, and this is the one that quietly breaks “portable” code, not every cache line is 64 bytes. That number is so common on x86 that it gets hard-coded into alignas(64), into padding structs, into “round up to 64.” But Apple’s M-series uses 128-byte lines in its L2 and system-level cache (the M1 measurements are public), and several POWER parts use 128 bytes too. On those machines, two fields you carefully placed on alignas(64) boundaries to avoid false sharing can land on the same 128-byte line, and the false sharing you thought you fixed is silently back.

alignas(64) separates two hot fields on a 64-byte machine and reunites them on a 128-byte one. Same source, opposite outcome.

I have watched this exact assumption bite. A lock-free allocator computed the width of its leaf blocks assuming a 512-bit (64-byte) line; on a host where the build defined the line size as 128 bytes, the element-count math came out wrong and tests failed in ways that looked, at first, like a memory-ordering bug. The fix was twofold and worth stealing: derive sizes from the actual line width rather than a literal, and static_assert the layout so a build for the wrong size fails loudly at compile time instead of regressing silently at runtime. And resist the temptation to treat std::hardware_destructive_interference_size as the source of truth: it is a compile-time, implementation-defined constant (often just 64 on x86 toolchains), so use it as a fallback while querying the real line size at runtime (sysconf(_SC_LEVEL1_DCACHE_LINESIZE) and friends). Part 3’s rule about false sharing was right; what was wrong was the constant we trusted to implement it.

The thread through all three

Step back and the three sections are one idea: the abstractions we program against (“memory updates in order,” “memory is uniform,” “a cache line is 64 bytes and my access is atomic”) are approximations, and concurrency is exactly where the approximation leaks. The defenses rhyme, too. Express ordering in the language’s memory model so the compiler can target each ISA correctly, instead of leaning on the one chip you happened to test. Respect topology, because the cost of sharing depends on physical distance. And never hard-code the cache line: query it, derive from it, and static_assert it. Get those three right and the ten rules from Part 3 actually hold on the machine in front of you, not just the one in your head.