Managing Parallel, Part 3: When Parallel Goes Wrong

Part 1 was about queues: rate mismatch, bursty workers, and why you buffer between stations. Part 2 was about the primitives: process, kernel thread, user-space thread, and who schedules what. This one is about the part nobody enjoys and everybody ships: the ways parallel programs break.

Here is the encouraging news buried in the bad news. After you have chased enough of these bugs, they stop looking like a thousand unique disasters and start to look like a short list of broken contracts. A data race, a lost wakeup, a double-free across a process boundary: these feel unrelated when you are staring at a 3 a.m. stack trace, but most of them are the same handful of rules violated in different costumes. Learn the rules and you can often predict the bug before you have written it.

So that is how this post is organized. We will move from threads to coroutines to processes, because each execution model trades one set of hazards for another even though the underlying mistakes are the same. And as we go, we will collect the rules.

Spectrum of three ways to run the same logic: cooperative coroutines, preemptive threads, and isolated processes, each with its first failure

The same logic can run cooperatively (a coroutine), preemptively (a thread), or in isolation (a process). You pick the trade-off; you inherit its first failure.

A quick word on where these come from and what they are not. They come from years of building streaming and dataflow runtimes (the kind of machinery that sits under RaftLib, an open-source stream-processing runtime I maintain), but almost nothing here is about any one library, and a good deal of it is not in the public/thesis version of RaftLib at all. Everything below is described at the level of the language and the operating system: std::atomic, condition variables, C++20 coroutines, futexes, RAII. The patterns travel to any environment with shared memory and real concurrency.

Act 1, Threads: the familiar hazards

Threads are where most of us first meet concurrency, and the villains here are old friends. We start with them because the intuition you build carries all the way up.

The most common bug I have ever chased, across every project, in any language with a scheduler, is the lost wakeup. A consumer waits for work. A producer makes work and signals. Somehow the consumer keeps sleeping, or wakes, looks, sees nothing, and goes back to sleep forever. Nine times out of ten the cause is ordering: the producer announced the work before it was actually visible, or it skipped the wake entirely because some “I think there is already a notification pending” flag looked set at the wrong instant.

Wrong versus right wakeup ordering: signaling before the data is written wakes the waiter to an empty queue; the fix is publish with a release store, then signal

Announce the data before it is visible and the waiter wakes to an empty queue. Publish first, then signal.

The fix is a discipline you will reach for constantly: publish, then signal. Make the data visible with a release store first, and only then touch the flag or condition variable that wakes the waiter; on the other side, load the announcement with acquire semantics before you read the data it points to. And do not try to be clever by suppressing a wake based on an unsynchronized read of your own “pending” flag. Collapse that into a single read-modify-write and let the condition variable or futex do the coalescing it was built to do. That gives us the first two rules. Rule 1: a wakeup is a hint to re-check, not a delivery. Rule 2: publish, then signal. Release after the data, acquire before it.

The next old friend is deadlock, and its most common shape is embarrassingly simple. Two threads each need the same two locks, and they take them in opposite orders. Thread 1 grabs A and reaches for B; Thread 2 grabs B and reaches for A; now each is holding exactly what the other is waiting for, and neither will ever let go.

AB/BA deadlock: thread 1 holds lock A waiting for B while thread 2 holds B waiting for A, forming a cycle; the fix is one global lock order

Two locks taken in opposite orders form a cycle. Impose one global order and the cycle cannot form.

The cure is not cleverness, it is consistency: decide on one global order for your locks (ordering by address works fine) and always acquire them that way, everywhere. While you are at it, never hold those locks by hand. A throw in the middle of a critical section that does manual lock/unlock leaves the lock held forever, and the next thread to come along deadlocks against a thread that is already gone. Scope every lock with RAII so unwinding the stack releases it for you. Rule 3: locks guard data, not time. Hold one only long enough to touch shared state, and never across a system call, a blocking wait, a join, or a callback into code you do not control, because that is how a lock you thought was fast becomes a latency cliff or a fresh deadlock. Rule 4: acquire in a fixed order, release with RAII.

The last thread-level hazard is the sneakiest because nothing is ever incorrect; it is just mysteriously slow. Two threads update two completely independent counters, and yet throughput falls off a cliff under load. The reason is that those two counters happen to live on the same 64-byte cache line, and the hardware can only hand that line to one core at a time. So the line ping-pongs back and forth, invalidated on every write, and your “independent” counters are quietly serialized through the cache-coherence protocol.

False sharing: two cores writing independent counters on one 64-byte cache line make it ping-pong; the fix gives each counter its own line

Independent counters that share a cache line are not independent. Give each its own line.

This is false sharing, and the fix is to stop sharing: pad and align hot, independently written fields so each sits on its own cache line (alignas(hardware_destructive_interference_size) is exactly this), and static_assert the layout so a build for a different cache-line size cannot silently undo your work. That is the broader Rule 8: the cheapest synchronization is none. Partition instead of sharing wherever you can. And while we are talking about fields that multiple threads touch: volatile does not help you here. It was never a concurrency primitive; it gives you neither atomicity nor ordering. Use std::atomic (or std::atomic_ref over existing storage) with the ordering you actually mean. Rule 9: volatile is not a concurrency primitive.

Act 2, Coroutines: new edges, sharper

C++20 coroutines move scheduling out of the kernel and into your program. That is what makes them cheap (a suspend and resume can cost a fraction of a thread context switch), and it is also what gives them a brand-new class of bugs, all clustered around suspension, resumption, and frame lifetime.

Rule 1 comes straight back, sharper. A coroutine awaiting a queue gets resumed and finds nothing to consume. If you treated “I was resumed” as “the data is here,” you now act on garbage, or trip an assertion, or worse. The resume was the hint; it was never the delivery. The fix is to loop: re-check the condition on resume and suspend again until there is genuinely data, or the input is closed and drained. Same rule as the lost wakeup, one level up the stack.

Then there is a failure that has no real thread analogue: the park/cancel race. A worker looks for work, sees none, and begins to put itself to sleep. In the tiny window between “deciding to sleep” and “actually asleep,” a producer makes work ready and fires a wake. If that wake lands in the gap, it can be lost, and the worker sleeps on top of work that is sitting right there.

The park/cancel race: a worker moves through running, about-to-park, and parked, and a wake in the race window is lost unless the sleeper can cancel its park

Between “about to park” and “parked” there is a window. Let the sleeper abort its own park so a wake in that window is never lost.

The robust pattern is a small state machine (running, about-to-park, parked) in which the sleeper can have its park cancelled. The producer publishes the work (Rule 2, again), then, if it finds a worker in the about-to-park state, it compare-and-swaps that worker straight back to running rather than firing a wake into the void; a real wake is only ever sent to a worker that genuinely reached the parked state. Spin briefly in user space, and treat actually going to sleep as the rare last resort.

The other coroutine-specific trap is about lifetime. A coroutine handle is, in effect, an owning back-pointer into a suspended stack frame. If an outer task is destroyed while a child is still holding a handle back into it, the child will eventually resume into freed memory. The remedy is to treat teardown as carefully as you treat the happy path: when a task is reset or destroyed, walk the chain of awaiters and sever the back-references before the frame goes away, and prefer symmetric transfer (returning the next handle from await_suspend) over resuming by hand inside a noexcept final suspend. This is really Rule 5: one owner per resource, expressed in the type system. Make the inline awaiter types non-copyable and non-movable so a stray copy cannot leave a dangling self-reference in the first place.

One more, because it bites everyone building pipelines: backpressure can deadlock you against yourself. If a single producer reserves a slot on a bounded queue and then, still holding that reservation, blocks waiting to reserve a second slot, it can wedge the whole pipeline: classic hold-and-wait, except the resource is buffer capacity. Keep at most one outstanding reservation per queue, and scope it as an RAII ticket so it is always either committed or released. That is Rule 3 wearing a different hat: do not hold a scarce thing while you block on another.

Act 3, Processes & shared memory: the deep end

Now we cross the hardest boundary. Between processes, a raw pointer is meaningless, the operating system’s resources outlive the process that made them, and any participant can die in the middle of an operation and leave its mess behind. Fault isolation is the whole reason to reach for processes (if an untrusted parser segfaults, the pipeline survives), but you pay for it in cleanup.

The signature bug is the stale kernel resource. A System V semaphore or a shared-memory segment is owned by the kernel, not by you. Crash without cleaning up and the next run greets you with “File exists,” then runs against a zombie segment and produces quietly wrong results. The discipline is twofold: register cleanup on every exit path you control, and accept that some exit paths (a hard kill) you do not control, so the startup path must be able to recover: treat an exclusive-create that fails as “tear down the stale one and retry once,” and fail fast rather than attach to something you cannot trust. This pairs with the way you should create those resources in the first place. Do not check whether a segment exists and then create it; that gap is a race two starters will lose. Let the atomic exclusive-create (O_CREAT | O_EXCL) be the check, and make teardown idempotent so it is safe to run twice or lose the race. Rule 6: reclaim only when no one is looking and Rule 7: the atomic create is the check.

Lifetimes get genuinely hard once memory is shared across the boundary. A block in shared memory can be freed by one process while another is still reading it; a reference count can underflow; a fork can silently double a count by duplicating the holder. The only thing that works is to rebuild, by hand, the discipline a shared_ptr gives you for free: the owner publishes a count of one before sharing, every attach increments and every detach decrements the same shared counter, copies add a real reference while moves transfer one, and the last participant out does the free. When in doubt, leaning toward leaking is safer than freeing memory a peer might still touch. That is Rule 5 again, now spanning processes. The same vigilance applies to file descriptors: open them close-on-exec so they do not leak into children, exchange-and-close when you replace one, and reset inherited process-local state in a pthread_atfork child handler.

Two portability traps are worth calling out because they are the kind of thing you only learn by getting burned. First, signal handlers: when a fatal signal lands in the middle of an allocation and your handler tries to run shared-memory cleanup (taking locks, calling free), you can deadlock or corrupt state inside the handler. A handler may safely do almost nothing; the correct move is write(2) a note and _exit, mask nested signals, set SA_RESTART, and retry interrupted IPC calls on EINTR. Second, and this one surprises people: std::atomic::wait is not guaranteed to work across processes on shared memory. For inter-process blocking you generally have to drop down to a raw, non-private futex on a shared word: spin a little, then wait; publish, then wake. And if a lock holder can die mid-critical-section across the boundary, reach for a robust process-shared mutex so survivors get EOWNERDEAD and can call pthread_mutex_consistent rather than blocking forever. Underneath all of it sits the quiet portability rule: static_assert your layout assumptions (cache-line size, page size, struct offsets), because a surprising number of “races” turn out to be a 64-byte build meeting a 128-byte machine.

The rules that cut across all three

A few patterns showed up at every layer, so they deserve their own billing.

The first is reading a value that is wider than a single atomic word: a small struct, a pointer plus a generation, a pair of counters. Read it without a protocol and a writer can catch you mid-update, handing you the first field from before and the second from after: a value that never actually existed. The clean answer is a seqlock.

Seqlock: a generation counter is odd while writing and even when stable, so a reader seeing an odd or changed sequence knows the read tore and retries

A generation counter that is odd while writing and even when stable lets a reader detect that it was caught mid-update, and simply retry.

The writer bumps a sequence number to an odd value before it starts, writes the fields, and bumps it to even when done; the reader takes a snapshot of the sequence, reads the fields, and re-reads the sequence. If it changed or was odd, the read was torn, so retry. It is Rule 2 in structural form, and the same family of idea, a generation stamp, is what defeats the ABA problem in lock-free code, where a pointer can match while the object it points to has been recycled into something else. Reclaiming memory in a lock-free structure follows from the same instinct as Rule 6: you cannot resize or free in place; you stamp, you wait for readers to drain (an epoch or hazard scheme), and only then do you retire the old generation.

The rest are short and worth internalizing. Right-size your memory ordering: blanket seq_cst everywhere is a smell, but so is its opposite, fencing on every retry of a loop. Weaken to acquire/release only where you can name the happens-before edge you are relying on. Keep the control plane out of the data plane: let end-of-stream or signal records leak into your data aggregation and you get a stable record count with a slowly drifting sum, the kind of bug that sails through every smoke test. Keep your observers read-mostly: a telemetry thread that grabs hot-path locks is just contention with a friendlier name, and measurement code is concurrent code too (reach for a monotonic clock, not a wall clock, when you time deltas). Remember that fairness is a safety property: an unbounded drain loop or a spinlock that never yields will starve your scheduler as surely as a deadlock freezes it. And validate at the point of use, not once at startup, because a resource that was healthy when you checked it can be stale by the time you lean on it, the same time-of-check/time-of-use gap as Rule 7, wearing yet another hat.

The whole list, on one card

Reference card listing all ten rules for code that shares memory, from a wakeup is a hint and publish then signal onward

That is the catalog. None of these rules is exotic, and that is the point. The reason concurrency feels endlessly deep is that the symptoms are endlessly varied (a hang here, a corrupted byte there, a benchmark that mysteriously regresses) while the causes are a short, stable list. The goal of a runtime, in the end, is to let the same piece of logic run as a coroutine, a thread, or a process without rewriting it. The price of that flexibility is respecting all ten of these at once, because each execution model is just waiting to break a different one first.

If there is appetite for it, a Part 4 could go a layer deeper on the lock-free queue itself: the single data structure where most of these rules collide at the same time.