All writing

The 80% Problem: The Last 20% Is Where the Engineer Used to Live

AI gets you to a working draft fast. The trouble is what the missing fifth contains, the edge cases and operational reality that used to teach engineers their judgment, and who builds that muscle now.

AI gets you to a working draft fast (say four-fifths of the way), and the speed is real. The trouble is what the remaining fifth contains, and who used to build the muscle that handles it.

At the end of the last post I admitted there was a related ache I was leaving for its own post: the way the tool gets you to eighty percent in a hurry, and the last twenty was always where the engineer actually lived. This is that post.

Start with the old joke, usually pinned on Tom Cargill at Bell Labs and made famous by Jon Bentley’s Programming Pearls column: the first ninety percent of the code takes the first ninety percent of the time, and the remaining ten percent of the code takes the other ninety percent of the time. It’s funny because the arithmetic is absurd and the experience is exact. The visible part of a system, the part that demos, comes together fast. Then the project meets the part nobody budgeted for: the edge cases, the failure modes, the conditions that only show up under load, the half-page of operational reality that decides whether the thing survives contact with the world.1 That tail isn’t a rounding error in the schedule. It is the schedule.

Generative AI hasn’t repealed this rule. It’s relocated it, and in moving it created a problem worth naming carefully. Ask a coding model to build a feature and it’ll produce the first eighty percent with startling speed and fluency: the happy-path logic, the structure that passes a basic test, the scaffolding that runs in a demo. One caveat: that eighty percent is real only when the model was trained on something like your problem and can generalize to it. Point it somewhere the training data barely touched and the fluency doesn’t go anywhere, but the correctness does, you get the same confident scaffolding, now hallucinated.2 Fluent output and usable work are different things, and the model has gotten good enough to nail the first while missing the second. What it quietly skips, when the eighty percent is the real kind, is the other twenty percent, and the skips aren’t random. They cluster around exactly the parts of engineering that take sustained operational experience, the idempotency key that keeps two racing requests from corrupting state, the backoff and jitter that keep a retry from turning into a stampede, the migration written to dodge a long table lock, the rate limiter, the circuit breaker, the structured log that makes the eventual failure diagnosable at 3am. Every one of those is invisible during development, because the dev environment never exercises the condition that would expose it. The code compiles. The test passes. The demo works. The artifact looks done. Then it meets concurrency, or a traffic spike, or a partial network failure, or any of the ordinary cruelties of production, and it isn’t done at all.

That last twenty percent is where the engineer used to live. It was never the fun part, but it was the formative part, because it was the part that forced contact with the substrate and built the judgment that contact produces. You learned about memory by chasing a segfault you couldn’t explain, the sickening randomness of corruption in a parallel program, the occasional stack smash.3 You learned about concurrency by debugging a race that only showed up on Tuesdays. You learned about data shape by watching a query that flew on a thousand rows fall over on ten million. None of it taught kindly or efficiently, but it taught, because the system simply refused to work until you understood it. The eighty percent the model now generates is, by and large, the part that resisted least. The twenty percent it skips is the part that did the teaching.

There’s a takeaway in that worth saying out loud, and it runs against the anxious version of this story: experience is worth more now, not less. When the typing is partly handled for you, the decisive skill becomes everything around the typing, knowing which algorithm actually fits, seeing where the components meet and how they’ll behave under stress, holding in your head what the system is really supposed to do. That’s the judgment the last twenty percent used to build, and it’s exactly the part the model can’t hand you. (A bias I’ll cop to: I still out-optimize my own LLM more often than not, though it’s entirely possible I’m just pickier than it is.4)

The good plan, violently executed

There’s a corollary here that cuts the other way, and I first learned it in a different uniform. The US Army teaches a planning version of this same ratio, usually wrapped around a line attributed to Patton:

A good plan violently executed now is better than a perfect plan executed at some indefinite time in the future.

Gen. George S. Patton

That isn’t a license to charge in with no plan, or a bad one. It’s a warning against the opposite failure, paralysis by planning. Wait for the perfect plan and you lose anyway: your forces sitting still and undeployed while a competitor with a rougher plan moves, surrounds you, and wins. It’s Pareto’s 80/205 pointed at action, the “good” plan being the crucial twenty percent of effort that buys most of the outcome, and executing it now beating hoarding time for the elusive last fraction.

So which is it? Is the missing twenty percent the thing you should skip and move past, or the thing that quietly kills you? That’s exactly the question, and the honest answer is that it depends entirely on what lives in your particular twenty percent. A battle plan keeps adapting as it meets the enemy; the gaps get filled in contact, by people who understand the situation, which is the whole reason executing now beats waiting. The catch with AI is that the twenty percent it skips doesn’t fill itself in on contact: there’s no commander in the loop adapting on the ground, just an artifact that looked finished, and left alone it sits there invisible until the load or the race or the partial outage finds it. That doesn’t mean you can’t fill it on contact, it means you have to plan to. Call it the new agile bargain: ship the AI’s eighty percent, which in many cases is genuinely fine, then systematically plan to find and adjust the twenty as it meets reality, before it bites. The plan isn’t “this is finished,” it’s “this is good enough to start, and we’ve staffed and instrumented the adapting-under-fire that actually finishes it.” The Patton reading and the danger I’m describing are the same arithmetic wearing opposite faces. The difference is whether your last fifth is refinement you can earn later, or the part everything else was quietly depending on, the one you only thought you could skip.

Synthetic competence

This is the place to put a name on the central hazard, because it isn’t incompetence and it isn’t hallucination, and those older words hide what’s actually going on. Call it synthetic competence: output with the surface texture of understanding and none of the understanding underneath. A developer with AI can produce a fluent design doc, plausible code, a confident architectural recommendation, a tidy migration plan, a clean incident summary, and the whole surface area of apparent competence balloons while the depth behind it may not move an inch.6 The artifact looks understood. The person, and the system the artifact applies to, may not be. And the gap between the two is exactly the thing that no longer announces itself, because the model has gotten good enough that the old tell, the inability to produce, is gone. In the old world, not understanding something showed up as not being able to build it. Now the building is cheap, and understanding has to be checked some other way: by whether the person can predict how the thing fails, name the assumptions it’s making, and throw out a plausible answer that happens to be wrong.7

Synthetic competence is dangerous in proportion to how convincing it is, and it’s most convincing exactly where there’s no external ground truth to catch it. When the output is code with a compiler and a test suite behind it, reality can still punish a wrong answer, and the floor stays solid. When the output is a recommendation, an analysis, a summary, a judgment, nothing external pushes back, and the polish of the output becomes the very thing that disarms scrutiny. The user gets better at producing confident output and worse at telling whether it’s right, and that widening ratio is where the damage piles up, quietly, one accepted answer at a time.

The supervisor’s irony

None of this is new. Fiction got to it before the engineers wrote it down: you can argue Asimov was circling the same problem all through his robot stories, where the one indispensable human is Susan Calvin, the robopsychologist who stays expert enough to diagnose a machine after everyone around her has stopped understanding how it works.8 The formal version arrived forty years ago, in a setting that had nothing to do with software. In 1983 the cognitive psychologist Lisanne Bainbridge published a short paper in Automatica called “Ironies of Automation,” about what happens when you automate a complex industrial process and shrink the human down to a supervisor. Her central irony is durable enough that it should be read aloud in every conversation about AI tooling: the more reliable the automation, the more crucial the human becomes in the rare moments it fails, and the supervisory role is precisely the one that least prepares the human to step in. You automate the routine work because it’s routine. But the routine work was the practice that kept the operator’s skill alive. Take it away and you’ve built a system that leans on human judgment in exactly the emergencies where it has quietly stopped growing that judgment.

Bainbridge’s ironies map onto AI-assisted engineering almost without translation. The developer who shifts from writing code to reviewing generated pull requests loses the daily reps that built the instinct to spot a race condition or a security hole on sight, and the review gets worse as the instinct fades. Skills that go unused decay, so the people best positioned to catch a truly wrong output are slowly becoming the least practiced at the underlying work. And here’s the deepest one, because it’s generational: today’s senior engineers, the ones qualified to supervise the machine, earned that judgment by grinding through the boilerplate and the edge cases themselves, back when that was simply the job. The next cohort walks into a workplace where the boilerplate is already automated away as economically wasteful. The judgment that supervises the tool today was, in Bainbridge’s phrase, built on the skills of former manual operators. When they retire, the foundation under the supervision retires with them.

The research on skill atrophy is still young, and I won’t pretend the numbers are in: what’s solid is the mechanism and some early evidence, not settled effect sizes. But the mechanism isn’t mysterious, and Bainbridge laid it out before any of us touched these tools. Automate the easy eighty percent, leave the human the hard twenty, then remove the very practice that built competence at the hard part to begin with. This is the struggle going forward, and working out how, as a society, to excel at it will likely be the question that defines leadership over the next twenty years. The fix isn’t to refuse the tool (you won’t, and you shouldn’t). It’s to deliberately rebuild the lost reps: keep the apprenticeship, defend the on-call rotation against the efficiency that wants to automate it away, and treat the moments of productive struggle not as waste to be optimized out but as the only curriculum that’s ever produced someone who can hold the last twenty percent when the first eighty leaks into it.

If you train engineers, or you’re being trained right now, I’d love to hear how you’re keeping the hard reps alive while the easy ones get automated. I think that question is going to define the next decade of this craft, and I don’t think any of us have it figured out yet.

Sources & Further Reading

Footnotes

  1. That’s a deliberate nod to the old military maxim, “no plan survives contact with the enemy.” The line is a much-simplified Helmuth von Moltke the Elder, who wrote in his 1871 essay Über Strategie that “no plan of operations extends with any certainty beyond the first encounter with the enemy’s main strength.” (On the attribution and how it got shortened over the decades, see Quote Investigator and Moltke’s biography.) It’s the same idea the Patton section below leans on: a plan is a starting point, not a guarantee.

  2. This failure mode has a fast-growing literature of its own, and it’s worth knowing how concrete it’s gotten. Code models invent things that don’t exist and say so with full confidence: in one large study, code-generating LLMs hallucinated package names that aren’t in PyPI or npm on 5.2% of commercial-model outputs and up to 21.7% of open-source outputs, and 43% of the fake names recurred on every rerun of the same prompt, a target stable enough that attackers now pre-register those names with malware (the “slopsquatting” supply-chain attack), per Spracklen et al., “We Have a Package for You!” (USENIX Security 2025). Beyond invented packages, researchers have built whole taxonomies and benchmarks for the problem: mapping, naming, resource, and logic hallucinations in CodeHalu (AAAI 2025), the HALLUCODE benchmark in “Exploring and Evaluating Hallucinations in LLM-Powered Code Generation”, and the broader survey by Huang et al., “A Survey on Hallucination in Large Language Models”. The recurring finding is the dangerous one: the output stays syntactically clean and plausible even when it’s wrong, which is exactly why fluency stopped being a usable signal.

  3. If you’ve done this a while, you know the exact sequence of gdb or lldb moves to corner one of these fast; it lives in your fingers, muscle memory. And you know the maddening twist: a lot of these bugs vanish the instant you run the program under a debugger, the textbook “Heisenbug.” The reason is often mundane, and it’s the kind of thing barely anyone learns anymore. When a debugger brings your target up with execve, it hands the process a fresh argv and a fatter environment block (that third argument next to argc and argv, envp, the one almost nobody talks about now), and because argv and the environment live at the base of the stack, every address below them shifts; a buffer overflow that used to clobber something load-bearing now lands in harmless padding, and the crash politely disappears. With the threads of a parallel program, spun up under the hood via clone, the debugger also perturbs timing and usually disables address-space randomization, which hides races for the same family of reasons. The cure, painfully, is to reproduce it without the debugger’s help: core dumps, address sanitizers, and the logging you wish you’d added earlier.

  4. As an aside, this cuts a hopeful way too: we may be walking into a new era of efficiency. Data centers notoriously struggle to keep their silicon busy, fleets run well below full utilization, partly because heterogeneous hardware makes workloads hard to pack (Meta names the problem in its infrastructure overview, and AWS has written about low server utilization for years). For most of computing history Donald Knuth’s warning held, that “premature optimization is the root of all evil” (from his 1974 Computing Surveys paper “Structured Programming with go to Statements,” not, as it’s often misremembered, from The Art of Computer Programming), because hand-tuning every path cost more human time than it saved. A model changes that arithmetic: it can cheerfully optimize every millimeter of a 350,000-line codebase, with one real precondition, copious tests to catch what it breaks. Knuth’s full line even left the door open, “we should forget about small efficiencies, say about 97% of the time; yet we should not pass up our opportunities in that critical 3%.” The tool may finally let us afford the other 97%.

  5. The “Pareto principle” is named for the Italian economist Vilfredo Pareto, who noticed in 1896 that roughly eighty percent of Italy’s land was held by about twenty percent of its people, and that income and wealth tend to follow that same lopsided shape. The 80/20 as a general rule of thumb came later, from the quality-management pioneer Joseph Juran, who read Pareto and generalized it into what he called the “vital few and the useful many”: across a lot of systems, a small share of the causes drives most of the effect. See the Pareto principle overview and the Juran Institute’s guide. The exact numbers aren’t a law, they’re a frequently-useful approximation; the real claim is just that effort and outcome are usually distributed very unevenly.

  6. The appearance of understanding predates AI by decades; we have always been able to manufacture it. And it isn’t only the look of understanding, it’s the look of useful work too, produced precisely because the depth underneath is thin. A thick design doc or a deck of authoritative slides can dress up a flimsy idea, bury a real risk under bullet points, and read as a week of progress while moving nothing real forward. The sharpest case is the Columbia shuttle disaster, where a genuine engineering warning got flattened into reassuring slideware; the Columbia Accident Investigation Board singled out “the endemic use of PowerPoint briefing slides instead of technical papers” as a symptom of NASA’s broken technical communication, the case at the center of Edward Tufte’s The Cognitive Style of PowerPoint. Software has a name for the general failure too, cargo cult programming, going through the motions of a working practice without the understanding that made it work, after Feynman’s “cargo cult science”; Steve McConnell pointed the same idea at “cargo cult software engineering”, the process and paperwork that imitate a healthy team without the substance. And when a whole engineering culture loses the thread, it can take the company down with it, as in Edgar Schein’s autopsy of Digital Equipment Corporation, DEC Is Dead, Long Live DEC.

  7. Sun Tzu got the shape of this twenty-five centuries early: “If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.” (The Art of War, ch. III, “Attack by Stratagem”; Lionel Giles translation, 1910.) The “enemy” here isn’t an enemy in the usual sense; it’s the task in front of you, something to be defeated but not necessarily destroyed. In an AI world it reads almost literally: you have to know the tool, what it can do and what it can’t, and you have to know yourself, your own limits and when to stop and be introspective. Win on both and the work goes well; know only your own side and you’ll trade wins for losses; know neither and you’re just shipping confident output into the dark.

  8. Susan Calvin runs through Asimov’s robot stories as US Robots’ chief robopsychologist; most of I, Robot (1950) is framed as her looking back on cases where a robot behaved in some baffling way and she was the only one who could reason out why. The stories aren’t about deskilling as such, but the figure of the lone human who understands the machine well enough to debug it when no one else can is exactly the one Bainbridge’s irony turns on.