Does AI Slow You Down?


Does AI Slow You Down?

AI made experienced open‑source developers slower. In a randomized trial on real work, METR found that allowing AI assistance increased average completion time by about one-fifth for expert maintainers tackling issues in the repositories they know best.

The result surprised both the researchers and the participants. It deserves close attention.

The study’s design was simple and strict. Sixteen long‑time maintainers supplied two hundred forty‑six actual issues from their own high‑quality projects and were randomly assigned to work with or without AI. The work was not synthetic. It mirrored everyday tasks.

Time, not impressions, was the outcome. Developers forecast how long each issue would take with and without AI, then recorded their actual time after they finished. They could use any tool when permitted. Most used modern code assistants.

The perception gap was large. Before starting, developers expected substantial speedups from AI; after finishing, they still believed they had been faster even when the measured time said otherwise. External experts also forecast sizable gains. The measurements did not.

The mechanisms matter more than the headline. AI suggestions were often unreliable, acceptance rates were low, and developers spent significant time reviewing, rewriting, and stitching AI‑generated fragments into code that met their repository’s tacit standards. Latency added friction at every prompt. Familiar code gave the tools little leverage. These are all workflow facts.

Learning effects were modest at best. Removing early issues did not eliminate the slowdown, and the screen recordings showed persistent overhead from prompting and supervision rather than early user error. The tools were directionally helpful at times. They were not consistently reliable.

Quality did not obviously improve. Post‑review effort and qualitative assessments looked similar across conditions, so the extra time was not buying clear gains in maintainability or correctness. That weak trade‑off limits the case for slower but safer output. It weakens it further.

The frame for interpretation is narrow but useful. The participants were experts embedded in mature codebases, where tacit conventions are thick and interface boundaries are tight, and where small errors propagate costly rework across complex dependency graphs. Many enterprise repositories share these traits. That is the context in which the study speaks most clearly.

We should not over‑generalize beyond that frame. Different tasks, domains, or models may yield different effects; future tooling may raise reliability and shrink latency. But the present result challenges a comfortable belief about rapid productivity gains. It does not dismiss them.

The economic question follows. Do current market expectations assume swift, broad productivity growth from AI in real software work? If they do, the study introduces a live risk that those gains arrive slower than the capital that is already committed. Timing matters for valuations. It matters for strategy.

The micro findings point to macro frictions. If much of the near‑term impact of coding assistants is a transfer from direct typing to supervision, review, and repair, then measured productivity rises only when reliability passes a threshold that turns oversight into leverage. Oversight without leverage is cost. Cost scales.

The second risk is misallocated effort. When builders believe the tools are saving time even as actual time rises, organizations will over‑invest in workflows that look modern but drag delivery, and they will under‑invest in the sober plumbing that compounds returns. Misperception converts optimism into waste. Waste compounds too.

The third risk is brittle dependence. Teams that structure work around agents that frequently propose partial or subtly wrong changes will keep human reviewers in the loop while increasing the volume of generated diffs, which raises coordination overhead and stretches review queues. Throughput stalls when review becomes the bottleneck. Queues grow.

The fourth risk is managerial confusion. Leaders will see enthusiastic uptake, impressive demos, and confident forecasts from both staff and outside experts while project timelines slip by small amounts that are hard to attribute and easy to excuse. Small slips across many projects become material. Budgets assume the opposite.

None of these risks require a market crash to matter. They show up first in missed delivery dates, lower engineering morale, and slow de‑risking of important code paths, which then surface as deferred revenue and squeezed margins. The path from repository friction to earnings surprises is not mysterious. It is direct.

What should firms do now? Treat reliability, not raw capability, as the scarce input. Track time carefully at the issue level; compare forecast time, perceived time, and measured time across AI‑allowed and AI‑disallowed lanes; and change defaults only when the data support it. Manage to evidence, not sentiment.

Focus adoption where leverage is plausible. New code, greenfield services, and exploratory prototypes may benefit more than long‑lived systems with rigid interfaces and dense invariants. In mature code, emphasize search, tests, and static checks that bound error rather than agents that propose large edits. Boundaries reduce rework.

Demand higher acceptance rates from tools. If fewer than half of model suggestions survive review, the assistant is generating review load rather than velocity, and latency will magnify the effect. Lower the number of interactions and raise their quality. Make prompts rarer and more decisive.

Finally, plan for slower gains. Assume a period in which AI improves ergonomics and reduces drudgery for some tasks while leaving end‑to‑end throughput unchanged or slightly worse for expert teams on complex systems. Build budgets and roadmaps that do not require immediate acceleration. Discipline buys time.

The METR study does not end the debate. It provides a clean measurement from a realistic setting, and it counters the most optimistic claims about short‑run productivity. Markets can tolerate uncertainty, but they punish timing errors. The prudent stance is to separate promise from schedule and invest accordingly.

For investors and policymakers, the message is narrow but concrete: watch realized throughput in real repositories, not marketing claims or survey beliefs; insist on evidence of higher acceptance rates, lower rework, and shorter lead times before underwriting ambitious productivity trajectories or subsidizing generalized deployments.

For workers, the counsel is similar: learn the tools, measure their effect, and defend engineering basics that reduce error. Evidence first; allocation second. Prudence matters.


Footnote

  1. Becker, H.; Rush, J.; Barnes, M.; Rein, H. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (METR). ArXiv preprint, July 2025. Available at: https://arxiv.org/abs/2507.09089


Previous
Previous

The First AI Minister