19 May 2026

The AI Productivity Mirage: When More Output Does Not Mean More Value

Hero image for 'The AI Productivity Mirage: When More Output Does Not Mean More Value.' Image by Mathieu Perrier.

Most organisations do not have a clean productivity problem. They have a measurement problem.

That was already true before generative AI. Teams counted tickets closed, pages published, sprint points completed, dashboards produced, and release notes written because those artefacts were visible. The harder questions, the ones about whether any of that work improved the product, reduced failure demand, protected margin, or taught the organisation something useful, were always slower and more awkward.

AI makes that gap much larger.

Once code, copy, summaries, tests, reports, and first drafts become dramatically cheaper to produce, the visible part of work expands faster than the valuable part. A team can show more output in almost every category management already knows how to count. It can produce more code, more content, more support responses, more product notes, more options, more internal documents, and more weekly updates. If the organisation was already relying on weak proxies, AI does not merely preserve the weakness. It amplifies it.

The productivity case for AI needs a closer look than most slide decks give it. The AI layoff story is one version of this, and I wrote about the broader system risk in The AI Layoff Trap. The same problem appears inside individual firms at a smaller scale. Teams can look more productive whilst creating extra review work, more noise, more maintenance burden, and weaker decision quality.

Output is Not Outcome

Productivity is not the number of artefacts produced. It is the rate at which useful outcomes are created relative to the time, money, attention, and capability consumed.

That distinction sounds obvious, but it breaks down quickly in day‑to‑day operating reviews. Output is immediate and countable. Outcomes are often delayed, distributed, and contested. If a team ships fifty pieces of content this month, everybody can see the number. If those pages dilute topical focus, cannibalise each other, attract the wrong audience, or fail to earn citations and mentions, the damage appears later and is harder to attribute.

The same problem exists in software. It is easy to see that a team merged more pull requests or generated more tests. It is harder to see whether those changes reduced operational load, improved user success, lowered rework, or made future changes cheaper. Diane Coyle and John Lourenze Poquiz make a similar point in Making AI Count: The Next Measurement Frontier, arguing that current statistical frameworks struggle to capture AI's impact on quality change, process change, and task reorganisation rather than only counting visible inputs and outputs.

Once an organisation starts talking about AI productivity, it often assumes the difficult part is getting the tools into more hands. The more difficult part is deciding what counts as productivity in the first place.

Why AI Makes Weak Metrics Easier to Inflate

There are at least three reasons AI makes bad metrics easier to game.

The first is simple multiplication. When drafting, summarising, and transforming text or code becomes cheap, any metric tied to raw volume moves quickly. If your KPI is pages published, tickets answered, pull requests opened, or executive briefings produced, AI gives you a fast way to move the number.

The second is organisational pressure. NBER's working paper on the rapid adoption of generative AI found that by late 2024, 23 percent of employed respondents in the United States had used generative AI for work in the previous week, with a later published version putting the figure at 27 percent (see NBER's working paper on generative AI adoption and Management Science research on generative AI). Once a tool is widely normalised inside work, leadership conversations shift from experimentation to expectation. The board no longer asks whether the tool matters. It asks why the cost base, throughput, or delivery profile has not changed yet.

The third is that many of AI's early gains arrive through task compression rather than clean business outcomes. In Still Waters, Rapid Currents: Early Labor Market Transformation under Generative AI, Anders Humlum and Emilie Vestergaard describe rapid adoption, widespread productivity claims, and visible task reorganisation, whilst finding little measurable short‑run effect on earnings or recorded hours at the worker and workplace level. That is not evidence that AI does nothing. It is evidence that local task efficiencies do not automatically show up as durable value in the places management expects.

In other words, AI can create a large amount of visible motion before it creates a comparable amount of visible progress.

More Code is Not Necessarily More Software Value

The software version of the mirage is now familiar. Teams can generate boilerplate, write tests, scaffold components, sketch migrations, and produce refactor suggestions much faster than before. Some of that is genuinely useful. Some of it is also a very efficient way to create future obligations.

The first trap is assuming that code production and software value are the same thing. They are not. A repository full of generated handlers, helper functions, wrappers, and tests is not a product improvement until those additions are understood, reviewed, integrated, observed in production, and maintained over time.

The second trap is assuming faster code generation reduces the importance of judgement. On real teams, it often moves the bottleneck into places where judgement matters more. Review quality, architectural coherence, naming, domain modelling, and operational fit become more important, not less, because there is more material moving through the system.

That is one reason the current evidence on developer productivity is so mixed. METR's early‑2025 study of experienced open‑source developers found that, in that specific context, developers using AI tools took 19 percent longer on average even though they expected to be faster. METR later reported experimental results showing some evidence of speed‑up under different conditions, but with wide confidence intervals and strong context dependence. That does not mean AI is bad for engineering. It means you cannot infer real productivity from a demo or a timing claim in isolation.

Google's own research points in a similar direction. What Improves Developer Productivity at Google? Code Quality argues that perceived developer productivity is causally linked to code quality, technical debt, team communication, infrastructure, priorities, and process, not merely to how quickly code is typed.

The broader discussion in The The Impact of AI on Developers and the Web Industry remains useful here. A model's ability to emit code is only the start of the problem. The harder test is whether the team can absorb that code without degrading comprehension, review standards, and long‑term changeability.

An engineering lead who looks only at generated throughput can very easily mistake inventory growth for delivery improvement.

Ticket Deflection is Not the Same as Resolution

Customer support is one of the few areas where there is already serious evidence for AI augmentation, but even here the measurement problem does not go away.

Brynjolfsson, Li, and Raymond found that a generative AI assistant increased productivity in a customer support setting by 14 percent on average, with much larger gains for novice and lower‑skilled workers, and with improvements in customer sentiment and employee retention. That is important because it shows AI can improve real performance, not only superficial speed.

But it is also a reminder that the right metric is not raw deflection. If an assistant helps more customers get correct answers more quickly, that is valuable. If it mainly pushes customers away from humans, closes cases too early, or hides unresolved product issues behind better‑looking dashboards, the gain is cosmetic.

Support leaders should be especially wary of three inflated metrics:

lower handle time without tracking repeat contact
higher deflection without tracking downstream churn or complaint escalation
faster first response without tracking whether the user actually achieved their goal

An online retailer can automate a large share of returns and delivery queries and still make the service worse if edge cases bounce between workflows, warehouse systems, and human agents who no longer have enough context to recover the journey. The dashboard may show less labour. The customer sees repetition, confusion, and delay.

That is not productivity. It is failure demand wearing an automation costume.

Product Discovery Summaries Do Not Replace Product Judgement

Product work is now filling up with summaries. Meeting summaries. Call summaries. Research‑note summaries. Roadmap summaries. Competitor summaries. AI is very good at converting a large body of language into a smaller body of language.

That can be useful. It can also be misleading.

Discovery is not a transcription problem. It is a judgement problem. The hard part is not only collecting what users said. It is deciding which signals matter, which complaints are symptoms rather than causes, which requests should be resisted, and which awkward inconsistency points to a structural flaw in the product or the operating model.

The evidence on workplace AI use supports that distinction. Anthropic's Economic Index found significant usage concentration in software development, writing, and knowledge tasks, with more augmentation than outright automation in observed Claude sessions. NBER's Shifting Work Patterns with Generative AI similarly found reduced time spent on email and out‑of‑hours work, but not major changes in the quantity or composition of tasks when the tool was provided at the individual level.

That combination matters. AI is good at compressing communication overhead. It is much less obvious that it reliably improves product judgement under uncertainty.

If a product team mistakes better summarisation for better discovery, it can move faster whilst learning less. It may produce cleaner read‑outs and weaker decisions at the same time.

What Engineering and Product Leaders Should Measure Instead

If the easy metrics are now easier to inflate, the answer is not to stop measuring. It is to measure a broader and more honest set of outcomes.

DORA's delivery metrics are useful precisely because they balance speed and instability rather than rewarding throughput alone. Throughput without stability is not high performance. It is often a polite description of latent risk.

For AI‑assisted teams, a stronger measurement stack usually includes some combination of the following:

change lead time, deployment frequency, recovery time, and change‑fail rate rather than only merge volume
customer resolution quality, repeat‑contact rate, and escalation rate rather than only deflection
content engagement quality, assisted conversions, citations, and organic durability rather than only pages published
rework, bug reopen rates, and maintenance load rather than only initial completion speed
onboarding time, documentation usefulness, and review burden rather than only individual output
team capability growth and judgement quality rather than only prompt adoption

The common feature is that these measures are harder to fake with raw volume.

They also force a leadership team to admit that different kinds of work produce value on different timelines. A generated test suite that saves time this sprint but doubles review effort and misses behavioural regressions is not a net gain. A content pipeline that triples page count and halves topical trust is not a net gain. A support assistant that shortens chats whilst driving more repeat contacts is not a net gain.

A Better Productivity Review for AI Programmes

When a team claims an AI productivity gain, leadership should ask a better set of questions before turning that gain into a headcount assumption, a budget cut, or a board narrative.

Try this review frame instead:

What output increased?
Which outcome improved?
Over what time horizon was the gain measured?
Which new review, maintenance, or governance cost appeared elsewhere?
Did customer experience improve, stay flat, or quietly degrade?
Did we reduce failure demand or only process it faster?
Did the gain depend on expert oversight that is not reflected in the headline number?
Would we still call this a gain if we included rework six months later?

That final question matters most. AI can make many tasks cheaper at the point of production. The trap is assuming that point is the same as the point of value creation.

The same handover problem sits behind Enterprise AI Delivery Starts After the Pilot: a useful demo only matters when ownership, support, integration, and measurement survive production.

Why Bad Scoreboards Become Strategy

Weak productivity metrics do not stay weak for long. Once they enter management reporting, they start shaping behaviour.

If a leadership team rewards more published pages, shorter handling times, more generated tests, or more closed tickets, teams adapt to those targets whether or not they reflect the underlying economics of the business. AI simply gives those targets a much bigger multiplier. The organisation can therefore become more confident at exactly the point it should be more sceptical, because the numbers it already liked are moving in the desired direction.

That is why productivity measurement is not a neutral reporting problem. It is a strategy problem. The scoreboard decides what gets optimised, and AI makes shallow scoreboards much more powerful.

Conclusion

AI can absolutely improve productivity. It can remove low‑value friction, compress communication overhead, help novices learn faster, and widen the set of tasks teams can complete in a given week.

But none of that means a bigger pile of output is the same thing as a better business.

The productivity mirage appears when an organisation counts visible artefacts instead of durable outcomes. AI makes that mistake easier because it produces convincing evidence of motion. More code. More summaries. More dashboards. More content. More replies. More options.

If leadership cannot distinguish that visible motion from real user value, quality improvement, resilience, and reduced future cost, the tool does not fix the measurement problem. It exposes it.

That is the part worth keeping in view. The real productivity gain is not whatever number gets larger fastest. It is the work that leaves the system better than it found it.

The AI Productivity Mirage: When More Output Does Not Mean More Value

Output is Not Outcome

Why AI Makes Weak Metrics Easier to Inflate

More Code is Not Necessarily More Software Value

More Content Can Reduce Signal

Ticket Deflection is Not the Same as Resolution

Product Discovery Summaries Do Not Replace Product Judgement

What Engineering and Product Leaders Should Measure Instead

A Better Productivity Review for AI Programmes

Why Bad Scoreboards Become Strategy

Conclusion

AI Will Be Ok If We Stop Treating It Like Magic

The Automation Tax No One Measures

Enterprise AI Delivery Starts After the Pilot

Responsible AI Needs an Owner

AI is Making Technical Debt Cheaper to Create

Understanding Event Loop and Concurrency in JavaScript

Understanding Media Queries in CSS

What Does Front‑End Development Mean?

Disabling Gatsby Telemetry

Using `getStaticProps` with CMS Data

Static Site Generators

Shopify Theme Constraints with Liquid

Relevant Services

Embedded Technical Leadership

Have a complex web platform issue?