The AI Productivity Mirage: When More Output Does Not Mean More Value

Most organisations do not have a clean productivity problem. They have a measurement problem.
That was already true before generative AI. Teams counted tickets closed, pages published, sprint points completed, dashboards produced, and release notes written because those artefacts were visible. The harder questions, the ones about whether any of that work improved the product, reduced failure demand, protected margin, or taught the organisation something useful, were always slower and more awkward.
AI makes that gap much larger.
Once code, copy, summaries, tests, reports, and first drafts become dramatically cheaper to produce, the visible part of work expands faster than the valuable part. A team can show more output in almost every category management already knows how to count. It can produce more code, more content, more support responses, more product notes, more options, more internal documents, and more weekly updates. If the organisation was already relying on weak proxies, AI does not merely preserve the weakness. It amplifies it.
That is why the productivity case for AI needs a closer look than most slide decks get. The AI layoff story is one version of this, and I wrote about the broader system risk in The AI Layoff Trap (The AI Layoff Trap). The same problem appears inside individual firms at a smaller scale. Teams can look more productive while creating extra review work, more noise, more maintenance burden, and weaker decision quality.
Output is Not Outcome
Productivity is not the number of artefacts produced. It is the rate at which useful outcomes are created relative to the time, money, attention, and capability consumed.
That distinction sounds obvious, but it breaks down quickly in day‑to‑day operating reviews. Output is immediate and countable. Outcomes are often delayed, distributed, and contested. If a team ships fifty pieces of content this month, everybody can see the number. If those pages dilute topical focus, cannibalise each other, attract the wrong audience, or fail to earn citations and mentions, the damage appears later and is harder to attribute.
The same problem exists in software. It is easy to see that a team merged more pull requests or generated more tests. It is harder to see whether those changes reduced operational load, improved user success, lowered rework, or made future changes cheaper. Diane Coyle and John Lourenze Poquiz make a similar point in Making AI Count: The Next Measurement Frontier, arguing that current statistical frameworks struggle to capture AI's impact on quality change, process change, and task reorganisation rather than only counting visible inputs and outputs (Coyle and Poquiz on AI and productivity measurement).
That matters because once an organisation starts talking about AI productivity, it often assumes the difficult part is getting the tools into more hands. The more difficult part is deciding what counts as productivity in the first place.
Why AI Makes Weak Metrics Easier to Inflate
There are at least three reasons AI makes bad metrics easier to game.
The first is simple multiplication. When drafting, summarising, and transforming text or code becomes cheap, any metric tied to raw volume moves quickly. If your KPI is pages published, tickets answered, pull requests opened, or executive briefings produced, AI gives you a fast way to move the number.
The second is organisational pressure. NBER's working paper on the rapid adoption of generative AI found that by late 2024, 23 percent of employed respondents in the United States had used generative AI for work in the previous week, with a later published version putting the figure at 27 percent (NBER's working paper on generative AI adoption and Management Science research on generative AI). Once a tool is widely normalised inside work, leadership conversations shift from experimentation to expectation. The board no longer asks whether the tool matters. It asks why the cost base, throughput, or delivery profile has not changed yet.
The third is that many of AI's early gains arrive through task compression rather than clean business outcomes. In Still Waters, Rapid Currents: Early Labor Market Transformation under Generative AI, Anders Humlum and Emilie Vestergaard describe rapid adoption, widespread productivity claims, and visible task reorganisation, while finding little measurable short‑run effect on earnings or recorded hours at the worker and workplace level (In Still Waters, Rapid Currents). That is not evidence that AI does nothing. It is evidence that local task efficiencies do not automatically show up as durable value in the places management expects.
In other words, AI can create a large amount of visible motion before it creates a comparably large amount of visible progress.
More Code is Not Necessarily More Software Value
The software version of the mirage is now familiar. Teams can generate boilerplate, write tests, scaffold components, sketch migrations, and produce refactor suggestions much faster than before. Some of that is genuinely useful. Some of it is also a very efficient way to create future obligations.
The first trap is assuming that code production and software value are the same thing. They are not. A repository full of generated handlers, helper functions, wrappers, and tests is not a product improvement until those additions are understood, reviewed, integrated, observed in production, and maintained over time.
The second trap is assuming faster code generation reduces the importance of judgement. In practice, it often moves the bottleneck into places where judgement matters more. Review quality, architectural coherence, naming, domain modelling, and operational fit become more important, not less, because there is more material moving through the system.
That is one reason the current evidence on developer productivity is so mixed. METR's early‑2025 study of experienced open‑source developers found that, in that specific context, developers using AI tools took 19 percent longer on average even though they expected to be faster (METR's AI productivity research). METR later reported experimental results showing some evidence of speed‑up under different conditions, but with wide confidence intervals and strong context dependence (METR's later uplift update). That does not mean AI is bad for engineering. It means you cannot infer real productivity from a demo or a timing claim in isolation.
Google's own research points in a similar direction. What Improves Developer Productivity at Google? Code Quality argues that perceived developer productivity is causally linked to code quality, technical debt, team communication, infrastructure, priorities, and process, not merely to how quickly code is typed (Google's research on developer productivity).
That is also why the broader discussion in The Impact of AI on Developers and the Web Industry remains useful (The Impact of AI on Developers and the Web Industry). The meaningful question is not whether a model can emit code. It is whether the team can absorb that code without degrading comprehension, review standards, and long‑term changeability.
An engineering lead who looks only at generated throughput can very easily mistake inventory growth for delivery improvement.
More Content Can Reduce Signal
Content teams face the same problem from a different angle.
If AI lets a team publish twenty pages where it previously published five, the volume increase is obvious. What is less obvious is whether those twenty pages are distinctive, accurate, useful, and aligned to a clear information need. If they are mostly derivative summaries built from the same obvious material, the business may have increased publishing activity while reducing average informational value.
Google's own documentation is quite direct on this. Creating Helpful, Reliable, People‑First Content puts first‑hand expertise, original information, and substantial value ahead of production volume (Google's people‑first content guidance). Its spam policies go further by explicitly warning against scaled content abuse, including large volumes of low‑value pages generated with AI (Google's spam policies). Google also states that generative AI can be useful for research and structure, but using it to mass‑produce pages without added value risks crossing directly into spam territory (Google's guidance on AI‑generated content).
This matters even more in a world shaped by answer extraction and GEO. When large models are summarising pages rather than sending every user to read them in full, generic content becomes easier to ignore. What survives is content with distinct claims, clear structure, real examples, and trust signals. I made a related argument in What GEO Is and Why It Is Not Just SEO for AI (What GEO Is and Why It Is Not Just SEO for AI). The point is not to produce more crawlable text. It is to produce material that deserves to be cited, summarised, and trusted.
The content mirage appears when teams celebrate page counts while their average usefulness falls.
Ticket Deflection is Not the Same as Resolution
Customer support is one of the few areas where there is already serious evidence for AI augmentation, but even here the measurement problem does not go away.
Brynjolfsson, Li, and Raymond found that a generative AI assistant increased productivity in a customer support setting by 14 percent on average, with much larger gains for novice and lower‑skilled workers, and with improvements in customer sentiment and employee retention (Generative AI at Work). That is important because it shows AI can improve real performance, not only superficial speed.
But it is also a reminder that the right metric is not raw deflection. If an assistant helps more customers get correct answers more quickly, that is valuable. If it mainly pushes customers away from humans, closes cases too early, or hides unresolved product issues behind better‑looking dashboards, the gain is cosmetic.
Support leaders should be especially wary of three inflated metrics:
- lower handle time without tracking repeat contact
- higher deflection without tracking downstream churn or complaint escalation
- faster first response without tracking whether the user actually achieved their goal
An online retailer can automate a large share of returns and delivery queries and still make the service worse if edge cases bounce between workflows, warehouse systems, and human agents who no longer have enough context to recover the journey. The dashboard may show less labour. The customer sees repetition, confusion, and delay.
That is not productivity. It is failure demand wearing an automation costume.
Product Discovery Summaries Do Not Replace Product Judgement
Product work is now filling up with summaries. Meeting summaries. Call summaries. Research‑note summaries. Roadmap summaries. Competitor summaries. AI is very good at converting a large body of language into a smaller body of language.
That can be useful. It can also be misleading.
Discovery is not a transcription problem. It is a judgement problem. The hard part is not only collecting what users said. It is deciding which signals matter, which complaints are symptoms rather than causes, which requests should be resisted, and which awkward inconsistency points to a structural flaw in the product or the operating model.
The evidence on workplace AI use supports that distinction. Anthropic's Economic Index found significant usage concentration in software development, writing, and knowledge tasks, with more augmentation than outright automation in observed Claude sessions (Anthropic's Economic Index). NBER's Shifting Work Patterns with Generative AI similarly found reduced time spent on email and out‑of‑hours work, but not major changes in the quantity or composition of tasks when the tool was provided at the individual level (NBER research on AI usage at work).
That combination matters. AI is good at compressing communication overhead. It is much less obvious that it reliably improves product judgement under uncertainty.
If a product team mistakes better summarisation for better discovery, it can move faster while learning less. It may produce cleaner read‑outs and weaker decisions at the same time.
What Engineering and Product Leaders Should Measure Instead
If the easy metrics are now easier to inflate, the answer is not to stop measuring. It is to measure a broader and more honest set of outcomes.
DORA's delivery metrics are useful precisely because they balance speed and instability rather than rewarding throughput alone (DORA's delivery metrics). Throughput without stability is not high performance. It is often a polite description of latent risk.
For AI‑assisted teams, a stronger measurement stack usually includes some combination of the following:
- change lead time, deployment frequency, recovery time, and change‑fail rate rather than only merge volume
- customer resolution quality, repeat‑contact rate, and escalation rate rather than only deflection
- content engagement quality, assisted conversions, citations, and organic durability rather than only pages published
- rework, bug reopen rates, and maintenance load rather than only initial completion speed
- onboarding time, documentation usefulness, and review burden rather than only individual output
- team capability growth and judgement quality rather than only prompt adoption
The common feature is that these measures are harder to fake with raw volume.
They also force a leadership team to admit that different kinds of work produce value on different timelines. A generated test suite that saves time this sprint but doubles review effort and misses behavioural regressions is not a net gain. A content pipeline that triples page count and halves topical trust is not a net gain. A support assistant that shortens chats while driving more repeat contacts is not a net gain.
A Better Productivity Review for AI Programmes
When a team claims an AI productivity gain, leadership should ask a better set of questions before turning that gain into a headcount assumption, a budget cut, or a board narrative.
Try this review frame instead:
- What output increased?
- Which outcome improved?
- Over what time horizon was the gain measured?
- Which new review, maintenance, or governance cost appeared elsewhere?
- Did customer experience improve, stay flat, or quietly degrade?
- Did we reduce failure demand or only process it faster?
- Did the gain depend on expert oversight that is not reflected in the headline number?
- Would we still call this a gain if we included rework six months later?
That final question matters most. AI can make many tasks cheaper at the point of production. The trap is assuming that point is the same as the point of value creation.
Why Bad Scoreboards Become Strategy
Weak productivity metrics do not stay weak for long. Once they enter management reporting, they start shaping behaviour.
If a leadership team rewards more published pages, shorter handling times, more generated tests, or more closed tickets, teams adapt to those targets whether or not they reflect the underlying economics of the business. AI simply gives those targets a much bigger multiplier. The organisation can therefore become more confident at exactly the point it should be more sceptical, because the numbers it already liked are moving in the desired direction.
That is why productivity measurement is not a neutral reporting problem. It is a strategy problem. The scoreboard decides what gets optimised, and AI makes shallow scoreboards much more powerful.
Conclusion
AI can absolutely improve productivity. It can remove low‑value friction, compress communication overhead, help novices learn faster, and widen the set of tasks teams can complete in a given week.
But none of that means a bigger pile of output is the same thing as a better business.
The productivity mirage appears when an organisation counts visible artefacts instead of durable outcomes. AI makes that mistake easier because it produces convincing evidence of motion. More code. More summaries. More dashboards. More content. More replies. More options.
If leadership cannot distinguish that visible motion from real user value, quality improvement, resilience, and reduced future cost, the tool does not fix the measurement problem. It exposes it.
That is the part worth keeping in view. The real productivity gain is not whatever number gets larger fastest. It is the work that leaves the system better than it found it.
Image suggestion: An operations dashboard projected across overlapping browser windows, code diffs, support queues, and editorial documents, with the display looking busy but slightly misleading. Image alt text: Overlapping dashboards, code windows, and content panels glowing on a desk, suggesting heavy activity without clear evidence of value.