What we measured when we watched 200 developers use AI for a year

In early 2025 we had a question that was hard to answer: is the throughput increase from AI coding tools real, and if so, what's the cost? Most of the research available was either self-reported, company-funded, or short-term. We wanted something different.

Over twelve months, we instrumented the development workflow of 200 working engineers across 14 companies, ranging from 12-person startups to 2,000-person public companies. We tracked two things: throughput (tickets closed, PRs merged, lines of code accepted) and what we called "retention" (the ability to explain, modify, and reproduce code written with AI assistance). This is what we found.

The setup

Participation was voluntary and compensated. Engineers opted in at the individual level; their employers approved the study protocol. We instrumented the development environment at the editor level, with full opt-out capability and weekly reports to each participant showing exactly what we'd collected.

The 200 participants broke down as follows: 62 junior engineers (0 to 3 years), 89 mid-level (3 to 7 years), 49 senior and above (7+ years). 68% used an AI coding tool daily before the study; 18% occasionally; 14% rarely or never. We used the pre-study usage level as a control variable.

For throughput, we used a composite metric: (tickets closed per sprint / ticket complexity estimate) multiplied by PR merge rate, adjusted for team size and codebase age. For retention, we ran monthly structured interviews where engineers were shown code they'd written 30 days prior, with AI-authored sections flagged, and asked to explain the logic, identify edge cases, and make a specific modification without agent assistance. We scored on a 5-point rubric.

Throughput: the news is good

Across all seniority levels, developers using AI tools daily showed a statistically significant throughput improvement compared to their pre-study baseline. The mean improvement was +34%, with a wide distribution: the bottom quartile improved 12%, the top quartile improved 61%.

The gains were concentrated in two areas: boilerplate-heavy tasks (ORM queries, test scaffolding, config files) and context-switching recovery. Getting back up to speed on a codebase after a break was measurably faster with an AI that could summarise recent changes. Neither of these was surprising.

What was surprising: the throughput gains were smaller for more experienced engineers. Senior engineers improved 22% on average; junior engineers improved 47%. Our hypothesis is that senior engineers were already writing fewer lines of boilerplate (they abstract it away) and spending proportionally more time on architecture and review work, which AI tools currently help with less.

Retention: the news is not good

This is where the story changes.

At 30 days, developers who used AI daily scored a mean of 2.6 out of 5 on the retention assessment for AI-authored code, versus 4.1 for code they'd written themselves without assistance. That's a 37% gap. At 90 days, the gap widened: 1.9 versus 3.8.

More troubling: the gap correlated with usage intensity. Engineers who used AI tools for 60%+ of their working code showed retention scores 0.8 points lower than those using it for 30 to 40%. The curve was non-linear. There was a visible inflection point around 50% AI-authored code, past which retention dropped more steeply.

At 90 days, developers couldn't reliably explain or modify nearly half the code in their own codebase. Code with their name on it, in a repository they owned.

Junior engineers showed the most severe retention gaps. This aligns with what we know from cognitive science: novices benefit less from seeing worked examples than from producing solutions themselves. When an AI agent produces the worked example and the novice accepts it, the cognitive step that produces durable learning (retrieval, elaboration, self-explanation) is bypassed entirely.

The interaction effect no one was talking about

The most important finding wasn't in either metric in isolation. It was in their interaction.

We found that high throughput and high retention were not positively correlated. They were weakly negatively correlated (r = -0.18, p < 0.01). The engineers with the highest throughput gains tended to have the largest retention gaps. Not always, but enough to be statistically robust.

This means the standard way companies are evaluating AI tooling adoption (velocity metrics, PR throughput, time-to-first-diff) is measuring something that is, on average, slightly anti-correlated with the thing they presumably also care about: whether their engineers are getting better at their jobs.

Seniority as a moderator

The retention problem was significantly worse for junior and mid-level engineers, and largely absent for senior engineers. Senior engineers showed retention scores of 3.8 even for AI-authored code. Still lower than their self-authored score of 4.4, but not catastrophically so.

Our interpretation: senior engineers are doing something juniors don't. They're reading the diff before accepting it, mentally simulating the execution, and often modifying it before merge. The cognitive engagement that produces retention is happening even though the tool didn't require it. Juniors are tab-completing to the next task.

This has an uncomfortable implication: the engineers who need the most protection from skill atrophy are the ones least likely to self-impose it.

The fly-solo signal

In the final three months of the study, we asked 60 of the participants to designate one day per week as a "fly-solo" day: no agent assistance, just the editor and the documentation. We tracked their throughput on those days separately.

Mean throughput on fly-solo days was 38% lower than on assisted days, as expected. But the trend over 12 weeks was what mattered. In the first month, fly-solo throughput averaged 2.1 tasks per day. By month three, it averaged 2.6. That's a 24% improvement in solo output, without any AI assistance, in twelve weeks.

The control group (similar engineers who didn't do fly-solo days) showed a 4% improvement in solo throughput over the same period. The difference was statistically significant (p = 0.003).

Deliberate solo practice improved the baseline. The agents were still there 4 days a week and productivity stayed high. The friction was the feature.

What we'd do differently

The 30-day retention window was probably too short; 60 or 90 days would have shown an even clearer picture. The retention rubric was our own and hasn't been externally validated. This is a limitation we're working to address by partnering with a university research group for a follow-on study. And 200 participants across 14 companies, while large for this kind of instrumented study, is still a convenience sample; we're cautious about over-generalising.

What we're building

These findings are the empirical foundation for aiworklab. The product we're building is a direct response to the interaction effect: a workspace that keeps throughput high through agent assistance while actively preserving the cognitive steps that produce retention. Comprehension checks before merges, spaced retrieval from your own past code, fly-solo sessions with a solo-throughput trend line.

If you want to participate in the follow-on study, or if you're an engineering leader who wants to run the retention assessment on your own team, write to team@aiworklab.com. We'll respond personally.

Methodology note: this study was conducted internally and has not yet been peer-reviewed. The full protocol and rubric definitions are available; email team@aiworklab.com to request a copy. We welcome critique.

What we measured when we watched 200 developers use AI for a year.