We caught it during the layoffs: an EM on bus-factor warnings

The engineering manager in this conversation runs platform and infrastructure at a Series C B2B company, roughly 130 engineers at the time of the events described. Late last year the company went through a restructure that cut about 15% of the engineering org. What follows is an edited transcript of a longer conversation; it has been condensed for length, and identifying details (names, the company, specific system names) have been removed or generalised at the interviewee's request. The interviewee reviewed and approved this version before publication.

Set the scene for us. Where does this story start?

It starts in a spreadsheet, like all the worst stories. We were planning the restructure, and part of that exercise is a risk column: if this person leaves, what breaks? I'd always filled that column from memory and from the org chart. Who owns which service, who's the tech lead on what. On paper our coverage looked fine. Every service had an owning team, every team had at least four people. Textbook.

And it wasn't fine.

It was fiction. Not deliberately, nobody lied, but ownership on an org chart measures who gets paged, not who understands. About a week into planning, one of my senior engineers, who knew roughly what was coming, said something in a one-on-one that I keep replaying. She said, "before you finalise anything, you should know that if you cut the wrong one person, you lose the billing reconciliation pipeline, the data-residency routing layer, and the legacy export system, all at once." Three production systems. One person. And the person she named wasn't the listed owner of any of the three.

How does that happen at a 130-engineer company with normal review practices?

Gradually and then invisibly. He was the person who'd been around when those systems were built. Over four or five years, every time something hairy came up in one of them, it routed to him, informally, because he was fastest. The teams that nominally owned the systems were shipping features on top of them, increasingly with agent assistance, and the agents are genuinely good at that. So velocity on those systems looked healthy. Reviews happened. Tickets closed. Nothing in any metric we tracked distinguished "the team understands this system" from "the team can drive an agent across the surface of this system." Those look identical on a dashboard until the day they don't.

When did the difference become visible?

I went and tested it, quietly, before we finalised the plan. I asked two engineers on the owning team to walk me through what the reconciliation pipeline does when a specific upstream feed arrives late. Not a trick question, an operational question, the kind an incident would ask. They could both navigate the code. They could both get an agent to summarise it, and the summaries were articulate. But neither could tell me what would actually happen, in production, in sequence, or where they'd look first. One of them said, and I appreciated the honesty, "I've merged maybe thirty PRs into this service and I'm realising I've never once had to hold the whole thing in my head."

That phrase could be the title of the decade.

It's not a confession of laziness, that's the thing. He was doing his job exactly as the tooling and the incentives defined it. The work got done, the diffs were reviewed and tested. The understanding just never needed to enter anyone's head, so it didn't. We'd automated the apprenticeship out of the loop and kept the productivity, and nobody noticed because productivity is the only thing we were measuring.

What did you do with the restructure itself?

We protected him, obviously, which is an absurd sentence. A staffing decision about a 130-person org should not hinge on protecting one head because it contains three systems. But that was genuinely the situation, so first we made sure the plan didn't touch him, and a couple of adjacent people whose names came up when we widened the same exercise. Then we did the slower thing: a forced de-concentration program on those three systems. Two engineers per system, rotating through incident-style walkthroughs with him, no agents in the room, until each of them could handle the late-feed scenario and two others like it on a whiteboard. It took about a quarter. It was the least efficient-looking work my org did all year and probably the most valuable.

You used aiworklab's bus-factor warnings later. What would have been different with that instrument earlier?

The honest answer is the spreadsheet exercise would have taken an afternoon instead of a near-miss. The dashboard tracks concepts, not services, which turns out to be the right resolution. It flags anything demonstrated by two or fewer engineers. When we first turned it on, the three systems from the restructure lit up exactly as expected, which was almost funny by then. What I didn't expect was the rest of the list. Seven warnings. Things like the exactly-once semantics in our event pipeline, one engineer. Our database's logical replication setup, two. None of those were on anyone's risk register because no service maps to them, they're load-bearing knowledge rather than load-bearing code. The org chart has no row for that.

Some people hear "skill telemetry" and think surveillance. Did your engineers?

It came up, and it should come up. Two things made it land okay. It's aggregated and anonymised, so I see "two engineers have demonstrated logical replication," I don't see a leaderboard of who's smart. And the first visible consequence was the opposite of punitive: people got protected time to learn the flagged things. If the first consequence had been someone's performance review, it would have poisoned the whole thing, deservedly. The instrument is neutral; what you do with it the first quarter decides what it means in your culture.

What would you say to an EM reading this who's sure their coverage is fine?

I'd say I was sure too, and I'd ask them to run one test, this week, no tooling required. Pick your scariest system. Ask two people who aren't its informal expert to walk you through a specific failure scenario, no agents open. Not "what does this service do," but "this input arrives wrong, what happens, in order." Then watch your own face during the answers. If you feel that little drop in your stomach, you have your answer, and you got it for free instead of during a restructure. The agents aren't the villain here. The absence of measurement is. We measured everything about our code and nothing about what our people could hold in their heads, and we very nearly paid for the difference all at once.

This interview is published with the interviewee's permission. Details that could identify the company or individuals have been removed or generalised; the interviewee reviewed the final text. If you have a story about skill concentration or bus-factor risk you're willing to share, on or off the record, write to team@aiworklab.com.

"We caught it during the layoffs": an EM on bus-factor warnings.

Why "lines accepted" is the wrong metric for your engineering org.

The first 90 days of a junior engineer in 2026.