Insights - WA Systems

A workspace at sunset with a laptop, open notebook, phone, and coffee mug, looking out over a mountain through a window

It's a Thursday evening, just before seven. Bob is the only senior engineer still online when the page goes off. The error rate has tripled in five minutes; checkout is failing for about a third of customers; the team has six engineers and four of them have already gone home.

Dimitri pages Bob from his Lookout post. Camille joins the channel from her kitchen, laptop balanced on the counter. Ann pings in a few minutes later. Bob opens the dashboards, declares the incident, posts the channel link, and starts asking questions.

Within ten minutes Bob has done four things at once. He's read the deploy log, narrowed the regression to a config change in the payment service, asked Camille to check the staging environment, and posted a customer-facing status update. He's also, somewhere in the middle of all that, started writing the rollback patch himself — because he's the only person on the channel who knows that part of the codebase well enough to do it safely under time pressure.

He is the Incident Commander. He is also the person fixing the thing.

The textbook says he shouldn't be both.

What the Incident Commander does

The Incident Commander is the role that holds the response together. They are the person who declares that an incident is happening, who decides when it's over, and who makes the calls in between. They hold the timeline. They hold the decisions. They know what's been tried, what's working, what isn't, and who is doing what right now. When a customer-facing comms post needs to go out, they decide what it says or who writes it. When the team has to choose between a fast rollback and a slower fix, they make the call.

Most importantly, they hold the coherence of the response. In any incident with more than one person working on it, there's a constant risk of the response fragmenting — two engineers debugging the same thing in parallel, a third engineer making a change that contradicts what the first two are trying, a comms message going out that doesn't match the technical state of the system. The IC's job is to keep the response from fragmenting. They are the single point where the picture stays whole.

The textbook version of this role — drawn from emergency response, from the Google SRE book, from large-scale operational practice — is explicit about one thing: the IC does not touch keys. They coordinate, they decide, they hold the picture. They do not debug, they do not write code, they do not run commands on production. The reasoning is sound. An IC who is debugging is not coordinating. An IC who is writing the fix is not holding the picture. The moment the IC's attention narrows to a stack trace, the response fragments, and the team is back to the situation that produced the incident in the first place.

This is correct. It is also impossible on most teams.

The small-team reality

Below roughly fifteen engineers, very few teams can afford a hands-off Incident Commander.

The math is straightforward. You need at least three people on an incident to justify a coordinator: someone fixing, someone investigating, someone communicating. With a hands-off IC making it four, you've consumed half the engineering capacity of an eight-person team for the duration of the incident. If the incident happens after hours, or on a weekend, or while half the team is in another timezone, those numbers don't work. The choices are: the IC has hands in the work, or there is no IC at all, or the response waits.

The third option is the worst. A response that waits while everyone gathers is a response that lets the customer impact compound. The second option — no IC at all — is the failure mode the whole incident response framework exists to prevent. So most small teams default to the first: the senior engineer who would be the IC also has hands in the work, because there isn't anyone else to do the work and the work cannot wait.

This is not a failure to follow the rules. It is the rules adapting to the resource constraint. The team has chosen, correctly, that a hands-on IC is better than no IC. The mistake — and it is a common one — is treating this adaptation as shameful or temporary, something the team will fix when they grow. They won't fix it. They will simply scale into a different version of the same role. What needs to happen instead is to acknowledge the adaptation explicitly and develop the techniques that make it work.

What a hands-on commander actually does

The hands-on IC does not pretend to be a hands-off IC who happens to also be coding. They are a different role, and the role has its own techniques.

The first is the explicit pause. When a hands-on IC drops into the fix, they don't try to keep one eye on the channel — that always fails. They name a window, hand the IC role to someone else for that window, and come back when it's done. I'm spending the next ten minutes on this rollback patch; Camille is acting IC until I'm back. The pause is timeboxed and handed to a named person. The technique is the explicit handoff, not the multitasking.

The second is narration. Hands-off ICs hold the team's shared picture by asking questions and consolidating answers. Hands-on ICs hold it by narrating their own work in the channel as they do it. Rolling back payment service config; ETA two minutes; Ann please prep a status update for after rollback completes. Without the narration, the team loses sight of what the IC is doing, and parallel work starts to drift out of sync.

The third is ruthless delegation. The hands-on IC reserves their hands for the one thing the team most needs them to do — usually the fix that depends on context only they have — and lets everything else go. They don't write the customer comms; they tell Ann what the message should say. They don't pull the logs; Camille pulls them and reports back. Trying to do everything is what produces the IC who is debugging while customer-facing comms go silent for forty minutes.

All three of these techniques rest on the same foundation: the rest of the team can do their part without close coordination. They know how to pull logs, draft a comms update, check staging, escalate to a vendor. That competence is built long before the incident, in the runbooks and the team's working agreements and the previous incidents they've run together. A hands-on IC on a team without that foundation is just a senior engineer doing four jobs badly. With it, they are an IC who happens to also be writing code.

The transition

There is a point — usually somewhere between fifteen and twenty-five engineers, depending on the system and the rate of incidents — where the math flips. The team is large enough that a hands-off IC is affordable, the systems are complex enough that a hands-off IC is necessary, and the techniques that worked at small scale start to break.

The signal that you've crossed the threshold is usually retrospective. You'll notice that the senior engineers who used to run incidents as hands-on commanders have started to drop the hands-on part instinctively — they let someone else write the fix, they stay on coordination, they recover the picture faster. Or you'll notice the opposite: incidents are taking longer, the IC is consistently overwhelmed, the comms are slipping, and the team is staying up later than they should. Either way, the threshold is felt before it's analysed. The team is telling you it's time to change the mode.

The change itself is mostly about explicit role assignment. The IC is named at the top of the incident — Camille is IC — and that is now their job for the duration. They don't pick up keys. If they need to, they hand off the IC role to someone else first. The senior engineers who used to be hands-on commanders now have a choice on each incident: run it as IC, or be a fixer, but not both. The choice is made early, named in the channel, and held.

This isn't a more correct version of incident response than the small-team mode. It's the version that fits a larger team. The hands-on commander wasn't doing it wrong; they were doing the right thing for the resources available. The hands-off commander isn't doing it better; they are doing the right thing for the resources now available. The role hasn't changed. The way it has to be held has.

The textbook version of incident response describes a target state, not a default. The target state assumes a team with enough people, enough specialisation, and enough practice that each role can be held by a separate person. Almost no growth-stage team has that. What they have is a few people, a system that is becoming complex faster than the team is, and a senior engineer who will, when the bell rings, do four things at once because there is no one else.

Naming that pattern, and giving the techniques to do it well, is most of the work. The rest is recognising the moment when the team has grown into the textbook version, and letting the role transform without pretending it was wrong all along.

Part one — Your First Incident Will Tell You Everything introduced the team and the failure mode. Part two — The Role Nobody Assigned looked at the structure of the response itself. Part three — Your Incident Isn't Over When the Site Comes Back focused on what happens after the incident is over. Part four — The Lookout introduced the Lookout role. Part five — Running the Watch showed how to run the rotation.

Thomas Riboulet is a Fractional VP of Engineering working with European tech companies. He writes about engineering leadership, team structure, and sustainable delivery at insights.wa-systems.eu.