Insights - WA Systems

Illustration of a chaotic office during an incident, people rushing and shouting

The war room is working. Ann declared command before the second alert fired. Bob and Camille are on the thread. The incident channel is clean.

Three months of practice shows.

It's Black Friday eve. 16:42. The product listing pages are returning errors for roughly 60% of users. An engineer — trying to improve loading times — merged a feature on Tuesday that loads 500 products per request and fires 30 parallel requests per user session. At normal traffic it's invisible. At Black Friday eve traffic it's a self-inflicted denial of service. The API backend is on its knees.

Camille, remote, has already spotted it. She posts in the channel: "I think it's the product loader — merged Tuesday. I can revert or we can kill it with a feature flag, your call Ann."

Ann is about to respond when her phone rings. Louis, the COO.

She lets it ring. Second ring. Bob raises his head from his laptop: "Louis is telling me to tell you to pick up."

Ann picks up. Louis is calm but clipped: "What's happening, what's the ETA, I have the CEO asking me."

Ann is now holding the phone, losing the thread of what Camille just posted, trying to remember where the feature flag configuration lives, and answering Louis simultaneously. She gives a rough answer — "we think we know what it is, working on it" — and hangs up after two minutes.

She turns back to her screen. Camille has posted three more messages. Bob has responded to one of them, tentatively. The thread is still coherent but Ann has lost two minutes of context and made a decision she hasn't logged.

Then the meeting room door opens.

John, Head of Sales, walks in. His opening word is not printable. His second sentence is: "I have three enterprise clients on the phone, what do I tell them?"

Ann looks at John. Looks at her screen. Looks at Bob.

Bob looks at the ceiling.

What just broke

The foundations held. Ann declared ownership immediately. The thread is clean. Camille is working the problem with precision.

What broke is everything around the edges of the technical response.

Louis got two minutes of Ann's attention — which is two minutes she wasn't managing the incident. The answer she gave him was vague because she was context-switching, not because she didn't know. John walked into the war room because nobody had told him where to go for updates. Bob relayed a message instead of doing anything with it because nobody had defined his role outside of technical work.

The incident is being handled. The communication around it is chaos.

And here is the thing about communication chaos during an incident: it compounds. Louis will call back in ten minutes because the update he got was insufficient. John is still standing in the doorway. The CEO is asking Louis, who is now asking Ann again, who is supposed to be managing a technical response.

Every person pulling at Ann is pulling her out of the war room, even when she stays physically inside it.

The role nobody assigned

There is a straightforward fix. It requires one decision before the next incident, not during it.

Someone on the team holds the communication role. Not permanently — per incident, like the commander role. Their job has nothing to do with fixing the problem. Their job is everything outside the war room.

Three things, on a regular cadence — every ten to fifteen minutes while the incident is open:

Current state. What is happening right now, in one sentence. Not technical detail — impact. "The product listing pages are returning errors for approximately 60% of users."

Current aim. Where the response team is in the arc. Estimating, mitigating, resolving, monitoring, resolved. One word is enough. "We are mitigating."

ETA for next update. Not an ETA for resolution — those are dangerous to give under uncertainty. An ETA for the next communication. "Next update in 10 minutes." That one commitment stops Louis from calling back.

That's the entire role in its minimal form. State, aim, next update. Posted to a stakeholder channel — not the incident channel — on a clock.

John gets a message before he reaches the meeting room door. Louis gets an update before he picks up the phone. The CEO gets something to read. Ann stays in the thread.

Let's replay the scene

16:42 — alert fires. Ann declares command. Bob and Camille join. Bob takes communication.

16:46 — Bob posts to the stakeholder channel: "We have an incident affecting product listing pages. Team is investigating. Next update in 10 minutes."

16:53 — Camille identifies the product loader. Ann makes the call: feature flag off. Bob posts: "Root cause identified. Mitigation in progress. Next update in 5 minutes."

16:54 — Louis calls. Bob picks up. "I've just posted an update in the stakeholder channel, I'll send you the link. Ann is managing the response, next update in four minutes." Louis hangs up satisfied.

16:57 — feature flag off, metrics recovering. Ann closes: "Incident closed 16:57. Root cause: product loader feature, 500 items × 30 parallel requests per session, merged Tuesday. Mitigated via feature flag. Fix and load testing required before re-enabling. Postmortem Thursday 10 AM."

Bob posts the close to the stakeholder channel. John reads it at his desk. Nobody walked into the war room.

Same team. Same pressure. One assigned role.

A note on the scribe

For small teams, the communication role and the scribe role often land on the same person. The scribe does one thing: log everything. Every action taken, every decision made, every hypothesis raised and discarded, timestamped. Not for the incident — for the postmortem.

Without a scribe, the postmortem reconstructs from memory, which is unreliable and uncomfortable. With a scribe, it reconstructs from a log, which is just data. We'll come back to this in Part 3.

What the communication role protects

It protects Ann's attention. That's it. That's the whole job.

Attention during an incident is the scarcest resource you have. Every minute the commander spends outside the technical thread is a minute the incident runs longer. The communication role exists to build a wall around that attention — not to hide the incident from stakeholders, but to give them what they need through a channel that doesn't cost Ann anything.

Structure doesn't slow you down in a crisis. It's the only thing that doesn't.

Part one — Your First Incident Will Tell You Everything introduced the team and the failure mode. Part three — Your Incident Isn't Over When the Site Comes Back focuses on what happens after the incident is over. Part four — The Lookout introduces the engineer whose job is to watch.

Thomas Riboulet is a Fractional VP of Engineering working with European tech companies. He writes about engineering leadership, team structure, and sustainable delivery at insights.wa-systems.eu.