Insights - WA Systems

A boat helm and steering wheel, photo by Andrés Mutiomi

The message tends to be brief and snappy: "Something is wrong with the site, can someone have a look?"

It's usually followed by a list of specific pings to people "@engineer-1, @cto, @engineer-2".

Then someone else responds: "indeed, something is wrong with the invoicing page, I have a customer telling me they can't access it".

And quite often we also get someone saying things like: "has someone seen the site is down yet?".

At one point or another, one of the engineers (Bob) will spot the messages and start to work on it, but say nothing. Another one (Ann) spots the message and might respond with "I am on a coffee run, let me run back to the office to check. eta 5 minutes."

Then another engineer (Camille) might chime in: "yes I have seen this, it might be related to my last deploy."

Then, somehow, one of those people might suddenly say "ah, site is back, thanks team!".

Success! Round of applause, coffee flows.

And then you get someone like me reading this and thinking "well, that was lucky".

Let's rewind

From a certain perspective the above is pretty standard "startup mode incident fix". Yet, for someone who has led tens of incident responses, this looks like a close call for a catastrophe or actually a cause for another incident later on.

Let's review how things went.

The way the issue is reported is ok, we would prefer to have alarms, automated notifications etc... But if someone reports it in one channel and pings the relevant people that's already ok.

Here we had three people sending messages (hopefully in the same channel) without really referring to one person as point of entry. This might prevent the message from being seen by the one person that is supposed to act.

And then, three engineers seem to be taking part in the resolution without officially taking the helm of the boat. They all appear to work separately.

Finally, the incident is claimed to be solved but no details come next and no one says what was done.

These aren't process violations. They're signals. The team got away with it this time because the incident was simple and short. Double the complexity, add a second concurrent failure, put the one engineer who knew the deploy context on a flight — and this exact pattern produces hours of downtime.

I have heard teams say "ok, but we never have bigger stuff happening, we don't need to worry about that." You know the saying: there are two kinds of Unix users — the ones running everything as root, the ones who don't anymore. It's the same with incidents and teams: the ones who haven't had a large incident, the ones who are about to.

Prepare for what you can

And these are not exotic failures. They are the default. Most teams that haven't deliberately built incident response capability will produce exactly this pattern — and most of the time, it works. The incidents are small enough, the team tight enough, the luck sufficient.

The problem is that this creates false confidence. The team celebrates the fix. Nobody examines the process. And the next incident inherits all the same structural gaps, with no improvement.

What the scene above is missing is not effort or competence — both were present. What's missing is structure. Three specific things:

A declared owner. One person holds the incident. Not the most senior engineer available, not the one who spotted it first — whoever is explicitly named as commander for this incident. Everyone else reports to them until the incident is closed.

A single working thread. All information, all actions, all updates flow through one place. No side conversations, no parallel Slack threads, no "I'll just quickly check something" without logging it. The incident channel is the source of truth while the incident is open.

A declared close. The incident is not over when the site comes back. It's over when the commander says it's over, states what happened, and names what comes next — even if "what comes next" is just a postmortem scheduled for tomorrow morning.

In the context of Incident Management, or Incident Response, there is a core concept: "Preparing for what you can so you can handle what you can't". It permeates all you do as an engineer once you get familiar with it.

The above three points are the core of any Incident Response. You can add more to it, but that's the gist you need to know and follow.

Let's replay the scene

The message comes in: "Something is wrong with the site, can someone have a look?"

This time, Ann — who is on the on-call rotation this week — sees it and responds immediately: "I have it. I'm the incident commander. Bob and Camille, I need you on the call. Everyone else, please hold and follow updates here."

Bob stops what he's doing and joins. Camille flags straight away: "this might be related to my last deploy — I can roll it back in two minutes if needed." Ann makes the call: "do it, log it here when done."

Two minutes later, Camille posts: "rolled back, watching metrics." Thirty seconds after that, Bob confirms: "invoicing page is responding, error rate back to zero."

Ann closes it: "Incident closed at 14:23. Root cause: deploy at 14:11 introduced a regression on the invoicing route. Rolled back. Postmortem scheduled for tomorrow 10 AM. Thanks Bob and Camille."

Same team. Same problem. Four minutes start to finish, and everyone — including the CTO who was in a meeting — knows exactly what happened, who did what, and what comes next.

That's not a bigger team or a better team. That's a prepared one.

Build the foundation first

Once you have that settled and ingrained in the organisation culture, you will be able to scale the Incident Response to the size of the team and, importantly, the incidents.

A declared owner, a single thread, a declared close. These three things cost nothing to implement. No tooling required, no process overhaul, no all-hands meeting. A decision, a shared understanding, and practice.

Start with the next small incident. Whoever is on call that week is the commander. Everything goes in one channel. Nobody closes it without a summary. Do that ten times and it becomes muscle memory. Do it a hundred times and your team handles the large incident — the one that would have been catastrophic — without panic, because the structure is already there.

That's what preparation buys you. Not the absence of incidents. The ability to move through them cleanly.

This is part one of a series on incident management for growing engineering teams. Part two — The Role Nobody Assigned looks at the communication role around the response. Part three — Your Incident Isn't Over When the Site Comes Back focuses on what happens after the incident is over. Part four — The Lookout introduces the engineer whose job is to watch.

Thomas Riboulet is a Fractional VP of Engineering working with European tech companies. He writes about engineering leadership, team structure, and sustainable delivery at insights.wa-systems.eu.