Insight

Your Incident Isn't Over When the Site Comes Back

Most teams stop at resolution. The ones that don't are the ones that get better.

Illustration of a calm post-incident meeting with three people around a table

Thursday 10 AM. The day after Black Friday eve.

The incident is resolved. The site held through the night. Revenue numbers came in — damaged, but not catastrophic. The team is tired and quietly relieved.

Ann has called the AAR. Eight people in the room: Ann, Bob, Camille, John from Sales, Louis the COO, and three other engineers who were tangentially involved. Someone has written "Black Friday Incident — Retrospective" on the whiteboard.

Louis opens before Ann can. "So, whose idea was the product loader?"

The room goes quiet. Camille looks at her laptop.

"I merged it," she says. "I thought it would help with load times."

Louis nods slowly. "Right. Well. We need to make sure that doesn't happen again."

Ann walks through the timeline. Someone asks why there was no load testing. Someone else says "we've been meaning to set that up." Three action items get written on the whiteboard. Then four more. Then another three. By the end there are eleven items, vaguely worded, with no owners and no dates.

Louis has a call at 10:45 so he leaves early. John follows him out. The remaining engineers drift back to their desks by 11:00.

Six weeks later, none of the eleven items have moved. The load testing environment still doesn't exist. And when the next incident happens, it finds the same gaps.


What just went wrong

The AAR happened. That's already more than most teams do. But it produced nothing — and in some ways left things worse than before.

Louis's opening question — "whose idea was the product loader?" — set the frame for everything that followed. The meeting became about Camille, not about the system. Camille will be more cautious about merging things in future, which is not the same as the system being safer. The other engineers in the room learned to keep their heads down.

Eleven action items with no owners is the same as zero action items. It looks like accountability. It produces none.

And the COO leaving at 10:45 signals, loudly and without words, that this meeting matters less than his next call. The team noticed.

Three things failed: the framing was wrong, the output was wrong, and the wrong people were in the room for the wrong reasons.


Why AAR and not postmortem

The term postmortem is borrowed from medicine — literally "after death." Even when practitioners know it's meant to be blameless, the word does work on the room before anyone speaks. You are here to examine a corpse. Someone, implicitly, killed it.

After Action Review comes from a different tradition — military and emergency services, the same disciplines that gave us the Incident Command System. The framing is different from the first word: an action happened, you are reviewing it, more actions will follow. The organisation is alive. The learning is forward-facing.

The book Incident Management for Operations by Rob Schnepp, Ron Vidal and Chris Hawley — which informs much of this series — uses AAR throughout, and for exactly this reason. If your team flinches at postmortem, switch the term. The practice is what matters, but language shapes culture, and culture shapes whether people speak honestly in the room.

Blameless is the foundational principle. It does not mean consequence-free — it means the system is on trial, not the person. Camille merged the product loader because the system had no load testing gate, no performance review checklist, no automated threshold that would have caught 30 parallel requests per session before it reached production. Fix the system and the next engineer who makes the same decision gets stopped before it matters.


The format

A well-run AAR has a clear structure. It is not a retro, not a debrief, not a blame session with better lighting. It is a specific practice with a specific shape.

Who is in the room

Only people directly involved in the incident response. Louis should not be there. John should not be there. Stakeholders get the written output — they do not participate in the reconstruction. Their presence changes what people say, and not for the better. The AAR is a technical and organisational learning exercise, not a performance for leadership.

How long

Sixty minutes maximum. Ninety if the incident was complex. More than that and the meeting is doing something other than learning.

The opening frame — say it out loud

The facilitator — usually the incident commander, though it can be someone else — opens with an explicit statement of purpose. Something like: "The goal of this meeting is not to find out who did something wrong. It is to find out what the system allowed to happen, and what we change so it can't happen the same way again. Everything said here stays in this room except the written output."

Say it every time. It takes thirty seconds and it changes the room.

The reconstruction

Walk the timeline chronologically. Not from memory — from the incident log the scribe kept. This is why the scribe matters: the reconstruction is only as good as the record. Each step: what happened, what decision was made, what information was available at the time.

No hindsight judgments. "Why didn't you check the connection pool first?" is not a reconstruction question — it's a blame question wearing a reconstruction costume. The standard is: given what was known at that moment, was the decision reasonable? Almost always, it was.

The five whys

Once the timeline is clear, go one level deeper on each significant decision point. Not "why did Camille merge the product loader" — that's a person question. "Why did the system allow a change with 30 parallel connections per user to reach production without load testing?" That's a system question. It produces a system answer.

Ask why five times, or until you hit something structural. The proximate cause is rarely the interesting one. The interesting one is three or four levels down — the missing gate, the absent process, the assumption nobody questioned.

The output — two action items, not eleven

This is where most AARs fail even when everything else goes right. The meeting produces a long list, the list goes into a doc, the doc goes nowhere.

Two action items. Named owners. Dates. That's it.

Not because there are only two things wrong — there may be eleven. But two will get done. Eleven won't. Choose the two that close the most significant gaps, assign them to people with the authority and time to act, and put them in the next sprint. The rest go on a watchlist, reviewed at the next AAR.

The action items are not personal. "Camille to review her deployment process" is not an action item. "Ann to set up a load testing environment and integrate it into the deployment pipeline by [date]" is.


Reading list

Two references worth having on the shelf:

Incident Management for Operations — Schnepp, Vidal, Hawley. The practical foundation. ICS applied to software operations, AAR framing, field-tested structure. Start here.

Site Reliability Engineering — Google. Chapter 15 on postmortem culture is the canonical software engineering treatment of blameless review. Dense but worth it. Available free at sre.google.


Let's replay the scene

Thursday 10 AM. Ann, Bob, Camille, and the two engineers who were on the thread. Five people. Louis and John have received the written incident summary — they don't need to be here.

Ann opens: "The goal today is not to find out who did something wrong. It's to find out what the system let through. Everything stays in this room except the two things we're going to commit to fixing."

They walk the timeline from the scribe log. Clean, chronological, no hindsight. At the product loader decision, Ann asks: "Why did a change with this connection profile reach production without load testing?"

Five whys later: there is no load testing gate in the pipeline, and there is no performance review checklist for infrastructure-adjacent changes. Two gaps. Two action items.

Ann to build a load testing baseline and integrate it into the pipeline. Bob to draft a performance review checklist for the engineering handbook. Both in the next sprint. Both on the board by end of day.

The meeting closes at 10:48. Camille leaves knowing the system failed, not her. The two action items are on the board before lunch.

Six weeks later, both are done. The next engineer who writes a feature with a similar connection profile gets a pipeline failure and a clear message before it reaches production.

The system learned.


Closing the loop

Three pieces. Three phases.

Respond cleanly — declared owner, single thread, declared close. Communicate clearly — one person holds the wall around the commander's attention, state and aim and next update on a clock. Learn structurally — blameless reconstruction, five whys, two action items that actually move.

These are not three separate practices. They are a loop. The AAR feeds back into the response: the load testing gate means the next incident is less likely and less severe. The scribe log means the next AAR is more accurate. The communication rhythm means stakeholders trust the process enough to stay out of the war room.

The organisation that runs this loop consistently doesn't eliminate incidents. It builds the structure that means incidents don't escalate into catastrophes — and that every incident, however painful, makes the next one smaller.

That's what preparation buys you. Not the absence of failure. The ability to move through it, and come out the other side with something you didn't have before.

Structure doesn't slow you down in a crisis. It's the only thing that doesn't.


Part one — Your First Incident Will Tell You Everything introduced the team and the failure mode. Part two — The Role Nobody Assigned looked at the structure of the response itself. Part four — The Lookout introduces the engineer whose job is to watch.


Thomas Riboulet is a Fractional VP of Engineering working with European tech companies. He writes about engineering leadership, team structure, and sustainable delivery at insights.wa-systems.eu.