Insights - WA Systems

A desk by a wide window looking out at a snow-capped mountain under an orange sky, a hand pointing toward a laptop

It's Tuesday, just past two. Dimitri is three sips into a coffee that's already gone lukewarm, working through a ticket nobody else wants — a flaky test in the billing suite that fails about one run in twenty and never the same way twice. He's not going to fix it today. He's going to add enough logging that whoever picks it up next week has somewhere to start.

A second monitor sits to the left of his laptop. Four panels: error rates, latency, queue depth, deploy markers. Nothing dramatic. He glances at it the way you glance at a baby monitor — not really looking, just confirming the shape of normal.

Then the shape changes.

The error rate climbs. Not a spike — a climb. Slow enough that an alert won't fire for another four or five minutes, fast enough that by the time the alert fires, customers will already be writing in. Dimitri watches it for fifteen seconds. Long enough to confirm it isn't noise. Long enough to check that the deploy marker from twenty minutes ago lines up with the start of the climb.

He saves his draft, pushes the ticket back to the queue, and rings the bell.

By the time the rest of the team has joined the channel, Dimitri has already posted the deploy hash, the affected service, and a one-line summary. Bob takes the incident commander seat. Camille starts pulling logs. Ann pings back that she'll be there in five — she's mid-merge.

The rollback completes at 14:23. From the moment the shape changed to the moment the graph came back: eleven minutes.

Nobody calls Dimitri a hero. He's been on the team three months. This is just his week on the watch.

What the role is

There's a job on every healthy engineering team that almost nobody names: the job of paying attention. Not to your own work — everyone does that, more or less — but to the system. To the dashboards. To the shape of normal, and to the moment that shape begins to change.

We call it the Lookout.

The metaphor is borrowed from fire lookouts, and it has a long history — watchtowers above forests and frontier towns from medieval Europe to Edo-period Japan to the contemporary forest services of half a dozen countries. A person sits in a tower above a stretch of country they are responsible for. They scan. Most of what they see is haze, weather, the daily rhythm of light moving across hills. Their work is to know the country well enough that when a wisp of smoke appears on a ridge twelve miles out, they recognise it for what it is, fix its location, and pick up the radio. They don't fight the fire. They watch, they call it in, and they go back to watching.

By the time something appears on a dashboard, it has been happening for a while. The bad deploy went out twenty minutes ago. The memory leak started growing yesterday. The Lookout's job is to see the smoke early enough that the response can still be proportionate — a quiet rollback, a graceful degradation — rather than the kind of three-hour scramble that ends up in a postmortem.

The Lookout is more than the on-call engineer. On-call is reactive: the page fires, the engineer responds. The Lookout is prior to on-call — watching for the conditions that produce pages and ideally catching them before the alert fires. In practice the same engineer often holds both, and the next piece in this series goes into how the two fit together. For now, what matters is that the posture of the Lookout is different from the posture of on-call response: not waiting for the page, but watching the system in a way that makes the page less likely. The Lookout is also not the platform engineer. Platform owns the substrate — the deploy pipeline, the observability stack. The Lookout reads what happens on top of it.

The work, and the permission to drop it

This is where most teams that try to institutionalise the role fail.

They give someone the watching responsibility, then give them a normal workload on top of it. Over the course of a week, the watching collapses into the work. The dashboards become a tab that gets checked when there's a moment, and there are fewer and fewer moments. By Friday the Lookout is just an engineer with a slightly worse week than usual.

What makes the fire lookout role work, across centuries and continents, is that watching is the work. Anything else the lookout does — maintaining the radio, keeping the logbook — is structured around the watching, not the other way around.

The engineering Lookout carries a workload too, but it has to be interruptible by design. The right kind of work is small, owned, and droppable: flaky test investigation, documentation cleanup, small infrastructure papercuts, pairing on a ticket that someone else owns. Stopping mid-sentence and coming back on Thursday costs nothing, to the work or to anyone else. The wrong kind of work is anything with a deadline, anything in a critical path, anything where the Lookout's absence would block a teammate. If the Lookout is on a feature team's sprint commitments, the role is already broken.

The permission to drop the work has to be explicit, and it has to come from the lead. Not "drop the work if it looks serious" — drop the work the moment something looks off. False alarms are not failures. A Lookout who never raises the bell is not doing the job; they are merely sitting near it.

Triage as the second kind of watching

The dashboards aren't the only signal. The issue queue is the other one, and in some ways the more important.

Dashboards see what you've instrumented. The issue queue sees what customers actually experienced — including the workflows you didn't know mattered, and the regressions that don't show up as error rates but as confused support tickets. A Lookout who only watches dashboards is half-blind.

Triage is also pattern recognition over time. A single ticket is a single ticket. Three tickets about the same workflow in two days is a signal. Engineers focused on their own features see the tickets routed to them and miss the ones that aren't. Support sees individual tickets but doesn't always have the context to connect them. The Lookout is the only person on the team whose job is to read the queue as a whole, looking for the cluster that hasn't yet become an incident but is about to.

It's also where the role makes contact with the rest of the company — support, customer success, sometimes sales. Engineers who never triage tend to see the product as a codebase plus dashboards, and forget that customers experience it as a workflow they're trying to complete. A week on the Lookout corrects the distortion.

The obvious next question is whether a team can afford to dedicate an engineer to this every week. The next piece in the series is about how the role pays for itself — and how to run the rotation so it doesn't burn out the people holding it.

Part one — Your First Incident Will Tell You Everything introduced the team and the failure mode. Part two — The Role Nobody Assigned looked at the structure of the response itself. Part three — Your Incident Isn't Over When the Site Comes Back focused on what happens after the incident is over.

Thomas Riboulet is a Fractional VP of Engineering working with European tech companies. He writes about engineering leadership, team structure, and sustainable delivery at insights.wa-systems.eu.