Ops

Incident management

Useful for

OperationsRisk managementAgentic delivery

Introduction

Policies and procedures are only useful if the team can follow them under pressure. Rehearsals turn governance from paperwork into operational muscle memory.

Response and recovery

Incident management should be treated as a two-step process.

The first step is response. This is where the immediate problem is contained or fixed. The service may be brought back online, the security issue may be blocked, the bad deployment may be rolled back, or the customer-impacting failure may be stopped.

The second step is recovery. This is where the system is brought back to normal and back under control. Temporary fixes are removed, manual changes are captured, infrastructure is reconciled with IaC where appropriate, monitoring is checked, data integrity is confirmed, and any follow-up work is tracked.

An incident is not really finished just because the first error has gone away. It is finished when the company understands what happened, the system is back in a controlled state, and the remaining risk has been remediated or accepted.

Runbooks

Do not rely on memory during an incident. Failover, restore and recovery tasks should have runbooks that people can follow under pressure.

Runbooks are not theoretical documents. They are the step-by-step operating instructions for actions that must work when the team is tired, the customer is waiting and the system is unstable. A restore process, regional failover, DNS change, key rotation, rollback, database recovery or emergency access procedure should not depend on the one person who happens to remember how it worked last time.

Good runbooks should define:

When the runbook should be used.
Who is allowed to approve the action.
Preconditions and safety checks.
Step-by-step execution instructions.
Expected timings.
Validation checks after each major step.
Communication points.
Rollback or stop conditions.
Evidence to capture.
How to return the system to a controlled state afterwards.

Runbooks also need rehearsal. A restore runbook that has never been tested is only a theory. A failover runbook that depends on undocumented manual steps is a risk. The rehearsal should prove that the process works, that the team can follow it, and that the documentation matches reality.

Backup restore tests should record the expected and achieved RPO and RTO. The result should be compared with contracts, support promises and customer expectations. If restore testing shows that RPO or RTO has worsened materially, the change should be recorded as a risk and escalated to the company risk register where it affects contractual promises, customer trust or service availability.

Incident coordinator

Every meaningful incident needs an incident coordinator. Their role is to manage and control the incident response, not to become one of the people fixing the immediate technical problem.

That separation matters. The people responding to the incident need to focus on diagnosis, containment and recovery. The coordinator needs to keep the response organised: who is working on what, what has been tried, what decisions have been made, who needs to be informed and whether extra people need to be brought in.

The coordinator should capture the incident timeline and communications as the incident unfolds. This includes internal updates, customer communications, supplier interactions, decisions, risk acceptances, workarounds and handovers. That record is invaluable during the post-mortem because it shows what actually happened under pressure rather than what people remember later.

Useful coordinator responsibilities include:

Open and maintain the incident channel or bridge.
Confirm who is incident lead, technical lead and communications owner.
Track actions, owners and timestamps.
Capture key decisions and assumptions.
Coordinate customer, supplier or internal communications.
Bring in additional team members when the response needs more capability.
Keep responders focused by reducing side conversations and duplicated effort.
Preserve the evidence needed for post-mortem review.

Breach and suspected-breach records

Every incident should have a record, including suspected breaches. This matters for ISO-style evidence, data protection governance and practical learning. If the company only records confirmed breaches, it loses the evidence trail that shows how decisions were made when the facts were still unclear.

The record should help the company understand what actually happened. It should capture the facts known at the time, the assumptions being made, the investigation steps, the impact assessment, the decisions taken and the rationale for those decisions.

For security or data incidents, the company should explicitly assess whether the incident is a personal data breach and whether the ICO should be contacted. If the decision is not to contact the ICO, that decision should still be recorded with the reason. The absence of a notification should be a documented conclusion, not a missing record.

The data protection officer role should be part of the process where personal data may be involved. In a small company this may be a named data protection owner rather than a formal DPO, but the role must be clear. The DPO/data protection owner should help assess whether personal data is involved, whether the company is acting as Controller or Processor, who must be contacted, what contractual notification timings apply and whether the ICO or affected data subjects need to be notified.

This means incident management must link back to Controller/Processor positioning, contracts and support records. The team should be able to answer:

Are we Controller, Processor, or both for the affected data?
If we are a Processor, which customer/controller must be contacted?
Which contract, DPA or support agreement defines notification timing?
Who is the customer security, privacy or support contact?
Who owns communications to the ICO, customers, data subjects and suppliers?
Where is the evidence of the notification decision stored?

A breach or suspected-breach record should include:

Incident identifier.
Date and time detected.
Date and time the incident may have started.
Reporter or detection source.
Incident coordinator.
Systems, customers, tenants or datasets affected.
Whether personal data may be involved.
Type and sensitivity of data involved.
Likely number of data subjects or records affected.
Containment actions.
Recovery actions.
Evidence reviewed.
Impact assessment.
DPO or data protection owner involvement.
Controller/Processor position for the affected data.
Customer/controller contact where the company is acting as Processor.
Contract, DPA or support notification requirement.
Whether the incident is a personal data breach.
Whether the ICO should be contacted.
ICO decision, owner, timestamp and rationale.
Whether affected customers or data subjects should be contacted.
Communications sent or deliberately not sent.
Residual risk and follow-up actions.
Link to the post-mortem or learning review.

This record should be controlled like other governance evidence. It should be retained securely, restricted to people who need access and linked to the incident timeline, post-mortem and any risk acceptance or remediation work.

Post-mortem culture

The purpose of a post-mortem is not blame. It is to understand what went wrong, learn from it and reduce the risk of the incident repeating.

This matters culturally and practically. If people feel the post-mortem is about finding somebody to blame, they will become defensive and the company will learn less. A useful post-mortem looks at the system: controls, assumptions, monitoring, alerting, access, deployment process, communication, documentation, recovery steps and decision points.

The output should be practical. The company should understand:

What happened.
What impact it had.
What signals were missed.
What made the response harder.
What worked well.
What needs to change.
Which risks are remediated, accepted or tracked.

Risk language should be precise. Risk reduction happens because a mitigation reduces the likely impact, the likelihood, or both. Residual risk is what remains after those mitigations have been applied. A post-mortem should not simply say that risk has been reduced. It should explain which mitigation reduced impact, which mitigation reduced likelihood, and what residual risk is still owned by the business.

The risk matrix should be used to score both the inherent risk and the residual risk. Many operational and technical risks can remain in the technical risk register, but material risks should be raised to the company risk register. The board does not need every technical detail, but it does need visibility of risks that could materially affect customers, regulatory duties, service availability, finances, reputation or strategic commitments.

The goal is not a perfect incident-free system. The goal is a company that gets better each time something goes wrong.

Rehearsals to consider

Tabletop incident exercise.
Restore rehearsal.
Failover rehearsal.
Restore runbook rehearsal.
Failover runbook rehearsal.
Lost admin account exercise.
Data subject request rehearsal.
Customer notification rehearsal.
Incident coordinator rehearsal.
Post-mortem rehearsal using a realistic incident timeline.
Recovery from a temporary manual infrastructure change.
Reconciliation of emergency changes back into IaC.

How this evolves as the company grows

Before Pre-Production, define response, recovery, incident coordination, communication capture and runbooks.

Before Production, rehearse restore or failover paths and make breach or suspected-breach records part of the process.

At Production, include DPO or data protection owner involvement where personal data may be affected and use contracts/support records to decide who must be contacted.

As the company scales, incident rehearsals, post-mortems, risk-reduction actions and residual-risk ownership should become regular evidence.

What an agent should look for

Who coordinates the incident?

What is response versus recovery?

Are breach decisions, contacts and post-mortem actions recorded?

What good looks like

The company can explain the decision, show the evidence behind it and identify the next point where the control needs to mature.

How Brokenhouse helps

Turn this into a practical plan.

I help technology teams turn this guidance into decisions, implementation plans, governance evidence and production-ready operating models.

Talk through your situation

Next guidance

Related decisions to work through

View all guidance

Agent-led consultancy should amplify judgement

Agents should not replace expert judgement. They should help capture, structure, challenge, and reuse it.

Platform

Azure Dev Platform Modernisation

Describe the organisation, product, team shape, delivery model, and operating constraints.

Agentic software delivery governance

Agents used by the delivery team need a different governance model from AI models embedded in the product. Delivery agents may not be part of the customer-facing service, but they can still create risk because they may read code, write code, inspect logs, summarise documents, generate infrastructure changes or draft customer-facing material.