Skip to content
Igor Maric / imTheOdd0ne

Incident post-mortems that change nothing: the blameless accountability ritual

Nearly every engineering organisation conducts post-mortems after major incidents, yet roughly half of all production incidents are repeats of problems already documented and supposedly resolved. The gap between writing the document and changing the system is where organisational learning goes to die.

TL;DRHomeBlog2026Article

Blameless culture was never meant to mean consequence-free culture. Safety science framed incident review as candid, systemic analysis paired with explicit ownership and follow-through. Software teams kept the language and dropped the machinery. The result is a familiar ritual: write-up produced, meeting held, backlog ticket filed, same incident family returns months later. Organisations that improve build structure around the learning itself: trained facilitators, bounded remediation, named owners, and leaders who treat prevention as real work. The document is only evidence that analysis happened. Learning is what shows up later in architecture, priorities, and budgets.

14 April 2026 · 23 min read · Quality, Infrastructure, IndustryMore from 2026 →
Incident post-mortems that change nothing: the blameless accountability ritual

On 1 August 2012, at 9:30 AM Eastern Time, Knight Capital Group's automated trading system began executing orders. Within forty-five minutes, the system had bought roughly $7 billion worth of stocks that nobody at the firm intended to purchase. The loss — approximately $460 million, according to the SEC's subsequent investigation — wiped out three times the company's annual earnings in less time than a team standup1.

The cause was dead code. A function called Power Peg, decommissioned around 2003, had been left in the codebase. A deployment the previous week had failed to propagate to one of eight servers. When that server came back online running the old code, it began executing trades at a rate that should have tripped every alarm in the building. And some alarms did trip. The firm's internal system generated 97 automated warning emails before the market opened, each one referencing the router and identifying an error1. Nobody acted on them.

Those 97 emails are worth dwelling on. Not because the Knight Capital story is unusual — though losing nearly half a billion dollars in three quarters of an hour is certainly memorable — but because the pattern it represents is so ordinary. Warnings documented, filed, and ignored. Problems identified and left to fester. Lessons supposedly learned and never applied. Every engineer who has sat through more than a handful of post-mortems recognises this pattern. The document gets written. The action items get logged. And then nothing changes.

The gap between writing and doing

Nearly every engineering organisation conducts some form of post-incident review. Atlassian's 2024 State of Incident Management survey of over 500 US-based IT professionals found that post-mortems are near-universal, though only 22% of respondents reported practising blameless post-mortems specifically2. The Pragmatic Engineer's survey of more than 50 engineering teams found 98.5% had an incident management process in place3. The practice is not the problem. Almost everyone does this.

What almost everyone does not do is follow through. The industry lacks a single rigorous, large-sample study measuring post-mortem action item completion rates — which is itself telling. The best proxy comes from a 2022 survey of over 300 on-call practitioners conducted by Dimensional Research, which found that 48% of production incidents are straightforward and repetitive4. Nearly half of all incidents are things that have already happened, already been documented, and already had action items assigned. The action items either died in a backlog or were never started.

The financial weight of this failure is substantial. A 2024 PagerDuty-commissioned survey of 500 IT leaders at large enterprises found the average customer-facing incident takes 175 minutes to resolve at an estimated cost of $4,537 per minute — roughly $794,000 per incident5. A separate 2024 study by Enterprise Management Associates pegged the average cost of unplanned downtime at $14,056 per minute6. The numbers vary depending on company size and what gets counted, but the direction is consistent: incidents are expensive, and repeat incidents are money set on fire because someone filed the lessons in a drawer.

I have watched this cycle play out enough times to know the choreography by heart. An incident occurs. There is urgency, adrenaline, a war room. The incident gets resolved. Someone is assigned to write the post-mortem. It takes a week because the writer was also on call and had a feature deadline. The post-mortem meeting happens. People nod. Action items are captured. They go into the backlog. The backlog has 400 items already. Feature work takes priority. Six months later, the same class of incident occurs. Someone in the war room says 'I feel like we have seen this before'. They have. The post-mortem from last time says so. Nobody had read it.

The same 2022 Dimensional Research survey found that SRE teams spend over 2,000 hours monthly on incident response — roughly twelve person-years annually per organisation4. That is not the cost of incidents. That is the cost of responding to incidents. The cost of responding to the same incident twice is the part nobody tracks, because tracking it would require admitting the post-mortem process failed. And admitting the post-mortem process failed would require someone to write a post-mortem about the post-mortem process, which is the kind of recursive absurdity that organisations prefer to leave unexamined.

The customer-facing incident rate is not improving. PagerDuty's 2024 survey found customer-facing incidents increased 43% year-over-year5. The same Shoreline survey found incident escalations consume 78% of on-call time4. These are not the numbers of an industry that is learning from its failures. These are the numbers of an industry that is documenting its failures and filing the documentation.

What blameless was supposed to mean

The concept of blameless post-mortems did not originate in Silicon Valley. It was imported from decades of research in safety science — disciplines where the stakes were measured in lives, not latency.

James Reason, working in the 1990s on organisational accident theory, proposed a framework he called Just Culture7. The model distinguished three categories of human behaviour following an incident: honest mistakes (console the person, fix the system), at-risk behaviour where the individual underestimated the danger (coach the person, address systemic incentives), and reckless behaviour involving conscious disregard of substantial risk (appropriate disciplinary action). The framework was never a blanket amnesty. It was a calibrated response that matched consequence to intent.

Sidney Dekker, a safety scientist whose work spans aviation, healthcare, and aerospace, built on Reason's foundation with his 2007 book Just Culture8. Dekker distinguished between retributive justice — who broke a rule, what punishment fits — and restorative justice: who was harmed, what are their needs, whose obligation is it to meet those needs. The word 'just' in Just Culture meant fair, not permissive. Dekker's framework explicitly included accountability. It simply argued that accountability should be forward-looking and systemic rather than backward-looking and punitive.

Richard Cook, a physician and researcher at the University of Chicago, contributed what may be the most frequently cited document in the field: 'How Complex Systems Fail', a short paper listing eighteen observations about failure in complex systems9. Cook's central argument was that catastrophe requires multiple failures — there is no isolated root cause for a complex incident. The people closest to the failure are the same people who create safety every other day. Punishing them for the failure destroys the organisation's ability to learn from it.

In 2012, John Allspaw translated these ideas for the software industry. His blog post 'Blameless PostMortems and a Just Culture', published on Etsy's Code as Craft blog, operationalised Dekker's framework for engineering teams10. Seek out the second story beneath the surface explanation, Allspaw argued. Treat the engineers who made mistakes as the experts best positioned to educate the rest of the organisation. The post was clear, persuasive, and enormously influential. Google codified the approach in Chapter 15 of the SRE book four years later11, and within a few years 'blameless post-mortem' had become industry standard vocabulary.

The problem is what happened next. The industry adopted the word 'blameless' and discarded the frameworks it came from. Reason's three-tier model — which explicitly included disciplinary action for reckless behaviour — was flattened into 'don't blame anyone'. Dekker's restorative justice, which demanded that someone take responsibility for meeting the needs of those harmed, was simplified into 'no consequences'. Cook's observation that there is no single root cause was heard as 'no one is responsible'. The nuance was lost in translation, and what arrived on the other side was a practice that looked like accountability without containing any.

Amy Edmondson has been explicit about this confusion12. Her research uses a two-by-two matrix: psychological safety on one axis, accountability on the other. High safety with low accountability produces a comfort zone — people feel safe but do not take ownership. High safety with high accountability produces the learning zone, where teams actually improve. The version of blamelessness that most organisations implemented landed them squarely in the comfort zone. Everyone felt safe. Nobody changed anything.

Allspaw himself has moved well beyond his 2012 framing. He now describes blamelessness as table stakes — necessary but nowhere near sufficient13. You could build an environment where people share every messy detail without fear of retribution, he has argued, and still not learn very much. Real learning requires trained analysts or facilitators who prepare, collate, and analyse what people actually did and said before and after the incident. After leaving Etsy, Allspaw co-founded Adaptive Capacity Labs with Cook and David Woods, working to professionalise incident analysis in ways that go far beyond the blameless label.

The irony is complete. The people most often cited as the intellectual foundation for blameless culture — Dekker, Reason, Allspaw, Edmondson — all included accountability in their models. The industry took their work, removed the inconvenient parts, and called the result progress.

The post-mortem as compliance artefact

Once the accountability was stripped out, what remained was a ritual. The post-mortem became something organisations performed rather than something they used. The document was the deliverable, not the learning.

I have read post-mortems that were clearly written to satisfy a process requirement rather than to communicate anything useful. You can spot them: the timeline is a chronological dump of log entries, the root cause analysis stops at the first plausible explanation, and the action items are vague enough to be declared complete without changing anything. 'Improve monitoring.' 'Add better alerts.' 'Update documentation.' Each of these could mean anything, and because they could mean anything, they end up meaning nothing. The document exists. The checkbox is ticked. The file is closed.

There is a specific moment in the post-mortem lifecycle where the energy dies. It happens about seventy-two hours after the incident resolves. During those first three days, people care. They remember the stress of the war room, the customers who were affected, the workaround that held together with tape and optimism. They want to fix things. But by day four, the next sprint has started. There is a feature demo on Friday. The product manager is asking about the roadmap. The post-mortem action items — which felt urgent at 2 AM when production was on fire — now compete with a backlog that was already overcommitted before the incident happened. And so they wait. They wait through the next sprint, and the one after that. By the time someone looks at them again, the context has faded, the urgency has evaporated, and closing the ticket without doing the work feels like pragmatism rather than negligence.

The Verica Open Incident Database's 2022 report, covering nearly 10,000 incident reports from close to 600 organisations, found that only 6% of reports identified a root cause or explicitly used RCA framing, and that pattern came from just 15 companies in the dataset14. Even among public incident write-ups, explicit causal analysis is thin. The act of publishing a post-mortem is no guarantee that an organisation has learned anything from it.

The consequences compound through a feedback loop of learned helplessness. Engineers who have been through enough cycles where nothing gets fixed eventually stop believing their analysis matters. They stop digging for real root causes. They stop suggesting ambitious remediation. They write the minimum viable post-mortem to satisfy the process, because experience has taught them that ambition in a post-mortem is wasted effort. The 97 automated emails at Knight Capital were a version of this same dysfunction — warning systems that nobody trusted enough to act on, because the organisational muscle for acting on warnings had atrophied.

Google's own experience illustrates the tension. In 2017, John Lunney, Sue Lueder, and Betsy Beyer published guidance in USENIX's ;login: magazine on making post-mortem action items actionable: specific, bounded, with clear ownership and measurable outcomes15. Google's SRE book is the canonical reference on blameless post-mortem culture. And yet in June 2025, a Google Cloud incident revealed that Service Control code had been deployed without a feature flag, lacked proper error handling, and crashed globally when triggered. The crashing services and their clients did not implement randomised exponential backoff — a resilience pattern that Google's own SRE documentation recommends as foundational16. The organisation that wrote the book on post-mortem culture failed to apply its own published guidance. If Google cannot consistently implement Google's framework, the challenge for everyone else becomes clearer.

The structural gap

The contrast with aviation is instructive. The National Transportation Safety Board has operated under a clear mandate since its founding: the sole objective of investigating an accident is the prevention of future accidents, not the determination of blame or liability. Reporting is legally mandatory. Voluntary safety reports receive legal protection through the Aviation Safety Reporting System. The NTSB issues formal recommendations to agencies with the authority and resources to implement them17. There is an independent investigation body, a legal framework for safety recommendations, professional standards for investigators, and structural consequences for ignoring findings.

The results speak for themselves. Commercial aviation fatality rates have declined by orders of magnitude over the past half century. The NTSB model works because it has structural teeth: mandatory reporting, independent investigation, and recommendations directed at entities with the power and obligation to act. The people investigating a crash are not the same people who were flying the plane. The recommendations do not compete with a feature backlog for prioritisation. And ignoring a safety recommendation requires a public, documented explanation — not a quiet ticket closure.

Software engineering has none of these. Post-mortems are voluntary. Investigation is performed by the same exhausted engineers who just resolved the incident. Action items compete with feature work for sprint capacity. There is no independent body, no mandatory reporting, no legal framework, no professional licensing for incident analysis. The entire system runs on goodwill and organisational culture, which is precisely why it collapses the moment a quarterly deadline applies pressure.

Robert Charette, a contributing editor at IEEE Spectrum who has tracked IT project failures for over two decades, has observed that the IT community keeps making the same mistakes it has made since at least 1968, when the term 'software crisis' was coined18. Common failure patterns include claiming every project is unique so past lessons do not apply, underestimating complexity, and inadequate testing. There are no professional licensing requirements for IT project managers and they are rarely held legally liable for failures. The post-mortem, in this context, is the industry's substitute for the regulatory frameworks that other safety-critical domains built over decades. It is an informal, voluntary, unenforceable substitute, and it performs accordingly.

Healthcare offers a parallel that cuts both ways. Morbidity and Mortality conferences have been a staple of medical practice for over a century, and the research on their effectiveness is uncomfortably familiar. A 2020 integrative review of quality-improvement-focused M&M rounds found repeated emphasis on structured case selection, systems-focused analysis, and explicit follow-up, but also noted the field still lacked consistent measures of effectiveness19. The core problem maps directly to software: learning occurs at the individual attendee level more easily than it transfers to the organisational level. People leave the meeting having learned something. The system they return to has not changed.

Diane Vaughan identified the deeper mechanism in her analysis of the Challenger disaster. She called it normalisation of deviance — the process by which deviations from correct behaviour become culturally normalised because they have not yet caused catastrophe20. NASA engineers had documented O-ring erosion in previous shuttle flights. Each launch that did not end in disaster reinforced the belief that the erosion was an acceptable risk. The pattern is identical in software: known risks documented in previous post-mortems go unaddressed because 'it worked last time'. Production pressure overrides safety recommendations. Structural compartmentalisation prevents anyone from seeing the full picture. Until the day it fails.

AWS has experienced cascading failures in the us-east-1 region in 2017, 2020, 2021, 2023, 2024, and 2025 — each involving a foundational service bringing down dependent systems21 22. The 2017 S3 outage was caused by a mistyped command that removed too many servers, and AWS's own post-mortem noted that the affected subsystems 'had not been completely restarted in our larger regions for many years' — nobody had tested what would happen if they needed to restart, because nobody had ever needed to21. The 2021 outage followed the same cascading pattern: an automated scaling activity triggered unexpected network congestion that AWS's own systems could not handle, effectively a self-inflicted denial of service. Each post-mortem pledged architectural improvements. Each subsequent incident revealed the same category of vulnerability.

Cloudflare's pattern is different in mechanism but identical in outcome. In July 2019, a single misconfigured WAF rule containing a poorly written regular expression caused catastrophic backtracking, spiking CPU to 100% on machines worldwide and taking the service down for 27 minutes23. Six years later, Cloudflare's November 2025 outage again traced a network-wide failure to a single change propagating globally — this time a database-permission change that generated an oversized Bot Management feature file and broke core proxy traffic24. Different subsystems, same systemic lesson: when a global change is allowed to propagate too far, too fast, one mistake can become everyone else's outage.

What actually changes things

The organisations that genuinely learn from incidents share structural features that most engineering teams lack.

Indeed's Learning from Incidents team, documented by Adaptive Capacity Labs in 2025, represents the most detailed public account of what dedicated incident analysis looks like in practice25. An SRE began introducing resilience engineering concepts internally in 2019. By 2020 the team had engaged Adaptive Capacity Labs. A critical methodological insight emerged: people analysing an incident should not include the responders to that incident, and analysis should be done by more than one person — ideally three. The separation prevents narrative anchoring and hindsight bias. The dedicated team became so effective that internal teams actively request their analysis services. The enabling conditions were initiative and patient long-term progress over multiple years.

Etsy, under CTO John Allspaw in the early 2010s, was among the first software companies to formalise blameless post-mortem practices. CEO Chad Dickerson actively shifted the culture from blame to learning. Engineers who made mistakes were treated as experts rather than defendants. The company published a debriefing facilitation guide that treated the debrief as a learning opportunity first and a fixing opportunity second26. What made Etsy's approach work was not the template. It was the cultural investment — years of leadership modelling the behaviour they wanted to see, creating an environment where engineers took more risks and moved faster because they trusted the system to support them when things broke.

The common thread across organisations that improve is not a specific process or tool. It is structural commitment. Time-boxed action items with explicit owners rather than open-ended backlog entries. Leadership that treats incident follow-through with the same urgency as feature delivery. Dedicated facilitation by people trained in the skill — because debriefing facilitation is a professional competency, as distinct from engineering as engineering is from project management. And crucially, an understanding that the written document is not the output. The learning is the output. The document is just the artefact left behind.

Google's own guidance on this point, published by Lunney, Lueder, and Beyer in USENIX's ;login: magazine, is worth taking seriously despite Google's imperfect track record of following it. Action items must be 'actionable, specific, and bounded' — phrased as sentences starting with a verb, with narrow scope, and clear completion criteria15. An action item that says 'improve monitoring' is not actionable. An action item that says 'add latency alerting to the payment service with a threshold of 500ms p99, owned by the payments team, due in two weeks' is actionable. The difference is not semantic. It is the difference between something that gets done and something that decays in a backlog.

I worked on a team once where the engineering manager made a simple rule: no post-mortem action item survives more than two sprints without either being completed or being explicitly deprioritised by a director. The deprioritisation had to be written down, with a name attached, and it went into the next post-mortem if the same class of incident recurred. It was not a complex system. It was just accountability with a paper trail. The repeat incident rate dropped noticeably within two quarters, not because the rule was clever, but because people knew that ignoring an action item would be visible and attributed the next time something broke.

Edmondson's research points to the precondition that makes all of this possible: psychological safety paired with accountability12. Not one or the other. Both. Teams where people feel safe to speak honestly about what went wrong and where the organisation then acts on what it hears. The learning zone sits at the intersection of candour and consequence. Most organisations have optimised for one axis while neglecting the other, and the result is either a blame culture that suppresses information or a blameless culture that suppresses action.

The distinction between blame and accountability is not subtle, but it is consistently conflated. Blame asks 'whose fault was this?' and distributes punishment. Accountability asks 'whose responsibility is it to make sure this does not happen again?' and distributes ownership. One looks backward to assign suffering. The other looks forward to assign work. Dekker's entire career has been spent articulating this distinction, and the industry has spent the last decade pretending it does not exist.

The 2024 DORA State of DevOps report, drawing on data from over 39,000 professionals across the programme's history, found that the fastest teams are also the most stable — speed and reliability reinforce rather than trade off27. A climate for learning, including treating failures as learning opportunities, is predictive of performance gains. The evidence does not support the premise that accountability slows teams down. It supports the opposite: teams that learn from failure and act on what they learn outperform teams that do either one without the other.

The post-mortem was never supposed to be a document. It was supposed to be a mechanism for change. Somewhere between Allspaw's blog post and the thousandth template copied into Confluence, we lost the mechanism and kept the document. The ritual persists because it is easy to perform and uncomfortable to question. Writing a post-mortem feels like progress. Filing it and moving on is the path of least resistance. And the path of least resistance, in the absence of structural forces compelling something better, wins every time.

I think about the 5 Whys technique that so many post-mortem templates include, and I think about why it fails. The technique was imported from Toyota's manufacturing process, where the causal chain from defect to root cause is typically mechanical and linear. Software incidents are not mechanical and linear. They are the product of multiple interacting failures across technical and organisational layers — what Cook called the 'changing mixtures of failures latent within' complex systems9. When you ask 'why' five times about a complex incident, you do not converge on a root cause. You converge on whatever the facilitator already suspected, filtered through confirmation bias and hindsight. The technique gives you the illusion of depth without the substance. It is the post-mortem in miniature: a ritual that mimics analysis without producing understanding.

Nora Jones, who founded the Learning from Incidents community and later Jeli (acquired by PagerDuty in 2023), has advocated for replacing the term 'post-mortem' entirely with 'learning review'28. The argument is not cosmetic. The medical etymology — an examination of the dead — carries an implicit assumption that there is a single cause of death to identify. Complex incidents do not have a single cause. They have contributing factors, interacting conditions, and adaptive responses that prevented the incident from being worse. The organisations that learn most from their incidents are the ones that study what went right alongside what went wrong — what Erik Hollnagel calls Safety-II, the recognition that things go right and things go wrong for the same basic reasons29.

The 97 emails at Knight Capital were not a failure of alerting. They were a failure of organisational capacity to act on information it already possessed. Every post-mortem that identifies a real problem and generates real action items that no one completes is a smaller version of the same failure. The information is there. The analysis is there. What is missing is the structural commitment to turn knowledge into change.

Choose wisely. Your incidents are already teaching you what is broken. The question is whether anyone with authority is listening — and whether listening, this time, will be followed by something more than another document.


Footnotes

  1. SEC. (2013). 'In the Matter of Knight Capital Americas LLC: Administrative Proceeding File No. 3-15570.' Securities and Exchange Commission Release No. 34-70694. https://www.sec.gov/litigation/admin/2013/34-70694.pdf 2

  2. Atlassian. (2024). 'State of Incident Management 2024.' Atlassian (conducted by CITE Research, n=500+ US-based IT professionals). https://www.atlassian.com/incident-management/2024-state-of-incident-management

  3. Orosz, G. (2021). 'Incident Review and Postmortem Best Practices.' The Pragmatic Engineer (survey of 50+ engineering teams). https://blog.pragmaticengineer.com/postmortem-best-practices/

  4. Shoreline.io. (2022). '2022 Production Operations Benchmark Survey.' Shoreline.io (conducted by Dimensional Research, n=300+ on-call practitioners). https://shoreline.io/offer/2022-production-operations-benchmark-survey 2 3

  5. PagerDuty. (2024). 'The State of Digital Operations 2024.' PagerDuty (conducted by Censuswide, n=500 IT leaders at enterprises with 1,000+ employees). https://www.pagerduty.com/newsroom/customer-facing-incidents-increase-43-percent/ 2

  6. BigPanda. (2024). 'The Cost of Outages 2024.' BigPanda (conducted by Enterprise Management Associates, n=400+ IT professionals). https://bigpanda.io/ar-ema-outage-cost-2024/

  7. Reason, J. (1997). Managing the Risks of Organisational Accidents. Ashgate Publishing.

  8. Dekker, S. (2007). Just Culture: Balancing Safety and Accountability. Ashgate Publishing. https://sidneydekker.com/just-culture

  9. Cook, R. I. (2000). 'How Complex Systems Fail.' Cognitive Technologies Laboratory, University of Chicago. https://how.complexsystems.fail/ 2

  10. Allspaw, J. (2012). 'Blameless PostMortems and a Just Culture.' Code as Craft (Etsy Engineering Blog). https://www.etsy.com/codeascraft/blameless-postmortems

  11. Beyer, B., Jones, C., Petoff, J. & Murphy, N. R., eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. Chapter 15: Postmortem Culture. https://sre.google/sre-book/postmortem-culture/

  12. Edmondson, A. C. (1999). 'Psychological Safety and Learning Behavior in Work Teams.' Administrative Science Quarterly, 44(2), 350-383. https://journals.sagepub.com/doi/10.2307/2666999 2

  13. Allspaw, J. (2019). 'Getting the Messy Details Is Critical.' Code for America Blog. https://medium.com/code-for-america/john-allspaw-getting-the-messy-details-is-critical-59e641aa0a77

  14. Verica. (2022). 'VOID Report 2022.' Verica. https://static.isthisit.nz/artifacts/blog/2022_Void_Report.pdf

  15. Lunney, J., Lueder, S. & Beyer, B. (2017). 'Postmortem Action Items: Plan the Work and Work the Plan.' ;login:, 42(1). USENIX. https://www.usenix.org/system/files/login/issues/login_spring17_issue.pdf 2

  16. HyperFrame Research. (2025). 'Google Cloud: Anatomy of a Systemic Failure.' HyperFrame Research. https://hyperframeresearch.com/2025/06/24/google-cloud-anatomy-of-a-systemic-failure/

  17. NTSB. (2006). 'Safety Studies: Lessons Learned and Lives Saved.' National Transportation Safety Board. https://www.ntsb.gov/safety/safety-studies/Documents/SR0601.pdf

  18. Charette, R. N. (2005). 'Why Software Fails.' IEEE Spectrum. https://spectrum.ieee.org/why-software-fails

  19. Churchill, K. P., Murphy, J. & Smith, N. (2020). 'Quality improvement focused morbidity and mortality rounds: an integrative review.' Cureus, 12(12), e12146. Cureus. https://doi.org/10.7759/cureus.12146

  20. Vaughan, D. (1996). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/C/bo22781921.html

  21. AWS. (2017). 'Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region.' Amazon Web Services. https://aws.amazon.com/message/41926/ 2

  22. AWS. (2021). 'Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region.' Amazon Web Services. https://aws.amazon.com/message/12721/

  23. Cloudflare. (2019). 'Details of the Cloudflare outage on July 2, 2019.' Cloudflare Blog. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

  24. Cloudflare. (2025). 'Code Orange: Fail Small.' Cloudflare Blog. https://blog.cloudflare.com/fail-small-resilience-plan/

  25. Adaptive Capacity Labs. (2025). 'What Progress in Learning from Incidents Actually Looks Like.' Adaptive Capacity Labs. https://www.adaptivecapacitylabs.com/2025/02/28/what-progress-in-learning-from-incidents-actually-looks-like/

  26. Allspaw, J., Evans, M. & Schauenberg, D. (2016). 'Debriefing Facilitation Guide.' Etsy Code as Craft. https://www.etsy.com/codeascraft/debriefing-facilitation-guide/

  27. DORA. (2024). '2024 Accelerate State of DevOps Report.' Google Cloud (10th annual report, cumulative n=39,000+ professionals). https://dora.dev/research/2024/dora-report/

  28. Jones, N. (2022). 'Learning from Incidents with Nora Jones.' The Changelog Podcast #478. https://changelog.com/podcast/478

  29. Hollnagel, E. (2014). Safety-I and Safety-II: The Past and Future of Safety Management. CRC Press/Ashgate. https://erikhollnagel.com/ideas/safety-i%20and%20safety-ii.html

Related Articles

Latest from the blog

The organisational memory leak: why lessons disappear between teams

Companies do not keep repeating software failures because nobody noticed. They repeat them because the lesson had nowhere durable to live, no owner, and no budget attached. The post-mortem sits in the wiki. The trap stays armed.

19 May 2026 · 23 min read