Videos and conference talks #
- Who destroyed three mile island?: Nick Means is one of the very best when it comes to conference talks. We chose this one to share as there are fascinating parallels to the incidents we all face, but How to crash a plane, Taking the 737 to the Max! and the others are great too.
- How did things go right?: A great talk by Ryan Kitchens at Netflix, which looks at things going right as a source of learning. There's also a "no root cause" slide with a baby Groot on it, so it's hard to go wrong.
- Redbull handling an incident: We think it's a glowing example of incident management done right, and a fascinating watch irrespective of whether Formula 1 is your thing or not. We wrote more about this here.
- The Field Guide to Understanding Human Error: A practical guide that explores how things go wrong, and how best to investigate and make sense of them when they do. Also one of Chris's favourite books of all time.
- Safety-I and Safety-II: Most people think of safety as the absence of things going wrong. This book repositions safety around activities which maximise the chance of things going right. Fair warning: this is more academic than practical, but still an interesting read.
Blog posts and papers #
- Moving past shallow incident data: A useful post that articulates why things like MTTR aren’t especially useful in understanding what happened in an incident. Fun fact: you can find this post by Googling "allspaw grapes".
- Safety-I vs Safety-II whitepaper: A much shorter read than the book above, and a great introduction to the topic.
- Incident response tabletop exercise scenarios: It can be pretty daunting figuring out how best to run a tabletop exercise. This guide has some great tips and examples for getting started.
- Incident metrics in SRE: A bit of mathematical deep dive into Mean Time to Recovery (MTTR) and how it's not a useful measure, especially in relation to spotting trends incident response.
- Crafting sustainable on-call rotations: The wonderful at Stripe have set the standard when it comes to online publishing, and this article by Ryn Daniels is no exception. If you're looking for advice on building more sustainable on-call, you're in good hands here.
- Why you shouldn't count production incidents: Some well-aligned advice on not counting production incidents. We agree wholeheartedly.
- The incident.io blog: It feels a little like cheating to include our own work, but hey, this is our guide and we're proud of what we've written.