A guide to post-mortem meetings and how we run them at incident.io
Post-mortem meetings can play a crucial role in fostering an environment of continuous learning. Here's how we do them!
incident.io
Whose fault was it anyway? On blameless post-mortems
While blameless post-mortems are a great idea on the surface, if taken to the extreme, they can muddy how much you actually learn from incidents.
incident.io
Keeping the codebase consistent with Pattern Parties
As a codebase evolves, it’s common to see some divergence in the design patterns within it.
Kelsey Mills
Better learning from incidents: A guide to incident post-mortem documents
Post-mortem documents are a great way to facilitate learning after incidents are resolved.
Luis Gonzalez
Clouds, caches and connection conundrums
During a recent infrastructure migration into Google Cloud, we kept running into a pesky issue without a clear cause. Here, we dive into the twists and turns we took to finally figure out what the smoking gun was.
Ben Wheatley
How we’ve made Status Pages better over the last three months
A few months ago we announced Status Pages -- the most delightful way to keep customers up-to-date about ongoing incidents. Since then, we've launched several features to add an extra bit of delight. Read on to learn more.
incident.io
The balancing act of reliability and availability
To prevent issues like downtime, you have to focus on the reliability and availability of your product. But there's a balance to be struck here.
incident.io
Incident management vs problem management: understanding the connection between the two
While problem management and incident management may seem different, they're two sides of the same coin.
Luis Gonzalez
Practical guidance for getting started as a Site Reliability Engineer
Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.
Ben Wheatley
Stay in the loop: subscribe to our RSS feed.