Recapping SEV0 London 2025

SEV0 London marked our very first time bringing the event across the pond—and what a debut it was.

The topic on everyone’s mind? How AI is transforming everyday life, much of which in ways we’re only beginning to grasp. Earlier this year, Anthropic CEO Dario Amodei predicted that AI would be writing 90% of code in the next three to six months. And while we’re not there yet at incident.io, we’re writing more code than ever with Claude.

The AI shift isn’t coming—it’s already here—and companies that embrace it are setting the new standard for how modern engineering teams work. But with greater ease comes greater risk.

As outlined by incident.io CEO Stephen Whitworth in the opening keynote, the three main costs we are now starting to see are increased debugging for humans on code they didn’t write, more complex and interdependent systems with less context for any single person, and your best engineers are burning out from constant incidents.

At SEV0 London, the AI shift was front and center. Conversations about AI weren’t just theoretical—they were grounded in real stories from teams already using it to change how they manage reliability and respond to incidents. From rethinking how to scale teams to the argument that we declare more incidents, it’s clear that AI has changed not only how teams ship code but also how they maintain it.

SEV0 London showed how the future of reliability will be a blend of smart technology and even smarter teams.

Rethinking growing engineers in the age of AI coding

Meri Williams (CTO, Pleo)

Watch session

Meri explored how AI coding assistants are changing the way engineering teams work and grow. Now, AI can handle much of the repetitive work once done by junior engineers, raising an important question: if AI takes over that work, how will new engineers gain experience? Comparing this to other industries facing automation, Meri noted that we need to rethink how people learn the basics when “learning by doing” isn’t as common.

Junior engineering work has historically served as a form of “deliberate practice”, a process where repetition, feedback, and incremental challenge enable mastery. As AI takes over many of these tasks, engineers risk missing out on the deep understanding built through this process. To adapt, engineers need to be equipped to not just to use AI, but to think with it—bridging the gap between human judgment and machine assistance.

You can’t hold back the ocean, but you can learn to surf…Pain is mandatory, but suffering is optional.

Repetitive coding tasks have always been a key part of mastering engineering skills. As AI takes over those tasks, teams must be more deliberate about teaching fundamentals like debugging, systems thinking, and recognizing “code smells,” while also developing new skills like prompt engineering and evaluating AI outputs.

Leading up in a crisis

Adrian Carvalho (Senior Engineering Manager, Zuari Software)

Watch session

Adrian gave a thoughtful, human talk about how engineering managers can lead up during crises—helping executives become a source of calm rather than stress. Drawing on his own experience managing major outages, he introduced the idea of “stress-free incident management.” The goal isn’t to avoid chaos, but to build a culture where teams stay steady and effective even when things go wrong.

Leadership challenges during incidents often come from good intentions—like wanting to help or provide clarity—but can lead to micromanagement, panic, or confusion. The key, then, is aligning leaders and responders through clear frameworks, blameless communication, and automation that reduces unnecessary noise. Often, clear communication matters more than more people or faster fixes.

Incidents will never be perfect, and the panic is part of the fun. If we’re lucky, we can turn them into something where we’re working together, and maybe we’re learning things.

Adrian suggested practical ways to make this part of company culture: training everyone on their role during incidents, adding a stress impact section to postmortems, and running reviews that link incidents to business outcomes. In closing, he reminded everyone that incidents will never be perfect—but if we handle them right, they can bring teams closer and help them grow.

Writing up your most embarrassing failures for the world to see is good, actually

Brian Scanlan (Sr. Principal Engineer, Intercom)

Watch session

Brian gave an honest and insightful talk on Intercom’s approach to public post-incident reports—and why transparency makes for both better engineering and better business. As Intercom shifted toward AI-powered customer service, uptime became directly tied to revenue. With so many customers relying on their platform daily, Brian stressed that availability is product quality, and sharing post-incident reports publicly is one of the best ways to prove that commitment.

He shared how Intercom moved from once limiting access to incident reports to now publishing them openly on their status page. That change reconnected the company with its core value of transparency and strengthened customer trust. Writing RCAs for a public audience pushes engineers to truly understand their systems and take ownership of their work.

Publishing publicly forces you to really own your availability… and I think customers do appreciate it.

Brian also highlighted how public RCAs benefit the broader industry. By learning from each other’s failures, teams across companies can raise the standard for reliability. He pointed to Cloudflare and Honeycomb as great examples of openness done right—and closed by reminding everyone that accountability and humility are what make transparency meaningful.

Murphy's Law is inevitable. Chaos isn't.

Katherin Schuppener (Inbound Supply Chain Lead, Picnic) and Luis Santos (Domain Tech Lead, Picnic)

Watch session

In their joint talk, Katherin and Luis shared how Picnic grew from a scrappy startup into a resilient, tech-powered grocery platform serving millions across Europe. Speaking from both business and engineering perspectives, they showed how the company’s “milkman reinvented” model evolved into a highly automated system powered by 30+ warehouses and 400+ microservices—all built through a culture of constant improvement and focus on customers.

They described three key phases of growth. The Honeymoon Phase (2015–2019) was fast-paced and chaotic, with everyone pitching in wherever needed—even founders packing groceries. It was during this time that “Mr. Murphy,” Picnic’s first internal incident bot, was created to bring order to the chaos. Then came the **Pandemic **Phase (2020–2022), when demand skyrocketed overnight. Picnic rapidly scaled operations and engineering, facing new challenges from automation and logistics. This period shifted the company from reactive firefighting to long-term, sustainable systems.

Finally, in the Structured Scale-Up Phase (2023–present), Picnic introduced formal on-call processes, trained teams in incident response, and adopted incident.io to improve consistency and learning. With SLOs, post-incident reviews, and a company-wide focus on prevention, they found the right balance between structure and culture. Schuppener and Almeida Santos closed with a simple reminder: growth requires structure—but never at the expense of trust, collaboration, and fun.

Structure growth and nurture the culture… let go of what no longer serves you… and don’t forget to have fun.

On-call - half baked?

John Paris (Principal Systems Engineer, Skyscanner)

Watch session

John shared how Skyscanner partnered with incident.io to completely reinvent on-call for its massive, globally distributed engineering organization. The old system had become too manual, opaque, and error-prone—making it hard to track workloads, ensure fair pay, or maintain healthy schedules. When incident.io launched its on-call product, Skyscanner seized the chance to rebuild everything from scratch.

Working hand in hand with incident.io, the team co-developed “On-call v2,” designed specifically for Skyscanner’s scale. Together, they tackled challenges like managing time zones and holidays, integrating payroll, and balancing autonomy with governance.

Speak to incident.io and they will build it.

The resulting benefits extended beyond tooling. Incident data became clearer and more actionable, engineers gained a unified view of alerts, incidents, and schedules, and accountability across teams improved. John wrapped up with advice for others: define your goals early, partner with HR and payroll, find internal advocates, and iterate as you go.

How granular is your SLO?

Sam Jewell (Staff Software Engineer, Grafana Labs)

Watch session

Sam shared a gripping story about how Grafana Labs handled a real security incident with speed, precision, and transparency. The issue started when a Canary Token — a kind of digital tripwire — alerted the team to suspicious activity in a GitHub Action that had exposed sensitive credentials. Within minutes, Grafana’s security team jumped into action, containing the problem before it spread.

Incidents will happen…but what matters is responding with speed, honesty, and a commitment to learning.

Jewell credited the company’s strong culture of observability and readiness for the quick response. Clear processes guided every step — from revoking tokens and checking logs to isolating systems — while open communication kept everyone aligned. Once the situation was resolved, Grafana publicly shared the details to maintain trust and help others learn from the experience.

The key lessons: automation can create risks as well as efficiencies, early detection tools are critical, and transparency builds resilience. As Jewell reminded everyone, no team is immune to incidents — what matters is how you respond.

We should all be declaring more incidents

Martha Lambert (Product Engineer, incident.io)

Watch session

Martha took the stage to challenge one of the deepest instincts in engineering teams: treating incidents as something to avoid. Instead, she argued, teams should declare more incidents—and use them as a superpower to build happier, more trusting customers.

At incident.io, her team sees around eight production incidents a day across forty engineers—not because things break more often, but because they intentionally lower the bar for what counts as an incident. This approach helps them catch problems faster, communicate better, and often turn potential frustrations into positive experiences. In one case, a customer issue was fixed within ten minutes—before the customer even reached out.

The heart of her message was about reframing what an “incident” means. Martha outlined three cultural shifts: Lower your bar (treating even minor issues as incidents), make incidents the top priority, and out the customer first.

Incidents aren’t failures—they’re opportunities to demonstrate reliability, empathy, and speed. Teams that practice handling them daily become calmer under pressure and earn more trust when things inevitably go wrong.

Declare more incidents. Your customers will thank you for it.

The day the database disappeared

Dr. Claire Knight (Co-Founder & CTO, Sailhouse)

Watch session

Dr. Knight gave a powerful talk about leadership and failure in engineering—centered on one scary question: what do you do when production disappears? Sharing real stories from her experience, including at GitHub, she showed that how leaders respond in a crisis matters as much as the technical fix. When two engineers accidentally deleted production databases, she focused on calm recovery instead of blame. Her belief: mistakes shouldn’t be punished, because fear only makes people hide them.

Incidents are rarely one person’s fault—they’re usually signs of deeper, systemic issues in process or communication. The goal isn’t to stop every mistake but to build systems and teams that can recover from them.

AI doesn't sleep, doesn't need coffee, doesn't need the happy hour that we're all going to soon. And it's probably gonna be pretty good at inventing very new ways of breaking things.

Looking ahead, AI will increase the chances for failure—automated systems can break things faster and in new ways—making guardrails and recovery planning more important than ever. But breaking production doesn’t make you a bad engineer; it just means you’re a real one.

Maintaining reliability amid layoffs, AI acceleration and acquisitions

Liam Whelan (Director of Reliability Engineering, Zendesk)

Watch session

Liam described the past year as a perfect storm of layoffs, rapid AI-driven product launches, and the challenge of integrating 13 acquisitions — each with its own culture and reliability standards. When institutional knowledge vanished and teams were left without clear ownership, incidents exposed just how fragile systems could become.

Faced with this, Liam’s focus has shifted from firefighting to building guardrails over heroics. Zendesk is investing in platform teams that standardize production readiness, observability, and reliability metrics — so safety and knowledge are built into the system, not just carried by individuals. At the same time, AI has become both a pressure and a partner: driving the pace of change while also helping reduce toil through alert deduplication, incident summaries, and risk detection.

Liam emphasized that reliability is always forged under stress. The key, he argued, is to make that stress productive — by building systems, teams, and tools resilient enough to grow through it.

A career in SRE is like pushing a boulder uphill. The size of the boulder depends on your tooling, and the strength of your team.

In conclusion

Across every story — from Grafana’s transparent response to a security scare, to Picnic’s journey from startup chaos to scale, to Zendesk’s fight to protect reliability amid constant change, to incident.io’s belief in declaring more incidents — a shared truth emerged: resilience today isn’t about eliminating failure. It’s about building teams, systems, and cultures that adapt when failure inevitably happens.

And increasingly, AI is becoming both the cause and the catalyst for that adaptation. As Sam Jewell showed, smarter SLOs can help filter noise and reduce alert fatigue. For Liam Whelan and Zendesk, AI isn’t just driving new customer features — it’s also forcing engineering organizations to evolve guardrails, rebuild observability, and rethink how they detect, respond, and recover. And within incident.io, AI SRE promises to take on the toil — auto-triaging issues, summarizing incidents, and freeing humans to focus on empathy and judgment where it matters most.

What ties these stories together is a shift in mindset. Incidents are no longer purely technical events; they’re opportunities to build trust, refine processes, and harness intelligence — human and artificial — to improve faster. The future of reliability won’t be about preventing every outage, but about creating systems that learn and respond dynamically.

Because as our tools grow smarter and our systems more complex, the real marker of resilience won’t be perfection — it’ll be our ability to collaborate with AI, communicate with clarity, and turn every moment of chaos into progress.