The Unofficial KubeCon EU '26 SRE Track

Jorge Lainfiesta

February 25, 2026

Amsterdam. Stroopwafels. Gouda. Canals. And 16 thousand cloud-native engineers crammed into the RAI Convention Centre debating whether to rewrite their observability stack or just add another dashboard.

KubeCon Europe 2026 runs March 24–26, and if you're anything like me, the hardest part isn't getting there: it's figuring out what to actually attend. The schedule is enormous (300+ talks), and every session title sounds equally unmissable until you realize they're all at the same time on Tuesday.

This is the part where I do the work for you. I combed through the schedule with an SRE lens ( reliability, observability, incidents, chaos) and picked six sessions worth blocking off. No AI hype for the sake of it. Just talks that I think will leave you with something real to bring back to your team.

I'll be in Amsterdam, so come say hi. You can find me and the Rootly team at our booth. We're also hosting three happy hours across the week (all within 5 mins walk from the venue), RSVP while spots last:

Mon, Mar 23 — KubeCon Europe Kickoff with Rootly, Cloudsmith, Embrace, Kusari, Pulumi, & Checkly | 6–9 PM CET
Tue, Mar 24 — KubeCon Europe Unwind with Rootly, Checkly, Cloudsmith, MetalBear, & Spotify for Backstage | 6–9 PM CET
Wed, Mar 25 — KubeCon Europe Surf's Up Social with Rootly, Tailscale, Port, Zesty, FusionAuth, & Chronosphere | 6–9 PM CET

Without further ado: the unofficial KubeCon EU '26 SRE track.

Case Studies at Scale

1000 Services, 1 Year, 0 Downtime: Airbnb's Zonal Cluster Migration

Following a series of major outages in the summer of 2023, Airbnb's Cloud Infrastructure team made a decision: migrate every single one of their production services (over 1,000 of them) from regional Kubernetes clusters to zonal clusters. In under a year. With zero user-visible downtime.

Sunny Beatteay (Airbnb) will share how a five-person team operationalized this migration across thousands of workloads and a 3,000-engineer organization. The talk explores the technical and organizational strategies behind the effort: rollout automation, capacity planning, and cross-team coordination at a scale that most of us will never have to deal with, and probably never want to.

If you've ever navigated a "we need to change the foundation while keeping the house standing" migration, Sunny's talk is for you.

When: Tuesday, March 24, 2026 | 11:15 – 11:45 CET

‍Where: Hall 8 | Room D

‍Add Sunny's talk to your schedule

What Survived Production: Operating Game Backends at Million-Player Scale

"Keep it lean, keep it minimal." That was the motto when Futureplay Games launched their cloud-native game servers back in 2023. Two years and millions of players later, Berkay Uckac will tell you what actually held up, and what didn't.

This talk is the kind of honest post-mortem you rarely get on a conference stage: what failed fast, what stayed simple, and what proved essential in keeping a live game online at scale. Three engineers. Millions of players. A relentless pressure to design for simplicity without sacrificing reliability.

I love this format because it doesn't sell you a perfect architecture. It shows you the pragmatic decisions made under real constraints. Small-team SREs especially will find this one relatable.

When: Tuesday, March 24, 2026 | 11:15 – 11:45 CET

‍Where: Hall 7 | Room A

‍Add Berkay's talk to your schedule

Banking on Reliability: Cloud Native SRE Practices in Financial Services

Clément Nussbaumer (PostFinance) brings five years of SRE stories from operating a Kubernetes platform at a major Swiss bank. Banks make for great reliability case studies: the stakes are extremely high, compliance is non-negotiable, and every failed request means a denied payment.

The talk covers how SLOs drove cascading improvements across API server load-balancing, nginx readiness probes, and etcd leadership transitions. Clément will also showcase two open-source Golang monitoring tools the team built from scratch: a DNS server monitoring tool and a distributed mesh for node-to-node checks that catches network problems before they become incidents.

The session ends with a live debugging walkthrough tracking down rare 502 errors (six per million requests), caused by mismatched connection timeouts. If you've ever had to hunt down a bug that only shows up at scale, this one will feel familiar.

When: Wednesday, March 25, 2026 | 15:00 – 15:30 CET

‍Where: Elicium 2

‍Add Clément's talk to your schedule

AI's New Role in Ops

From Alert Fatigue To Self-Healing: Building AI-Enabled Control Planes in Banking

Alert fatigue is one of the oldest problems in SRE. You've tuned your thresholds, you've added runbooks, and somehow your on-call engineer is still drowning. Nuno Guedes and Yury Tsarev will share how Millennium bcp, one of Portugal's largest banks, decided to tackle it differently.

Rather than more tuning, they built AI-enhanced Crossplane control planes that bring self-healing and intelligent scaling directly into their multi-cloud platform. LLM-powered composition functions automatically triage and remediate Kubernetes alerts, cutting SRE escalations dramatically. Workload-aware algorithms dynamically scale resources across clouds. And, all of it remains fully auditable and compliant, which matters enormously in a regulated banking environment.

What I find most interesting about this talk is that it's not a research paper: it's a real-world implementation of AI Ops using CNCF projects like Crossplane and Kubernetes, in production, at a bank. That's a meaningful bar to clear.

When: Tuesday, March 24, 2026 | 16:15 – 16:45 CET

‍Where: Hall 8 | Room D

‍Add Nuno's and Yury's talk to your schedule

Observing Chaos: Real-Time Monitoring of AI-Driven Kubernetes Destruction

Traditional chaos engineering is great, but it has a ceiling: you define the failure scenarios in advance, and your system eventually learns to handle them. Josh Halley (Cisco) and Ricardo Aravena (CNCF) asked a genuinely fun question: what if the chaos evolved along with your system's resilience?

The answer, apparently, is DOOM. They integrated ViZDoom and KubeDoom so that reinforcement learning agents play DOOM against live Kubernetes workloads, and as the agents get better, they generate progressively more sophisticated chaos: killing pods, disrupting services, stressing infrastructure. The twist is the observability layer on top of it all: OpenTelemetry for distributed tracing and Cilium for network visibility, feeding into a central dashboard that shows the real-time impact across workloads.

This is one of those talks that's equal parts technically rigorous and genuinely delightful. And honestly, if a talk about AI agents destroying your cluster in real time doesn't make it onto an SRE track, what will?

When: Tuesday, March 24, 2026 | 17:00 – 17:30 CET

‍Where: F002-005

‍Add Josh's and Ricardo's talk to your schedule

Debugging Deep Dive

Kubernetes Autopsy: Live Debugging a Cluster Meltdown

This one is exactly what it sounds like. Sandeep Kanabar (Gen), Aditi Gupta (Disney+ Hotstar), and Anshika Gupta will reconstruct a real-world cluster catastrophe from the digital crime scene it left behind.

The failure chain is the kind that keeps platform engineers up at night: a memory leak triggered an OOMKill, which destabilized etcd, which corrupted state, which triggered a controller storm, which brought down the control plane. Using actual logs, metrics, and traces, they'll reverse-engineer the incident minute-by-minute, showing how seemingly unrelated symptoms connect into a devastating chain reaction.

What makes this format so useful is that the forensics mindset translates directly back to your own incidents. You'll leave with a sharper instinct for reading signals that look unrelated but aren't, and a better understanding of where Kubernetes failure cascades tend to start.

When: Tuesday, March 24, 2026 | 12:00 – 12:30 CET

‍Where: Elicium 2

‍Add this talk to your schedule

See you in Amsterdam. 🌷

‍