The Zenduty Journey, AI-Native Response, and a New Host
Listen to a special throwback episode where new host Jim Hirschauer sits down with SRE veteran Vishwa. They dive deep into the origins of Zenduty, why observability tools often flood you with too much data, and why you can't use tooling to fix culture. A perfect start to Season 3.

Episodes
The Zenduty Journey, AI-Native Response, and a New Host
Reliability is about fixing things, not just resolving them. In this season premiere, we take a trip down memory lane with Vishwa to uncover the story behind Zenduty and how the "Incidentally Reliable" podcast began. Jim and Vishwa discuss the transition to Xurrent, the "needle in the haystack" problem in modern observability, and why culture—not just code—is the key to true reliability.

Once an SRE, always an SRE
In this episode, Sudarshan shares his experience leading high-performing SRE and infrastructure teams at Rippling, Twilio, Walmart, and Epsilon. He talks about reducing CI/CD costs by 60 percent, cutting on-call alerts by 65 percent, and the mindset required to build resilient systems.

CTRL + ALT + Scale: Building More Than Just Code
In this episode, Madhu Rawat (CTO, Xurrent) sits down with Sakshi — Co-founder and Head of Engineering at Kapstan, with leadership experience at Sumo Logic and UpGrad. They discuss the evolution of observability, building for scale, the role of AI in incident management, and what it means to lead engineering teams through change.

GoDaddy's Journey to Hosting Reliability — Incidentally Reliable Podcast with Amit Rindhe
In this episode of Incidentally Reliable, we sit down with Amit Rhinde, Head of Engineering at GoDaddy, to uncover the secrets behind building resilient systems, scaling global operations, and ensuring uptime for millions of users.

Press Start to Scale: SRE in Gaming - Incidentally Reliable with Denys Pashutynski
In our latest episode, we speak with Denys Pashutynski, Senior Engineering Manager of Site Reliability at Roblox, about the formidable challenges of sustaining a global gaming platform. Drawing from his tenure at Twitter, AWS, and eBay, Denys delves into managing traffic surges, latency optimization, and strategic change management.

Battle-Tested Reliability Strategies with Abhishek Ghosh
We dive into the trenches with Abhishek Ghosh, a veteran who has led SRE teams at Pinterest, and now at Cribl. He shares gripping war room stories from Pinterest, strategies for maintaining uptime, insights into the role of AI in observability, and more! Discover the future of SRE and learn how to navigate the challenges of digital reliability. Tune in to gain valuable lessons from one of the industry's leading experts.
Meet the Veterans
Peek into their journey so far, manoeuvred nightmares, their war-room stories and opinions on the current state of the space.

Incidentally Reliable Blogs

The Reliability Stories You Won’t Hear on LinkedIn
We had the pleasure of meeting Ponmani Palanisamy, a Staff Site Reliability Engineer at LinkedIn, at a recent SRE Meetup in Bangalore. Ponmani gave an insightful talk on "Improving data redundancy and rebalancing data in HDFS." We were cap

Strategies for Scaling Systems Reliably by Bob Lee
I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability. I ha
The Definitive Guide to AI in Service & Operations















































