SREDay x Xurrent

We scraped 200k metric samples a second. We still missed the incident.

ON-DEMAND
Jun 20, 2026 9:10 AM

An honest SRE talk about how more metrics quietly became less insight, and the back-to-basics fixes that turned it around.

More metrics didn't mean more insight. It meant more noise, more cost, and a missed incident.

Nandini walks through a real incident her team missed despite near-perfect instrumentation, the root-cause dig that followed, and the discipline they rebuilt their observability around. It's practical, specific, and refreshingly free of buzzwords.

Nandini Bhatt, an SRE at Xurrent, lived it. In her SREDay keynote she breaks down the incident her team missed, what they found when they went digging, and the back-to-basics fixes that turned it around. No AI hype. No pitch. Just the math most teams quietly get wrong.

AT A GLANCE

  • One service was generating 73% of all their metrics. Fixing it cut total volume by 60% — and they lost nothing.
  • A 32ms average looked fine. The real story was the P99 spiking to 800ms — the one user in a hundred having a terrible time.
  • 99.9% uptime isn't a number. It's 43 minutes a month. They'd already burned 70% of it.
"More metrics doesn't mean more insight. Fewer, better metrics move mountains."
‍

WHO SHOULD WATCH

  • The SRE drowning in dashboards — more panels than you've ever opened, a pager that still cries wolf.
  • The platform lead watching the bill climb — monitoring spend growing faster than the insight it buys.
  • The team that keeps missing incidents — great tooling, and your users still find the problem first.
  • The engineer rebuilding from scratch — you want a clear rule for what to keep and what to cut.

WHAT YOU'LL LEARN

  • Why 200k metrics a second can still miss the incident that matters
  • The simple fix that cut their metric volume 60% without losing a thing
  • Why your healthy average is hiding the users you're actually losing
  • How to keep the metrics that predict failure and cut the rest