We had perfect observability but still struggled during incidents. Here's what fixed it | discoverkit

discoverkit · 2025-11-03T12:20:43.000Z

We built a solid observability stack. OpenTelemetry pipelines, unified metrics, logs, traces. Beautiful Grafana dashboards. Everything instrumented. We could see everything. But when incidents hit, we still struggled. Alerts fired, but we didn't know: is this severe? What do we do? Who should respond? Everyone had different opinions. "2% error rate is fine" vs "2% is catastrophic." We were improvising every time. The missing piece wasn't technical. It was organizational. We needed SLOs to define what "working" means (so severity isn't subjective), runbooks to codify remediation steps (so response isn't improvisation), and post-mortems to learn from failures systematically (so we don't repeat mistakes). Here's what actually worked for us: **SLOs:** We use availability SLIs from OpenTelemetry span-metrics in Prometheus. We calculate percentage of successful requests by comparing successful calls (2xx/3xx) against total calls for each service. This gives us availability. We set 99.5% as our SLO, which creates a 0.5% error budget (14.4 hours downtime per month). Now we know when something is actually broken, not just "different." When we're burning error budget faster than expected, we slow feature releases. **Runbooks:** We connect runbooks directly to alerts via PagerDuty. When an alert fires, the notification includes what's broken (service name, error rate), current vs expected (SLO threshold), where to look (dashboard link, trace query), and what to do (runbook link). The on-call engineer clicks the runbook and follows steps. No guessing, no Slack archaeology trying to remember what worked last time. **Post-mortems:** We use a simple template: Impact (users affected, SLO impact), Timeline, Root Cause, What Went Well/Poorly, Action Items (with owners, priorities P0-P2, and due dates). The key is prioritizing action items in sprint planning. Otherwise post-mortems become theater where everyone nods, writes "we should monitor better" and changes nothing. After implementing these practices, our MTTR dropped by 60% in three months. Not because we collected more data, but because we knew how to act on it. I wrote about the framework, templates, and practical steps here: [From Signals to Reliability: SLOs, Runbooks and Post-Mortems](https://fatihkoc.net/posts/sre-observability-slo-runbooks/) What practices have helped your team move from reactive firefighting to proactive reliability?

We had perfect observability but still struggled during incidents. Here's what fixed it | discoverkit | discoverkit