Back to all
Engineering··6 min read

SLO-driven alerting that doesn't wake you up at 3 AM

A practical take on building alerts around user-visible symptoms instead of server-side causes.

Most paging setups we inherit alert on causes: CPU is high, a pod is restarting, a disk is full. The problem is that none of those directly matter to the user. They matter to you only inasmuch as they might eventually cause a user-visible failure.

SLO-driven alerting flips this: you define what "working" means for your users, you measure it, and you page only when you're at risk of burning too much of your error budget.