define reliability targets with measurable error budgets

SLOs, SLIs, and error budgets for platform teams: a minimal reliability contract

11 min read

Dashboards and alert volume do not define reliability. This guide shows how small platform teams pick one or two user-facing SLIs, set a 30-day SLO with an error budget, wire multi-window burn alerts, and connect budget policy to release decisions.

Why more alerts do not equal better reliability

Teams that scale monitoring often end up with hundreds of thresholds and still miss user-visible outages. The gap is structural: infrastructure metrics describe components, while reliability is defined by whether critical journeys succeed within agreed bounds. Service Level Indicators (SLIs) quantify that journey-level behavior. Service Level Objectives (SLOs) set a target over a rolling window. The error budget is simply the allowed bad events before the objective is missed — it turns reliability from a vague goal into a shared, measurable contract between platform and product engineering.

Choose one or two SLIs tied to user journeys

Resist defining SLIs for every microservice. Pick the smallest set that represents customer trust: availability (successful requests divided by valid requests), latency (proportion of requests faster than a threshold), or freshness (age of the latest successful pipeline run for async workflows). Map each SLI to a single critical path — checkout, login, or API sync — and instrument at the edge (ingress, gateway, or synthetic probe) so load balancer retries do not mask user pain. If you already run user-impact dashboards from an observability baseline, reuse those series as SLI sources instead of inventing parallel metrics.

Define the SLO and error budget on a 30-day window

A common starting target for request-serving APIs is 99.9% availability over 30 rolling days. That sounds strict, but the error budget is about 43 minutes of bad events per month — enough to absorb small blips while still forcing attention when incidents stack up. Document the objective in plain language: who owns it, which environments count, and whether maintenance windows are excluded. Store the definition in Git (OpenSLO or an internal YAML schema) so changes go through review like application code. The manifest below defines a single availability SLI and objective for an HTTP API.

OpenSLO · availability objective
apiVersion: openslo/v1
kind: SLO
metadata:
  name: api-availability
spec:
  service: public-api
  budgetingMethod: Occurrences
  objectives:
    - target: 0.999
      displayName: 99.9% availability (30d)
      indicator:
        metadata:
          name: api-availability-sli
        spec:
          ratioMetric:
            good:
              metricSource:
                type: Prometheus
                metricType: counter
                query: sum(rate(http_requests_total{status!~"5.."}[5m]))
            total:
              metricSource:
                type: Prometheus
                metricType: counter
                query: sum(rate(http_requests_total[5m]))

Implement burn-rate alerts instead of static thresholds

Static alerts on error rate spike during deploys and go quiet during slow burns that consume the entire monthly budget in a week. Multi-window, multi-burn-rate alerting compares short- and long-window consumption against the budget — inspired by Google SRE practice. Page on fast burn (for example 14.4x budget consumption in one hour) and ticket on slow burn (for example 6x over six hours). Wire alerts to runbooks that name the owning team and the first mitigation step. The Prometheus rules below show a recording rule for availability SLI and a representative fast-burn alert; tune windows and factors to your SLO target.

Prometheus · SLI recording and fast burn
groups:
  - name: slo-recording
    rules:
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{job="api",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="api"}[5m]))
      - alert: SLOAvailabilityBudgetBurnFast
        expr: |
          (1 - sli:http_availability:ratio) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: Fast burn of API availability error budget

Connect SLOs to release cadence and prioritization

When plenty of error budget remains, teams can ship features and accept normal operational risk. When budget is nearly exhausted, slow feature work and fund reliability fixes: dependency upgrades, retry tuning, cache externalization, or chaos experiments that validate mitigations. This is not a blanket deploy freeze — it is a prioritization signal product managers can understand. Pair the signal with delivery metrics from your release pipeline: if lead time is already high, burning budget on risky Friday deploys is worse than deferring a non-critical change. Conversely, if budget is healthy but change failure rate is climbing, the problem may be pipeline quality rather than the SLO target being too loose.

Write a lightweight error budget policy

A one-page policy beats a heavyweight process. Specify three bands: green (more than 50% budget left — normal delivery), yellow (25–50% — require reliability review on risky changes), red (under 25% or exhausted — halt non-critical releases, staff incident follow-ups, and schedule hardening work). Name who can grant exceptions and for how long. Link each exception to a ticket so you can audit whether exceptions correlate with repeat incidents. Revisit targets quarterly: an SLO nobody misses is too loose; an SLO breached every week trains teams to ignore the signal.

Close the loop after incidents without heavyweight postmortems

After each user-impacting event, ask three questions: how many minutes of error budget did we spend, was the burn alert timely, and do SLI queries still match reality? Update recording rules, dashboards, and runbooks in the same sprint — not in a month-long review document. When repeated burns trace to the same dependency, promote a chaos experiment or load test into the regular pipeline so the fix is validated under failure, not only in theory. Over time the SLO becomes the language product and platform teams use to negotiate speed versus stability, instead of debating whether monitoring is noisy enough.

SLOs only work when telemetry reflects user impact, so start from the signals and alerting habits in our observability setup guide for small platform teams.

When error budget is exhausted, release policy matters as much as monitoring — pair this framework with the release pipeline bottlenecks diagnostic.