collapse incident tooling into one auditable Slack workflow

ChatOps incident response: from Alertmanager alert to resolution in Slack

12 min read

On-call engineers still context-switch between PagerDuty, Grafana, kubectl, and wikis while minutes burn. This guide wires Prometheus Alertmanager into a Slack bot that enriches alerts, posts runbook actions, and executes approved remediation with RBAC.

Why fragmented incident response widens the blast radius

When a production alert fires, the typical on-call path spans PagerDuty acknowledgment, Grafana dashboards, Confluence runbooks, a local terminal for kubectl, and a Slack thread for coordination. Each hop is a context switch under pressure. Diagnostic steps live as tribal knowledge, and when the primary responder is unavailable, mean time to recovery grows. The monitoring stack is rarely the bottleneck. The workflow is.

ChatOps architecture: ingest, orchestrate, execute

ChatOps collapses diagnosis and coordination into the chat system your team already uses during incidents. Three layers are enough for a production pilot. Alert ingestion sends Prometheus Alertmanager webhook payloads to your bot when rules fire. The orchestration layer parses alert labels and annotations, queries Prometheus or Grafana for enrichment, and posts Block Kit messages with runbook links and action buttons. The execution layer runs approved remediation commands through a sandboxed executor with Kubernetes RBAC, signing verification for Slack interactivity, and structured audit logs. The same pattern works in Microsoft Teams with different SDK hooks; Slack is used below because most platform teams standardize incident channels there first.

Wire Alertmanager to a Slack enrichment bot

Route critical alerts to a dedicated webhook receiver instead of duplicating notification paths. Keep grouping keys aligned with how responders think—alertname plus namespace is a practical default. Add a promql_query annotation on alert rules so the bot can fetch live values; never pass alertname as a PromQL expression. Register the bot interactivity URL in your Slack app settings so button clicks hit a signed endpoint.

YAML · Alertmanager receiver for ChatOps webhook
route:
  receiver: chatops-bot
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity="critical"
      receiver: chatops-bot
      continue: true

receivers:
  - name: chatops-bot
    webhook_configs:
      - url: 'https://chatops.example.com/api/alertmanager'
        send_resolved: true
Python · Alertmanager webhook and Slack actions with slack_bolt
import os
import logging
from flask import Flask, request
from slack_bolt import App
from slack_bolt.adapter.flask import SlackRequestHandler
from slack_sdk import WebClient
import requests

logger = logging.getLogger(__name__)
flask_app = Flask(__name__)
slack_app = App(
    token=os.environ["SLACK_BOT_TOKEN"],
    signing_secret=os.environ["SLACK_SIGNING_SECRET"],
)
handler = SlackRequestHandler(slack_app)
client = WebClient(token=os.environ["SLACK_BOT_TOKEN"])
SLACK_CHANNEL = os.environ.get("SLACK_CHANNEL", "C0123456789")
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://prometheus:9090")


def enrich_alert(alert):
    annotations = alert.get("annotations", {})
    query = annotations.get("promql_query")
    if not query:
        return "No promql_query annotation on alert rule."
    try:
        resp = requests.get(
            f"{PROMETHEUS_URL}/api/v1/query",
            params={"query": query},
            timeout=10,
        )
        data = resp.json()
        if data.get("status") == "success" and data["data"]["result"]:
            return f"Current value: {data['data']['result'][0]['value'][1]}"
    except Exception as exc:
        logger.warning("Prometheus query failed: %s", exc)
    return "Unable to fetch current metrics."


def build_incident_blocks(alert):
    labels = alert.get("labels", {})
    annotations = alert.get("annotations", {})
    deployment = labels.get("deployment", labels.get("service", "unknown"))
    namespace = labels.get("namespace", "default")
    action_value = f"{namespace}/{deployment}"

    return [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"Incident: {labels.get('alertname', 'unknown')}",
            },
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {labels.get('severity', 'n/a')}"},
                {"type": "mrkdwn", "text": f"*Namespace:* {namespace}"},
                {"type": "mrkdwn", "text": f"*Deployment:* {deployment}"},
                {"type": "mrkdwn", "text": f"*Started:* {alert.get('startsAt', 'unknown')}"},
            ],
        },
        {
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": annotations.get("description", "No description"),
            },
        },
        {
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*Context:*\n```{enrich_alert(alert)}```"},
        },
        {
            "type": "actions",
            "elements": [
                {
                    "type": "button",
                    "text": {"type": "plain_text", "text": "Runbook"},
                    "url": annotations.get("runbook_url", "https://runbooks.example.com"),
                },
                {
                    "type": "button",
                    "text": {"type": "plain_text", "text": "Restart deployment"},
                    "action_id": "restart_deployment",
                    "value": action_value,
                    "style": "danger",
                    "confirm": {
                        "title": {"type": "plain_text", "text": "Confirm restart"},
                        "text": {
                            "type": "plain_text",
                            "text": f"Restart {deployment} in {namespace}?",
                        },
                        "confirm": {"type": "plain_text", "text": "Restart"},
                        "deny": {"type": "plain_text", "text": "Cancel"},
                    },
                },
            ],
        },
    ]


@flask_app.post("/api/alertmanager")
def receive_alert():
    for alert in request.json.get("alerts", []):
        client.chat_postMessage(
            channel=SLACK_CHANNEL,
            blocks=build_incident_blocks(alert),
            text=f"Incident: {alert.get('labels', {}).get('alertname', 'unknown')}",
        )
    return {"status": "ok"}, 200


@slack_app.action("restart_deployment")
def restart_deployment(ack, body, client):
    ack()
    namespace, deployment = body["actions"][0]["value"].split("/", 1)
    user = body["user"]["id"]
    # Call an internal executor service with RBAC instead of shelling out from the bot process.
    result = requests.post(
        "http://remediation-executor/remediate",
        json={"action": "rollout_restart", "namespace": namespace, "deployment": deployment, "requested_by": user},
        timeout=30,
    )
    client.chat_postEphemeral(
        channel=body["channel"]["id"],
        user=user,
        text=f"Restart result: {result.text}",
    )


@flask_app.post("/slack/events")
def slack_events():
    return handler.handle(request)


@flask_app.get("/health")
def health():
    return {"status": "ok"}, 200

Deploy the bot in Kubernetes with least-privilege RBAC

Run the bot in a dedicated monitoring namespace with secrets for the Slack bot token and signing secret. Expose /health for probes and terminate TLS at your ingress controller. Bind a Role that allows only get and patch on Deployments in approved namespaces—never grant cluster-admin to automation that Slack users can trigger. A separate remediation executor service can hold kubectl credentials so the public-facing bot process stays thin.

YAML · Deployment and Role for ChatOps bot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chatops-bot
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: chatops-bot
  template:
    metadata:
      labels:
        app: chatops-bot
    spec:
      serviceAccountName: chatops-bot
      containers:
        - name: bot
          image: registry.example.com/chatops-bot:latest
          ports:
            - containerPort: 8080
          envFrom:
            - secretRef:
                name: chatops-secrets
          env:
            - name: PROMETHEUS_URL
              value: http://prometheus.monitoring:9090
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chatops-bot
  namespace: production
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "patch"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

Operational practices that keep ChatOps safe and auditable

Phase the rollout: week one is read-only enrichment and runbook links; only then enable low-risk actions like acknowledge or log deep links; destructive remediation last. Require a second approver for production mutations—post a confirmation thread and execute only after a different on-call engineer approves in-channel, not just a single-user Slack confirm dialog. Log every action to Loki, Elasticsearch, or your incident database with timestamp, user, alert name, target, and result so post-incident review does not depend on Slack retention. Keep each incident in a dedicated thread: when a related alert fires, reply in the existing thread instead of spamming the channel. ChatOps will not fix weak alerting or missing runbooks, but it removes tool fragmentation so responders spend time on diagnosis and recovery instead of hunting tabs.

ChatOps only helps when alert context is trustworthy, so define signals and dashboards first using our observability baseline for small platform teams.

Tie remediation urgency to reliability policy by connecting incident threads with SLO, SLI, and error budget practices for platform teams.