collapse incident tooling into one auditable Slack workflow
ChatOps incident response: from Alertmanager alert to resolution in Slack
12 min read
On-call engineers still context-switch between PagerDuty, Grafana, kubectl, and wikis while minutes burn. This guide wires Prometheus Alertmanager into a Slack bot that enriches alerts, posts runbook actions, and executes approved remediation with RBAC.
Why fragmented incident response widens the blast radius
When a production alert fires, the typical on-call path spans PagerDuty acknowledgment, Grafana dashboards, Confluence runbooks, a local terminal for kubectl, and a Slack thread for coordination. Each hop is a context switch under pressure. Diagnostic steps live as tribal knowledge, and when the primary responder is unavailable, mean time to recovery grows. The monitoring stack is rarely the bottleneck. The workflow is.
ChatOps architecture: ingest, orchestrate, execute
ChatOps collapses diagnosis and coordination into the chat system your team already uses during incidents. Three layers are enough for a production pilot. Alert ingestion sends Prometheus Alertmanager webhook payloads to your bot when rules fire. The orchestration layer parses alert labels and annotations, queries Prometheus or Grafana for enrichment, and posts Block Kit messages with runbook links and action buttons. The execution layer runs approved remediation commands through a sandboxed executor with Kubernetes RBAC, signing verification for Slack interactivity, and structured audit logs. The same pattern works in Microsoft Teams with different SDK hooks; Slack is used below because most platform teams standardize incident channels there first.
Wire Alertmanager to a Slack enrichment bot
Route critical alerts to a dedicated webhook receiver instead of duplicating notification paths. Keep grouping keys aligned with how responders think—alertname plus namespace is a practical default. Add a promql_query annotation on alert rules so the bot can fetch live values; never pass alertname as a PromQL expression. Register the bot interactivity URL in your Slack app settings so button clicks hit a signed endpoint.
route:
receiver: chatops-bot
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity="critical"
receiver: chatops-bot
continue: true
receivers:
- name: chatops-bot
webhook_configs:
- url: 'https://chatops.example.com/api/alertmanager'
send_resolved: trueimport os
import logging
from flask import Flask, request
from slack_bolt import App
from slack_bolt.adapter.flask import SlackRequestHandler
from slack_sdk import WebClient
import requests
logger = logging.getLogger(__name__)
flask_app = Flask(__name__)
slack_app = App(
token=os.environ["SLACK_BOT_TOKEN"],
signing_secret=os.environ["SLACK_SIGNING_SECRET"],
)
handler = SlackRequestHandler(slack_app)
client = WebClient(token=os.environ["SLACK_BOT_TOKEN"])
SLACK_CHANNEL = os.environ.get("SLACK_CHANNEL", "C0123456789")
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://prometheus:9090")
def enrich_alert(alert):
annotations = alert.get("annotations", {})
query = annotations.get("promql_query")
if not query:
return "No promql_query annotation on alert rule."
try:
resp = requests.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": query},
timeout=10,
)
data = resp.json()
if data.get("status") == "success" and data["data"]["result"]:
return f"Current value: {data['data']['result'][0]['value'][1]}"
except Exception as exc:
logger.warning("Prometheus query failed: %s", exc)
return "Unable to fetch current metrics."
def build_incident_blocks(alert):
labels = alert.get("labels", {})
annotations = alert.get("annotations", {})
deployment = labels.get("deployment", labels.get("service", "unknown"))
namespace = labels.get("namespace", "default")
action_value = f"{namespace}/{deployment}"
return [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"Incident: {labels.get('alertname', 'unknown')}",
},
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Severity:* {labels.get('severity', 'n/a')}"},
{"type": "mrkdwn", "text": f"*Namespace:* {namespace}"},
{"type": "mrkdwn", "text": f"*Deployment:* {deployment}"},
{"type": "mrkdwn", "text": f"*Started:* {alert.get('startsAt', 'unknown')}"},
],
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": annotations.get("description", "No description"),
},
},
{
"type": "section",
"text": {"type": "mrkdwn", "text": f"*Context:*\n```{enrich_alert(alert)}```"},
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "Runbook"},
"url": annotations.get("runbook_url", "https://runbooks.example.com"),
},
{
"type": "button",
"text": {"type": "plain_text", "text": "Restart deployment"},
"action_id": "restart_deployment",
"value": action_value,
"style": "danger",
"confirm": {
"title": {"type": "plain_text", "text": "Confirm restart"},
"text": {
"type": "plain_text",
"text": f"Restart {deployment} in {namespace}?",
},
"confirm": {"type": "plain_text", "text": "Restart"},
"deny": {"type": "plain_text", "text": "Cancel"},
},
},
],
},
]
@flask_app.post("/api/alertmanager")
def receive_alert():
for alert in request.json.get("alerts", []):
client.chat_postMessage(
channel=SLACK_CHANNEL,
blocks=build_incident_blocks(alert),
text=f"Incident: {alert.get('labels', {}).get('alertname', 'unknown')}",
)
return {"status": "ok"}, 200
@slack_app.action("restart_deployment")
def restart_deployment(ack, body, client):
ack()
namespace, deployment = body["actions"][0]["value"].split("/", 1)
user = body["user"]["id"]
# Call an internal executor service with RBAC instead of shelling out from the bot process.
result = requests.post(
"http://remediation-executor/remediate",
json={"action": "rollout_restart", "namespace": namespace, "deployment": deployment, "requested_by": user},
timeout=30,
)
client.chat_postEphemeral(
channel=body["channel"]["id"],
user=user,
text=f"Restart result: {result.text}",
)
@flask_app.post("/slack/events")
def slack_events():
return handler.handle(request)
@flask_app.get("/health")
def health():
return {"status": "ok"}, 200Deploy the bot in Kubernetes with least-privilege RBAC
Run the bot in a dedicated monitoring namespace with secrets for the Slack bot token and signing secret. Expose /health for probes and terminate TLS at your ingress controller. Bind a Role that allows only get and patch on Deployments in approved namespaces—never grant cluster-admin to automation that Slack users can trigger. A separate remediation executor service can hold kubectl credentials so the public-facing bot process stays thin.
apiVersion: apps/v1
kind: Deployment
metadata:
name: chatops-bot
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: chatops-bot
template:
metadata:
labels:
app: chatops-bot
spec:
serviceAccountName: chatops-bot
containers:
- name: bot
image: registry.example.com/chatops-bot:latest
ports:
- containerPort: 8080
envFrom:
- secretRef:
name: chatops-secrets
env:
- name: PROMETHEUS_URL
value: http://prometheus.monitoring:9090
readinessProbe:
httpGet:
path: /health
port: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chatops-bot
namespace: production
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "patch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]Operational practices that keep ChatOps safe and auditable
Phase the rollout: week one is read-only enrichment and runbook links; only then enable low-risk actions like acknowledge or log deep links; destructive remediation last. Require a second approver for production mutations—post a confirmation thread and execute only after a different on-call engineer approves in-channel, not just a single-user Slack confirm dialog. Log every action to Loki, Elasticsearch, or your incident database with timestamp, user, alert name, target, and result so post-incident review does not depend on Slack retention. Keep each incident in a dedicated thread: when a related alert fires, reply in the existing thread instead of spamming the channel. ChatOps will not fix weak alerting or missing runbooks, but it removes tool fragmentation so responders spend time on diagnosis and recovery instead of hunting tabs.
ChatOps only helps when alert context is trustworthy, so define signals and dashboards first using our observability baseline for small platform teams.
Tie remediation urgency to reliability policy by connecting incident threads with SLO, SLI, and error budget practices for platform teams.
