debug kernel-level latency and network issues in production
eBPF in production: kernel-level observability and debugging for DevOps teams
14 min read
Application metrics cannot explain TCP retransmissions, cgroup scheduling delays, or syscall hot paths. eBPF runs sandboxed programs in the Linux kernel to observe those signals with minimal overhead—without strace, sidecars, or kernel rebuilds.
Why application metrics miss kernel-level failures
Prometheus and APM agents excel at service latency, error rate, and saturation. They cannot see TCP retransmissions, cgroup throttling, inode pressure, or lock contention inside the kernel. strace and tcpdump answer point questions but add risk and CPU cost on live nodes. Transient symptoms—memory pressure spikes, futex storms, TCP window collapse—often disappear before an engineer SSHs in. Teams discover they need kernel data only after hours of application-level log spelunking during a P1. eBPF closes that gap by attaching verified programs to kernel hook points and exporting structured events to user space with overhead measured in single-digit percent when scoped correctly.
Production stack: kernel programs, aggregation, and visualization
A typical production layout has three layers. Kernel programs attach to tracepoints, kprobes, uprobes, or LSM hooks and write to BPF maps—hash maps, ring buffers, or per-CPU arrays. An aggregation tier reads those maps: Cilium Hubble for Kubernetes network flows and policy verdicts, Falco or Tetragon for runtime security events, Pixie for no-code auto-telemetry, Parca or Pyroscope for continuous CPU and memory profiles, bpftrace for ad-hoc one-liners on a node. Visualization consumes aggregated data through Grafana, Prometheus scrape endpoints, or OTLP exporters. Cilium plus Hubble fits teams replacing or augmenting CNI for cluster-wide flow visibility. Falco and Tetragon suit security use cases but should follow read-only observability rollout. bpftrace belongs in break-glass debugging with time limits, not as a permanent DaemonSet.
bpftrace: find syscall hot paths on a noisy pod
When pod CPU is high but application metrics look flat, the bottleneck is often syscalls—not business logic. Install bpftrace on the node or run the official container with host PID and debugfs mounted. Resolve the container PID through crictl or kubectl debug, then count syscall entry probes for that process. Dominant futex or epoll_wait counts point to lock or event-loop contention rather than compute. Time-box ad-hoc runs, capture output to a ticket, and avoid peak traffic windows without a change record.
# Ubuntu / Debian
sudo apt-get install -y bpftrace
# Amazon Linux 2023
sudo dnf install -y bpftrace
# Ephemeral container on the node (requires privileged debug access)
docker run --rm --privileged --pid=host \
-v /sys/kernel/debug:/sys/kernel/debug:ro \
quay.io/bpftrace/bpftrace:latest bpftrace --versionPID=$(crictl inspect <container-id> | jq -r .info.pid)
bpftrace -p "$PID" -e 'tracepoint:raw_syscalls:sys_enter { @[probe] = count(); }'@[tracepoint:raw_syscalls:sys_enter]: 45210
@[kprobe:sys_futex]: 380112
@[kprobe:sys_clock_nanosleep]: 1200
@[kprobe:sys_epoll_wait]: 890455bpftrace -e '
kprobe:tcp_sendmsg
/ntop(AF_INET, args->sk->__sk_common.skc_daddr) == "10.0.1.45"/
{
@start[tid] = nsecs;
}
kretprobe:tcp_sendmsg
/@start[tid]/
{
@latency_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'Cilium Hubble: cluster-wide network flow and policy visibility
For Kubernetes estates, Cilium with Hubble exposes L3/L4 flows, DNS metadata, and policy verdicts without instrumenting application code. Enable Hubble relay, UI, and metrics during Helm install, then use the Hubble CLI to filter dropped flows by namespace and protocol—misconfigured NetworkPolicy shows up as DROPPED verdicts immediately. Scrape Hubble metrics from Cilium pods on port 9965 into Prometheus for dashboards and alert rules. Treat Cilium as a platform decision: validate kernel version, BTF availability, and kube-proxy replacement implications in staging before production cutover.
helm repo add cilium https://helm.cilium.io
helm repo update
helm upgrade --install cilium cilium/cilium \
--version 1.16.0 \
--namespace kube-system \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set hubble.metrics.enabled="{dns,http,tcp,flow,drop}" \
--set prometheus.enabled=trueexport HUBBLE_VERSION=$(curl -fsSL https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all \
"https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
tar xzf hubble-linux-amd64.tar.gz
sudo mv hubble /usr/local/bin
hubble observe --namespace production --protocol tcp --verdict DROPPEDJul 3 10:23:45.123: 10.0.1.45:54321 (default/payment-svc) -> 10.0.2.12:5432 (default/payment-db) to-stack FORWARDED (TCP)
Jul 3 10:23:45.234: 10.0.3.88:42110 (prod/order-svc) -> 10.0.2.12:5432 (default/payment-db) Policy denied DROPPED (TCP)scrape_configs:
- job_name: hubble
kubernetes_sd_configs:
- role: pod
namespaces:
names: [kube-system]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_k8s_app]
regex: cilium
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: ${1}:9965Operational practices for safe production eBPF adoption
Start with read-only observability—Hubble, bpftrace break-glass, Parca profiling—before enabling enforcement in Tetragon or strict Cilium NetworkPolicy modes. Deploy continuous profiling with Parca or Grafana Pyroscope to see which kernel and user paths consume CPU; store profiles in persistent volume for regression comparison. Bound every long-lived BPF map: unbounded hash maps are a common cause of kernel memory pressure. CO-RE and BTF require kernel 5.2 or newer with debug info available; cgroup v2 integration needs 5.7 plus. Run bpftool feature probe before rollout and pin kernel minor versions in node images. Version BPF programs in git, document what each probe measures, and test in staging on the same kernel build as production. eBPF complements Prometheus, OpenTelemetry traces, and log aggregators—it does not replace them. Typical drill-down: high latency in Prometheus, dominant futex counts in bpftrace, TCP retransmissions or DROPPED flows in Hubble, fix NetworkPolicy or routing. Monitor eBPF itself with bpftool prog and map JSON output, watch for dropped ring buffer events, and align security tooling with broader cluster hardening practices.
helm upgrade --install parca oci://ghcr.io/parca-dev/parca/charts/parca \
--namespace observability --create-namespace \
--set persistentVolume.enabled=true \
--set persistentVolume.size=50Gibpftrace -e '
BEGIN { @latency = lruhash(10240); }
// populate @latency in probes; old entries evict automatically
'bpftool feature probe kernel
bpftool prog show --json | jq -r '.[] | "\(.name): xlated \(.bytes_xlated // 0) bytes"'
bpftool map show --json | jq -r '.[] | "\(.name): max_entries=\(.max_entries // "n/a")"'Layer kernel signals on the application baseline from our observability setup for small platform teams.
When distributed traces stop at the syscall boundary, continue the investigation with OpenTelemetry distributed tracing in production Kubernetes.
