debug kernel-level latency and network issues in production

eBPF in production: kernel-level observability and debugging for DevOps teams

14 min read

Application metrics cannot explain TCP retransmissions, cgroup scheduling delays, or syscall hot paths. eBPF runs sandboxed programs in the Linux kernel to observe those signals with minimal overhead—without strace, sidecars, or kernel rebuilds.

Why application metrics miss kernel-level failures

Prometheus and APM agents excel at service latency, error rate, and saturation. They cannot see TCP retransmissions, cgroup throttling, inode pressure, or lock contention inside the kernel. strace and tcpdump answer point questions but add risk and CPU cost on live nodes. Transient symptoms—memory pressure spikes, futex storms, TCP window collapse—often disappear before an engineer SSHs in. Teams discover they need kernel data only after hours of application-level log spelunking during a P1. eBPF closes that gap by attaching verified programs to kernel hook points and exporting structured events to user space with overhead measured in single-digit percent when scoped correctly.

Production stack: kernel programs, aggregation, and visualization

A typical production layout has three layers. Kernel programs attach to tracepoints, kprobes, uprobes, or LSM hooks and write to BPF maps—hash maps, ring buffers, or per-CPU arrays. An aggregation tier reads those maps: Cilium Hubble for Kubernetes network flows and policy verdicts, Falco or Tetragon for runtime security events, Pixie for no-code auto-telemetry, Parca or Pyroscope for continuous CPU and memory profiles, bpftrace for ad-hoc one-liners on a node. Visualization consumes aggregated data through Grafana, Prometheus scrape endpoints, or OTLP exporters. Cilium plus Hubble fits teams replacing or augmenting CNI for cluster-wide flow visibility. Falco and Tetragon suit security use cases but should follow read-only observability rollout. bpftrace belongs in break-glass debugging with time limits, not as a permanent DaemonSet.

bpftrace: find syscall hot paths on a noisy pod

When pod CPU is high but application metrics look flat, the bottleneck is often syscalls—not business logic. Install bpftrace on the node or run the official container with host PID and debugfs mounted. Resolve the container PID through crictl or kubectl debug, then count syscall entry probes for that process. Dominant futex or epoll_wait counts point to lock or event-loop contention rather than compute. Time-box ad-hoc runs, capture output to a ticket, and avoid peak traffic windows without a change record.

Bash · install bpftrace on the node

# Ubuntu / Debian
sudo apt-get install -y bpftrace

# Amazon Linux 2023
sudo dnf install -y bpftrace

# Ephemeral container on the node (requires privileged debug access)
docker run --rm --privileged --pid=host \
  -v /sys/kernel/debug:/sys/kernel/debug:ro \
  quay.io/bpftrace/bpftrace:latest bpftrace --version

Bash · resolve container PID and count syscalls

PID=$(crictl inspect <container-id> | jq -r .info.pid)

bpftrace -p "$PID" -e 'tracepoint:raw_syscalls:sys_enter { @[probe] = count(); }'

Example · syscall counts revealing lock contention

@[tracepoint:raw_syscalls:sys_enter]: 45210
@[kprobe:sys_futex]: 380112
@[kprobe:sys_clock_nanosleep]: 1200
@[kprobe:sys_epoll_wait]: 890455

Bash · TCP send latency histogram to a target IP

bpftrace -e '
kprobe:tcp_sendmsg
  /ntop(AF_INET, args->sk->__sk_common.skc_daddr) == "10.0.1.45"/
{
  @start[tid] = nsecs;
}
kretprobe:tcp_sendmsg
  /@start[tid]/
{
  @latency_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

Cilium Hubble: cluster-wide network flow and policy visibility

For Kubernetes estates, Cilium with Hubble exposes L3/L4 flows, DNS metadata, and policy verdicts without instrumenting application code. Enable Hubble relay, UI, and metrics during Helm install, then use the Hubble CLI to filter dropped flows by namespace and protocol—misconfigured NetworkPolicy shows up as DROPPED verdicts immediately. Scrape Hubble metrics from Cilium pods on port 9965 into Prometheus for dashboards and alert rules. Treat Cilium as a platform decision: validate kernel version, BTF availability, and kube-proxy replacement implications in staging before production cutover.

Bash · install Cilium with Hubble via Helm

helm repo add cilium https://helm.cilium.io
helm repo update

helm upgrade --install cilium cilium/cilium \
  --version 1.16.0 \
  --namespace kube-system \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set hubble.metrics.enabled="{dns,http,tcp,flow,drop}" \
  --set prometheus.enabled=true

Bash · install Hubble CLI and observe dropped TCP flows

export HUBBLE_VERSION=$(curl -fsSL https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all \
  "https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
tar xzf hubble-linux-amd64.tar.gz
sudo mv hubble /usr/local/bin

hubble observe --namespace production --protocol tcp --verdict DROPPED

Example · Hubble flow showing policy denial

Jul  3 10:23:45.123: 10.0.1.45:54321 (default/payment-svc) -> 10.0.2.12:5432 (default/payment-db) to-stack FORWARDED (TCP)
Jul  3 10:23:45.234: 10.0.3.88:42110 (prod/order-svc) -> 10.0.2.12:5432 (default/payment-db) Policy denied DROPPED (TCP)

YAML · Prometheus scrape for Hubble metrics

scrape_configs:
  - job_name: hubble
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: [kube-system]
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_k8s_app]
        regex: cilium
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        target_label: __address__
        replacement: ${1}:9965

Operational practices for safe production eBPF adoption

Start with read-only observability—Hubble, bpftrace break-glass, Parca profiling—before enabling enforcement in Tetragon or strict Cilium NetworkPolicy modes. Deploy continuous profiling with Parca or Grafana Pyroscope to see which kernel and user paths consume CPU; store profiles in persistent volume for regression comparison. Bound every long-lived BPF map: unbounded hash maps are a common cause of kernel memory pressure. CO-RE and BTF require kernel 5.2 or newer with debug info available; cgroup v2 integration needs 5.7 plus. Run bpftool feature probe before rollout and pin kernel minor versions in node images. Version BPF programs in git, document what each probe measures, and test in staging on the same kernel build as production. eBPF complements Prometheus, OpenTelemetry traces, and log aggregators—it does not replace them. Typical drill-down: high latency in Prometheus, dominant futex counts in bpftrace, TCP retransmissions or DROPPED flows in Hubble, fix NetworkPolicy or routing. Monitor eBPF itself with bpftool prog and map JSON output, watch for dropped ring buffer events, and align security tooling with broader cluster hardening practices.

Bash · deploy Parca for continuous profiling

helm upgrade --install parca oci://ghcr.io/parca-dev/parca/charts/parca \
  --namespace observability --create-namespace \
  --set persistentVolume.enabled=true \
  --set persistentVolume.size=50Gi

Bash · bpftrace bounded LRU map pattern

bpftrace -e '
BEGIN { @latency = lruhash(10240); }
// populate @latency in probes; old entries evict automatically
'

Bash · verify kernel eBPF features and inspect loaded programs

bpftool feature probe kernel

bpftool prog show --json | jq -r '.[] | "\(.name): xlated \(.bytes_xlated // 0) bytes"'
bpftool map show --json | jq -r '.[] | "\(.name): max_entries=\(.max_entries // "n/a")"'

Layer kernel signals on the application baseline from our observability setup for small platform teams.

When distributed traces stop at the syscall boundary, continue the investigation with OpenTelemetry distributed tracing in production Kubernetes.

Tags:ebpf observability kubernetes linux sre

Discuss your infrastructure goals

eBPF in production: kernel-level observability and debugging for DevOps teams

Why application metrics miss kernel-level failures

Production stack: kernel programs, aggregation, and visualization

bpftrace: find syscall hot paths on a noisy pod

Cilium Hubble: cluster-wide network flow and policy visibility

Operational practices for safe production eBPF adoption

You might also like

Distributed tracing with OpenTelemetry in production Kubernetes

Kubernetes Security Hardening: A Practical Guide for Production Clusters

Observability setup for small platform teams: what to implement first