unify traces metrics and logs through a scalable OTel Collector tier

Production-grade OpenTelemetry Collector pipeline for unified traces, metrics, and logs

14 min read

Jaeger, Prometheus, and Fluentd as three separate stacks multiply ops cost and break correlation. This guide deploys agent and gateway Collectors with memory limits, tail sampling, exporter queues, and Kubernetes Helm patterns.

Why three agent stacks break correlation and scale together

Many estates still run Jaeger or Zipkin for traces, Prometheus or StatsD for metrics, and Fluentd or Vector for logs. Each stack has its own config lifecycle, resource profile, and failure mode. When traffic doubles, operational work often triples. Correlating a slow trace with a latency spike and an error log still means manual hopping across dashboards because the signals never shared context at ingest. The architectural fix is a vendor-neutral telemetry path: applications emit OTLP, collectors process and route centrally, and backends become interchangeable.

Collector pipeline model: receivers, processors, exporters, connectors

The OpenTelemetry Collector is a single binary with composable pipelines per signal. Receivers ingest OTLP, Prometheus scrape, filelog, or legacy formats. Processors mutate, filter, sample, or batch data. Exporters deliver to Jaeger, Tempo, Prometheus remote write, Loki, or vendor endpoints. Connectors link pipelines—for example spanmetrics turns traces into metrics without a separate instrumentation path. A typical production layout uses a per-node agent for local ingest and light batching, plus a central gateway for tail-based sampling, attribute enrichment, and multi-backend fan-out. Use the contrib distribution unless you are certain you need only core OTLP components.

Agent tier: local ingest, memory limits, forward to gateway

Deploy agents as a Kubernetes DaemonSet or sidecar. They receive OTLP from application SDKs or auto-instrumentation, apply memory_limiter first, batch second, and forward to the gateway over TLS. Agents are not durable long-term buffers: queues are in-memory by default and backpressure propagates upstream when exporters saturate. Size agent pods modestly and keep heavy processors off this tier.

YAML · OpenTelemetry Collector agent config

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  batch:
    timeout: 5s
    send_batch_size: 8192
  resource:
    attributes:
      - key: collector.tier
        value: agent
        action: upsert

exporters:
  otlp/gateway:
    endpoint: otel-gateway.observability.svc:4317
    tls:
      insecure: false
    sending_queue:
      enabled: true
      queue_size: 4096
    retry_on_failure:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/gateway]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/gateway]

Gateway tier: tail sampling, enrichment, and multi-backend export

Gateway collectors run as a Deployment with more memory and CPU. Tail-based sampling belongs here so decisions see complete traces across services. Listed sampling policies match with OR semantics: keep errors, slow requests, and a probabilistic slice of the remainder. Add memory_limiter before sampling, enrich attributes for environment and team, and fan out to trace, metric, and log backends. Enable health_check and zpages extensions for incident debugging.

YAML · OpenTelemetry Collector gateway config

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 2000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-remainder
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
  attributes:
    actions:
      - key: deployment.environment
        action: upsert
        value: production
      - key: team
        action: upsert
        value: platform
  batch:
    timeout: 10s
    send_batch_size: 16384

exporters:
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls:
      insecure: true
    sending_queue:
      enabled: true
      queue_size: 8192
    retry_on_failure:
      enabled: true
  prometheusremotewrite:
    endpoint: https://mimir.example.com/api/v1/push
  loki:
    endpoint: http://loki.observability.svc:3100/loki/api/v1/push

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, zpages]
  telemetry:
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: 0.0.0.0
                port: 8888
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

Deploy on Kubernetes with the OpenTelemetry Helm chart

Install agent and gateway releases from the official Helm chart. Pin the contrib image tag and validate config with otelcol validate in CI. Mount gateway config from a ConfigMap or use the chart values overlay pattern. Scale gateway replicas horizontally; pair with a load balancer or OTLP loadbalancing exporter when agent counts grow.

Bash · Helm install agent DaemonSet and gateway Deployment

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm upgrade --install otel-agent open-telemetry/opentelemetry-collector \
  --namespace observability --create-namespace \
  --set mode=daemonset \
  --set image.repository=otel/opentelemetry-collector-contrib \
  --set image.tag=0.106.1 \
  -f values/agent.yaml

helm upgrade --install otel-gateway open-telemetry/opentelemetry-collector \
  --namespace observability \
  --set mode=deployment \
  --set replicaCount=3 \
  --set image.repository=otel/opentelemetry-collector-contrib \
  --set image.tag=0.106.1 \
  -f values/gateway.yaml

Reference gateway: self-metrics scrape, logs pipeline, and export hardening

The gateway below shows a complete traces and metrics path with self-scrape of Collector metrics, resource enrichment from incoming attributes, and exporter queues. Add a logs pipeline the same way when applications emit OTLP logs. Instrument apps with the OTel SDK so trace_id and span_id land in structured logs—correlation is what makes unified telemetry useful, not fewer daemons alone.

YAML · gateway with tail sampling and self-monitoring

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  prometheus:
    config:
      scrape_configs:
        - job_name: otelcol-self
          scrape_interval: 15s
          static_configs:
            - targets: ['127.0.0.1:8888']

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
  tail_sampling:
    decision_wait: 10s
    num_traces: 200000
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: sample-5pct
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
  resource:
    attributes:
      - key: deployment.region
        value: eu-west-1
        action: upsert
      - key: service.version
        from_attribute: git.commit.sha
        action: upsert
  batch:
    timeout: 10s
    send_batch_size: 16384

exporters:
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls:
      insecure: true
    sending_queue:
      enabled: true
    retry_on_failure:
      enabled: true
  prometheusremotewrite:
    endpoint: https://mimir.observability.svc/api/v1/push
  debug:
    verbosity: basic

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777

service:
  extensions: [health_check, zpages, pprof]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, resource, batch]
      exporters: [otlp/tempo, debug]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]

Operational practices for production collector fleets

Place memory_limiter first in every pipeline and set limit_mib to roughly seventy to eighty percent of the pod memory limit. Keep tail sampling and attribute rewriting on gateways, not agents. Monitor otelcol_exporter_send_failed_spans, otelcol_exporter_queue_size, and otelcol_processor_refused_spans—drops at the collector mean blind spots downstream. Pin image tags, run otelcol validate on config changes, and test upgrades in staging because schema moves between minor versions. Separate resource requests: agents stay lean; gateways need headroom for sampling buffers. Use resourcedetection and k8sattributes processors when you need node, pod, and namespace metadata on every signal. A unified Collector tier reduces agent sprawl and makes trace-log-metric correlation achievable when SDK context propagation is enforced in application code.

Start from actionable signals and alert hygiene described in our observability baseline for small platform teams.

Sampling and retention decisions should respect reliability budgets from SLO, SLI, and error budget practices for platform teams.

Tags:opentelemetry observability kubernetes monitoring sre

Discuss your infrastructure goals

Production-grade OpenTelemetry Collector pipeline for unified traces, metrics, and logs

Why three agent stacks break correlation and scale together

Collector pipeline model: receivers, processors, exporters, connectors

Agent tier: local ingest, memory limits, forward to gateway

Gateway tier: tail sampling, enrichment, and multi-backend export

Deploy on Kubernetes with the OpenTelemetry Helm chart

Reference gateway: self-metrics scrape, logs pipeline, and export hardening

Operational practices for production collector fleets

You might also like

ChatOps incident response: from Alertmanager alert to resolution in Slack

SLOs, SLIs, and error budgets for platform teams: a minimal reliability contract

Observability setup for small platform teams: what to implement first