Network Observability
Network observability uses golden signals, eBPF probes, flow logs, and distributed tracing to provide continuous visibility into network health and rapid root cause diagnosis.
The Problem
How does a team maintain continuous visibility into network health, detect issues before they affect users, and diagnose root causes when network problems span multiple services and layers?
Mental Model
Like a hospital's monitoring system — the staff doesn't wait for the patient to flatline, they watch the vital signs continuously and catch problems early.
Architecture Diagram
How It Works
Network observability is the practice of understanding what the network is doing at all times — not just when something breaks. It goes beyond traditional monitoring (which alerts on known failure modes) to provide the data needed to answer arbitrary questions about network behavior.
The Four Golden Signals
Google's SRE book defined four golden signals that are the minimum viable monitoring for any networked system. Everything else is a refinement of these:
1. Latency — How long requests take. Track both successful and failed requests separately — a fast error is different from a slow success. At the network level, this means RTT, TCP handshake time, and DNS resolution time.
2. Traffic — How much demand the system is handling. Measured in requests/second for HTTP, packets/second and bytes/second for network links, and connections/second for load balancers.
3. Errors — The rate of failed requests. At the network level: TCP retransmissions, connection resets, DNS failures, TLS handshake failures, and ICMP unreachables.
4. Saturation — How full the system is. Network saturation means link utilization approaching capacity, connection table limits, socket buffer pressure, and CPU consumed by packet processing.
# Prometheus alerting rules for golden signals
groups:
- name: network_golden_signals
rules:
- alert: HighTCPRetransmissions
expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "TCP retransmission rate exceeds 100/s"
- alert: InterfaceSaturation
expr: rate(node_network_transmit_bytes_total[5m]) * 8
/ node_network_speed_bytes * 100 > 80
for: 10m
labels:
severity: critical
annotations:
summary: "Network interface >80% utilized"
RED Metrics for Networking
The RED method (Rate, Errors, Duration) is a simplified framework that works well for request-driven systems. Applied to network connections:
| RED Metric | Network Measurement | What It Reveals |
|---|---|---|
| Rate | Connections/sec, packets/sec, bytes/sec | Traffic load and capacity demands |
| Errors | Retransmissions, resets, timeouts, DNS failures | Network health and reliability |
| Duration | RTT, connection setup time, DNS resolution time | Latency and performance |
Track these per service pair (Service A → Service B) to build a service dependency map with network health overlay. Cilium Hubble does this automatically for Kubernetes pods.
eBPF-Based Network Observability
eBPF (extended Berkeley Packet Filter) is the most significant advancement in network observability in the last decade. It allows running sandboxed programs inside the Linux kernel that hook into network events — without modifying kernel source, loading kernel modules, or adding application instrumentation.
What eBPF can observe:
┌─────────────────────────────────────────────────────┐
│ User Space │
│ Application → socket() → connect() → send() │
├─────────────────────────────────────────────────────┤
│ Kernel Space (eBPF hooks at each layer) │
│ ┌─── Socket Layer ──── kprobe:tcp_connect() │
│ ├─── TCP Layer ─────── tracepoint:tcp_retransmit │
│ ├─── IP Layer ──────── XDP (eXpress Data Path) │
│ └─── NIC Driver ────── TC (Traffic Control) │
└─────────────────────────────────────────────────────┘
Key eBPF programs for observability:
- kprobes on
tcp_connect: Track every outbound connection with source, destination, and latency - tracepoint
tcp:tcp_retransmit_skb: Detect every retransmission with the affected connection tuple - XDP programs: Process packets at the NIC driver level before they enter the kernel networking stack — used for DDoS mitigation and packet sampling
- TC (Traffic Control) hooks: Classify and sample traffic at the queueing discipline level
# Using bpftrace to monitor TCP retransmissions in real-time
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
printf("retransmit: %s:%d -> %s:%d\n",
ntop(args->saddr), args->sport,
ntop(args->daddr), args->dport);
}'
# Monitor new TCP connections
bpftrace -e 'kprobe:tcp_connect {
$sk = (struct sock *)arg0;
printf("connect: -> %s:%d\n",
ntop($sk->__sk_common.skc_daddr),
$sk->__sk_common.skc_dport);
}'
The beauty of eBPF is zero application changes. Deploy an eBPF agent (like Cilium or Pixie) and it immediately provides network visibility for every connection on the host.
Flow-Based Monitoring
While eBPF gives host-level visibility, flow protocols provide network-wide traffic analysis by exporting summarized flow records from switches and routers.
NetFlow, sFlow, and IPFIX
| Protocol | How It Works | Use Case |
|---|---|---|
| NetFlow v9 | Router maintains flow cache, exports completed flows | Cisco-centric, capacity planning |
| sFlow | Samples 1-in-N packets, exports immediately | Multi-vendor, real-time analysis |
| IPFIX | Standardized NetFlow v10, flexible templates | Open standard, modern deployments |
A flow record typically contains: source/destination IP, source/destination port, protocol, byte count, packet count, start/end timestamps, and TCP flags. From these records, teams can build:
- Traffic matrices: who talks to whom, and how much
- Top talkers: which hosts or services generate the most traffic
- Anomaly detection: sudden traffic patterns that deviate from baseline
- Security forensics: post-incident analysis of communication patterns
# Example: Using nfdump to analyze NetFlow data
# Top 10 source IPs by bytes
nfdump -r nfcapd.202401011200 -s srcip -n 10
# Show flows to a specific destination
nfdump -r nfcapd.202401011200 'dst ip 10.0.1.50'
# Traffic breakdown by port
nfdump -r nfcapd.202401011200 -s dstport -n 20
Cloud VPC Flow Logs
Every major cloud provider offers flow logs for virtual networks:
- AWS VPC Flow Logs: Capture accepted/rejected traffic at ENI, subnet, or VPC level
- GCP VPC Flow Logs: Sampled at 1-in-N with metadata enrichment
- Azure NSG Flow Logs: Network Security Group level flow records
These are invaluable for security auditing ("is anything talking to the internet that shouldn't be?"), compliance ("prove network segmentation works"), and troubleshooting ("why can't pod A reach pod B?").
Building a Network Observability Stack
The Practical Architecture
A production-ready network observability stack has four layers:
1. Collection Layer
- eBPF agents on each host (Cilium, Pixie, or custom bpf programs)
- Flow exporters on network devices (NetFlow/sFlow/IPFIX)
- DNS query logging from resolvers
- Cloud provider flow logs and metrics
2. Processing Layer
- OpenTelemetry Collector for metric normalization and routing
- Stream processing for flow record aggregation and anomaly detection
- Enrichment with metadata (pod labels, service names, cloud tags)
3. Storage Layer
- Prometheus for time-series metrics (golden signals, RED metrics)
- Loki or Elasticsearch for flow log storage and search
- Tempo or Jaeger for distributed traces
4. Visualization and Alerting Layer
- Grafana dashboards for golden signals, service maps, and traffic analysis
- Alertmanager for threshold-based and anomaly-based alerts
- Runbooks linked to alerts for rapid response
Service Map Generation
One of the most powerful outputs of network observability is an automatic service dependency map. By observing which services connect to which, engineers can:
- Discover unknown dependencies ("wait, the auth service calls the billing service?")
- Detect configuration drift ("this pod shouldn't be talking to the production database")
- Understand blast radius ("if this service goes down, these 12 services are affected")
Cilium Hubble generates these maps automatically for Kubernetes. For non-Kubernetes environments, flow logs and eBPF connection tracking produce the same result.
Observability vs. Monitoring: The Key Difference
Monitoring answers predefined questions: "Is the server up? Is latency below 200ms? Is the error rate below 1%?" Engineers define thresholds and get alerts.
Observability answers questions that haven't been asked yet: "Why did latency spike for users in Singapore but not Tokyo, on Tuesday afternoon, only for POST requests to the /checkout endpoint?" This requires rich, correlated data to slice and dice in real-time.
The shift from monitoring to observability requires:
- High-cardinality data: metrics tagged with service, pod, namespace, source, destination, protocol, status code
- Correlation across signals: connecting a network retransmission event to the application-level timeout it caused
- Drill-down capability: from a dashboard spike → to the affected service pair → to the specific connection → to the packet capture
This is why eBPF-based tools are winning — they provide kernel-level detail with application-level context, automatically, at minimal overhead. The future of network observability is in the kernel, not in sidecars.
Key Points
- •The four golden signals — latency, traffic, errors, saturation — are the minimum viable monitoring for any network. If only four things get tracked, make it these
- •eBPF is the game-changer for network observability — it instruments the kernel without modifying code, adding latency, or requiring restarts
- •Flow logs (NetFlow, sFlow, IPFIX) provide traffic-level visibility without packet capture overhead — essential for capacity planning and anomaly detection
- •RED metrics (Rate, Errors, Duration) applied to network connections reveal issues that application-level metrics miss entirely
- •Network observability is not network monitoring — monitoring tells the team something is broken, observability tells them WHY it broke
Key Components
| Component | Role |
|---|---|
| Golden Signals (Latency, Traffic, Errors, Saturation) | The four fundamental metrics that indicate the health of any networked system — defined by Google's SRE book |
| eBPF Probes | Kernel-level hooks that capture network events (connections, retransmits, drops) without modifying application code or adding sidecars |
| Flow Logs (NetFlow/sFlow/IPFIX) | Aggregated records of network flows (src/dst IP, ports, bytes, packets) for traffic analysis and anomaly detection |
| Distributed Tracing (Network Layer) | Tracking requests across network hops and service boundaries to identify where latency and errors occur |
| Packet Sampling | Capturing 1-in-N packets to build statistical profiles of traffic without the overhead of full capture |
When to Use
Implement network observability from day one of any production system. Start with the golden signals and flow logs, then add eBPF-based deep instrumentation as the system grows in complexity.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Cilium Hubble | Open Source | Kubernetes-native network observability using eBPF — service maps, flow visibility, and policy monitoring | Medium-Enterprise |
| Prometheus + Grafana | Open Source | Metrics collection and visualization — pull-based model with PromQL for flexible querying and alerting | Any |
| Grafana | Open Source | Unified dashboards combining network, infrastructure, and application metrics from multiple data sources | Any |
| Datadog Network Monitoring | Managed | SaaS network performance monitoring with auto-discovery, flow maps, and DNS analytics across cloud and on-prem | Medium-Enterprise |
Debug Checklist
- Check the golden signals dashboard: is latency spiking? is error rate elevated? is traffic pattern abnormal? is any resource saturated?
- Correlate network metrics with application traces — overlay TCP retransmission rate with API P99 latency to confirm network-level root cause
- Inspect flow logs for traffic anomalies — sudden spikes from a single source (DDoS), unusual port activity, or traffic to unexpected destinations
- Check eBPF connection tracking for TCP state issues — SYN floods, TIME_WAIT accumulation, connection reset storms
- Verify DNS resolution metrics — elevated NXDOMAIN or SERVFAIL rates indicate DNS configuration or infrastructure issues
Common Mistakes
- Monitoring only at the application layer and missing network-level issues like packet loss, retransmissions, and routing changes that degrade performance silently
- Collecting too many metrics without aggregation. Per-connection metrics at high cardinality will overwhelm the monitoring system — aggregate by service, pod, or subnet
- Relying solely on SNMP polling at 5-minute intervals. Modern networks change in seconds — streaming telemetry is a must, not periodic polling
- Not correlating network metrics with application traces. A spike in TCP retransmissions might explain why API P99 latency jumped — but only if the data is overlaid
- Ignoring saturation metrics. CPU, memory, and bandwidth at 90% utilization don't trigger error metrics, but they cause tail latency spikes
Real World Usage
- •Netflix uses eBPF-based tools to monitor network health across their entire fleet — detecting TCP retransmissions and connection failures in real-time
- •Google defined the golden signals framework and monitors millions of network endpoints with their internal Monarch time-series system
- •Meta uses eBPF extensively for network observability, replacing traditional tools with kernel-level instrumentation that adds near-zero overhead
- •LinkedIn uses flow-based monitoring to detect DDoS attacks and traffic anomalies, triggering automated mitigation within seconds
- •Cloudflare uses eBPF to analyze millions of packets per second at each edge PoP for DDoS detection and traffic classification