Performance & ObservabilityTopic 6 of 6

Performance & ObservabilityAdvanced

Network Observability

SNMPsFlowNetFlowIPFIX

Network observability uses golden signals, eBPF probes, flow logs, and distributed tracing to provide continuous visibility into network health and rapid root cause diagnosis.

The Problem

How does a team maintain continuous visibility into network health, detect issues before they affect users, and diagnose root causes when network problems span multiple services and layers?

Mental Model

Like a hospital's monitoring system — the staff doesn't wait for the patient to flatline, they watch the vital signs continuously and catch problems early.

Architecture Diagram

How It Works

Network observability is the practice of understanding what the network is doing at all times — not just when something breaks. It goes beyond traditional monitoring (which alerts on known failure modes) to provide the data needed to answer arbitrary questions about network behavior.

The Four Golden Signals

Google's SRE book defined four golden signals that are the minimum viable monitoring for any networked system. Everything else is a refinement of these:

1. Latency — How long requests take. Track both successful and failed requests separately — a fast error is different from a slow success. At the network level, this means RTT, TCP handshake time, and DNS resolution time.

2. Traffic — How much demand the system is handling. Measured in requests/second for HTTP, packets/second and bytes/second for network links, and connections/second for load balancers.

3. Errors — The rate of failed requests. At the network level: TCP retransmissions, connection resets, DNS failures, TLS handshake failures, and ICMP unreachables.

4. Saturation — How full the system is. Network saturation means link utilization approaching capacity, connection table limits, socket buffer pressure, and CPU consumed by packet processing.

# Prometheus alerting rules for golden signals
groups:
  - name: network_golden_signals
    rules:
      - alert: HighTCPRetransmissions
        expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TCP retransmission rate exceeds 100/s"

      - alert: InterfaceSaturation
        expr: rate(node_network_transmit_bytes_total[5m]) * 8
              / node_network_speed_bytes * 100 > 80
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Network interface >80% utilized"

RED Metrics for Networking

The RED method (Rate, Errors, Duration) is a simplified framework that works well for request-driven systems. Applied to network connections:

RED Metric	Network Measurement	What It Reveals
Rate	Connections/sec, packets/sec, bytes/sec	Traffic load and capacity demands
Errors	Retransmissions, resets, timeouts, DNS failures	Network health and reliability
Duration	RTT, connection setup time, DNS resolution time	Latency and performance

Track these per service pair (Service A → Service B) to build a service dependency map with network health overlay. Cilium Hubble does this automatically for Kubernetes pods.

eBPF-Based Network Observability

eBPF (extended Berkeley Packet Filter) is the most significant advancement in network observability in the last decade. It allows running sandboxed programs inside the Linux kernel that hook into network events — without modifying kernel source, loading kernel modules, or adding application instrumentation.

What eBPF can observe:

┌─────────────────────────────────────────────────────┐
│ User Space                                          │
│   Application → socket() → connect() → send()      │
├─────────────────────────────────────────────────────┤
│ Kernel Space  (eBPF hooks at each layer)            │
│   ┌─── Socket Layer ──── kprobe:tcp_connect()       │
│   ├─── TCP Layer ─────── tracepoint:tcp_retransmit  │
│   ├─── IP Layer ──────── XDP (eXpress Data Path)    │
│   └─── NIC Driver ────── TC (Traffic Control)       │
└─────────────────────────────────────────────────────┘

Key eBPF programs for observability:

kprobes on tcp_connect: Track every outbound connection with source, destination, and latency
tracepoint tcp:tcp_retransmit_skb: Detect every retransmission with the affected connection tuple
XDP programs: Process packets at the NIC driver level before they enter the kernel networking stack — used for DDoS mitigation and packet sampling
TC (Traffic Control) hooks: Classify and sample traffic at the queueing discipline level

# Using bpftrace to monitor TCP retransmissions in real-time
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
  printf("retransmit: %s:%d -> %s:%d\n",
    ntop(args->saddr), args->sport,
    ntop(args->daddr), args->dport);
}'

# Monitor new TCP connections
bpftrace -e 'kprobe:tcp_connect {
  $sk = (struct sock *)arg0;
  printf("connect: -> %s:%d\n",
    ntop($sk->__sk_common.skc_daddr),
    $sk->__sk_common.skc_dport);
}'

The beauty of eBPF is zero application changes. Deploy an eBPF agent (like Cilium or Pixie) and it immediately provides network visibility for every connection on the host.

Flow-Based Monitoring

While eBPF gives host-level visibility, flow protocols provide network-wide traffic analysis by exporting summarized flow records from switches and routers.

NetFlow, sFlow, and IPFIX

Protocol	How It Works	Use Case
NetFlow v9	Router maintains flow cache, exports completed flows	Cisco-centric, capacity planning
sFlow	Samples 1-in-N packets, exports immediately	Multi-vendor, real-time analysis
IPFIX	Standardized NetFlow v10, flexible templates	Open standard, modern deployments

A flow record typically contains: source/destination IP, source/destination port, protocol, byte count, packet count, start/end timestamps, and TCP flags. From these records, teams can build:

Traffic matrices: who talks to whom, and how much
Top talkers: which hosts or services generate the most traffic
Anomaly detection: sudden traffic patterns that deviate from baseline
Security forensics: post-incident analysis of communication patterns

# Example: Using nfdump to analyze NetFlow data
# Top 10 source IPs by bytes
nfdump -r nfcapd.202401011200 -s srcip -n 10

# Show flows to a specific destination
nfdump -r nfcapd.202401011200 'dst ip 10.0.1.50'

# Traffic breakdown by port
nfdump -r nfcapd.202401011200 -s dstport -n 20

Cloud VPC Flow Logs

Every major cloud provider offers flow logs for virtual networks:

AWS VPC Flow Logs: Capture accepted/rejected traffic at ENI, subnet, or VPC level
GCP VPC Flow Logs: Sampled at 1-in-N with metadata enrichment
Azure NSG Flow Logs: Network Security Group level flow records

These are invaluable for security auditing ("is anything talking to the internet that shouldn't be?"), compliance ("prove network segmentation works"), and troubleshooting ("why can't pod A reach pod B?").

Building a Network Observability Stack

The Practical Architecture

A production-ready network observability stack has four layers:

1. Collection Layer

eBPF agents on each host (Cilium, Pixie, or custom bpf programs)
Flow exporters on network devices (NetFlow/sFlow/IPFIX)
DNS query logging from resolvers
Cloud provider flow logs and metrics

2. Processing Layer

OpenTelemetry Collector for metric normalization and routing
Stream processing for flow record aggregation and anomaly detection
Enrichment with metadata (pod labels, service names, cloud tags)

3. Storage Layer

Prometheus for time-series metrics (golden signals, RED metrics)
Loki or Elasticsearch for flow log storage and search
Tempo or Jaeger for distributed traces

4. Visualization and Alerting Layer

Grafana dashboards for golden signals, service maps, and traffic analysis
Alertmanager for threshold-based and anomaly-based alerts
Runbooks linked to alerts for rapid response

Service Map Generation

One of the most powerful outputs of network observability is an automatic service dependency map. By observing which services connect to which, engineers can:

Discover unknown dependencies ("wait, the auth service calls the billing service?")
Detect configuration drift ("this pod shouldn't be talking to the production database")
Understand blast radius ("if this service goes down, these 12 services are affected")

Cilium Hubble generates these maps automatically for Kubernetes. For non-Kubernetes environments, flow logs and eBPF connection tracking produce the same result.

Observability vs. Monitoring: The Key Difference

Monitoring answers predefined questions: "Is the server up? Is latency below 200ms? Is the error rate below 1%?" Engineers define thresholds and get alerts.

Observability answers questions that haven't been asked yet: "Why did latency spike for users in Singapore but not Tokyo, on Tuesday afternoon, only for POST requests to the /checkout endpoint?" This requires rich, correlated data to slice and dice in real-time.

The shift from monitoring to observability requires:

High-cardinality data: metrics tagged with service, pod, namespace, source, destination, protocol, status code
Correlation across signals: connecting a network retransmission event to the application-level timeout it caused
Drill-down capability: from a dashboard spike → to the affected service pair → to the specific connection → to the packet capture

This is why eBPF-based tools are winning — they provide kernel-level detail with application-level context, automatically, at minimal overhead. The future of network observability is in the kernel, not in sidecars.

Key Points

•The four golden signals — latency, traffic, errors, saturation — are the minimum viable monitoring for any network. If only four things get tracked, make it these
•eBPF is the game-changer for network observability — it instruments the kernel without modifying code, adding latency, or requiring restarts
•Flow logs (NetFlow, sFlow, IPFIX) provide traffic-level visibility without packet capture overhead — essential for capacity planning and anomaly detection
•RED metrics (Rate, Errors, Duration) applied to network connections reveal issues that application-level metrics miss entirely
•Network observability is not network monitoring — monitoring tells the team something is broken, observability tells them WHY it broke

Key Components

Component	Role
Golden Signals (Latency, Traffic, Errors, Saturation)	The four fundamental metrics that indicate the health of any networked system — defined by Google's SRE book
eBPF Probes	Kernel-level hooks that capture network events (connections, retransmits, drops) without modifying application code or adding sidecars
Flow Logs (NetFlow/sFlow/IPFIX)	Aggregated records of network flows (src/dst IP, ports, bytes, packets) for traffic analysis and anomaly detection
Distributed Tracing (Network Layer)	Tracking requests across network hops and service boundaries to identify where latency and errors occur
Packet Sampling	Capturing 1-in-N packets to build statistical profiles of traffic without the overhead of full capture

When to Use

Implement network observability from day one of any production system. Start with the golden signals and flow logs, then add eBPF-based deep instrumentation as the system grows in complexity.

Tool Comparison

Tool	Type	Best For	Scale
Cilium Hubble	Open Source	Kubernetes-native network observability using eBPF — service maps, flow visibility, and policy monitoring	Medium-Enterprise
Prometheus + Grafana	Open Source	Metrics collection and visualization — pull-based model with PromQL for flexible querying and alerting	Any
Grafana	Open Source	Unified dashboards combining network, infrastructure, and application metrics from multiple data sources	Any
Datadog Network Monitoring	Managed	SaaS network performance monitoring with auto-discovery, flow maps, and DNS analytics across cloud and on-prem	Medium-Enterprise

Debug Checklist

Check the golden signals dashboard: is latency spiking? is error rate elevated? is traffic pattern abnormal? is any resource saturated?
Correlate network metrics with application traces — overlay TCP retransmission rate with API P99 latency to confirm network-level root cause
Inspect flow logs for traffic anomalies — sudden spikes from a single source (DDoS), unusual port activity, or traffic to unexpected destinations
Check eBPF connection tracking for TCP state issues — SYN floods, TIME_WAIT accumulation, connection reset storms
Verify DNS resolution metrics — elevated NXDOMAIN or SERVFAIL rates indicate DNS configuration or infrastructure issues

Common Mistakes

Monitoring only at the application layer and missing network-level issues like packet loss, retransmissions, and routing changes that degrade performance silently
Collecting too many metrics without aggregation. Per-connection metrics at high cardinality will overwhelm the monitoring system — aggregate by service, pod, or subnet
Relying solely on SNMP polling at 5-minute intervals. Modern networks change in seconds — streaming telemetry is a must, not periodic polling
Not correlating network metrics with application traces. A spike in TCP retransmissions might explain why API P99 latency jumped — but only if the data is overlaid
Ignoring saturation metrics. CPU, memory, and bandwidth at 90% utilization don't trigger error metrics, but they cause tail latency spikes

Real World Usage

•Netflix uses eBPF-based tools to monitor network health across their entire fleet — detecting TCP retransmissions and connection failures in real-time
•Google defined the golden signals framework and monitors millions of network endpoints with their internal Monarch time-series system
•Meta uses eBPF extensively for network observability, replacing traditional tools with kernel-level instrumentation that adds near-zero overhead
•LinkedIn uses flow-based monitoring to detect DDoS attacks and traffic anomalies, triggering automated mitigation within seconds
•Cloudflare uses eBPF to analyze millions of packets per second at each edge PoP for DDoS detection and traffic classification

RFCs & Specs

RFC 3954 — Cisco Systems NetFlow Services Export Version 9RFC 5101 — Specification of the IP Flow Information Export (IPFIX) ProtocolRFC 5476 — Packet Sampling (PSAMP) Protocol SpecificationsRFC 3416 — SNMP Version 2 Protocol Operations

Network Observability

SNMPsFlowNetFlowIPFIX

Network observability uses golden signals, eBPF probes, flow logs, and distributed tracing to provide continuous visibility into network health and rapid root cause diagnosis.

The Problem

How does a team maintain continuous visibility into network health, detect issues before they affect users, and diagnose root causes when network problems span multiple services and layers?

Mental Model

Like a hospital's monitoring system — the staff doesn't wait for the patient to flatline, they watch the vital signs continuously and catch problems early.

Architecture Diagram

How It Works

The Four Golden Signals

Google's SRE book defined four golden signals that are the minimum viable monitoring for any networked system. Everything else is a refinement of these:

2. Traffic — How much demand the system is handling. Measured in requests/second for HTTP, packets/second and bytes/second for network links, and connections/second for load balancers.

3. Errors — The rate of failed requests. At the network level: TCP retransmissions, connection resets, DNS failures, TLS handshake failures, and ICMP unreachables.

4. Saturation — How full the system is. Network saturation means link utilization approaching capacity, connection table limits, socket buffer pressure, and CPU consumed by packet processing.

# Prometheus alerting rules for golden signals
groups:
  - name: network_golden_signals
    rules:
      - alert: HighTCPRetransmissions
        expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TCP retransmission rate exceeds 100/s"

      - alert: InterfaceSaturation
        expr: rate(node_network_transmit_bytes_total[5m]) * 8
              / node_network_speed_bytes * 100 > 80
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Network interface >80% utilized"

RED Metrics for Networking

The RED method (Rate, Errors, Duration) is a simplified framework that works well for request-driven systems. Applied to network connections:

RED Metric	Network Measurement	What It Reveals
Rate	Connections/sec, packets/sec, bytes/sec	Traffic load and capacity demands
Errors	Retransmissions, resets, timeouts, DNS failures	Network health and reliability
Duration	RTT, connection setup time, DNS resolution time	Latency and performance

Track these per service pair (Service A → Service B) to build a service dependency map with network health overlay. Cilium Hubble does this automatically for Kubernetes pods.

eBPF-Based Network Observability

What eBPF can observe:

┌─────────────────────────────────────────────────────┐
│ User Space                                          │
│   Application → socket() → connect() → send()      │
├─────────────────────────────────────────────────────┤
│ Kernel Space  (eBPF hooks at each layer)            │
│   ┌─── Socket Layer ──── kprobe:tcp_connect()       │
│   ├─── TCP Layer ─────── tracepoint:tcp_retransmit  │
│   ├─── IP Layer ──────── XDP (eXpress Data Path)    │
│   └─── NIC Driver ────── TC (Traffic Control)       │
└─────────────────────────────────────────────────────┘

Key eBPF programs for observability:

kprobes on tcp_connect: Track every outbound connection with source, destination, and latency
tracepoint tcp:tcp_retransmit_skb: Detect every retransmission with the affected connection tuple
XDP programs: Process packets at the NIC driver level before they enter the kernel networking stack — used for DDoS mitigation and packet sampling
TC (Traffic Control) hooks: Classify and sample traffic at the queueing discipline level

# Using bpftrace to monitor TCP retransmissions in real-time
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
  printf("retransmit: %s:%d -> %s:%d\n",
    ntop(args->saddr), args->sport,
    ntop(args->daddr), args->dport);
}'

# Monitor new TCP connections
bpftrace -e 'kprobe:tcp_connect {
  $sk = (struct sock *)arg0;
  printf("connect: -> %s:%d\n",
    ntop($sk->__sk_common.skc_daddr),
    $sk->__sk_common.skc_dport);
}'

The beauty of eBPF is zero application changes. Deploy an eBPF agent (like Cilium or Pixie) and it immediately provides network visibility for every connection on the host.

Flow-Based Monitoring

While eBPF gives host-level visibility, flow protocols provide network-wide traffic analysis by exporting summarized flow records from switches and routers.

NetFlow, sFlow, and IPFIX

Protocol	How It Works	Use Case
NetFlow v9	Router maintains flow cache, exports completed flows	Cisco-centric, capacity planning
sFlow	Samples 1-in-N packets, exports immediately	Multi-vendor, real-time analysis
IPFIX	Standardized NetFlow v10, flexible templates	Open standard, modern deployments

A flow record typically contains: source/destination IP, source/destination port, protocol, byte count, packet count, start/end timestamps, and TCP flags. From these records, teams can build:

Traffic matrices: who talks to whom, and how much
Top talkers: which hosts or services generate the most traffic
Anomaly detection: sudden traffic patterns that deviate from baseline
Security forensics: post-incident analysis of communication patterns

# Example: Using nfdump to analyze NetFlow data
# Top 10 source IPs by bytes
nfdump -r nfcapd.202401011200 -s srcip -n 10

# Show flows to a specific destination
nfdump -r nfcapd.202401011200 'dst ip 10.0.1.50'

# Traffic breakdown by port
nfdump -r nfcapd.202401011200 -s dstport -n 20

Cloud VPC Flow Logs

Every major cloud provider offers flow logs for virtual networks:

AWS VPC Flow Logs: Capture accepted/rejected traffic at ENI, subnet, or VPC level
GCP VPC Flow Logs: Sampled at 1-in-N with metadata enrichment
Azure NSG Flow Logs: Network Security Group level flow records

Building a Network Observability Stack

The Practical Architecture

A production-ready network observability stack has four layers:

1. Collection Layer

eBPF agents on each host (Cilium, Pixie, or custom bpf programs)
Flow exporters on network devices (NetFlow/sFlow/IPFIX)
DNS query logging from resolvers
Cloud provider flow logs and metrics

2. Processing Layer

OpenTelemetry Collector for metric normalization and routing
Stream processing for flow record aggregation and anomaly detection
Enrichment with metadata (pod labels, service names, cloud tags)

3. Storage Layer

Prometheus for time-series metrics (golden signals, RED metrics)
Loki or Elasticsearch for flow log storage and search
Tempo or Jaeger for distributed traces

4. Visualization and Alerting Layer

Grafana dashboards for golden signals, service maps, and traffic analysis
Alertmanager for threshold-based and anomaly-based alerts
Runbooks linked to alerts for rapid response

Service Map Generation

One of the most powerful outputs of network observability is an automatic service dependency map. By observing which services connect to which, engineers can:

Discover unknown dependencies ("wait, the auth service calls the billing service?")
Detect configuration drift ("this pod shouldn't be talking to the production database")
Understand blast radius ("if this service goes down, these 12 services are affected")

Cilium Hubble generates these maps automatically for Kubernetes. For non-Kubernetes environments, flow logs and eBPF connection tracking produce the same result.

Observability vs. Monitoring: The Key Difference

Monitoring answers predefined questions: "Is the server up? Is latency below 200ms? Is the error rate below 1%?" Engineers define thresholds and get alerts.

The shift from monitoring to observability requires:

High-cardinality data: metrics tagged with service, pod, namespace, source, destination, protocol, status code
Correlation across signals: connecting a network retransmission event to the application-level timeout it caused
Drill-down capability: from a dashboard spike → to the affected service pair → to the specific connection → to the packet capture

Key Points

•The four golden signals — latency, traffic, errors, saturation — are the minimum viable monitoring for any network. If only four things get tracked, make it these
•eBPF is the game-changer for network observability — it instruments the kernel without modifying code, adding latency, or requiring restarts
•Flow logs (NetFlow, sFlow, IPFIX) provide traffic-level visibility without packet capture overhead — essential for capacity planning and anomaly detection
•RED metrics (Rate, Errors, Duration) applied to network connections reveal issues that application-level metrics miss entirely
•Network observability is not network monitoring — monitoring tells the team something is broken, observability tells them WHY it broke

Key Components

Component	Role
Golden Signals (Latency, Traffic, Errors, Saturation)	The four fundamental metrics that indicate the health of any networked system — defined by Google's SRE book
eBPF Probes	Kernel-level hooks that capture network events (connections, retransmits, drops) without modifying application code or adding sidecars
Flow Logs (NetFlow/sFlow/IPFIX)	Aggregated records of network flows (src/dst IP, ports, bytes, packets) for traffic analysis and anomaly detection
Distributed Tracing (Network Layer)	Tracking requests across network hops and service boundaries to identify where latency and errors occur
Packet Sampling	Capturing 1-in-N packets to build statistical profiles of traffic without the overhead of full capture

When to Use

Implement network observability from day one of any production system. Start with the golden signals and flow logs, then add eBPF-based deep instrumentation as the system grows in complexity.

Tool Comparison

Tool	Type	Best For	Scale
Cilium Hubble	Open Source	Kubernetes-native network observability using eBPF — service maps, flow visibility, and policy monitoring	Medium-Enterprise
Prometheus + Grafana	Open Source	Metrics collection and visualization — pull-based model with PromQL for flexible querying and alerting	Any
Grafana	Open Source	Unified dashboards combining network, infrastructure, and application metrics from multiple data sources	Any
Datadog Network Monitoring	Managed	SaaS network performance monitoring with auto-discovery, flow maps, and DNS analytics across cloud and on-prem	Medium-Enterprise

Debug Checklist

Check the golden signals dashboard: is latency spiking? is error rate elevated? is traffic pattern abnormal? is any resource saturated?
Correlate network metrics with application traces — overlay TCP retransmission rate with API P99 latency to confirm network-level root cause
Inspect flow logs for traffic anomalies — sudden spikes from a single source (DDoS), unusual port activity, or traffic to unexpected destinations
Check eBPF connection tracking for TCP state issues — SYN floods, TIME_WAIT accumulation, connection reset storms
Verify DNS resolution metrics — elevated NXDOMAIN or SERVFAIL rates indicate DNS configuration or infrastructure issues

Common Mistakes

Monitoring only at the application layer and missing network-level issues like packet loss, retransmissions, and routing changes that degrade performance silently
Collecting too many metrics without aggregation. Per-connection metrics at high cardinality will overwhelm the monitoring system — aggregate by service, pod, or subnet
Relying solely on SNMP polling at 5-minute intervals. Modern networks change in seconds — streaming telemetry is a must, not periodic polling
Not correlating network metrics with application traces. A spike in TCP retransmissions might explain why API P99 latency jumped — but only if the data is overlaid
Ignoring saturation metrics. CPU, memory, and bandwidth at 90% utilization don't trigger error metrics, but they cause tail latency spikes

Real World Usage

•Netflix uses eBPF-based tools to monitor network health across their entire fleet — detecting TCP retransmissions and connection failures in real-time
•Google defined the golden signals framework and monitors millions of network endpoints with their internal Monarch time-series system
•Meta uses eBPF extensively for network observability, replacing traditional tools with kernel-level instrumentation that adds near-zero overhead
•LinkedIn uses flow-based monitoring to detect DDoS attacks and traffic anomalies, triggering automated mitigation within seconds
•Cloudflare uses eBPF to analyze millions of packets per second at each edge PoP for DDoS detection and traffic classification

The Problem

Mental Model

Architecture Diagram

How It Works

The Four Golden Signals

RED Metrics for Networking

eBPF-Based Network Observability

Flow-Based Monitoring

NetFlow, sFlow, and IPFIX

Cloud VPC Flow Logs

Building a Network Observability Stack

The Practical Architecture

Service Map Generation

Observability vs. Monitoring: The Key Difference

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics

The Problem

Mental Model

Architecture Diagram

How It Works

The Four Golden Signals

RED Metrics for Networking

eBPF-Based Network Observability

Flow-Based Monitoring

NetFlow, sFlow, and IPFIX

Cloud VPC Flow Logs

Building a Network Observability Stack

The Practical Architecture

Service Map Generation

Observability vs. Monitoring: The Key Difference

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics