TCP/IP Debugging Toolkit
Systematic network debugging starts with the symptom, picks the right tool (ss, tcpdump, mtr, dig, curl, openssl), and works from connectivity through transport to application layer.
The Problem
When a network connection fails, is slow, or behaves unexpectedly — how does an engineer systematically isolate whether the problem is DNS, routing, TCP, TLS, or the application?
Mental Model
Like a mechanic's toolbox — each tool reveals a different layer of what's happening under the hood. Don't pull out the oscilloscope first; start with the symptom and pick the right diagnostic.
Architecture Diagram
How It Works
Network debugging is a skill that separates senior engineers from everyone else. It's not about memorizing flags — it's about having a systematic approach: start with the symptom, pick the right tool, and work through layers until the root cause surfaces.
The Debugging Decision Tree
Before reaching for any tool, classify the symptom:
| Symptom | Likely Layer | First Tool |
|---|---|---|
| "Connection refused" | Transport (TCP) | ss -tlnp |
| "Connection timed out" | Network/Firewall | mtr, nc -zv |
| "Slow responses" | Transport/App | ss -ti, curl -w |
| "TLS handshake failed" | Security | openssl s_client |
| "DNS resolution failed" | Application/DNS | dig +trace |
| "HTTP 502/503 errors" | Application | curl -v |
| "Intermittent failures" | Any layer | tcpdump + Wireshark |
The rule is: start at the highest layer that makes sense and work down. Don't capture packets for an HTTP 404 — that's an application problem. Do capture packets when connections randomly reset and nobody knows why.
Tool 1: ss — Socket Statistics
ss is the modern replacement for netstat. It's faster (reads directly from kernel netlink) and shows more TCP internal state.
# List all listening TCP ports with process names
ss -tlnp
# Show all established connections to a specific host
ss -tnp dst 10.0.1.50
# Show TCP internal metrics for connections (the gold)
ss -ti dst 10.0.1.50
# Output includes:
# rtt:1.234/0.567 → smoothed RTT / variance
# retrans:0/3 → current / total retransmissions
# cwnd:10 → congestion window (in MSS units)
# send 93.4Mbps → estimated send rate
# rcv_space:65536 → receive window
# Count connections by state
ss -tn state established | wc -l
ss -tn state time-wait | wc -l
ss -tn state close-wait | wc -l
What to look for:
- Thousands of
TIME_WAIT: ephemeral port exhaustion risk. Enabletcp_tw_reuse. - Growing
CLOSE_WAIT: the application isn't closing connections. This is a code bug, not a network issue. SYN_SENTstuck connections: the remote server isn't responding to SYN (firewall, server down).- High
retranscount inss -ti: packet loss on this connection.
Tool 2: tcpdump — Packet Capture
The universal source of truth. When everything else is ambiguous, packets tell the real story.
# Capture HTTP traffic on port 443 to a file
tcpdump -i any port 443 -w /tmp/capture.pcap -c 5000
# Capture traffic to a specific host
tcpdump -i eth0 host 10.0.1.50 -w /tmp/debug.pcap
# Capture only SYN packets (connection attempts)
tcpdump -i any 'tcp[tcpflags] & (tcp-syn) != 0' -n
# Capture DNS queries
tcpdump -i any port 53 -n
# Live display with readable output (no file)
tcpdump -i any port 8080 -n -A # -A for ASCII payload
# Capture with rotation (10 files of 100MB each)
tcpdump -i any port 443 -w /tmp/cap.pcap -C 100 -W 10
Key tcpdump flags:
-i any: capture on all interfaces-n: don't resolve hostnames (faster)-w file.pcap: write raw packets (for Wireshark analysis)-c N: stop after N packets-s 0: capture full packets (not just headers)
After capturing, open the .pcap in Wireshark for analysis. Use these Wireshark filters:
tcp.analysis.retransmission # Find retransmitted packets
tcp.analysis.zero_window # Flow control issues
tcp.flags.reset == 1 # Connection resets
tcp.analysis.duplicate_ack # Signs of packet loss
ssl.alert_message # TLS errors
dns.flags.rcode != 0 # DNS failures
Tool 3: mtr — Path Analysis
mtr combines traceroute and ping into continuous monitoring. It sends packets and reports per-hop statistics.
# Basic mtr to a host (runs continuously, Ctrl+C to stop)
mtr -n 10.0.1.50
# Report mode (100 packets, then exit with summary)
mtr -n --report -c 100 api.example.com
# Use TCP instead of ICMP (bypasses ICMP-blocking firewalls)
mtr -n --tcp --port 443 api.example.com
# Use UDP
mtr -n --udp api.example.com
Reading mtr output:
Host Loss% Snt Last Avg Best Wrst StDev
1. 10.0.0.1 0.0% 100 0.5 0.5 0.4 1.2 0.1
2. 172.16.0.1 0.0% 100 1.2 1.3 1.0 3.5 0.3
3. 203.0.113.1 12.0% 100 5.4 25.3 4.8 150.2 35.1 ← Problem hop
4. 198.51.100.1 0.0% 100 15.2 15.5 14.8 18.3 0.5
5. api.example.com 0.0% 100 15.8 16.0 15.2 19.1 0.6
Important: Loss at an intermediate hop but not at the destination usually means that router rate-limits ICMP (traceroute probes) — this is normal and not a real problem. Only worry when loss appears at intermediate hops AND the destination.
Tool 4: dig — DNS Debugging
# Basic query
dig api.example.com
# Query specific record type
dig api.example.com AAAA # IPv6
dig example.com MX # Mail servers
dig example.com TXT # TXT records (SPF, DKIM)
# Trace the full delegation chain (root → TLD → authoritative)
dig +trace api.example.com
# Query a specific DNS server
dig @8.8.8.8 api.example.com
# Check TTL (how long until cache expires)
dig api.example.com | grep -A1 "ANSWER SECTION"
# Reverse DNS lookup
dig -x 93.184.216.34
What to look for:
NXDOMAIN: the domain doesn't exist (typo? deleted record?)SERVFAIL: the nameserver can't answer (DNSSEC validation failure? broken delegation?)- High TTL: changes won't propagate until existing caches expire
- Different answers from different nameservers: propagation delay or inconsistent configuration
Tool 5: curl — HTTP Debugging
# Verbose output showing headers and TLS negotiation
curl -v https://api.example.com/health
# Timing breakdown of every phase
curl -w "\
DNS: %{time_namelookup}s\n\
Connect: %{time_connect}s\n\
TLS: %{time_appconnect}s\n\
TTFB: %{time_starttransfer}s\n\
Total: %{time_total}s\n\
HTTP Code: %{http_code}\n\
Size: %{size_download} bytes\n" \
-o /dev/null -s https://api.example.com/data
# Follow redirects and show each hop
curl -vL https://example.com
# Send with specific headers
curl -H "Authorization: Bearer TOKEN" -H "Accept: application/json" \
https://api.example.com/resource
# Test POST with body
curl -X POST -d '{"key":"value"}' -H "Content-Type: application/json" \
https://api.example.com/resource
Tool 6: openssl s_client — TLS Debugging
# Connect and show certificate chain
openssl s_client -connect api.example.com:443 -servername api.example.com
# Check certificate expiry
openssl s_client -connect api.example.com:443 -servername api.example.com 2>/dev/null \
| openssl x509 -noout -dates
# Show negotiated cipher and TLS version
openssl s_client -connect api.example.com:443 -servername api.example.com 2>/dev/null \
| grep -E "Protocol|Cipher"
# Test specific TLS version
openssl s_client -connect api.example.com:443 -tls1_3
# Verify certificate against a CA bundle
openssl s_client -connect api.example.com:443 -CAfile /etc/ssl/certs/ca-certificates.crt
Putting It All Together: A Real Debugging Session
Here's how these tools combine in a real incident: "Service A can't reach Service B, getting timeouts."
# Step 1: Can we reach the port at all?
nc -zv service-b.internal:8080
# Result: "Connection timed out" → not a DNS issue, not an app issue
# Step 2: Is it a routing or firewall issue?
mtr -n --tcp --port 8080 service-b.internal
# Result: 100% loss at hop 3 → firewall or routing issue
# Step 3: Check from the other side — is Service B listening?
# (SSH to service-b host)
ss -tlnp | grep :8080
# Result: "LISTEN 0 128 0.0.0.0:8080" → service is running
# Step 4: Check firewall rules
iptables -L -n | grep 8080
# Result: No allow rule → firewall is blocking traffic
# Fix: Add firewall rule, verify with nc, confirm with curl
Common Patterns and Their Diagnoses
| Pattern | Diagnosis |
|---|---|
| SYN sent, no SYN-ACK received | Firewall dropping packets, server down, or wrong IP |
| SYN-ACK received, then RST | Port is open but service rejected the connection (TCP wrapper, listen backlog full) |
| Connection established, then RST | Application-level rejection (protocol mismatch, auth failure) |
| Established but no data flows | Application deadlock, full send/receive buffer, blocked thread |
| Increasing retransmissions | Network congestion or packet loss on the path |
| Zero window events | Receiver can't keep up — application is slow to read from socket |
These patterns are visible in ss -ti output and tcpdump captures. Learning to recognize them turns hours of guessing into minutes of targeted diagnosis.
Why This Matters
Every backend engineer will face network issues in production. The engineers who can pick the right tool, capture the right evidence, and isolate the root cause in minutes are worth their weight in gold. These tools are free, available on every Linux system, and repay the investment in learning them across an entire career.
Key Points
- •The best debugging approach is symptom-driven: start with what's broken (timeout, refused, slow, TLS error) and pick the right tool for that symptom
- •tcpdump is the universal truth — when logs and metrics disagree, packets don't lie. Learn to capture and filter effectively
- •ss -ti exposes TCP internals (RTT, cwnd, retransmits) per connection without packet capture — it's the fastest way to spot TCP issues
- •mtr combines traceroute and ping into a continuous path analysis — it reveals which hop is dropping packets or adding latency
- •Most 'network issues' are actually application issues. Always check the application layer (curl -v, HTTP status codes) before diving into packets
Key Components
| Component | Role |
|---|---|
| tcpdump / Wireshark | Packet capture and analysis — tcpdump for command-line capture on servers, Wireshark for visual deep-dive analysis |
| ss / netstat | Socket statistics showing connection states, window sizes, RTT, and retransmission counts per connection |
| mtr / traceroute | Path analysis showing every hop between source and destination, with per-hop latency and packet loss |
| dig / nslookup | DNS resolution debugging — query specific record types, trace delegation chain, verify TTL and propagation |
| curl -v / openssl s_client | HTTP and TLS debugging — verbose request/response headers, certificate chain verification, cipher negotiation |
When to Use
Reach for these tools whenever application-level logs and metrics don't explain the problem. Connection timeouts, intermittent failures, unexplained latency, and TLS errors all require network-level debugging.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Wireshark | Open Source | Deep packet inspection with GUI — TCP stream reassembly, retransmission analysis, protocol dissection | Development-Production |
| tcpdump | Open Source | Command-line packet capture on remote servers — lightweight, available everywhere, scriptable | Any |
| mtr | Open Source | Continuous network path analysis combining traceroute and ping — shows per-hop loss and jitter | Any |
| netcat (nc) | Open Source | Quick connectivity tests — TCP/UDP port checks, simple client-server testing, banner grabbing | Any |
Debug Checklist
- Check if the port is open and the service is listening: ss -tlnp | grep :PORT — if nothing shows, the service isn't bound or isn't running
- Test basic connectivity: nc -zv HOST PORT — this confirms whether the TCP handshake succeeds without any application protocol
- Trace the network path: mtr -n --report HOST — look for hops with >1% packet loss or sudden latency jumps
- Capture packets for detailed analysis: tcpdump -i any port PORT -w /tmp/capture.pcap -c 1000 — then open in Wireshark
- Check TCP connection health: ss -ti dst HOST — look at rtt, retrans count, cwnd size, and whether the connection is in a healthy state
- Verify DNS resolution: dig +trace HOSTNAME — follow the delegation chain from root servers to authoritative, checking for delays
- Debug TLS issues: openssl s_client -connect HOST:443 -servername HOST — verify certificate validity, chain, and negotiated cipher
- Debug HTTP layer: curl -v -o /dev/null https://HOST/path — inspect request/response headers, redirects, timing, and status codes
Common Mistakes
- Capturing too many packets without filters. Always use tcpdump with port and host filters — an unfiltered capture on a busy server fills disk in seconds
- Running traceroute once and drawing conclusions. Network paths fluctuate — use mtr with 100+ packets to get statistically meaningful results
- Confusing ICMP-based traceroute results with actual TCP path behavior. Some routers rate-limit ICMP, showing false packet loss
- Not checking both sides of the connection. A timeout might be the client not sending, the server not responding, or a middlebox dropping packets
- Forgetting about firewalls and security groups. 'Connection refused' vs 'connection timed out' indicates whether a firewall is dropping (timeout) or rejecting (refused)
Real World Usage
- •SRE teams use tcpdump to capture packets during incidents, then analyze offline in Wireshark to find retransmissions, resets, and connection failures
- •Network engineers use mtr to diagnose path-specific issues when users in a specific region report slowness
- •DevOps engineers use ss -tnp to find connection state accumulation (thousands of TIME_WAIT or CLOSE_WAIT sockets) during high-traffic events
- •Security teams use tcpdump to verify that traffic between services is actually encrypted (TLS) and not leaking plaintext
- •Platform teams use dig +trace to debug DNS propagation delays after zone changes or during DNS migration