Deepen PracticeOrdered learning track

Network Observability and Packet-Level Debugging

Learn Java Networking - Part 027

Network observability and packet-level debugging for Java applications, covering latency decomposition, socket and HTTP client diagnostics, JFR, JVM and OS evidence, tcpdump/Wireshark workflow, TLS/HTTP logging, and production-safe troubleshooting playbooks.

24 min read4746 words
PrevNext
Lesson 2732 lesson track1927 Deepen Practice
#java#networking#observability#debugging+7 more

Part 027 — Network Observability and Packet-Level Debugging

Core thesis: networking bugs are rarely solved by staring at Java stack traces alone. You need a layered evidence model: application intent, Java runtime behavior, OS socket state, DNS/TLS/HTTP semantics, and packet-level facts.

This part is about diagnosing Java networking systems in production. It does not repeat the general observability series. The scope here is narrower and deeper:

  • What exactly did the Java process try to connect to?
  • Which address was selected after DNS?
  • Did the connection fail before TCP, during TCP, during TLS, during HTTP, or while streaming the body?
  • Was latency caused by DNS, connect, TLS handshake, server processing, response body transfer, backpressure, or client-side queuing?
  • Did timeout/cancellation close the socket as expected?
  • Is the packet trace consistent with what the application logs claim?

A top-tier engineer treats a network incident as an evidence-reconciliation exercise, not as guesswork.

The practical objective is simple:

Given a production networking incident, you should be able to build a timeline that explains what happened without relying on folklore.


1. Kaufman Skill Map

1.1 Target capability

After this part, you should be able to:

  • classify network failures by layer and phase;
  • instrument Java clients and servers without leaking secrets;
  • decompose latency into DNS, connect, TLS, request, first byte, and body transfer;
  • use Java Flight Recorder for socket and HTTP-adjacent diagnosis;
  • enable java.net.http logging safely in non-production or scoped production windows;
  • correlate application logs with packet captures;
  • read TCP-level evidence such as SYN, SYN-ACK, FIN, RST, retransmission, zero window, and TLS handshake boundaries;
  • distinguish client timeout, server close, proxy reset, firewall drop, and DNS failure;
  • produce incident notes that are defensible and reproducible.

1.2 Subskills

SubskillWhy it mattersPractice target
Failure phase classificationDifferent fixes apply at different layersLabel every error as DNS, connect, TLS, HTTP, stream, or app protocol
Timeline correlationLogs alone lie by omissionAlign app timestamp, JFR event, OS socket state, and packet timestamp
Latency decomposition“Slow network” is not actionableMeasure connect, handshake, TTFB, and body duration separately
Safe loggingNetwork data can contain secretsRedact headers, query params, tokens, payload fragments
Packet captureSometimes packets are the source of truthCapture minimal traffic with filters and interpret basic TCP behavior
JFR diagnosticsJVM evidence is lower-overhead than ad-hoc loggingRecord socket read/write stalls and allocation pressure
Exception interpretationJava wraps many network failuresMap exception type and message to probable phase
Production playbooksIncidents need repeatable stepsBuild a checklist for DNS/TCP/TLS/HTTP/body/backpressure

1.3 Anti-goals

This part is not about:

  • general logging frameworks;
  • full OpenTelemetry setup;
  • complete Wireshark mastery;
  • deep TCP congestion-control theory;
  • replacing infrastructure telemetry;
  • blaming the network without proof.

2. The Layered Evidence Model

When a Java network call fails, there are at least six layers of evidence.

LayerTypical evidenceQuestions answered
Applicationoperation name, target logical service, deadline, correlation idWhat did the code intend to do?
Java APIURI, timeout, proxy, redirect, body publisher/subscriber, exceptionWhat did the JDK client/socket API experience?
JVM runtimeJFR socket events, allocation, GC, thread stateWas the process blocked, allocating, or stalled?
OS socketlocal port, remote address, state, queue sizesDid the kernel have an open connection and where?
Network pathpacket capture, NAT, proxy, firewall logsDid packets leave/return? Who reset/dropped?
Peer/proxyserver logs, load balancer logs, TLS logs, upstream metricsDid the remote endpoint receive and process it?

The invariant:

Do not conclude from one layer when another layer can falsify it.

For example:

  • Java says SocketTimeoutException.
  • Packet capture shows the server sent response bytes after the client deadline.
  • Application logs show a 300 ms deadline on a call that usually takes 800 ms.

The correct conclusion is not “server down”. The likely conclusion is client deadline too aggressive or deadline not propagated with enough budget.


3. Failure Phase Taxonomy

A production-grade network incident should be classified by phase.

PhaseCommon Java symptomLikely root classes
URI parseIllegalArgumentException, URISyntaxExceptionmalformed URI, bad encoding, unsupported scheme
DNSUnknownHostException, long first-call latencyresolver failure, bad search domain, split-horizon DNS, negative cache
Address selectionconnects to wrong family/addressIPv4/IPv6 preference, stale DNS, unexpected localhost resolution
TCP connectConnectException, SocketTimeoutException on connectservice down, firewall reject/drop, backlog saturation, wrong port
TLS handshakeSSLHandshakeException, cert path errorstruststore, hostname verification, protocol/cipher/SNI/mTLS issue
HTTP protocolstatus codes, protocol exception, stream resetproxy/server behavior, HTTP/2 stream reset, malformed response
Body uploadtimeout while writing, broken pipeslow receiver, request too large, server/proxy closed
Body downloadtimeout while reading, partial fileslow sender, client not consuming, stream cancellation
Pool reusesporadic reset on first write/readstale idle connection, proxy/load balancer idle timeout
Cancellationfuture cancelled but socket still busycancellation semantics, body subscriber not closed, blocking code ignored deadline

3.1 Interpret exceptions by phase, not just by type

The same exception type may occur in multiple phases.

ExceptionPossible phaseDiagnostic question
SocketTimeoutExceptionconnect, read, TLS, HTTP bodyWhich timeout fired and at what timestamp?
ConnectException: Connection refusedTCP connectDid remote host actively reject with RST?
ConnectException: Network is unreachablerouting/address familyIs route missing or IPv6 selected unexpectedly?
UnknownHostExceptionDNSWas hostname invalid, resolver unavailable, or search domain wrong?
SSLHandshakeExceptionTLSIs it trust, hostname, SNI, protocol, cipher, or client cert?
EOFExceptionprotocol/bodyDid peer close cleanly before expected bytes?
IOException: Broken pipewriteDid peer close before/during upload?
Connection resetTCPWho sent RST: client, server, proxy, firewall?

A good incident report says:

“The request failed during TLS certificate validation after TCP connect succeeded.”

Not:

“The API is down.”


4. What to Log in Java Networking Code

Network logging must be useful under stress and safe under audit.

4.1 Minimum client-side call record

Every outbound network call should have a structured record like this:

{
  "event": "network.client.call",
  "operation": "partner-risk-score.lookup",
  "correlationId": "case-721-req-92",
  "targetService": "risk-score-api",
  "scheme": "https",
  "host": "api.partner.example",
  "port": 443,
  "method": "POST",
  "httpVersionRequested": "HTTP_2",
  "connectTimeoutMs": 500,
  "requestTimeoutMs": 2500,
  "deadlineRemainingMsAtStart": 2310,
  "attempt": 1,
  "retryable": false
}

4.2 Minimum completion record

{
  "event": "network.client.complete",
  "operation": "partner-risk-score.lookup",
  "correlationId": "case-721-req-92",
  "targetService": "risk-score-api",
  "durationMs": 184,
  "phase": "http.response",
  "status": 200,
  "responseBytes": 4182,
  "reusedConnection": "unknown",
  "attempt": 1
}

4.3 Minimum failure record

{
  "event": "network.client.failure",
  "operation": "partner-risk-score.lookup",
  "correlationId": "case-721-req-92",
  "targetService": "risk-score-api",
  "durationMs": 503,
  "phase": "tcp.connect",
  "exceptionClass": "java.net.http.HttpConnectTimeoutException",
  "messageClass": "connect-timeout",
  "attempt": 1,
  "retryable": true,
  "deadlineRemainingMsAtFailure": 1807
}

Do not blindly log:

  • full URL with query string;
  • Authorization headers;
  • cookies;
  • client certificates;
  • request/response bodies;
  • signed URLs;
  • PII in payload fragments.

4.4 Structured Java wrapper around HttpClient

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
import java.time.Instant;
import java.util.Objects;

public final class ObservedHttpClient {
    private final HttpClient client;

    public ObservedHttpClient(HttpClient client) {
        this.client = Objects.requireNonNull(client);
    }

    public <T> HttpResponse<T> send(
            String operation,
            HttpRequest request,
            HttpResponse.BodyHandler<T> bodyHandler
    ) throws IOException, InterruptedException {
        URI uri = request.uri();
        Instant start = Instant.now();

        logStart(operation, request, uri);

        try {
            HttpResponse<T> response = client.send(request, bodyHandler);
            long durationMs = Duration.between(start, Instant.now()).toMillis();
            logSuccess(operation, uri, response.statusCode(), durationMs);
            return response;
        } catch (IOException | InterruptedException e) {
            long durationMs = Duration.between(start, Instant.now()).toMillis();
            logFailure(operation, uri, classify(e), e, durationMs);
            throw e;
        }
    }

    private static void logStart(String operation, HttpRequest request, URI uri) {
        System.out.printf(
                "event=network.client.start operation=%s scheme=%s host=%s port=%d method=%s timeout=%s%n",
                safe(operation),
                uri.getScheme(),
                uri.getHost(),
                effectivePort(uri),
                request.method(),
                request.timeout().map(Duration::toString).orElse("none")
        );
    }

    private static void logSuccess(String operation, URI uri, int status, long durationMs) {
        System.out.printf(
                "event=network.client.success operation=%s host=%s status=%d durationMs=%d%n",
                safe(operation), uri.getHost(), status, durationMs
        );
    }

    private static void logFailure(
            String operation,
            URI uri,
            String phase,
            Exception e,
            long durationMs
    ) {
        System.out.printf(
                "event=network.client.failure operation=%s host=%s phase=%s exception=%s durationMs=%d%n",
                safe(operation), uri.getHost(), phase, e.getClass().getName(), durationMs
        );
    }

    private static String classify(Exception e) {
        String name = e.getClass().getName();
        String msg = String.valueOf(e.getMessage()).toLowerCase();

        if (name.contains("UnknownHost")) return "dns";
        if (name.contains("HttpConnectTimeout")) return "tcp.connect.timeout";
        if (name.contains("SocketTimeout")) return "socket.timeout";
        if (name.contains("SSL")) return "tls";
        if (msg.contains("connection reset")) return "tcp.reset";
        if (msg.contains("broken pipe")) return "tcp.write.closed";
        return "unknown";
    }

    private static int effectivePort(URI uri) {
        if (uri.getPort() >= 0) return uri.getPort();
        return switch (String.valueOf(uri.getScheme()).toLowerCase()) {
            case "http" -> 80;
            case "https" -> 443;
            default -> -1;
        };
    }

    private static String safe(String value) {
        return value.replaceAll("[^a-zA-Z0-9_.:-]", "_");
    }
}

This wrapper is intentionally simple. In real systems, replace System.out.printf with structured logging and metrics.


5. Latency Decomposition

A single durationMs is necessary but insufficient.

For network calls, decompose latency into phases:

PhaseMeaningCommon cause when high
Queue waittime before call startslocal bulkhead, executor saturation, virtual-thread pinning, rate limiter
DNShostname resolutionresolver latency, search domains, negative cache, split DNS
TCP connectSYN to established connectionfirewall drop, remote overload, route issue, backlog saturation
TLS handshakeClientHello to secure sessioncert chain, OCSP/CRL, mTLS, SNI, ALPN, CPU
Request headerswriting headersconnection flow control, proxy buffering
Request bodyuploadslow receiver, large body, backpressure
TTFBtime to first response byteserver processing, upstream latency, proxy buffering
Response bodydownload and consumelarge response, slow client, decompression, body handler allocation

5.1 Why HttpClient makes this non-trivial

The JDK HttpClient gives a high-level API. It does not expose a first-class per-phase timing object like some specialized HTTP clients.

Therefore, you usually combine:

  • application timing around send / sendAsync;
  • operation-level metrics;
  • JFR events;
  • HTTP client logging when needed;
  • server/proxy timing headers if available;
  • packet capture for disputed cases.

5.2 Add timing where you control the body

For large responses, body consumption may dominate total latency.

import java.io.IOException;
import java.io.InputStream;
import java.net.http.HttpResponse;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.Duration;
import java.time.Instant;

public final class TimedDownload {
    public static HttpResponse.BodyHandler<Path> toFileWithTiming(Path target) {
        return responseInfo -> HttpResponse.BodySubscribers.mapping(
                HttpResponse.BodySubscribers.ofInputStream(),
                in -> copyWithTiming(in, target)
        );
    }

    private static Path copyWithTiming(InputStream in, Path target) {
        Instant start = Instant.now();
        long bytes = 0;
        byte[] buffer = new byte[64 * 1024];

        try (InputStream input = in; var output = Files.newOutputStream(target)) {
            int read;
            while ((read = input.read(buffer)) != -1) {
                output.write(buffer, 0, read);
                bytes += read;
            }
            long ms = Duration.between(start, Instant.now()).toMillis();
            System.out.printf("event=network.download.complete bytes=%d bodyMs=%d%n", bytes, ms);
            return target;
        } catch (IOException e) {
            long ms = Duration.between(start, Instant.now()).toMillis();
            System.out.printf("event=network.download.failure bytes=%d bodyMs=%d exception=%s%n",
                    bytes, ms, e.getClass().getName());
            throw new RuntimeException(e);
        }
    }
}

The key idea: for streaming, “request duration” and “body consumption duration” may not be the same operational problem.


6. Java Flight Recorder for Network Diagnosis

Java Flight Recorder is often the best first low-overhead JVM-side evidence source.

It can help answer:

  • which threads were blocked in socket reads/writes;
  • whether socket operations were long-running;
  • whether GC or allocation pressure overlapped with network latency;
  • whether virtual threads were parked or carrier threads were saturated;
  • whether file/network I/O spikes correlate with latency;
  • whether exceptions increased during a window.

6.1 Start a bounded recording

For a running process:

jcmd <pid> JFR.start name=network-debug settings=profile duration=120s filename=/tmp/network-debug.jfr

For process startup:

java \
  -XX:StartFlightRecording=name=network-debug,settings=profile,duration=120s,filename=/tmp/network-debug.jfr \
  -jar app.jar

For ongoing production environments, prefer operationally approved templates and time-bounded recordings.

6.2 What to inspect

JFR areaWhat to look forInterpretation
Socket read eventslong reads, low bytes, repeated timeoutsslow peer, stalled response, client waiting
Socket write eventslong writes, small writesslow receiver, flow-control pressure, upload bottleneck
Thread statesblocked/parked/waiting threadsI/O wait vs CPU saturation
Allocationfrequent buffer/string allocationsbody handling or logging pressure
GC pausesGC overlap with network spikeslocal runtime issue, not network path
Exceptionsrepeated network exceptionsclassify by phase and target
Method profilinghot encode/decode pathsprotocol parser or body processing cost

6.3 JFR does not replace packet capture

JFR sees JVM events. It usually will not prove:

  • whether SYN packets left the host;
  • whether a firewall silently dropped traffic;
  • who sent a TCP reset;
  • whether TLS records are fragmented in a specific way;
  • whether retransmission happened on the wire;
  • whether NAT/proxy changed the path.

Use JFR to narrow the hypothesis. Use packet evidence when the path itself is disputed.


7. java.net.http Diagnostic Logging

The JDK HTTP Client has a system property for high-level logging through the platform logging API:

-Djdk.httpclient.HttpClient.log=errors,requests,headers,frames,ssl,trace,channel

Use it carefully.

7.1 Safe usage rules

RuleReason
Prefer lower environments firstHTTP logs may expose metadata and operational details
Never enable broad body/content logging casuallyPayloads may contain PII or credentials
Scope by short duration in productionLogging can add volume and overhead
Redact before sharing logsHeaders and URLs may contain secrets
Align timestamps with app logs and packet captureLogging is useful only when correlated

7.2 What HTTP client logs can answer

EvidenceUseful for
request line and headersconfirming method, authority, redirects, protocol
frame logsHTTP/2 stream-level behavior
SSL logsTLS handshake path and negotiation hints
channel logsconnection/channel lifecycle
errorsinternal client failures and transport events

7.3 What HTTP client logs cannot safely answer

They are not a replacement for:

  • peer server logs;
  • TLS certificate-chain inspection;
  • packet capture;
  • DNS resolver logs;
  • proxy/firewall logs;
  • application-level business causality.

8. JSSE/TLS Debugging

For TLS problems, Java can emit detailed JSSE diagnostics.

Typical command:

-Djavax.net.debug=ssl,handshake

More verbose variants may include certificate and key manager details, but they should be used carefully.

8.1 TLS debugging questions

QuestionEvidence
Did TCP connect succeed?TLS logs start only after a socket exists
Was SNI sent?ClientHello details
Which protocol was negotiated?TLS version in handshake
Was ALPN negotiated?HTTP/2 vs HTTP/1.1 negotiation evidence
Which certificate chain was received?certificate debug output
Why did trust validation fail?cert path validation exception
Was a client certificate requested?CertificateRequest message
Was a client cert selected?key manager debug output

8.2 Common TLS conclusions

SymptomLikely conclusion
PKIX path building failedtruststore does not trust issuer chain
No subject alternative DNS name matching ...hostname verification failure
server closes after ClientHelloSNI/protocol/cipher mismatch or middlebox behavior
HTTP/2 expected but HTTP/1.1 usedALPN not negotiated or server/proxy limitation
mTLS handshake fails after cert requestclient cert/key missing, wrong alias, or unacceptable CA

Do not disable hostname verification or trust all certificates to “fix” production. That changes the security property of the system.


9. OS-Level Socket Evidence

From the OS, you can answer questions Java does not expose directly.

9.1 Useful Linux commands

# Established connections and listening sockets
ss -tunap

# Connections involving a specific port
ss -tanp '( sport = :8080 or dport = :8080 )'

# Listening sockets
ss -ltnp

# Process file descriptors
ls -l /proc/<pid>/fd

# Per-process open TCP sockets through lsof, if available
lsof -Pan -p <pid> -i

9.2 What socket states imply

StateMeaningTypical Java-level symptom
LISTENserver socket bound and acceptingservice has an open listener
SYN-SENTclient sent SYN, waitingconnect in progress or timeout soon
SYN-RECVserver received SYN, handshake incompleteSYN backlog pressure possible
ESTABTCP establishedJava may be reading/writing/application-stalled
FIN-WAIT-1/2local close in progressgraceful close path
CLOSE-WAITpeer closed, local app has not closedJava code leaked close handling
TIME-WAITclosed connection retained temporarilyhigh churn or no pooling

9.3 CLOSE-WAIT is usually an application smell

If a process has many CLOSE-WAIT sockets, the peer already closed but your process has not closed its side.

Common causes:

  • not closing response body streams;
  • not closing raw socket streams;
  • leaked WebSocket/session lifecycle;
  • forgotten error path;
  • server protocol state machine does not handle EOF;
  • thread stuck before cleanup.

A correct server treats EOF as a state transition.


10. Packet Capture Workflow

Packet capture is the most concrete evidence for path behavior.

10.1 Minimal tcpdump examples

Capture traffic to a host and port:

tcpdump -i any -nn host 203.0.113.10 and port 443

Write to a file for Wireshark:

tcpdump -i any -nn -s 0 -w /tmp/capture.pcap host 203.0.113.10 and port 443

Capture a specific local service:

tcpdump -i any -nn -s 0 -w /tmp/service-8080.pcap port 8080

10.2 Capture rules

RuleWhy
Filter aggressivelyProduction packet capture can be huge and sensitive
Capture both client and server side if possibleNAT/proxy/firewall may change the path
Record exact time windowNeeded to correlate with logs
Avoid payload capture unless approvedPayload can contain secrets/PII
Prefer metadata-first analysisSYN/RST/FIN/retransmit often enough

10.3 Reading the TCP handshake

Normal connect:

Client -> Server  SYN
Server -> Client  SYN, ACK
Client -> Server  ACK

Connection refused:

Client -> Server  SYN
Server -> Client  RST, ACK

Silent drop/firewall blackhole:

Client -> Server  SYN
Client -> Server  SYN retransmission
Client -> Server  SYN retransmission
...

The Java symptom may be the same broad “connect failed”, but the fix is different:

Wire behaviorLikely fix direction
RST immediatelyservice not listening, wrong port, active reject
SYN retransmitsfirewall drop, routing, security group, blackhole
handshake succeeds then RSTprotocol/TLS/proxy/server close
data retransmitspacket loss, congestion, MTU/path issue
zero windowreceiver not consuming fast enough

10.4 FIN vs RST

SignalMeaningJava interpretation
FINgraceful close; no more bytes from senderread eventually returns EOF (-1)
RSTabortive close; connection resetConnection reset, stream failure

A reset is not automatically a network outage. It can be:

  • peer process crash;
  • proxy idle timeout;
  • server rejects malformed protocol;
  • client wrote after peer closed;
  • load balancer reset;
  • firewall policy;
  • application intentionally aborting.

11. Correlation ID Across Network Boundaries

Packet capture tells you packets. It does not tell you business operation.

A network call should carry a correlation identifier when protocol allows it.

For HTTP:

HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create("https://api.partner.example/risk-score"))
        .header("X-Correlation-Id", correlationId)
        .header("Accept", "application/json")
        .timeout(Duration.ofSeconds(2))
        .POST(HttpRequest.BodyPublishers.ofString(payload))
        .build();

For raw protocols, include a request id in your frame header.

| magic | version | requestId | type | length | payload |

This lets you reconcile:

  • caller log;
  • callee log;
  • proxy/load balancer log;
  • packet timestamp;
  • JFR event.

12. Server-Side Network Observability

A production Java server should log and measure connection lifecycle, not only request lifecycle.

12.1 Raw TCP server lifecycle events

EventUseful fields
acceptedlocal address, remote address, connection id
first byte receivedtime since accept
frame decodedrequest id, frame type, size
protocol errorreason, bytes consumed, remote address
response queuedqueue bytes, queue depth
write completedbytes written, duration
peer closedstate, outstanding request count
closedclose reason, lifetime, bytes in/out

12.2 Example connection id wrapper

import java.net.SocketAddress;
import java.nio.channels.SocketChannel;
import java.util.concurrent.atomic.AtomicLong;

public final class ConnectionIdentity {
    private static final AtomicLong SEQUENCE = new AtomicLong();

    public static String assign(SocketChannel channel) {
        long id = SEQUENCE.incrementAndGet();
        SocketAddress remote;
        try {
            remote = channel.getRemoteAddress();
        } catch (Exception e) {
            remote = null;
        }
        return "conn-" + id + " remote=" + remote;
    }
}

The connection id should be propagated through read/write logs for that channel.

12.3 Server metrics that matter

MetricWhy it matters
active connectionssaturation, leaks, slow clients
accepts/sectraffic rate and connection churn
accept failuresfile descriptor, backlog, permission, OS errors
bytes read/secinbound throughput
bytes written/secoutbound throughput
protocol decode failuresmalformed clients or parser bugs
write queue bytesbackpressure and slow consumer indicator
connection lifetimechurn vs long-lived sessions
close reason distributionEOF, timeout, protocol error, server shutdown
event-loop lagNIO server health

13. Client-Side Metrics That Matter

MetricDimensionWhy
calls totaloperation, target, methodbaseline volume
failures totalphase, exception classroot-cause grouping
latency histogramoperation, targetSLO, tail latency
retries totaloperation, reasonretry storms
timeout totaltimeout typebudget misconfiguration
response body bytesoperationpayload growth
request body bytesoperationupload pressure
in-flight callstargetsaturation
queued callsbulkhead/executorlocal bottleneck
cancellation totaloperationdeadline pressure
DNS failureshostresolver or config issue

Avoid high-cardinality dimensions:

  • full URL;
  • user id;
  • raw IP for internet traffic at high scale;
  • exception message with dynamic text;
  • request id.

Use controlled labels:

  • target service name;
  • operation name;
  • failure phase;
  • retry decision;
  • status class, not every status if volume is high.

14. Debugging DNS from Java

DNS problems are often hidden behind UnknownHostException or long connection setup.

14.1 Java probe

import java.net.InetAddress;
import java.time.Duration;
import java.time.Instant;
import java.util.Arrays;

public final class DnsProbe {
    public static void main(String[] args) throws Exception {
        String host = args.length == 0 ? "example.com" : args[0];
        Instant start = Instant.now();
        InetAddress[] addresses = InetAddress.getAllByName(host);
        long ms = Duration.between(start, Instant.now()).toMillis();

        System.out.printf("host=%s lookupMs=%d addresses=%s%n",
                host, ms, Arrays.toString(addresses));
    }
}

14.2 Compare with OS tools

getent hosts api.partner.example
nslookup api.partner.example
dig api.partner.example

Mismatch between Java and OS tools may indicate:

  • JVM DNS cache;
  • different resolver configuration in container;
  • search domain behavior;
  • IPv6 vs IPv4 preference;
  • custom name-service provider or resolver SPI;
  • split-horizon DNS depending on network namespace.

15. Debugging TLS from Java

15.1 Minimal TLS probe

import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSocket;
import java.net.Socket;

public final class TlsProbe {
    public static void main(String[] args) throws Exception {
        String host = args.length > 0 ? args[0] : "example.com";
        int port = args.length > 1 ? Integer.parseInt(args[1]) : 443;

        try (SSLSocket socket = (SSLSocket) SSLContext.getDefault()
                .getSocketFactory()
                .createSocket(host, port)) {
            socket.startHandshake();
            System.out.println("protocol=" + socket.getSession().getProtocol());
            System.out.println("cipher=" + socket.getSession().getCipherSuite());
            System.out.println("peer=" + socket.getSession().getPeerPrincipal());
        }
    }
}

Run with:

java -Djavax.net.debug=ssl,handshake TlsProbe api.partner.example 443

15.2 Compare with OpenSSL

openssl s_client -connect api.partner.example:443 -servername api.partner.example -showcerts

If OpenSSL succeeds and Java fails, suspect:

  • different trust store;
  • hostname verification differences;
  • missing intermediate certificate;
  • JDK disabled algorithm constraints;
  • mTLS/client cert configuration;
  • proxy inspection certificate not trusted by Java.

16. Debugging HTTP/2 Problems

HTTP/2 adds stream-level behavior that can confuse traditional connection-level thinking.

SymptomPossible cause
One request fails but connection remains openstream reset, not TCP reset
Many streams slow togetherTCP head-of-line blocking, flow-control window, server saturation
HTTP/1.1 used unexpectedlyALPN negotiation failed or server/proxy limitation
Large download stallsresponse flow control or client not consuming body
Upload stallsserver/proxy not reading request body
GOAWAY receivedserver draining connection or rejecting new streams

Debug workflow:

  1. Confirm negotiated protocol.
  2. Check whether failure is stream-level or connection-level.
  3. Inspect response body consumption.
  4. Check client/server/proxy HTTP/2 settings.
  5. Correlate frame logs with server/load balancer logs.

17. Debugging WebSocket Problems

WebSocket failures are connection-lifecycle failures plus message-protocol failures.

SymptomLikely area
handshake failsHTTP upgrade/auth/proxy/TLS
connection closes after idlemissing ping/pong, proxy idle timeout
messages stop arrivinglistener demand not requested, app backpressure
memory growsinbound messages buffered faster than processed
close code abnormalnetwork break, proxy reset, peer crash
reconnect stormno backoff or bad close classification

For Java WebSocket.Listener, remember demand:

@Override
public CompletionStage<?> onText(WebSocket webSocket, CharSequence data, boolean last) {
    try {
        handle(data, last);
    } finally {
        webSocket.request(1); // ask for the next message/frame after processing capacity is available
    }
    return CompletableFuture.completedFuture(null);
}

The diagnostic invariant:

If your listener does not request more demand, the connection may look “stuck” even though the network is fine.


18. Production-Safe Troubleshooting Playbooks

18.1 UnknownHostException

Checklist:

  • print normalized host;
  • compare Java InetAddress.getAllByName with getent/dig;
  • check container /etc/resolv.conf;
  • check JVM DNS cache/security properties;
  • check search domain expansion;
  • check IPv4/IPv6 address family;
  • check if failures align with deploy or DNS change.

18.2 Connect timeout

Checklist:

  • confirm effective host/port;
  • check ss for SYN-SENT;
  • run packet capture on client;
  • check security group/firewall/load balancer;
  • test from same network namespace/container;
  • compare with curl --connect-timeout from same host;
  • inspect backlog/accept pressure on server.

18.3 Connection refused

Likely facts:

  • packet reached a host;
  • host or intermediary actively rejected;
  • port is closed or policy rejects.

Checklist:

  • verify remote service is listening;
  • verify correct port and protocol;
  • verify container port mapping;
  • verify load balancer target health;
  • check deploy timing;
  • check firewall reject vs drop policy.

18.4 TLS certificate failure

Checklist:

  • confirm hostname in URI matches certificate SAN;
  • inspect chain with openssl s_client;
  • check Java truststore used by the process;
  • check missing intermediate CA;
  • check corporate TLS inspection;
  • check mTLS client certificate and key alias;
  • enable JSSE debug for a bounded window;
  • do not disable validation as a fix.

18.5 Sporadic reset on reused connections

Likely causes:

  • server/load balancer idle timeout shorter than client reuse;
  • proxy closes idle connection;
  • pooled stale socket reused;
  • peer restarts/drains;
  • NAT mapping expired.

Checklist:

  • correlate resets with idle age;
  • compare with keepalive timeout settings;
  • reduce client keepalive below infrastructure idle timeout;
  • retry only safe idempotent operations;
  • check load balancer drain/GOAWAY behavior;
  • capture packet to identify RST sender.

18.6 Slow download

Checklist:

  • check response size;
  • check if body handler buffers whole response;
  • measure TTFB vs body duration;
  • inspect client CPU/decompression;
  • inspect disk write speed if saving file;
  • check TCP zero-window evidence;
  • check server throttling/proxy buffering;
  • inspect GC/allocation around response body.

19. Packet-Level Patterns and Java Meaning

Packet patternJava symptomMeaning
SYN retransmits, no SYN-ACKconnect timeoutblackhole/drop/path issue
RST after SYNconnection refusedno listener or active reject
FIN after responsenormal EOFgraceful peer close
RST during writebroken pipe/resetpeer aborted or proxy reset
ACKs but no app dataread timeoutpeer idle, server stuck, or app not sending
repeated small packetspoor batching/Nagle/flush behaviorinefficient write pattern
zero window from clientslow Java consumerapplication/backpressure issue
zero window from serverslow peer/proxyupload or peer processing issue
retransmissions under loadpacket loss/congestionpath or saturation issue

20. Avoiding Misleading Diagnostics

20.1 “Ping works” is weak evidence

Ping uses ICMP, not TCP, TLS, HTTP, proxy, SNI, ALPN, or application authentication.

Better tests:

curl -v --connect-timeout 2 https://api.partner.example/health
openssl s_client -connect api.partner.example:443 -servername api.partner.example
nc -vz api.partner.example 443

But even these are not perfect because your Java app may use:

  • different truststore;
  • different proxy;
  • different DNS cache;
  • different source IP;
  • different container namespace;
  • different headers/auth;
  • different HTTP version.

20.2 “It works from my laptop” is usually irrelevant

Production failures are path-specific.

You need to test from:

  • same pod/container;
  • same node;
  • same VPC/subnet;
  • same service account/network policy;
  • same proxy configuration;
  • same DNS resolver;
  • same JDK configuration.

20.3 “CPU is low, so app is fine” is false

A Java networking process can be unhealthy while CPU is low:

  • blocked in socket reads;
  • waiting on DNS;
  • stuck behind backpressure;
  • out of file descriptors;
  • leaking connections;
  • stalled due to GC or allocation throttling;
  • waiting in executor/bulkhead queue;
  • blocked on disk while streaming response body.

21. Incident Evidence Template

Use this during real incidents.

## Incident: <short name>

### User-visible symptom
- Start time:
- End time:
- Impact:
- Affected operations:

### Network call classification
- Direction: inbound / outbound
- Protocol: TCP / TLS / HTTP/1.1 / HTTP/2 / WebSocket / custom
- Target service:
- Host/port:
- Proxy path:
- Timeout/deadline:

### Failure phase
- URI / DNS / TCP connect / TLS / HTTP headers / upload / TTFB / download / close / unknown

### Evidence
- Application logs:
- JFR:
- OS socket state:
- Packet capture:
- DNS evidence:
- Proxy/load balancer logs:
- Peer service logs:

### Timeline
| Time | Evidence | Interpretation |
|---|---|---|
| | | |

### Root cause

### Why existing controls did not prevent/detect it

### Fix

### Regression test / chaos test

### Follow-up observability improvement

22. Deliberate Practice Drills

Drill 1 — DNS failure lab

Build a small Java program that calls a hostname. Then test:

  • valid host;
  • invalid host;
  • host changed in /etc/hosts;
  • IPv4-only host;
  • IPv6-only host;
  • container with different resolver.

Record:

  • Java exception;
  • lookup duration;
  • OS resolver result;
  • final conclusion.

Drill 2 — connect refused vs connect timeout

Create two targets:

  • closed local port: expect refusal;
  • blackholed IP or firewall-dropped path: expect timeout.

Compare:

  • Java exception;
  • ss state;
  • packet capture.

Drill 3 — TLS certificate path

Call an endpoint with:

  • valid cert;
  • wrong hostname;
  • self-signed cert;
  • missing intermediate;
  • mTLS requirement.

Document exact Java failure mode.

Drill 4 — slow body consumer

Create a server that streams large response bytes. Make the client consume slowly.

Observe:

  • body duration;
  • memory usage;
  • TCP window behavior if visible;
  • JFR socket read/write events.

Drill 5 — stale pooled connection

Set server/load-balancer idle timeout lower than client keepalive expectation. Wait, then send another request.

Observe:

  • reset behavior;
  • retry safety;
  • packet-level RST sender;
  • mitigation by reducing keepalive or safe retry.

23. Production Readiness Checklist

A production Java networking component should have:

  • operation-level network metrics;
  • structured start/success/failure logs;
  • failure phase classification;
  • timeout/deadline values in logs;
  • target service name separate from raw host;
  • safe redaction for URL/query/header/body;
  • correlation id propagation;
  • bounded debug logging capability;
  • documented JFR capture procedure;
  • documented packet capture procedure;
  • DNS probe procedure;
  • TLS probe procedure;
  • server connection lifecycle metrics;
  • close reason metrics;
  • runbook for UnknownHostException;
  • runbook for connect timeout/refused;
  • runbook for TLS handshake failure;
  • runbook for slow body transfer;
  • runbook for connection reset;
  • incident template that reconciles logs/JFR/OS/packets.

24. Key Takeaways

  1. Java network debugging must be layered: application, JDK, JVM, OS, packet path, peer.
  2. Always classify failures by phase: DNS, connect, TLS, HTTP, upload, download, close.
  3. A single duration number is not enough; decompose latency when possible.
  4. JFR is often the safest first runtime evidence source.
  5. java.net.http logging and JSSE debug are powerful but must be scoped and redacted.
  6. Packet captures are for resolving disputed path behavior, not for casual logging.
  7. CLOSE-WAIT, resets, retransmissions, and zero-window patterns each imply different fixes.
  8. Production runbooks should be written before incidents, not during them.

25. References

  • Java SE 25 — JDK Flight Recorder troubleshooting documentation.
  • Java SE 25 — java.net.http API and module documentation.
  • Java SE 25 — JSSE Reference Guide.
  • Java SE 25 — java.net, java.nio.channels, Socket, SocketChannel, ServerSocketChannel API documentation.
  • RFC 9110 — HTTP Semantics.
  • RFC 9113 — HTTP/2.
  • RFC 6455 — The WebSocket Protocol.

Series status: belum selesai. Lanjut ke Part 028.

Lesson Recap

You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.