Learn Java Core Types Part 010 Bytes Binary Data Buffering
title: Learn Java Core Types, Data Model & Data APIs - Part 010 description: Deep engineering treatment of Java bytes, binary data, charset encoding/decoding, ByteBuffer, heap vs direct buffers, endianness, Base64, hex, and production I/O failure modes. series: learn-java-core-types seriesTitle: Learn Java Core Types, Data Model & Data APIs order: 10 partTitle: Bytes, Binary Data, and Buffering tags:
- java
- byte
- binary
- charset
- utf-8
- bytebuffer
- nio
- base64
- hex
- encoding
- advanced date: 2026-06-27
Part 010 — Bytes, Binary Data, and Buffering
Text is not bytes.
This sounds obvious, but many production bugs come from code that acts as if it were not true.
byte[] bytes = input.getBytes();
String again = new String(bytes);
This code depends on the platform default charset. It might work on your machine and corrupt data elsewhere.
Or:
ByteBuffer buffer = ByteBuffer.allocate(8);
buffer.putLong(42L);
socket.write(buffer); // bug: position is at end unless flipped
Or:
int b = bytes[i]; // sign extension surprise when treating byte as unsigned
This part explains how Java models binary data:
byteandbyte[];- signed byte vs unsigned byte interpretation;
- text encoding and decoding;
- charset correctness;
ByteBufferposition/limit/capacity;- heap vs direct buffer;
- endianness;
- Base64 and hex;
- binary protocol failure modes.
1. Kaufman Deconstruction
Skill besar pada part ini:
Mampu mendesain dan membaca boundary binary/text Java secara eksplisit tanpa encoding bugs, buffer state bugs, atau signed-byte surprises.
Sub-skill:
| Sub-skill | Yang perlu dikuasai |
|---|---|
| Byte model | byte signed 8-bit, but often interpreted as unsigned octet |
| Binary container | byte[], ByteBuffer, streams, channels |
| Encoding | String <-> byte[] via Charset |
| Charset safety | avoid default charset, use StandardCharsets |
| Buffer state | capacity, position, limit, mark, flip, clear, compact |
| Endianness | byte order for multi-byte numeric values |
| Base64/hex | binary-to-text representation |
| Protocol thinking | framing, partial reads, length prefix, validation |
Target 20 jam:
| Jam | Fokus latihan |
|---|---|
| 1-2 | signed byte experiments and unsigned conversion |
| 3-5 | encode/decode UTF-8, UTF-16, ISO-8859-1 examples |
| 6-8 | CharsetEncoder/CharsetDecoder error handling |
| 9-11 | ByteBuffer position/limit/flip/compact drills |
| 12-14 | endianness and binary integer serialization |
| 15-17 | Base64/hex encode/decode utilities |
| 18-20 | build a small binary framed message parser |
2. Mental Model: Bytes Are Raw, Meaning Comes From Interpretation
A byte sequence has no inherent meaning.
48 65 6C 6C 6F
Possible interpretations:
- ASCII/UTF-8 text:
Hello - hex-encoded bytes;
- part of compressed data;
- part of encrypted data;
- binary protocol frame;
- image data;
- integer fields;
- serialized object payload.
Meaning comes from a contract:
When the contract is implicit, bugs appear.
3. Java byte: Signed Storage, Unsigned Use Cases
Java byte is signed and ranges from -128 to 127.
But many binary protocols define bytes as unsigned octets from 0 to 255.
byte b = (byte) 0xFF;
System.out.println(b); // -1
To interpret as unsigned:
int unsigned = Byte.toUnsignedInt(b);
System.out.println(unsigned); // 255
Or:
int unsigned = b & 0xFF;
Prefer named API for clarity:
int value = Byte.toUnsignedInt(buffer[index]);
3.1 Sign Extension Pitfall
byte b = (byte) 0xFE;
int x = b;
System.out.println(x); // -2
Widening from byte to int preserves signed value.
If you need unsigned octet:
int x = b & 0xFF;
System.out.println(x); // 254
3.2 Byte Literals and Casting
byte a = 127; // ok, constant fits
byte b = (byte)128; // -128 after narrowing
byte c = (byte)255; // -1 after narrowing
Rule:
byteis a signed Java primitive. Treating it as unsigned is an interpretation step, not its native type behavior.
4. byte[]: The Basic Binary Container
byte[] is the simplest binary data container.
byte[] payload = new byte[] { 0x48, 0x65, 0x6C, 0x6C, 0x6F };
It is mutable.
record DocumentHash(byte[] bytes) { }
This record is dangerous because callers can mutate the array after construction.
byte[] raw = {1, 2, 3};
DocumentHash h = new DocumentHash(raw);
raw[0] = 99; // mutates h.bytes() content
Defensive copy:
import java.util.Arrays;
public final class DocumentHash {
private final byte[] bytes;
public DocumentHash(byte[] bytes) {
this.bytes = Arrays.copyOf(bytes, bytes.length);
}
public byte[] bytes() {
return Arrays.copyOf(bytes, bytes.length);
}
}
For records:
public record BinaryPayload(byte[] bytes) {
public BinaryPayload {
bytes = bytes.clone();
}
@Override
public byte[] bytes() {
return bytes.clone();
}
}
But remember: generated record equals for arrays uses reference equality, not content equality. For binary value objects, class may be better than record unless you override equals, hashCode, and toString carefully.
5. Text Encoding: String to byte[]
A String is text. A byte[] is bytes. Encoding converts text to bytes.
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
Decoding converts bytes to text.
String text = new String(bytes, StandardCharsets.UTF_8);
Avoid:
text.getBytes();
new String(bytes);
Because these use the default charset.
5.1 StandardCharsets
Use StandardCharsets:
import java.nio.charset.StandardCharsets;
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
String decoded = new String(utf8, StandardCharsets.UTF_8);
Common standard charsets:
| Charset | Use case |
|---|---|
UTF_8 | default modern text interchange |
UTF_16 | Java/Unicode interop when explicitly required |
US_ASCII | strict ASCII protocols |
ISO_8859_1 | legacy single-byte Latin-1 systems |
Rule:
For new protocols and storage, prefer UTF-8 unless a contract says otherwise.
6. Encoding Is Not Always Lossless
Some characters cannot be represented in some charsets.
String text = "Ayu 😊";
byte[] ascii = text.getBytes(StandardCharsets.US_ASCII);
String decoded = new String(ascii, StandardCharsets.US_ASCII);
System.out.println(decoded); // likely Ayu ?
Default encoding methods may replace unmappable characters.
If you need strict failure, use CharsetEncoder.
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.*;
CharsetEncoder encoder = StandardCharsets.US_ASCII
.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
try {
ByteBuffer encoded = encoder.encode(CharBuffer.wrap("Ayu 😊"));
} catch (CharacterCodingException ex) {
// input cannot be represented as US-ASCII
}
Rule:
| Boundary | Error strategy |
|---|---|
| user-facing import | report invalid encoding clearly |
| logs/debug output | replacement may be acceptable |
| legal/audit records | strict preservation and explicit errors |
| security tokens | bytes, not text; no charset conversion unless specified |
| protocol payload | strict contract |
7. Charset Boundary Diagram
The sender and receiver must agree on charset.
If sender uses UTF-8 and receiver assumes ISO-8859-1, text may corrupt.
This corruption is often called mojibake.
8. ByteBuffer: State Machine, Not Just a Byte Array
ByteBuffer is central in Java NIO.
It has:
- capacity;
- position;
- limit;
- mark;
- byte order;
- backing storage, optionally.
ByteBuffer buffer = ByteBuffer.allocate(8);
Initial state:
capacity = 8
position = 0
limit = 8
After writing a long:
buffer.putLong(42L);
State:
position = 8
limit = 8
To read what you wrote, call flip():
buffer.flip();
long value = buffer.getLong();
flip() prepares for reading:
limit = old position
position = 0
8.1 Buffer State Diagram
8.2 clear Does Not Erase Data
buffer.clear();
This resets position/limit for writing. It does not zero the memory.
If buffer contains secrets, clear() is not secure erasure.
8.3 compact Preserves Unread Bytes
compact() is useful after partial reads:
- unread bytes move to beginning;
- position set after moved bytes;
- limit set to capacity;
- ready for more writing.
Typical socket pattern:
ByteBuffer buffer = ByteBuffer.allocate(8192);
int read = channel.read(buffer);
buffer.flip();
while (canReadFrame(buffer)) {
Frame frame = readFrame(buffer);
process(frame);
}
buffer.compact(); // preserve incomplete frame
9. Relative vs Absolute Buffer Operations
Relative operations use and update position:
buffer.put((byte) 1);
byte b = buffer.get();
Absolute operations use index and do not update position:
buffer.put(0, (byte) 1);
byte b = buffer.get(0);
Rule:
| Operation style | Use when |
|---|---|
| relative | sequential protocol read/write |
| absolute | inspect/update known offset |
| duplicate/slice | pass sub-view without copying |
Be careful: slices and duplicates can share content.
10. Heap vs Direct ByteBuffer
ByteBuffer heap = ByteBuffer.allocate(1024);
ByteBuffer direct = ByteBuffer.allocateDirect(1024);
Heap buffer:
- backed by Java heap array;
- easier for GC visibility;
- often has accessible array;
- good general default.
Direct buffer:
- memory outside normal Java heap;
- useful for native I/O interactions;
- allocation/deallocation more expensive;
- can reduce copying in some I/O scenarios;
- not always faster by default.
Decision:
| Need | Prefer |
|---|---|
| ordinary small data processing | heap byte[] or heap ByteBuffer |
| NIO channel high-throughput I/O | consider direct buffer |
| simple encode/decode | byte[] often enough |
| native interop | direct buffer may help |
| many short-lived buffers | avoid direct allocation churn |
Rule:
Do not use direct buffers as a cargo-cult performance optimization. Measure and understand allocation lifetime.
11. Endianness: Byte Order for Multi-Byte Values
A single byte has no endianness. Multi-byte values do.
Example integer 0x01020304:
Big-endian:
01 02 03 04
Little-endian:
04 03 02 01
Java ByteBuffer defaults to big-endian.
ByteBuffer buffer = ByteBuffer.allocate(4);
buffer.putInt(0x01020304);
Explicit little-endian:
ByteBuffer buffer = ByteBuffer.allocate(4)
.order(ByteOrder.LITTLE_ENDIAN);
buffer.putInt(0x01020304);
Always specify byte order in binary protocols.
record FrameHeader(int version, int length) {
static FrameHeader read(ByteBuffer buffer) {
buffer.order(ByteOrder.BIG_ENDIAN);
int version = Byte.toUnsignedInt(buffer.get());
int length = buffer.getInt();
return new FrameHeader(version, length);
}
}
Better: set order once at buffer creation/boundary and document protocol order.
12. Base64: Binary as Text
Base64 encodes binary data into text-safe representation.
Use cases:
- JSON payload containing bytes;
- email/MIME;
- tokens;
- HTTP basic credentials format;
- embedding binary in text protocols.
Java API:
String encoded = Base64.getEncoder().encodeToString(bytes);
byte[] decoded = Base64.getDecoder().decode(encoded);
URL-safe variant:
String token = Base64.getUrlEncoder().withoutPadding().encodeToString(bytes);
byte[] raw = Base64.getUrlDecoder().decode(token);
Important:
- Base64 is encoding, not encryption.
- It increases size by roughly 33%.
- Padding policy must match receiver expectations.
- URL-safe Base64 differs from basic Base64.
12.1 Base64 Is Not a Charset
Do not do this conceptually:
String s = new String(binaryBytes, StandardCharsets.UTF_8); // wrong for arbitrary binary
Do this:
String s = Base64.getEncoder().encodeToString(binaryBytes);
Base64 converts arbitrary bytes to ASCII-ish text safely.
13. Hex Encoding
Hex is often used for diagnostics, hashes, signatures, binary IDs.
Java 17 introduced HexFormat.
import java.util.HexFormat;
String hex = HexFormat.of().formatHex(bytes);
byte[] parsed = HexFormat.of().parseHex(hex);
Uppercase:
String hex = HexFormat.of().withUpperCase().formatHex(bytes);
With delimiter:
String hex = HexFormat.ofDelimiter(":").formatHex(bytes);
Hex trade-off:
| Encoding | Pros | Cons |
|---|---|---|
| Hex | readable, stable, easy debug | 2x size |
| Base64 | compact text encoding | less readable, padding variants |
For hashes in logs, hex is often friendlier.
14. Binary Protocol Framing
Network/file reads may be partial.
Never assume one read equals one message.
Bad mental model:
read() -> whole message
Correct mental model:
read() -> some bytes
parser -> zero or more complete frames + maybe incomplete remainder
Length-prefixed frame example:
[4-byte length][payload bytes]
Parser sketch:
static boolean canReadFrame(ByteBuffer buffer) {
if (buffer.remaining() < Integer.BYTES) {
return false;
}
buffer.mark();
int length = buffer.getInt();
buffer.reset();
if (length < 0 || length > 1_000_000) {
throw new IllegalArgumentException("Invalid frame length: " + length);
}
return buffer.remaining() >= Integer.BYTES + length;
}
static byte[] readFrame(ByteBuffer buffer) {
int length = buffer.getInt();
byte[] payload = new byte[length];
buffer.get(payload);
return payload;
}
Production concerns:
- maximum frame size;
- negative length;
- integer overflow in length calculations;
- partial reads;
- buffer compaction;
- timeout;
- backpressure;
- malformed payload metrics;
- audit/logging without dumping secrets.
15. Byte Array Equality and Hashing
Arrays do not use content equality.
byte[] a = {1, 2, 3};
byte[] b = {1, 2, 3};
System.out.println(a.equals(b)); // false
Use:
Arrays.equals(a, b);
Arrays.hashCode(a);
For nested arrays:
Arrays.deepEquals(...);
Arrays.deepHashCode(...);
For cryptographic comparisons, use appropriate constant-time comparison APIs where relevant.
Do not log secrets or raw tokens.
16. Binary Value Object Example
import java.util.Arrays;
import java.util.HexFormat;
import java.util.Objects;
public final class Sha256Hash {
private static final int LENGTH = 32;
private static final HexFormat HEX = HexFormat.of();
private final byte[] bytes;
public Sha256Hash(byte[] bytes) {
Objects.requireNonNull(bytes, "bytes");
if (bytes.length != LENGTH) {
throw new IllegalArgumentException("SHA-256 hash must be 32 bytes");
}
this.bytes = bytes.clone();
}
public static Sha256Hash fromHex(String hex) {
Objects.requireNonNull(hex, "hex");
return new Sha256Hash(HEX.parseHex(hex));
}
public byte[] bytes() {
return bytes.clone();
}
public String toHex() {
return HEX.formatHex(bytes);
}
@Override
public boolean equals(Object o) {
return this == o || (o instanceof Sha256Hash other && Arrays.equals(bytes, other.bytes));
}
@Override
public int hashCode() {
return Arrays.hashCode(bytes);
}
@Override
public String toString() {
return toHex();
}
}
Why class, not record?
Because record-generated equality for byte[] would compare array references. For binary value semantics, explicit implementation is clearer.
17. Text vs Binary Decision Framework
Rules:
- Text crossing byte boundary needs a charset.
- Arbitrary binary must not be forced into
String. - Binary-as-text needs Base64 or hex.
- Multi-byte binary numbers need byte order.
- Buffer state must be managed explicitly.
- Mutable byte arrays need defensive copies.
18. I/O Streams and Bytes
Classic byte streams:
InputStream in;
OutputStream out;
Read loop:
byte[] buffer = new byte[8192];
int n;
while ((n = in.read(buffer)) != -1) {
out.write(buffer, 0, n);
}
Do not ignore n.
Wrong:
out.write(buffer); // writes entire buffer, including stale bytes
Correct:
out.write(buffer, 0, n);
For text:
try (Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8)) {
// character stream
}
Boundary distinction:
| API | Data level |
|---|---|
InputStream / OutputStream | bytes |
Reader / Writer | characters/text |
InputStreamReader | byte -> char bridge via charset |
OutputStreamWriter | char -> byte bridge via charset |
19. File Reading/Writing: Explicit Charset
Text file:
String content = Files.readString(path, StandardCharsets.UTF_8);
Files.writeString(path, content, StandardCharsets.UTF_8);
Binary file:
byte[] bytes = Files.readAllBytes(path);
Files.write(path, bytes);
Do not read binary data as string just to pass it around.
Bad:
String image = Files.readString(imagePath); // wrong for binary
Good:
byte[] image = Files.readAllBytes(imagePath);
If binary must go to JSON:
String base64 = Base64.getEncoder().encodeToString(image);
20. Memory and Allocation Concerns
Binary-heavy code often stresses memory.
Common issues:
- copying large
byte[]repeatedly; - converting bytes to Base64 strings unnecessarily;
- keeping full file in memory;
- direct buffer allocation churn;
- unbounded frame length;
- logging giant payloads;
- retaining buffer slices longer than expected.
Engineering mitigations:
| Issue | Mitigation |
|---|---|
| large payload | stream instead of load-all |
| repeated concat | use buffers/streams |
| unbounded input | enforce max size |
| debug logs huge | log length/hash/sample, not full payload |
| mutable ownership | copy at boundary or document ownership |
| direct buffer churn | pool carefully or allocate long-lived buffers |
21. Binary Data in APIs
Do not expose raw byte[] casually.
Bad:
class Attachment {
byte[] content;
}
Better:
final class AttachmentContent {
private final byte[] bytes;
AttachmentContent(byte[] bytes) {
this.bytes = bytes.clone();
}
int size() {
return bytes.length;
}
InputStream openStream() {
return new ByteArrayInputStream(bytes);
}
byte[] copyBytes() {
return bytes.clone();
}
}
For very large content, avoid storing in memory:
interface BlobRef {
long size();
InputStream openStream() throws IOException;
}
Data representation decision depends on size and ownership.
22. Common Failure Modes
| Failure | Cause | Prevention |
|---|---|---|
| mojibake | mismatched/default charset | explicit StandardCharsets.UTF_8 |
| data loss on encode | unmappable characters replaced | strict CharsetEncoder with REPORT |
| signed byte bug | treating byte as 0..255 | Byte.toUnsignedInt |
| buffer writes nothing | forgot flip() before read/write | manage buffer state |
| stale bytes written | ignored read count | write 0..n only |
| wrong integer value | endianness mismatch | explicit ByteOrder |
| binary corrupted as text | arbitrary bytes converted to String | Base64/hex or keep bytes |
| mutable binary value | exposed byte[] | defensive copy |
| array equality bug | byte[].equals reference equality | Arrays.equals |
| memory pressure | load huge payloads | streaming, max size |
| partial read bug | assumes one read = one message | frame parser |
23. Practice Drill: Framed UTF-8 Message Parser
Build a parser for this protocol:
[4-byte big-endian length][UTF-8 JSON-like text payload]
Rules:
- length is signed Java
intbut valid range is0..1_000_000; - parser receives arbitrary chunks;
- one chunk may contain half a frame;
- one chunk may contain multiple frames;
- payload must decode as valid UTF-8;
- malformed length fails fast;
- malformed UTF-8 fails clearly;
- parser preserves incomplete bytes for next read.
Suggested public API:
final class FramedUtf8Parser {
List<String> feed(byte[] chunk);
}
Test cases:
- one complete frame;
- two frames in one chunk;
- frame split across chunks;
- negative length;
- length above max;
- incomplete length prefix;
- invalid UTF-8;
- zero-length payload.
24. Review Checklist
Before approving Java binary/text boundary code, ask:
- Is this text or binary?
- Where is the charset specified?
- Are default charset APIs avoided?
- Are arbitrary bytes ever converted to
String? - If binary is transported as text, is Base64/hex used intentionally?
- Are
bytevalues interpreted as signed or unsigned intentionally? - Are mutable
byte[]values defensively copied? - Does equality use content equality?
- Is byte order specified for multi-byte numeric fields?
- Does
ByteBuffercode handleflip,clear, andcompactcorrectly? - Are partial reads handled?
- Are frame sizes bounded?
- Is strict decoding required?
- Are secrets/tokens excluded from logs?
- Are large payloads streamed instead of fully loaded?
25. Summary
Binary data in Java is simple only when the boundary contract is explicit.
Key takeaways:
byteis signed; unsigned interpretation requires conversion.byte[]is mutable; defensive copy is often required.Stringto bytes requires aCharset.- Avoid default charset APIs at production boundaries.
- Use
CharsetEncoder/CharsetDecoderfor strict error handling. ByteBufferis a state machine with position, limit, capacity, and byte order.- Always
flip()before reading data you wrote into a buffer. - Specify endianness for binary protocols.
- Use Base64 or hex for binary-as-text.
- Do not assume one I/O read equals one full message.
- Treat binary value objects carefully because arrays use reference equality.
Next part: Object, Class, runtime type, identity-sensitive operations, and how Java's root object model affects debugging, frameworks, and domain modeling.
You just completed lesson 10 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.