Learn Java Core Types Part 032 Streams Collectors And Data Pipeline Semantics
title: Learn Java Core Types, Data Model & Data APIs - Part 032 description: Java Streams, collectors, data pipeline semantics, laziness, encounter order, side effects, primitive streams, grouping, reducing, parallel stream caveats, spliterator mental model, and final integration rubric for Java core data modeling. series: learn-java-core-types seriesTitle: Learn Java Core Types, Data Model & Data APIs order: 32 partTitle: Streams, Collectors, and Data Pipeline Semantics tags:
- java
- streams
- collectors
- spliterator
- pipeline
- functional
- collections
- data-processing
- parallel-stream
- grouping
- reducing
- data-modeling date: 2026-06-27
Part 032 — Streams, Collectors, and Data Pipeline Semantics
Goal: memahami stream sebagai pipeline data yang lazy, typed, ordered atau unordered, side-effect-sensitive, dan memiliki terminal semantics yang jelas. Setelah bagian ini, kita bisa memutuskan kapan memakai stream, kapan memakai loop, bagaimana menulis collector yang benar, dan bagaimana menghindari bug akibat mutation, ordering, null, parallelism, boxing, dan reduction yang tidak associative.
Ini adalah part terakhir seri learn-java-core-types.
Streams sering diajarkan sebagai:
list.stream()
.filter(...)
.map(...)
.collect(...);
Itu hanya syntax.
Top engineer melihat stream sebagai:
deklarasi transformasi data dari source ke terminal result, dengan kontrak laziness, encounter order, statelessness, non-interference, type conversion, and reduction semantics.
Stream bukan collection. Stream bukan selalu lebih cepat. Stream bukan magic parallelism. Stream adalah abstraction untuk pipeline.
1. Kaufman Deconstruction: Sub-Skill Streams
| Sub-skill | Kemampuan |
|---|---|
| Pipeline model | Bisa membedakan source, intermediate operation, terminal operation |
| Laziness | Tahu kapan computation benar-benar terjadi |
| Encounter order | Tahu apakah order dipertahankan, diabaikan, atau mahal |
| Statelessness | Menghindari operation yang bergantung pada mutable external state |
| Non-interference | Tidak memodifikasi source saat pipeline berjalan |
| Mapping discipline | Memisahkan parse, validate, normalize, map, filter, reduce |
| Collector design | Memilih collector sesuai output semantics |
| Reduction correctness | Menggunakan identity/accumulator/combiner yang associative |
| Primitive specialization | Menghindari boxing overhead ketika perlu |
| Parallel caution | Tahu kapan parallel stream aman dan kapan berbahaya |
| Debuggability | Bisa menulis pipeline yang dapat dites dan dibaca |
| Integration | Mengubah raw data menjadi domain model dengan boundary yang jelas |
2. Stream Mental Model
Pipeline parts:
| Part | Example | Meaning |
|---|---|---|
| Source | list.stream() | dari mana elemen berasal |
| Intermediate | filter, map, sorted, distinct | transform stream menjadi stream lain |
| Terminal | collect, toList, count, forEach | memicu traversal dan menghasilkan result/side effect |
Example:
List<String> activeNames = users.stream()
.filter(User::active)
.map(User::name)
.sorted()
.toList();
Interpretasi:
- Ambil source
users. - Pilih user active.
- Ubah user menjadi name.
- Sort names.
- Materialize ke list.
3. Stream Is Not a Collection
Collection:
- menyimpan data;
- bisa diiterasi berkali-kali;
- punya size;
- bisa mutable/unmodifiable;
- adalah data structure.
Stream:
- bukan storage;
- biasanya one-shot;
- lazy;
- bisa finite atau infinite;
- merepresentasikan computation pipeline.
Buruk:
Stream<User> stream = users.stream();
long count = stream.count();
List<User> again = stream.toList(); // IllegalStateException
Stream sudah consumed setelah terminal operation.
Jika perlu reuse, simpan source atau supplier:
Supplier<Stream<User>> userStream = users::stream;
long count = userStream.get().count();
List<User> active = userStream.get()
.filter(User::active)
.toList();
Namun jangan overuse supplier jika loop lebih jelas.
4. Laziness: No Terminal, No Work
users.stream()
.filter(user -> {
System.out.println("filter " + user.id());
return user.active();
});
Tidak ada output. Belum ada terminal operation.
Dengan terminal:
long activeCount = users.stream()
.filter(user -> {
System.out.println("filter " + user.id());
return user.active();
})
.count();
Baru berjalan.
Mental model:
Intermediate operations membangun recipe. Terminal operation menjalankan recipe.
Laziness memungkinkan:
- short-circuiting;
- operation fusion;
- processing only needed elements;
- efficient infinite stream handling.
Contoh short-circuit:
Optional<User> firstActive = users.stream()
.filter(User::active)
.findFirst();
Pipeline berhenti setelah first active ditemukan, jika source/order memungkinkan.
5. Intermediate Operations: Stateless vs Stateful
5.1 Stateless Operations
Stateless operation tidak perlu mengingat elemen sebelumnya.
| Operation | Meaning |
|---|---|
filter | keep/drop element |
map | one-to-one transform |
flatMap | one-to-many flatten |
peek | observe/debug, not business mutation |
mapToInt | transform to primitive stream |
Example:
List<CaseId> ids = cases.stream()
.filter(CaseFile::open)
.map(CaseFile::id)
.toList();
5.2 Stateful Operations
Stateful operation perlu melihat beberapa/banyak elemen.
| Operation | Needs state |
|---|---|
distinct | seen set |
sorted | buffering/sorting |
limit | count/short-circuit semantics |
skip | count |
takeWhile | prefix condition |
dropWhile | prefix dropping |
Example:
List<CaseFile> latest = cases.stream()
.sorted(Comparator.comparing(CaseFile::submittedAt).reversed())
.limit(10)
.toList();
Stateful operations bisa mahal, terutama pada parallel stream dan ordered source.
6. Encounter Order
Encounter order adalah urutan elemen yang dilihat pipeline dari source.
| Source | Encounter order |
|---|---|
ArrayList | positional order |
LinkedHashSet | insertion order |
TreeSet | sorted order |
HashSet | no guaranteed order |
HashMap.entrySet() | no guaranteed order |
LinkedHashMap.entrySet() | insertion/access order depending config |
TreeMap.entrySet() | sorted key order |
Example:
List<String> names = users.stream()
.map(User::name)
.toList();
Jika users adalah List, result mengikuti list order.
Jika users adalah HashSet, jangan anggap order stabil.
6.1 findFirst vs findAny
Optional<User> first = users.stream()
.filter(User::active)
.findFirst();
findFirst menghormati encounter order.
Optional<User> any = users.parallelStream()
.filter(User::active)
.findAny();
findAny memberi lebih banyak kebebasan, terutama parallel.
Rule:
Jika order tidak penting, jangan memaksa order. Jika order penting, pilih source dan terminal yang mengekspresikannya.
7. Non-Interference: Jangan Modifikasi Source
Buruk:
List<User> users = new ArrayList<>();
users.stream()
.filter(User::inactive)
.forEach(users::remove); // dangerous
Source dimodifikasi saat traversal. Ini bisa menyebabkan exception atau behavior tidak jelas.
Lebih baik:
users.removeIf(User::inactive);
Atau buat result baru:
List<User> activeUsers = users.stream()
.filter(User::active)
.toList();
Pipeline harus membaca source, bukan merusak source.
8. Statelessness: Jangan Bergantung pada Mutable External State
Buruk:
Set<String> seen = new HashSet<>();
List<User> unique = users.stream()
.filter(user -> seen.add(user.email()))
.toList();
Ini tampak bekerja sequential, tetapi:
- tidak jelas;
- tidak thread-safe;
- parallel unsafe;
- menyembunyikan stateful logic dalam predicate;
- lebih sulit dites.
Lebih baik jika uniqueness berdasarkan object equality:
List<User> unique = users.stream()
.distinct()
.toList();
Jika uniqueness berdasarkan key, gunakan helper eksplisit atau collector.
Sequential-only helper bisa dibuat jelas:
static <T, K> Predicate<T> distinctByKey(Function<T, K> keyExtractor) {
Set<K> seen = new HashSet<>();
return item -> seen.add(keyExtractor.apply(item));
}
Namun dokumentasikan bahwa ini sequential-stateful. Untuk production, sering lebih jelas memakai map:
Map<Email, User> byEmail = new LinkedHashMap<>();
for (User user : users) {
byEmail.putIfAbsent(user.email(), user);
}
List<User> unique = List.copyOf(byEmail.values());
Loop kadang lebih defensible.
9. Side Effects: forEach Is a Terminal Escape Hatch
forEach bukan pengganti map.
Buruk:
List<CaseDto> result = new ArrayList<>();
cases.stream()
.filter(CaseFile::open)
.forEach(c -> result.add(toDto(c)));
Lebih baik:
List<CaseDto> result = cases.stream()
.filter(CaseFile::open)
.map(this::toDto)
.toList();
Use forEach untuk side effect yang memang terminal:
cases.stream()
.filter(CaseFile::overdue)
.forEach(notificationService::sendReminder);
Tetapi untuk side effects production:
- consider retry;
- consider idempotency;
- consider transaction boundary;
- consider partial failure;
- consider logging and observability.
Stream pipeline tidak otomatis memberi workflow semantics.
10. peek: Debug, Not Business Logic
peek sering disalahgunakan.
Acceptable debugging:
List<CaseDto> result = cases.stream()
.peek(c -> log.debug("input {}", c.id()))
.filter(CaseFile::open)
.peek(c -> log.debug("open {}", c.id()))
.map(this::toDto)
.toList();
Bad business mutation:
List<CaseFile> result = cases.stream()
.peek(c -> c.markReviewed())
.toList();
Jika mutation adalah tujuan, gunakan loop atau explicit command method. peek membuat mutation tersembunyi di tengah pipeline.
11. Mapping Discipline: Parse → Validate → Normalize → Model
Untuk input eksternal, jangan langsung stream menjadi domain object tanpa boundary.
Buruk:
List<CaseId> ids = rawIds.stream()
.map(CaseId::new)
.toList();
Jika constructor throw di tengah, error reporting buruk.
Lebih kuat:
List<CaseId> ids = rawIds.stream()
.map(String::strip)
.filter(s -> !s.isEmpty())
.map(CaseId::parse)
.toList();
Lebih production-ready dengan error accumulation:
record ParseResult<T>(T value, String error) {
static <T> ParseResult<T> ok(T value) {
return new ParseResult<>(value, null);
}
static <T> ParseResult<T> fail(String error) {
return new ParseResult<>(null, error);
}
boolean valid() {
return error == null;
}
}
List<ParseResult<CaseId>> parsed = rawIds.stream()
.map(CaseId::tryParse)
.toList();
List<String> errors = parsed.stream()
.filter(result -> !result.valid())
.map(ParseResult::error)
.toList();
List<CaseId> ids = parsed.stream()
.filter(ParseResult::valid)
.map(ParseResult::value)
.toList();
Untuk batch validation, stream bisa bagus, tetapi jangan kehilangan error context.
12. Nulls in Streams
Stream can contain null elements unless source/operations prevent them.
Buruk:
List<String> names = users.stream()
.map(User::name)
.map(String::toUpperCase)
.toList();
Jika name() bisa null, NPE.
Options:
List<String> names = users.stream()
.map(User::name)
.filter(Objects::nonNull)
.map(name -> name.toUpperCase(Locale.ROOT))
.toList();
Atau model domain seharusnya non-null sejak awal:
record UserName(String value) {
UserName {
value = Objects.requireNonNull(value).strip();
if (value.isEmpty()) {
throw new IllegalArgumentException("empty user name");
}
}
}
Rule:
Stream bukan pengganti nullability design. Bersihkan null di boundary.
13. Primitive Streams
Boxing cost bisa muncul pada stream numeric.
int total = orders.stream()
.map(Order::amountInCents)
.reduce(0, Integer::sum);
Jika amountInCents() return int, lebih baik:
int total = orders.stream()
.mapToInt(Order::amountInCents)
.sum();
Primitive streams:
| Type | Stream |
|---|---|
int | IntStream |
long | LongStream |
double | DoubleStream |
Useful operations:
IntSummaryStatistics stats = orders.stream()
.mapToInt(Order::amountInCents)
.summaryStatistics();
But for money, prefer BigDecimal or minor units depending domain. Jangan gunakan DoubleStream untuk money hanya karena enak.
14. map vs flatMap
map transforms one element into one value.
List<CaseId> ids = cases.stream()
.map(CaseFile::id)
.toList();
flatMap transforms one element into stream of values, then flattens.
List<CaseEvent> allEvents = cases.stream()
.flatMap(caseFile -> caseFile.events().stream())
.toList();
Common domain example:
Set<Permission> permissions = users.stream()
.flatMap(user -> user.roles().stream())
.flatMap(role -> role.permissions().stream())
.collect(Collectors.toSet());
If role/permission are enums:
EnumSet<Permission> permissions = users.stream()
.flatMap(user -> user.roles().stream())
.flatMap(role -> role.permissions().stream())
.collect(Collectors.toCollection(() -> EnumSet.noneOf(Permission.class)));
15. toList() vs Collectors.toList() vs toUnmodifiableList()
Modern Java has Stream.toList().
List<CaseDto> result = cases.stream()
.map(this::toDto)
.toList();
Guideline:
| Method | Use when |
|---|---|
stream.toList() | You want a simple unmodifiable result list |
collect(Collectors.toList()) | You need collector form or do not require specific mutability |
collect(Collectors.toUnmodifiableList()) | You explicitly want unmodifiable collector and null rejection |
collect(Collectors.toCollection(ArrayList::new)) | You need a specific mutable implementation |
If caller will mutate result, be explicit:
List<CaseDto> mutable = cases.stream()
.map(this::toDto)
.collect(Collectors.toCollection(ArrayList::new));
Don't rely on unspecified implementation details.
16. Collectors: Materialization Semantics
Collectors turn stream elements into result containers.
Common collectors:
| Collector | Result |
|---|---|
toList | list |
toSet | set |
toCollection | chosen collection |
toMap | map |
groupingBy | map from classifier to list/collector result |
partitioningBy | map true/false |
mapping | downstream mapping |
filtering | downstream filtering |
flatMapping | downstream flattening |
reducing | downstream reduction |
summarizingInt | numeric summary |
joining | string join |
teeing | combine two collectors |
Example:
Map<CaseStatus, List<CaseFile>> byStatus = cases.stream()
.collect(Collectors.groupingBy(CaseFile::status));
But if CaseStatus is enum:
EnumMap<CaseStatus, List<CaseFile>> byStatus = cases.stream()
.collect(Collectors.groupingBy(
CaseFile::status,
() -> new EnumMap<>(CaseStatus.class),
Collectors.toList()
));
Implementation choice still matters inside collectors.
17. toMap: Duplicate Key Must Be Designed
Buruk:
Map<CaseId, CaseFile> byId = cases.stream()
.collect(Collectors.toMap(CaseFile::id, Function.identity()));
Jika duplicate ID muncul, exception.
Sometimes that is good. It fails fast.
If duplicate policy exists, encode it.
Keep first:
Map<CaseId, CaseFile> byId = cases.stream()
.collect(Collectors.toMap(
CaseFile::id,
Function.identity(),
(first, duplicate) -> first
));
Keep latest:
Map<CaseId, CaseFile> byId = cases.stream()
.collect(Collectors.toMap(
CaseFile::id,
Function.identity(),
(oldValue, newValue) -> newValue
));
Preserve encounter order:
Map<CaseId, CaseFile> byId = cases.stream()
.collect(Collectors.toMap(
CaseFile::id,
Function.identity(),
(first, duplicate) -> first,
LinkedHashMap::new
));
Rule:
Duplicate key policy is domain logic. Never leave it implicit unless fail-fast is the desired contract.
18. groupingBy: Classification Semantics
Basic:
Map<CaseStatus, List<CaseFile>> byStatus = cases.stream()
.collect(Collectors.groupingBy(CaseFile::status));
Counting:
Map<CaseStatus, Long> countByStatus = cases.stream()
.collect(Collectors.groupingBy(
CaseFile::status,
Collectors.counting()
));
EnumMap + count:
EnumMap<CaseStatus, Long> countByStatus = cases.stream()
.collect(Collectors.groupingBy(
CaseFile::status,
() -> new EnumMap<>(CaseStatus.class),
Collectors.counting()
));
Mapping downstream:
Map<CaseStatus, List<CaseId>> idsByStatus = cases.stream()
.collect(Collectors.groupingBy(
CaseFile::status,
Collectors.mapping(CaseFile::id, Collectors.toList())
));
Grouping is not free:
- creates maps/lists;
- stores grouped elements;
- depends on classifier equality;
- may create skewed groups;
- can hide memory growth in large streams.
For very large datasets, consider database grouping, streaming aggregation, or incremental processing.
19. partitioningBy: Boolean Split
Use when classifier is truly binary.
Map<Boolean, List<CaseFile>> partitioned = cases.stream()
.collect(Collectors.partitioningBy(CaseFile::overdue));
More expressive wrapper:
record OverduePartition(
List<CaseFile> overdue,
List<CaseFile> notOverdue
) {
static OverduePartition from(List<CaseFile> cases) {
Map<Boolean, List<CaseFile>> map = cases.stream()
.collect(Collectors.partitioningBy(CaseFile::overdue));
return new OverduePartition(
List.copyOf(map.get(true)),
List.copyOf(map.get(false))
);
}
}
Avoid leaking Map<Boolean, ...> across domain boundaries. Boolean keys are not self-documenting.
20. Reduction: Identity, Accumulator, Combiner
Reduction looks simple but has strict semantics.
int total = numbers.stream()
.reduce(0, Integer::sum);
For sequential stream, many reductions appear to work. For parallel stream, identity/accumulator/combiner must be correct and associative.
Bad:
int result = numbers.parallelStream()
.reduce(1, Integer::sum);
Identity 1 is wrong for sum. Parallel partitions will each start with 1, causing inflated result.
Correct:
int result = numbers.parallelStream()
.reduce(0, Integer::sum);
20.1 Prefer Specialized Operations
int total = numbers.stream()
.mapToInt(Integer::intValue)
.sum();
Clearer than manual reduce.
20.2 Reduce for Immutable Aggregation
BigDecimal total = invoices.stream()
.map(Invoice::amount)
.reduce(BigDecimal.ZERO, BigDecimal::add);
This is good because BigDecimal is immutable and addition identity is correct.
21. Custom Collector Mental Model
A collector has:
| Component | Role |
|---|---|
| supplier | create mutable accumulation container |
| accumulator | add one element |
| combiner | merge two containers |
| finisher | convert container to final result |
| characteristics | metadata such as unordered/concurrent/identity finish |
Example: collect case IDs into LinkedHashSet then immutable list.
List<CaseId> uniqueIdsInOrder = cases.stream()
.map(CaseFile::id)
.collect(Collectors.collectingAndThen(
Collectors.toCollection(LinkedHashSet::new),
List::copyOf
));
Custom collector rarely needed. Compose existing collectors first.
If writing a custom collector, test:
- empty stream;
- one element;
- duplicate elements;
- parallel stream;
- unordered source;
- null policy;
- combiner correctness.
22. Parallel Streams: Use with Skepticism
Parallel stream can help when:
- source can split efficiently;
- data size large enough;
- operation CPU-bound;
- operation stateless and non-blocking;
- reduction is associative;
- no ordering requirement or ordering cost acceptable;
- common pool interaction is acceptable.
Parallel stream is risky when:
- operation does I/O;
- operation blocks;
- shared mutable state exists;
- transaction/security/request context is thread-bound;
- order matters;
- source splits poorly;
- result aggregation is expensive;
- service already uses common fork-join pool heavily.
Bad:
orders.parallelStream()
.forEach(order -> paymentGateway.charge(order));
This is not safe just because it is shorter.
Better: design explicit concurrency with backpressure, timeout, retry, idempotency, and observability.
23. Spliterator Mental Model
A Spliterator is the bridge between source and stream traversal/splitting.
Important characteristics:
| Characteristic | Meaning |
|---|---|
ORDERED | encounter order is defined |
DISTINCT | elements are distinct |
SORTED | elements follow sorted order |
SIZED | exact size known before traversal |
SUBSIZED | splits are also sized |
NONNULL | elements are non-null |
IMMUTABLE | source cannot be structurally modified |
CONCURRENT | source can be safely concurrently modified |
UNORDERED concept | stream can ignore order when requested |
You do not need to implement Spliterator often. But understanding it explains why:
ArrayList.parallelStream()splits well;LinkedList.parallelStream()is less naturally split-friendly;HashSethas no encounter order guarantee;sorted/distinctmay require buffering;- parallel performance depends heavily on source characteristics.
24. Ordered vs Unordered Pipelines
If order does not matter, you can allow optimizations:
long count = users.parallelStream()
.unordered()
.filter(User::active)
.count();
But only do this if output semantics truly do not depend on order.
Example where order matters:
List<CaseEvent> firstTen = events.stream()
.filter(CaseEvent::visible)
.limit(10)
.toList();
limit(10) on ordered stream means first ten in encounter order.
Unordered limit may return any ten matching elements.
25. Stream vs Loop Decision Framework
Use stream when:
- transformation pipeline is linear and readable;
- mapping/filtering/collecting expresses intent;
- side effects are absent or terminal and controlled;
- output is a collection/map/reduction;
- operations are stateless;
- data size is moderate or source supports lazy processing.
Use loop when:
- complex branching;
- early exits with multiple conditions;
- mutation is domain operation;
- exception handling per item is complex;
- partial failure must be recorded carefully;
- multiple outputs are built together;
- performance hot path requires explicit control;
- debugging pipeline would be harder than loop.
Stream version:
List<CaseDto> result = cases.stream()
.filter(CaseFile::visible)
.map(this::toDto)
.toList();
Loop version better for multi-result validation:
List<CaseDto> valid = new ArrayList<>();
List<ValidationError> errors = new ArrayList<>();
for (RawCase raw : rawCases) {
try {
CaseFile file = parse(raw);
valid.add(toDto(file));
} catch (ValidationException ex) {
errors.add(new ValidationError(raw.id(), ex.getMessage()));
}
}
Readable beats fashionable.
26. Exception Handling in Streams
Don't hide checked/meaningful exceptions in generic wrappers without policy.
Bad:
List<CaseFile> files = paths.stream()
.map(path -> {
try {
return readCaseFile(path);
} catch (IOException e) {
throw new RuntimeException(e);
}
})
.toList();
Maybe acceptable for scripts, but weak for production.
Better options:
- handle outside stream with loop;
- return result type;
- separate I/O from pure transformation;
- collect errors explicitly;
- fail fast intentionally with domain exception.
Example result type:
sealed interface LoadResult permits LoadResult.Success, LoadResult.Failure {
record Success(CaseFile file) implements LoadResult {}
record Failure(Path path, String message) implements LoadResult {}
}
Then:
List<LoadResult> results = paths.stream()
.map(this::tryLoad)
.toList();
This preserves failure information.
27. Pipeline Layering: Do Not Mix Boundaries
Bad:
List<ResponseDto> response = request.items().stream()
.filter(item -> database.exists(item.id()))
.map(item -> service.enrich(item))
.map(item -> mapper.toResponse(item))
.toList();
Problems:
- database call hidden in filter;
- service side effects hidden in map;
- latency per item unclear;
- error handling weak;
- no batching;
- hard to observe.
Better layering:
List<ItemId> ids = request.items().stream()
.map(RequestItem::id)
.distinct()
.toList();
Map<ItemId, ItemRecord> records = repository.findAllById(ids);
List<ResponseDto> response = ids.stream()
.map(records::get)
.filter(Objects::nonNull)
.map(mapper::toResponse)
.toList();
This separates:
- extraction;
- batch I/O;
- transformation;
- output.
28. Case Study: Regulatory Case Event Pipeline
Input:
record RawEvent(String caseId, String type, String occurredAt, String actor) {}
Domain:
record CaseId(String value) {
CaseId {
value = Objects.requireNonNull(value).strip();
if (value.isEmpty()) {
throw new IllegalArgumentException("empty caseId");
}
}
}
enum EventType {
CREATED,
ASSIGNED,
APPROVED,
REJECTED
}
record CaseEvent(
CaseId caseId,
EventType type,
Instant occurredAt,
String actor
) {}
Parsing:
CaseEvent parse(RawEvent raw) {
return new CaseEvent(
new CaseId(raw.caseId()),
EventType.valueOf(raw.type().strip().toUpperCase(Locale.ROOT)),
Instant.parse(raw.occurredAt()),
raw.actor().strip()
);
}
Pipeline:
List<CaseEvent> events = rawEvents.stream()
.map(this::parse)
.sorted(Comparator.comparing(CaseEvent::occurredAt))
.toList();
Index by case preserving first-seen order:
Map<CaseId, List<CaseEvent>> eventsByCase = events.stream()
.collect(Collectors.groupingBy(
CaseEvent::caseId,
LinkedHashMap::new,
Collectors.toList()
));
Count by event type using enum map:
EnumMap<EventType, Long> countsByType = events.stream()
.collect(Collectors.groupingBy(
CaseEvent::type,
() -> new EnumMap<>(EventType.class),
Collectors.counting()
));
Build timeline view:
List<TimelineItem> timeline = events.stream()
.map(TimelineItem::from)
.toList();
This pipeline is defensible because:
- raw strings are converted at boundary;
- time is represented by
Instant; - event type is enum;
- order is explicit via
sorted; - grouping map implementation preserves first-seen case order;
- enum counts use
EnumMap; - output list is unmodifiable via
toList.
29. Data Pipeline Review Rubric
Before merging stream-heavy code, ask:
Source
- What is the source?
- Is it ordered?
- Is it finite?
- Can it contain null?
- Can it be modified during traversal?
- Is it cheap or expensive to traverse?
- Does it split well if parallel?
Operations
- Are operations stateless?
- Are operations non-interfering?
- Are side effects absent or explicit?
- Is
peekonly debug/observation? - Are stateful operations necessary?
- Is sorting/distinct done on correct equality/comparator?
- Is null handling explicit?
- Are conversions/parsing separated clearly?
Terminal
- Does terminal operation express desired result?
- Is duplicate key policy explicit for
toMap? - Is grouping output implementation correct?
- Is mutability of result intentional?
- Is reduction identity correct?
- Is reduction associative?
- Are exceptions handled intentionally?
- Is parallelism avoided unless justified?
Domain
- Are raw data converted into domain scalar/value types early?
- Are IDs typed?
- Are times represented with correct
java.timetype? - Are money/precision values represented correctly?
- Are collections hidden behind domain wrappers if invariants exist?
- Are null/absence semantics clear?
- Are equality/hash/comparator contracts correct?
30. Common Stream Anti-Patterns
30.1 Stream Everything
Not every loop should become stream.
If loop is clearer, use loop.
30.2 Hidden Mutation in map
.map(user -> {
user.activate();
return user;
})
Use explicit command loop.
30.3 Side Effects in filter
.filter(user -> audit(user))
Predicate should answer a question, not perform action.
30.4 Parallel Stream for I/O
urls.parallelStream().map(this::httpGet).toList();
Use explicit concurrency model.
30.5 toMap Without Duplicate Policy
If duplicates are possible, define merge behavior.
30.6 sorted Before Reducing Data
largeStream.sorted().filter(...).limit(10)
Often filter before sort:
largeStream.filter(...).sorted().limit(10)
But validate semantics.
30.7 distinct on Wrong Equality
If uniqueness is by email, but equals is by ID, distinct is wrong.
30.8 Optional.get After findFirst
User user = users.stream()
.filter(User::active)
.findFirst()
.get();
Use orElseThrow with meaningful error:
User user = users.stream()
.filter(User::active)
.findFirst()
.orElseThrow(() -> new IllegalStateException("no active user"));
31. Final Integration: Choosing Representation from Raw Input to Output
The full-series mental model:
| Layer | Strong representation |
|---|---|
| Raw text | String, byte[], charset-aware parsing |
| Numeric | primitive, BigInteger, BigDecimal, domain scalar |
| ID | typed ID record/value object |
| Time | Instant, LocalDate, ZonedDateTime, Duration, Period |
| Absence | null at boundary, Optional for return, explicit state in domain |
| Symbolic domain | enum or sealed type |
| Transparent data | record |
| Behavioral invariant | class |
| API boundary | interface/sealed interface |
| Grouping | collection interface/domain wrapper |
| Implementation | chosen by workload/contract |
| Transformation | stream or loop |
| Output | immutable DTO/view |
32. Top 1% Java Data Modeling Review Rubric
Use this as final checklist for the entire series.
Type and Representation
- Is each domain concept represented by the narrowest useful type?
- Are raw strings avoided beyond boundaries?
- Are IDs typed instead of generic
String/Longeverywhere? - Are booleans replaced by enum/sealed state when domain has more than yes/no?
- Are numbers represented with correct precision?
- Is money never modeled as
double? - Is text handled with Unicode/locale awareness where needed?
- Is byte/text conversion explicit with
Charset?
Object and Value Semantics
- Is identity vs value equality clear?
- Are
equals/hashCodecorrect and stable? - Are records used only when transparent data carrier semantics are appropriate?
- Are mutable components in records defensively copied?
- Are entities not accidentally modeled as pure value objects?
- Are value-based classes not used with identity-sensitive operations?
- Are object aliases controlled?
Null and Absence
- Is null accepted only at clear boundaries?
- Is absence represented explicitly?
- Is
Optionalused mainly as return type, not as universal field wrapper? - Are empty string, null, empty collection, and absent field semantically distinct when needed?
Conversions and Generics
- Are narrowing conversions explicit and safe?
- Is numeric promotion understood in calculations?
- Are unchecked casts isolated and justified?
- Are wildcards used to make APIs flexible without leaking complexity?
- Are arrays/generics/reifiability risks avoided?
- Is boxing avoided in hot numeric paths?
Collections and Streams
- Is collection interface chosen by semantic contract?
- Is collection implementation chosen by workload and invariant?
- Are map/set keys immutable or stable?
- Is ordering explicit and documented?
- Are unmodifiable snapshots used at boundaries?
- Are domain-specific collection wrappers used when raw collections leak invariants?
- Are streams stateless and non-interfering?
- Are collectors chosen with duplicate/mutability/order semantics explicit?
- Is parallel stream avoided unless demonstrably safe?
- Are loops used where they are clearer?
Time, Lifecycle, and Production Defensibility
- Is time type chosen by meaning, not convenience?
- Are time zones and DST handled intentionally?
- Are inclusive/exclusive intervals explicit?
- Are validation and normalization done at construction/boundary?
- Are failure modes tested, not just happy paths?
- Are serialization/database/API representations separated from domain representation?
- Is concurrency ownership clear?
- Are data structures observable/debuggable under production incidents?
- Are invariants documented in code structure, not only comments?
- Can a reviewer infer the domain rule from the type choices?
33. Final Practice: Refactor Weak Data Code
Weak version:
Map<String, Object> caseData = new HashMap<>();
caseData.put("id", " CASE-123 ");
caseData.put("status", "approved");
caseData.put("submittedAt", "2026-06-27T10:15:30Z");
caseData.put("amount", 100.25);
caseData.put("tags", List.of(" urgent ", "regulatory", "urgent"));
Strong version:
record CaseId(String value) {
CaseId {
value = Objects.requireNonNull(value).strip().toUpperCase(Locale.ROOT);
if (value.isEmpty()) {
throw new IllegalArgumentException("empty case id");
}
}
}
enum CaseStatus {
DRAFT,
SUBMITTED,
APPROVED,
REJECTED
}
record Money(BigDecimal amount, Currency currency) {
Money {
amount = Objects.requireNonNull(amount).setScale(2, RoundingMode.HALF_UP);
currency = Objects.requireNonNull(currency);
}
}
record Tags(Set<String> values) {
Tags {
values = values.stream()
.map(String::strip)
.filter(s -> !s.isEmpty())
.collect(Collectors.toCollection(LinkedHashSet::new));
values = Set.copyOf(values);
}
}
record CaseData(
CaseId id,
CaseStatus status,
Instant submittedAt,
Money amount,
Tags tags
) {}
Transformation boundary:
CaseData parse(Map<String, Object> raw) {
return new CaseData(
new CaseId((String) raw.get("id")),
CaseStatus.valueOf(((String) raw.get("status")).strip().toUpperCase(Locale.ROOT)),
Instant.parse((String) raw.get("submittedAt")),
new Money(BigDecimal.valueOf((Double) raw.get("amount")), Currency.getInstance("USD")),
new Tags(new LinkedHashSet<>((List<String>) raw.get("tags")))
);
}
Even this can be improved by avoiding raw Map<String, Object> and Double at the boundary. But the point is clear:
Strong representation makes invalid states harder to express and easier to detect.
34. What Mastery Looks Like
At the end of this series, mastery is not memorizing every class.
Mastery is being able to say:
- This should be a
recordbecause it is transparent data with fixed components. - This should be a
classbecause it owns invariant and lifecycle. - This should be an
enumbecause the domain is a closed symbolic set. - This should be a sealed hierarchy because variants carry different data.
- This should be
Instant, notLocalDateTime, because it is timeline event time. - This should be
BigDecimal, notdouble, because it is decimal money. - This should be
EnumMap, notHashMap, because key universe is enum. - This should be
LinkedHashMap, notHashMap, because encounter order is output contract. - This should be
ArrayDeque, notLinkedList, because we need queue/deque behavior. - This stream should be a loop because error handling and multiple outputs matter.
- This parallel stream is unsafe because operation is blocking and stateful.
- This map key is dangerous because equality field is mutable.
- This API leaks internal mutable state.
- This
Optionalhides a domain state that should be explicit. - This cast is unchecked and should be isolated at boundary.
- This wildcard makes input flexible but should not appear in return type.
- This collection needs a domain wrapper.
That is the difference between writing Java that compiles and writing Java whose data model survives production.
35. Series Completion
This is the final part of:
Learn Java Core Types, Data Model & Data APIs
The series covered:
- Kaufman skill map
- Java type system
- values, variables, references, objects
- primitive types and literals
- integral types
- floating-point types
- boolean semantics
- text and Unicode
- parsing, formatting, regex
- bytes and binary data
- object and runtime type
- equality, hash, comparison
- null and absence
- mutability and defensive copying
- class as data model
- interface and abstract boundaries
- records
- enums
- sealed types
- generics and erasure
- wildcards and variance
- arrays and reifiability
- conversions and casting
- boxing and wrappers
- value-based classes
- BigDecimal, BigInteger, money, precision
- IDs, UUID, random, tokens
- java.time mental model
- time zone, DST, formatting, parsing
- collection semantics
- collection implementations
- streams, collectors, and pipeline semantics
Status: seri selesai.
You just completed lesson 32 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.