Deepen PracticeOrdered learning track

Stateful Dedupe Patterns

Learn Java Data Pipeline Pattern - Part 046

Stateful deduplication patterns for production Java data pipelines: event identity, business keys, TTL, watermark cleanup, sink ledgers, late duplicates, and state-cost control.

[2026-07-04]15 min read2829 words

In This Lesson

1. Why Duplicates Exist 2. Dedupe vs Idempotency 3. Dedupe Key

PrevNext

Lesson 4684 lesson track46–69 Deepen Practice

#java#data-pipeline#flink#kafka+2 more

Part 046 — Stateful Dedupe Patterns

Dedupe adalah salah satu pattern paling sederhana secara konsep, tetapi paling sering salah di production.

Versi naifnya:

if eventId has been seen:
  drop
else:
  process

Versi production-nya harus menjawab:

duplicate berdasarkan identity apa?
horizon dedupe berapa lama?
apakah duplicate datang sebelum atau sesudah output?
apakah correction boleh memakai id yang sama?
apakah replay harus diproses ulang atau di-skip?
apakah sink sudah idempotent?
apakah dedupe state bisa dibersihkan?
apakah duplicate harus di-audit?
apakah late duplicate berbeda dengan late correction?

Dedupe bukan hanya filter. Dedupe adalah stateful correctness boundary.

1. Why Duplicates Exist

Duplicate bisa muncul dari banyak tempat:

Sumber Duplicate	Contoh
Producer retry	Producer timeout, publish sukses tapi client tidak tahu
API pagination	Page berubah saat sync berjalan
CDC snapshot + stream overlap	Row muncul di snapshot dan CDC stream
Outbox relay retry	Relay crash setelah publish sebelum mark sent
Consumer retry	Sink sukses, offset belum commit
Rebalance	Partition diproses ulang oleh consumer baru
Backfill	Historical data dipublish lagi
Manual re-import	Operator upload file sama
Source bug	Upstream mengirim event sama dengan ID berbeda
Multi-region replication	Event muncul dari dua region

Karena itu, dedupe tidak bisa hanya diletakkan di satu tempat dan dianggap selesai.

Production pipeline biasanya memakai beberapa lapisan:

2. Dedupe vs Idempotency

Dedupe dan idempotency sering tertukar.

Konsep	Pertanyaan	Lokasi Umum
Dedupe	Apakah record ini pernah dilihat?	Stream processor / consumer state
Idempotency	Apakah effect ini aman diulang?	Sink / database / external system
Reconciliation	Apakah hasil akhir cocok dengan sumber?	Audit / reporting / control process

Dedupe mengurangi duplicate processing. Idempotency memastikan duplicate processing tidak merusak hasil. Reconciliation mendeteksi jika keduanya gagal.

Production rule:

Jangan mengandalkan dedupe saja. Sink tetap harus idempotent untuk side effect penting.

3. Dedupe Key

Dedupe key adalah keputusan paling penting.

3.1 Event ID

Event ID unik per event occurrence.

public record EventId(String value) {}

Cocok untuk:

outbox event,
domain event,
API event dengan stable ID,
CDC transaction event yang punya source position.

Kelemahan:

jika source mengirim duplicate dengan ID baru, tidak terdeteksi.

3.2 Business Key

Business key mewakili fakta bisnis.

Contoh:

caseId + decisionType + decisionEffectiveDate

Cocok untuk:

file import tanpa event ID,
API sync,
CDC row-state projection,
regulatory facts.

Kelemahan:

correction bisa terlihat seperti duplicate,
natural key bisa berubah,
membutuhkan semantic design.

3.3 Source Position

Source position berasal dari log/offset/cursor.

Contoh:

database=case-db, table=case_status, lsn=12345678
kafkaTopic=case-events, partition=3, offset=99128
file=cases-2026-07-04.csv, line=812

Cocok untuk audit dan exactly-once-ish processing boundary.

Kelemahan:

tidak mendeteksi duplicate semantik lintas source position.

3.4 Payload Hash

Hash payload bisa dipakai saat tidak ada ID.

String hash = sha256(normalizedPayload);

Kelemahan:

perubahan insignificant bisa membuat hash beda,
collision sangat jarang tetapi tetap konsepnya ada,
tidak membedakan duplicate vs repeated legitimate event.

3.5 Composite Dedupe Key

Production sering memakai composite key.

public record DedupeKey(
    String sourceSystem,
    String entityType,
    String semanticKey,
    String eventType,
    String eventVersion
) {}

Rule:

Dedupe key harus didefinisikan dalam data contract, bukan tersembunyi di kode operator.

4. Duplicate Is Not Always Bad

Ada event yang terlihat sama tetapi valid.

Contoh:

case assigned ke officer A, lalu assigned lagi ke officer A setelah reopen,
status berubah dari OPEN ke OPEN karena correction metadata,
decision reissued dengan effective date yang sama tapi legal basis berbeda,
heartbeat event berulang,
repeated measurement dengan nilai sama.

Pertanyaan dedupe bukan:

Apakah payload sama?

Pertanyaan yang benar:

Apakah event ini merepresentasikan fakta yang sama dalam domain timeline yang sama?

5. Dedupe Horizon

Dedupe tidak bisa menyimpan semua ID selamanya, kecuali volumenya kecil dan compliance memang membutuhkan ledger permanen.

Dedupe horizon adalah berapa lama pipeline mengingat event yang sudah terlihat.

Contoh:

Use Case	Horizon
Producer retry cepat	5-30 menit
Kafka consumer replay pendek	1-7 hari
CDC snapshot overlap	Durasi snapshot + buffer
File import duplicate	Selama file import lifecycle
Regulatory audit ledger	Permanen / sesuai retention policy

Horizon harus disesuaikan dengan failure mode.

Jangan memilih TTL 24 jam kalau backfill bisa mengulang data 30 hari lalu.

6. Dedupe State Shape

Ada beberapa bentuk state.

6.1 Set of Seen IDs

key -> Set<eventId>

Mudah dipahami, tetapi bisa besar.

6.2 Map ID to Timestamp

key -> Map<eventId, seenAt>

Memungkinkan cleanup berdasarkan waktu.

6.3 Last Seen Version

entityId -> lastVersion

Cocok untuk monotonic version.

6.4 Last Source Position

partition -> lastProcessedOffset

Cocok untuk ordered source.

6.5 Sink Ledger

effectId -> effectStatus

Cocok untuk side effect idempotency.

6.6 Probabilistic Filter

Bloom filter atau approximate set bisa dipakai untuk high-volume telemetry.

Trade-off:

hemat memory,
bisa false positive,
tidak cocok jika false drop tidak boleh terjadi.

Untuk regulatory pipeline, jangan gunakan probabilistic dedupe untuk data yang harus lengkap.

7. Exact Dedupe with Flink Keyed State

Pattern umum:

keyBy(dedupe partition key)
MapState<dedupeId, firstSeenMetadata>

Java Model

public record DedupeDecision(
    boolean duplicate,
    String reason,
    Instant firstSeenAt,
    Instant currentSeenAt
) {}

public record SeenEvent(
    Instant eventTime,
    Instant firstSeenProcessingTime,
    String sourcePosition,
    String payloadHash
) {}

Flink ProcessFunction Skeleton

public final class StatefulDedupeFunction
    extends KeyedProcessFunction<String, PipelineEvent, PipelineEvent> {

    private transient MapState<String, SeenEvent> seen;

    @Override
    public void open(Configuration parameters) {
        MapStateDescriptor<String, SeenEvent> descriptor =
            new MapStateDescriptor<>(
                "seen-events",
                String.class,
                SeenEvent.class
            );

        StateTtlConfig ttl = StateTtlConfig
            .newBuilder(Duration.ofDays(7))
            .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
            .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
            .build();

        descriptor.enableTimeToLive(ttl);
        seen = getRuntimeContext().getMapState(descriptor);
    }

    @Override
    public void processElement(
        PipelineEvent event,
        Context ctx,
        Collector<PipelineEvent> out
    ) throws Exception {
        String dedupeId = event.dedupeId();

        if (seen.contains(dedupeId)) {
            emitDuplicateMetric(event);
            return;
        }

        seen.put(dedupeId, new SeenEvent(
            event.eventTime(),
            Instant.ofEpochMilli(ctx.timerService().currentProcessingTime()),
            event.sourcePosition(),
            event.payloadHash()
        ));

        out.collect(event);
    }
}

Ini mudah, tetapi belum cukup untuk semua kasus.

8. TTL Is a Memory Policy, Not a Correctness Policy

State TTL membantu membersihkan state. Tetapi TTL tidak selalu berarti "record pasti hilang tepat pada waktu X" secara semantik bisnis.

Production mental model:

TTL mengontrol resource. Correctness dikontrol oleh dedupe horizon contract, watermark/timer, sink idempotency, dan reconciliation.

Jika Anda harus tahu pasti kapan ID boleh dilupakan, gunakan explicit timer dan metadata.

9. Event-Time Dedupe Horizon

Untuk dedupe yang bergantung pada waktu event, cleanup harus berbasis event time, bukan processing time.

Contoh:

remember eventId until eventTime + 7 days + allowed lateness

Pattern:

simpan eventId -> eventTime,
register event-time timer untuk cleanup,
saat timer fire, hapus ID,
kirim metric cleanup.

Skeleton

public final class EventTimeDedupeFunction
    extends KeyedProcessFunction<String, PipelineEvent, PipelineEvent> {

    private transient MapState<String, SeenEvent> seen;
    private transient MapState<Long, List<String>> cleanupIndex;

    private final Duration horizon = Duration.ofDays(7);
    private final Duration allowedLateness = Duration.ofHours(2);

    @Override
    public void processElement(
        PipelineEvent event,
        Context ctx,
        Collector<PipelineEvent> out
    ) throws Exception {
        String id = event.dedupeId();

        if (seen.contains(id)) {
            return;
        }

        long cleanupAt = event.eventTime()
            .plus(horizon)
            .plus(allowedLateness)
            .toEpochMilli();

        seen.put(id, SeenEvent.from(event));
        addToCleanupIndex(cleanupAt, id);
        ctx.timerService().registerEventTimeTimer(cleanupAt);

        out.collect(event);
    }

    @Override
    public void onTimer(
        long timestamp,
        OnTimerContext ctx,
        Collector<PipelineEvent> out
    ) throws Exception {
        List<String> ids = cleanupIndex.get(timestamp);
        if (ids == null) {
            return;
        }

        for (String id : ids) {
            seen.remove(id);
        }
        cleanupIndex.remove(timestamp);
    }
}

Real implementation harus hati-hati dengan List<String> yang bisa besar. Biasanya cleanup bucket dibuat coarse-grained, misalnya per menit atau per jam.

10. Processing-Time Dedupe

Processing-time dedupe cocok untuk duplicate akibat retry cepat.

Contoh:

producer retry dalam 5 menit,
webhook redelivery,
API duplicate response,
temporary network retry.

Tidak cocok untuk:

historical replay,
regulatory correction,
deterministic backfill,
event-time reporting.

Processing-time dedupe saat replay bisa berbeda karena jam worker berbeda.

11. Dedupe and Backfill

Backfill bisa berbenturan dengan dedupe.

Pertanyaan penting:

Saat backfill memutar ulang event lama, apakah event itu harus dianggap duplicate atau harus dihitung ulang?

Ada dua mode:

11.1 Operational Replay Mode

Tujuannya mengulang delivery yang gagal.

Dedupe aktif. Sink idempotent. Output tidak boleh double.

11.2 Recompute Mode

Tujuannya menghitung ulang hasil dengan transform versi baru.

Dedupe lama mungkin harus diabaikan atau namespace dedupe harus berbeda.

Contoh key:

dedupeNamespace = transformVersion + backfillRunId

Jika tidak, recompute bisa "hilang" karena dianggap duplicate oleh state lama.

12. Dedupe Namespace

Dedupe key harus memiliki namespace.

public record DedupeNamespace(
    String pipelineName,
    String transformVersion,
    String processingMode,
    String sourceSystem
) {}

Processing mode:

public enum ProcessingMode {
    LIVE,
    REPLAY,
    BACKFILL,
    RECOMPUTE,
    CORRECTION
}

Full key:

public record NamespacedDedupeKey(
    DedupeNamespace namespace,
    String dedupeId
) {}

Tanpa namespace, backfill dan live stream bisa saling mengacaukan.

13. Duplicate vs Correction

Correction bukan duplicate.

Contoh:

Original:
caseId=123, decision=APPROVED, effectiveDate=2026-07-01

Correction:
caseId=123, decision=REJECTED, effectiveDate=2026-07-01, correctsEventId=abc

Jika dedupe key hanya caseId + effectiveDate, correction bisa dibuang secara salah.

Correction event harus punya semantic model:

public sealed interface CaseDecisionEvent permits
    CaseDecisionIssued,
    CaseDecisionCorrected,
    CaseDecisionRetracted {}

public record CaseDecisionCorrected(
    String eventId,
    String correctsEventId,
    String caseId,
    String newDecision,
    Instant effectiveDate,
    String correctionReason
) implements CaseDecisionEvent {}

Dedupe harus tahu event type.

14. Late Duplicate vs Late New Event

Late duplicate:

eventId same, eventTime old, already processed

Late new event:

eventId new, eventTime old, not processed

Late correction:

new event correcting old event

Ketiganya harus punya policy berbeda.

Case	Policy Umum
Late duplicate	Drop + metric
Late new event	Side output / late update / correction
Late correction	Correction lane

Jangan membuat semua late event sebagai duplicate.

15. Dedupe Before or After Enrichment?

Ada dua pilihan.

Before Enrichment

Mengurangi beban enrichment.

Cocok jika dedupe key tersedia di raw event.

After Enrichment

Dedupe key lebih akurat karena butuh reference data/canonicalization.

Cocok jika raw source tidak stabil.

Hybrid

cheap technical dedupe sebelum enrichment,
semantic dedupe setelah canonicalization.

Diagram:

16. Dedupe in Kafka Consumer Without Flink

Jika memakai plain Java Kafka consumer, dedupe biasanya memakai external store.

Contoh PostgreSQL ledger:

CREATE TABLE processed_event_ledger (
    consumer_name text NOT NULL,
    dedupe_id text NOT NULL,
    first_seen_at timestamptz NOT NULL DEFAULT now(),
    source_topic text NOT NULL,
    source_partition int NOT NULL,
    source_offset bigint NOT NULL,
    PRIMARY KEY (consumer_name, dedupe_id)
);

Processing flow:

Important:

ledger insert dan business effect harus dalam transaksi yang sama jika sink-nya database yang sama,
offset commit dilakukan setelah transaksi DB commit,
jika offset commit gagal, replay akan hit duplicate key dan aman.

17. Sink Ledger vs Stream Dedupe

Stream dedupe mengurangi noise sebelum sink. Sink ledger menjamin side effect.

Untuk critical effect, sink ledger lebih penting.

Contoh:

CREATE TABLE case_projection_event_ledger (
    projection_name text NOT NULL,
    event_id text NOT NULL,
    applied_at timestamptz NOT NULL DEFAULT now(),
    transform_version text NOT NULL,
    PRIMARY KEY (projection_name, event_id)
);

Dalam transaksi:

INSERT INTO case_projection_event_ledger(projection_name, event_id, transform_version)
VALUES (?, ?, ?)
ON CONFLICT DO NOTHING;

Jika row inserted = 1, apply projection. Jika row inserted = 0, skip.

18. Dedupe with Monotonic Version

Jika source punya monotonic version per entity, dedupe bisa lebih murah.

entityId -> highestSeenVersion

Rule:

version lebih kecil atau sama = duplicate/stale,
version lebih besar = process.

Cocok untuk:

entity snapshot stream,
CDC row version,
aggregate version,
event-sourced aggregate sequence.

Caveat:

harus ordered per entity,
version tidak boleh reset,
correction harus punya model khusus,
gap version harus diobservasi.

Example:

if (event.version() <= state.highestVersion()) {
    dropAsStale(event);
} else if (event.version() > state.highestVersion() + 1) {
    emitGapWarning(event);
    process(event);
} else {
    process(event);
}

19. Dedupe with Source Position

Jika input ordered per partition, offset bisa menjadi dedupe/progress state.

Tetapi offset tidak cukup untuk semantic dedupe.

Kafka offset menjawab:

Apakah posisi ini sudah dibaca?

Bukan:

Apakah fakta bisnis ini sudah diproses dari source lain?

CDC LSN menjawab:

Apakah perubahan log ini sudah diproses?

Bukan:

Apakah row ini merepresentasikan correction yang mengganti fakta lama?

Gunakan source position untuk recovery, gunakan semantic key untuk business dedupe.

20. Dedupe with Compacted Topic

Kafka compacted topic bisa dipakai sebagai dedupe/reference state.

Pattern:

topic seen-events compacted,
key = dedupe ID,
value = first seen metadata,
stream processor restore state dari topic.

Trade-off:

Kelebihan	Kekurangan
Durable	Topic bisa besar
Replayable	Compaction bukan immediate delete
Shared state	Tombstone/retention perlu benar
Kafka-native	Query lokal tetap perlu state store

Ini cocok untuk state yang harus recoverable dan auditable, tetapi bukan pengganti sink ledger jika ada side effect eksternal.

21. Dedupe for CDC Snapshot + Stream Overlap

CDC sering memiliki fase snapshot lalu streaming.

Duplicate bisa muncul jika:

snapshot membaca row,
lalu log stream juga memuat perubahan yang sama/berdekatan,
consumer menggabungkan snapshot event dan CDC event tanpa boundary.

Pattern:

gunakan source metadata dari CDC connector,
bedakan snapshot event dan streaming event,
gunakan primary key + source position/version,
sink harus upsert/idempotent,
reconciliation setelah snapshot selesai.

Jika output adalah projection latest state, upsert biasanya cukup. Jika output adalah append-only facts, perlu dedupe lebih hati-hati.

22. Dedupe for File Import

File import duplicate biasanya berbeda dari stream duplicate.

Dedupe keys:

file checksum,
manifest ID,
source file path + generation/version,
row number,
business key,
payload hash.

Pattern:

File-level dedupe saja tidak cukup jika file berbeda berisi row yang sama. Row-level dedupe saja tidak cukup jika file replay harus diperlakukan sebagai same import.

23. Dedupe for API Sync

API sync duplicate muncul karena pagination dan updated-since semantics.

Common pattern:

cursor = lastUpdatedAt
lookback = 10 minutes
query from cursor - lookback

Lookback sengaja membaca ulang data untuk menghindari missing update.

Karena itu sink harus dedupe/upsert.

Dedupe key biasanya:

remoteEntityId + remoteVersion/updatedAt

Jika API tidak punya version, gunakan:

remoteEntityId + normalizedPayloadHash

Tetapi pastikan correction dan legitimate repeated state tidak salah dibuang.

24. Dedupe State Size Estimation

Rumus kasar:

state_size ≈ unique_dedupe_keys_within_horizon × bytes_per_entry

Jika:

20.000 events/sec,
horizon 7 hari,
100 bytes per entry raw,

20,000 × 60 × 60 × 24 × 7 = 12,096,000,000 entries
raw = 1.2 TB before overhead

Itu tidak realistis untuk exact state sederhana.

Solusi:

kurangi horizon,
dedupe per key bukan global,
gunakan source-level idempotency,
pindahkan ledger ke database/object store,
gunakan compacted topic,
gunakan probabilistic dedupe untuk low-risk telemetry,
gunakan aggregation yang tidak sensitif duplicate,
gunakan idempotent sink dan reconciliation.

25. Partitioning Dedupe State

Dedupe state harus dipartisi dengan key yang sama dengan dedupe decision.

Jika dedupe key adalah event ID global, keyBy event ID. Jika duplicate didefinisikan per entity, keyBy entity ID.

Kesalahan:

.keyBy(event -> event.tenantId())

Jika tenant besar, satu key menjadi panas dan state raksasa.

Lebih baik:

.keyBy(event -> hash(event.dedupeId()) % shardCount)

Tetapi pastikan duplicate dengan dedupe ID yang sama selalu masuk shard yang sama.

26. Hot Dedupe Key

Hot key terjadi jika banyak event berbagi key.

Contoh:

UNKNOWN_CUSTOMER,
null,
default tenant,
global metric,
malformed data.

Production guard:

if (event.dedupeId() == null || event.dedupeId().isBlank()) {
    outputToQuarantine(event, "MISSING_DEDUPE_ID");
    return;
}

Jangan mengganti missing key dengan string konstan seperti UNKNOWN. Itu membuat hot key dan duplicate palsu.

27. Dedupe and Privacy

Dedupe key sering mengandung PII.

Contoh buruk:

email + phone + dateOfBirth

Lebih baik:

canonicalize,
hash dengan salt/pepper yang dikelola aman,
jangan log raw key,
pisahkan security boundary,
dokumentasikan retention.

Example:

public record SafeDedupeKey(
    String keyHash,
    String keyVersion,
    String classification
) {}

Dedupe state juga termasuk data yang harus di-retain dan dihapus sesuai policy.

28. Dedupe Observability

Metric wajib:

Metric	Makna
dedupe_checked_total	Volume record dicek
duplicate_dropped_total	Duplicate yang dibuang
duplicate_rate	Signal upstream retry/bug
state_entries	Jumlah key dalam state
state_bytes	Kapasitas
state_cleanup_total	Cleanup berjalan
late_duplicate_total	Duplicate datang terlambat
missing_dedupe_key_total	Contract violation
dedupe_false_conflict_total	Event valid dianggap duplicate
dedupe_store_latency	External ledger health

Log jangan memuat raw PII key.

Gunakan hash/fingerprint:

{
  "event": "duplicate_dropped",
  "pipeline": "case-escalation-pipeline",
  "dedupeKeyHash": "sha256:abc...",
  "firstSeenAt": "2026-07-04T10:00:00Z",
  "currentSeenAt": "2026-07-04T10:03:12Z",
  "sourcePosition": "topic=case-events,partition=2,offset=991"
}

29. Dedupe Audit Trail

Untuk data kritikal, duplicate yang dibuang tetap perlu audit.

Pattern:

main stream hanya menerima first occurrence,
duplicate stream menyimpan duplicate metadata,
duplicate stream bisa dianalisis untuk source quality.

Duplicate audit stream tidak harus menyimpan full payload jika sensitif.

30. Testing Stateful Dedupe

Test cases minimal:

first event passes,
same event duplicate drops,
same payload with different event ID follows policy,
same business key with correction passes,
replay mode does not double-apply sink effect,
TTL cleanup removes old key,
late duplicate drops,
late new event follows late policy,
missing dedupe key goes quarantine,
state restore after checkpoint preserves seen IDs,
backfill namespace does not conflict with live namespace,
high cardinality does not exceed state budget.

Example property:

For any event sequence S:
  output contains at most one event per dedupe key per namespace per horizon

But only if your dedupe contract says exactly that.

31. Failure Scenarios

31.1 Crash After Dedupe State Update Before Output

In Flink, managed state and output are coordinated through checkpointing. But sink semantics still matter.

Question:

If event marked seen, but output not durably committed, can it disappear?

Use framework-supported checkpointing and sink commit protocols. For external stores, use idempotent sink ledger.

31.2 Crash After Sink Success Before Offset Commit

On replay, dedupe may or may not remember event depending on state checkpoint.

Sink idempotency must protect effect.

31.3 TTL Expires Before Duplicate Arrives

Duplicate becomes new event.

This is expected if outside horizon, but must be acceptable by contract.

31.4 State Lost or Reset

All historical duplicates can pass.

Mitigation:

restore from checkpoint/savepoint,
external ledger,
compacted state topic,
sink idempotency,
reconciliation.

32. Production Blueprint

Core invariant:

For a given dedupe namespace and horizon, the pipeline emits at most one accepted event for each dedupe key, unless the event is explicitly modeled as correction/retraction/reissue.

33. Minimal Java Interfaces

public interface DedupeKeyResolver<E> {
    Optional<NamespacedDedupeKey> resolve(E event);
}

public interface DedupeStore {
    DedupeDecision checkAndRemember(
        NamespacedDedupeKey key,
        SeenEvent metadata
    );
}

public enum DuplicatePolicy {
    DROP,
    AUDIT_AND_DROP,
    QUARANTINE,
    PASS_WITH_DUPLICATE_MARKER
}

public record DedupeResult<E>(
    E event,
    boolean duplicate,
    DuplicatePolicy policy,
    String reason
) {}

This abstraction makes dedupe testable outside Flink/Kafka.

34. Common Anti-Patterns

Anti-Pattern 1: Dedupe by Payload Hash Only

Payload equality is not domain equality.

Anti-Pattern 2: Infinite In-Memory Set

Works in demo. Dies in production.

Anti-Pattern 3: TTL Chosen Randomly

TTL must come from failure model and replay/backfill requirement.

Anti-Pattern 4: Missing Key Defaults to Constant

UNKNOWN becomes hot key and false duplicate factory.

Anti-Pattern 5: Dedupe Before Understanding Correction

Correction gets dropped as duplicate.

Anti-Pattern 6: Trusting Dedupe Without Idempotent Sink

One state loss or replay can double-apply side effect.

Anti-Pattern 7: No Duplicate Audit

You cannot detect upstream quality issue if duplicates disappear silently.

35. Review Checklist

Before approving dedupe design:

What is the dedupe key?
Is dedupe technical, semantic, or both?
What is the namespace?
What is the horizon?
Why is that horizon enough?
What happens after TTL expires?
Is correction distinguishable from duplicate?
Is replay mode different from recompute mode?
Is backfill isolated?
Is sink idempotent?
Is there duplicate audit lane?
How big can state become?
What is hot-key mitigation?
Does state survive restart?
What is the reconciliation mechanism?
Are raw PII keys avoided?
Are missing keys quarantined?
Are metrics and alerts defined?

36. What Top Engineers Notice

Engineer biasa melihat duplicate sebagai nuisance.

Engineer kuat melihat duplicate sebagai konsekuensi natural dari distributed systems.

Distributed pipeline tidak bisa menghindari retry, replay, partial failure, reordering, and unknown outcome. Karena itu, dedupe harus didesain sebagai bagian dari correctness model, bukan filter tambahan di akhir.

Prinsip akhirnya:

Dedupe reduces repeated work. Idempotency protects effects. Reconciliation proves the result.

Jika ketiganya ada, pipeline jauh lebih defensible.

References

Apache Flink Documentation — Working with State: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/
Apache Flink Documentation — State TTL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/
Apache Flink Documentation — Timers and ProcessFunction: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/process_function/
Apache Kafka Documentation — Consumer and Offset Semantics: https://kafka.apache.org/documentation/

Lesson Recap

You just completed lesson 46 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Flink Joins and Enrichment