Learn Java Microservices Cpq Oms Platform Part 031 Performance Engineering And Capacity Modeling
title: Learn Java Microservices CPQ/OMS Platform - Part 031 description: Performance engineering and capacity modeling for a Java microservices CPQ and order management platform: workload model, latency budget, throughput, JVM, PostgreSQL, Kafka, Redis, Camunda 7, load testing, and production capacity planning. series: learn-java-microservices-cpq-oms-platform seriesTitle: Learn Java Microservices CPQ/OMS Platform order: 31 partTitle: Performance Engineering and Capacity Modeling tags:
- java
- microservices
- cpq
- order-management
- performance
- capacity-planning
- postgresql
- kafka
- redis
- camunda
- jfr
- load-testing date: 2026-07-02
Part 031 — Performance Engineering and Capacity Modeling
Performance engineering untuk CPQ/OMS bukan hanya membuat endpoint cepat. Kita sedang membangun platform bisnis yang menghitung konfigurasi produk, harga, approval, quote, order, orchestration, event publishing, cache, dan audit trail. Platform seperti ini harus cepat, tetapi juga harus benar, stabil, explainable, dan recoverable.
Kesalahan umum engineer adalah langsung mengoptimasi query, menambah cache, atau menaikkan jumlah replica tanpa memahami bentuk beban. Di CPQ/OMS, performa sering ditentukan oleh kombinasi: jumlah tenant, ukuran katalog, kompleksitas konfigurasi, jumlah line item, aturan pricing, jumlah approval signal, fan-out event, backlog workflow, dan pattern akses database. Karena itu, capacity modeling harus dimulai dari workload model, bukan dari CPU graph.
Part ini membangun cara berpikir performance yang sistematis: dari SLO, latency budget, throughput, queueing, profiling, database plan, Kafka lag, Camunda job executor, Redis hot key, sampai load test dan production capacity review.
1. Tujuan Pembelajaran
Setelah menyelesaikan part ini, kita ingin mampu:
- Membuat workload model untuk platform CPQ/OMS.
- Menentukan SLO dan latency budget per use case.
- Menghubungkan throughput, concurrency, queue length, dan latency.
- Mengidentifikasi hot path di catalog, configuration, pricing, quote, order, dan orchestration.
- Melakukan profiling Java dengan pendekatan evidence-based.
- Membaca PostgreSQL execution plan untuk query kritis.
- Mendesain kapasitas Kafka consumer, partition, dan retry/DLT.
- Menyetel Camunda 7 job executor dan proses workflow tanpa membuat bottleneck.
- Memakai Redis untuk mengurangi latency tanpa merusak correctness.
- Mendesain load test, soak test, stress test, dan capacity review.
2. Kaufman Deconstruction: Performance Skill Map
Kaufman-style learning target untuk part ini adalah bukan “hafal tuning parameter”, tetapi mampu melakukan loop:
Hypothesis -> Measurement -> Bottleneck Identification -> Controlled Change -> Validation -> Runbook Update
Jika kita tidak punya measurement, kita tidak sedang melakukan performance engineering. Kita hanya menebak.
3. Mental Model: Performance adalah Queueing + Contention + Shape
Untuk platform CPQ/OMS, performance biasanya memburuk karena salah satu dari lima hal:
- Queueing — request, event, job, atau DB connection menunggu terlalu lama.
- Contention — banyak worker berebut row, lock, partition, key, CPU, thread, atau connection.
- Data shape — query, payload, atau object graph terlalu besar.
- Remote dependency — latency external service masuk ke critical path.
- Unbounded work — endpoint melakukan kerja yang tidak dibatasi oleh page size, line limit, retry budget, atau timeout.
Sistem dengan CPU rendah tetap bisa lambat jika semua request menunggu DB connection. Sistem dengan DB cepat tetap bisa lambat jika Camunda job executor backlog. Sistem dengan cache hit tinggi tetap bisa salah jika cache menyimpan harga yang seharusnya sudah tidak berlaku.
4. Performance Principles untuk CPQ/OMS
Gunakan prinsip berikut sebagai invariant engineering:
| Prinsip | Arti Praktis |
|---|---|
| Correctness before speed | Jangan optimasi dengan mengorbankan invariant quote/order. |
| Measure before tuning | Setiap optimasi harus punya baseline dan after-measurement. |
| Bound every dimension | Batasi line item, page size, retry count, payload size, workflow fan-out. |
| Separate read and write path | Query kompleks tidak boleh mengganggu command transaction. |
| Avoid hidden N+1 | MyBatis memberi kontrol SQL, tetapi tidak otomatis mencegah query berulang. |
| Cache only stable facts | Cache pricing/configuration harus punya version key dan invalidation policy. |
| Async is not free | Kafka/Camunda memindahkan latency ke backlog; bukan menghapus kerja. |
| Optimize hot path only | Jangan optimasi code path yang tidak signifikan dalam traffic nyata. |
| Keep repair path fast | Incident recovery butuh query dan tooling yang predictable. |
5. Workload Model
Sebelum tuning, definisikan workload. Untuk CPQ/OMS, workload tidak cukup dengan “100 RPS”. Kita perlu mengetahui struktur kerja.
5.1 Workload Dimensions
| Dimension | Contoh | Mengapa Penting |
|---|---|---|
| Tenant aktif | 50 tenant | Mempengaruhi isolation, cache key, DB index selectivity. |
| Catalog size | 100k offer | Mempengaruhi lookup, publish snapshot, cache size. |
| Product attribute count | 200 attribute/product | Mempengaruhi validation dan payload size. |
| Configuration lines | 1–500 line | Mempengaruhi rule evaluation dan pricing. |
| Pricing rules | 100–10k rule | Mempengaruhi latency calculation. |
| Quote versions | 1–100 version/quote | Mempengaruhi storage dan retrieval. |
| Approval signals | 1–50 signal/quote | Mempengaruhi approval path. |
| Order lines | 1–1000 line | Mempengaruhi orchestration fan-out. |
| Fulfillment steps | 3–50 step/line | Mempengaruhi Camunda job count. |
| Event subscribers | 1–20 consumer group | Mempengaruhi fan-out dan schema compatibility. |
5.2 Capacity Unit
Untuk CPQ/OMS, gunakan beberapa unit kapasitas, bukan satu angka tunggal:
| Capacity Unit | Definisi |
|---|---|
quote_line_calculation | Satu line item dihitung pricing-nya. |
configuration_validation | Satu konfigurasi divalidasi terhadap rule set dan catalog snapshot. |
quote_submission | Quote draft disubmit untuk approval/acceptance. |
order_capture | Accepted quote dikonversi menjadi order. |
order_line_transition | Satu line order berubah state. |
workflow_job_execution | Satu job Camunda dieksekusi. |
event_publication | Satu domain event dipublish dari outbox ke Kafka. |
event_consumption | Satu event diproses consumer dengan inbox idempotency. |
Contoh:
Peak hour:
- 20k active sales users
- 8k quote recalculation/hour
- average 25 quote lines
- 2.5 pricing recalculation per quote edit
quote_line_calculation/hour = 8,000 * 25 * 2.5 = 500,000
quote_line_calculation/sec = 138.9 average
peak factor 5x = 694.5/sec
Angka RPS API mungkin terlihat rendah, tetapi kerja pricing line bisa besar.
6. SLO dan Latency Budget
SLO harus berbasis use case. Jangan menyamakan endpoint read catalog dengan accept quote atau submit order.
6.1 Example SLO Matrix
| Use Case | Target p50 | Target p95 | Target p99 | Catatan |
|---|---|---|---|---|
| Search catalog | 100 ms | 300 ms | 800 ms | Read-heavy, cache/projection friendly. |
| Validate configuration | 150 ms | 600 ms | 1500 ms | Bergantung kompleksitas rule. |
| Calculate price | 200 ms | 800 ms | 2000 ms | Harus deterministic dan explainable. |
| Save quote draft | 80 ms | 250 ms | 700 ms | Write path kecil. |
| Submit quote | 200 ms | 800 ms | 2000 ms | Bisa menghasilkan approval evaluation. |
| Accept quote | 300 ms | 1000 ms | 3000 ms | Harus idempotent; order capture bisa async. |
| Capture order | 500 ms | 2000 ms | 5000 ms | Bisa start orchestration. |
| Process workflow job | 100 ms | 1000 ms | 5000 ms | Tergantung external dependency. |
Angka di atas bukan universal. Ini baseline latihan untuk mengajari cara berpikir. Production SLO harus ditentukan dari kebutuhan bisnis, volume nyata, dan cost envelope.
6.2 Latency Budget untuk Accept Quote
Latency budget contoh:
| Segment | Budget |
|---|---|
| API authentication + authorization | 30 ms |
| Load quote snapshot | 80 ms |
| Version/state guard | 20 ms |
| Acceptance persistence | 80 ms |
| Order capture command | 300 ms |
| Outbox insert | 30 ms |
| Response construction | 20 ms |
| Safety margin | 160 ms |
| Total p95 budget | 700 ms |
Jika order orchestration memanggil external fulfillment system, jangan letakkan seluruh fulfillment di synchronous accept path. Accept quote harus membuat commitment bisnis dan memulai order process, bukan menunggu seluruh fulfillment selesai.
7. Little's Law untuk Engineer Backend
Little's Law:
L = λ * W
Keterangan:
L= jumlah work-in-progress/concurrent items.λ= arrival rate/throughput.W= waktu rata-rata item berada di sistem.
Contoh:
Order workflow arrival rate = 20 order/sec
Average orchestration duration = 15 minutes = 900 sec
Active process instances ≈ 20 * 900 = 18,000
Artinya, walaupun API hanya menerima 20 order/sec, runtime Camunda mungkin harus menampung belasan ribu process instance aktif. Ini mempengaruhi runtime table, history table, timer jobs, incident query, dan operation dashboard.
Contoh untuk thread pool:
Pricing API throughput = 500 request/sec
Average service time = 120 ms = 0.12 sec
Concurrent in-flight ≈ 500 * 0.12 = 60
Jika thread pool hanya 32 dan dependency tidak stabil, request mulai queueing. Jika thread pool terlalu besar, DB connection pool bisa habis dan latency memburuk.
8. Hot Path Platform
8.1 Catalog Browse Hot Path
Read-heavy path:
User -> Catalog API -> Redis/Projection -> PostgreSQL fallback -> Response
Optimization focus:
- Projection table khusus read.
- Index pada
tenant_id,catalog_version_id,status,category,sku. - Cache key berbasis
tenant + catalog_version + query_dimension. - Page size bounded.
- Hindari join besar untuk list view.
8.2 Configuration Validation Hot Path
User Selection -> Load Catalog Snapshot -> Evaluate Rules -> Explain Violations -> Save Session
Optimization focus:
- Precompiled rule graph per catalog version.
- Redis cache untuk metadata stabil.
- Batched DB load.
- Explainability tanpa menyimpan object graph besar.
- Limit line count dan attribute count.
8.3 Pricing Hot Path
Configuration Snapshot -> Price Book -> Discount Rules -> Charge Breakdown -> Pricing Snapshot
Optimization focus:
- Price book version key.
- Deterministic calculation.
- Avoid floating point; gunakan minor unit/decimal.
- Golden master tests.
- Cache price rule lookup, bukan final price jika price sangat context-sensitive.
8.4 Quote Submit Hot Path
Draft Quote -> Validate Completeness -> Evaluate Approval Signals -> Persist Submitted -> Emit Event
Optimization focus:
- Submit harus state transition kecil.
- Heavy document rendering async.
- Approval policy cache versioned.
- Outbox insert dalam transaksi yang sama.
8.5 Order Capture Hot Path
Accepted Quote -> Idempotency Guard -> Snapshot Order -> Create Lines -> Start Workflow
Optimization focus:
- Bulk insert line item.
- Stable order reference.
- Idempotency key dari quote acceptance.
- Workflow start harus retry-safe.
9. JVM Performance Engineering
9.1 Profiling Sebelum Optimasi
Java modern memberi banyak alat observability runtime. Java Flight Recorder menyediakan mekanisme recording untuk memahami CPU, allocation, lock, thread, GC, dan event runtime. Gunakan profiling saat:
- p95/p99 latency naik tanpa sebab jelas.
- CPU tinggi tetapi throughput tidak naik.
- GC pause meningkat.
- thread blocked/waiting meningkat.
- pricing/configuration CPU-bound.
- serialization/deserialization mahal.
Checklist profiling:
- Ambil baseline traffic representatif.
- Capture JFR selama window yang cukup.
- Lihat hot methods, allocation pressure, lock contention, GC, thread states.
- Hubungkan dengan trace/request ID.
- Ubah satu variabel.
- Jalankan ulang test.
9.2 Allocation Discipline
CPQ payload sering besar. Hindari pattern berikut:
// Anti-pattern: repeatedly copying large line lists
List<PricedLine> result = new ArrayList<>();
for (QuoteLine line : lines) {
List<PricedLine> intermediate = new ArrayList<>(result);
intermediate.add(price(line));
result = intermediate;
}
Gunakan bounded mutable builder di dalam calculation boundary:
List<PricedLine> result = new ArrayList<>(lines.size());
for (QuoteLine line : lines) {
result.add(price(line));
}
return List.copyOf(result);
Immutability bagus untuk safety, tetapi object churn di hot path harus diukur.
9.3 Thread Pool dan Connection Pool
Jangan menyetel thread pool secara terpisah dari DB pool.
API worker threads = 200
DB pool = 30
Average DB time = 80 ms
Peak API request = 500 rps
Jika 200 thread bisa masuk ke code path yang butuh DB, 170 thread mungkin hanya menunggu connection. Ini menaikkan latency dan memory pressure.
Rule praktis:
- Thread pool inbound harus bounded.
- DB connection pool harus sesuai database capacity.
- External dependency calls harus punya timeout.
- Async executor untuk background job harus dipisahkan dari request executor.
- Per-tenant heavy work perlu limiter.
10. JAX-RS/Jersey API Performance
Hotspot umum di service layer:
- JSON payload terlalu besar.
- Validation terlalu mahal.
- Exception mapper membuat stack trace besar untuk error bisnis normal.
- Request logging menyimpan body besar.
- Blocking call tanpa timeout.
- Pagination tidak bounded.
- Resource method melakukan orchestration terlalu banyak.
Contoh bounded endpoint contract:
parameters:
- name: limit
in: query
schema:
type: integer
minimum: 1
maximum: 100
default: 50
Contoh defensive response shaping:
public record QuoteSummaryResponse(
UUID quoteId,
String quoteNumber,
String state,
String customerName,
Instant updatedAt,
MoneyResponse total
) {}
List endpoint harus mengembalikan summary, bukan full quote snapshot dengan ratusan line.
11. PostgreSQL Performance Engineering
PostgreSQL query planner memilih execution plan untuk setiap query. EXPLAIN menunjukkan plan yang dipilih planner, dan EXPLAIN ANALYZE mengeksekusi query lalu menambahkan actual runtime statistics. Ini penting untuk membedakan tebakan dari bukti.
11.1 Query Shape untuk CPQ/OMS
Bad list query:
SELECT q.*, ql.*, a.*, e.*
FROM quote q
LEFT JOIN quote_line ql ON ql.quote_id = q.id
LEFT JOIN approval_request a ON a.quote_id = q.id
LEFT JOIN outbox_event e ON e.aggregate_id = q.id
WHERE q.tenant_id = :tenant_id
ORDER BY q.updated_at DESC
LIMIT 50;
Masalah:
- Join memperbesar row count.
- Pagination bisa salah karena duplicate quote rows.
- Outbox tidak relevan untuk UI list.
- Query sulit di-cache.
Better:
SELECT
q.id,
q.quote_number,
q.state,
q.customer_name,
q.total_amount_minor,
q.currency,
q.updated_at
FROM quote q
WHERE q.tenant_id = :tenant_id
AND q.state = ANY(:states)
ORDER BY q.updated_at DESC, q.id DESC
LIMIT :limit;
Full details dipanggil melalui endpoint detail yang memang butuh lines.
11.2 Index Design
Index harus mengikuti access pattern, bukan semua kolom.
CREATE INDEX idx_quote_tenant_state_updated
ON quote (tenant_id, state, updated_at DESC, id DESC);
Untuk order line processing:
CREATE INDEX idx_order_line_ready
ON order_line (tenant_id, state, dependency_ready_at, id)
WHERE state = 'READY_FOR_FULFILLMENT';
Partial index berguna untuk queue-like table jika status tertentu jauh lebih sering dicari dibanding status lain.
11.3 Lock dan Contention
Optimistic transition:
UPDATE order_header
SET state = :next_state,
version = version + 1,
updated_at = now()
WHERE tenant_id = :tenant_id
AND id = :order_id
AND state = :expected_state
AND version = :expected_version;
Jika affected row = 0, transisi ditolak karena stale version atau invalid state. Ini lebih baik daripada lock panjang.
Untuk worker queue/outbox:
SELECT id
FROM outbox_event
WHERE status = 'NEW'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 100;
SKIP LOCKED memungkinkan multiple publisher mengambil batch berbeda tanpa saling menunggu row yang sudah dikunci worker lain. Tetap perlu stale-claim recovery.
11.4 MyBatis Performance
Anti-pattern N+1:
List<QuoteRow> quotes = quoteMapper.search(criteria);
for (QuoteRow quote : quotes) {
quote.setLines(lineMapper.findByQuoteId(quote.id()));
}
Better:
List<QuoteRow> quotes = quoteMapper.search(criteria);
List<UUID> ids = quotes.stream().map(QuoteRow::id).toList();
List<QuoteLineRow> lines = lineMapper.findByQuoteIds(ids);
Lalu group di memory. Untuk list besar, gunakan pagination dan limit.
11.5 Transaction Size
CPQ/OMS write transaction harus kecil:
- Validate command.
- Load aggregate minimum.
- Check invariant.
- Write state change.
- Write audit/outbox.
- Commit.
Jangan render PDF, call external system, publish Kafka langsung, atau evaluate ribuan rule dalam transaksi DB jika bisa dipisah.
12. Kafka Performance Engineering
Kafka bottleneck biasanya muncul sebagai consumer lag, rebalance, slow partition, retry storm, atau DLT flood.
12.1 Partition Key
| Event | Key yang Disarankan | Alasan |
|---|---|---|
QuoteSubmitted | quoteId | Menjaga ordering event quote. |
QuoteAccepted | quoteId | Acceptance harus terurut terhadap quote lifecycle. |
OrderCaptured | orderId | Order lifecycle butuh ordering per order. |
OrderLineStateChanged | orderId atau orderLineId | Pilih berdasarkan kebutuhan ordering. |
ApprovalEscalated | approvalRequestId | Approval lifecycle lokal. |
Jangan pakai tenantId sebagai key untuk semua event jika tenant besar dapat membuat hot partition.
12.2 Consumer Capacity
Consumer group parallelism dibatasi oleh jumlah partition topic. Jika topic punya 12 partition, consumer group efektif maksimal 12 active consumer untuk topic itu.
Capacity estimate:
Event arrival = 2,000 events/sec
Average processing time = 20 ms = 0.02 sec
Concurrency needed = 2,000 * 0.02 = 40 in-flight
Jika satu consumer process memproses sequential per partition, kita butuh cukup partition dan instance. Namun, menaikkan partition juga meningkatkan operational cost dan complexity. Partition count adalah keputusan arsitektur, bukan angka default.
12.3 Producer Batching
Untuk event publikasi outbox:
- Batch publish meningkatkan throughput.
- Batch terlalu besar menaikkan latency dan retry blast radius.
- Gunakan bounded batch size.
- Persist status per event.
- Jangan menghapus outbox sebelum event benar-benar acknowledged.
12.4 Lag sebagai Business Risk
Kafka lag bukan hanya angka teknis. Untuk CPQ/OMS:
| Lag Consumer | Dampak Bisnis |
|---|---|
| order-orchestrator lag | Order captured tetapi belum diproses fulfillment. |
| approval-escalation lag | Approval SLA terlambat. |
| search-projection lag | UI menampilkan data stale. |
| billing-integration lag | Revenue recognition terlambat. |
| audit-export lag | Compliance reporting tertunda. |
Dashboard harus menerjemahkan lag menjadi dampak bisnis.
13. Camunda 7 Performance Engineering
Camunda 7 sangat kuat untuk workflow, tetapi bisa menjadi bottleneck jika proses didesain sebagai tempat menyimpan semua data dan semua kerja.
13.1 Camunda Performance Principles
- Process instance menyimpan orchestration state, bukan full business aggregate.
- Variable harus kecil dan versioned.
- Heavy business logic tetap di service domain.
- Async continuation digunakan untuk membatasi transaksi dan retry.
- History level harus sesuai kebutuhan audit dan storage.
- Incident harus bisa di-query dan diperbaiki.
- Timer jumlah besar perlu capacity planning.
13.2 Job Executor Capacity
Estimate:
Orders/sec = 10
Average jobs/order = 30
Workflow jobs/sec = 300
Average job execution = 50 ms
Concurrency needed = 300 * 0.05 = 15 workers
Tambahkan margin untuk retry, timer, incident recovery, dan peak factor.
Jika job execution memanggil external service rata-rata 1 detik:
Concurrency needed = 300 * 1.0 = 300 workers
Ini sinyal bahwa external call seharusnya dipindah ke async integration worker atau external task pattern dengan limiter, bukan Java delegate blocking tanpa kontrol.
13.3 Camunda Variables
Bad:
{
"fullQuote": { "lines": [ /* 500 huge lines */ ] },
"fullOrder": { "lines": [ /* huge */ ] },
"pricingTrace": { "rules": [ /* huge */ ] }
}
Better:
{
"tenantId": "t-001",
"orderId": "9a2a...",
"orderVersion": 7,
"processSchemaVersion": "order-orchestration.v3",
"currentPhase": "FULFILLMENT"
}
Process engine dapat mengambil snapshot dari Order Service jika butuh detail.
13.4 History Table Growth
Workflow platform dengan ribuan order/day akan menghasilkan banyak history rows. Rencanakan:
- history level.
- cleanup policy.
- index untuk operation query.
- retention per process definition.
- export audit jika regulatory retention lebih panjang dari operational retention.
- partitioning/archival jika volume besar.
14. Redis Performance Engineering
Redis harus mempercepat runtime, bukan menjadi sumber kebenaran finansial.
14.1 Cache Key Design
catalog:{tenantId}:{catalogVersion}:offer:{offerId}
pricing-rule:{tenantId}:{priceBookVersion}:{segment}:{region}
config-session:{tenantId}:{sessionId}
idempotency:{tenantId}:{operation}:{key}
Key harus mengandung version jika data bisa berubah melalui publish/versioning. Ini mengurangi invalidation kompleks.
14.2 Hot Key
Hot key muncul ketika semua request membaca key yang sama, misalnya:
pricing-rule:default
Mitigasi:
- versioned and segmented keys.
- local in-memory near-cache untuk data sangat stabil.
- request coalescing.
- TTL jitter.
- split large object.
14.3 Stampede Prevention
Pattern:
Cache Miss -> Try short lock -> One loader hits DB -> Others wait/fallback -> Populate cache with TTL jitter
Namun distributed lock harus hati-hati. Untuk pricing correctness, lebih aman membuat cache rebuild idempotent dan membiarkan beberapa duplicate rebuild daripada mengunci path utama terlalu lama.
15. Capacity Model Worksheet
Gunakan worksheet ini sebelum performance test.
15.1 Business Demand
| Input | Value |
|---|---|
| Active tenants | |
| Peak users | |
| Quote edits/hour | |
| Average quote lines | |
| Price recalculation/edit | |
| Quote submissions/hour | |
| Quote acceptance/hour | |
| Average order lines | |
| Workflow steps/order line | |
| Event subscribers |
15.2 Derived Load
pricing_line_calc_per_sec = quote_edits_per_hour * avg_lines * recalculation_per_edit / 3600
order_line_created_per_sec = quote_acceptance_per_hour * avg_order_lines / 3600
workflow_jobs_per_sec = order_line_created_per_sec * workflow_steps_per_line
event_processing_per_sec = domain_events_per_sec * subscribers
15.3 Resource Estimate
| Resource | Estimate Basis |
|---|---|
| API replicas | request/sec * service time / target utilization |
| DB connections | concurrent DB work, not API threads |
| Kafka partitions | ordering boundary + throughput + future growth |
| Camunda workers | jobs/sec * job duration * peak factor |
| Redis memory | key count * average value size * overhead * growth factor |
| Storage | quote/order/history/outbox/audit retention |
16. Load Testing Strategy
16.1 Test Types
| Test | Tujuan |
|---|---|
| Smoke test | Membuktikan flow dasar berjalan. |
| Baseline load test | Mengukur performa normal. |
| Stress test | Menemukan titik jenuh. |
| Spike test | Menguji lonjakan tiba-tiba. |
| Soak test | Menemukan leak, table growth, backlog. |
| Scalability test | Membuktikan replica/partition/worker scaling. |
| Failure-injection test | Menguji timeout, retry, recovery. |
16.2 Scenario Mix
Contoh production-like mix:
| Scenario | Weight |
|---|---|
| Catalog search | 35% |
| Configuration validate | 20% |
| Price recalculation | 20% |
| Save quote draft | 10% |
| Submit quote | 5% |
| Accept quote/order capture | 3% |
| Order status read | 5% |
| Admin/repair/read audit | 2% |
Jika hanya menguji accept quote, kita tidak melihat cache/read pressure. Jika hanya menguji read endpoint, kita tidak melihat lock, workflow, outbox, dan Kafka lag.
16.3 Test Data Must Be Realistic
Bad test data:
- 1 tenant.
- 10 product.
- 1 quote line.
- semua rule sederhana.
- semua request pakai user sama.
- database kosong.
Better:
- banyak tenant dengan ukuran berbeda.
- catalog besar dengan versioning.
- quote lines bervariasi.
- discount/approval edge cases.
- order line dependency graph.
- existing history/audit/outbox data.
- skewed tenant traffic untuk menguji noisy neighbor.
17. Performance Debugging Playbook
17.1 API Latency Naik
Langkah:
- Cek p50/p95/p99 per endpoint.
- Cek error rate dan timeout.
- Cek thread pool saturation.
- Cek DB pool wait time.
- Cek trace breakdown.
- Cek slow query.
- Cek Redis latency/cache miss.
- Cek external dependency.
- Cek deployment baru.
17.2 Database CPU Tinggi
Langkah:
- Cek top query by total time.
- Run
EXPLAIN (ANALYZE, BUFFERS)di staging/prod-safe replica. - Cek missing index/statistics.
- Cek row count estimate vs actual.
- Cek lock wait.
- Cek autovacuum/bloat.
- Cek new query dari release terbaru.
- Tambahkan index secara concurrent jika perlu.
17.3 Kafka Lag Naik
Langkah:
- Identifikasi consumer group dan topic.
- Cek partition skew.
- Cek processing latency per event type.
- Cek poison event/retry storm.
- Cek downstream dependency.
- Scale consumer jika partition cukup.
- Pause/resume atau route ke DLT jika perlu.
- Jalankan replay/reconciliation setelah stabil.
17.4 Camunda Incident Flood
Langkah:
- Kelompokkan incident per process definition/version/activity.
- Identifikasi error transient/permanent/business.
- Cek job retry config.
- Cek external dependency.
- Cek variable schema mismatch.
- Stop retry storm jika dependency down.
- Deploy fix atau repair data.
- Retry incidents secara batch terkontrol.
18. Observability Metrics untuk Performance
18.1 API
- request rate per endpoint.
- p50/p95/p99 latency.
- error rate by status/error code.
- request body size.
- response size.
- thread pool active/queued.
- DB pool active/wait.
18.2 PostgreSQL
- query total time.
- query mean/p95 time.
- lock wait.
- deadlock count.
- connection count.
- transaction duration.
- table/index hit ratio.
- autovacuum activity.
- table/index size growth.
18.3 Kafka
- producer latency.
- outbox age.
- publish rate.
- consumer lag.
- consumer processing latency.
- retry topic rate.
- DLT rate.
- rebalance count.
18.4 Camunda
- active process instances.
- jobs acquired/sec.
- job execution latency.
- failed jobs.
- incident count.
- timer due count.
- process duration by definition.
- stuck activity count.
18.5 Redis
- command latency.
- hit/miss ratio.
- memory usage.
- evicted keys.
- expired keys.
- hot key indicators.
- connection count.
- timeout count.
19. Common Bottlenecks and Fixes
| Symptom | Likely Cause | Better Fix |
|---|---|---|
| Quote list slow | Over-joined query | Summary projection + proper index. |
| Pricing p99 high | Rule graph loaded repeatedly | Versioned rule cache + precompiled rule graph. |
| Submit quote stalls | Approval evaluation too heavy | Split signal calculation and policy decision. |
| Accept quote duplicate order | Missing idempotency | Unique idempotency key + state guard. |
| Kafka lag on one partition | Bad key/hot aggregate | Revisit partition key or split topic. |
| Camunda DB grows fast | Huge variables/history | Reduce variable payload + cleanup policy. |
| Redis memory spikes | Unbounded session/cache | TTL + max object size + eviction policy. |
| DB pool exhausted | Thread pool too high/slow query | Tune query and align pools. |
| CPU high in Java | Serialization/allocation | Profile JFR, reduce payload/copying. |
20. Performance Anti-Patterns
- Caching final price without price book version.
- Using Camunda variables as document store.
- Making Kafka partition key equal to tenant for all events.
- Running PDF generation inside quote submit transaction.
- Loading full catalog for every configuration validation.
- Adding indexes blindly without query evidence.
- Increasing thread pool to hide DB latency.
- Treating p50 as enough while p99 is broken.
- Running load test with toy data.
- Ignoring repair/admin workflows in capacity planning.
- Using Redis distributed lock as correctness boundary.
- Using E2E tests as the only performance signal.
21. Implementation Lab
Bangun performance lab sederhana:
- Buat dataset:
- 10 tenant.
- 100k offers.
- 1M quote rows.
- 5M quote lines.
- 100k order rows.
- 2M order lines.
- Buat scenario:
- catalog search.
- validate configuration 10/50/200 lines.
- price recalculation.
- submit quote.
- accept quote.
- process order workflow jobs.
- Ukur:
- API p50/p95/p99.
- DB slow query.
- Kafka lag.
- Camunda job latency.
- Redis hit ratio.
- CPU/memory/GC.
- Lakukan satu optimasi:
- tambah projection.
- tambah index.
- ubah cache key.
- batch MyBatis query.
- kecilkan Camunda variables.
- Tulis before/after report.
22. Production Readiness Checklist
Sebelum platform dianggap performance-ready:
- Ada workload model tertulis.
- Ada SLO per use case.
- Ada latency budget untuk hot path.
- Ada capacity unit untuk pricing/config/order/workflow/event.
- Ada baseline load test.
- Ada soak test minimal untuk memory/table/backlog growth.
- Ada query plan untuk query kritis.
- Ada index review.
- Ada DB pool/thread pool alignment.
- Ada Kafka lag dashboard.
- Ada outbox age dashboard.
- Ada Camunda incident/job dashboard.
- Ada Redis hit/miss/memory dashboard.
- Ada JFR/profiling runbook.
- Ada stress limit dan graceful degradation policy.
- Ada capacity review sebelum major release.
23. Reference Baseline
- PostgreSQL documentation:
EXPLAIN,EXPLAIN ANALYZE, indexes, and monitoring database activity. - Oracle JDK documentation: Java Flight Recorder and Java SE 25 troubleshooting/profiling material.
- Apache Kafka documentation: producer/consumer configuration, topic partitioning, and operational concepts.
- Redis documentation: latency, memory, data structures, caching, and production operation.
- Camunda 7 documentation: job executor, incidents, history, and process engine operation.
- OpenTelemetry Java documentation: metrics, logs, traces, instrumentation, and runtime telemetry.
24. Ringkasan
Performance engineering CPQ/OMS harus dimulai dari workload model dan SLO, bukan dari tuning parameter. Platform ini punya banyak hot path: catalog search, configuration validation, pricing calculation, quote submission, order capture, workflow job execution, event publishing, and event consumption.
Kunci top-tier engineering adalah kemampuan menghubungkan angka teknis dengan risiko bisnis:
- API latency berarti sales user menunggu.
- Pricing p99 berarti quote edit terasa berat.
- Kafka lag berarti downstream view/integration stale.
- Camunda backlog berarti order belum bergerak.
- DB lock berarti state transition berebut resource.
- Redis hot key berarti cache justru menjadi bottleneck.
Optimasi yang benar selalu evidence-based: ukur, pahami bottleneck, ubah satu hal, validasi, lalu dokumentasikan runbook.
You just completed lesson 31 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.