Deepen PracticeOrdered learning track

Performance Engineering for ERP Workloads

Learn Java Large Scale ERP - Part 025

Performance engineering for large-scale ERP workloads in Java, including workload modelling, measurement, batch design, posting performance, read models, JVM/DB tuning, and operational performance guardrails.

25 min read4927 words
PrevNext
Lesson 2534 lesson track1928 Deepen Practice
#java#erp#performance-engineering#scalability+4 more

Part 025 — Performance Engineering for ERP Workloads

Large-scale ERP performance is not only about making requests faster. It is about preserving business correctness while the system absorbs seasonal peaks, month-end close, bulk posting, MRP planning, report bursts, inventory reconciliation, integrations, imports, and human users working at the same time.

A small CRUD application can define performance as p95 API latency < 300ms. An ERP platform needs a wider model:

  • Can the system post millions of accounting lines without breaking double-entry invariants?
  • Can inventory allocation remain correct under peak order volume?
  • Can month-end close finish inside the business window?
  • Can reporting run without starving operational transactions?
  • Can one tenant, branch, company, job, or integration partner avoid exhausting shared capacity?
  • Can support teams see exactly where time is spent when the system slows down?

Performance engineering for ERP is therefore a control discipline, not a tuning exercise.


1. Kaufman Framing: Deconstruct the Performance Skill

Following Kaufman's method, do not learn "performance" as a vague expert topic. Deconstruct it into smaller skills that can be practiced deliberately.

Sub-skillWhat You Must Be Able To DoERP Example
Workload modellingClassify traffic by shape, criticality, and contentionSeparate POS order entry from month-end depreciation run
Capacity modellingEstimate CPU, heap, DB I/O, locks, queue depth, connection usageForecast GL posting throughput for 10M journal lines
MeasurementBuild repeatable benchmark scenarios and baselinesRun P2P happy path, invoice import, close, and report burst tests
ProfilingFind CPU, allocation, blocking, I/O, and DB hotspotsDetect one allocation-heavy mapper in posting pipeline
Data access optimizationDesign query, index, pagination, fetch, projection, and partition strategyAvoid loading full PO aggregate just to show approval inbox
Batch engineeringChunk, checkpoint, retry, partition, and resume large jobsResume depreciation job after 7M of 12M assets
Contention controlReduce lock duration and isolate hot rowsAvoid one stock balance row becoming global bottleneck
BackpressureProtect the system from overload and retry stormsStop invoice import from consuming all DB connections
ObservabilityExpose business-level and technical-level performance signalsAlert when posting lag exceeds SLA by legal entity
Regression controlPrevent performance degradation from entering productionBenchmark stock reservation and GL posting in CI/nightly

The goal of this part is to give you the mental model and review discipline to reason about ERP performance before it fails in production.


2. The ERP Performance Mental Model

Think of ERP performance as a set of interacting queues and bottlenecks.

When ERP slows down, the root cause is rarely one function. It is usually one of these:

  1. A hot business object: one stock item, one warehouse, one fiscal period, one numbering sequence, one bank account, one approval queue.
  2. A hot technical resource: DB connection pool, table lock, index page, message partition, CPU core, heap, file system, network hop.
  3. An unbounded query or job: export all invoices, recalculate all stock balances, refresh all read models.
  4. A broken feedback loop: retry storm, duplicate import, lock timeout retry, report retry, integration replay.
  5. A mixed workload collision: users, posting, reporting, imports, and reconciliation sharing the same capacity without admission control.

The advanced engineer asks:

Which business flow owns this load, which invariant must survive, which queue is growing, which resource is saturated, and what is the safe degradation mode?


3. ERP Workload Taxonomy

Do not optimize everything the same way. Classify workload first.

3.1 Interactive transactional workload

Examples:

  • Create purchase order.
  • Submit approval.
  • Reserve stock.
  • Post invoice.
  • Confirm goods receipt.
  • Release sales order.

Characteristics:

  • User-facing latency matters.
  • Correctness and consistency are strict.
  • Transactions should be short.
  • Locks must be held briefly.
  • Timeouts must produce clear user outcomes.

Design principle:

Interactive ERP requests should validate, decide, persist, enqueue durable follow-up work, and return. They should not run unbounded recalculation or reporting inside the same request.

3.2 Bulk operational workload

Examples:

  • Import supplier invoices.
  • Generate payment proposals.
  • Allocate thousands of orders.
  • Revalue inventory.
  • Run depreciation.
  • Recalculate material requirements.

Characteristics:

  • Throughput matters more than single-item latency.
  • Restartability matters.
  • Progress visibility matters.
  • Partial failure is normal.
  • Checkpointing is mandatory.

Design principle:

Bulk jobs must be chunked, idempotent, restartable, observable, and rate-limited.

3.3 Financial close workload

Examples:

  • Close AP/AR periods.
  • Post accruals.
  • Run depreciation.
  • Reconcile subledgers.
  • Generate trial balance.
  • Lock fiscal period.

Characteristics:

  • Happens in constrained windows.
  • High business pressure.
  • Cannot silently skip records.
  • Requires audit evidence.
  • Often competes with reports and corrections.

Design principle:

Close performance must be engineered from the start. It is not acceptable to discover at go-live that closing takes 39 hours.

3.4 Reporting workload

Examples:

  • Sales dashboard.
  • Inventory aging.
  • Trial balance.
  • Customer aging.
  • Export invoices.
  • Compliance report.

Characteristics:

  • Read-heavy.
  • Can be very expensive.
  • Freshness requirements vary.
  • Security filters are expensive if added late.
  • Often spikes during close and audits.

Design principle:

Reports should run on certified read models, not on arbitrary joins over the transactional core.

3.5 Integration workload

Examples:

  • Bank statement import.
  • EDI purchase order exchange.
  • WMS shipment update.
  • MES production confirmation.
  • Tax engine request.
  • CRM customer sync.

Characteristics:

  • Bursty.
  • Partner-dependent.
  • Retry-heavy.
  • Duplicate-prone.
  • Requires reconciliation.

Design principle:

Integration throughput must be governed by business capacity and downstream safety, not by how fast messages can be consumed.


4. Performance Requirements as Contracts

ERP performance requirements should be written as contracts, not wishes.

Weak requirement:

Posting should be fast.

Strong requirement:

The GL posting pipeline must post 5 million validated journal lines within 45 minutes for one legal entity, while preserving balanced journal invariant, producing replay-safe posting evidence, and keeping interactive approval p95 below 800ms.

Use this template:

For <business scenario>,
under <data volume and concurrency>,
the system must achieve <latency/throughput/completion window>,
while preserving <business invariant>,
without exceeding <resource or operational limit>,
and exposing <observable evidence>.

Examples:

ScenarioRequirement
Stock reservationReserve 1,000 order lines/minute for one warehouse with no oversell and p95 < 500ms
Invoice importProcess 500k supplier invoices in 4 hours with duplicate detection and resumable checkpoints
Trial balanceGenerate trial balance for one legal entity and fiscal period in < 30s from certified balances
MRP runComplete daily netting for 2M demand/supply rows within 2 hours using planning snapshot
Approval inboxLoad approver inbox p95 < 300ms for 10k pending tasks with security scope applied
Month-end closeComplete depreciation, accrual, subledger reconciliation, and period lock inside 6-hour window

5. The ERP Performance Architecture

A high-scale ERP should separate performance-sensitive responsibilities.

Key boundaries:

BoundaryPurpose
Command modelPreserve write invariants and short transactions
Worker modelProcess durable follow-up work with retry/checkpoint
Read modelServe expensive read/report queries without harming writes
Batch modelProcess large volumes with partitioning and restartability
Observability modelMake bottlenecks and business lag visible
Capacity modelPrevent one workload from consuming all resources

6. Measurement Before Optimization

Performance work without measurement becomes superstition.

6.1 Build a scenario catalog

Your ERP performance suite should include named scenarios:

Scenario IDNameShapeCritical Metrics
PERF-O2C-001Sales order creation burstInteractive writep95 latency, DB time, validation time
PERF-O2C-002Stock reservation hotspotContended writeoversell count, lock wait, retry count
PERF-P2P-001Supplier invoice importBulk writerows/minute, duplicate rate, checkpoint lag
PERF-GL-001Journal posting batchBatch writelines/sec, balanced failure count, DB commit time
PERF-INV-001Inventory balance projectionProjectionprojection lag, rebuild time
PERF-RPT-001Trial balance reportRead/reportruntime, scanned rows, cache hit, freshness
PERF-MFG-001MRP planning runPlanning batchcompletion time, memory, temp storage
PERF-CLOSE-001Month-end close rehearsalMixed workloadclose window, queue backlog, error count

6.2 Use layered telemetry

You need both technical and business metrics.

LayerMetrics
User/APIrequest rate, p50/p95/p99 latency, error rate, timeout rate
Domainorders created, invoices posted, journal lines posted, stock reservations/sec
Queuelag, depth, age of oldest message, retry count, DLQ count
Batchprocessed rows, failed rows, checkpoint age, chunk duration
JVMCPU, heap, GC pause, allocation rate, thread state, virtual thread pinning when applicable
Databasequery time, locks, deadlocks, buffer hit ratio, connection usage, row scans, index scans
Cachehit ratio, stale reads, evictions, stampede count
Reportingrefresh time, query runtime, export size, report queue depth

6.3 Use profiling tools intentionally

For Java systems, Java Flight Recorder is a practical production-grade profiling tool. Use it to observe CPU, allocation, lock, I/O, and thread behavior during representative ERP workloads, not synthetic toy functions.

A good profiling session asks:

  • Which business scenario is running?
  • Which tenant/legal entity/company is included?
  • What data volume is realistic?
  • Is the database state realistic?
  • Is caching warm or cold?
  • Are reports/integration/jobs running concurrently?
  • Which version of the application and schema is being measured?

6.4 Do not trust averages

Average latency hides ERP pain. Use percentiles and completion windows.

Bad:

Average invoice post time: 120ms

Better:

Invoice post latency:
- p50: 110ms
- p95: 420ms
- p99: 2.8s
- timeout rate: 0.3%
- lock wait p99: 1.9s
- duplicate idempotency rejection: 2,341/hour

The p99 value often tells the real story: lock contention, GC pauses, cold query plans, disk stalls, queue bursts, or downstream slowdowns.


7. Performance Budgets

A performance budget is a design constraint.

Example for POST /sales-orders/{id}/reserve:

ComponentBudget
Authentication/authorization20ms
Load order summary40ms
Validate reservable lines60ms
Lock reservation keys80ms
Persist reservation ledger100ms
Publish outbox30ms
Serialization/network30ms
Buffer140ms
Total p95 target500ms

Use budgets during design review:

  • Which step can exceed its budget?
  • Which dependency has unbounded latency?
  • Which query has unbounded cardinality?
  • Which lock can wait indefinitely?
  • Which part should move to asynchronous processing?

A performance budget prevents accidental design drift.


8. JVM-Level Performance for ERP

Java performance in ERP is usually shaped by allocation, blocking, serialization, DB calls, and object graph size more than pure CPU arithmetic.

8.1 Control object graph size

ERP aggregates can become large:

  • Sales order with thousands of lines.
  • Purchase order with receipts, invoices, approval history, attachments.
  • Item with price lists, tax rules, UOM conversions, stock balances.
  • Manufacturing order with BOM explosion and routing steps.

Do not load a full aggregate when the use case only needs a projection.

Bad:

PurchaseOrder po = purchaseOrderRepository.findById(id).orElseThrow();
return approvalInboxMapper.toDto(po); // loads lines, receipts, invoices, attachments, history

Better:

ApprovalInboxRow row = approvalInboxQuery.findPurchaseOrderApprovalRow(id, currentApprover);
return ApprovalInboxDto.from(row);

Rule:

Command use cases load what they need to preserve invariants. Query use cases load projections shaped for the screen/report.

8.2 Watch allocation rate

High allocation rate causes GC pressure, even when heap is large. Common ERP allocation sources:

  • Mapping huge entity graphs to DTOs.
  • Parsing large CSV/Excel/XML/JSON files.
  • Creating BigDecimal repeatedly in tight loops.
  • Building massive in-memory lists before batch insert.
  • Rendering large exports in memory.
  • Excessive string concatenation in logs.
  • JSON serialization of full object graphs.

Practice:

  • Stream large files.
  • Process in chunks.
  • Reuse immutable reference data carefully.
  • Avoid building full export in memory.
  • Use projections instead of full entities.
  • Profile allocation with JFR.

8.3 Thread pools and virtual threads

Modern Java gives more options, including virtual threads, but ERP engineers must still reason about bottlenecks.

Virtual threads can help when workload is mostly blocking I/O and the downstream capacity is controlled. They do not make the database, broker, tax engine, or connection pool infinite.

Checklist before using virtual threads in ERP services:

  • Is the DB connection pool sized intentionally?
  • Does the service enforce timeouts?
  • Are downstream calls bulkheaded?
  • Is backpressure applied at queue/API boundaries?
  • Are long synchronized regions avoided?
  • Are metrics separated for request concurrency and downstream concurrency?

Core principle:

More threads increase concurrency pressure. They do not create more database throughput.

8.4 Connection pool as a safety valve

A DB connection pool is not only a performance tool. It is also a blast-radius limiter.

Anti-pattern:

Increase max pool size because requests are waiting.

Better diagnosis:

  • Are queries slow?
  • Are transactions too long?
  • Are locks causing waits?
  • Are jobs consuming all connections?
  • Are reports sharing OLTP connections?
  • Are retries multiplying demand?

Use separate pools or even separate services for different workload classes when needed:

WorkloadPool Strategy
Interactive APISmall/medium, strict timeout, high priority
Posting workersBounded, controlled throughput
Batch jobsSeparate pool, rate-limited
ReportsSeparate read replica/read model connection
Integration importSeparate bounded pool and queue

9. Database Performance for ERP

ERP performance is often database performance wearing an application costume.

9.1 Index for business access paths

Indexes should match real questions:

  • Pending approvals by approver and status.
  • Open invoices by vendor and due date.
  • Inventory balance by item, warehouse, lot, and bin.
  • Journal lines by legal entity, account, fiscal period.
  • Stock movements by item and posting time.
  • Outbox events by status and creation time.
  • Job records by status and next attempt time.

Example:

CREATE INDEX idx_journal_line_period_account
ON journal_line (legal_entity_id, fiscal_year, fiscal_period, account_id);

CREATE INDEX idx_approval_task_assignee_status_due
ON approval_task (assignee_id, status, due_at);

CREATE INDEX idx_outbox_ready
ON outbox_event (status, next_attempt_at, created_at)
WHERE status IN ('READY', 'RETRY');

9.2 Avoid unbounded cardinality

Bad API:

GET /invoices?vendorId=V001

Better API:

GET /invoices?vendorId=V001&status=OPEN&dueBefore=2026-07-31&pageSize=100&cursor=...

Every ERP query should have a cardinality story:

  • What is the maximum expected row count?
  • What is the index path?
  • What is the pagination strategy?
  • Is the query stable under concurrent writes?
  • Is the result complete enough for the business action?

9.3 Prefer keyset pagination for large operational lists

Offset pagination degrades on large tables because the database still has to walk skipped rows.

Better pattern:

SELECT id, document_no, status, submitted_at
FROM purchase_order
WHERE tenant_id = :tenantId
  AND status = 'PENDING_APPROVAL'
  AND submitted_at < :lastSeenSubmittedAt
ORDER BY submitted_at DESC, id DESC
LIMIT 100;

9.4 Keep write transactions small

A transaction that validates, computes, posts, exports, emails, and calls external systems is a production incident waiting to happen.

Command transaction should usually:

  1. Validate preconditions.
  2. Lock only required rows.
  3. Persist state change.
  4. Insert outbox event.
  5. Commit.

External calls and expensive projections happen after commit.

9.5 Use precomputed balances carefully

Financial and inventory systems often maintain balances for speed.

Design options:

OptionProsCons
Calculate from ledger every timeSimple truth modelSlow at scale
Maintain balance table transactionallyFast readsHot rows and reconciliation needed
Maintain asynchronous projectionDecoupled and scalableFreshness lag and rebuild complexity
Materialized viewUseful for reportingRefresh strategy needed

ERP rule:

Ledger is the source of truth; balances are derived and must be reconcilable.


10. Batch Performance Engineering

Batch jobs are not loops. They are state machines over large data sets.

10.1 Batch job design checklist

Every ERP batch job needs:

  • Job identity.
  • Parameters hash.
  • Scope: tenant, legal entity, period, warehouse, item range.
  • Snapshot or selection strategy.
  • Chunk size.
  • Checkpoint cursor.
  • Retry policy.
  • Failure classification.
  • Idempotency key per output.
  • Progress metrics.
  • Operator controls.
  • Completion evidence.

10.2 Chunking strategy

Bad:

List<Invoice> invoices = invoiceRepository.findAllPending();
for (Invoice invoice : invoices) {
    post(invoice);
}

Better:

while (true) {
    List<InvoiceWorkItem> chunk = workQueue.leaseNextChunk(jobId, 500);
    if (chunk.isEmpty()) break;

    for (InvoiceWorkItem item : chunk) {
        try {
            postingService.postInvoice(item.invoiceId(), item.idempotencyKey());
            workQueue.markDone(item.id());
        } catch (RetriableException ex) {
            workQueue.markRetry(item.id(), ex.getMessage());
        } catch (BusinessException ex) {
            workQueue.markRejected(item.id(), ex.getCode());
        }
    }

    jobCheckpoint.save(jobId, chunk.getLast().cursor());
}

10.3 Partitioning strategy

Partitioning can improve throughput, but only when partitions do not fight over the same hot resources.

JobGood Partition KeyDangerous Partition Key
Journal postinglegal entity + fiscal period + journal batchrandom line ID if all update same balance rows
Inventory projectionitem + warehousetimestamp only when one item dominates
Invoice importsource file segmentvendor if one vendor has huge volume
Depreciationasset book + legal entityaccount if it creates hot account balance rows
MRPplanning area + item familyall items competing for one temp table

10.4 Checkpoint and restart

A batch job that cannot restart safely is not production-ready.

Checkpoint must answer:

  • What was selected?
  • What was completed?
  • What failed permanently?
  • What is safe to retry?
  • Which outputs were already emitted?
  • Which parameters define this run?

Use idempotency on every side effect:

CREATE UNIQUE INDEX uq_posting_request_idempotency
ON posting_request (tenant_id, idempotency_key);

10.5 Batch throttling

A batch job should not consume all capacity simply because it can.

Throttle by:

  • Max active workers.
  • Max DB connections.
  • Max messages/sec.
  • Max rows/sec.
  • Time window.
  • Business calendar.
  • Downstream lag.
  • Interactive latency guardrail.

Example policy:

If interactive p95 > 900ms for 5 minutes, reduce invoice import workers from 12 to 4.
If GL posting queue oldest age > 30 minutes during close, pause non-critical exports.
If DB lock wait p99 > 2s, reduce stock projection rebuild concurrency.

11. Posting Pipeline Performance

Financial posting combines performance with strict invariants.

11.1 Posting pipeline shape

Performance pitfalls:

  • Loading full operational aggregate for every line.
  • Re-deriving static account mapping repeatedly.
  • Updating the same balance row for every journal line.
  • One giant transaction for millions of lines.
  • Synchronous read model update inside posting transaction.
  • Logging every line at INFO.
  • Reporting queries on journal line table during posting.

11.2 Posting throughput model

Throughput is constrained by:

min(
  validation throughput,
  derivation throughput,
  DB insert throughput,
  balance update throughput,
  lock contention limit,
  outbox throughput,
  downstream projection throughput
)

Do not optimize derivation if balance locking is the bottleneck.

11.3 Bulk insert strategy

For high-volume posting:

  • Use batch inserts where appropriate.
  • Avoid per-line flush.
  • Validate in memory per chunk.
  • Persist header + lines in bounded chunks.
  • Keep transaction size below operational risk threshold.
  • Store enough evidence to reconstruct calculation.
  • Emit one outbox event per posted journal or chunk, not per trivial line unless required.

11.4 Balance updates

Balance rows can become hot.

Options:

StrategyWhen UsefulRisk
Update balance per journal synchronouslySmall/medium volumeHot account/period rows
Append ledger only, project balance asyncHigh volumeFreshness lag
Sharded balance accumulatorVery high volumeMore complex reconciliation
Periodic aggregationReporting workloadsNot immediate

Rule:

Never trade financial correctness for write speed. Use projection/reconciliation patterns instead.


12. Read Model and Reporting Performance

Reports are often the silent killer of ERP performance.

12.1 Reporting should have workload isolation

Isolation options:

  • Separate query service.
  • Read replicas.
  • Materialized views.
  • Dedicated reporting schema.
  • Search index for search-like access.
  • Warehouse/lakehouse for analytics.
  • Export job queue for large downloads.

12.2 Report freshness class

Classify every report.

ClassFreshnessExampleArchitecture
Real-time operationalsecondspick queue, approval inboxoperational read model
Near-real-timeminutessales dashboardprojection/materialized view
Certified financialclose-controlledtrial balance, financial statementscontrolled ledger/balance model
Analyticalhours/daycustomer profitabilitywarehouse/lakehouse
Audit extractcase-dependentregulator exportgoverned export job

Do not spend OLTP write capacity to make analytical dashboards real-time unless the business truly needs it.

12.3 Export as job, not request

Bad:

GET /invoices/export?from=2025-01-01&to=2025-12-31

Better:

POST /exports
{
  "type": "INVOICE_EXPORT",
  "from": "2025-01-01",
  "to": "2025-12-31",
  "format": "CSV"
}

Then:

  • Validate request.
  • Create export job.
  • Process in chunks.
  • Write to object storage.
  • Notify user.
  • Audit who exported what.
  • Expire download link.

13. Caching in ERP

Caching can help ERP performance, but caching the wrong thing creates correctness defects.

13.1 Cache categories

Cache TypeGood UseDangerous Use
Reference data cachecurrency, UOM, static code listsmutable tax/pricing rules without versioning
Configuration cacheeffective published configdraft/unapproved config
Authorization cacheshort-lived permission decisionslong-lived SoD decisions after role change
Read model cachedashboard summariesfinancial report requiring certified freshness
External lookup cachetax rate lookup with expirylegal calculation evidence without stored source

13.2 Cache invalidation model

Every cache needs:

  • Owner.
  • Key structure.
  • Scope: tenant/company/branch/user.
  • Version or effective date.
  • TTL.
  • Invalidation trigger.
  • Staleness tolerance.
  • Fallback behavior.
  • Audit relevance.

For ERP pricing/tax/config, prefer versioned keys:

pricing:{tenant}:{priceBook}:{publishedVersion}:{item}:{customerSegment}
tax:{tenant}:{jurisdiction}:{taxConfigVersion}:{itemTaxClass}:{customerTaxClass}

Versioned cache keys reduce stale-decision risk.

13.3 Cache stampede protection

High-scale ERP read models can suffer stampede when many users request the same expensive report.

Controls:

  • Request coalescing.
  • Single-flight loading.
  • Stale-while-revalidate for non-critical dashboards.
  • Rate limits.
  • Precomputation.
  • Job-based reports.

14. Backpressure and Admission Control

A system without backpressure converts overload into failure.

14.1 Where to apply backpressure

BoundaryControl
API gatewayrate limit, request size limit, concurrency limit
Applicationsemaphore per operation, timeout, circuit breaker
Queue consumermax concurrency, pause/resume, lag-based scaling
Batch schedulercalendar windows, worker limit, priority
DBconnection pool, lock timeout, statement timeout
Export/reportjob queue, max rows, approval for huge export
Integrationper-partner rate limit, retry budget

14.2 Retry budget

Retry without budget creates retry storms.

Example retry policy:

For bank statement import:
- Retry network timeout up to 5 times with exponential backoff.
- Do not retry validation failure.
- Do not retry duplicate statement; mark duplicate.
- Stop partner consumer if DLQ rate > 5% for 10 minutes.
- Alert integration owner if oldest retry age > 1 hour.

14.3 Degradation modes

ERP degradation should be explicit.

PressureSafe Degradation
Reporting overloadQueue exports, serve cached dashboard, pause ad-hoc reports
Posting backlogPrioritize legal entity close jobs, pause non-critical integrations
Integration stormSlow consumer, DLQ invalid messages, preserve inbox ledger
DB lock contentionReduce batch concurrency, increase chunking, isolate hot keys
Search overloadLimit filters, require date range, defer full export

15. Performance Testing Strategy

15.1 Test with realistic data shape

ERP performance depends on skew.

Test data must include:

  • Large tenant and small tenant.
  • Hot item and normal items.
  • Large vendor/customer.
  • Large fiscal period.
  • Many approval tasks for one approver.
  • Large document with many lines.
  • Many small documents.
  • Historical data across years.
  • Closed and open periods.
  • Active and inactive master data.

15.2 Test mixed workload

A system that passes isolated API load tests can fail under realistic mixed load.

Mixed close workload example:

- 150 active users.
- 20 invoice import workers.
- 8 GL posting workers.
- 4 depreciation workers.
- Trial balance report every 2 minutes.
- Inventory aging report every 5 minutes.
- WMS shipment events at 200/minute.
- Tax API latency p95 = 600ms.

15.3 Define pass/fail criteria

Pass/fail must include correctness and operability:

  • No broken financial balance.
  • No oversold stock.
  • No duplicate posted invoice.
  • No lost outbox event.
  • No unbounded queue growth after load stops.
  • No manual DB cleanup required.
  • No worker stuck without visibility.
  • Completion window met.
  • Error classification correct.

16. Worked Example: Month-End Close Performance

16.1 Problem

Month-end close is missing its 6-hour SLA. Symptoms:

  • Depreciation job takes 3 hours.
  • Subledger reconciliation takes 2 hours.
  • Trial balance report sometimes takes 20 minutes.
  • Users complain approval screen is slow during close.
  • DB CPU is high but not consistently saturated.

16.2 Poor response

Increase database size and add more worker threads.

This may worsen contention.

16.3 Better investigation

Break down close pipeline:

Measure each stage:

StageObservedBottleneck
Depreciation3hsingle-thread asset book, repeated account lookup
Accrual posting45macceptable
Reconciliation2hfull scans of invoice/journal tables
Trial balance20mreporting on raw journal lines
Approval screenp95 2.5sshared DB pool with close jobs

16.4 Design changes

  • Partition depreciation by legal entity + asset book.
  • Cache published account mapping by version.
  • Use checkpointed depreciation work items.
  • Maintain certified subledger balance projection.
  • Generate trial balance from period balance table.
  • Separate interactive and batch connection pools.
  • Add close workload throttle.
  • Add progress and lag metrics per stage.

16.5 Result target

Depreciation: 45 minutes
Reconciliation: 25 minutes
Trial balance: 20 seconds
Interactive approval p95 during close: < 800ms
Close total: < 2 hours

The important shift is not just optimization. It is workload isolation and derived data design.


17. Java Implementation Patterns

17.1 Micrometer timing around business operations

@Component
public class PostingMetrics {
    private final MeterRegistry registry;

    public PostingMetrics(MeterRegistry registry) {
        this.registry = registry;
    }

    public <T> T recordPosting(String legalEntity, String sourceType, Supplier<T> supplier) {
        return Timer.builder("erp.posting.duration")
                .tag("legalEntity", legalEntity)
                .tag("sourceType", sourceType)
                .publishPercentileHistogram()
                .register(registry)
                .record(supplier);
    }

    public void incrementRejected(String sourceType, String reason) {
        Counter.builder("erp.posting.rejected")
                .tag("sourceType", sourceType)
                .tag("reason", reason)
                .register(registry)
                .increment();
    }
}

Be careful with metric cardinality. Do not tag metrics with invoice ID, PO number, or user ID.

17.2 Statement timeout for reporting queries

@Transactional(readOnly = true)
public TrialBalanceView generateTrialBalance(TrialBalanceRequest request) {
    jdbcTemplate.execute("SET LOCAL statement_timeout = '30s'");
    return trialBalanceQuery.loadCertifiedBalance(request);
}

A timeout is not a substitute for a good query, but it prevents one bad query from consuming the system indefinitely.

17.3 Bounded worker executor

public final class BoundedWorkerGate {
    private final Semaphore permits;

    public BoundedWorkerGate(int maxConcurrent) {
        this.permits = new Semaphore(maxConcurrent);
    }

    public <T> T execute(Callable<T> work) throws Exception {
        if (!permits.tryAcquire(2, TimeUnit.SECONDS)) {
            throw new CapacityRejectedException("Worker capacity exhausted");
        }
        try {
            return work.call();
        } finally {
            permits.release();
        }
    }
}

This is simple but important: uncontrolled concurrency is a common ERP production failure.


18. Performance Review Checklist

Use this checklist in architecture review.

Workload

  • Have we classified interactive, batch, reporting, integration, and close workloads?
  • Do we know the largest tenant/company/warehouse/vendor/customer/item?
  • Do we know peak concurrency and seasonal spikes?
  • Do we know mixed workload behavior?

Correctness under performance pressure

  • Are financial, stock, approval, and numbering invariants preserved under load?
  • Are retries idempotent?
  • Are partial failures classified?
  • Can jobs resume without duplicate output?

Data access

  • Are queries bounded by scope and date/status where appropriate?
  • Are indexes aligned with real access paths?
  • Is keyset pagination used for large lists?
  • Are reports isolated from OLTP writes?

JVM/application

  • Do we measure allocation rate and GC pauses?
  • Are connection pools sized by workload class?
  • Are timeouts configured?
  • Are thread pools bounded?
  • Is metric cardinality controlled?

Batch

  • Is every batch job chunked?
  • Does every job have checkpoint and restart?
  • Are chunks idempotent?
  • Is concurrency configurable?
  • Are progress and failure reasons visible?

Operations

  • Can support see queue lag, job progress, and bottlenecks?
  • Are overload controls documented?
  • Are performance regressions tested before release?
  • Is month-end close rehearsed with production-like data?

19. Anti-Patterns

Anti-patternWhy It Fails
Report directly on OLTP tablesReports fight with writes and create unpredictable load
Load full aggregate for every screenObject graph explosion and N+1 behavior
One giant batch transactionLock bloat, rollback risk, no progress visibility
Increase threads to fix slownessOften increases DB contention and timeout rate
Cache financial truthCreates audit and reconciliation risk
Use average latencyHides tail latency and production pain
No realistic test dataPerformance passes in test but fails at go-live
No workload isolationBatch/reporting/integration starve interactive users
Retry without budgetTurns transient failure into retry storm
Tune before measuringWastes effort and may worsen the real bottleneck

20. Deliberate Practice

Spend 2–3 hours on these exercises before moving on.

Exercise 1 — Workload map

Pick one ERP domain such as inventory or AP. Build a workload map:

  • Interactive commands.
  • Batch jobs.
  • Reports.
  • Integrations.
  • Close/reconciliation flows.
  • Peak periods.
  • Hot entities.

Exercise 2 — Performance contract

Write five performance contracts using the template from Section 4.

At least one must include:

  • A financial invariant.
  • A stock invariant.
  • A reporting freshness requirement.
  • A retry/idempotency requirement.
  • A completion window.

Exercise 3 — Batch redesign

Take this bad job:

Load all pending invoices, post them in a loop, fail if any invoice fails.

Redesign it with:

  • Work item table.
  • Chunking.
  • Checkpoint.
  • Idempotency key.
  • Failure classification.
  • Metrics.
  • Operator controls.

Exercise 4 — Mixed workload test plan

Design a load test for month-end close with:

  • User traffic.
  • Posting workers.
  • Report burst.
  • Integration events.
  • DB metrics.
  • Business metrics.
  • Pass/fail criteria.

21. Mental Compression

Remember these rules:

  1. ERP performance is workload management plus correctness preservation.
  2. Optimize business scenarios, not isolated methods.
  3. Measure before tuning.
  4. Separate interactive, batch, reporting, and integration capacity.
  5. Batch jobs must be chunked, checkpointed, idempotent, and observable.
  6. Reports need read models, not heroic OLTP joins.
  7. More threads can create more contention.
  8. Ledger is truth; balances and reports are derived and reconciled.
  9. Tail latency and completion windows matter more than averages.
  10. If support cannot see progress, lag, and bottleneck, the system is not operable.

22. Source Notes

This part is grounded in the following technical references and industry-proven concepts:

  • Oracle Java Flight Recorder and JDK Mission Control documentation for profiling and troubleshooting Java runtime performance.
  • Spring Boot Actuator and Micrometer documentation for production metrics and observability.
  • PostgreSQL documentation on transaction isolation, explicit locking, query planning, indexes, and materialized views.
  • Jakarta Batch concepts for chunk-oriented batch processing in enterprise Java environments.
  • Enterprise integration and transactional outbox/idempotent consumer patterns already covered in previous parts.

Use vendor documentation for exact syntax and version-specific behavior. Use the mental model here for architecture and design review.

Lesson Recap

You just completed lesson 25 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.