Series/Learn Java Error, Reliability & Observability Engineering

Deepen PracticeOrdered learning track

Micrometer, Prometheus & Actuator

Learn Java Error, Reliability & Observability Engineering - Part 026

Praktik instrumentasi metrics Java production-grade dengan Micrometer, Prometheus, dan Spring Boot Actuator: meter registry, counters, gauges, timers, histograms, tags, dashboards, alerts, dan testing.

[2026-06-28]11 min read2010 words

In This Lesson

1. Architecture Overview 2. Dependency Setup 3. Minimal Actuator Configuration

PrevNext

Lesson 2635 lesson track20–29 Deepen Practice

#java#spring-boot#micrometer#prometheus+4 more

Part 026 — Micrometer, Prometheus & Actuator

Part sebelumnya membangun mental model metrics. Sekarang kita masuk ke implementasi Java production-grade menggunakan:

Micrometer sebagai instrumentation facade
Spring Boot Actuator sebagai endpoint observability aplikasi
Prometheus sebagai scraper dan time-series backend
Grafana atau tool sejenis sebagai dashboard layer

Target part ini bukan sekadar “menampilkan /actuator/prometheus”. Targetnya adalah mampu mendesain metrics yang benar, aman, stabil, rendah cardinality, dan berguna saat incident.

1. Architecture Overview

Micrometer memberi abstraction layer. Aplikasi Java merekam metric ke MeterRegistry. Spring Boot mengintegrasikan Micrometer dengan Actuator. Prometheus melakukan scrape endpoint /actuator/prometheus.

Important boundary:

Aplikasi tidak “push dashboard”. Aplikasi mengekspos time series. Monitoring backend yang scrape, query, aggregate, alert.

2. Dependency Setup

Contoh Maven untuk Spring Boot service:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>

    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>

Untuk Gradle:

dependencies {
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("io.micrometer:micrometer-registry-prometheus")
}

Spring Boot akan auto-configure registry Prometheus jika dependency registry tersedia.

3. Minimal Actuator Configuration

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      probes:
        enabled: true
  metrics:
    tags:
      application: case-service
      environment: prod

Catatan penting:

/actuator/metrics untuk diagnosis metric yang terdaftar, bukan production scrape backend.
/actuator/prometheus untuk Prometheus scrape.
Jangan expose semua actuator endpoint sembarangan di internet.
Tambahkan security/network policy untuk management endpoints.

4. What Spring Boot Gives You by Default

Dengan Actuator + Micrometer, Spring Boot biasanya menyediakan metric seperti:

HTTP server requests
JVM memory
JVM GC
JVM threads
process CPU
system CPU
logback events
executor metrics jika instrumented
data source / connection pool metrics jika applicable
cache metrics jika configured

Contoh endpoint:

GET /actuator/metrics
GET /actuator/metrics/http.server.requests
GET /actuator/prometheus

Namun default metric tidak cukup untuk domain reliability. Kita tetap perlu custom metrics untuk domain failure, retry, fallback, audit, dan business correctness.

5. MeterRegistry

MeterRegistry adalah pusat registrasi metric.

@Component
public class CaseApprovalMetrics {

    private final MeterRegistry registry;

    public CaseApprovalMetrics(MeterRegistry registry) {
        this.registry = registry;
    }
}

Jangan membuat global static registry kecuali ada alasan kuat. Dependency injection membuat metrics lebih mudah diuji.

6. Counter

Gunakan counter untuk event kumulatif.

@Component
public class CaseApprovalMetrics {

    private final Counter approvalSuccess;
    private final Counter approvalRejected;
    private final Counter approvalFailed;

    public CaseApprovalMetrics(MeterRegistry registry) {
        this.approvalSuccess = Counter.builder("case.approval.attempts")
            .description("Total case approval attempts")
            .tag("outcome", "success")
            .register(registry);

        this.approvalRejected = Counter.builder("case.approval.attempts")
            .description("Total case approval attempts")
            .tag("outcome", "rejected")
            .register(registry);

        this.approvalFailed = Counter.builder("case.approval.attempts")
            .description("Total case approval attempts")
            .tag("outcome", "failure")
            .register(registry);
    }

    public void recordSuccess() {
        approvalSuccess.increment();
    }

    public void recordRejected() {
        approvalRejected.increment();
    }

    public void recordFailure() {
        approvalFailed.increment();
    }
}

Prometheus output akan mengikuti naming convention registry. Counter biasanya terekspos dengan suffix _total.

PromQL:

sum(rate(case_approval_attempts_total[5m])) by (outcome)

7. Dynamic Tags: Use Carefully

Kadang kita ingin tag berdasarkan reason.

public void recordRejection(String reason) {
    Counter.builder("case.approval.rejections")
        .tag("reason", reason)
        .register(registry)
        .increment();
}

Ini aman hanya jika reason berasal dari enum/registry yang bounded.

Good:

public enum RejectionReason {
    MISSING_APPROVAL_AUTHORITY,
    CASE_ALREADY_CLOSED,
    POLICY_DENIED,
    VERSION_CONFLICT
}

Bad:

recordRejection(exception.getMessage());
recordRejection(command.caseId());
recordRejection(userId);

Better implementation:

public void recordRejection(RejectionReason reason) {
    Counter.builder("case.approval.rejections")
        .tag("reason", reason.name())
        .register(registry)
        .increment();
}

8. Gauge

Gauge merepresentasikan nilai saat ini.

@Component
public class CaseQueueMetrics {

    private final ManualReviewQueue queue;

    public CaseQueueMetrics(MeterRegistry registry, ManualReviewQueue queue) {
        this.queue = queue;

        Gauge.builder("case.manual_review.queue.depth", queue, ManualReviewQueue::depth)
            .description("Current number of cases waiting for manual review")
            .baseUnit("cases")
            .register(registry);
    }
}

Important:

Gauge membaca value saat scrape.
Jangan increment/decrement gauge seperti counter kecuali backed by controlled state object.
Pastikan object yang di-gauge tidak hilang dari memory jika registry menggunakan weak reference behavior.

Untuk value internal:

@Component
public class InFlightMetrics {

    private final AtomicInteger inFlight = new AtomicInteger();

    public InFlightMetrics(MeterRegistry registry) {
        Gauge.builder("case.approval.in_flight", inFlight, AtomicInteger::get)
            .description("Current in-flight case approval operations")
            .register(registry);
    }

    public void started() {
        inFlight.incrementAndGet();
    }

    public void finished() {
        inFlight.decrementAndGet();
    }
}

Pastikan decrement ada di finally.

metrics.started();
try {
    approve(command);
} finally {
    metrics.finished();
}

9. Timer

Timer adalah metric utama untuk latency.

@Component
public class CaseApprovalMetrics {

    private final MeterRegistry registry;

    public CaseApprovalMetrics(MeterRegistry registry) {
        this.registry = registry;
    }

    public Timer timer(String outcome) {
        return Timer.builder("case.approval.duration")
            .description("Case approval processing duration")
            .tag("outcome", outcome)
            .publishPercentileHistogram()
            .serviceLevelObjectives(
                Duration.ofMillis(100),
                Duration.ofMillis(250),
                Duration.ofMillis(500),
                Duration.ofSeconds(1),
                Duration.ofSeconds(2)
            )
            .register(registry);
    }
}

Usage:

public ApprovalResult approve(ApproveCaseCommand command) {
    Timer.Sample sample = Timer.start(registry);
    try {
        ApprovalResult result = approvalService.approve(command);
        sample.stop(metrics.timer("success"));
        return result;
    } catch (DomainRejectionException ex) {
        sample.stop(metrics.timer("rejected"));
        throw ex;
    } catch (Exception ex) {
        sample.stop(metrics.timer("failure"));
        throw ex;
    }
}

9.1 Avoid Timer Tag Explosion

Bad:

.tag("caseId", command.caseId())
.tag("userId", command.userId())
.tag("exception", ex.getMessage())

Good:

.tag("operation", "approve")
.tag("outcome", "failure")
.tag("reason", "dependency_timeout")

10. Timer With `recordCallable`

Untuk kode sederhana:

ApprovalResult result = Timer.builder("case.approval.duration")
    .tag("outcome", "success")
    .register(registry)
    .recordCallable(() -> approvalService.approve(command));

Namun hati-hati: jika outcome tergantung exception, Timer.Sample sering lebih fleksibel karena kita bisa memilih tag setelah tahu outcome.

11. DistributionSummary

Gunakan DistributionSummary untuk ukuran non-durasi.

@Component
public class ValidationMetrics {

    private final DistributionSummary validationErrorCount;

    public ValidationMetrics(MeterRegistry registry) {
        this.validationErrorCount = DistributionSummary.builder("case.validation.error.count")
            .description("Number of validation errors per rejected request")
            .baseUnit("errors")
            .publishPercentileHistogram()
            .serviceLevelObjectives(1, 2, 5, 10, 20)
            .register(registry);
    }

    public void recordValidationErrors(int count) {
        validationErrorCount.record(count);
    }
}

Use cases:

validation errors per request
payload size
batch item count
affected entity count
retry attempts per completed operation

12. LongTaskTimer

Gunakan LongTaskTimer untuk task yang sedang berjalan lama.

@Component
public class ReconciliationService {

    private final LongTaskTimer reconciliationTimer;

    public ReconciliationService(MeterRegistry registry) {
        this.reconciliationTimer = LongTaskTimer.builder("case.reconciliation.active")
            .description("Active reconciliation job duration")
            .register(registry);
    }

    public void reconcile() {
        LongTaskTimer.Sample sample = reconciliationTimer.start();
        try {
            runReconciliation();
        } finally {
            sample.stop();
        }
    }
}

Alert example:

case_reconciliation_active_duration_seconds_max > 3600

13. FunctionCounter and FunctionTimer

Gunakan function-based meters ketika source sudah punya cumulative value.

FunctionCounter.builder("case.processed.total", processor, CaseProcessor::processedCount)
    .description("Total processed cases")
    .register(registry);

Gunakan dengan hati-hati:

value harus monotonik untuk counter
source object lifecycle harus stabil
backend harus dapat menangani reset/restart

14. Central Metrics Facade

Jangan sebarkan metric builder acak di seluruh codebase.

Better:

@Component
public class CaseMetrics {

    private final MeterRegistry registry;

    public CaseMetrics(MeterRegistry registry) {
        this.registry = registry;
    }

    public void recordTransition(String from, String to, TransitionOutcome outcome) {
        Counter.builder("case.transition.attempts")
            .tag("from", from)
            .tag("to", to)
            .tag("outcome", outcome.metricValue())
            .register(registry)
            .increment();
    }

    public Timer transitionTimer(String transition, String outcome) {
        return Timer.builder("case.transition.duration")
            .tag("transition", transition)
            .tag("outcome", outcome)
            .publishPercentileHistogram()
            .register(registry);
    }
}

Tetapi jangan buat satu “God Metrics Class” untuk seluruh aplikasi. Gunakan facade per bounded context/module.

15. Naming and Tag Conventions

Gunakan convention konsisten.

<domain>.<operation>.<measurement>

Examples:

case.approval.attempts
case.approval.duration
case.approval.rejections
case.transition.conflicts
case.audit.write.failures
case.manual_review.queue.depth
case.outbox.lag

Prometheus biasanya mengekspos dot sebagai underscore:

case_approval_attempts_total
case_approval_duration_seconds
case_manual_review_queue_depth

15.1 Tag Convention

Common tags:

Tag	Values	Notes
`outcome`	`success`, `failure`, `rejected`, `timeout`	bounded
`reason`	enum code	bounded
`operation`	stable operation name	bounded
`dependency`	stable service/db name	bounded
`route`	route template	bounded-ish
`method`	HTTP method	bounded
`status`	HTTP status	bounded
`exception`	avoid or controlled category	risky
`tenant`	avoid unless controlled	high risk

16. MeterFilter

MeterFilter dapat dipakai untuk:

rename metric
deny metric
cap tags
add common tags
configure distribution statistics

Example common tags:

@Configuration
public class MetricsConfig {

    @Bean
    MeterRegistryCustomizer<MeterRegistry> commonTags() {
        return registry -> registry.config().commonTags(
            "application", "case-service",
            "runtime", "java"
        );
    }
}

Example deny high-cardinality metric:

@Bean
MeterFilter denyRawUriMetrics() {
    return MeterFilter.deny(id ->
        id.getName().equals("http.server.requests") &&
        id.getTag("uri") != null &&
        id.getTag("uri").matches(".*[0-9a-fA-F-]{8,}.*")
    );
}

Better solution: fix route templating at source. Filter is safety net, not primary design.

17. Distribution Configuration

Untuk Prometheus percentile query, gunakan histogram buckets.

management:
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true
        case.approval.duration: true
      slo:
        http.server.requests: 50ms,100ms,250ms,500ms,1s,2s,5s
        case.approval.duration: 100ms,250ms,500ms,1s,2s

Atau programmatically:

@Bean
MeterFilter caseApprovalHistogramConfig() {
    return MeterFilter.maximumAllowableTags(
        "case.approval.duration",
        "reason",
        20,
        MeterFilter.deny()
    );
}

Be careful:

histograms multiply time series by bucket count
each tag combination multiplies bucket series
enable histograms intentionally

18. HTTP Server Metrics

Spring Boot automatically instruments HTTP server requests.

Common dimensions include:

method
uri/route
status
outcome
exception

Important:

uri should be route template like /cases/{id}, not raw /cases/123.
High-cardinality URL tags are dangerous.
Keep endpoint grouping meaningful.

Example PromQL request rate:

sum(rate(http_server_requests_seconds_count[5m])) by (uri, method)

Error ratio:

sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (uri)
/
sum(rate(http_server_requests_seconds_count[5m])) by (uri)

p95 latency:

histogram_quantile(
  0.95,
  sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri)
)

19. Dependency Metrics

Instrument dependency calls explicitly.

public RiskScore getRiskScore(CaseId caseId) {
    Timer.Sample sample = Timer.start(registry);
    String outcome = "success";

    try {
        return riskClient.getRiskScore(caseId);
    } catch (TimeoutException ex) {
        outcome = "timeout";
        throw ex;
    } catch (Exception ex) {
        outcome = "failure";
        throw ex;
    } finally {
        sample.stop(Timer.builder("dependency.call.duration")
            .tag("dependency", "risk-service")
            .tag("operation", "getRiskScore")
            .tag("outcome", outcome)
            .publishPercentileHistogram()
            .register(registry));
    }
}

Counter for calls:

Counter.builder("dependency.call.attempts")
    .tag("dependency", "risk-service")
    .tag("operation", "getRiskScore")
    .tag("outcome", outcome)
    .register(registry)
    .increment();

Metrics to add:

call rate
error/timeout rate
latency histogram
retry attempts
circuit breaker state
bulkhead rejection
fallback usage

20. Retry Metrics

@Component
public class RetryMetrics {

    private final MeterRegistry registry;

    public RetryMetrics(MeterRegistry registry) {
        this.registry = registry;
    }

    public void recordAttempt(String dependency, int attempt, String reason) {
        Counter.builder("retry.attempts")
            .tag("dependency", dependency)
            .tag("attempt", attempt <= 3 ? String.valueOf(attempt) : "gt3")
            .tag("reason", reason)
            .register(registry)
            .increment();
    }

    public void recordExhausted(String dependency, String reason) {
        Counter.builder("retry.exhausted")
            .tag("dependency", dependency)
            .tag("reason", reason)
            .register(registry)
            .increment();
    }
}

Note the bounded attempt tag. Never tag with dynamic request ID.

21. Circuit Breaker Metrics

If using Resilience4j, Micrometer integration can publish circuit breaker metrics. Even without framework integration, the concepts should be visible:

circuit_breaker_calls_total{dependency="risk-service", outcome="success"}
circuit_breaker_calls_total{dependency="risk-service", outcome="error"}
circuit_breaker_state{dependency="risk-service", state="open"}
circuit_breaker_not_permitted_calls_total{dependency="risk-service"}

Dashboard should show:

dependency error rate
dependency latency
retry attempts
circuit breaker state
fallback usage
user-facing error rate

Without this chain, on-call sees only “API 500 high” but not why.

22. Queue Metrics

For queue/message systems:

message.consumed.total
message.processing.duration
message.processing.failures.total
message.retry.attempts.total
message.dlq.total
message.consumer.lag
message.oldest.age.seconds

Example:

Gauge.builder("case.manual_review.queue.depth", queue, ManualReviewQueue::depth)
    .register(registry);

Gauge.builder("case.manual_review.queue.oldest_age", queue, ManualReviewQueue::oldestAgeSeconds)
    .baseUnit("seconds")
    .register(registry);

oldest_age often matters more than raw queue depth.

23. Domain Metrics Example: Case Approval

@Service
public class CaseApprovalService {

    private final MeterRegistry registry;
    private final CaseRepository repository;
    private final PolicyEngine policyEngine;

    public CaseApprovalService(
        MeterRegistry registry,
        CaseRepository repository,
        PolicyEngine policyEngine
    ) {
        this.registry = registry;
        this.repository = repository;
        this.policyEngine = policyEngine;
    }

    public ApprovalResult approve(ApproveCaseCommand command) {
        Timer.Sample sample = Timer.start(registry);
        String outcome = "success";
        String reason = "none";

        try {
            CaseAggregate aggregate = repository.get(command.caseId());
            PolicyDecision decision = policyEngine.evaluate(command, aggregate);

            if (decision.denied()) {
                outcome = "rejected";
                reason = decision.reason().name();
                incrementApproval(outcome, reason);
                throw new DomainRejectionException(decision.reason());
            }

            aggregate.approve(command.actor());
            repository.save(aggregate);
            incrementApproval(outcome, reason);
            return ApprovalResult.approved(command.caseId());
        } catch (OptimisticLockingFailureException ex) {
            outcome = "conflict";
            reason = "VERSION_CONFLICT";
            incrementApproval(outcome, reason);
            throw ex;
        } catch (Exception ex) {
            outcome = "failure";
            reason = "UNEXPECTED";
            incrementApproval(outcome, reason);
            throw ex;
        } finally {
            sample.stop(Timer.builder("case.approval.duration")
                .tag("outcome", outcome)
                .tag("reason", reason)
                .publishPercentileHistogram()
                .register(registry));
        }
    }

    private void incrementApproval(String outcome, String reason) {
        Counter.builder("case.approval.attempts")
            .tag("outcome", outcome)
            .tag("reason", reason)
            .register(registry)
            .increment();
    }
}

Review:

reason must be bounded.
caseId is not a tag.
Duration is recorded in finally.
Domain rejection is separate from unexpected failure.
Conflict is visible separately.

24. Avoid Duplicate Registration Hot Path Cost

Calling builder/register repeatedly can be acceptable for low volume and bounded tags, but for hot paths we often pre-register common meters.

Example:

@Component
public class ApprovalMeters {

    private final Map<String, Counter> counters;

    public ApprovalMeters(MeterRegistry registry) {
        this.counters = Map.of(
            "success:none", counter(registry, "success", "none"),
            "rejected:MISSING_APPROVAL_AUTHORITY", counter(registry, "rejected", "MISSING_APPROVAL_AUTHORITY"),
            "conflict:VERSION_CONFLICT", counter(registry, "conflict", "VERSION_CONFLICT"),
            "failure:UNEXPECTED", counter(registry, "failure", "UNEXPECTED")
        );
    }

    public void increment(String outcome, String reason) {
        counters.getOrDefault(outcome + ":" + reason, counters.get("failure:UNEXPECTED"))
            .increment();
    }

    private Counter counter(MeterRegistry registry, String outcome, String reason) {
        return Counter.builder("case.approval.attempts")
            .tag("outcome", outcome)
            .tag("reason", reason)
            .register(registry);
    }
}

This makes allowed combinations explicit.

25. Prometheus Scrape Configuration

Minimal prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: case-service
    metrics_path: /actuator/prometheus
    static_configs:
      - targets:
          - case-service:8080

In Kubernetes, service discovery usually replaces static targets.

Key choices:

Setting	Impact
`scrape_interval`	freshness vs cost
`metrics_path`	endpoint path
target labels	service/cluster/namespace identification
relabeling	normalize/drop labels
retention	storage cost

26. PromQL Basics for Java Services

26.1 Request Rate

sum(rate(http_server_requests_seconds_count[5m])) by (uri)

26.2 5xx Ratio

sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (uri)
/
sum(rate(http_server_requests_seconds_count[5m])) by (uri)

26.3 p95 Latency

histogram_quantile(
  0.95,
  sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri)
)

26.4 Dependency Timeout Rate

sum(rate(dependency_call_attempts_total{outcome="timeout"}[5m])) by (dependency)

26.5 Retry Exhaustion

sum(rate(retry_exhausted_total[5m])) by (dependency, reason)

26.6 Queue Age

max(case_manual_review_queue_oldest_age_seconds) by (queue)

27. Dashboard Design

A production dashboard should answer in order:

Is user/domain impact happening?
Which operation is impacted?
Is it latency, error, or saturation?
Which dependency/resource correlates?
What changed recently?
Which mitigation is safe?

27.1 SLO Row

availability SLI
latency SLI
error budget remaining
burn rate

27.2 Service Row

request rate
error ratio
p50/p95/p99 latency
top routes by error

27.3 Dependency Row

dependency call rate
dependency timeout rate
dependency p95 latency
circuit breaker open state

27.4 Resource Row

CPU
heap
GC pause
executor queue
DB pool active/pending

27.5 Domain Row

transition rejection rate
policy denial rate
audit write failure
manual review queue age
conflict rate

28. Alert Rules

28.1 Fast Burn Availability Alert

groups:
  - name: case-service-slo
    rules:
      - alert: CaseServiceFastBurn
        expr: |
          (
            sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
            /
            sum(rate(http_server_requests_seconds_count[5m]))
          ) > 0.02
        for: 10m
        labels:
          severity: page
          service: case-service
        annotations:
          summary: "case-service has high 5xx ratio"
          runbook: "Check deploys, dependency health, DB pool, retry storm, and circuit breaker state."

This is simplified. Mature SLO burn alerting should compare against the allowed error budget and use multi-window logic.

28.2 Queue Age Alert

- alert: ManualReviewQueueTooOld
  expr: max(case_manual_review_queue_oldest_age_seconds{queue="manual_review"}) > 1800
  for: 15m
  labels:
    severity: ticket
    service: case-service
  annotations:
    summary: "Manual review queue oldest item is older than 30 minutes"

28.3 Audit Write Failure Alert

- alert: AuditWriteFailures
  expr: sum(rate(case_audit_write_failures_total[5m])) > 0
  for: 5m
  labels:
    severity: page
    service: case-service
  annotations:
    summary: "Audit writes are failing"

For regulatory systems, audit failure may be page-worthy even at low volume.

29. Metrics and Logs/Traces Correlation

Micrometer metrics do not replace logs/traces.

Design correlation:

Metric label	Log field	Trace attribute
`operation`	`operation`	`operation.name`
`outcome`	`outcome`	`app.outcome`
`reason`	`error.code`	`error.code`
`dependency`	`dependency`	`peer.service`
`route`	`http.route`	`http.route`

When alert fires:

dashboard shows high reason=DEPENDENCY_TIMEOUT
trace query filters span attribute error.code=DEPENDENCY_TIMEOUT
logs filter error.code=DEPENDENCY_TIMEOUT and trace_id=<trace>

30. Testing Metrics

30.1 Unit Test With `SimpleMeterRegistry`

class CaseApprovalMetricsTest {

    @Test
    void incrementsSuccessCounter() {
        SimpleMeterRegistry registry = new SimpleMeterRegistry();
        CaseApprovalMetrics metrics = new CaseApprovalMetrics(registry);

        metrics.recordSuccess();

        Counter counter = registry.find("case.approval.attempts")
            .tag("outcome", "success")
            .counter();

        assertThat(counter).isNotNull();
        assertThat(counter.count()).isEqualTo(1.0);
    }
}

30.2 Timer Test

@Test
void recordsApprovalDuration() {
    SimpleMeterRegistry registry = new SimpleMeterRegistry();
    CaseApprovalMetrics metrics = new CaseApprovalMetrics(registry);

    Timer.Sample sample = Timer.start(registry);
    sample.stop(metrics.timer("success"));

    Timer timer = registry.find("case.approval.duration")
        .tag("outcome", "success")
        .timer();

    assertThat(timer).isNotNull();
    assertThat(timer.count()).isEqualTo(1);
}

30.3 Actuator Endpoint Test

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
class PrometheusEndpointTest {

    @Autowired
    TestRestTemplate rest;

    @Test
    void exposesPrometheusEndpoint() {
        ResponseEntity<String> response = rest.getForEntity(
            "/actuator/prometheus",
            String.class
        );

        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        assertThat(response.getBody()).contains("jvm_memory_used_bytes");
    }
}

30.4 Cardinality Regression Test

@Test
void doesNotTagMetricsWithCaseId() {
    SimpleMeterRegistry registry = new SimpleMeterRegistry();

    service.approve(new ApproveCaseCommand("case-123", "actor-1"));

    assertThat(registry.getMeters())
        .flatExtracting(meter -> meter.getId().getTags())
        .noneMatch(tag -> tag.getValue().equals("case-123"));
}

31. Security and Exposure

Actuator endpoints can reveal sensitive operational details.

Minimum rules:

expose only needed endpoints
protect management port/path
restrict network access
avoid exposing env/config endpoints publicly
do not include secrets in metric tags
do not tag with PII
ensure scrape endpoint is only accessible to monitoring infrastructure

Example separate management port:

management:
  server:
    port: 9090
  endpoints:
    web:
      exposure:
        include: health,info,prometheus

Then restrict port 9090 at network/Kubernetes policy.

32. Production Pitfalls

32.1 Raw URL Cardinality

Symptom:

uri="/cases/123"
uri="/cases/124"
uri="/cases/125"

Fix:

uri="/cases/{id}"

32.2 Exception Message as Tag

Bad:

reason="Timeout after 2851ms calling risk-service instance 10.2.1.9"

Fix:

reason="DEPENDENCY_TIMEOUT"

32.3 Per-Tenant Metrics Without Budget

Tenant metrics can be useful but risky. If tenant count is high, prefer:

aggregate metric
logs/traces with tenant ID
sampled per-tenant analytics pipeline
top-N offline analysis
dedicated metrics only for small bounded tenant set

32.4 Histogram Everywhere

Histograms are powerful but costly. Enable for:

request latency
dependency latency
critical domain operation duration
payload/batch distribution where needed

Do not enable for every timer blindly.

32.5 Metrics Without SLO

A dashboard with 200 panels and no SLO is not observability. It is visual noise.

32.6 No Metric Ownership

Every custom metric should have:

owning team
purpose
expected range
related dashboard
related alert, if critical
retirement condition

33. Implementation Checklist

Before merging metrics instrumentation:

34. Capstone Exercise

Build metrics for a CaseApprovalService with these requirements:

Count approval attempts by outcome and bounded reason.
Measure approval duration with buckets around 100ms, 250ms, 500ms, 1s, 2s.
Count dependency timeout to risk-service.
Count retry attempts and retry exhaustion.
Gauge manual review queue depth.
Gauge oldest manual review age.
Count audit write failures.
Expose /actuator/prometheus.
Write PromQL for:
- approval rate
- rejection ratio
- p95 approval duration
- audit failure alert
- queue oldest age alert
Write tests ensuring no caseId appears as metric tag.

Expected result:

A service owner can detect user impact, dependency failure, domain rejection surge, backlog growth, audit failure, and metric cardinality regression before customer support escalates.

35. Summary

Micrometer + Spring Boot Actuator + Prometheus gives Java teams a strong metrics foundation, but the framework does not design reliability for us.

The production-grade skill is knowing:

what to measure
which meter type to use
which labels are safe
how to keep cardinality bounded
how to map metrics to SLI/SLO
how to alert on customer/domain impact
how to test metrics as operational contracts

A mature Java service should expose not only JVM and HTTP metrics, but also domain failure, dependency degradation, retry/fallback behavior, queue health, shutdown safety, and audit integrity.

Part berikutnya akan membahas tracing mental model: trace, span, parent-child relationship, causal path, critical path, sampling, and why tracing is the missing middle between metrics and logs.

References

Micrometer Reference Documentation: https://docs.micrometer.io/micrometer/reference/index.html
Micrometer Meters: https://docs.micrometer.io/micrometer/reference/concepts/meters.html
Micrometer Timers: https://docs.micrometer.io/micrometer/reference/concepts/timers.html
Micrometer Gauges: https://docs.micrometer.io/micrometer/reference/concepts/gauges.html
Micrometer Distribution Summaries: https://docs.micrometer.io/micrometer/reference/concepts/distribution-summaries.html
Spring Boot Actuator Metrics: https://docs.spring.io/spring-boot/reference/actuator/metrics.html
Spring Boot Actuator Prometheus Endpoint: https://docs.spring.io/spring-boot/api/rest/actuator/prometheus.html
Prometheus Metric Types: https://prometheus.io/docs/concepts/metric_types/

Lesson Recap

You just completed lesson 26 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Distributed Tracing Mental Model

Micrometer, Prometheus & Actuator

Part 026 — Micrometer, Prometheus & Actuator

1. Architecture Overview

2. Dependency Setup

3. Minimal Actuator Configuration

4. What Spring Boot Gives You by Default

5. MeterRegistry

6. Counter

7. Dynamic Tags: Use Carefully

8. Gauge

9. Timer

9.1 Avoid Timer Tag Explosion

10. Timer With recordCallable

11. DistributionSummary

12. LongTaskTimer

13. FunctionCounter and FunctionTimer

14. Central Metrics Facade

15. Naming and Tag Conventions

15.1 Tag Convention

16. MeterFilter

17. Distribution Configuration

18. HTTP Server Metrics

19. Dependency Metrics

20. Retry Metrics

21. Circuit Breaker Metrics

22. Queue Metrics

23. Domain Metrics Example: Case Approval

24. Avoid Duplicate Registration Hot Path Cost

25. Prometheus Scrape Configuration

26. PromQL Basics for Java Services

26.1 Request Rate

26.2 5xx Ratio

26.3 p95 Latency

26.4 Dependency Timeout Rate

26.5 Retry Exhaustion

26.6 Queue Age

27. Dashboard Design

27.1 SLO Row

27.2 Service Row

27.3 Dependency Row

27.4 Resource Row

27.5 Domain Row

28. Alert Rules

28.1 Fast Burn Availability Alert

28.2 Queue Age Alert

28.3 Audit Write Failure Alert

29. Metrics and Logs/Traces Correlation

30. Testing Metrics

30.1 Unit Test With SimpleMeterRegistry

30.2 Timer Test

30.3 Actuator Endpoint Test

30.4 Cardinality Regression Test

31. Security and Exposure

32. Production Pitfalls

32.1 Raw URL Cardinality

32.2 Exception Message as Tag

32.3 Per-Tenant Metrics Without Budget

32.4 Histogram Everywhere

32.5 Metrics Without SLO

32.6 No Metric Ownership

33. Implementation Checklist

34. Capstone Exercise

35. Summary

References

10. Timer With `recordCallable`

30.1 Unit Test With `SimpleMeterRegistry`