Final StretchOrdered learning track

Learn Frontend React Production Architecture Part 033 Observability Error Boundaries And Frontend Reliability

[]17 min read3280 words

In This Lesson

Tujuan Pembelajaran 1. Core Mental Model 2. Observability vs Monitoring

Lesson 3335 lesson track30–35 Final Stretch

title: Learn Frontend React Production Architecture - Part 033 description: Production-grade guide to frontend observability, error boundaries, reliability engineering, logging, metrics, tracing, Web Vitals, session replay, release health, incident response, and anti-patterns in React applications. series: learn-frontend-react-production-architecture seriesTitle: Learn Frontend React Production Architecture order: 33 partTitle: Observability, Error Boundaries, and Frontend Reliability tags:

react
frontend
observability
reliability
error-boundaries
logging
metrics
tracing
web-vitals
production
series date: 2026-06-28

Part 033 — Observability, Error Boundaries, and Frontend Reliability

Tujuan Pembelajaran

Production frontend tidak selesai saat test hijau dan deploy berhasil.

Setelah aplikasi digunakan ribuan user dengan device, browser, network, auth state, permission, data shape, dan behavior yang berbeda-beda, bug baru akan muncul.

Frontend reliability menjawab:

apakah user benar-benar bisa menggunakan app?
halaman mana yang crash?
release mana yang memperburuk error rate?
route mana yang lambat?
API mana yang sering gagal?
apakah chunk load error terjadi setelah deploy?
apakah user mengalami blank screen?
apakah action approval gagal karena 409, 403, atau network?
apakah realtime disconnected?
apakah Web Vitals memburuk?
apakah error hanya terjadi di browser tertentu?
apakah app pulih dengan baik setelah failure?

Observability adalah kemampuan untuk menjawab pertanyaan-pertanyaan tersebut dari data production, tanpa harus menebak.

1. Core Mental Model

Frontend reliability loop:

Detect:

error tracking,
metrics,
logs,
traces,
Web Vitals,
synthetic checks,
user feedback.

Triage:

route,
release,
browser,
device,
user segment,
API status,
stack trace,
replay/breadcrumbs.

Contain:

feature flag off,
rollback,
disable broken route/action,
degrade gracefully.

Recover:

error boundary fallback,
retry,
reload prompt,
cache reset,
reconnect.

Learn:

postmortem,
test added,
monitor added,
checklist updated.

2. Observability vs Monitoring

Monitoring asks:

Is something known bad happening?

Observability asks:

Can we understand unknown bad behavior from emitted signals?

Monitoring:

error rate above threshold,
LCP p75 above budget,
5xx API error spike,
chunk load failures.

Observability:

which route?
which release?
which browser?
which user flow?
what happened before error?
which API request failed?
did retry help?
did feature flag correlate?

Production frontend needs both.

3. Frontend Telemetry Signals

Signals:

Signal	Examples
errors	uncaught exception, promise rejection, render crash
logs	structured diagnostic events
metrics	error rate, Web Vitals, route latency
traces	navigation/action/API spans
breadcrumbs	user actions before failure
session replay	visual reproduction, privacy-safe
network telemetry	API status/latency
release health	crash-free sessions/users
feature flags	flag state at error time
device/browser info	browser, OS, viewport, memory
custom domain events	approval submitted, conflict shown

Do not collect everything blindly. Collect what helps reliability while respecting privacy/security.

4. Error Types in Frontend

Frontend errors include:

Error Type	Example
render error	component throws during render
event handler error	click handler throws
async error	promise rejection
resource error	script/image chunk fails
API error	500/403/409/network
hydration error	server/client markup mismatch
routing error	route loader/action fails
state invariant error	impossible state reached
browser API error	storage denied, clipboard blocked
realtime error	socket disconnect, invalid event
validation error	user input invalid
domain error	conflict/forbidden/locked

Not all errors are bugs. Some are expected states. Observability should distinguish.

Example: 409 conflict in workflow is expected domain condition, not necessarily production bug. But a sudden spike may signal UX/concurrency issue.

5. Error Boundary Mental Model

React error boundaries catch JavaScript errors during rendering, lifecycle, and constructors of child components.

They do not catch every error, such as:

event handler errors,
async promise rejections,
server-side errors in some contexts,
errors thrown inside the boundary itself.

Error boundary purpose:

prevent whole app from unmounting,
show fallback UI,
log error,
allow reset/retry where possible,
isolate failure by route/widget.

A class component can implement:

static getDerivedStateFromError,
componentDidCatch.

Example:

class RouteErrorBoundary extends React.Component<
  { children: React.ReactNode; fallback: React.ReactNode },
  { hasError: boolean }
> {
  state = {
    hasError: false,
  };

  static getDerivedStateFromError() {
    return { hasError: true };
  }

  componentDidCatch(error: Error, info: React.ErrorInfo) {
    reportError(error, {
      componentStack: info.componentStack,
      boundary: "RouteErrorBoundary",
    });
  }

  render() {
    if (this.state.hasError) {
      return this.props.fallback;
    }

    return this.props.children;
  }
}

6. Error Boundary Placement

Error boundary placement is architecture.

Root boundary:

catches catastrophic app failure,
fallback may ask reload.

Route boundary:

isolates broken route,
user can navigate elsewhere.

Widget boundary:

isolates chart/timeline/third-party widget,
page remains usable.

Do not rely only on root boundary. A chart crash should not blank entire case detail page.

7. Error Boundary Fallback UX

Bad fallback:

Something went wrong.

Better fallback includes:

user-friendly explanation,
retry action,
reload if needed,
navigation fallback,
support reference/trace id if safe,
no sensitive stack trace,
preserves app shell where possible.

Example:

function RouteErrorFallback({ onRetry }: { onRetry: () => void }) {
  return (
    <section role="alert">
      <h1>We could not show this page</h1>
      <p>Try again. If the problem continues, contact support.</p>
      <button onClick={onRetry}>Retry</button>
      <Link to="/cases">Back to cases</Link>
    </section>
  );
}

Fallback must be accessible.

8. Resetting Error Boundaries

Error boundary should reset when route/resource changes.

Example with key:

<RouteErrorBoundary key={location.pathname} fallback={<RouteErrorFallback />}>
  <Outlet />
</RouteErrorBoundary>

For case detail:

<CaseBoundary key={caseId}>
  <CaseDetail caseId={caseId} />
</CaseBoundary>

If boundary does not reset, user may stay stuck in fallback after navigating.

9. Route Error Boundaries

Framework/router route error boundaries can handle loader/action/render errors at route level.

Route-level fallback can distinguish:

not found,
forbidden,
loader failed,
thrown response,
unexpected render crash.

Pattern:

function CaseRouteErrorBoundary() {
  const error = useRouteError();

  if (isForbidden(error)) {
    return <ForbiddenPage />;
  }

  if (isNotFound(error)) {
    return <CaseNotFound />;
  }

  return <RouteCrashFallback />;
}

Do not collapse all route errors into generic crash.

10. Event Handler Errors

Error boundaries generally do not catch event handler errors. Handle event errors explicitly or let global error handler report them.

async function handleApprove() {
  try {
    await approveCaseMutation.mutateAsync(input);
  } catch (error) {
    const normalized = normalizeAppError(error);
    showCommandError(normalized);
    reportHandledError(normalized, {
      action: "approveCase",
      caseId,
    });
  }
}

Handled domain errors should be UI states, not uncaught exceptions.

11. Promise Rejection Handling

Unhandled promise rejections are common.

Global handlers:

window.addEventListener("unhandledrejection", (event) => {
  reportError(event.reason, {
    source: "unhandledrejection",
  });
});

But do not use global handler as substitute for proper command error handling.

Global handler is safety net.

12. Resource and Chunk Load Errors

Chunk load failure can happen after deployment.

Detect:

dynamic import rejection,
script load error,
asset 404,
service worker stale cache.

Fallback:

function ChunkLoadErrorFallback() {
  return (
    <section role="alert">
      <h1>Application update available</h1>
      <p>Reload to continue.</p>
      <button onClick={() => window.location.reload()}>
        Reload
      </button>
    </section>
  );
}

Track:

chunk URL,
release id,
browser,
route,
asset status if available.

Deployment strategy should retain old assets and cache index.html correctly.

13. Hydration and Recoverable Errors

SSR/hydration can produce mismatch warnings/errors.

Track:

hydration mismatch count,
route,
component,
release,
browser,
whether user-facing.

Common causes:

Date.now() in render,
random IDs not from useId,
locale/timezone mismatch,
browser-only data during SSR,
auth personalization mismatch,
invalid HTML nesting,
third-party DOM mutation.

Hydration errors are reliability issues, not only console noise.

14. Structured Logging

Logs should be structured.

Bad:

console.log("approve failed", error);

Better:

logger.error("case_approval_failed", {
  caseId,
  route: "/cases/:caseId",
  errorType: normalized.type,
  status: normalized.status,
  release: config.releaseId,
  traceId: normalized.traceId,
});

Rules:

use event names,
include route pattern, not raw sensitive URL if needed,
include release,
include feature flag state if useful,
redact sensitive values,
sample noisy logs,
avoid logging full payloads.

15. Breadcrumbs

Observability breadcrumbs capture preceding actions.

Examples:

addBreadcrumb({
  category: "navigation",
  message: "Opened case detail",
  data: { route: "/cases/:caseId" },
});

addBreadcrumb({
  category: "ui.action",
  message: "Clicked approve",
  data: { caseId },
});

Breadcrumbs help answer “what happened before crash?”

But never include sensitive form fields or document contents.

16. Metrics

Metrics are numerical signals.

Frontend metrics:

JS error rate,
crash-free sessions,
unhandled rejection count,
route transition p75/p95,
API latency by endpoint,
API error rate by status,
Web Vitals,
chunk load failures,
realtime reconnect count,
memory growth,
form submission success/failure,
conflict rate,
forbidden rate,
retry rate.

Metric needs dimensions:

route,
release,
browser,
device,
country/region if allowed,
network type if available,
user role/segment if privacy policy allows.

Avoid high-cardinality dimensions like raw case ID.

17. Web Vitals Instrumentation

Track:

LCP,
INP,
CLS,
FCP/TTFB if useful.

Attach:

route,
release,
page type,
device class,
navigation type,
user segment if allowed.

Example conceptual:

onLCP((metric) => {
  reportMetric("web_vital_lcp", {
    value: metric.value,
    rating: metric.rating,
    route: getRoutePattern(),
    release: config.releaseId,
  });
});

Field Web Vitals reveal real user pain. Lab-only performance can miss production issues.

18. Tracing

Tracing links work across frontend and backend.

Example trace:

User clicks Approve
  frontend span: approve_button_click
  frontend span: POST /cases/:id/approve
  backend span: approveCase handler
  db span: update case
  backend span: audit event insert
  frontend span: invalidate case detail

Benefits:

see end-to-end latency,
correlate frontend action with backend service,
debug slow command,
identify API bottleneck,
attach trace id to support ticket.

OpenTelemetry provides vendor-neutral APIs/SDKs for traces, metrics, and logs.

19. Frontend Span Design

Spans:

navigation,
route loader,
API request,
form submit,
mutation,
realtime reconnect,
heavy computation,
file upload,
chunk load.

Example conceptual:

const span = tracer.startSpan("case.approve");

try {
  span.setAttribute("case.id", caseId);
  span.setAttribute("route", "/cases/:caseId");

  await approveCase(input);

  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
} finally {
  span.end();
}

Be careful with sensitive attributes. Do not attach approval reason.

20. API Observability

For each API request, capture:

method,
endpoint pattern,
status,
duration,
retry count,
aborted/timeout,
trace id,
release,
route,
error type.

Endpoint pattern:

GET /cases/:caseId

not raw:

GET /cases/CASE-2026-001

if IDs are sensitive/high-cardinality.

21. Domain Reliability Metrics

For workflow-heavy UI, track domain-oriented signals.

Examples:

approval submit success rate,
approval validation error rate,
conflict rate,
forbidden action attempt rate,
average approval command latency,
percentage of cases with stale update banner,
realtime disconnect duration,
file upload failure rate,
report generation timeout rate.

These metrics connect frontend reliability to business outcomes.

22. Release Health

Every event should include release id.

Release health tracks:

crash-free sessions,
error rate by release,
Web Vitals by release,
chunk failures by release,
API failure correlation,
feature flag state,
adoption percentage,
rollback trigger threshold.

If new release spikes errors, rollback or disable flag quickly.

23. Feature Flag Observability

Flags should be included in relevant telemetry.

Example:

reportError(error, {
  release: config.releaseId,
  flags: {
    newCaseTimeline: flagClient.get("newCaseTimeline"),
  },
});

But do not attach every flag to every event if high payload/cardinality.

Use targeted flag context.

Feature flags are only useful for reliability if you can observe their impact.

24. Session Replay

Session replay can help reproduce UI issues.

Benefits:

see user steps,
inspect DOM state,
diagnose visual failures,
reproduce rage clicks/dead clicks.

Risks:

privacy,
PII capture,
sensitive form data,
regulatory data,
performance overhead.

Controls:

mask sensitive text/input,
disable for sensitive routes if needed,
sample,
access control,
retention limits,
privacy review.

For case management apps, session replay must be very carefully governed.

25. Privacy and Redaction

Telemetry must not leak sensitive data.

Avoid:

approval reason,
case subject details,
document content,
token/cookie,
full URL with sensitive query,
raw API response,
localStorage dump,
user PII unless approved,
screenshots/replay of sensitive screens without masking.

Redaction should be built into telemetry wrapper.

function safeRouteContext() {
  return {
    routePattern: getRoutePattern(),
    release: config.releaseId,
  };
}

26. Reliability UX Patterns

Reliability is not only reporting. UI should recover.

Patterns:

retry button,
stale data banner,
offline banner,
reconnecting indicator,
partial failure fallback,
route error boundary,
widget error boundary,
reload prompt for chunk failure,
conflict resolution UI,
permission denied page,
not found page,
maintenance mode,
graceful empty state,
progressive enhancement.

Do not let one failed widget blank entire route.

27. Partial Failure Design

Case detail may have:

summary,
timeline,
documents,
related cases,
actions.

If timeline fails, summary and actions can still work.

<CaseSummary data={caseDetail} />

<WidgetBoundary fallback={<TimelineError />}>
  <AuditTimeline caseId={caseId} />
</WidgetBoundary>

<WidgetBoundary fallback={<DocumentsError />}>
  <DocumentsPanel caseId={caseId} />
</WidgetBoundary>

Use partial boundaries around independent sections.

28. Offline and Reconnect Reliability

When offline:

show offline banner,
prevent unsafe commands,
allow cached read if acceptable,
mark data stale,
reconnect automatically,
invalidate after reconnect,
preserve form draft if safe,
avoid false success.

Realtime:

show reconnecting,
backoff,
heartbeat,
invalidate after reconnect gap.

Reliability is user trust. Do not pretend app is live when it is stale.

29. Error Budget

Reliability can use error budgets.

Examples:

JS crash-free sessions >= 99.9%
chunk load error rate < 0.05%
approval command failure due frontend bug < 0.1%
route transition p95 < 2s
unhandled promise rejection rate < 0.1/session

If error budget is burned:

pause feature work,
fix reliability,
rollback risky release,
add tests/monitors.

Error budget makes reliability trade-off explicit.

30. SLOs for Frontend

Service Level Objectives for frontend can include:

SLO	Example
availability	app shell loads 99.9%
crash-free	99.9% sessions crash-free
performance	LCP p75 <= 2.5s on public routes
interaction	INP p75 <= 200ms
workflow	approval action feedback <= 200ms p75
delivery	chunk load failure < 0.05%
realtime	reconnect within 10s p95
freshness	case detail stale warning under 1% sessions

Choose SLOs that match business value.

31. Alerting

Alert only on actionable signals.

Bad alerts:

every single frontend error,
noisy 404 due bot,
known validation errors,
low-volume non-actionable warnings.

Good alerts:

new release error spike,
chunk load failures above threshold,
app shell boot failure,
login failure spike,
approval command failure spike,
Web Vitals budget regression,
realtime disconnect spike,
source map missing for release.

Alert should have owner and runbook.

32. Runbooks

Runbook example: chunk load error spike.

Symptoms:
  ChunkLoadError > 0.1% sessions after release

Check:
  asset 404 logs
  CDN cache headers
  release id
  old asset retention
  service worker status

Mitigation:
  restore previous assets
  rollback index.html
  purge CDN if wrong
  show reload prompt
  disable service worker if culprit

Prevention:
  retain old assets
  add deploy smoke
  monitor asset 404

Runbooks reduce panic during incidents.

33. Incident Response

Frontend incident flow:

detection,
severity classification,
owner assigned,
user impact defined,
mitigation chosen,
communication,
rollback/flag disable if needed,
verification,
postmortem,
action items.

Frontend incidents can be serious:

login broken,
approval action broken,
sensitive data leak,
blank screen,
chunk 404,
XSS,
severe performance regression.

Treat with same seriousness as backend incidents.

34. Postmortem

Good postmortem includes:

timeline,
detection gap,
user impact,
root cause,
contributing factors,
why tests missed it,
why monitoring missed/detected it,
what prevented worse impact,
action items,
owners,
due dates.

Avoid blame.

Question:

What system change would make this class of failure less likely?

35. Source Maps and Error Quality

Minified stack traces need source maps.

Reliability setup:

build release id,
upload source maps to error provider,
do not publicly expose if policy forbids,
verify upload in CI/CD,
associate release with deployed assets,
redact sources if needed.

If source maps missing, production debugging becomes slower.

36. Frontend Health Dashboard

Dashboard should show:

active release,
crash-free sessions,
top errors by route/release,
Web Vitals by route,
API error rate by endpoint,
chunk load errors,
route transition latency,
approval command success/failure,
realtime reconnects,
browser/device breakdown,
feature flag correlation.

The dashboard should answer “is the app healthy now?”

37. Synthetic Monitoring

Synthetic checks:

load public route,
login smoke,
open dashboard,
deep link refresh,
approve test case in safe environment,
check static asset availability,
check chunk lazy route.

Synthetic is not substitute for RUM but catches obvious deploy/config failures.

Use safe test tenant/data for workflow commands.

38. Anti-Pattern Catalog

38.1 Error Boundary Only at Root

Small widget crash blanks app.

38.2 Generic “Something Went Wrong”

No recovery, no trace, no context.

38.3 Logging Sensitive Data

Telemetry becomes data leak.

38.4 No Release ID

Cannot correlate errors to deploy.

38.5 No Source Maps

Stack traces unusable.

38.6 Treating 409/403 as Crashes

Expected domain errors pollute error tracking.

38.7 No Chunk Load Monitoring

Deployment delivery failures invisible.

38.8 Session Replay Without Privacy Review

Sensitive data exposure.

38.9 Alerts Without Runbooks

Noise and panic.

38.10 No Postmortems

Same incident repeats.

39. Mini Case Study: Case Detail Partial Failure

Requirement

If audit timeline fails, case summary and actions remain usable.

Architecture:

function CaseDetailPage({ caseDetail }: Props) {
  return (
    <>
      <CaseHeader caseDetail={caseDetail} />
      <CaseActionBar caseDetail={caseDetail} />

      <SectionBoundary
        name="AuditTimeline"
        fallback={<AuditTimelineError />}
      >
        <AuditTimeline caseId={caseDetail.id} />
      </SectionBoundary>

      <SectionBoundary
        name="DocumentsPanel"
        fallback={<DocumentsPanelError />}
      >
        <DocumentsPanel caseId={caseDetail.id} />
      </SectionBoundary>
    </>
  );
}

Telemetry:

reportError(error, {
  boundary: "AuditTimeline",
  route: "/cases/:caseId",
  release: config.releaseId,
});

User impact minimized. Error still reported.

40. Mini Case Study: Approval Command Observability

Command telemetry:

case_approval_opened
case_approval_submitted
case_approval_succeeded
case_approval_failed

Attributes:

route pattern,
release,
status/error type,
duration,
retry count,
trace id,
user role if allowed,
feature flag variant.

Never include approval reason.

Metrics:

success rate,
403 rate,
409 rate,
validation error rate,
latency p75/p95,
retry rate.

If 409 rate spikes, maybe realtime stale data problem. If 403 spikes, maybe available action mismatch.

41. Mini Case Study: New Release Error Spike

Symptom

Error rate doubles after release 2026.06.28-001.

Triage

top error: Cannot read properties of undefined.
route: /cases/:caseId/documents.
browser: Chrome/Edge.
feature flag: new document preview.
source map identifies DocumentPreview.tsx.

Contain

disable feature flag,
fallback to old document preview.

Recover

monitor error rate returns normal.

Learn

add fixture for document without preview metadata,
add component test,
add runtime schema fallback,
add story for missing metadata.

42. Reliability Review Checklist

Before approving production feature:

What can fail?
Is there an error boundary at correct level?
Is fallback user-friendly and accessible?
Are expected domain errors handled as UI states?
Are unexpected errors reported?
Are sensitive values redacted?
Is release id attached?
Are source maps uploaded?
Are API failures observable by endpoint/status?
Are Web Vitals/route metrics relevant?
Is chunk load failure handled?
Is offline/reconnect behavior defined?
Are feature flags observable?
Is partial failure possible?
Are alerts actionable?
Is there a runbook for critical failure?
Is cache cleared on logout/error where needed?
Are user-impact metrics defined?
Are tests covering fallback/recovery?
How would we know this broke in production?

43. Deliberate Practice

Latihan 1 — Error Boundary Map

Draw component tree and mark boundaries:

Root
  AuthenticatedShell
    CaseDetailRoute
      AuditTimeline
      DocumentsPanel
      ActionDialogs

Decide fallback for each.

Latihan 2 — Telemetry Event Design

For one command, define:

event names,
attributes,
sensitive fields to exclude,
metrics,
alert threshold.

Latihan 3 — Runbook

Write runbook for:

blank screen,
chunk load error,
login broken,
approval command failure spike,
WebSocket reconnect spike.

Latihan 4 — Reliability Dashboard

Design dashboard panels:

errors by release,
route latency,
Web Vitals,
top API failures,
workflow success rate,
realtime health.

Latihan 5 — Postmortem Simulation

Pick previous bug. Write:

timeline,
detection,
impact,
root cause,
missing test/monitor,
prevention action.

44. Ringkasan

Frontend reliability requires observability, containment, recovery, and learning.

Core practices:

use error boundaries at root/route/widget levels,
distinguish expected domain errors from crashes,
emit structured telemetry,
attach release id and route pattern,
protect privacy,
track Web Vitals and route/user-flow metrics,
observe API and chunk failures,
provide recovery UX,
use alerts with runbooks,
run postmortems,
add regression tests and monitors.

A production frontend without observability is a black box. A top-tier engineer designs the app so failures are visible, contained, recoverable, and less likely to repeat.

45. Self-Assessment

Anda siap lanjut jika bisa menjawab:

Apa beda monitoring dan observability?
Error apa saja yang tidak ditangkap error boundary?
Di mana error boundary sebaiknya ditempatkan?
Apa fallback UI yang baik?
Apa yang harus ada di telemetry event?
Mengapa release id penting?
Bagaimana Web Vitals masuk observability?
Mengapa session replay berisiko privacy?
Apa itu frontend SLO/error budget?
Bagaimana membuat runbook untuk chunk load error?

46. Sumber Rujukan

React Docs — Error Boundaries via Component
React Legacy Docs — Error Boundaries
Sentry Docs — React Error Boundary
OpenTelemetry Docs — JavaScript
OpenTelemetry Docs — Instrumentation
web.dev — Web Vitals

Lesson Recap

You just completed lesson 33 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 32

Learn Frontend React Production Architecture Part 032 Accessibility As Production Invariant

Next Lesson

Lesson 34

Learn Frontend React Production Architecture Part 034 Monorepo Ci Cd Release And Deployment Governance