Deepen PracticeOrdered learning track

Linkerd, Cilium, and Sidecarless Mesh Trade-offs

Learn Kubernetes Networking, Gateway API, Service Mesh, and Multi-Cluster Traffic Engineering - Part 023

Deep guide to Linkerd, Cilium Service Mesh, and sidecarless service mesh trade-offs: architecture, data plane choices, eBPF, Envoy, identity, policy, observability, Gateway API, performance, failure modes, and production decision framework.

21 min read4159 words
PrevNext
Lesson 2335 lesson track2029 Deepen Practice
#kubernetes#networking#service-mesh#linkerd+9 more

Part 023 — Linkerd, Cilium, and Sidecarless Mesh Trade-offs

1. Tujuan Part Ini

Part 020 membangun model umum service mesh. Part 021 dan Part 022 memakai Istio sebagai studi utama untuk sidecar mode dan ambient mode. Part ini memperluas sudut pandang supaya keputusan arsitektur tidak menjadi tool-driven.

Target part ini:

Anda mampu mengevaluasi Linkerd, Cilium Service Mesh, dan model sidecarless secara arsitektural, bukan dari hype, benchmark tunggal, atau checklist fitur.

Setelah part ini, Anda harus bisa menjawab:

  • Kapan mesh sederhana lebih baik daripada mesh kaya fitur?
  • Apa konsekuensi memilih sidecar proxy per-Pod?
  • Apa konsekuensi memilih dataplane eBPF + optional Envoy?
  • Apa yang benar-benar berarti dari “sidecarless”?
  • Bagaimana Linkerd berbeda dari Istio secara philosophy dan operating model?
  • Bagaimana Cilium menyatukan CNI, NetworkPolicy, observability, Gateway API, dan service mesh?
  • Kapan sidecarless mesh mengurangi operational tax, dan kapan justru memindahkan kompleksitas ke layer yang lebih rendah?
  • Bagaimana memilih mesh berdasarkan invariant produksi, bukan preferensi vendor?

2. Kaufman Framing: Jangan Belajar Mesh sebagai Brand

Framework Kaufman menyarankan skill dipecah menjadi komponen yang bisa dilatih secara terpisah. Untuk service mesh, komponen pentingnya bukan “Istio vs Linkerd vs Cilium”, tetapi capability berikut:

CapabilityPertanyaan Desain
Traffic interceptionTraffic ditangkap di Pod, node, socket, eBPF hook, iptables, atau Gateway?
IdentityIdentitas workload berasal dari ServiceAccount, SPIFFE ID, certificate, label, atau IP?
EncryptionmTLS default, opt-in, strict, permissive, atau external?
AuthorizationPolicy berlaku di L3/L4, L7, identity layer, namespace layer, atau route layer?
RoutingApakah mendukung weighted routing, failover, mirroring, header match, gRPC semantics?
ObservabilityApakah flow terlihat di kernel, proxy, Gateway, app, atau trace context?
Failure isolationJika control plane rusak, apakah data plane tetap jalan?
Upgrade modelApakah upgrade menyentuh setiap Pod, setiap node, atau hanya control plane?
Resource taxOverhead terjadi per Pod, per node, per request, atau per route?
PortabilityApakah policy berbasis standard API atau CRD spesifik implementasi?

Deliberate practice untuk part ini:

  1. ambil satu requirement nyata: all service-to-service traffic must be encrypted and observable;
  2. implementasikan mental design memakai Linkerd;
  3. implementasikan mental design memakai Cilium;
  4. bandingkan failure mode dan operational burden;
  5. pilih bukan berdasarkan fitur paling banyak, tetapi berdasarkan risk paling kecil untuk konteks organisasi.

3. The Mesh Selection Trap

Kesalahan umum engineer senior sekalipun:

Kami butuh service mesh.
Berarti kami butuh Istio.
Berarti semua workload harus punya sidecar.
Berarti semua traffic management harus pindah ke mesh.

Itu chain reasoning yang lemah.

Pertanyaan yang lebih tepat:

Concern apa yang tidak layak lagi dikerjakan oleh aplikasi sendiri?
Concern itu lebih tepat ditegakkan di CNI, Gateway, sidecar, node proxy, atau application library?

Service mesh bukan tujuan. Service mesh adalah mekanisme untuk memindahkan concern tertentu keluar dari application code:

  • mTLS;
  • workload identity;
  • service-to-service authorization;
  • retries/timeouts/outlier detection;
  • traffic splitting;
  • telemetry;
  • policy distribution;
  • service discovery enhancement.

Namun setiap concern punya tempat terbaik. Tidak semua concern harus ditaruh di mesh.

Contoh:

ConcernBiasanya Lebih Tepat di
Public API authenticationEdge gateway / API gateway
Inter-service mTLSMesh / CNI identity layer
L3/L4 isolationNetworkPolicy / CNI policy
HTTP route delegationGateway API
Retry semanticsMesh atau application client, tergantung idempotency
Business authorizationApplication layer
Egress allowlistEgress Gateway / firewall / CNI policy
Packet flow visibilityCNI/eBPF observability
Request semanticsProxy / app / trace instrumentation

Top 1% engineer tidak bertanya “mesh mana yang paling populer?”. Mereka bertanya:

Di mana enforcement point paling defensible untuk invariant ini?


4. Taxonomy: Sidecar, Node Proxy, eBPF, and Hybrid Mesh

Sebelum membandingkan tool, pahami modelnya.

4.1 Sidecar Mesh

Sidecar mesh menambahkan proxy ke setiap workload Pod.

Kelebihan:

  • traffic context dekat dengan workload;
  • L7 policy kaya;
  • mTLS identity per workload;
  • isolasi proxy per Pod;
  • established model;
  • cocok untuk HTTP/gRPC traffic shaping.

Biaya:

  • resource overhead per Pod;
  • upgrade sering butuh restart workload;
  • startup ordering;
  • config explosion;
  • debugging lebih kompleks;
  • latency tambahan;
  • operational overhead tinggi untuk fleet besar.

4.2 Node-Level / Sidecarless Mesh

Sidecarless mesh memindahkan sebagian fungsi ke node atau shared proxy.

Kelebihan:

  • tidak perlu inject proxy ke setiap Pod;
  • lebih sedikit container;
  • upgrade dapat lebih terpusat;
  • overhead per workload lebih rendah;
  • onboarding workload lebih mudah.

Biaya:

  • shared component menjadi blast radius baru;
  • L7 policy sering butuh proxy tambahan;
  • traffic attribution bisa lebih rumit;
  • bypass path harus dipahami;
  • semantic gap antara L4 dan L7 bisa membingungkan.

4.3 eBPF-Assisted Mesh

eBPF memungkinkan observability, policy, load balancing, dan traffic redirection dilakukan di kernel datapath.

Kelebihan:

  • efisien untuk L3/L4;
  • sangat baik untuk flow visibility;
  • mengurangi dependency iptables;
  • cocok untuk CNI-integrated platform;
  • bisa menggabungkan networking, security, dan observability.

Batas:

  • L7 tetap membutuhkan parser/proxy seperti Envoy;
  • debugging eBPF membutuhkan skill kernel/networking lebih dalam;
  • portability antar CNI tidak gratis;
  • failure bisa terlihat sebagai kernel datapath issue, bukan proxy issue.

5. Linkerd: Mesh yang Sengaja Sederhana

Linkerd sering diposisikan sebagai service mesh yang lebih ringan dan fokus. Ia bukan “Istio kecil”; ia mengambil keputusan desain berbeda.

Mental model:

Linkerd = simple, transparent, secure-by-default sidecar mesh.

Linkerd terdiri dari:

  • control plane;
  • data plane proxy;
  • identity service;
  • destination service;
  • policy components;
  • tap/metrics/telemetry components;
  • CLI dan dashboard tooling.

5.1 Data Plane

Linkerd data plane menggunakan proxy ringan yang berjalan di Pod sebagai sidecar. Proxy ini menangani inbound dan outbound traffic secara transparan.

Secara konseptual:

application -> localhost/proxy -> network -> remote proxy -> remote application

Proxy bertanggung jawab untuk:

  • mTLS;
  • load balancing;
  • retries untuk traffic tertentu;
  • metrics;
  • traffic policy enforcement;
  • service discovery integration;
  • connection management.

5.2 Identity dan Automatic mTLS

Salah satu value utama Linkerd adalah automatic mTLS. Workload identity biasanya dikaitkan dengan Kubernetes ServiceAccount.

Model sederhananya:

Pod runs as ServiceAccount
      ↓
Linkerd proxy obtains identity certificate
      ↓
Proxy establishes mTLS with peer proxy
      ↓
Policy can reason over authenticated workload identity

Ini penting karena network identity berbasis IP tidak stabil di Kubernetes. Pod IP berubah, node berubah, endpoint berubah. ServiceAccount identity lebih stabil sebagai runtime principal.

5.3 Philosophy Linkerd

Linkerd cenderung mengoptimalkan:

  • simple install;
  • secure defaults;
  • minimal CRD surface;
  • low cognitive load;
  • day-2 operability;
  • automatic mTLS;
  • golden metrics;
  • safer adoption path.

Trade-off-nya:

  • tidak selalu sekaya Istio untuk complex routing;
  • tidak selalu fleksibel untuk enterprise edge routing yang sangat custom;
  • tetap sidecar-based;
  • ekstensi L7 tertentu tidak seluas Envoy/Istio ecosystem;
  • advanced policy model bisa lebih terbatas dibanding Cilium/Istio untuk beberapa skenario.

5.4 Kapan Linkerd Cocok

Linkerd cocok bila requirement dominan adalah:

  • encrypt service-to-service traffic;
  • get reliable service metrics quickly;
  • add identity-based policy without huge platform redesign;
  • keep app teams productive;
  • avoid high-complexity mesh configuration;
  • prefer opinionated defaults;
  • operate small-to-medium platform team.

Contoh situasi:

Perusahaan punya 60 microservices.
Traffic mostly HTTP/gRPC internal.
Pain utama: tidak ada mTLS, tidak ada service-level metrics, tidak jelas service dependency graph.
Team belum siap mengoperasikan Istio complexity.

Linkerd sering menjadi pilihan defensible.

5.5 Kapan Linkerd Kurang Cocok

Kurang cocok bila requirement utama:

  • extremely complex L7 routing;
  • heavy API gateway integration;
  • deep Envoy customization;
  • advanced multi-cluster traffic steering;
  • strong need for CNI-integrated L3/L7 security policy;
  • desire to remove sidecars entirely;
  • large platform already standardized on Envoy extensions.

6. Cilium Service Mesh: CNI-Native, eBPF-First, Envoy When Needed

Cilium berbeda karena berangkat dari CNI dan eBPF networking, bukan murni dari service mesh proxy model.

Mental model:

Cilium = Kubernetes networking/security/observability dataplane using eBPF,
         with service mesh capabilities through Gateway API, Envoy, identity, and policy.

6.1 What eBPF Changes

eBPF lets Cilium attach programs to kernel hooks and implement networking behavior without relying only on iptables chains or per-Pod proxy interception.

Capabilities commonly associated with Cilium include:

  • CNI networking;
  • kube-proxy replacement;
  • service load balancing;
  • NetworkPolicy enforcement;
  • identity-aware policy;
  • L7 policy via Envoy;
  • Hubble observability;
  • Gateway API support;
  • ingress integration;
  • egress gateway patterns;
  • cluster mesh capabilities.

The key architectural point:

Cilium can make service mesh a property of the cluster dataplane rather than an extra proxy container in every workload.

6.2 Sidecarless Does Not Mean Proxyless

This distinction matters.

sidecarless != no proxy
sidecarless == no proxy injected beside every workload

For L3/L4 policy, service load balancing, and flow visibility, eBPF can handle much of the path efficiently. But for rich L7 semantics such as HTTP header matching, gRPC method awareness, Kafka parsing, or HTTP policy, Cilium still relies on Envoy-style proxying.

So the accurate model is:

L3/L4: eBPF datapath
L7: Envoy when required

6.3 Gateway API in Cilium

Cilium supports Gateway API as a way to express ingress and service traffic routing through Kubernetes-native resources.

Why this matters:

  • Gateway API reduces controller-specific annotation reliance;
  • platform teams can expose shared Gateway resources;
  • app teams can attach routes;
  • Envoy handles protocol-aware L7 behavior;
  • Cilium can integrate routing with network policy and identity.

6.4 Cilium Identity Model

Cilium assigns security identities based on workload labels. Instead of making IP the primary identity, Cilium maps endpoint labels to numeric identities and enforces policy based on those identities.

Implication:

Pod IP is location.
Cilium identity is security context.
Policy should reason over identity, not ephemeral IP.

This identity model is powerful for NetworkPolicy and microsegmentation, especially when combined with eBPF enforcement.

6.5 Observability with Hubble

Cilium’s Hubble gives network flow visibility.

A strong production value of Cilium is that platform engineers can answer:

  • who talked to whom;
  • over which port/protocol;
  • whether policy allowed or denied it;
  • whether DNS was involved;
  • whether flow was dropped;
  • which identity was associated with the flow.

This is different from proxy-only telemetry. Proxy telemetry is request-aware. eBPF/CNI telemetry is path-aware.

Top-level mental model:

Observability PlaneBest At
Application logsBusiness result
TracesRequest causal chain
Proxy metricsHTTP/gRPC behavior
CNI/eBPF flowsPacket/connection path and policy decision
Gateway statusRouting/control plane state
Cloud LB metricsEdge health and external traffic

6.6 Kapan Cilium Cocok

Cilium cocok bila requirement dominan:

  • strong Kubernetes networking platform;
  • eBPF-based policy and visibility;
  • kube-proxy replacement;
  • advanced NetworkPolicy/microsegmentation;
  • desire to avoid sidecars for broad mesh capability;
  • Gateway API adoption;
  • flow-level observability;
  • multi-cluster networking through Cilium Cluster Mesh;
  • platform team comfortable with Linux networking/eBPF concepts.

Contoh situasi:

Platform menjalankan ratusan namespace multi-tenant.
Security team butuh identity-aware network policy dan flow audit.
Service mesh requirement mostly mTLS, policy, observability, Gateway API.
Team ingin menghindari sidecar injection across all workloads.

Cilium menjadi opsi kuat.

6.7 Kapan Cilium Kurang Cocok

Kurang cocok bila:

  • organisasi belum siap mengoperasikan eBPF dataplane;
  • team debugging masih lemah di Linux networking;
  • workload butuh mesh semantics yang sangat Envoy/Istio-specific;
  • standardisasi sudah kuat di Istio CRD ecosystem;
  • platform tidak ingin CNI dan mesh menjadi satu vendor/control surface;
  • requirement L7 sangat kompleks dan ingin proxy-per-workload isolation.

7. Linkerd vs Cilium vs Istio: Jangan Bandingkan sebagai Checklist

Checklist fitur sering menyesatkan karena semua tool bisa menulis “mTLS: yes”. Pertanyaan sebenarnya:

mTLS implemented where, identified by what, configured through what API,
observed how, failed how, upgraded how, and debugged by whom?

7.1 Comparison Matrix

AxisLinkerdCilium Service MeshIstio SidecarIstio Ambient
Primary originService meshCNI/eBPF networkingService mesh / EnvoyService mesh redesign
Data planeSidecar micro-proxyeBPF + Envoy when neededEnvoy sidecarztunnel + waypoint
Main strengthSimplicity, mTLS, metricsNetwork/security/observability integrationRich L7 featuresSidecarless Istio semantics
L7 richnessModerateEnvoy-backed where enabledVery highHigh with waypoint
Resource modelPer-Pod sidecarMostly node/datapath + EnvoyPer-Pod EnvoyPer-node + waypoint
Upgrade burdenSidecar restart often neededCNI/agent/proxy componentsSidecar fleet managementztunnel/waypoint management
Policy centerMesh policyCNI identity/policy + GatewayIstio security/networking CRDsIstio + Gateway API style
ObservabilityService metrics/tapFlow visibility + proxy telemetryRich Envoy telemetrySplit L4/L7 telemetry
Cognitive loadLowerMedium-highHighMedium-high
Best fitSecure-by-default service meshPlatform networking/security convergenceAdvanced traffic managementIstio capability with less sidecar tax

7.2 Feature Depth vs Operational Risk

This chart is intentionally approximate. Its value is not numerical precision. Its value is forcing the right conversation:

  • How much feature breadth do we need?
  • How much complexity can we operate safely?
  • Which failure mode are we willing to own?

8. Sidecar vs Sidecarless: Real Trade-offs

8.1 Resource Isolation

Sidecar:

Each workload has its own proxy.
Failure of one proxy affects mostly that Pod.

Sidecarless/node-level:

Shared components serve many workloads.
Failure can affect a wider node or namespace scope.

Trade-off:

ModelIsolationEfficiency
SidecarStronger per-workload isolationHigher overhead
Node/sharedLower per-workload overheadLarger shared blast radius

8.2 Upgrade Model

Sidecar upgrade:

  • update injection template;
  • restart workloads;
  • coordinate app team windows;
  • handle mixed proxy versions;
  • verify telemetry and mTLS after rollout.

Sidecarless upgrade:

  • update node agents/proxies;
  • fewer workload restarts;
  • but node-level failure affects many workloads;
  • requires careful rolling upgrade and fallback.

8.3 Debugging Model

Sidecar debugging asks:

Did this Pod get the right proxy config?
Did iptables redirect traffic to the sidecar?
Is Envoy cluster/route/listener correct?
Is mTLS configured between this pair?

Sidecarless/eBPF debugging asks:

Did datapath attach correctly?
Was flow redirected?
Which identity was assigned?
Did policy drop it?
Did traffic enter Envoy for L7?
Was there a bypass path?

Neither is objectively simpler. They require different expertise.

8.4 Security Boundary

Sidecar security boundary:

  • proxy runs next to app;
  • app and proxy share Pod boundary;
  • compromised Pod may interact with local network namespace;
  • policy enforcement is close to workload;
  • certs often scoped to proxy/workload identity.

Sidecarless security boundary:

  • node agent/proxy represents multiple workloads;
  • kernel datapath participates in enforcement;
  • node compromise has broader impact;
  • fewer per-Pod secrets/proxies;
  • bypass prevention must be proven.

The key invariant:

Security architecture must state exactly where identity is bound, where traffic is intercepted, and where policy is enforced.


9. The Enforcement Point Model

Every mesh decision should identify enforcement points.

Ask for every policy:

PolicyBest Enforcement PointReason
Namespace isolationCNI / NetworkPolicyPacket-level default deny
Service-to-service identity authMesh / identity-aware proxyNeeds authenticated principal
HTTP path authzL7 proxy / applicationNeeds request semantics
Business permissionApplicationNeeds domain state
Egress allowlistEgress gateway / firewall / CNINeeds central audit/control
Public rate limitEdge Gateway/API gatewayProtects platform boundary
Canary traffic splitGateway or meshNeeds route-level traffic control

Bad design example:

Use mesh AuthorizationPolicy to enforce business permission:
"only investigator assigned to case can approve enforcement action".

That is wrong. The mesh can identify service principal, not domain-level human assignment state. Business authorization belongs in application/domain logic.

Good design example:

Mesh policy: only case-service can call sanction-service /internal/evaluate.
Application policy: only assigned enforcement officer can approve the sanction.

10. Gateway API as the Mesh Abstraction Layer

Gateway API reduces the need to learn every controller-specific routing API first. It provides a Kubernetes-native language for:

  • HTTP routes;
  • gRPC routes;
  • TCP/UDP/TLS routes;
  • backend references;
  • route delegation;
  • listener ownership;
  • policy attachment.

For mesh, Gateway API matters because it can express east-west traffic intent more consistently.

Example mental model:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: payments-canary
  namespace: checkout
spec:
  parentRefs:
    - group: ""
      kind: Service
      name: payments
      namespace: payments
  rules:
    - backendRefs:
        - name: payments-v1
          port: 8080
          weight: 90
        - name: payments-v2
          port: 8080
          weight: 10

This kind of route may be interpreted by a mesh implementation that supports Gateway API for mesh use cases.

The important architecture question:

Are we committing to Gateway API as a platform contract, or to implementation-specific CRDs as the contract?

Implementation-specific CRDs are not bad. But they are a stronger coupling.


11. Production Decision Framework

Use this framework before selecting mesh.

11.1 Requirement Classification

RequirementWeightNotes
Automatic mTLSHighMost mesh products can do this, but identity model differs.
L7 traffic shapingMedium/HighIstio strongest; Cilium/Linkerd vary by capability.
NetworkPolicy integrationHighCilium strong due to CNI origin.
Minimal operational complexityHighLinkerd often strong.
Sidecarless adoptionMedium/HighCilium/Istio ambient stronger.
Envoy extensibilityHighIstio/Cilium stronger.
Gateway API strategyHighCheck conformance and supported features.
Flow observabilityHighCilium/Hubble strong.
Multi-clusterHighImplementation-specific maturity matters.
Compliance auditHighNeed identity, policy, logs, cert rotation proof.

11.2 Team Capability Fit

Team StrengthBetter Fit Bias
Strong SRE, weak networkingLinkerd or managed mesh
Strong Linux/eBPF networkingCilium
Strong Envoy/Istio expertiseIstio
Need simple platform adoptionLinkerd
Need integrated CNI + policy + visibilityCilium
Need advanced traffic managementIstio sidecar or ambient with waypoint

11.3 Risk Questions

Before adopting any mesh, answer:

  1. What traffic paths are in mesh and out of mesh?
  2. What is the default mTLS mode?
  3. What breaks if the identity issuer is down?
  4. What breaks if control plane is down?
  5. What breaks if node agent/proxy is down?
  6. Can workloads bypass mesh?
  7. How are certificates rotated?
  8. How do we audit who called whom?
  9. How do app teams debug rejected traffic?
  10. What is the rollback plan?
  11. Which APIs are standard and which are vendor-specific?
  12. What is the per-Pod or per-node resource budget?

12. Failure Mode Catalog

12.1 Sidecar Injection Drift

Symptom:

Some Pods have mesh behavior, others do not.

Causes:

  • namespace label missing;
  • webhook failed;
  • manual Pod creation bypassed injection;
  • workload restarted before injection config updated;
  • excluded port annotation wrong.

Detection:

kubectl get pod -n payments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].name}{"\n"}{end}'

Invariant:

Every workload that is expected to be in mesh must be provably in mesh.

12.2 mTLS Partial Deployment

Symptom:

Traffic works between some workloads, fails between others.

Causes:

  • one workload not meshed;
  • permissive mode hiding plaintext;
  • trust bundle mismatch;
  • cert expired;
  • identity service unavailable during rotation.

Mitigation:

  • define strict/permissive migration phases;
  • create mesh enrollment inventory;
  • expose mTLS success/failure metrics;
  • test with out-of-mesh clients deliberately.

12.3 eBPF Policy Drop Misdiagnosed as App Failure

Symptom:

Client sees timeout. Server logs show nothing.

Causes:

  • CNI policy denied packet before server;
  • DNS egress blocked;
  • identity mismatch;
  • stale endpoint identity;
  • node datapath issue.

Debugging direction:

If server logs show nothing, debug path before request reaches application.

Use flow visibility, policy verdicts, and packet path tools.

12.4 L7 Policy Requires Proxy but Traffic Stayed L4

Symptom:

L4 connectivity works, but HTTP policy is not enforced.

Causes:

  • traffic never redirected to L7 proxy;
  • missing waypoint;
  • protocol not detected;
  • port not named correctly;
  • controller does not support feature;
  • policy attached to wrong resource.

Invariant:

L7 policy requires L7 observation point.

12.5 Shared Node Component Blast Radius

Symptom:

Multiple unrelated workloads on one node lose mesh behavior simultaneously.

Causes:

  • node-level proxy crash;
  • CNI agent failure;
  • eBPF program issue;
  • ztunnel-like component failure;
  • host network conflict.

Trade-off:

Sidecarless reduces per-workload overhead but can increase node-scoped blast radius.

12.6 Feature Works in One Controller but Not Another

Symptom:

Gateway API resource is accepted in dev but behaves differently in prod.

Causes:

  • different implementation conformance;
  • extended feature not supported;
  • controller-specific interpretation;
  • implementation-specific policy CRD;
  • version skew.

Mitigation:

  • test against actual GatewayClass;
  • verify status conditions;
  • use conformance docs;
  • avoid assuming Gateway API means complete portability.

13. Design Patterns

13.1 Simple Secure Mesh Pattern

Use when:

  • internal services need mTLS;
  • app team should not manage certs;
  • request routing needs are modest;
  • platform team values simplicity.

Architecture:

Characteristics:

  • mesh is transparent;
  • policy surface is constrained;
  • operating model is simpler;
  • good first mesh for many organizations.

Risk:

  • advanced traffic management may require another layer;
  • still carries sidecar lifecycle cost.

13.2 CNI-Native Security Platform Pattern

Use when:

  • platform needs NetworkPolicy and flow audit;
  • security team needs identity-aware enforcement;
  • service mesh is only one part of networking platform;
  • eBPF expertise exists.

Architecture:

Characteristics:

  • traffic policy and observability start at CNI layer;
  • fewer sidecars;
  • strong flow visibility;
  • L7 uses Envoy when needed.

Risk:

  • requires deeper networking skill;
  • CNI becomes high criticality platform dependency.

13.3 Hybrid Gateway + Mesh Pattern

Use when:

  • public APIs need Gateway/API gateway;
  • internal services need mTLS;
  • only selected services need advanced L7 routing;
  • want avoid full-mesh complexity everywhere.

Architecture:

Principle:

Do not mesh everything just because mesh exists.
Mesh the trust boundary and dependency paths that need mesh capabilities.

13.4 Progressive Mesh Adoption Pattern

Phases:

  1. observe only;
  2. enable mTLS in permissive mode;
  3. inventory out-of-mesh calls;
  4. move to strict mTLS for selected namespaces;
  5. add authorization policy;
  6. add L7 traffic management;
  7. add multi-cluster only after single-cluster invariants are proven.

Avoid:

Day 1: install mesh, enable strict mTLS globally, add retries, add authz, enable canary, enable multi-cluster.

That is not engineering maturity. That is blast radius manufacturing.


14. Performance and Cost Model

14.1 Cost Dimensions

CostSidecarSidecarless/eBPF
CPU per workloadHigherLower per workload
Memory per workloadHigherLower per workload
Node component criticalityMediumHigh
L7 proxy costPer PodShared/selected
Upgrade coordinationWorkload-heavyPlatform-heavy
Debugging expertiseProxy-heavyKernel/CNI-heavy
Telemetry cardinalityHighHigh, but different layer

14.2 Latency Model

Every proxy hop can add latency. But the dangerous part is not only average latency. It is tail latency and failure amplification.

Questions:

  • Does every request cross two proxies?
  • Does mTLS handshake reuse connection pooling?
  • Are retries performed at app and mesh layer simultaneously?
  • Are timeouts aligned?
  • Is telemetry synchronous or buffered?
  • Does Envoy filter chain include expensive processing?
  • Does L7 policy require request body inspection?

Bad pattern:

Application retries 3x.
Mesh retries 3x.
Gateway retries 2x.
One user request can become 18 backend attempts.

Good pattern:

Define retry budget globally.
Apply retries only at the layer with enough semantics to know idempotency.

15. Security and Compliance Model

For regulated systems, the mesh decision must produce audit artifacts.

You need to prove:

  • what identity each workload has;
  • how certificates are issued;
  • how certificates rotate;
  • which services can talk;
  • which requests were denied;
  • which traffic is encrypted;
  • which namespaces are exempt;
  • who can change policy;
  • how emergency rollback works;
  • how policy changes are reviewed.

Mesh security is weak if it cannot answer:

At 2026-07-01T10:00:00Z,
was service A allowed to call service B on route /internal/approve,
under which identity,
using which certificate chain,
and where is the evidence?

16. Lab: Compare Three Mesh Designs

Use one system:

frontend -> checkout -> payment -> ledger
checkout -> fraud
payment -> bank-adapter

Requirements:

  • all internal traffic encrypted;
  • checkout can call payment;
  • frontend cannot call ledger directly;
  • payment can call bank-adapter only through egress path;
  • canary payment v2 at 10%;
  • observe request success rate and denied flows;
  • rollback within five minutes.

16.1 Design A: Linkerd

Answer:

  • Where is mTLS configured?
  • How are identities assigned?
  • Which policy blocks frontend -> ledger?
  • How do you observe denied calls?
  • How do you do payment canary?
  • What requires additional tooling?

16.2 Design B: Cilium

Answer:

  • Which Cilium identities exist?
  • Which NetworkPolicy/CiliumNetworkPolicy applies?
  • Does payment canary use Gateway API or app-level routing?
  • Does L7 require Envoy?
  • How does Hubble show denied flows?
  • What happens if Cilium agent on a node fails?

16.3 Design C: Istio Ambient

Answer:

  • Which namespaces are ambient enrolled?
  • Which traffic is L4-only via ztunnel?
  • Which service needs waypoint?
  • Where is HTTPRoute attached?
  • Which AuthorizationPolicy applies at L4 vs L7?
  • What bypass risks exist?

Deliverable:

One architecture decision record explaining which design you choose and why.

17. Architecture Decision Record Template

# ADR: Service Mesh Selection

## Context
We need workload-to-workload encryption, service identity, policy enforcement, observability, and selected traffic shaping for Kubernetes workloads.

## Decision
We choose <Linkerd/Cilium/Istio/...> for <scope>.

## Scope
- Included namespaces:
- Excluded namespaces:
- North-south traffic:
- East-west traffic:
- Egress traffic:

## Invariants
- All in-scope service-to-service traffic must use mTLS.
- Authorization must be identity-based, not IP-based.
- L7 policy may only be used where L7 proxying is proven active.
- Retry budget must be centrally defined.

## Alternatives Considered
- Linkerd:
- Cilium:
- Istio sidecar:
- Istio ambient:

## Consequences
- Operational cost:
- Security posture:
- Debugging model:
- Upgrade model:
- Lock-in:

## Rollback Plan
- Disable policy:
- Disable mesh enrollment:
- Restore previous Gateway path:
- Verify plaintext fallback is not accidentally permanent:

## Evidence Required
- mTLS metrics:
- flow logs:
- policy audit:
- conformance test:
- failure drill:

18. Review Checklist

Use this checklist before production rollout.

Architecture

  • Mesh scope is explicit.
  • Out-of-mesh traffic paths are known.
  • North-south and east-west responsibilities are separated.
  • Gateway API role is defined.
  • CNI policy role is defined.
  • Business authorization is not delegated to mesh.

Security

  • Workload identity model is documented.
  • mTLS mode is explicit.
  • Certificate rotation is tested.
  • Policy default is understood.
  • Denied traffic is observable.
  • Break-glass process exists.

Operations

  • Control plane failure behavior is tested.
  • Data plane failure behavior is tested.
  • Upgrade plan is tested.
  • Rollback plan is tested.
  • Resource overhead is measured.
  • Tail latency is measured.

Debugging

  • App team can identify whether request entered mesh.
  • Platform team can trace flow from client to server.
  • Policy denial reason is visible.
  • mTLS handshake failure is distinguishable from network drop.
  • Gateway status and mesh status are both checked.

19. Summary

Linkerd, Cilium, Istio sidecar, and Istio ambient are not interchangeable labels for “service mesh”. They are different choices about where traffic is intercepted, where identity is bound, where policy is enforced, where telemetry is generated, and where operational complexity lives.

Key takeaways:

  • Linkerd is strong when simplicity, secure defaults, and fast adoption matter.
  • Cilium is strong when networking, security, observability, and service mesh should converge at the CNI/eBPF layer.
  • Sidecarless reduces per-Pod overhead but introduces shared dataplane and debugging trade-offs.
  • eBPF is powerful for L3/L4 enforcement and visibility, but rich L7 behavior still needs a proxy.
  • Gateway API can become a stable platform contract, but controller support and conformance must be verified.
  • A mature mesh decision starts from invariants, not product preference.

The next part goes deeper into the security foundation behind all serious mesh designs: mTLS, SPIFFE, identity, trust domain, and zero-trust service networking.


20. References

Lesson Recap

You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.