Build CoreOrdered learning track

Parser Selection and Processing Strategy

Learn Java XML In Action - Part 008

Strategi memilih parser dan processing model di Java XML: DOM, SAX, StAX, XPath, XSD validation, XSLT, XQuery, binding, hybrid pipelines, decision matrix, anti-pattern, dan production architecture trade-off.

21 min read4162 words
PrevNext
Lesson 0832 lesson track0718 Build Core
#java#xml#dom#sax+8 more

Part 008 — Parser Selection and Processing Strategy

Tujuan Part Ini

Part ini menjawab pertanyaan yang lebih penting daripada “bagaimana cara parse XML?”:

Processing model apa yang paling tepat untuk masalah ini?

Engineer yang kuat tidak memilih DOM, SAX, StAX, XPath, XSD, XSLT, XQuery, atau binding karena familiar. Mereka memilih berdasarkan:

  • ukuran dokumen;
  • akses data;
  • bentuk transformasi;
  • kebutuhan validasi;
  • latency;
  • memory;
  • auditability;
  • evolusi contract;
  • security;
  • failure mode;
  • skill tim;
  • maintainability jangka panjang.

Part ini adalah decision framework.

Setelah part ini, kita punya peta praktis sebelum masuk ke XSD, XPath, XQuery, dan XSLT secara lebih dalam.


1. The Core Question

Pertanyaan yang salah:

Parser XML mana yang terbaik?

Pertanyaan yang benar:

Apa bentuk pekerjaan XML yang sedang dilakukan?

Ada beberapa bentuk pekerjaan:

Work TypePertanyaan Praktis
ParseBagaimana membaca XML menjadi event/tree/object?
ValidateApakah XML sesuai contract?
ExtractField/record apa yang perlu diambil?
QueryBagian mana yang memenuhi ekspresi tertentu?
TransformBagaimana mengubah XML menjadi format lain?
GenerateBagaimana membuat XML valid dan deterministic?
BindBagaimana XML dipetakan ke object domain/DTO?
RouteKe mana dokumen/record dikirim berdasarkan isi?
AuditBagaimana membuktikan input-output processing?
EvolveBagaimana contract berubah tanpa merusak partner?

Parser hanya salah satu komponen.


2. Processing Models dalam Java XML

Ringkasan:

ModelMental ModelCocok UntukHindari Untuk
DOMXML menjadi tree object graphDokumen kecil-menengah, random access, mutationFile besar, high-throughput streaming
SAXPush event callbackLarge file extraction sederhana, low-level streamingBusiness flow kompleks, readability tinggi
StAXPull event cursorLarge file extraction, streaming writer, controlled pipelineRandom access, complex global rules
XPathDeclarative node selectionQuery kecil atas DOM/XDM, assertions, extraction targetedScan jutaan record besar tanpa strategy
XSDDeclarative contract validationStructure/datatype validation, partner contractBusiness rule kompleks lintas sistem
XSLTDeclarative transformationXML-to-XML/HTML/text mapping, canonicalizationImperative side-effect heavy logic
XQueryDeclarative XML queryQuery/aggregation atas XML collections atau complex XMLSimple single-document extraction
BindingXML <-> object mappingStable schema, object-centric business layerLarge streaming, mixed content, schema fleksibel
HybridKombinasi modelProduction pipeline kompleksJika tanpa boundary jelas

3. Decision Tree

Decision tree ini bukan hukum absolut. Ini guardrail agar pilihan awal tidak salah arah.


4. First Principle: Access Pattern Beats Tool Preference

XML document dapat dilihat dari beberapa access pattern.

4.1 Full Document Access

Kita perlu melihat seluruh dokumen:

  • compare header dan footer total;
  • validate cross-section consistency;
  • mutate beberapa node berdasarkan node lain;
  • render UI preview;
  • apply XPath berulang;
  • sign/canonicalize subset.

Cenderung cocok:

  • DOM;
  • XDM/Saxon tree;
  • binding object model;
  • XSLT/XQuery jika declarative.

Tidak cocok:

  • raw SAX jika logic butuh global context;
  • raw StAX jika harus banyak look-back/look-ahead.

4.2 Sequential Record Access

Dokumen berisi banyak record berulang:

<records>
  <record>...</record>
  <record>...</record>
  <record>...</record>
</records>

Kita bisa proses record satu per satu.

Cenderung cocok:

  • StAX;
  • SAX;
  • streaming XSLT jika processor mendukung dan stylesheet streamable;
  • batch validation/extraction hybrid.

Tidak cocok:

  • DOM untuk jutaan record;
  • binding seluruh file ke object graph besar.

4.3 Targeted Lookup

Kita hanya butuh beberapa value:

/header/messageId
/header/sender
/body/order/@id

Cenderung cocok:

  • XPath untuk dokumen kecil/menengah;
  • StAX untuk file besar;
  • SAX jika lookup sederhana.

Trade-off:

  • XPath lebih expressive;
  • StAX lebih predictable untuk memory;
  • DOM + XPath mudah tetapi dapat mahal.

4.4 Declarative Mapping

Kita mengubah struktur XML:

PartnerOrderXML -> CanonicalOrderXML

Cenderung cocok:

  • XSLT;
  • XQuery untuk query-heavy mapping;
  • StAX manual untuk mapping sederhana/performance-critical;
  • binding + mapper jika business object layer penting.

4.5 Object-Centric Business Logic

Domain logic ingin object:

Order order = xmlMapper.read(...);
pricingEngine.price(order);

Cenderung cocok:

  • JAXB/Jakarta XML Binding;
  • generated classes from XSD;
  • partial StAX extraction into domain record;
  • DOM-to-domain mapper untuk dokumen kecil.

Tidak cocok:

  • XSLT bila logic sangat imperative dan stateful;
  • raw parser tersebar di business service.

5. Size and Memory Strategy

Ukuran XML bukan hanya bytes di disk.

DOM object graph bisa jauh lebih besar daripada ukuran file karena:

  • setiap element menjadi object;
  • attribute menjadi object;
  • text node menjadi object/string;
  • namespace metadata;
  • internal parser structure;
  • Java object overhead.

Practical guideline:

XML SizeDefault Thinking
< 1 MBDOM/XPath/binding biasanya aman jika volume rendah.
1–20 MBEvaluasi memory, concurrency, dan access pattern.
20–100 MBMulai serius pertimbangkan StAX/SAX/hybrid.
> 100 MBJangan default DOM. Streaming atau staged processing.
GB-levelStAX/SAX/pipeline/checkpoint.

Ini bukan angka absolut. Yang menentukan:

memory impact = file size × object expansion × concurrency × retention time

Contoh:

10 MB XML × 8x expansion × 100 concurrent requests = 8 GB heap pressure

Maka DOM yang aman di laptop bisa menjadi incident di service concurrent.


6. Latency vs Throughput

Ada dua profil berbeda.

6.1 Low-Latency Request/Response

Contoh:

  • SOAP-like request;
  • partner API XML payload;
  • small regulatory lookup;
  • synchronous validation gateway.

Prioritas:

  • cepat fail;
  • error message jelas;
  • memory bounded;
  • timeout;
  • security hardening;
  • no unbounded transform.

Pilihan umum:

  • XSD validate envelope;
  • DOM/XPath untuk dokumen kecil;
  • StAX untuk extraction targeted;
  • XSLT compiled template jika mapping declarative.

6.2 High-Throughput Batch

Contoh:

  • nightly bank statement;
  • telco CDR XML;
  • large claims file;
  • invoice bundle;
  • government report ingestion.

Prioritas:

  • streaming;
  • batching;
  • checkpoint;
  • reject report;
  • idempotency;
  • resumability;
  • audit trail.

Pilihan umum:

  • StAX extraction;
  • XSD validation stage;
  • per-record domain validation;
  • batch persistence;
  • output generation with writer/XSLT;
  • reconciliation summary.

7. Security as Selection Constraint

Processing model apa pun harus melewati security baseline.

Untrusted XML risks:

  • XXE;
  • SSRF through external entity/schema/stylesheet references;
  • local file disclosure;
  • XML bomb/entity expansion;
  • oversized payload;
  • decompression bomb;
  • XPath injection;
  • XSLT external resource access;
  • dangerous extension functions;
  • log leakage of sensitive payload.

Security constraints dapat mengubah pilihan tool.

Contoh:

RequirementImpact
External DTD forbiddenParser harus disable DTD/entity resolver.
Stylesheet supplied by partnerJangan execute blindly; treat as code/config.
XPath expression supplied by userParameterize or whitelist; avoid expression injection.
Huge compressed inputNeed decompression limit and streaming.
PII in XMLLogging, audit, test fixture harus redacted.
Strict outbound complianceGenerate then validate against XSD.

Decision rule:

Jangan pilih processor yang tidak bisa dikunci sesuai threat model.


8. Maintainability and Team Skill

Tool paling powerful belum tentu paling maintainable untuk tim.

DOM Maintainability

Pros:

  • mudah dipahami;
  • debugging mudah;
  • banyak contoh;
  • cocok untuk small document.

Cons:

  • traversal verbose;
  • namespace bugs umum;
  • memory risk;
  • mutation bisa merusak struktur.

SAX Maintainability

Pros:

  • efficient;
  • low memory;
  • cocok untuk parser sederhana.

Cons:

  • callback state tersebar;
  • sulit untuk logic kompleks;
  • testing perlu disiplin.

StAX Maintainability

Pros:

  • control flow natural;
  • streaming;
  • read/write;
  • cocok untuk state machine explicit.

Cons:

  • cursor contract rawan;
  • manual mapping verbose;
  • correctness perlu fixture kuat.

XPath Maintainability

Pros:

  • concise;
  • expressive;
  • bagus untuk assertion/extraction kecil.

Cons:

  • namespace context sering salah;
  • expression bisa brittle;
  • performance buruk jika dipakai sembarangan dalam loop besar.

XSLT Maintainability

Pros:

  • transform declarative;
  • cocok untuk XML-to-XML;
  • identity transform pattern powerful;
  • stylesheet bisa versioned sebagai contract artifact.

Cons:

  • tim perlu skill khusus;
  • stylesheet spaghetti mungkin terjadi;
  • debugging berbeda dari Java imperative;
  • external resource policy harus dikunci.

XQuery Maintainability

Pros:

  • powerful untuk query XML;
  • cocok untuk collections;
  • FLWOR expressive.

Cons:

  • lebih niche;
  • processor dependency;
  • governance query penting.

Binding Maintainability

Pros:

  • object-centric;
  • baik untuk stable schema;
  • business code lebih Java-like.

Cons:

  • generated model bisa besar;
  • schema evolution rumit;
  • mixed content sulit;
  • streaming benefit hilang jika bind seluruh dokumen.

9. Decision Matrix

ConstraintDOMSAXStAXXPathXSDXSLTXQueryBinding
Small XMLExcellentGoodGoodExcellentExcellentExcellentGoodExcellent
Huge XMLPoorExcellentExcellentRiskyGood*DependsDependsPoor*
Random accessExcellentPoorPoorGoodN/AGoodGoodGood
Streaming extractionPoorExcellentExcellentPoorGood*DependsDependsPoor*
MutationGoodPoorPoorPoorN/AGoodGoodGood
Declarative transformPoorPoorPoorPoorN/AExcellentGoodMedium
Schema contractMediumMediumMediumPoorExcellentMediumMediumGood
Low memoryPoorExcellentExcellentMediumMediumDependsDependsPoor*
Team familiarityHighMediumMediumHighMediumMediumLowMedium
AuditabilityMediumMediumHighMediumHighHighMediumMedium

Notes:

  • Good* untuk XSD pada huge XML bergantung pipeline dan validator behavior.
  • Poor* untuk binding seluruh dokumen; partial/unmarshal fragment dapat lebih baik.
  • Depends untuk XSLT/XQuery karena processor dan stylesheet/query design sangat menentukan.

10. Workload Recipes

10.1 Small Request Validation and Extraction

Scenario:

Synchronous XML request < 200 KB.
Need validate structure, extract messageId, route by type.

Recommended:

InputStream
  -> Secure parser config
  -> XSD validation
  -> DOM parse
  -> XPath extraction
  -> route

Reasoning:

  • small payload makes DOM acceptable;
  • XSD gives contract error;
  • XPath concise for route fields;
  • diagnostics easy.

Avoid:

  • raw string regex;
  • XPath without namespace context;
  • parsing twice without reason.

10.2 Large Batch Import

Scenario:

Nightly partner file 5 GB.
Contains millions of transactions.
Need persist valid records and report rejected records.

Recommended:

File stream
  -> checksum + size/decompression guard
  -> StAX record extraction
  -> per-record validation
  -> batch persist with idempotency
  -> reject report
  -> import summary

Reasoning:

  • DOM impossible or dangerous;
  • StAX pull loop allows bounded processing;
  • record-level validation gives partial acceptance;
  • line/column improves support.

Avoid:

  • List<Transaction> for entire file;
  • single huge transaction;
  • logging full rejected payload;
  • no checkpoint.

10.3 XML-to-XML Canonical Transformation

Scenario:

Partner-specific order XML -> internal canonical order XML.
Mapping mostly structural with renames, default values, normalization.

Recommended:

Input XML
  -> secure parse/validation
  -> XSLT compiled stylesheet
  -> canonical XML
  -> XSD validate output

Reasoning:

  • XSLT is designed for XML transformation;
  • stylesheet can be versioned per partner;
  • output validation catches mapping bugs;
  • transformation audit is reproducible.

Avoid:

  • Java code with hundreds of manual writer calls;
  • stylesheet with uncontrolled external document access;
  • mixing business side effects into transformation.

10.4 XML Report Generation

Scenario:

Generate regulatory XML report from database rows.
Need deterministic, validated output.

Recommended:

DB cursor/page
  -> domain aggregation
  -> XMLStreamWriter or XSLT
  -> output file
  -> XSD validation
  -> checksum + audit metadata

Choose writer if:

  • output structure straightforward;
  • data already in Java objects;
  • file large.

Choose XSLT if:

  • source is XML;
  • mapping is declarative;
  • template reuse matters.

Avoid:

  • string concatenation;
  • non-deterministic order;
  • no final validation;
  • no checksum.

10.5 XML Search Service

Scenario:

Need query many XML documents by fields and sometimes return fragments.

Options:

  • Index extracted fields into relational/search store.
  • Store XML plus metadata.
  • Use XQuery-capable XML database/processor if XML query is core capability.
  • Use XPath only for small local documents, not as unindexed search over large archive.

Reasoning:

XML parser is not a search engine.

If queries are frequent and large-scale, create indexed projections.


11. Hybrid Strategies

Real systems often combine models.

11.1 StAX Envelope + DOM Island

Use StAX to scan huge document, but build DOM only for a selected subtree.

Large XML file
  -> StAX scan
  -> when <case> found, materialize that subtree as DOM
  -> XPath/DOM processing for that case
  -> discard DOM island

Use when:

  • full document huge;
  • individual subtree manageable;
  • subtree logic needs random access.

Risk:

  • subtree may still be too large;
  • need subtree size limit.

11.2 XSD First + StAX Business Processing

Input file
  -> XSD validation stage
  -> StAX extraction stage

Use when:

  • invalid XML should be rejected early;
  • business extraction assumes contract validity;
  • file can be staged/read twice.

Risk:

  • doubles I/O;
  • not ideal for very large streams without staging.

11.3 StAX Extract + XPath on Fragment

StAX extracts <order>
  -> fragment converted to DOM/XDM
  -> XPath assertions/extractions

Use when:

  • record is manageable;
  • XPath rules are configurable;
  • full file too large.

Risk:

  • fragment materialization overhead;
  • namespace context must be preserved.

11.4 XSLT Transform + Java Validation

Partner XML
  -> XSLT normalize
  -> canonical XML
  -> Java domain validation

Use when:

  • structural mapping is declarative;
  • domain validation needs Java services/reference data;
  • canonical XML is an integration boundary.

Risk:

  • transformation and validation responsibilities blur;
  • error traceability must map back to input.

11.5 Binding for Header + StAX for Line Items

Document header -> bind to object
Millions of line items -> StAX stream

Use when:

  • header small/stable;
  • body massive/repeated;
  • domain code benefits from typed header object.

Risk:

  • two models in one parser;
  • contract boundaries must be documented.

12. Architecture Pattern: XML Ingestion Pipeline

Boundary discipline:

StageResponsibilityShould Not Do
Security gatePrevent unsafe XML processingInterpret business meaning
XSD validationStructure/datatype contractCall database/services
ExtractorConvert XML event/tree to recordPersist directly
Domain validationBusiness invariantsParse raw XML manually
Persistence/publishDurable side effectDecide XML contract
AuditEvidence and traceabilityStore unnecessary PII

13. Architecture Pattern: XML Transformation Service

Design concerns:

  • mapping version must be explicit;
  • stylesheet dependency must be controlled;
  • input/output checksum should be stored;
  • transformation parameters should be recorded;
  • output should be validated;
  • errors should include stylesheet version and line/column if available;
  • retries must be idempotent.

14. Architecture Pattern: XML Contract Gateway

Gateway rule:

A gateway should enforce XML boundary rules, not become a dumping ground for all business logic.


15. Choosing Between XPath and StAX

Both can extract data, but from different mental models.

XPath Example

String orderId = xpath.evaluate(
    "/o:orders/o:order[1]/@id",
    document
);

Great when:

  • document is already a tree;
  • query is concise;
  • extraction points are few;
  • readability matters;
  • testing assertions.

StAX Example

while (reader.hasNext()) {
    int event = reader.next();
    if (event == START_ELEMENT && is(reader, NS, "order")) {
        OrderRecord order = readOrder(reader);
        consumer.accept(order);
    }
}

Great when:

  • file is large;
  • records are repeated;
  • memory must be bounded;
  • extraction is sequential;
  • downstream batching matters.

Rule:

XPath selects nodes from a model.
StAX walks a stream to produce actions.

If there is no model because we intentionally avoid building one, XPath is not the primary tool.


16. Choosing Between XSLT and Java Mapping

Use XSLT When

  • input and output are XML-centric;
  • mapping is structural;
  • identity transform plus overrides is natural;
  • partner-specific mappings must be versioned;
  • business wants declarative mapping artifacts;
  • output should be reproducible;
  • transformation has little side effect.

Use Java Mapping When

  • logic depends on services/database/reference data;
  • output is object/domain command, not XML;
  • mapping is imperative and stateful;
  • team cannot maintain XSLT safely;
  • transformation must be deeply integrated with domain validation;
  • debugging through Java stack is critical.

Hybrid

Often best:

XSLT: structural normalization
Java: domain validation/enrichment

Do not force all business rules into XSLT just because XML is involved.

Do not write unmaintainable Java tree manipulation just because team avoids XSLT.


17. Choosing Between XSD and Java Validation

XSD validates XML contract:

  • required element/attribute;
  • order and occurrence;
  • simple datatype;
  • enumerations;
  • pattern/facet constraints;
  • namespace structure;
  • type composition.

Java validates domain semantics:

  • account must exist;
  • order total must match pricing service;
  • submission date cannot violate business calendar;
  • user has permission;
  • duplicate idempotency key;
  • cross-record aggregation;
  • state transition validity.

Rule:

Use XSD for syntax/structure/type contract.
Use Java for business truth.

Bad design:

  • XSD too weak: everything is xs:string, all rules in Java.
  • XSD too strong: business rules encoded as fragile regex/facets.
  • Java duplicates XSD checks with inconsistent error messages.

Good design:

  • XSD catches contract violations early;
  • Java assumes contract-normalized shape;
  • errors are categorized separately;
  • both validation layers are tested.

18. Choosing Binding vs Manual Parsing

Binding is Good When

  • schema stable;
  • object model maps naturally;
  • payload size manageable;
  • downstream code wants typed object graph;
  • generated code governance is acceptable;
  • unknown extension policy is clear.

Manual StAX/SAX is Good When

  • payload huge;
  • only subset needed;
  • records independent;
  • object graph would be wasteful;
  • input structure is awkward;
  • partial acceptance required.

DOM Mapper is Good When

  • document small;
  • mapping needs flexible navigation;
  • schema has mixed content or variable structure;
  • you need custom diagnostics.

Binding anti-pattern:

Generate 700 classes from huge XSD, expose them as domain model, then couple all business logic to generated schema classes.

Better:

XML binding DTO -> anti-corruption mapper -> domain model

19. Versioning Pressure Changes the Choice

If XML contract evolves frequently, processing strategy must handle change.

Evolution PatternBetter Strategy
Add optional elementsXSD versioning + tolerant reader.
Partner-specific mappingXSLT per partner/version.
Multiple schema versions activeRouter by namespace/version.
Need backward compatibilityCanonical model + adapters.
Frequent field renamesTransformation layer, not scattered XPath strings.
Complex extension pointsDOM/XDM island or extensible binding strategy.

Avoid:

  • hardcoding XPath strings across codebase;
  • generated classes leaking everywhere;
  • one parser method that handles all partner versions;
  • namespace-less XML contracts;
  • silent fallback for unknown versions.

20. Error Handling Strategy by Model

ModelTypical ErrorDiagnostic Need
DOMParse failure, missing node, namespace lookup failureline/column often parse-level only; node context needed
SAXCallback state bug, invalid event sequencelocator, current path, state
StAXcursor misuse, missing field, malformed XMLlocation, current path, record key
XPathempty result, namespace context bug, expression errorexpression id, namespace map, document version
XSDvalidation violationline/column, schema version, error code mapping
XSLTtemplate error, missing parameter, resource issuestylesheet version, template/mode, source location
XQueryquery error, type error, missing collectionquery id/version, parameters, processor diagnostics
Bindingunmarshal error, unexpected element, adapter failureschema/class version, field path, line/column

Normalize errors for callers:

XML_PARSE_ERROR
XML_SECURITY_REJECTED
XML_SCHEMA_INVALID
XML_MAPPING_FAILED
XML_DOMAIN_INVALID
XML_TRANSFORMATION_FAILED
XML_OUTPUT_INVALID

Do not expose raw internal stack traces as partner-facing contract.


21. Observability by Processing Strategy

Production XML systems need evidence.

Capture:

  • parser type/version if relevant;
  • schema version;
  • stylesheet/query version;
  • mapping version;
  • input checksum;
  • output checksum;
  • file/message id;
  • partner id;
  • namespace/message type;
  • record count;
  • accepted count;
  • rejected count;
  • processing duration;
  • first error code;
  • line/column for rejected record;
  • memory/size guard metrics.

Example import summary:

{
  "importId": "imp-20260702-0001",
  "partnerId": "partner-a",
  "fileName": "orders-2026-07-01.xml.gz",
  "inputSha256": "...",
  "schemaVersion": "order-v1.4",
  "parserModel": "StAX",
  "totalRecords": 1000000,
  "acceptedRecords": 999970,
  "rejectedRecords": 30,
  "durationMs": 184000,
  "status": "COMPLETED_WITH_REJECTIONS"
}

22. Testing Strategy by Model

ModelTest Focus
DOMnamespace lookup, missing nodes, mutation result, serialization
SAXevent order, state transitions, fragmented characters
StAXcursor postcondition, skip subtree, batch behavior, large file
XPathnamespace context, empty/multiple result, compiled expression
XSDvalid/invalid fixtures, error mapping, version compatibility
XSLTgolden output, parameters, resolver policy, output validation
XQueryquery result, type behavior, collection fixtures
Bindinggenerated class compatibility, adapter behavior, unknown elements

Golden rule:

Test XML semantically, not only as raw strings.

Use canonical comparison where possible. Whitespace, attribute ordering, and namespace prefixes can differ while XML meaning remains equivalent.


23. Anti-Decision Patterns

23.1 “Always Use DOM, It Is Easier”

Fine for small payload. Dangerous for batch and high concurrency.

23.2 “Always Use Streaming, It Is Faster”

Streaming can make code more complex. If payload small and logic needs random access, DOM/XPath may be better.

23.3 “XSD Handles Validation, We Are Done”

XSD cannot verify database existence, authorization, state transition, or business calendar.

23.4 “XSLT Is Old, Use Java”

For XML-to-XML transformation, XSLT can be cleaner, more auditable, and less error-prone than manual Java tree manipulation.

23.5 “Generated Classes Are Domain Model”

Generated XML classes represent contract shape, not necessarily domain truth.

23.6 “Namespace Prefix Is Stable”

Prefix is syntax. Namespace URI is identity.

23.7 “We Can Regex XML”

XML is not regular text. Use XML parser.

23.8 “Validation Before Processing Is Always Better”

For huge files, upfront validation may double I/O or block partial acceptance. Choose deliberately.


24. Practical Selection Examples

Example A — Extract Message ID from Small SOAP Envelope

Choice:

DOM + XPath

Reason:

  • small document;
  • header lookup;
  • namespace-aware XPath expressive;
  • easy diagnostics.

Example B — Import 20 Million Transaction Lines

Choice:

StAX + batch consumer + domain validator

Reason:

  • huge record stream;
  • bounded memory;
  • partial acceptance;
  • line/column per rejection.

Example C — Convert XML to HTML Statement

Choice:

XSLT

Reason:

  • template-driven rendering;
  • XML-to-HTML is native XSLT use case;
  • output reproducible.

Example D — Validate Partner Payload Against Contract

Choice:

XSD Validation API

Reason:

  • schema is executable contract;
  • standardized error locations;
  • decouples contract shape from business semantics.

Example E — Search Archive by Invoice Number

Choice:

Extract invoice number into index; do not scan XML with XPath every time

Reason:

  • parser is not indexing system;
  • operational query needs indexed projection.

Example F — Partner-Specific Canonicalization

Choice:

XSLT per partner/version + output XSD validation

Reason:

  • mapping is versioned artifact;
  • canonical output contract must be enforced;
  • audit can store stylesheet version.

25. Parser Selection Checklist

Before choosing, answer:

  1. Is input trusted or untrusted?
  2. What is max compressed and uncompressed size?
  3. What is expected concurrency?
  4. Do we need full document access?
  5. Can records be processed independently?
  6. Do we need partial acceptance?
  7. Is transformation structural or business-heavy?
  8. Is schema stable?
  9. Is output XML or Java domain command?
  10. Do we need audit trail?
  11. Do we need line/column rejected record report?
  12. What is versioning strategy?
  13. Who maintains mapping rules?
  14. What is the failure recovery model?
  15. What should happen to unknown elements?
  16. What security features must be disabled/enforced?
  17. What tests prove the selected model is safe?
  18. How will we detect performance regression?

If you cannot answer these, parser choice is premature.


For production Java XML systems, these defaults are usually sane:

SituationDefault
Small XML requestSecure parse + XSD if contract exists + DOM/XPath or binding
Large XML fileStAX extraction + batch processing
Simple event extractionSAX or StAX; prefer StAX if team wants pull control
XML-to-XML mappingXSLT with compiled templates
Output XML reportXMLStreamWriter or XSLT, then XSD validate
Contract enforcementXSD + domain validation separation
Repeated query over many XML docsExtract/index metadata, consider XQuery/XML DB only if XML query is core
Versioned partner formatsNamespace/version router + adapter/stylesheet per version
Auditable transformationStore input/output checksum + mapping version + validation result

27. Deliberate Practice

Latihan 1 — Decision exercise:

Ambil 10 XML scenarios berikut dan pilih modelnya:

  1. SOAP request 50 KB.
  2. Regulatory output 300 MB.
  3. Partner invoice file 4 GB.
  4. XML config 20 KB.
  5. XML-to-HTML rendering.
  6. Query 10.000 XML documents by customer id.
  7. Validate schema compatibility.
  8. Extract only <header> from huge document.
  9. Convert old partner XML v1 to canonical v3.
  10. Parse mixed-content legal document.

Untuk setiap scenario, tulis:

- selected model
- why
- rejected alternatives
- security controls
- failure mode
- test fixture required

Latihan 2 — Hybrid design:

Desain pipeline:

Input: 2 GB XML file containing <case> records.
Need: validate header, stream cases, reject invalid cases, persist valid cases, generate reject XML.

Buat diagram Mermaid dan tentukan:

  • stage boundary;
  • parser model;
  • validation model;
  • batch size;
  • error taxonomy;
  • audit evidence.

Latihan 3 — Refactor bad design:

Diberikan service yang:

  • parse XML memakai DOM;
  • load file 1 GB;
  • XPath di dalam loop;
  • persist per record;
  • log full payload saat error.

Refactor menjadi design production.


28. Ringkasan Mental Model

Parser selection adalah architecture decision.

Rule yang paling penting:

Choose based on access pattern, not habit.

Mental model akhir:

  • DOM: tree, random access, memory cost.
  • SAX: push event, efficient, callback complexity.
  • StAX: pull event, streaming control, explicit state machine.
  • XPath: declarative selection over a model.
  • XSD: executable structural contract.
  • XSLT: declarative XML transformation.
  • XQuery: query language for XML data model/collections.
  • Binding: XML contract mapped to object graph.
  • Hybrid: production-grade composition with explicit boundaries.

Top-level principle:

Good XML engineering is not knowing every API. It is choosing the smallest processing model that satisfies correctness, security, performance, evolvability, and operability.

Part berikutnya akan masuk ke XSD foundations sebagai contract design, karena setelah kita tahu cara memilih processing model, kita perlu mendesain contract XML yang bisa divalidasi, dievolusi, dan dipertahankan di production.


Referensi

  • Oracle Java SE API — java.xml module.
  • Oracle Java SE API — DOM, SAX, StAX, XPath, Validation, and Transformation packages.
  • Oracle Java Tutorials — JAXP and StAX.
  • W3C XML, XML Namespaces, XPath, XQuery, XSLT, and XML Schema specifications.
  • OWASP XML External Entity Prevention Cheat Sheet.
Lesson Recap

You just completed lesson 08 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.