Parser Selection and Processing Strategy
Learn Java XML In Action - Part 008
Strategi memilih parser dan processing model di Java XML: DOM, SAX, StAX, XPath, XSD validation, XSLT, XQuery, binding, hybrid pipelines, decision matrix, anti-pattern, dan production architecture trade-off.
Part 008 — Parser Selection and Processing Strategy
Tujuan Part Ini
Part ini menjawab pertanyaan yang lebih penting daripada “bagaimana cara parse XML?”:
Processing model apa yang paling tepat untuk masalah ini?
Engineer yang kuat tidak memilih DOM, SAX, StAX, XPath, XSD, XSLT, XQuery, atau binding karena familiar. Mereka memilih berdasarkan:
- ukuran dokumen;
- akses data;
- bentuk transformasi;
- kebutuhan validasi;
- latency;
- memory;
- auditability;
- evolusi contract;
- security;
- failure mode;
- skill tim;
- maintainability jangka panjang.
Part ini adalah decision framework.
Setelah part ini, kita punya peta praktis sebelum masuk ke XSD, XPath, XQuery, dan XSLT secara lebih dalam.
1. The Core Question
Pertanyaan yang salah:
Parser XML mana yang terbaik?
Pertanyaan yang benar:
Apa bentuk pekerjaan XML yang sedang dilakukan?
Ada beberapa bentuk pekerjaan:
| Work Type | Pertanyaan Praktis |
|---|---|
| Parse | Bagaimana membaca XML menjadi event/tree/object? |
| Validate | Apakah XML sesuai contract? |
| Extract | Field/record apa yang perlu diambil? |
| Query | Bagian mana yang memenuhi ekspresi tertentu? |
| Transform | Bagaimana mengubah XML menjadi format lain? |
| Generate | Bagaimana membuat XML valid dan deterministic? |
| Bind | Bagaimana XML dipetakan ke object domain/DTO? |
| Route | Ke mana dokumen/record dikirim berdasarkan isi? |
| Audit | Bagaimana membuktikan input-output processing? |
| Evolve | Bagaimana contract berubah tanpa merusak partner? |
Parser hanya salah satu komponen.
2. Processing Models dalam Java XML
Ringkasan:
| Model | Mental Model | Cocok Untuk | Hindari Untuk |
|---|---|---|---|
| DOM | XML menjadi tree object graph | Dokumen kecil-menengah, random access, mutation | File besar, high-throughput streaming |
| SAX | Push event callback | Large file extraction sederhana, low-level streaming | Business flow kompleks, readability tinggi |
| StAX | Pull event cursor | Large file extraction, streaming writer, controlled pipeline | Random access, complex global rules |
| XPath | Declarative node selection | Query kecil atas DOM/XDM, assertions, extraction targeted | Scan jutaan record besar tanpa strategy |
| XSD | Declarative contract validation | Structure/datatype validation, partner contract | Business rule kompleks lintas sistem |
| XSLT | Declarative transformation | XML-to-XML/HTML/text mapping, canonicalization | Imperative side-effect heavy logic |
| XQuery | Declarative XML query | Query/aggregation atas XML collections atau complex XML | Simple single-document extraction |
| Binding | XML <-> object mapping | Stable schema, object-centric business layer | Large streaming, mixed content, schema fleksibel |
| Hybrid | Kombinasi model | Production pipeline kompleks | Jika tanpa boundary jelas |
3. Decision Tree
Decision tree ini bukan hukum absolut. Ini guardrail agar pilihan awal tidak salah arah.
4. First Principle: Access Pattern Beats Tool Preference
XML document dapat dilihat dari beberapa access pattern.
4.1 Full Document Access
Kita perlu melihat seluruh dokumen:
- compare header dan footer total;
- validate cross-section consistency;
- mutate beberapa node berdasarkan node lain;
- render UI preview;
- apply XPath berulang;
- sign/canonicalize subset.
Cenderung cocok:
- DOM;
- XDM/Saxon tree;
- binding object model;
- XSLT/XQuery jika declarative.
Tidak cocok:
- raw SAX jika logic butuh global context;
- raw StAX jika harus banyak look-back/look-ahead.
4.2 Sequential Record Access
Dokumen berisi banyak record berulang:
<records>
<record>...</record>
<record>...</record>
<record>...</record>
</records>
Kita bisa proses record satu per satu.
Cenderung cocok:
- StAX;
- SAX;
- streaming XSLT jika processor mendukung dan stylesheet streamable;
- batch validation/extraction hybrid.
Tidak cocok:
- DOM untuk jutaan record;
- binding seluruh file ke object graph besar.
4.3 Targeted Lookup
Kita hanya butuh beberapa value:
/header/messageId
/header/sender
/body/order/@id
Cenderung cocok:
- XPath untuk dokumen kecil/menengah;
- StAX untuk file besar;
- SAX jika lookup sederhana.
Trade-off:
- XPath lebih expressive;
- StAX lebih predictable untuk memory;
- DOM + XPath mudah tetapi dapat mahal.
4.4 Declarative Mapping
Kita mengubah struktur XML:
PartnerOrderXML -> CanonicalOrderXML
Cenderung cocok:
- XSLT;
- XQuery untuk query-heavy mapping;
- StAX manual untuk mapping sederhana/performance-critical;
- binding + mapper jika business object layer penting.
4.5 Object-Centric Business Logic
Domain logic ingin object:
Order order = xmlMapper.read(...);
pricingEngine.price(order);
Cenderung cocok:
- JAXB/Jakarta XML Binding;
- generated classes from XSD;
- partial StAX extraction into domain record;
- DOM-to-domain mapper untuk dokumen kecil.
Tidak cocok:
- XSLT bila logic sangat imperative dan stateful;
- raw parser tersebar di business service.
5. Size and Memory Strategy
Ukuran XML bukan hanya bytes di disk.
DOM object graph bisa jauh lebih besar daripada ukuran file karena:
- setiap element menjadi object;
- attribute menjadi object;
- text node menjadi object/string;
- namespace metadata;
- internal parser structure;
- Java object overhead.
Practical guideline:
| XML Size | Default Thinking |
|---|---|
| < 1 MB | DOM/XPath/binding biasanya aman jika volume rendah. |
| 1–20 MB | Evaluasi memory, concurrency, dan access pattern. |
| 20–100 MB | Mulai serius pertimbangkan StAX/SAX/hybrid. |
| > 100 MB | Jangan default DOM. Streaming atau staged processing. |
| GB-level | StAX/SAX/pipeline/checkpoint. |
Ini bukan angka absolut. Yang menentukan:
memory impact = file size × object expansion × concurrency × retention time
Contoh:
10 MB XML × 8x expansion × 100 concurrent requests = 8 GB heap pressure
Maka DOM yang aman di laptop bisa menjadi incident di service concurrent.
6. Latency vs Throughput
Ada dua profil berbeda.
6.1 Low-Latency Request/Response
Contoh:
- SOAP-like request;
- partner API XML payload;
- small regulatory lookup;
- synchronous validation gateway.
Prioritas:
- cepat fail;
- error message jelas;
- memory bounded;
- timeout;
- security hardening;
- no unbounded transform.
Pilihan umum:
- XSD validate envelope;
- DOM/XPath untuk dokumen kecil;
- StAX untuk extraction targeted;
- XSLT compiled template jika mapping declarative.
6.2 High-Throughput Batch
Contoh:
- nightly bank statement;
- telco CDR XML;
- large claims file;
- invoice bundle;
- government report ingestion.
Prioritas:
- streaming;
- batching;
- checkpoint;
- reject report;
- idempotency;
- resumability;
- audit trail.
Pilihan umum:
- StAX extraction;
- XSD validation stage;
- per-record domain validation;
- batch persistence;
- output generation with writer/XSLT;
- reconciliation summary.
7. Security as Selection Constraint
Processing model apa pun harus melewati security baseline.
Untrusted XML risks:
- XXE;
- SSRF through external entity/schema/stylesheet references;
- local file disclosure;
- XML bomb/entity expansion;
- oversized payload;
- decompression bomb;
- XPath injection;
- XSLT external resource access;
- dangerous extension functions;
- log leakage of sensitive payload.
Security constraints dapat mengubah pilihan tool.
Contoh:
| Requirement | Impact |
|---|---|
| External DTD forbidden | Parser harus disable DTD/entity resolver. |
| Stylesheet supplied by partner | Jangan execute blindly; treat as code/config. |
| XPath expression supplied by user | Parameterize or whitelist; avoid expression injection. |
| Huge compressed input | Need decompression limit and streaming. |
| PII in XML | Logging, audit, test fixture harus redacted. |
| Strict outbound compliance | Generate then validate against XSD. |
Decision rule:
Jangan pilih processor yang tidak bisa dikunci sesuai threat model.
8. Maintainability and Team Skill
Tool paling powerful belum tentu paling maintainable untuk tim.
DOM Maintainability
Pros:
- mudah dipahami;
- debugging mudah;
- banyak contoh;
- cocok untuk small document.
Cons:
- traversal verbose;
- namespace bugs umum;
- memory risk;
- mutation bisa merusak struktur.
SAX Maintainability
Pros:
- efficient;
- low memory;
- cocok untuk parser sederhana.
Cons:
- callback state tersebar;
- sulit untuk logic kompleks;
- testing perlu disiplin.
StAX Maintainability
Pros:
- control flow natural;
- streaming;
- read/write;
- cocok untuk state machine explicit.
Cons:
- cursor contract rawan;
- manual mapping verbose;
- correctness perlu fixture kuat.
XPath Maintainability
Pros:
- concise;
- expressive;
- bagus untuk assertion/extraction kecil.
Cons:
- namespace context sering salah;
- expression bisa brittle;
- performance buruk jika dipakai sembarangan dalam loop besar.
XSLT Maintainability
Pros:
- transform declarative;
- cocok untuk XML-to-XML;
- identity transform pattern powerful;
- stylesheet bisa versioned sebagai contract artifact.
Cons:
- tim perlu skill khusus;
- stylesheet spaghetti mungkin terjadi;
- debugging berbeda dari Java imperative;
- external resource policy harus dikunci.
XQuery Maintainability
Pros:
- powerful untuk query XML;
- cocok untuk collections;
- FLWOR expressive.
Cons:
- lebih niche;
- processor dependency;
- governance query penting.
Binding Maintainability
Pros:
- object-centric;
- baik untuk stable schema;
- business code lebih Java-like.
Cons:
- generated model bisa besar;
- schema evolution rumit;
- mixed content sulit;
- streaming benefit hilang jika bind seluruh dokumen.
9. Decision Matrix
| Constraint | DOM | SAX | StAX | XPath | XSD | XSLT | XQuery | Binding |
|---|---|---|---|---|---|---|---|---|
| Small XML | Excellent | Good | Good | Excellent | Excellent | Excellent | Good | Excellent |
| Huge XML | Poor | Excellent | Excellent | Risky | Good* | Depends | Depends | Poor* |
| Random access | Excellent | Poor | Poor | Good | N/A | Good | Good | Good |
| Streaming extraction | Poor | Excellent | Excellent | Poor | Good* | Depends | Depends | Poor* |
| Mutation | Good | Poor | Poor | Poor | N/A | Good | Good | Good |
| Declarative transform | Poor | Poor | Poor | Poor | N/A | Excellent | Good | Medium |
| Schema contract | Medium | Medium | Medium | Poor | Excellent | Medium | Medium | Good |
| Low memory | Poor | Excellent | Excellent | Medium | Medium | Depends | Depends | Poor* |
| Team familiarity | High | Medium | Medium | High | Medium | Medium | Low | Medium |
| Auditability | Medium | Medium | High | Medium | High | High | Medium | Medium |
Notes:
Good*untuk XSD pada huge XML bergantung pipeline dan validator behavior.Poor*untuk binding seluruh dokumen; partial/unmarshal fragment dapat lebih baik.Dependsuntuk XSLT/XQuery karena processor dan stylesheet/query design sangat menentukan.
10. Workload Recipes
10.1 Small Request Validation and Extraction
Scenario:
Synchronous XML request < 200 KB.
Need validate structure, extract messageId, route by type.
Recommended:
InputStream
-> Secure parser config
-> XSD validation
-> DOM parse
-> XPath extraction
-> route
Reasoning:
- small payload makes DOM acceptable;
- XSD gives contract error;
- XPath concise for route fields;
- diagnostics easy.
Avoid:
- raw string regex;
- XPath without namespace context;
- parsing twice without reason.
10.2 Large Batch Import
Scenario:
Nightly partner file 5 GB.
Contains millions of transactions.
Need persist valid records and report rejected records.
Recommended:
File stream
-> checksum + size/decompression guard
-> StAX record extraction
-> per-record validation
-> batch persist with idempotency
-> reject report
-> import summary
Reasoning:
- DOM impossible or dangerous;
- StAX pull loop allows bounded processing;
- record-level validation gives partial acceptance;
- line/column improves support.
Avoid:
List<Transaction>for entire file;- single huge transaction;
- logging full rejected payload;
- no checkpoint.
10.3 XML-to-XML Canonical Transformation
Scenario:
Partner-specific order XML -> internal canonical order XML.
Mapping mostly structural with renames, default values, normalization.
Recommended:
Input XML
-> secure parse/validation
-> XSLT compiled stylesheet
-> canonical XML
-> XSD validate output
Reasoning:
- XSLT is designed for XML transformation;
- stylesheet can be versioned per partner;
- output validation catches mapping bugs;
- transformation audit is reproducible.
Avoid:
- Java code with hundreds of manual writer calls;
- stylesheet with uncontrolled external document access;
- mixing business side effects into transformation.
10.4 XML Report Generation
Scenario:
Generate regulatory XML report from database rows.
Need deterministic, validated output.
Recommended:
DB cursor/page
-> domain aggregation
-> XMLStreamWriter or XSLT
-> output file
-> XSD validation
-> checksum + audit metadata
Choose writer if:
- output structure straightforward;
- data already in Java objects;
- file large.
Choose XSLT if:
- source is XML;
- mapping is declarative;
- template reuse matters.
Avoid:
- string concatenation;
- non-deterministic order;
- no final validation;
- no checksum.
10.5 XML Search Service
Scenario:
Need query many XML documents by fields and sometimes return fragments.
Options:
- Index extracted fields into relational/search store.
- Store XML plus metadata.
- Use XQuery-capable XML database/processor if XML query is core capability.
- Use XPath only for small local documents, not as unindexed search over large archive.
Reasoning:
XML parser is not a search engine.
If queries are frequent and large-scale, create indexed projections.
11. Hybrid Strategies
Real systems often combine models.
11.1 StAX Envelope + DOM Island
Use StAX to scan huge document, but build DOM only for a selected subtree.
Large XML file
-> StAX scan
-> when <case> found, materialize that subtree as DOM
-> XPath/DOM processing for that case
-> discard DOM island
Use when:
- full document huge;
- individual subtree manageable;
- subtree logic needs random access.
Risk:
- subtree may still be too large;
- need subtree size limit.
11.2 XSD First + StAX Business Processing
Input file
-> XSD validation stage
-> StAX extraction stage
Use when:
- invalid XML should be rejected early;
- business extraction assumes contract validity;
- file can be staged/read twice.
Risk:
- doubles I/O;
- not ideal for very large streams without staging.
11.3 StAX Extract + XPath on Fragment
StAX extracts <order>
-> fragment converted to DOM/XDM
-> XPath assertions/extractions
Use when:
- record is manageable;
- XPath rules are configurable;
- full file too large.
Risk:
- fragment materialization overhead;
- namespace context must be preserved.
11.4 XSLT Transform + Java Validation
Partner XML
-> XSLT normalize
-> canonical XML
-> Java domain validation
Use when:
- structural mapping is declarative;
- domain validation needs Java services/reference data;
- canonical XML is an integration boundary.
Risk:
- transformation and validation responsibilities blur;
- error traceability must map back to input.
11.5 Binding for Header + StAX for Line Items
Document header -> bind to object
Millions of line items -> StAX stream
Use when:
- header small/stable;
- body massive/repeated;
- domain code benefits from typed header object.
Risk:
- two models in one parser;
- contract boundaries must be documented.
12. Architecture Pattern: XML Ingestion Pipeline
Boundary discipline:
| Stage | Responsibility | Should Not Do |
|---|---|---|
| Security gate | Prevent unsafe XML processing | Interpret business meaning |
| XSD validation | Structure/datatype contract | Call database/services |
| Extractor | Convert XML event/tree to record | Persist directly |
| Domain validation | Business invariants | Parse raw XML manually |
| Persistence/publish | Durable side effect | Decide XML contract |
| Audit | Evidence and traceability | Store unnecessary PII |
13. Architecture Pattern: XML Transformation Service
Design concerns:
- mapping version must be explicit;
- stylesheet dependency must be controlled;
- input/output checksum should be stored;
- transformation parameters should be recorded;
- output should be validated;
- errors should include stylesheet version and line/column if available;
- retries must be idempotent.
14. Architecture Pattern: XML Contract Gateway
Gateway rule:
A gateway should enforce XML boundary rules, not become a dumping ground for all business logic.
15. Choosing Between XPath and StAX
Both can extract data, but from different mental models.
XPath Example
String orderId = xpath.evaluate(
"/o:orders/o:order[1]/@id",
document
);
Great when:
- document is already a tree;
- query is concise;
- extraction points are few;
- readability matters;
- testing assertions.
StAX Example
while (reader.hasNext()) {
int event = reader.next();
if (event == START_ELEMENT && is(reader, NS, "order")) {
OrderRecord order = readOrder(reader);
consumer.accept(order);
}
}
Great when:
- file is large;
- records are repeated;
- memory must be bounded;
- extraction is sequential;
- downstream batching matters.
Rule:
XPath selects nodes from a model.
StAX walks a stream to produce actions.
If there is no model because we intentionally avoid building one, XPath is not the primary tool.
16. Choosing Between XSLT and Java Mapping
Use XSLT When
- input and output are XML-centric;
- mapping is structural;
- identity transform plus overrides is natural;
- partner-specific mappings must be versioned;
- business wants declarative mapping artifacts;
- output should be reproducible;
- transformation has little side effect.
Use Java Mapping When
- logic depends on services/database/reference data;
- output is object/domain command, not XML;
- mapping is imperative and stateful;
- team cannot maintain XSLT safely;
- transformation must be deeply integrated with domain validation;
- debugging through Java stack is critical.
Hybrid
Often best:
XSLT: structural normalization
Java: domain validation/enrichment
Do not force all business rules into XSLT just because XML is involved.
Do not write unmaintainable Java tree manipulation just because team avoids XSLT.
17. Choosing Between XSD and Java Validation
XSD validates XML contract:
- required element/attribute;
- order and occurrence;
- simple datatype;
- enumerations;
- pattern/facet constraints;
- namespace structure;
- type composition.
Java validates domain semantics:
- account must exist;
- order total must match pricing service;
- submission date cannot violate business calendar;
- user has permission;
- duplicate idempotency key;
- cross-record aggregation;
- state transition validity.
Rule:
Use XSD for syntax/structure/type contract.
Use Java for business truth.
Bad design:
- XSD too weak: everything is
xs:string, all rules in Java. - XSD too strong: business rules encoded as fragile regex/facets.
- Java duplicates XSD checks with inconsistent error messages.
Good design:
- XSD catches contract violations early;
- Java assumes contract-normalized shape;
- errors are categorized separately;
- both validation layers are tested.
18. Choosing Binding vs Manual Parsing
Binding is Good When
- schema stable;
- object model maps naturally;
- payload size manageable;
- downstream code wants typed object graph;
- generated code governance is acceptable;
- unknown extension policy is clear.
Manual StAX/SAX is Good When
- payload huge;
- only subset needed;
- records independent;
- object graph would be wasteful;
- input structure is awkward;
- partial acceptance required.
DOM Mapper is Good When
- document small;
- mapping needs flexible navigation;
- schema has mixed content or variable structure;
- you need custom diagnostics.
Binding anti-pattern:
Generate 700 classes from huge XSD, expose them as domain model, then couple all business logic to generated schema classes.
Better:
XML binding DTO -> anti-corruption mapper -> domain model
19. Versioning Pressure Changes the Choice
If XML contract evolves frequently, processing strategy must handle change.
| Evolution Pattern | Better Strategy |
|---|---|
| Add optional elements | XSD versioning + tolerant reader. |
| Partner-specific mapping | XSLT per partner/version. |
| Multiple schema versions active | Router by namespace/version. |
| Need backward compatibility | Canonical model + adapters. |
| Frequent field renames | Transformation layer, not scattered XPath strings. |
| Complex extension points | DOM/XDM island or extensible binding strategy. |
Avoid:
- hardcoding XPath strings across codebase;
- generated classes leaking everywhere;
- one parser method that handles all partner versions;
- namespace-less XML contracts;
- silent fallback for unknown versions.
20. Error Handling Strategy by Model
| Model | Typical Error | Diagnostic Need |
|---|---|---|
| DOM | Parse failure, missing node, namespace lookup failure | line/column often parse-level only; node context needed |
| SAX | Callback state bug, invalid event sequence | locator, current path, state |
| StAX | cursor misuse, missing field, malformed XML | location, current path, record key |
| XPath | empty result, namespace context bug, expression error | expression id, namespace map, document version |
| XSD | validation violation | line/column, schema version, error code mapping |
| XSLT | template error, missing parameter, resource issue | stylesheet version, template/mode, source location |
| XQuery | query error, type error, missing collection | query id/version, parameters, processor diagnostics |
| Binding | unmarshal error, unexpected element, adapter failure | schema/class version, field path, line/column |
Normalize errors for callers:
XML_PARSE_ERROR
XML_SECURITY_REJECTED
XML_SCHEMA_INVALID
XML_MAPPING_FAILED
XML_DOMAIN_INVALID
XML_TRANSFORMATION_FAILED
XML_OUTPUT_INVALID
Do not expose raw internal stack traces as partner-facing contract.
21. Observability by Processing Strategy
Production XML systems need evidence.
Capture:
- parser type/version if relevant;
- schema version;
- stylesheet/query version;
- mapping version;
- input checksum;
- output checksum;
- file/message id;
- partner id;
- namespace/message type;
- record count;
- accepted count;
- rejected count;
- processing duration;
- first error code;
- line/column for rejected record;
- memory/size guard metrics.
Example import summary:
{
"importId": "imp-20260702-0001",
"partnerId": "partner-a",
"fileName": "orders-2026-07-01.xml.gz",
"inputSha256": "...",
"schemaVersion": "order-v1.4",
"parserModel": "StAX",
"totalRecords": 1000000,
"acceptedRecords": 999970,
"rejectedRecords": 30,
"durationMs": 184000,
"status": "COMPLETED_WITH_REJECTIONS"
}
22. Testing Strategy by Model
| Model | Test Focus |
|---|---|
| DOM | namespace lookup, missing nodes, mutation result, serialization |
| SAX | event order, state transitions, fragmented characters |
| StAX | cursor postcondition, skip subtree, batch behavior, large file |
| XPath | namespace context, empty/multiple result, compiled expression |
| XSD | valid/invalid fixtures, error mapping, version compatibility |
| XSLT | golden output, parameters, resolver policy, output validation |
| XQuery | query result, type behavior, collection fixtures |
| Binding | generated class compatibility, adapter behavior, unknown elements |
Golden rule:
Test XML semantically, not only as raw strings.
Use canonical comparison where possible. Whitespace, attribute ordering, and namespace prefixes can differ while XML meaning remains equivalent.
23. Anti-Decision Patterns
23.1 “Always Use DOM, It Is Easier”
Fine for small payload. Dangerous for batch and high concurrency.
23.2 “Always Use Streaming, It Is Faster”
Streaming can make code more complex. If payload small and logic needs random access, DOM/XPath may be better.
23.3 “XSD Handles Validation, We Are Done”
XSD cannot verify database existence, authorization, state transition, or business calendar.
23.4 “XSLT Is Old, Use Java”
For XML-to-XML transformation, XSLT can be cleaner, more auditable, and less error-prone than manual Java tree manipulation.
23.5 “Generated Classes Are Domain Model”
Generated XML classes represent contract shape, not necessarily domain truth.
23.6 “Namespace Prefix Is Stable”
Prefix is syntax. Namespace URI is identity.
23.7 “We Can Regex XML”
XML is not regular text. Use XML parser.
23.8 “Validation Before Processing Is Always Better”
For huge files, upfront validation may double I/O or block partial acceptance. Choose deliberately.
24. Practical Selection Examples
Example A — Extract Message ID from Small SOAP Envelope
Choice:
DOM + XPath
Reason:
- small document;
- header lookup;
- namespace-aware XPath expressive;
- easy diagnostics.
Example B — Import 20 Million Transaction Lines
Choice:
StAX + batch consumer + domain validator
Reason:
- huge record stream;
- bounded memory;
- partial acceptance;
- line/column per rejection.
Example C — Convert XML to HTML Statement
Choice:
XSLT
Reason:
- template-driven rendering;
- XML-to-HTML is native XSLT use case;
- output reproducible.
Example D — Validate Partner Payload Against Contract
Choice:
XSD Validation API
Reason:
- schema is executable contract;
- standardized error locations;
- decouples contract shape from business semantics.
Example E — Search Archive by Invoice Number
Choice:
Extract invoice number into index; do not scan XML with XPath every time
Reason:
- parser is not indexing system;
- operational query needs indexed projection.
Example F — Partner-Specific Canonicalization
Choice:
XSLT per partner/version + output XSD validation
Reason:
- mapping is versioned artifact;
- canonical output contract must be enforced;
- audit can store stylesheet version.
25. Parser Selection Checklist
Before choosing, answer:
- Is input trusted or untrusted?
- What is max compressed and uncompressed size?
- What is expected concurrency?
- Do we need full document access?
- Can records be processed independently?
- Do we need partial acceptance?
- Is transformation structural or business-heavy?
- Is schema stable?
- Is output XML or Java domain command?
- Do we need audit trail?
- Do we need line/column rejected record report?
- What is versioning strategy?
- Who maintains mapping rules?
- What is the failure recovery model?
- What should happen to unknown elements?
- What security features must be disabled/enforced?
- What tests prove the selected model is safe?
- How will we detect performance regression?
If you cannot answer these, parser choice is premature.
26. Recommended Defaults
For production Java XML systems, these defaults are usually sane:
| Situation | Default |
|---|---|
| Small XML request | Secure parse + XSD if contract exists + DOM/XPath or binding |
| Large XML file | StAX extraction + batch processing |
| Simple event extraction | SAX or StAX; prefer StAX if team wants pull control |
| XML-to-XML mapping | XSLT with compiled templates |
| Output XML report | XMLStreamWriter or XSLT, then XSD validate |
| Contract enforcement | XSD + domain validation separation |
| Repeated query over many XML docs | Extract/index metadata, consider XQuery/XML DB only if XML query is core |
| Versioned partner formats | Namespace/version router + adapter/stylesheet per version |
| Auditable transformation | Store input/output checksum + mapping version + validation result |
27. Deliberate Practice
Latihan 1 — Decision exercise:
Ambil 10 XML scenarios berikut dan pilih modelnya:
- SOAP request 50 KB.
- Regulatory output 300 MB.
- Partner invoice file 4 GB.
- XML config 20 KB.
- XML-to-HTML rendering.
- Query 10.000 XML documents by customer id.
- Validate schema compatibility.
- Extract only
<header>from huge document. - Convert old partner XML v1 to canonical v3.
- Parse mixed-content legal document.
Untuk setiap scenario, tulis:
- selected model
- why
- rejected alternatives
- security controls
- failure mode
- test fixture required
Latihan 2 — Hybrid design:
Desain pipeline:
Input: 2 GB XML file containing <case> records.
Need: validate header, stream cases, reject invalid cases, persist valid cases, generate reject XML.
Buat diagram Mermaid dan tentukan:
- stage boundary;
- parser model;
- validation model;
- batch size;
- error taxonomy;
- audit evidence.
Latihan 3 — Refactor bad design:
Diberikan service yang:
- parse XML memakai DOM;
- load file 1 GB;
- XPath di dalam loop;
- persist per record;
- log full payload saat error.
Refactor menjadi design production.
28. Ringkasan Mental Model
Parser selection adalah architecture decision.
Rule yang paling penting:
Choose based on access pattern, not habit.
Mental model akhir:
- DOM: tree, random access, memory cost.
- SAX: push event, efficient, callback complexity.
- StAX: pull event, streaming control, explicit state machine.
- XPath: declarative selection over a model.
- XSD: executable structural contract.
- XSLT: declarative XML transformation.
- XQuery: query language for XML data model/collections.
- Binding: XML contract mapped to object graph.
- Hybrid: production-grade composition with explicit boundaries.
Top-level principle:
Good XML engineering is not knowing every API. It is choosing the smallest processing model that satisfies correctness, security, performance, evolvability, and operability.
Part berikutnya akan masuk ke XSD foundations sebagai contract design, karena setelah kita tahu cara memilih processing model, kita perlu mendesain contract XML yang bisa divalidasi, dievolusi, dan dipertahankan di production.
Referensi
- Oracle Java SE API —
java.xmlmodule. - Oracle Java SE API — DOM, SAX, StAX, XPath, Validation, and Transformation packages.
- Oracle Java Tutorials — JAXP and StAX.
- W3C XML, XML Namespaces, XPath, XQuery, XSLT, and XML Schema specifications.
- OWASP XML External Entity Prevention Cheat Sheet.
You just completed lesson 08 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.