Java XSD Validation Pipeline
Learn Java XML In Action - Part 013
Java XSD validation pipeline dengan SchemaFactory, Schema, Validator, ValidatorHandler, LSResourceResolver, ErrorHandler, streaming validation, diagnostics, observability, security, dan production design.
Part 013 — Java XSD Validation Pipeline
Tujuan Part Ini
Part sebelumnya membahas XSD sebagai kontrak yang modular, versioned, dan governed. Sekarang kita masuk ke sisi implementasi Java:
Bagaimana mengubah XSD menjadi validation pipeline yang aman, cepat, observable, deterministik,
dan bisa dipertanggungjawabkan saat terjadi dispute production?
Target setelah part ini:
- memahami lifecycle
SchemaFactory,Schema,Validator, danValidatorHandler; - tahu kapan validasi dilakukan sebelum parsing, saat streaming, setelah parsing, atau di beberapa stage;
- membuat error report yang berguna untuk manusia dan mesin;
- mengamankan schema resolution agar tidak melakukan network access liar;
- mendesain schema cache dan validator lifecycle yang thread-safe;
- membedakan validation error, parse error, semantic error, dan policy error;
- membangun validation service yang production-grade.
Mental model:
XSD validation is not a boolean check.
It is a boundary-control pipeline that converts untrusted XML into classified evidence.
1. Validation Is a Boundary, Not a Utility Method
Banyak codebase memperlakukan validasi XSD seperti ini:
validate(xmlFile, xsdFile);
Secara teknis bisa jalan. Secara production sering tidak cukup.
Sebuah validation pipeline harus menjawab:
| Question | Why It Matters |
|---|---|
| Schema versi mana yang dipakai? | Untuk audit, replay, dan compatibility. |
| Apakah schema dependency di-resolve deterministik? | Untuk mencegah hasil validasi berubah karena network/file system. |
| Apakah parser aman dari XXE/entity expansion? | XML input sering berasal dari partner/untrusted source. |
| Apakah error bisa dikaitkan ke line/column/path? | Untuk debugging dan dispute. |
| Apakah semua error dikumpulkan atau fail-fast? | Untuk UX, batch processing, dan SLA. |
| Apakah hasil validasi disimpan sebagai evidence? | Untuk regulatory defensibility. |
| Apakah validasi menghalangi throughput? | Untuk batch besar dan event pipeline. |
A good validation boundary does not merely return true or false.
It returns:
ValidationResult
accepted/rejected
schema identity
parser policy
error list
warning list
resource resolution trace
input fingerprint
timing and size metrics
2. Java Validation API Mental Model
Package utama adalah:
javax.xml.validation
Core objects:
| Type | Role | Lifecycle |
|---|---|---|
SchemaFactory | Membuat Schema dari XSD | Configure once per schema language/policy |
Schema | Immutable grammar representation | Cache/reuse |
Validator | Validates one document/source | Create per validation/request |
ValidatorHandler | Streaming validator on SAX events | Create per streaming pipeline |
LSResourceResolver | Resolves imported/included schema resources | Deterministic policy object |
ErrorHandler | Receives warning/error/fatalError | Per validation context |
Diagram:
Important invariant:
Schema is the reusable compiled grammar.
Validator is the per-run validation processor.
Do not cache and share a mutable Validator across threads.
3. Minimal XSD Validation in Java
A simple baseline:
import javax.xml.XMLConstants;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;
import java.nio.file.Path;
public final class MinimalXsdValidation {
public static void validate(Path xml, Path xsd) throws Exception {
SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = factory.newSchema(xsd.toFile());
Validator validator = schema.newValidator();
validator.validate(new StreamSource(xml.toFile()));
}
}
This is acceptable for learning.
It is not enough for production because it does not define:
- secure parser limits;
- external resource access policy;
- resource resolution policy;
- structured error handling;
- schema version identity;
- observability;
- deterministic schema bundle loading.
4. Production Validation Result Model
Start by rejecting the primitive result type:
boolean valid;
Use a domain object:
import java.time.Duration;
import java.util.List;
import java.util.Map;
public record XmlValidationResult(
boolean accepted,
String schemaId,
String schemaVersion,
String inputFingerprint,
long inputSizeBytes,
Duration elapsed,
List<XmlValidationIssue> issues,
Map<String, String> diagnostics
) {
public boolean hasErrors() {
return issues.stream().anyMatch(XmlValidationIssue::isError);
}
}
Issue model:
public record XmlValidationIssue(
Severity severity,
String code,
String message,
Integer line,
Integer column,
String systemId,
String publicId,
String xmlPathHint
) {
public boolean isError() {
return severity == Severity.ERROR || severity == Severity.FATAL;
}
}
enum Severity {
WARNING,
ERROR,
FATAL
}
Avoid exposing raw parser messages directly to external clients. Parser messages can leak file paths, system identifiers, schema layout, or internal implementation details.
Recommended external error response:
{
"accepted": false,
"errorType": "XML_SCHEMA_VALIDATION_FAILED",
"correlationId": "b7de7a2a-9d9c-4e30-9e7a-7e97d8d9b31a",
"issues": [
{
"severity": "ERROR",
"message": "The request document does not match the expected schema.",
"line": 42,
"column": 17
}
]
}
Internal diagnostic event can be richer.
5. ErrorHandler That Collects Errors
Validator.validate(...) normally throws on validation failure. For batch UX, partner onboarding, and schema testing, you often want to collect issues.
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import java.util.ArrayList;
import java.util.List;
public final class CollectingErrorHandler implements ErrorHandler {
private final List<XmlValidationIssue> issues = new ArrayList<>();
@Override
public void warning(SAXParseException exception) throws SAXException {
issues.add(toIssue(Severity.WARNING, exception));
}
@Override
public void error(SAXParseException exception) throws SAXException {
issues.add(toIssue(Severity.ERROR, exception));
// Do not throw if you want to keep collecting recoverable errors.
}
@Override
public void fatalError(SAXParseException exception) throws SAXException {
issues.add(toIssue(Severity.FATAL, exception));
throw exception;
}
public List<XmlValidationIssue> issues() {
return List.copyOf(issues);
}
private static XmlValidationIssue toIssue(Severity severity, SAXParseException e) {
return new XmlValidationIssue(
severity,
"XML_SCHEMA_VALIDATION_" + severity.name(),
e.getMessage(),
e.getLineNumber() >= 0 ? e.getLineNumber() : null,
e.getColumnNumber() >= 0 ? e.getColumnNumber() : null,
e.getSystemId(),
e.getPublicId(),
null
);
}
}
Important nuance:
Collecting validation errors is not guaranteed to produce every possible error.
After one structural error, the parser/validator may not be able to infer later structure correctly.
So use wording like:
The validator reported these issues.
not:
These are all issues in the document.
6. Secure SchemaFactory Configuration
Validation has two resource categories:
- XML input resources: entities, DTD, external references.
- Schema resources: XSD
include,import,redefine, external schema locations.
Configure the factory explicitly:
import javax.xml.XMLConstants;
import javax.xml.validation.SchemaFactory;
public final class SecureSchemaFactories {
public static SchemaFactory newSecureXsdFactory() throws Exception {
SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
// Deny external access by default.
factory.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
factory.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
return factory;
}
}
For validator instances:
Validator validator = schema.newValidator();
validator.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
validator.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
Production default:
No HTTP.
No arbitrary file access.
No runtime schema download.
No dependency on schemaLocation hints from untrusted XML.
If external access is allowed at all, it should be explicit, allowlisted, logged, and ideally resolved from immutable artifact storage.
7. Deterministic Schema Resolution
XSD files often contain:
<xs:include schemaLocation="common-types.xsd"/>
<xs:import namespace="https://example.com/schema/common" schemaLocation="../common/common.xsd"/>
Default resolution can become ambiguous:
- relative to current working directory;
- relative to process start location;
- relative to deployment layout;
- accidentally resolved from network;
- accidentally resolved differently in tests and production.
Use LSResourceResolver.
import org.w3c.dom.ls.LSInput;
import org.w3c.dom.ls.LSResourceResolver;
import java.io.InputStream;
import java.util.Map;
public final class ClasspathSchemaResolver implements LSResourceResolver {
private final Map<String, String> namespaceToClasspathResource;
public ClasspathSchemaResolver(Map<String, String> namespaceToClasspathResource) {
this.namespaceToClasspathResource = Map.copyOf(namespaceToClasspathResource);
}
@Override
public LSInput resolveResource(
String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI
) {
String resource = namespaceToClasspathResource.get(namespaceURI);
if (resource == null) {
throw new IllegalArgumentException(
"Schema namespace is not allowlisted: " + namespaceURI + ", systemId=" + systemId
);
}
InputStream in = Thread.currentThread()
.getContextClassLoader()
.getResourceAsStream(resource);
if (in == null) {
throw new IllegalStateException("Schema resource not found: " + resource);
}
return new SimpleLsInput(publicId, systemId, in);
}
}
LSInput implementation:
import org.w3c.dom.ls.LSInput;
import java.io.InputStream;
import java.io.Reader;
public final class SimpleLsInput implements LSInput {
private String publicId;
private String systemId;
private InputStream byteStream;
public SimpleLsInput(String publicId, String systemId, InputStream byteStream) {
this.publicId = publicId;
this.systemId = systemId;
this.byteStream = byteStream;
}
@Override public Reader getCharacterStream() { return null; }
@Override public void setCharacterStream(Reader characterStream) { }
@Override public InputStream getByteStream() { return byteStream; }
@Override public void setByteStream(InputStream byteStream) { this.byteStream = byteStream; }
@Override public String getStringData() { return null; }
@Override public void setStringData(String stringData) { }
@Override public String getSystemId() { return systemId; }
@Override public void setSystemId(String systemId) { this.systemId = systemId; }
@Override public String getPublicId() { return publicId; }
@Override public void setPublicId(String publicId) { this.publicId = publicId; }
@Override public String getBaseURI() { return null; }
@Override public void setBaseURI(String baseURI) { }
@Override public String getEncoding() { return null; }
@Override public void setEncoding(String encoding) { }
@Override public boolean getCertifiedText() { return false; }
@Override public void setCertifiedText(boolean certifiedText) { }
}
Attach it to the factory:
SchemaFactory factory = SecureSchemaFactories.newSecureXsdFactory();
factory.setResourceResolver(new ClasspathSchemaResolver(Map.of(
"https://example.com/schema/common", "schemas/common/common.xsd",
"https://example.com/schema/order", "schemas/order/order.xsd"
)));
Design invariant:
Schema resolution should be a policy, not an accident.
8. Schema Bundle as Deployable Artifact
Do not deploy random .xsd files scattered through the repo.
Create a schema bundle:
schemas/
manifest.json
order/
order-v1.xsd
common/
common-v1.xsd
catalog/
schema-map.properties
Example manifest:
{
"schemaId": "order-message",
"schemaVersion": "1.4.0",
"targetNamespace": "https://example.com/schema/order/v1",
"entrypoint": "schemas/order/order-v1.xsd",
"dependencies": [
{
"namespace": "https://example.com/schema/common/v1",
"resource": "schemas/common/common-v1.xsd",
"sha256": "..."
}
]
}
Benefits:
- deterministic deployment;
- reproducible validation;
- audit-friendly schema identity;
- compatibility testing;
- cached compilation;
- easier rollback.
In regulated systems, a validation result without schema identity is weak evidence.
9. Compiling and Caching Schema
Compiling schema can be expensive. Cache Schema, not Validator.
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
public final class SchemaRegistry {
private final ConcurrentMap<String, Schema> cache = new ConcurrentHashMap<>();
private final SchemaFactory factory;
public SchemaRegistry(SchemaFactory factory) {
this.factory = factory;
}
public Schema getOrCompile(SchemaDescriptor descriptor) {
return cache.computeIfAbsent(descriptor.cacheKey(), ignored -> compile(descriptor));
}
private Schema compile(SchemaDescriptor descriptor) {
try {
return factory.newSchema(new StreamSource(
descriptor.entrypointInputStream(),
descriptor.entrypointSystemId()
));
} catch (Exception e) {
throw new SchemaCompilationException(descriptor.cacheKey(), e);
}
}
}
Descriptor example:
public record SchemaDescriptor(
String schemaId,
String schemaVersion,
String targetNamespace,
String entrypointSystemId
) {
public String cacheKey() {
return schemaId + ":" + schemaVersion + ":" + targetNamespace;
}
public java.io.InputStream entrypointInputStream() {
java.io.InputStream in = Thread.currentThread()
.getContextClassLoader()
.getResourceAsStream(entrypointSystemId);
if (in == null) {
throw new IllegalStateException("Schema not found: " + entrypointSystemId);
}
return in;
}
}
Rule:
Schema cache key must include contract identity, not just file name.
Bad cache key:
order.xsd
Good cache key:
order-message:v1.4.0:https://example.com/schema/order/v1
10. Full Validation Service Skeleton
import javax.xml.XMLConstants;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.Validator;
import java.io.InputStream;
import java.security.MessageDigest;
import java.time.Duration;
import java.time.Instant;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public final class XsdValidationService {
private final SchemaRegistry schemaRegistry;
public XsdValidationService(SchemaRegistry schemaRegistry) {
this.schemaRegistry = schemaRegistry;
}
public XmlValidationResult validate(
InputStream xmlInput,
long inputSizeBytes,
SchemaDescriptor schemaDescriptor
) {
Instant started = Instant.now();
CollectingErrorHandler errorHandler = new CollectingErrorHandler();
String fingerprint = null;
try {
byte[] xmlBytes = xmlInput.readAllBytes();
fingerprint = sha256Hex(xmlBytes);
Schema schema = schemaRegistry.getOrCompile(schemaDescriptor);
Validator validator = schema.newValidator();
validator.setErrorHandler(errorHandler);
validator.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
validator.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
validator.validate(new StreamSource(new java.io.ByteArrayInputStream(xmlBytes)));
List<XmlValidationIssue> issues = errorHandler.issues();
boolean accepted = issues.stream().noneMatch(XmlValidationIssue::isError);
return result(accepted, schemaDescriptor, fingerprint, inputSizeBytes, started, issues, Map.of());
} catch (Exception e) {
List<XmlValidationIssue> issues = mergeException(errorHandler.issues(), e);
return result(false, schemaDescriptor, fingerprint, inputSizeBytes, started, issues, diagnostic(e));
}
}
private static XmlValidationResult result(
boolean accepted,
SchemaDescriptor schemaDescriptor,
String fingerprint,
long inputSizeBytes,
Instant started,
List<XmlValidationIssue> issues,
Map<String, String> diagnostics
) {
return new XmlValidationResult(
accepted,
schemaDescriptor.schemaId(),
schemaDescriptor.schemaVersion(),
fingerprint,
inputSizeBytes,
Duration.between(started, Instant.now()),
issues,
diagnostics
);
}
private static String sha256Hex(byte[] bytes) throws Exception {
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] digest = md.digest(bytes);
StringBuilder sb = new StringBuilder(digest.length * 2);
for (byte b : digest) {
sb.append(String.format("%02x", b));
}
return sb.toString();
}
private static List<XmlValidationIssue> mergeException(List<XmlValidationIssue> existing, Exception e) {
if (!existing.isEmpty()) {
return existing;
}
return List.of(new XmlValidationIssue(
Severity.FATAL,
"XML_VALIDATION_EXCEPTION",
e.getMessage(),
null,
null,
null,
null,
null
));
}
private static Map<String, String> diagnostic(Exception e) {
Map<String, String> map = new HashMap<>();
map.put("exceptionType", e.getClass().getName());
return map;
}
}
This skeleton reads all bytes to compute fingerprint. For very large files, use a teeing stream or compute digest while streaming.
11. Validation Source Types
Validator.validate(Source) supports multiple Source types.
Common options:
| Source | Use Case | Trade-Off |
|---|---|---|
StreamSource | file/input stream validation | simple, streaming-capable depending on processor |
SAXSource | validation from configured SAX parser | best for secure parser control |
DOMSource | validate already-built DOM | convenient but memory-heavy |
StAXSource | validate StAX reader/event reader | useful in streaming pipelines |
Example with DOMSource:
import javax.xml.transform.dom.DOMSource;
validator.validate(new DOMSource(document));
Use this only when DOM already exists for another reason. Do not build DOM just to validate a 500 MB document.
12. SAXSource for Secure Parser Control
If you need strict parser feature control, create an XMLReader explicitly.
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.validation.Validator;
import java.io.InputStream;
public final class SaxSourceValidation {
public static void validateWithControlledSax(
InputStream xml,
Validator validator
) throws Exception {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setFeature(javax.xml.XMLConstants.FEATURE_SECURE_PROCESSING, true);
spf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
XMLReader reader = spf.newSAXParser().getXMLReader();
SAXSource source = new SAXSource(reader, new InputSource(xml));
validator.validate(source);
}
}
This pattern is useful when you need to guarantee parser behavior across environments.
13. Streaming Validation with ValidatorHandler
ValidatorHandler sits inside a SAX event pipeline.
Use it when you want:
- validation and extraction in one pass;
- validation before passing events to business handler;
- large payload processing;
- event pipeline composition.
Pipeline:
Example:
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.validation.Schema;
import javax.xml.validation.ValidatorHandler;
import java.io.InputStream;
public final class StreamingValidationPipeline {
public static void validateAndExtract(
InputStream xml,
Schema schema,
org.xml.sax.ContentHandler businessHandler,
org.xml.sax.ErrorHandler errorHandler
) throws Exception {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setFeature(javax.xml.XMLConstants.FEATURE_SECURE_PROCESSING, true);
XMLReader reader = spf.newSAXParser().getXMLReader();
ValidatorHandler validatorHandler = schema.newValidatorHandler();
validatorHandler.setErrorHandler(errorHandler);
validatorHandler.setContentHandler(businessHandler);
reader.setContentHandler(validatorHandler);
reader.parse(new InputSource(xml));
}
}
Important detail:
ValidatorHandler validates SAX events. It does not own input parsing.
The upstream XMLReader still needs secure configuration.
14. Validate Then Parse vs Parse While Validate
There are two dominant designs.
14.1 Validate Then Parse
Pros:
- simpler mental model;
- clear boundary;
- business code sees only structurally valid XML;
- easier failure isolation.
Cons:
- may require two passes;
- expensive for huge files;
- fingerprint/stream management more complex;
- cannot short-circuit business extraction until validation completes.
Use for:
- small to medium payloads;
- API requests;
- onboarding partner integrations;
- regulatory documents where evidence clarity matters.
14.2 Parse While Validate
Pros:
- single pass;
- lower memory;
- can integrate with SAX streaming extraction;
- good for batch files.
Cons:
- handler composition is more complex;
- error and extraction state must be carefully coordinated;
- business handler may receive some events before a later validation failure.
Use for:
- large XML files;
- batch ingestion;
- ETL-like pipelines;
- high-throughput internal integration.
Production pattern:
If side effects happen during streaming extraction, buffer or stage them until validation outcome is known.
Never commit irreversible business changes before validation acceptance is finalized.
15. Validation Staging Pattern
For large files, do not insert domain rows directly while parsing unless you have a rollback strategy.
Better pattern:
Staging table example:
create table xml_ingest_staging (
ingest_id uuid not null,
record_number bigint not null,
record_type varchar(100) not null,
payload_hash varchar(64) not null,
extracted_json jsonb not null,
validation_status varchar(30) not null,
created_at timestamp not null,
primary key (ingest_id, record_number)
);
This gives:
- replayability;
- auditability;
- partial diagnostics;
- controlled promotion;
- safer retry semantics.
16. Fail-Fast vs Error Aggregation
Validation mode should be explicit.
| Mode | Behavior | Good For |
|---|---|---|
| Fail-fast | Stop at first serious issue | API latency, attack surface reduction |
| Collect recoverable errors | Gather multiple issues | partner testing, UI feedback, QA |
| Hybrid | Stop after threshold | batch safety, noisy invalid payloads |
Example threshold handler:
public final class ThresholdErrorHandler implements org.xml.sax.ErrorHandler {
private final int maxErrors;
private final java.util.List<XmlValidationIssue> issues = new java.util.ArrayList<>();
public ThresholdErrorHandler(int maxErrors) {
this.maxErrors = maxErrors;
}
@Override
public void warning(org.xml.sax.SAXParseException e) {
issues.add(toIssue(Severity.WARNING, e));
}
@Override
public void error(org.xml.sax.SAXParseException e) throws org.xml.sax.SAXException {
issues.add(toIssue(Severity.ERROR, e));
if (countErrors() >= maxErrors) {
throw new org.xml.sax.SAXException("Validation error threshold exceeded: " + maxErrors, e);
}
}
@Override
public void fatalError(org.xml.sax.SAXParseException e) throws org.xml.sax.SAXException {
issues.add(toIssue(Severity.FATAL, e));
throw e;
}
public java.util.List<XmlValidationIssue> issues() {
return java.util.List.copyOf(issues);
}
private long countErrors() {
return issues.stream().filter(XmlValidationIssue::isError).count();
}
private static XmlValidationIssue toIssue(Severity severity, org.xml.sax.SAXParseException e) {
return new XmlValidationIssue(severity, "XML_SCHEMA_VALIDATION_" + severity,
e.getMessage(), e.getLineNumber(), e.getColumnNumber(), e.getSystemId(), e.getPublicId(), null);
}
}
Recommended defaults:
external API request: fail-fast or low threshold
partner test portal: collect multiple errors
nightly batch: threshold + evidence
security-sensitive gateway: fail-fast with strict parser limits
17. Schema Version Selection
Validation must choose a schema version.
Common strategies:
| Strategy | Example | Risk |
|---|---|---|
| endpoint-specific | /api/v1/orders uses v1 schema | simple but endpoint proliferation |
| namespace-based | root namespace maps to schema | strong XML-native design |
| document field | <schemaVersion>1.4</schemaVersion> | must parse before validation |
| partner profile | partner A uses schema set X | operationally useful but hidden coupling |
| envelope header | integration envelope declares contract | good for messaging systems |
Recommended for enterprise XML:
Use root namespace + controlled schema registry.
Optionally confirm version via envelope/header for operational routing.
Example version detector using StAX:
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import java.io.InputStream;
public final class RootNamespaceDetector {
public static String detectRootNamespace(InputStream xml) throws Exception {
XMLInputFactory factory = XMLInputFactory.newFactory();
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false);
factory.setProperty("javax.xml.stream.isSupportingExternalEntities", false);
XMLStreamReader reader = factory.createXMLStreamReader(xml);
try {
while (reader.hasNext()) {
int event = reader.next();
if (event == XMLStreamConstants.START_ELEMENT) {
return reader.getNamespaceURI();
}
}
throw new IllegalArgumentException("XML document has no root element");
} finally {
reader.close();
}
}
}
Remember: detecting root namespace consumes the stream. Use byte buffering, mark/reset, temp file, or a repeatable input abstraction.
18. Repeatable Input Abstraction
Many production bugs come from reading an InputStream twice.
Create an abstraction:
import java.io.InputStream;
public interface RepeatableXmlInput {
InputStream openStream() throws Exception;
long sizeBytes();
String fingerprint();
}
Implementation options:
| Payload Size | Repeatable Strategy |
|---|---|
| small API request | byte array |
| medium payload | temp file |
| large batch | object storage key + checksum |
| regulated payload | immutable evidence store |
Avoid passing raw InputStream through many layers unless ownership is crystal clear.
19. XML Catalogs
A catalog maps external identifiers or URIs to local resources.
Conceptual mapping:
https://example.com/schema/common/v1/common.xsd
-> classpath:/schemas/common/common-v1.xsd
Benefits:
- no runtime network dependency;
- deterministic schema resolution;
- faster validation;
- safer deployment;
- better offline testing.
In Java, you can implement catalog behavior with LSResourceResolver or use platform/catalog support depending on runtime and library choice.
Governance rule:
Every schema dependency must resolve from a controlled artifact, not from the public internet.
20. Validation and Business Rules
XSD catches structural errors:
required element missing
invalid datatype
invalid enum
wrong sequence
invalid occurrence count
identity constraint violation
XSD should not carry all business rules.
Examples better outside XSD:
| Rule | Reason |
|---|---|
| customer must be active | needs database/current state |
| quote expiry must consider timezone policy | business semantics |
| discount requires approval above threshold | workflow/stateful rule |
| order line product must exist | reference data lookup |
| status transition must be legal | lifecycle model |
Pipeline separation:
Do not weaken XSD just because business validation exists. Use XSD to guarantee the structural contract business validation depends on.
21. Error Taxonomy
A production validation service should classify failures.
| Error Type | Meaning | Example |
|---|---|---|
XML_NOT_WELL_FORMED | XML parser cannot build event stream | missing closing tag |
XML_SCHEMA_NOT_FOUND | selected schema missing | unknown namespace |
XML_SCHEMA_COMPILE_FAILED | XSD bundle invalid | broken import |
XML_SCHEMA_VALIDATION_FAILED | XML violates XSD | invalid enum |
XML_SECURITY_POLICY_VIOLATION | disallowed DTD/entity/resource | DOCTYPE found |
XML_RESOURCE_RESOLUTION_FAILED | resolver cannot map dependency | unallowlisted namespace |
XML_SEMANTIC_VALIDATION_FAILED | business rule violation | inactive customer |
XML_PROCESSING_TIMEOUT | exceeded processing budget | huge input |
Why taxonomy matters:
- correct retry behavior;
- partner support;
- alert routing;
- SLA reporting;
- security monitoring;
- audit explanation.
22. Observability for Validation
Minimum metrics:
xml.validation.count
xml.validation.accepted.count
xml.validation.rejected.count
xml.validation.duration.ms
xml.validation.input.bytes
xml.validation.schema.compile.duration.ms
xml.validation.schema.cache.hit.count
xml.validation.schema.cache.miss.count
xml.validation.issue.count
Useful dimensions:
schema_id
schema_version
partner_id
message_type
pipeline_stage
rejection_reason
Be careful with high cardinality:
Do not tag metrics with correlation_id, filename, raw namespace from untrusted input, or raw error message.
Structured log example:
{
"event": "xml.validation.completed",
"correlationId": "b7de7a2a-9d9c-4e30-9e7a-7e97d8d9b31a",
"schemaId": "order-message",
"schemaVersion": "1.4.0",
"accepted": false,
"issueCount": 3,
"inputSizeBytes": 58291,
"elapsedMs": 43,
"inputSha256": "c8b4...",
"primaryErrorType": "XML_SCHEMA_VALIDATION_FAILED"
}
Do not log full XML by default. Use controlled evidence storage with redaction and retention policy.
23. Validation Evidence Model
For regulated systems, persist evidence.
create table xml_validation_evidence (
validation_id uuid primary key,
correlation_id varchar(100) not null,
schema_id varchar(100) not null,
schema_version varchar(50) not null,
target_namespace varchar(500) not null,
input_sha256 varchar(64) not null,
input_size_bytes bigint not null,
accepted boolean not null,
primary_error_type varchar(100),
issue_count int not null,
parser_policy_version varchar(50) not null,
resolver_policy_version varchar(50) not null,
started_at timestamp not null,
completed_at timestamp not null
);
create table xml_validation_issue (
validation_id uuid not null,
issue_index int not null,
severity varchar(20) not null,
code varchar(100) not null,
message text not null,
line_number int,
column_number int,
system_id text,
xml_path_hint text,
primary key (validation_id, issue_index)
);
Evidence lets you answer:
Which schema validated this payload?
What exactly failed?
Was the parser policy secure at that time?
Can we reproduce the decision later?
24. Line/Column vs XPath Path
Parser errors usually provide line/column, not XPath.
Line/column is useful for raw document debugging. XPath path is useful for application-level support.
For SAX/StAX streaming, maintaining exact XPath-like path requires a stack:
public final class XmlPathStack {
private final java.util.Deque<String> stack = new java.util.ArrayDeque<>();
public void push(String namespaceUri, String localName) {
String name = namespaceUri == null || namespaceUri.isBlank()
? localName
: "{" + namespaceUri + "}" + localName;
stack.push(name);
}
public void pop() {
stack.pop();
}
public String currentPath() {
java.util.List<String> names = new java.util.ArrayList<>(stack);
java.util.Collections.reverse(names);
return "/" + String.join("/", names);
}
}
However, ErrorHandler may not directly know the business handler stack. If you need path hints, design parser/validator/handler composition carefully.
25. Handling XSD 1.0 vs XSD 1.1
The standard JDK validation stack commonly targets W3C XML Schema 1.0 behavior through JAXP providers. XSD 1.1 support generally requires choosing a provider that supports it.
XSD 1.1 features such as assertions are attractive, but be careful:
A schema language upgrade is a runtime dependency decision, not just a file syntax decision.
Decision checklist:
- Does the chosen validator support XSD 1.1 fully enough for your constraints?
- Is it available in all deployment environments?
- Are error messages stable enough for support workflows?
- Are performance characteristics acceptable?
- Are generated bindings affected?
- Can partners validate with compatible tools?
For broad interoperability, many organizations intentionally stay with XSD 1.0 and place advanced semantic rules outside XSD.
26. Multi-Schema Validation
Some systems validate the same payload against several contracts:
base schema
+ partner profile schema
+ regulatory overlay schema
Pattern:
Be careful with error semantics:
- If base schema fails, downstream profile errors may be meaningless.
- If partner overlay duplicates base rules, diagnostics become confusing.
- If overlay depends on transformed/canonical XML, say so explicitly.
A better model is often:
XSD validates structural envelope.
Schematron/business rules validate cross-field policy.
Domain service validates stateful facts.
Schematron is outside this part, but it is worth knowing as a rule-oriented XML validation technology.
27. API Boundary Example
For an HTTP XML endpoint:
Boundary rules:
- reject unsupported content type before XML parsing;
- reject oversize payload before parser allocation;
- validate before business mutation;
- persist validation evidence for accepted and rejected documents;
- return safe messages externally;
- keep raw XML only under retention/redaction policy.
28. Batch Boundary Example
For files:
incoming/
partner-a/orders-2026-07-02.xml
processing/
accepted/
rejected/
quarantine/
Batch pipeline:
Failure handling:
| Failure | Action |
|---|---|
| not well-formed | reject file |
| unknown schema | quarantine or reject based on policy |
| schema compile failure | alert platform team, do not blame partner automatically |
| schema validation failed | reject with issue report |
| transient storage failure | retry pipeline step |
| evidence persistence failure | do not mark accepted unless policy allows degraded mode |
29. Testing Validation Pipeline
Test categories:
| Test Type | Purpose |
|---|---|
| valid golden samples | ensure accepted contracts remain accepted |
| invalid structural samples | ensure failures are caught |
| namespace variants | catch QName/default namespace mistakes |
| malicious XML samples | ensure parser hardening |
| schema import/include samples | ensure deterministic resolution |
| version routing samples | ensure correct schema selection |
| error snapshot tests | ensure diagnostics remain useful |
| performance tests | ensure validation budget is realistic |
Example JUnit test:
import org.junit.jupiter.api.Test;
import static org.assertj.core.api.Assertions.assertThat;
class XsdValidationServiceTest {
@Test
void rejectsOrderWithInvalidStatus() throws Exception {
XsdValidationService service = TestValidationServices.orderService();
RepeatableXmlInput input = TestInputs.fromClasspath("samples/order-invalid-status.xml");
XmlValidationResult result = service.validate(
input.openStream(),
input.sizeBytes(),
TestSchemas.orderV1()
);
assertThat(result.accepted()).isFalse();
assertThat(result.issues())
.anyMatch(issue -> issue.message().contains("status"));
}
}
Testing principle:
Test the validation decision, error classification, and evidence metadata—not only exception throwing.
30. Performance Engineering
Validation cost depends on:
- input size;
- schema complexity;
- identity constraints;
- number of imports/includes;
- parser implementation;
- error mode;
- DOM vs streaming source;
- allocation patterns;
- schema compilation caching.
Rules of thumb:
Compile schema once.
Create validator per run.
Prefer streaming sources for large input.
Reject oversize input before XML parser allocation.
Do not validate already-untrusted input by building DOM first unless size is bounded.
For high throughput:
- cache schema by version;
- reuse byte buffers carefully;
- avoid logging large messages;
- measure p95/p99 validation latency;
- track schema cache misses;
- benchmark representative invalid payloads, not only valid payloads;
- define max document size and max validation time.
31. Validation Timeouts and Resource Limits
JAXP has security-related limits, but your service still needs outer resource controls.
Controls:
| Control | Layer |
|---|---|
| request body max size | HTTP gateway |
| compressed payload ratio limit | gateway/ingest |
| parser secure processing | JAXP |
| DTD/entity disabled | parser/JAXP |
| external resource denied | JAXP/resolver |
| validation timeout | executor/service boundary |
| memory budget | JVM/container |
| batch file max size | ingestion policy |
Timeout wrapper example:
import java.time.Duration;
import java.util.concurrent.*;
public final class TimeoutValidationRunner {
private final ExecutorService executor;
public TimeoutValidationRunner(ExecutorService executor) {
this.executor = executor;
}
public XmlValidationResult runWithTimeout(
Callable<XmlValidationResult> task,
Duration timeout
) throws Exception {
Future<XmlValidationResult> future = executor.submit(task);
try {
return future.get(timeout.toMillis(), TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
future.cancel(true);
throw new XmlProcessingTimeoutException("XML validation timed out after " + timeout, e);
}
}
}
Caveat:
Thread interruption may not immediately stop native/parser work.
Use layered limits, not timeout alone.
32. Common Production Bugs
| Bug | Root Cause | Prevention |
|---|---|---|
| works locally, fails in container | relative schema paths | classpath/catalog resolver |
| network call during validation | schemaLocation/external access | deny external schema/DTD |
| validator shared across threads | wrong lifecycle | cache Schema, not Validator |
| huge memory spike | DOMSource on large XML | streaming validation |
| accepted payload later fails mapping | XSD too loose or wrong schema version | versioned schema registry |
| invalid namespace rejected as missing element | default namespace misunderstanding | namespace-aware tests |
| partner cannot fix payload | raw parser message unclear | normalized issue report |
| security scanner flags XXE | parser not hardened | secure factory baseline |
| replay gives different result | schema changed under same file name | immutable schema bundle/version |
| alert storm on bad batch | per-record logging | aggregate batch diagnostics |
33. Production Checklist
Before putting XML validation into production:
- schema bundle has stable identity and version;
- schema dependencies resolve deterministically;
- external DTD/schema access is denied by default;
- parser features are hardened;
- schema compilation is cached;
-
Validatoris created per validation run; - max input size is enforced;
- validation timeout/resource controls exist;
- validation result includes evidence metadata;
- error taxonomy is stable;
- external error messages are safe;
- internal diagnostics include line/column and correlation ID;
- accepted and rejected validations are observable;
- malicious XML tests exist;
- schema version routing is tested;
- replay can reproduce the original validation decision.
34. Kaufman Practice Drill
Timebox: 90–120 minutes.
Build a small validation service for PurchaseOrder.
Requirements:
- Create
common.xsdandpurchase-order.xsd. - Use
xs:importorxs:includeintentionally. - Implement
SchemaRegistrythat caches compiled schema. - Implement
LSResourceResolverthat resolves only classpath schemas. - Disable external DTD/schema access.
- Return structured
XmlValidationResult. - Add tests for:
- valid XML;
- invalid enum;
- missing required element;
- wrong namespace;
- malicious DOCTYPE;
- unknown schema version.
- Log one structured validation event.
- Persist or print validation evidence.
Self-correction questions:
Can I explain exactly which schema validated the payload?
Can I reproduce the decision tomorrow?
Can invalid XML produce a safe, actionable error?
Can malicious XML cause network access?
Can a 500 MB file avoid DOM allocation?
Can the service process two validations concurrently without sharing mutable Validator state?
35. Summary
XSD validation in Java is not just calling validator.validate(...).
Production-grade validation requires:
- secure parser and resolver policy;
- deterministic schema bundle resolution;
- compiled schema caching;
- per-run validator lifecycle;
- structured issue collection;
- schema version identity;
- resource limits;
- observability and evidence;
- clear separation between structural and semantic validation.
Core invariant:
The value of validation is not only rejection.
The value is a reproducible, explainable, defensible boundary decision.
In the next part, we move from structural validation to targeted selection and interrogation of XML documents using XPath.
References
- Oracle Java API,
javax.xml.validation: XML document validation API. - Oracle Java API,
SchemaFactory,Schema,Validator, andValidatorHandler. - Oracle JAXP Security Guide: secure processing, external resource access controls, and XML processing limits.
- W3C XML Schema Part 1 and Part 2 specifications.
You just completed lesson 13 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.