Learn Mintlify Like Ai Docs Cli Part 010 Content Intermediate Representation
title: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI - Part 010 description: Design and implement a content intermediate representation for an AI-driven documentation generator, separating source extraction, AI planning, semantic blocks, provenance, validation, and deterministic MDX emission. series: learn-mintlify-like-ai-docs-cli seriesTitle: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI order: 10 partTitle: Content Intermediate Representation tags:
- documentation
- ai
- mdx
- compiler
- intermediate-representation
- developer-tools date: 2026-07-03
Part 010 — Content Intermediate Representation
In Part 009, kita membuat classifier.
Sekarang pipeline sudah tahu bahwa:
README.mdadalah project overview source,openapi.yamladalah API reference source,package.jsonadalah command/configuration source,docs/quickstart.mdxadalah existing doc page,src/cli.tsadalah source code yang mungkin mendefinisikan command,.envharus diblokir.
Tetapi kita belum punya bentuk internal untuk konten dokumentasi.
Banyak orang akan langsung melakukan ini:
source files + prompt -> AI -> .mdx file
Ini terlihat cepat. Tetapi untuk production-grade documentation generator, ini salah satu desain paling rapuh.
Masalahnya:
- output AI bisa tidak valid MDX,
- heading hierarchy bisa kacau,
- frontmatter bisa rusak,
- link bisa broken,
- code block bisa kehilangan language tag,
- API reference bisa tidak konsisten,
- prose claim bisa tidak punya provenance,
- incremental updates sulit karena kita hanya punya string besar,
- quality gates sulit karena struktur dokumen tidak eksplisit,
- renderer dan AI generator menjadi tightly coupled.
Solusinya adalah memakai Content Intermediate Representation.
Kita akan menyebutnya Content IR.
Content IR adalah jembatan antara:
- scanner,
- classifier,
- parsers,
- code index,
- OpenAPI parser,
- AI planner,
- AI writer,
- MDX compiler,
- search indexer,
llms.txtexporter,- quality gate.
Tanpa IR, semua komponen akan saling mengoper string.
Dengan IR, kita punya kontrak struktural.
1. Mental model: documentation generator as compiler
A compiler rarely goes directly from source code string to machine code string.
Compiler biasanya punya tahapan:
source text -> tokens -> AST -> semantic model -> IR -> optimized IR -> target output
Documentation generator kita juga harus seperti itu:
repo files -> classified artifacts -> extracted facts -> page plan -> content IR -> MDX -> static site
Kenapa tidak langsung MDX?
Karena MDX adalah target format, bukan internal truth.
MDX bagus untuk authoring dan rendering. MDX memungkinkan Markdown plus JSX/component usage, sehingga cocok untuk docs modern. Tetapi sebagai internal model, MDX string terlalu bebas.
Kita butuh bentuk yang lebih ketat:
- heading adalah object,
- paragraph adalah object,
- code block adalah object,
- API operation reference adalah object,
- admonition adalah object,
- provenance adalah field,
- dependency adalah field,
- validation bisa dilakukan sebelum render.
Content IR harus cukup expressive untuk menghasilkan MDX, tetapi cukup strict untuk divalidasi.
2. What Content IR is and is not
Content IR adalah:
- struktur dokumen internal,
- serializable JSON object,
- target netral sebelum MDX,
- tempat menyimpan provenance,
- input untuk validation dan transformation passes,
- unit yang bisa di-diff,
- kontrak antara AI writer dan renderer.
Content IR bukan:
- AST parser MDX mentah,
- format final untuk user edit,
- database schema permanen yang tidak boleh berubah,
- prompt text,
- replacement untuk OpenAPI schema,
- replacement untuk code symbol graph.
Content IR berada di tengah.
3. Core invariants
Sebelum menulis schema, tentukan invariant.
3.1 IR must be serializable
Content IR harus bisa disimpan sebagai JSON.
No functions. No class instances. No circular references.
Ini membuatnya bisa:
- di-cache,
- di-diff,
- disnapshot di tests,
- dikirim antar process,
- diinspeksi via CLI.
3.2 IR must preserve provenance
Setiap klaim penting harus bisa ditelusuri.
Bukan setiap kata perlu citation. Tetapi setiap technical fact yang dapat diverifikasi harus punya anchor:
- source file path,
- line range,
- symbol ID,
- OpenAPI operation ID,
- package manifest field path,
- extraction job ID.
3.3 IR must be target-neutral enough
Kita saat ini menarget MDX. Tetapi IR jangan terlalu MDX-specific.
Bad:
type BadBlock = {
rawMdx: string;
};
Better:
type CalloutBlock = {
type: "callout";
variant: "info" | "warning" | "danger";
title?: string;
children: ContentBlock[];
};
Nanti emitter yang memutuskan apakah callout menjadi:
<Warning>
...
</Warning>
atau syntax lain.
3.4 IR must be strict enough to validate
Jika block bisa berisi arbitrary string tanpa constraint, IR tidak membantu.
Kita perlu validation:
- heading level valid,
- section ID unique,
- code block language known,
- internal link target exists,
- API operation reference resolvable,
- symbol reference resolvable,
- page frontmatter lengkap,
- no blocked provenance source,
- no raw untrusted JSX.
3.5 IR must support partial generation
AI generation bisa gagal di tengah.
IR harus bisa menyimpan status:
- planned,
- drafted,
- validated,
- emitted,
- failed.
Jangan menganggap page langsung sempurna.
3.6 IR must support deterministic emission
MDX output harus stabil.
Jika input IR sama, output MDX harus sama byte-for-byte, kecuali timestamp memang berubah.
Determinism penting untuk:
- clean Git diffs,
- caching,
- regression tests,
- PR automation,
- user trust.
4. Layering: Fact Model, Page Plan, Content IR, Render AST
Jangan memasukkan semua ke satu model.
Kita butuh beberapa layer.
4.1 Extracted Fact Model
Fact model adalah hasil parser dan extractor.
Example:
export type ExtractedFact =
| PackageScriptFact
| PackageDependencyFact
| OpenApiOperationFact
| CodeSymbolFact
| MarkdownSectionFact
| CiCommandFact;
Fact model belum berupa halaman docs.
4.2 Page Plan IR
Page plan menjawab:
- halaman apa yang harus dibuat,
- untuk audience siapa,
- sumber apa yang dipakai,
- struktur section apa,
- acceptance criteria apa.
Example:
export type PagePlan = {
pageId: string;
slug: string;
title: string;
pageType: PageType;
purpose: string;
audience: "newUser" | "integrator" | "maintainer" | "operator";
sourceRefs: SourceRef[];
sections: PlannedSection[];
acceptanceCriteria: string[];
};
4.3 Content IR
Content IR adalah halaman yang sudah berbentuk konten struktural.
4.4 Render IR
Render IR adalah bentuk yang sangat dekat dengan MDX output.
Kita bisa skip Render IR di versi pertama. Tetapi secara mental, tetap pisahkan:
- Content IR: semantic intent,
- Render IR: concrete component mapping.
Example:
Content IR: callout warning about destructive command
Render IR: <Warning> component with escaped child markdown
5. Top-level schema
Start with top-level document package.
export type ContentIrBundle = {
version: "content-ir.v1";
projectId: string;
generatedAt: string;
generatorVersion: string;
pages: PageIr[];
assets: AssetIr[];
diagnostics: Diagnostic[];
};
version is required. IR will evolve.
Do not rely on package version alone. IR schema version and CLI version are different things.
Page IR
export type PageIr = {
id: string;
slug: string;
route: string;
title: string;
description: string;
pageType: PageType;
frontmatter: FrontmatterIr;
nav: NavHintIr;
status: PageGenerationStatus;
sourceRefs: SourceRef[];
sections: SectionIr[];
diagnostics: Diagnostic[];
};
export type PageType =
| "overview"
| "quickstart"
| "concept"
| "howTo"
| "tutorial"
| "apiReference"
| "troubleshooting"
| "migration"
| "changelog"
| "reference";
export type PageGenerationStatus =
| "planned"
| "drafted"
| "validated"
| "emitted"
| "failed";
Frontmatter IR
export type FrontmatterIr = {
title: string;
description: string;
tags: string[];
order?: number;
hidden?: boolean;
generated?: boolean;
lastVerifiedAt?: string;
sourceHash?: string;
};
Do not put arbitrary frontmatter everywhere. Allow extensions later, but keep core fields strict.
6. Source references and provenance
Provenance should be first-class.
export type SourceRef = {
refId: string;
artifactId: string;
path: string;
range?: SourceRange;
symbolId?: string;
openApiPointer?: string;
jsonPointer?: string;
authority: "primary" | "secondary" | "derived" | "generated" | "untrusted" | "unknown";
};
export type SourceRange = {
startLine: number;
endLine: number;
startColumn?: number;
endColumn?: number;
};
Every block can carry sourceRefs.
export type BlockBase = {
id: string;
sourceRefs?: SourceRef[];
diagnostics?: Diagnostic[];
};
A paragraph can be prose. But if it claims “Run pnpm dev to start local development”, source should point to package.json#/scripts/dev or equivalent extracted fact.
{
"id": "block-run-dev-command",
"type": "paragraph",
"text": "Run `pnpm dev` to start the local development server.",
"sourceRefs": [
{
"refId": "src-package-script-dev",
"artifactId": "artifact-package-json",
"path": "package.json",
"jsonPointer": "/scripts/dev",
"authority": "primary"
}
]
}
This makes fact-checking possible.
7. Section IR
A page contains sections.
export type SectionIr = {
id: string;
title: string;
level: 2 | 3 | 4;
purpose?: string;
sourceRefs?: SourceRef[];
blocks: ContentBlock[];
children?: SectionIr[];
};
Why sections have purpose?
Because AI planner and reviewer need to know why the section exists.
Example:
{
"id": "install",
"title": "Install the CLI",
"level": 2,
"purpose": "Help a new developer install the package and verify the binary works.",
"blocks": []
}
This is not necessarily emitted to MDX. It is internal instruction and audit metadata.
8. Content blocks
Now define block types.
export type ContentBlock =
| ParagraphBlock
| HeadingBlock
| ListBlock
| CodeBlock
| CalloutBlock
| StepsBlock
| TabsBlock
| TableBlock
| MermaidBlock
| ApiOperationBlock
| SymbolReferenceBlock
| FileTreeBlock
| CardsBlock
| RawMdxBlock;
Keep RawMdxBlock, but make it restricted.
8.1 Paragraph block
export type ParagraphBlock = BlockBase & {
type: "paragraph";
text: InlineContent[];
};
Use inline content, not plain string, if you want good link/code validation.
export type InlineContent =
| { type: "text"; value: string }
| { type: "inlineCode"; value: string }
| { type: "link"; text: string; target: LinkTarget }
| { type: "strong"; children: InlineContent[] }
| { type: "emphasis"; children: InlineContent[] };
export type LinkTarget =
| { kind: "url"; href: string }
| { kind: "route"; route: string }
| { kind: "heading"; pageId: string; headingId: string }
| { kind: "source"; sourceRef: SourceRef };
You may simplify v1 by storing paragraph as Markdown text, but the long-term model benefits from inline structure.
8.2 Code block
export type CodeBlock = BlockBase & {
type: "code";
language: string;
code: string;
title?: string;
executable?: boolean;
expectedOutput?: string;
redacted?: boolean;
};
Code block should not be arbitrary.
It needs metadata because later we will verify snippets.
Example:
{
"id": "install-command",
"type": "code",
"language": "bash",
"code": "npm install -g docforge",
"title": "Install globally",
"executable": false,
"sourceRefs": [
{
"refId": "package-name",
"artifactId": "artifact-package-json",
"path": "package.json",
"jsonPointer": "/name",
"authority": "primary"
}
]
}
8.3 Callout block
export type CalloutBlock = BlockBase & {
type: "callout";
variant: "info" | "tip" | "warning" | "danger";
title?: string;
children: ContentBlock[];
};
Do not encode callout as raw HTML.
Semantic callouts can be rendered differently per theme.
8.4 Steps block
export type StepsBlock = BlockBase & {
type: "steps";
steps: StepItem[];
};
export type StepItem = {
id: string;
title: string;
blocks: ContentBlock[];
};
This is useful for quickstarts and tutorials.
8.5 Tabs block
export type TabsBlock = BlockBase & {
type: "tabs";
tabs: TabItem[];
};
export type TabItem = {
id: string;
label: string;
blocks: ContentBlock[];
};
Tabs are not just UI. They represent alternate paths:
- npm vs pnpm vs yarn,
- curl vs JavaScript vs Python,
- Docker vs local install,
- cloud provider A vs B.
8.6 API operation block
Do not copy OpenAPI details into prose too early.
Reference the operation.
export type ApiOperationBlock = BlockBase & {
type: "apiOperation";
operationRef: {
specArtifactId: string;
operationId?: string;
method: string;
path: string;
};
display: "summary" | "full" | "requestOnly" | "responseOnly";
};
This allows the API renderer to stay consistent with OpenAPI source.
8.7 Symbol reference block
export type SymbolReferenceBlock = BlockBase & {
type: "symbolReference";
symbolId: string;
display: "signature" | "summary" | "full";
};
This is useful for SDK docs and CLI docs.
8.8 Mermaid block
export type MermaidBlock = BlockBase & {
type: "mermaid";
diagramType: "flowchart" | "sequence" | "state" | "class" | "er" | "unknown";
code: string;
title?: string;
};
Diagrams should be validated enough to avoid broken build output.
8.9 Raw MDX block
Sometimes you need escape hatch.
export type RawMdxBlock = BlockBase & {
type: "rawMdx";
code: string;
trust: "generatedSafe" | "userAuthored" | "unsafe";
};
Rule:
- user-authored existing MDX can be preserved,
- generated raw MDX must pass strict validation,
- unsafe raw MDX cannot be emitted.
Raw blocks are necessary, but they should be rare.
9. Zod schemas
Runtime validation is mandatory because AI-generated IR is external input from the system's point of view.
import { z } from "zod";
const SourceRangeSchema = z.object({
startLine: z.number().int().positive(),
endLine: z.number().int().positive(),
startColumn: z.number().int().positive().optional(),
endColumn: z.number().int().positive().optional(),
});
const SourceRefSchema = z.object({
refId: z.string().min(1),
artifactId: z.string().min(1),
path: z.string().min(1),
range: SourceRangeSchema.optional(),
symbolId: z.string().optional(),
openApiPointer: z.string().optional(),
jsonPointer: z.string().optional(),
authority: z.enum(["primary", "secondary", "derived", "generated", "untrusted", "unknown"]),
});
const BlockBaseSchema = z.object({
id: z.string().min(1),
sourceRefs: z.array(SourceRefSchema).optional(),
});
const CodeBlockSchema = BlockBaseSchema.extend({
type: z.literal("code"),
language: z.string().min(1),
code: z.string(),
title: z.string().optional(),
executable: z.boolean().optional(),
expectedOutput: z.string().optional(),
redacted: z.boolean().optional(),
});
Recursive schemas need lazy evaluation.
type ContentBlock = z.infer<typeof ContentBlockSchema>;
const ContentBlockSchema: z.ZodType<ContentBlock> = z.lazy(() =>
z.discriminatedUnion("type", [
ParagraphBlockSchema,
CodeBlockSchema,
CalloutBlockSchema,
StepsBlockSchema,
TabsBlockSchema,
MermaidBlockSchema,
ApiOperationBlockSchema,
SymbolReferenceBlockSchema,
RawMdxBlockSchema,
]),
);
Use discriminated unions. They make validation and rendering simpler.
10. AI should output IR, not final MDX
The writer agent should produce structured JSON matching schema.
Bad prompt:
Write an MDX quickstart for this project.
Better contract:
Create a Content IR draft for a quickstart page.
Return only JSON matching the PageIr schema.
Every technical claim must reference one of the provided sourceRefs.
Do not output raw MDX unless the block type requires it.
Use code blocks for commands.
Use steps for ordered setup.
Then validate:
const parsed = JSON.parse(modelOutput);
const page = PageIrSchema.parse(parsed);
const diagnostics = validatePageIr(page, projectKnowledge);
if (diagnostics.some(d => d.severity === "error")) {
throw new InvalidGeneratedIrError(diagnostics);
}
This gives you a repair loop.
Without IR, repair is messy because you only have broken MDX text.
With IR, you can tell the model exactly what field is wrong.
11. Validation passes
IR validation should happen before MDX emission.
11.1 Schema validation
Checks object shape.
Examples:
- missing title,
- invalid block type,
- invalid enum value,
- wrong field type.
11.2 Structural validation
Checks document structure.
export function validateSectionStructure(page: PageIr): Diagnostic[] {
const diagnostics: Diagnostic[] = [];
const ids = new Set<string>();
for (const section of walkSections(page.sections)) {
if (ids.has(section.id)) {
diagnostics.push({
code: "DOCFORGE_IR_DUPLICATE_SECTION_ID",
severity: "error",
message: `Duplicate section id: ${section.id}`,
});
}
ids.add(section.id);
if (section.level < 2) {
diagnostics.push({
code: "DOCFORGE_IR_INVALID_SECTION_LEVEL",
severity: "error",
message: `Section ${section.id} has invalid level ${section.level}`,
});
}
}
return diagnostics;
}
11.3 Reference validation
Checks whether referenced source exists.
export function validateSourceRefs(page: PageIr, sourceIndex: SourceIndex): Diagnostic[] {
const diagnostics: Diagnostic[] = [];
for (const block of walkBlocks(page)) {
for (const ref of block.sourceRefs ?? []) {
if (!sourceIndex.hasArtifact(ref.artifactId)) {
diagnostics.push({
code: "DOCFORGE_IR_UNKNOWN_SOURCE_ARTIFACT",
severity: "error",
message: `Block ${block.id} references unknown artifact ${ref.artifactId}`,
});
}
if (ref.path && sourceIndex.isBlocked(ref.path)) {
diagnostics.push({
code: "DOCFORGE_IR_BLOCKED_SOURCE_REFERENCE",
severity: "error",
message: `Block ${block.id} references blocked source ${ref.path}`,
});
}
}
}
return diagnostics;
}
11.4 Link validation
Internal routes should exist. Heading anchors should exist. External URLs can be checked later.
11.5 Render validation
Before writing files, emit MDX into memory and compile it.
Content IR validity does not guarantee MDX compiler validity. The emitter can still have bugs.
12. Transform passes
After validation, run transforms.
12.1 Normalize IDs
Make IDs deterministic.
function slugifyId(input: string): string {
return input
.toLowerCase()
.trim()
.replace(/[^a-z0-9]+/g, "-")
.replace(/^-+|-+$/g, "");
}
For generated blocks, include stable context:
function stableBlockId(pageId: string, sectionId: string, index: number, type: string): string {
return `${pageId}.${sectionId}.${index}.${type}`;
}
12.2 Resolve links
Convert semantic links into routes.
function emitLinkTarget(target: LinkTarget, routeIndex: RouteIndex): string {
switch (target.kind) {
case "url":
return target.href;
case "route":
return routeIndex.resolveRoute(target.route);
case "heading":
return `${routeIndex.resolvePage(target.pageId)}#${target.headingId}`;
case "source":
return `#source-${target.sourceRef.refId}`;
}
}
12.3 Prune empty blocks
AI may produce empty paragraphs. Remove them before emission.
But do not silently remove a section if it was required by acceptance criteria. Produce diagnostic.
13. Deterministic MDX emission
Emitter converts IR to MDX.
export interface MdxEmitter {
emitPage(page: PageIr): EmitResult;
}
export type EmitResult = {
path: string;
content: string;
diagnostics: Diagnostic[];
};
Simple emitter skeleton:
export function emitPageToMdx(page: PageIr): string {
const chunks: string[] = [];
chunks.push(emitFrontmatter(page.frontmatter));
chunks.push("");
chunks.push(`# ${escapeMarkdown(page.title)}`);
chunks.push("");
for (const section of page.sections) {
chunks.push(emitSection(section));
chunks.push("");
}
return chunks.join("\n").replace(/\n{3,}/g, "\n\n").trimEnd() + "\n";
}
function emitSection(section: SectionIr): string {
const chunks: string[] = [];
chunks.push(`${"#".repeat(section.level)} ${escapeMarkdown(section.title)}`);
chunks.push("");
for (const block of section.blocks) {
chunks.push(emitBlock(block));
chunks.push("");
}
for (const child of section.children ?? []) {
chunks.push(emitSection(child));
chunks.push("");
}
return chunks.join("\n").trimEnd();
}
Block emission:
function emitBlock(block: ContentBlock): string {
switch (block.type) {
case "paragraph":
return emitInlineContent(block.text);
case "code":
return emitCodeBlock(block);
case "callout":
return emitCallout(block);
case "steps":
return emitSteps(block);
case "tabs":
return emitTabs(block);
case "mermaid":
return emitMermaid(block);
case "apiOperation":
return emitApiOperation(block);
case "symbolReference":
return emitSymbolReference(block);
case "rawMdx":
return emitRawMdx(block);
default:
return assertNever(block);
}
}
Code block emission:
function emitCodeBlock(block: CodeBlock): string {
const title = block.title ? ` title="${escapeAttribute(block.title)}"` : "";
return `\`\`\`${block.language}${title}\n${block.code}\n\`\`\``;
}
Callout emission:
function emitCallout(block: CalloutBlock): string {
const component = {
info: "Info",
tip: "Tip",
warning: "Warning",
danger: "Danger",
}[block.variant];
const titleProp = block.title ? ` title="${escapeAttribute(block.title)}"` : "";
const body = block.children.map(emitBlock).join("\n\n");
return `<${component}${titleProp}>\n\n${body}\n\n</${component}>`;
}
Mermaid emission:
function emitMermaid(block: MermaidBlock): string {
return `\`\`\`mermaid id="${escapeAttribute(block.id)}"\n${block.code}\n\`\`\``;
}
API operation emission could map to a component:
function emitApiOperation(block: ApiOperationBlock): string {
const method = escapeAttribute(block.operationRef.method.toUpperCase());
const path = escapeAttribute(block.operationRef.path);
const operationId = block.operationRef.operationId
? ` operationId="${escapeAttribute(block.operationRef.operationId)}"`
: "";
return `<ApiOperation method="${method}" path="${path}"${operationId} display="${block.display}" />`;
}
This keeps generated API pages consistent.
14. Example: Quickstart Page IR
Imagine extracted facts:
{
"packageName": "docforge",
"packageManager": "pnpm",
"scripts": {
"dev": "tsx src/cli.ts dev",
"build": "tsup src/index.ts"
},
"bin": {
"docforge": "dist/cli.js"
}
}
Page IR:
{
"id": "quickstart",
"slug": "quickstart",
"route": "/quickstart",
"title": "Quickstart",
"description": "Install the CLI, initialize documentation, and run the local preview server.",
"pageType": "quickstart",
"frontmatter": {
"title": "Quickstart",
"description": "Install the CLI, initialize documentation, and run the local preview server.",
"tags": ["quickstart", "cli"],
"order": 1,
"generated": true
},
"nav": {
"group": "Getting Started",
"order": 1
},
"status": "drafted",
"sourceRefs": [
{
"refId": "package-json",
"artifactId": "artifact-package-json",
"path": "package.json",
"authority": "primary"
}
],
"sections": [
{
"id": "install",
"title": "Install the CLI",
"level": 2,
"blocks": [
{
"id": "install.command",
"type": "code",
"language": "bash",
"code": "pnpm add -D docforge",
"title": "Install Docforge",
"sourceRefs": [
{
"refId": "package-name",
"artifactId": "artifact-package-json",
"path": "package.json",
"jsonPointer": "/name",
"authority": "primary"
}
]
}
]
},
{
"id": "initialize-docs",
"title": "Initialize docs",
"level": 2,
"blocks": [
{
"id": "init.command",
"type": "code",
"language": "bash",
"code": "pnpm docforge init",
"title": "Create docs config"
}
]
}
],
"diagnostics": []
}
Emitted MDX:
---
title: Quickstart
description: Install the CLI, initialize documentation, and run the local preview server.
tags:
- quickstart
- cli
order: 1
generated: true
---
# Quickstart
## Install the CLI
```bash title="Install Docforge"
pnpm add -D docforge
```
## Initialize docs
```bash title="Create docs config"
pnpm docforge init
```
The emitted result is simple. But internally we kept structure, provenance, and validation hooks.
15. Existing docs import
Your generator will not only create new docs. It must also read existing docs.
Existing Markdown/MDX should be parsed into IR as much as possible.
Important policy:
User-authored docs should be preserved unless the user explicitly asks for rewrite or update.
Generated docs tools become annoying when they overwrite human-authored content too aggressively.
For existing MDX, unsupported constructs can become RawMdxBlock with trust: "userAuthored".
const block: RawMdxBlock = {
id: "raw-user-mdx-1",
type: "rawMdx",
code: originalMdxFragment,
trust: "userAuthored",
sourceRefs: [sourceRefForOriginalRange],
};
This preserves content without pretending we understand every custom component.
16. Page planning with IR
Page planner should output PagePlan, not final content.
Example page plan for a docs CLI:
{
"pageId": "configuration",
"slug": "configuration",
"title": "Configuration",
"pageType": "reference",
"purpose": "Explain every supported docs configuration field with examples and validation rules.",
"audience": "maintainer",
"sourceRefs": [
{
"refId": "config-schema",
"artifactId": "artifact-config-schema",
"path": "packages/config/schema.ts",
"authority": "primary"
}
],
"sections": [
{
"id": "overview",
"title": "Overview",
"requiredFacts": ["config file name", "config lookup order"]
},
{
"id": "fields",
"title": "Fields",
"requiredFacts": ["field names", "types", "defaults", "examples"]
},
{
"id": "validation-errors",
"title": "Validation errors",
"requiredFacts": ["diagnostic codes", "fix hints"]
}
],
"acceptanceCriteria": [
"Every config field has type, default, and example.",
"Every technical fact cites config schema source.",
"No deprecated field is recommended as default."
]
}
Then writer converts plan + facts into Content IR.
This separation is important:
- planner decides structure,
- writer fills content,
- reviewer validates facts,
- emitter handles MDX.
17. Handling claims
A high-quality docs generator must distinguish between prose and claim.
Simple prose:
This guide walks through local setup.
Technical claim:
Run `pnpm dev` to start the local preview server.
The second should have source refs.
You can model claims explicitly.
export type Claim = {
claimId: string;
text: string;
sourceRefs: SourceRef[];
confidence: number;
verified: boolean;
};
Then attach claims to blocks.
export type ClaimAwareBlockBase = BlockBase & {
claims?: Claim[];
};
In v1, you can skip explicit claims and rely on block sourceRefs. But for advanced quality gates, explicit claims become valuable.
Future reviewer agent can check:
- Does every command exist?
- Does every endpoint exist?
- Does every config field exist?
- Does every version claim match manifests?
- Does every file path exist?
18. IR diffs
String diffs are noisy.
IR diffs can be semantic.
Example:
{
"type": "block.updated",
"pageId": "quickstart",
"blockId": "install.command",
"field": "code",
"before": "npm install -g docforge",
"after": "pnpm add -D docforge",
"reason": "package manager changed from npm to pnpm based on pnpm-lock.yaml"
}
This is useful for PR automation.
A PR comment can say:
Updated Quickstart install command because the repository now uses pnpm-lock.yaml and package scripts are pnpm-based.
That is much better than dumping a huge MDX diff and expecting reviewers to infer intent.
19. Storing IR
Suggested hidden output:
.docforge/
cache/
content-ir/
quickstart.json
api-reference.users.create.json
facts/
source-index.sqlite
Do not require users to commit .docforge/cache.
But you may allow committing generated metadata for reproducibility later.
For now:
- generated
.mdxgoes to docs directory, - internal IR goes to cache,
- build manifest records hashes.
export type IrCacheRecord = {
pageId: string;
inputHash: string;
irHash: string;
mdxHash: string;
generatedAt: string;
sourceArtifactIds: string[];
};
This allows incremental generation.
If source artifacts do not change, skip regeneration.
20. Diagnostics in IR
Diagnostics should attach at multiple levels:
- bundle,
- page,
- section,
- block.
Example:
{
"id": "run-tests-command",
"type": "code",
"language": "bash",
"code": "npm test",
"diagnostics": [
{
"code": "DOCFORGE_IR_UNVERIFIED_COMMAND",
"severity": "warning",
"message": "Command was generated but no matching package script or source reference was found.",
"hint": "Reference package.json#/scripts/test or mark this command as manual."
}
]
}
This keeps generated content auditable.
Do not hide uncertainty.
Uncertainty should be visible to the pipeline and sometimes to the user.
21. Avoiding MDX injection
MDX can execute/import components during build depending on your renderer setup. Treat generated MDX as untrusted until validated.
IR helps because most blocks are semantic and escaped by emitter.
Risky path:
AI output -> raw MDX -> compile -> arbitrary component/import behavior
Safer path:
AI output JSON -> schema validation -> semantic validation -> escaping emitter -> MDX compile
Raw MDX rules:
- no
importin generated raw blocks, - no
exportin generated raw blocks, - no unknown JSX components unless allowed by theme registry,
- no event handlers,
- no remote script injection,
- no raw HTML if security policy disables it.
function validateRawMdxBlock(block: RawMdxBlock, registry: ComponentRegistry): Diagnostic[] {
if (block.trust === "unsafe") {
return [{
code: "DOCFORGE_IR_UNSAFE_RAW_MDX",
severity: "error",
message: `Raw MDX block ${block.id} is marked unsafe and cannot be emitted.`,
}];
}
if (/^\s*import\s+/m.test(block.code) && block.trust !== "userAuthored") {
return [{
code: "DOCFORGE_IR_GENERATED_MDX_IMPORT",
severity: "error",
message: `Generated raw MDX block ${block.id} contains import statements.`,
}];
}
return [];
}
Security starts here, not only in deployment.
22. Build checkpoint
At the end of this part, your implementation should have:
ContentIrBundle,PageIr,SectionIr,ContentBlockunion,SourceRef,FrontmatterIr,- Zod schemas,
- page validation passes,
- source reference validation,
- deterministic MDX emitter,
- basic IR cache record,
- CLI debug command.
Suggested CLI command:
docforge ir --page quickstart --json
Suggested command to emit without writing:
docforge emit --page quickstart --dry-run
Suggested validation:
docforge validate-ir .docforge/cache/content-ir/quickstart.json
23. Tests
Snapshot-test the emitter.
it("emits stable MDX for quickstart page", () => {
const page = makeQuickstartPageIr();
const mdx = emitPageToMdx(page);
expect(mdx).toMatchInlineSnapshot(`
"---
title: Quickstart
description: Install the CLI, initialize documentation, and run the local preview server.
tags:
- quickstart
- cli
order: 1
generated: true
---
# Quickstart
## Install the CLI
\`\`\`bash title=\"Install Docforge\"
pnpm add -D docforge
\`\`\`
"
`);
});
Test validation:
it("rejects source refs to blocked artifacts", () => {
const page = makePageWithSourceRef({ path: ".env", artifactId: "artifact-env" });
const sourceIndex = makeSourceIndex([{ artifactId: "artifact-env", path: ".env", blocked: true }]);
const diagnostics = validateSourceRefs(page, sourceIndex);
expect(diagnostics).toEqual(expect.arrayContaining([
expect.objectContaining({ code: "DOCFORGE_IR_BLOCKED_SOURCE_REFERENCE" }),
]));
});
Test AI output validation:
it("rejects AI output with unknown block type", () => {
const modelOutput = {
id: "bad-page",
sections: [
{
id: "x",
title: "X",
level: 2,
blocks: [{ type: "magic", content: "not allowed" }],
},
],
};
expect(() => PageIrSchema.parse(modelOutput)).toThrow();
});
These tests make the system harder to corrupt.
24. Key takeaways
- Do not make AI generate final MDX directly as the primary architecture.
- MDX is a target format; Content IR is the internal contract.
- Content IR gives you validation, provenance, deterministic emission, diffing, and better repair loops.
- Every technical claim should eventually connect to source evidence.
- Raw MDX is an escape hatch, not the default block type.
- IR should be serializable and versioned.
- Separate fact extraction, page planning, content drafting, validation, and emission.
- A good IR makes later features easier: search,
llms.txt, MCP retrieval, PR automation, quality gates, and documentation evaluation.
In the next part, we will go from Content IR to the MDX authoring model: frontmatter, components, admonitions, tabs, code blocks, imports, links, and the contract between generated content and user-editable docs.
You just completed lesson 10 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.