Learn Mintlify Like Ai Docs Cli Part 022 Repository Knowledge Store
title: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI - Part 022 description: Mendesain repository knowledge store untuk documentation generator: SQLite schema, artifact/symbol/relation storage, semantic artifacts, provenance, embeddings metadata, cache invalidation, migrations, transactions, query APIs, and operational reliability. series: learn-mintlify-like-ai-docs-cli seriesTitle: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI order: 22 partTitle: Repository Knowledge Store tags:
- documentation
- ai
- cli
- sqlite
- knowledge-store
- codebase-indexing
- developer-tools date: 2026-07-03
Part 022 — Repository Knowledge Store
Kita sudah membangun banyak model:
SourceArtifact,CodeSymbol,CodeRelation,SemanticArtifact,DocumentationMapping,ProvenanceRef,- search metadata,
- discovery diagnostics,
- index snapshots.
Sekarang pertanyaannya:
semua data ini disimpan di mana?
Kalau disimpan sebagai object in-memory saja, tool akan lambat dan tidak bisa incremental.
Kalau disimpan sebagai JSON besar, mudah dimulai tetapi sulit di-query, sulit update sebagian, dan lambat untuk repo besar.
Untuk CLI lokal, pilihan yang sangat masuk akal adalah SQLite knowledge store.
SQLite memberi kita:
- single-file local database,
- transaksi,
- indexing,
- query relational,
- durability,
- portability,
- no server,
- cocok untuk cache/index lokal.
Knowledge store adalah jantung dari AI-driven documentation generator karena ia membuat sistem bisa mengingat hasil indexing, melakukan impact analysis, membangun retrieval, dan memverifikasi provenance tanpa mem-parse seluruh repo setiap saat.
1. Mental model: knowledge store adalah local read/write model
Knowledge store bukan source code truth. Source code tetap source of truth.
Knowledge store adalah derived local model.
Karena derived, store harus bisa:
- dihapus dan dibangun ulang,
- divalidasi freshness-nya,
- dimigrasi schema-nya,
- di-invalidate sebagian,
- tidak dianggap lebih benar dari file sumber.
2. Store responsibilities
Knowledge store menyimpan:
| Data | Purpose |
|---|---|
| Artifacts | file metadata, hash, language, sensitivity |
| Parse runs | parser/query version per artifact |
| Symbols | declarations extracted from code |
| Relations | graph edges |
| Semantic artifacts | endpoint, CLI command, config field, example, test |
| Documentation mappings | docs page → source artifact/symbol/artifact |
| Provenance | precise source references |
| Diagnostics | indexing/discovery/build issues |
| Embedding metadata | vector cache keys and target mapping |
| Snapshots | freshness and diffing |
| Cache metadata | invalidation and versioning |
| Coverage | derived reporting optional |
Do not store secrets or raw full source by default.
3. Store location
Default:
<project-root>/.docforge/index/docforge.sqlite
Other files:
<project-root>/.docforge/
index/
docforge.sqlite
docforge.sqlite-shm
docforge.sqlite-wal
cache/
...
traces/
...
Config:
{
"index": {
"storePath": ".docforge/index/docforge.sqlite"
}
}
Rules:
- Store path must be inside project root by default.
- Store should be ignored by git.
- Store should not be deployed.
- Store can be deleted safely.
- Store version must be explicit.
4. Why SQLite instead of JSON
JSON is tempting:
.docforge/index.json
But problems appear:
- updating one artifact requires rewriting whole file,
- querying symbol relations requires loading everything,
- graph traversal becomes manual,
- concurrent read/write harder,
- large repo becomes slow,
- migrations are ad hoc.
SQLite gives:
SELECTby path/hash/symbol,- indexes,
- transactions,
- partial updates,
- consistency,
- inspectability with SQL,
- good local performance.
JSON can still be used for public build artifacts. Knowledge store is internal.
5. Store versioning
Need schema version.
CREATE TABLE IF NOT EXISTS schema_migrations (
version INTEGER PRIMARY KEY,
name TEXT NOT NULL,
applied_at TEXT NOT NULL
);
Migration files:
packages/knowledge-store/migrations/
001_initial.sql
002_semantic_artifacts.sql
003_embeddings.sql
Migration runner:
export async function migrateStore(db: Database): Promise<void> {
const applied = await getAppliedMigrationVersions(db);
const migrations = await loadMigrations();
for (const migration of migrations) {
if (applied.has(migration.version)) {
continue;
}
await db.transaction(async () => {
await db.exec(migration.sql);
await db.run(
"INSERT INTO schema_migrations(version, name, applied_at) VALUES (?, ?, ?)",
[migration.version, migration.name, new Date().toISOString()]
);
});
}
}
Never silently use incompatible schema.
6. Core schema overview
We will design relational tables plus JSON metadata fields where appropriate.
Use relational columns for query-heavy fields. Use JSON for optional metadata.
7. Artifacts table
CREATE TABLE artifacts (
id TEXT PRIMARY KEY,
path TEXT NOT NULL UNIQUE,
kind TEXT NOT NULL,
language TEXT,
hash TEXT NOT NULL,
size_bytes INTEGER NOT NULL,
generated INTEGER NOT NULL DEFAULT 0,
vendored INTEGER NOT NULL DEFAULT 0,
binary INTEGER NOT NULL DEFAULT 0,
sensitive TEXT NOT NULL DEFAULT 'public',
last_seen_at TEXT NOT NULL,
metadata_json TEXT
);
CREATE INDEX idx_artifacts_kind ON artifacts(kind);
CREATE INDEX idx_artifacts_hash ON artifacts(hash);
CREATE INDEX idx_artifacts_language ON artifacts(language);
CREATE INDEX idx_artifacts_sensitive ON artifacts(sensitive);
Notes:
pathis project-relative normalized POSIX path.hashis content hash.last_seen_athelps detect deleted files.metadata_jsoncan store detector details.
Artifact ID:
artifact:${sha256(path)}
or hash of normalized path. Since artifact identity is path-based initially.
8. Parse runs table
Tracks parser/extractor version per artifact.
CREATE TABLE parse_runs (
artifact_id TEXT NOT NULL,
artifact_hash TEXT NOT NULL,
language TEXT NOT NULL,
parser_name TEXT NOT NULL,
parser_version TEXT NOT NULL,
query_version TEXT NOT NULL,
extractor_version TEXT NOT NULL,
ok INTEGER NOT NULL,
indexed_at TEXT NOT NULL,
diagnostics_count INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (
artifact_id,
artifact_hash,
parser_name,
parser_version,
query_version,
extractor_version
)
);
CREATE INDEX idx_parse_runs_artifact ON parse_runs(artifact_id);
Fresh parse exists if:
- artifact hash same,
- parser version same,
- query version same,
- extractor version same.
9. Symbols table
CREATE TABLE symbols (
id TEXT PRIMARY KEY,
artifact_id TEXT NOT NULL,
language TEXT NOT NULL,
kind TEXT NOT NULL,
name TEXT NOT NULL,
qualified_name TEXT NOT NULL,
display_name TEXT NOT NULL,
visibility TEXT NOT NULL,
exported INTEGER NOT NULL DEFAULT 0,
parent_symbol_id TEXT,
signature TEXT,
doc_comment TEXT,
start_line INTEGER NOT NULL,
start_column INTEGER NOT NULL,
end_line INTEGER NOT NULL,
end_column INTEGER NOT NULL,
selection_start_line INTEGER,
selection_start_column INTEGER,
selection_end_line INTEGER,
selection_end_column INTEGER,
modifiers_json TEXT,
annotations_json TEXT,
parser_metadata_json TEXT,
FOREIGN KEY (artifact_id) REFERENCES artifacts(id) ON DELETE CASCADE
);
CREATE INDEX idx_symbols_artifact ON symbols(artifact_id);
CREATE INDEX idx_symbols_kind ON symbols(kind);
CREATE INDEX idx_symbols_name ON symbols(name);
CREATE INDEX idx_symbols_qualified_name ON symbols(qualified_name);
CREATE INDEX idx_symbols_exported ON symbols(exported);
CREATE INDEX idx_symbols_parent ON symbols(parent_symbol_id);
qualified_name may not be globally unique in all cases, so do not make it unique.
10. Semantic artifacts table
Different artifact types have different fields. Use common columns + JSON payload.
CREATE TABLE semantic_artifacts (
id TEXT PRIMARY KEY,
type TEXT NOT NULL,
source_kind TEXT NOT NULL,
title TEXT,
key TEXT,
source_artifact_id TEXT,
source_hash TEXT,
confidence TEXT NOT NULL,
payload_json TEXT NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
FOREIGN KEY (source_artifact_id) REFERENCES artifacts(id) ON DELETE SET NULL
);
CREATE INDEX idx_semantic_artifacts_type ON semantic_artifacts(type);
CREATE INDEX idx_semantic_artifacts_key ON semantic_artifacts(key);
CREATE INDEX idx_semantic_artifacts_source ON semantic_artifacts(source_artifact_id);
CREATE INDEX idx_semantic_artifacts_confidence ON semantic_artifacts(confidence);
Examples:
| Type | Key |
|---|---|
apiEndpoint | POST /users |
cliCommand | docforge build |
configField | search.enabled |
event | kafka:user.created:produced |
example | examples/basic |
Payload example:
{
"method": "POST",
"path": "/users",
"operationId": "createUser",
"visibility": "public"
}
11. Graph nodes table
We can infer nodes from artifacts/symbols/semantic artifacts/doc pages, but generic graph node table helps traversal.
CREATE TABLE graph_nodes (
id TEXT NOT NULL,
type TEXT NOT NULL,
label TEXT NOT NULL,
source_table TEXT,
metadata_json TEXT,
PRIMARY KEY (type, id)
);
CREATE INDEX idx_graph_nodes_label ON graph_nodes(label);
Node ref:
type + id
Examples:
artifact | artifact:abc
symbol | symbol:def
semanticArtifact | api:POST:/users
docPage | docs:quickstart
external | npm:express
12. Graph edges table
CREATE TABLE graph_edges (
id TEXT PRIMARY KEY,
from_type TEXT NOT NULL,
from_id TEXT NOT NULL,
to_type TEXT NOT NULL,
to_id TEXT NOT NULL,
kind TEXT NOT NULL,
confidence TEXT NOT NULL,
source_artifact_id TEXT,
path TEXT,
start_line INTEGER,
start_column INTEGER,
end_line INTEGER,
end_column INTEGER,
metadata_json TEXT
);
CREATE INDEX idx_graph_edges_from ON graph_edges(from_type, from_id);
CREATE INDEX idx_graph_edges_to ON graph_edges(to_type, to_id);
CREATE INDEX idx_graph_edges_kind ON graph_edges(kind);
CREATE INDEX idx_graph_edges_source_artifact ON graph_edges(source_artifact_id);
CREATE INDEX idx_graph_edges_from_kind ON graph_edges(from_type, from_id, kind);
CREATE INDEX idx_graph_edges_to_kind ON graph_edges(to_type, to_id, kind);
source_artifact_id makes invalidation possible.
When artifact changes:
DELETE FROM graph_edges WHERE source_artifact_id = ?;
13. Doc pages table
Docs pages are also indexed.
CREATE TABLE doc_pages (
id TEXT PRIMARY KEY,
source_path TEXT NOT NULL UNIQUE,
route TEXT NOT NULL,
title TEXT NOT NULL,
description TEXT,
kind TEXT NOT NULL,
generated INTEGER NOT NULL DEFAULT 0,
hidden INTEGER NOT NULL DEFAULT 0,
draft INTEGER NOT NULL DEFAULT 0,
content_hash TEXT,
metadata_json TEXT,
updated_at TEXT NOT NULL
);
CREATE INDEX idx_doc_pages_route ON doc_pages(route);
CREATE INDEX idx_doc_pages_kind ON doc_pages(kind);
CREATE INDEX idx_doc_pages_generated ON doc_pages(generated);
This table is updated from MDX compiler/page manifest.
14. Documentation mappings table
Maps docs pages to sources they document.
CREATE TABLE documentation_mappings (
id TEXT PRIMARY KEY,
page_id TEXT NOT NULL,
target_type TEXT NOT NULL,
target_id TEXT NOT NULL,
relation_kind TEXT NOT NULL DEFAULT 'documents',
confidence TEXT NOT NULL,
source_hash TEXT,
metadata_json TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
FOREIGN KEY (page_id) REFERENCES doc_pages(id) ON DELETE CASCADE
);
CREATE INDEX idx_doc_mappings_page ON documentation_mappings(page_id);
CREATE INDEX idx_doc_mappings_target ON documentation_mappings(target_type, target_id);
Example:
{
"page_id": "reference-cli-build",
"target_type": "semanticArtifact",
"target_id": "cli:docforge-build",
"confidence": "high"
}
Also create graph edge:
docPage --documents--> semanticArtifact
The table provides richer mapping metadata.
15. Provenance refs table
Provenance can be embedded JSON, but queryable provenance is useful.
CREATE TABLE provenance_refs (
id TEXT PRIMARY KEY,
owner_type TEXT NOT NULL,
owner_id TEXT NOT NULL,
artifact_id TEXT,
path TEXT NOT NULL,
selector TEXT,
kind TEXT NOT NULL,
hash TEXT,
start_line INTEGER,
start_column INTEGER,
end_line INTEGER,
end_column INTEGER,
metadata_json TEXT
);
CREATE INDEX idx_provenance_owner ON provenance_refs(owner_type, owner_id);
CREATE INDEX idx_provenance_artifact ON provenance_refs(artifact_id);
CREATE INDEX idx_provenance_path ON provenance_refs(path);
Owner examples:
- semantic artifact,
- doc mapping,
- generated block,
- diagnostic,
- search chunk.
16. Diagnostics table
Store diagnostics from indexing/discovery.
CREATE TABLE diagnostics (
id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
code TEXT NOT NULL,
severity TEXT NOT NULL,
category TEXT NOT NULL,
message TEXT NOT NULL,
path TEXT,
line INTEGER,
column INTEGER,
end_line INTEGER,
end_column INTEGER,
hint TEXT,
owner_type TEXT,
owner_id TEXT,
metadata_json TEXT,
created_at TEXT NOT NULL
);
CREATE INDEX idx_diagnostics_run ON diagnostics(run_id);
CREATE INDEX idx_diagnostics_code ON diagnostics(code);
CREATE INDEX idx_diagnostics_severity ON diagnostics(severity);
CREATE INDEX idx_diagnostics_path ON diagnostics(path);
Diagnostics from current run can replace previous run's diagnostics.
Alternative: store only latest diagnostics. But run history can help debugging. Keep bounded.
17. Index runs table
CREATE TABLE index_runs (
id TEXT PRIMARY KEY,
started_at TEXT NOT NULL,
ended_at TEXT,
status TEXT NOT NULL,
tool_version TEXT NOT NULL,
config_hash TEXT NOT NULL,
artifact_count INTEGER,
symbols_count INTEGER,
relations_count INTEGER,
semantic_artifacts_count INTEGER,
diagnostics_count INTEGER,
metadata_json TEXT
);
CREATE INDEX idx_index_runs_started_at ON index_runs(started_at);
CREATE INDEX idx_index_runs_status ON index_runs(status);
Status:
running
succeeded
failed
cancelled
On startup, if previous run stuck in running, mark as interrupted.
18. Embedding metadata table
Do not store vectors in SQLite initially unless needed. Store metadata and vector cache location.
CREATE TABLE embedding_records (
id TEXT PRIMARY KEY,
target_type TEXT NOT NULL,
target_id TEXT NOT NULL,
content_hash TEXT NOT NULL,
model TEXT NOT NULL,
provider TEXT NOT NULL,
dimensions INTEGER,
vector_store TEXT NOT NULL,
vector_ref TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE INDEX idx_embeddings_target ON embedding_records(target_type, target_id);
CREATE INDEX idx_embeddings_model ON embedding_records(provider, model);
CREATE INDEX idx_embeddings_content_hash ON embedding_records(content_hash);
Vector storage options:
- separate binary file,
- SQLite blob,
- external local vector DB,
- provider-side? usually not for local CLI.
For early implementation, skip vectors or store small JSON vector files.
19. Search chunks table
Search build may use separate public artifacts, but store can cache chunks.
CREATE TABLE search_chunks (
id TEXT PRIMARY KEY,
page_id TEXT NOT NULL,
route TEXT NOT NULL,
anchor TEXT,
title TEXT NOT NULL,
section_title TEXT,
text_hash TEXT NOT NULL,
text_preview TEXT,
weight REAL NOT NULL,
metadata_json TEXT,
updated_at TEXT NOT NULL
);
CREATE INDEX idx_search_chunks_page ON search_chunks(page_id);
CREATE INDEX idx_search_chunks_route ON search_chunks(route);
Full text can be large. Store preview and hash, or store text if needed for local retrieval. Avoid secrets.
20. Snapshot table
CREATE TABLE snapshots (
id TEXT PRIMARY KEY,
kind TEXT NOT NULL,
created_at TEXT NOT NULL,
config_hash TEXT NOT NULL,
tool_version TEXT NOT NULL,
data_json TEXT NOT NULL
);
CREATE INDEX idx_snapshots_kind_created ON snapshots(kind, created_at);
Snapshot kinds:
artifactIndex,graph,publicSurface,docsMapping,search,coverage.
Keep latest N snapshots.
21. Store API design
Expose typed API, not raw SQL to the whole app.
export type KnowledgeStore = {
migrate(): Promise<void>;
close(): Promise<void>;
transaction<T>(fn: (tx: KnowledgeStoreTx) => Promise<T>): Promise<T>;
artifacts: ArtifactRepository;
symbols: SymbolRepository;
semanticArtifacts: SemanticArtifactRepository;
graph: GraphRepository;
docPages: DocPageRepository;
mappings: DocumentationMappingRepository;
provenance: ProvenanceRepository;
diagnostics: DiagnosticRepository;
runs: IndexRunRepository;
};
Transaction-scoped API:
export type KnowledgeStoreTx = Omit<KnowledgeStore, "transaction" | "close" | "migrate">;
Repositories are focused.
22. Artifact repository
export type ArtifactRepository = {
upsertMany(artifacts: SourceArtifact[]): Promise<void>;
getByPath(path: string): Promise<SourceArtifact | undefined>;
getById(id: ArtifactId): Promise<SourceArtifact | undefined>;
listByKind(kind: SourceArtifactKind): Promise<SourceArtifact[]>;
listChanged(current: SourceArtifact[]): Promise<ArtifactChangeSet>;
deleteMissing(currentPaths: string[]): Promise<SourceArtifact[]>;
};
listChanged compares hash.
export type ArtifactChangeSet = {
added: SourceArtifact[];
changed: SourceArtifact[];
unchanged: SourceArtifact[];
deleted: SourceArtifact[];
};
23. Symbol repository
export type SymbolRepository = {
replaceForArtifacts(artifactIds: ArtifactId[], symbols: CodeSymbol[]): Promise<void>;
getById(id: SymbolId): Promise<CodeSymbol | undefined>;
findByName(name: string): Promise<CodeSymbol[]>;
findByQualifiedName(qualifiedName: string): Promise<CodeSymbol[]>;
listByArtifact(artifactId: ArtifactId): Promise<CodeSymbol[]>;
listExported(): Promise<CodeSymbol[]>;
};
Replacement:
async function replaceForArtifacts(artifactIds: ArtifactId[], symbols: CodeSymbol[]) {
for (const artifactId of artifactIds) {
await db.run("DELETE FROM symbols WHERE artifact_id = ?", [artifactId]);
}
await insertSymbols(symbols);
}
Wrap in transaction.
24. Graph repository
export type GraphRepository = {
upsertNodes(nodes: GraphNode[]): Promise<void>;
replaceEdgesForArtifacts(artifactIds: ArtifactId[], edges: CodeRelation[]): Promise<void>;
outgoing(ref: GraphNodeRef, kind?: RelationKind): Promise<CodeRelation[]>;
incoming(ref: GraphNodeRef, kind?: RelationKind): Promise<CodeRelation[]>;
neighbors(ref: GraphNodeRef, options?: NeighborOptions): Promise<GraphNode[]>;
getNode(ref: GraphNodeRef): Promise<GraphNode | undefined>;
};
Outgoing query:
SELECT *
FROM graph_edges
WHERE from_type = ?
AND from_id = ?
AND (? IS NULL OR kind = ?);
Incoming query:
SELECT *
FROM graph_edges
WHERE to_type = ?
AND to_id = ?
AND (? IS NULL OR kind = ?);
25. Semantic artifact repository
export type SemanticArtifactRepository = {
replaceForSourceArtifacts(
artifactIds: ArtifactId[],
artifacts: SemanticArtifact[]
): Promise<void>;
getById(id: string): Promise<SemanticArtifact | undefined>;
listByType(type: SemanticArtifact["type"]): Promise<SemanticArtifact[]>;
findByKey(type: string, key: string): Promise<SemanticArtifact[]>;
};
Replacement needs source artifact ID.
DELETE FROM semantic_artifacts WHERE source_artifact_id IN (...);
Then insert current artifacts.
26. Documentation mapping repository
export type DocumentationMappingRepository = {
replaceForPage(pageId: PageId, mappings: DocumentationMapping[]): Promise<void>;
findPagesDocumenting(target: GraphNodeRef): Promise<DocPageNode[]>;
findTargetsDocumentedBy(pageId: PageId): Promise<DocumentationMapping[]>;
findStaleMappings(): Promise<StaleDocumentationMapping[]>;
};
Stale mapping:
- mapping stores
source_hash, - current source artifact hash differs.
SELECT dm.*
FROM documentation_mappings dm
JOIN semantic_artifacts sa
ON dm.target_type = 'semanticArtifact'
AND dm.target_id = sa.id
WHERE dm.source_hash IS NOT NULL
AND dm.source_hash != sa.source_hash;
27. Provenance repository
export type ProvenanceRepository = {
replaceForOwner(owner: OwnerRef, refs: ProvenanceRef[]): Promise<void>;
listForOwner(owner: OwnerRef): Promise<ProvenanceRef[]>;
findByArtifact(artifactId: ArtifactId): Promise<ProvenanceRef[]>;
};
Owner:
export type OwnerRef = {
type: "semanticArtifact" | "docPage" | "mapping" | "diagnostic" | "searchChunk";
id: string;
};
28. Diagnostics repository
export type DiagnosticRepository = {
replaceForRun(runId: string, diagnostics: Diagnostic[]): Promise<void>;
listLatest(): Promise<Diagnostic[]>;
listByPath(path: string): Promise<Diagnostic[]>;
countBySeverity(runId: string): Promise<Record<DiagnosticSeverity, number>>;
};
If storing many runs, listLatest needs latest successful/last run.
SELECT id FROM index_runs ORDER BY started_at DESC LIMIT 1;
29. Transactions
Index update must be atomic.
Bad:
- delete old symbols,
- crash before insert new symbols,
- store loses data.
Good:
await store.transaction(async (tx) => {
await tx.artifacts.upsertMany(changedArtifacts);
await tx.symbols.replaceForArtifacts(changedArtifactIds, symbols);
await tx.semanticArtifacts.replaceForSourceArtifacts(changedArtifactIds, semanticArtifacts);
await tx.graph.replaceEdgesForArtifacts(changedArtifactIds, relations);
await tx.diagnostics.replaceForRun(runId, diagnostics);
});
If any step fails, rollback.
30. Write-ahead logging
SQLite WAL mode improves concurrency.
On store open:
PRAGMA journal_mode = WAL;
PRAGMA foreign_keys = ON;
PRAGMA synchronous = NORMAL;
Foreign keys help consistency.
WAL allows readers while writer writes.
For CLI, this helps dev server and commands reading index.
31. Concurrency model
Multiple commands could access store:
docforge dev,docforge index,docforge graph,- editor extension later.
SQLite supports multiple readers, single writer.
Implement lock behavior:
- short transactions,
- avoid holding transaction while parsing,
- parse first, then write quickly,
- busy timeout.
PRAGMA busy_timeout = 5000;
Do not parse files inside DB transaction.
Correct:
scan -> parse -> compute results -> open transaction -> write results
32. Store corruption and recovery
SQLite is robust, but handle failure.
If open fails:
- print diagnostic,
- suggest rebuild,
- optionally move corrupted store aside.
Command:
docforge index reset
Behavior:
Moves .docforge/index/docforge.sqlite to .docforge/index/docforge.sqlite.bak-<timestamp>
Rebuilds index from source files.
Do not delete silently.
33. Invalidation model
Invalidation inputs:
- file content hash changed,
- config hash changed,
- parser version changed,
- query version changed,
- discovery plugin version changed,
- tool version migration,
- theme/component contract maybe,
- OpenAPI spec changed.
Store freshness:
export type FreshnessReason =
| "artifactChanged"
| "artifactDeleted"
| "configChanged"
| "parserVersionChanged"
| "queryVersionChanged"
| "pluginVersionChanged"
| "schemaMigrationRequired";
Plan:
export type ReindexPlan = {
full: boolean;
artifactsToParse: SourceArtifact[];
artifactsToDelete: SourceArtifact[];
reasons: FreshnessReason[];
};
If config include/exclude changes, full rescan may be needed.
34. Partial invalidation
When a file changes:
await store.transaction(async (tx) => {
await tx.graph.deleteEdgesForSourceArtifact(artifactId);
await tx.symbols.deleteForArtifact(artifactId);
await tx.semanticArtifacts.deleteForSourceArtifact(artifactId);
await tx.provenance.deleteForArtifact(artifactId);
await tx.artifacts.upsert(artifact);
await tx.symbols.insertMany(newSymbols);
await tx.semanticArtifacts.insertMany(newSemanticArtifacts);
await tx.graph.insertEdges(newRelations);
});
Do not leave old semantic artifacts from changed file.
35. Store compaction
Over time, database may grow.
Commands:
docforge index vacuum
docforge index reset
Vacuum:
VACUUM;
Only run on explicit command or occasional maintenance, not every build.
36. Data retention
Diagnostics and runs can grow.
Config:
{
"index": {
"retainRuns": 20
}
}
Cleanup:
DELETE FROM diagnostics
WHERE run_id NOT IN (
SELECT id FROM index_runs ORDER BY started_at DESC LIMIT 20
);
Then delete old runs.
37. Avoid storing raw source
Raw source is sensitive and large.
By default store:
- paths,
- hashes,
- symbols,
- doc comments,
- signatures,
- small snippets maybe,
- provenance ranges.
Do not store full file content unless config allows.
For retrieval, you can read source files on demand using provenance.
If storing excerpts:
- redact secrets,
- limit length,
- mark source.
38. Store and AI privacy
Knowledge store may contain:
- internal class names,
- comments,
- endpoint names,
- config fields,
- doc comments.
It should remain local unless user explicitly sends context to AI provider.
Separation:
Knowledge store = local derived data
AI evidence pack = selected bounded data
AI provider = receives only evidence pack
Do not upload whole store.
39. Query examples
39.1 Find public CLI commands
SELECT payload_json
FROM semantic_artifacts
WHERE type = 'cliCommand'
ORDER BY key;
39.2 Find docs pages affected by endpoint
SELECT dp.*
FROM documentation_mappings dm
JOIN doc_pages dp ON dp.id = dm.page_id
WHERE dm.target_type = 'semanticArtifact'
AND dm.target_id = ?;
39.3 Find exported symbols
SELECT *
FROM symbols
WHERE exported = 1
ORDER BY qualified_name;
39.4 Find route handlers
SELECT ge.*
FROM graph_edges ge
WHERE ge.kind = 'handlesRoute';
39.5 Find unresolved imports diagnostics
SELECT *
FROM diagnostics
WHERE code = 'graph.import.unresolved'
ORDER BY path, line;
40. Typed query helpers
Avoid SQL strings in application logic.
export async function findCliCommands(store: KnowledgeStore): Promise<CliCommandArtifact[]> {
return store.semanticArtifacts.listByType("cliCommand") as Promise<CliCommandArtifact[]>;
}
export async function findDocsForSemanticArtifact(
store: KnowledgeStore,
artifactId: string
): Promise<DocPageNode[]> {
return store.mappings.findPagesDocumenting({
type: "semanticArtifact",
id: artifactId,
});
}
SQL belongs inside repository implementation.
41. Store-backed retrieval
Retrieval can query store:
export type RetrievalStore = {
keywordSearch(query: string): Promise<RetrievalItem[]>;
findSymbol(name: string): Promise<RetrievalItem[]>;
findSemanticArtifact(type: string, key: string): Promise<RetrievalItem[]>;
expandGraph(ref: GraphNodeRef): Promise<RetrievalItem[]>;
};
Example retrieval:
export async function retrieveForCliCommand(
store: KnowledgeStore,
commandName: string
): Promise<RetrievalItem[]> {
const commands = await store.semanticArtifacts.findByKey("cliCommand", commandName);
const items: RetrievalItem[] = [];
for (const command of commands) {
items.push(semanticArtifactToRetrievalItem(command));
const handler = await findCommandHandler(store.graph, command.id);
if (handler) {
items.push(symbolToRetrievalItem(handler));
}
const docs = await store.mappings.findPagesDocumenting({
type: "semanticArtifact",
id: command.id,
});
items.push(...docs.map(docPageToRetrievalItem));
}
return items;
}
42. Store-backed impact analysis
Impact command:
docforge graph impact --changed src/commands/build.ts
Flow:
- find artifact by path,
- find symbols/semantic artifacts from artifact,
- find doc mappings,
- expand dependent relations,
- rank pages.
export async function impactForChangedPath(
store: KnowledgeStore,
path: string
): Promise<DocumentationImpact> {
const artifact = await store.artifacts.getByPath(path);
if (!artifact) {
return { changedArtifacts: [], affectedPages: [] };
}
return computeImpactForArtifact(store.graph, store.mappings, artifact.id);
}
43. Migration strategy
Migrations must be safe.
Rules:
- migration files immutable once released,
- each migration wrapped in transaction,
- backup before destructive migration if necessary,
- incompatible major changes can rebuild store,
- user can run
docforge index reset.
Because store is derived, destructive migrations are acceptable if rebuild is easy. But still avoid data loss of user-authored metadata.
If store contains manual annotations, keep them outside derived tables or in config files.
44. Schema migration example
Add endpoint visibility:
-- 004_add_semantic_artifact_visibility.sql
ALTER TABLE semantic_artifacts ADD COLUMN visibility TEXT DEFAULT 'unknown';
CREATE INDEX idx_semantic_artifacts_visibility ON semantic_artifacts(visibility);
Migration runner records:
INSERT INTO schema_migrations(version, name, applied_at)
VALUES (4, 'add semantic artifact visibility', datetime('now'));
45. Store health check
Command:
docforge index doctor
Checks:
- DB opens,
- schema migrations applied,
- foreign keys valid,
- required indexes exist,
- latest run status,
- artifact count reasonable,
- stale artifacts,
- diagnostics count,
- orphan graph edges.
SQLite checks:
PRAGMA integrity_check;
PRAGMA foreign_key_check;
Output:
Knowledge store health:
Schema: up to date
Integrity: ok
Artifacts: 213
Symbols: 1,942
Edges: 4,821
Latest run: succeeded 2 minutes ago
Diagnostics: 0 errors, 3 warnings
46. Orphan edge check
Graph edge points to missing node.
SELECT ge.*
FROM graph_edges ge
LEFT JOIN graph_nodes gn
ON ge.to_type = gn.type AND ge.to_id = gn.id
WHERE gn.id IS NULL;
Also check from side.
If orphan edges exist, index may need rebuild.
47. Store status command
docforge index status
Output:
Index status:
Store: .docforge/index/docforge.sqlite
Schema: 22
Last indexed: 2026-07-03 12:14:22
Config hash: 9f2a...
Artifacts:
current: 213
changed: 0
deleted: 0
Parser versions:
tree-sitter-typescript: 0.23.x
Discovery:
express: enabled
commander: enabled
zod: enabled
Index is fresh.
If stale:
Index is stale.
Reasons:
- 3 files changed
- config hash changed
Run:
docforge index
48. Store reset command
docforge index reset
Behavior:
- stop if dev server using store? or coordinate lock,
- move old DB to backup,
- create fresh schema,
- optionally reindex.
docforge index reset --rebuild
Output:
Moved old store to .docforge/index/docforge.sqlite.bak-20260703-121422
Rebuilt index successfully.
49. Store locking
For single-process CLI, SQLite locking is enough.
For better UX, use lock table or file lock for long index operations.
CREATE TABLE store_locks (
name TEXT PRIMARY KEY,
owner TEXT NOT NULL,
acquired_at TEXT NOT NULL,
expires_at TEXT
);
But avoid overengineering. SQLite writer lock plus busy timeout is sufficient for early version.
50. Testing knowledge store
50.1 Migration tests
it("applies migrations to empty database", async () => {
const db = await openTempStore();
await migrateStore(db);
const version = await getCurrentSchemaVersion(db);
expect(version).toBe(LATEST_SCHEMA_VERSION);
});
50.2 Repository tests
it("upserts and reads artifact by path", async () => {
const store = await createTempKnowledgeStore();
await store.artifacts.upsertMany([
artifact({ path: "src/index.ts", hash: "abc" }),
]);
const found = await store.artifacts.getByPath("src/index.ts");
expect(found?.hash).toBe("abc");
});
50.3 Transaction rollback tests
it("rolls back failed transaction", async () => {
const store = await createTempKnowledgeStore();
await expect(store.transaction(async (tx) => {
await tx.artifacts.upsertMany([artifact({ path: "src/a.ts" })]);
throw new Error("boom");
})).rejects.toThrow("boom");
expect(await store.artifacts.getByPath("src/a.ts")).toBeUndefined();
});
50.4 Graph query tests
it("returns outgoing graph edges", async () => {
const store = await createTempKnowledgeStore();
await seedGraph(store);
const outgoing = await store.graph.outgoing(
{ type: "symbol", id: "symbol:handler" },
"handlesRoute"
);
expect(outgoing).toHaveLength(1);
});
51. Performance testing
Test with generated dataset:
- 1,000 artifacts,
- 10,000 symbols,
- 50,000 edges,
- 1,000 semantic artifacts,
- 2,000 doc mappings.
Measure:
- bulk insert time,
- artifact lookup,
- symbol search,
- outgoing/incoming graph query,
- impact query,
- migration time.
Use transactions for bulk inserts.
Bad:
for each row:
insert outside transaction
Good:
transaction:
prepared statement insert many rows
52. Prepared statements
For performance:
const stmt = db.prepare(`
INSERT INTO symbols (
id, artifact_id, language, kind, name, qualified_name, display_name,
visibility, exported, start_line, start_column, end_line, end_column
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(id) DO UPDATE SET
artifact_id = excluded.artifact_id,
language = excluded.language,
kind = excluded.kind,
name = excluded.name,
qualified_name = excluded.qualified_name,
display_name = excluded.display_name,
visibility = excluded.visibility,
exported = excluded.exported,
start_line = excluded.start_line,
start_column = excluded.start_column,
end_line = excluded.end_line,
end_column = excluded.end_column
`);
Use batch transaction.
53. JSON encoding
Some fields stored as JSON.
Helpers:
export function encodeJson(value: unknown): string {
return JSON.stringify(value ?? null);
}
export function decodeJson<T>(value: string | null | undefined, fallback: T): T {
if (!value) return fallback;
try {
return JSON.parse(value) as T;
} catch {
return fallback;
}
}
Bad JSON should not crash query. But corruption should be detected in doctor.
54. Normalizing paths
Store all paths normalized.
export function normalizeStorePath(pathValue: string): string {
return pathValue.replace(/\\/g, "/").replace(/^\.\/+/, "");
}
Never store absolute local path in derived public data unless internal-only.
Artifact table uses project-relative path.
Diagnostics may display project-relative path.
55. Store and monorepos
Monorepo options:
Option A — one store per repo
repo/.docforge/index/docforge.sqlite
Pros:
- cross-package graph possible.
Cons:
- larger store.
Option B — one store per docs project/package
repo/packages/a/.docforge/index/docforge.sqlite
Pros:
- isolated.
Cons:
- harder cross-package references.
Recommended for this project: one store per configured project root, with package/service metadata.
Artifacts can include:
package_name TEXT
service_name TEXT
or metadata JSON.
56. Store and branch changes
Git branches can change files dramatically.
Store freshness by hash handles changes.
But deleted/recreated files may leave stale docs mappings if not cleaned.
On index run:
- mark seen artifacts,
- delete missing artifacts or mark deleted,
- cascade delete symbols/edges,
- recompute mappings if target missing.
Optional branch key:
branchName?: string;
headCommit?: string;
Store snapshots can include git metadata.
57. Store and CI cache
CI can cache .docforge/index.
But if cache stale, index should invalidate by:
- file hashes,
- config hash,
- tool version,
- parser version.
Do not trust cache blindly.
CI script:
- uses: actions/cache
with:
path: .docforge/index
key: docforge-index-${{ hashFiles('**/*') }}
Better key can include lockfiles/config.
Even without cache, index should build deterministically.
58. Store API package layout
packages/knowledge-store/
src/
open.ts
migrate.ts
schema.ts
repositories/
artifacts.ts
symbols.ts
semantic-artifacts.ts
graph.ts
doc-pages.ts
mappings.ts
provenance.ts
diagnostics.ts
runs.ts
transaction.ts
path.ts
json.ts
health.ts
reset.ts
__tests__/
migrations.test.ts
artifacts.test.ts
symbols.test.ts
graph.test.ts
transactions.test.ts
59. Store integration with index command
export async function runIndexCommand(args: IndexArgs): Promise<void> {
const config = await loadConfig(args.cwd);
const store = await openKnowledgeStore(config.index.storePath);
await store.migrate();
const run = await store.runs.start({
toolVersion: VERSION,
configHash: hashConfig(config),
});
try {
const result = await indexRepository({
projectRoot: args.cwd,
config,
store,
runId: run.id,
});
await store.runs.finish(run.id, {
status: "succeeded",
artifactCount: result.stats.artifacts,
symbolsCount: result.stats.symbols,
relationsCount: result.stats.relations,
diagnosticsCount: result.diagnostics.length,
});
printIndexSummary(result);
} catch (error) {
await store.runs.finish(run.id, { status: "failed" });
throw error;
} finally {
await store.close();
}
}
60. Minimal implementation milestone
First version:
- SQLite open/migration,
- artifacts table/repository,
- symbols table/repository,
- semantic artifacts table/repository,
- graph nodes/edges table/repository,
- diagnostics table,
- index runs table,
- transaction support,
- artifact hash change detection,
- artifact-scoped replacement,
docforge index status,docforge index reset.
Second version:
- doc pages/mappings,
- provenance refs,
- graph impact queries,
- embeddings metadata,
- search chunk cache,
- store doctor,
- snapshots/diff,
- CI cache support,
- retention/cleanup,
- monorepo metadata.
61. Failure modes
| Failure | Cause | Prevention |
|---|---|---|
| Store stale after file change | no hash invalidation | artifact hash planning |
| Old symbols remain | replacement not artifact-scoped | delete by artifact before insert |
| Partial index update | no transaction | transactional writes |
| Slow queries | missing indexes | indexes on graph/artifact/symbol fields |
| DB locked | long write transactions | parse outside transaction, busy timeout |
| Secrets stored | raw source stored indiscriminately | metadata-only default, redaction |
| Schema mismatch | no migrations | schema_migrations table |
| Store corruption blocks work | no recovery command | index reset, backup |
| Public build leaks store | output copies .docforge | deploy allowlist |
| Graph orphan edges | deleted nodes without cleanup | foreign checks/doctor/orphan check |
| Cache invalid after parser update | key lacks parser/query version | parse_runs versioning |
62. Key takeaways
Repository knowledge store is the local memory of the documentation generator.
It stores derived facts, not source truth:
A strong knowledge store design has:
- explicit schema versioning,
- artifact hash invalidation,
- transactional updates,
- artifact-scoped replacement,
- typed repositories,
- graph query indexes,
- provenance storage,
- diagnostics history,
- freshness/status commands,
- reset/doctor recovery paths.
Next, we move into OpenAPI ingestion and validation, where formal API specs become high-confidence API documentation source artifacts.
You just completed lesson 22 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.