Learn Ai Docs Km Cli Part 038 Retrieval Layer For Docs And Notes
title: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI - Part 038 description: Build a retrieval layer for source-grounded documentation generation across docs, notes, examples, contracts, source snippets, and knowledge graph relations. series: learn-ai-docs-km-cli seriesTitle: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI with Code2Prompt and Open-source Knowledge Management order: 38 partTitle: Retrieval Layer for Docs and Notes tags:
- ai-docs
- documentation
- cli
- retrieval
- rag
- semantic-search
- knowledge-graph
- source-grounding
- context-engine date: 2026-07-04
Part 038 — Retrieval Layer for Docs and Notes
Pada part sebelumnya kita membangun bidirectional docs and notes sync.
Sekarang kita masuk ke layer yang membuat seluruh sistem menjadi jauh lebih pintar:
Retrieval layer untuk docs, notes, graph, examples, contracts, dan source snippets.
Tanpa retrieval, context compiler hanya bisa memakai heuristik statis:
ambil file yang kelihatan relevan
ambil README
ambil OpenAPI
ambil tests
Itu cukup untuk repo kecil.
Untuk repo besar, monorepo, multi-service platform, dan dokumentasi enterprise, pendekatan itu gagal.
Kita butuh retrieval layer yang bisa menjawab:
- source mana yang paling relevan untuk halaman ini?
- note mana yang pernah menjelaskan konsep ini?
- example mana yang paling authoritative?
- contract mana yang harus menang jika ada conflict?
- relation graph mana yang membantu expand context?
- chunk mana yang aman dimasukkan ke prompt?
- mana yang stale?
- mana yang public/internal/private?
Part ini membahas desain retrieval layer yang source-grounded, hybrid, explainable, cacheable, dan safe untuk documentation generation.
1. Mental Model: Retrieval Is Context Selection Under Constraints
Retrieval bukan sekadar search.
Dalam sistem AI docs CLI, retrieval adalah proses memilih evidence untuk task tertentu.
query + task + page spec + source authority + visibility + token budget
-> ranked evidence set
Kita tidak mengambil dokumen karena “mirip secara embedding” saja.
Kita mengambil dokumen karena:
- relevan terhadap task,
- punya authority cukup,
- fresh,
- visible untuk target output,
- tidak redundant,
- membantu memenuhi page contract,
- bisa diverifikasi.
Retrieval yang buruk menghasilkan docs yang buruk walaupun modelnya bagus.
2. Why Retrieval Matters for Code and Docs
LLM tidak selalu tahu library, API, atau internal codebase terbaru.
Paper DocPrompting: Generating Code by Retrieving the Docs menunjukkan ide penting: ketika programmer memakai API/library, mereka merujuk dokumentasi; sistem generasi kode bisa membaik dengan mengambil dokumentasi relevan terlebih dahulu sebelum generate output.
Untuk documentation generator kita, prinsipnya lebih luas:
retrieve relevant source/docs/notes/examples first
then generate grounded docs from retrieved evidence
Tanpa retrieval, generator cenderung:
- memakai prior knowledge yang salah,
- mengarang behavior,
- melewatkan edge cases,
- mencampur old docs dan new source,
- memasukkan terlalu banyak context tidak relevan.
3. Retrieval Sources
Kita punya banyak source.
source files
symbols
contracts
examples
tests
configs
migrations
existing docs
Logseq pages
OpenNote notes/chunks
knowledge graph nodes/edges
review comments
CI/runbook artifacts
Tidak semua source sama.
| Source | Usefulness | Risk | Default Authority |
|---|---|---|---|
| OpenAPI / schemas | Very high | Medium if stale | High |
| source code | Very high | Medium if complex | High |
| tests | High | Low-medium | High |
| examples | High | Medium if outdated | Medium-high |
| reviewed docs | High | Medium if stale | Medium-high |
| Logseq notes | Medium-high | High if unreviewed | Medium-low |
| OpenNote semantic notes | Medium-high | High if unreviewed | Medium-low |
| generated summaries | Medium | High | Low |
| review comments | Medium | Medium | Contextual |
Retrieval layer must carry source authority in every result.
4. Retrieval Query Model
A retrieval query is not just a string.
type RetrievalQuery = {
id: string;
task:
| "generate_page"
| "verify_claim"
| "repair_page"
| "find_examples"
| "detect_drift"
| "answer_dev_question"
| "sync_notes";
naturalLanguage?: string;
symbols?: string[];
paths?: string[];
contracts?: string[];
concepts?: string[];
targetPage?: string;
visibility: "public" | "internal" | "private";
requiredKinds?: RetrievalKind[];
forbiddenKinds?: RetrievalKind[];
freshness?: "strict" | "normal" | "relaxed";
tokenBudget?: number;
};
Example:
{
"task": "generate_page",
"targetPage": "docs/guides/authentication.mdx",
"naturalLanguage": "Write the authentication guide",
"symbols": ["AuthService", "TokenController"],
"contracts": ["openapi:POST /oauth/token"],
"visibility": "public",
"requiredKinds": ["contract", "source", "example", "reviewed_doc"],
"forbiddenKinds": ["private_note", "unreviewed_ai_summary"],
"tokenBudget": 12000
}
Task-aware retrieval beats generic search.
5. Retrieval Result Model
Every result needs more than text.
type RetrievalResult = {
id: string;
kind: RetrievalKind;
title?: string;
text: string;
sourceRefs: SourceRef[];
authority: number;
relevance: number;
freshness: number;
confidence: number;
visibility: "public" | "internal" | "private";
ownership: "source" | "human" | "generated" | "hybrid";
stale: boolean;
riskFlags: string[];
relations: RetrievalRelation[];
score: number;
explanation: string[];
};
A retrieval result must answer:
Why are you in the context?
Can you be trusted?
Can you be shown to this target?
Are you stale?
What source backs you?
6. Retrieval Kinds
type RetrievalKind =
| "source_snippet"
| "symbol"
| "contract"
| "schema"
| "example"
| "test_case"
| "config"
| "migration"
| "doc_section"
| "logseq_block"
| "opennote_chunk"
| "graph_node"
| "graph_edge"
| "runbook_step"
| "review_comment";
Different kinds need different indexing and scoring.
An OpenAPI operation should not be ranked the same way as a Logseq journal block.
7. Hybrid Retrieval
We use hybrid retrieval.
lexical search + semantic search + graph expansion + authority filtering + recency filtering
Why not semantic only?
Because code/docs contain exact identifiers:
AuthService,POST /oauth/token,x-request-id,InvoiceCreated,retry_after_ms,docs.json.
Embedding search can miss exact identifiers.
Why not lexical only?
Because docs/notes use natural language:
- “token renewal”,
- “refresh session”,
- “reauthentication”,
- “credential rotation”.
Semantic search can connect related phrasing.
Why graph expansion?
Because if we retrieve AuthService, we also need related endpoints, configs, tests, and concepts.
8. Lexical Index
Lexical index supports exact match and keyword search.
Index fields:
id
title
path
symbols
identifiers
headings
body
source refs
tags
relations
Implementation options:
- SQLite FTS5,
- Tantivy,
- Meilisearch,
- local inverted index,
- ripgrep-backed search for minimal version.
Minimal local index schema:
CREATE VIRTUAL TABLE retrieval_fts USING fts5(
id UNINDEXED,
kind UNINDEXED,
title,
path,
identifiers,
body,
tokenize = 'unicode61'
);
Query examples:
SELECT id, bm25(retrieval_fts) AS rank
FROM retrieval_fts
WHERE retrieval_fts MATCH 'AuthService OR oauth token'
ORDER BY rank;
Lexical retrieval is excellent for exact code symbols and endpoint paths.
9. Semantic Index
Semantic index supports meaning-based search.
Chunk embedding record:
type EmbeddingRecord = {
chunkId: string;
model: string;
dimensions: number;
vector: number[];
sourceHash: string;
textHash: string;
createdAt: string;
};
Semantic query:
"How does token refresh work?"
May retrieve:
RefreshTokenService,POST /oauth/token,Token expiry configuration,- Logseq note “Session renewal behavior”,
- test case “does not refresh expired refresh token”.
Semantic retrieval must still obey:
- visibility,
- source freshness,
- authority,
- source refs,
- token budget.
10. Chunking Strategy
Bad chunking destroys retrieval.
Chunk types:
file chunk
symbol chunk
heading section chunk
OpenAPI operation chunk
schema chunk
test episode chunk
example chunk
note block chunk
graph neighborhood chunk
Rules:
- do not split code symbol randomly,
- keep heading + body together,
- keep request/response examples together,
- keep test setup/action/assertion together,
- preserve source refs,
- include metadata in chunk,
- avoid huge chunks that dominate token budget.
Example chunk:
{
"id": "chunk:openapi:post-oauth-token",
"kind": "contract",
"title": "POST /oauth/token",
"text": "Issues an access token...",
"source_refs": ["openapi/auth.yaml#/paths/~1oauth~1token/post"],
"symbols": ["TokenController", "TokenResponse"],
"visibility": "public",
"authority": 0.95
}
11. Code-aware Chunking
Code needs special treatment.
For source files, chunk by:
- exported symbol,
- class,
- function,
- endpoint handler,
- config object,
- test case,
- migration unit.
Avoid arbitrary fixed-size chunks as primary strategy.
Example:
type CodeChunk = {
id: string;
file: string;
language: string;
symbolId?: string;
startLine: number;
endLine: number;
signature?: string;
docComment?: string;
body: string;
imports: string[];
outgoingRelations: string[];
};
For large function/class:
signature + comments + relevant branches + called symbols summary
Do not always include full body.
12. Docs-aware Chunking
For MDX docs:
- split by heading sections,
- keep frontmatter metadata,
- preserve internal links,
- preserve code fences,
- preserve callout type,
- exclude generated metadata comments from semantic text,
- include source refs.
Example:
{
"id": "chunk:docs:auth-guide:refresh-token",
"kind": "doc_section",
"title": "Refresh tokens",
"path": "docs/guides/authentication.mdx",
"headingPath": ["Authentication", "Refresh tokens"],
"text": "Refresh tokens are used to...",
"source_refs": ["auth.config.ts", "openapi/auth.yaml"],
"reviewStatus": "accepted"
}
Reviewed docs are more authoritative than unreviewed generated docs.
13. Note-aware Chunking
For Logseq:
- page properties become metadata,
- blocks become chunks,
- nested blocks preserve context path,
[[Page References]]become relations,- tags become metadata,
- TODO/DONE state is preserved,
- journal pages have lower authority by default.
For OpenNote:
- note metadata becomes document metadata,
- semantic chunks may already exist,
- relations are loaded if exported,
- embedding metadata must include model/version.
Notes need stronger safety filters because they are often informal.
14. Knowledge Graph Expansion
Retrieval should use graph expansion after initial hits.
Example:
Initial hit:
symbol:AuthService
Graph expansion adds:
endpoint:POST /oauth/token
config:auth.token.ttl
test:refresh-token-expiry
concept:Bearer Token
doc:Authentication Guide
Expansion policy:
type ExpansionPolicy = {
maxDepth: number;
allowedEdges: string[];
maxNodes: number;
requireAuthorityAbove: number;
};
For docs generation, use conservative expansion:
depth 1-2
only high-confidence edges
exclude private/internal if target public
15. Score Formula
Simple score:
score =
lexicalScore * 0.25 +
semanticScore * 0.25 +
graphScore * 0.20 +
authorityScore * 0.20 +
freshnessScore * 0.10 -
riskPenalty
But weights should be task-specific.
For claim verification:
authority > lexical > graph > semantic
For concept discovery:
semantic > graph > lexical > authority
For examples:
example validity > source authority > relevance > freshness
For public docs generation:
visibility safety is a hard filter, not just a score
16. Authority as Hard Constraint
If a low-authority note conflicts with a high-authority contract, contract wins.
Retrieval should surface conflict, not average them.
Example:
Logseq note: API uses cursor pagination
OpenAPI spec: API uses page/limit
Output should be:
{
"result": "conflict",
"preferred": "openapi/billing.yaml#/paths/~1invoices/get",
"conflicting": "logseq/pages/API___Billing.md#pagination",
"reason": "contract has higher authority than unreviewed note"
}
Do not blend contradictory evidence into one summary.
17. Freshness and Staleness
Every retrieval result should know whether it is stale.
Freshness signals:
- source hash matches current scan,
- source ref still exists,
- generated artifact built after source change,
- note synced after docs update,
- contract version matches target version,
- branch matches current branch.
Stale result example:
{
"id": "chunk:docs:old-auth-guide",
"stale": true,
"reason": "source_ref_hash_changed",
"sourceRef": "openapi/auth.yaml#/paths/~1oauth~1token/post"
}
Stale evidence can be used for drift detection, but not as authoritative input for new docs unless marked clearly.
18. Visibility Filter
Retrieval must apply visibility before ranking.
function canUse(result: RetrievalResult, targetVisibility: Visibility) {
if (targetVisibility === "public") return result.visibility === "public";
if (targetVisibility === "internal") return result.visibility !== "private";
return true;
}
This prevents private notes from leaking into public docs.
Visibility filter is not optional.
19. Retrieval Plan
For page generation, retrieval should produce a plan.
{
"query_id": "rq:auth-guide",
"target": "docs/guides/authentication.mdx",
"strategy": "hybrid_graph_authority",
"required_coverage": [
"overview",
"authentication-flow",
"token-request",
"refresh-token",
"errors",
"examples"
],
"selected": [
"contract:POST /oauth/token",
"source:TokenController.issueToken",
"test:refresh-token-expiry",
"example:curl-token-request"
],
"omitted": [
{
"id": "logseq:journal:2026-06-incident",
"reason": "visibility_internal_for_public_target"
}
],
"coverage_gaps": [
"rate limit behavior"
]
}
The retrieval plan becomes input to context compiler.
20. Coverage-driven Retrieval
A page spec requires certain sections.
Example:
Authentication Guide requires:
- overview
- token issuance
- auth header format
- refresh behavior
- error responses
- examples
Retrieval should not just return top 20 chunks.
It should cover all required topics.
Algorithm:
for each required section:
retrieve candidates
rank candidates
select high-authority evidence
mark coverage status
combine evidence
remove duplicates
fit token budget
This avoids a common failure:
All retrieved chunks are about token issuance.
No evidence about refresh behavior.
Generator invents refresh behavior.
Coverage gaps should be explicit.
21. Redundancy Control
Retrieval often returns many similar chunks.
Examples:
- README and docs say same thing,
- Logseq note mirrors docs,
- OpenNote chunk mirrors Logseq block,
- tests and examples overlap.
We need deduplication.
Signals:
- same source refs,
- same stable entity ID,
- high text similarity,
- generated-from relation,
- same hash.
Dedup policy:
prefer higher authority
prefer fresher
prefer reviewed
prefer shorter if enough
preserve one alternate source if useful for verification
22. Token Budget Integration
Retrieval does not end with ranking.
It must fit into token budget.
Budget buckets:
instructions 15%
source evidence 45%
examples 15%
existing docs 10%
notes 5%
output contract 10%
For public docs generation, notes should usually be a small slice unless reviewed.
Packing algorithm:
select must-have evidence
add coverage evidence
add examples
add notes only if useful and allowed
compress lower-priority chunks
omit redundant chunks
emit omitted evidence diagnostics
Retrieval and context packing are separate but tightly connected.
23. Retrieval Index Artifact
Store index metadata.
.aidocs/retrieval/
index-manifest.v1.json
lexical.sqlite
vectors.bin
embeddings.jsonl
chunks.jsonl
graph-expansion-cache.json
Manifest:
{
"version": 1,
"created_at": "2026-07-04T10:00:00Z",
"inputs": {
"scan": "sha256:...",
"symbols": "sha256:...",
"contracts": "sha256:...",
"docs": "sha256:...",
"km": "sha256:..."
},
"embedding": {
"provider": "local-or-remote",
"model": "text-embedding-model",
"dimensions": 1536
},
"chunk_count": 4920
}
Index manifest makes retrieval reproducible and debuggable.
24. Incremental Indexing
Do not rebuild all embeddings on every run.
Use text hash.
if chunk.textHash unchanged and embedding.model unchanged:
reuse embedding
else:
recompute embedding
Index invalidation inputs:
- source file hash changed,
- docs page hash changed,
- Logseq page hash changed,
- OpenNote note hash changed,
- chunker version changed,
- embedding model changed,
- visibility metadata changed,
- source ref changed.
Chunker version is important. If chunking logic changes, all chunk IDs or boundaries may change.
25. Retrieval CLI
Command surface:
aidocs retrieve "How does authentication work?"
Search everything allowed.
aidocs retrieve --page docs/guides/authentication.mdx
Retrieve for page generation.
aidocs retrieve --symbol AuthService --explain
Explain why results match.
aidocs retrieve --kind example --contract "POST /oauth/token"
Find examples for endpoint.
aidocs retrieve --target-visibility public
Apply visibility filter.
aidocs retrieve --rebuild-index
Rebuild retrieval index.
aidocs retrieve --coverage page-spec.json
Coverage-driven retrieval.
26. Explainability
Every result should show why it was selected.
Example CLI output:
1. POST /oauth/token
kind: contract
score: 0.94
authority: 0.96
source: openapi/auth.yaml#/paths/~1oauth~1token/post
reasons:
- exact endpoint match
- linked to TokenController.issueToken
- required by page section token issuance
2. refresh-token-expiry test
kind: test_case
score: 0.88
source: tests/auth/refresh-token.test.ts:42-91
reasons:
- covers required section refresh behavior
- recent source hash
- validates expiry behavior
If users cannot understand retrieval, they cannot trust generated docs.
27. Retrieval for Verification
Verifier uses retrieval differently than generator.
Claim:
Refresh tokens expire after 30 days.
Verification query:
{
"task": "verify_claim",
"naturalLanguage": "Refresh tokens expire after 30 days",
"requiredKinds": ["source_snippet", "config", "contract", "test_case"],
"freshness": "strict"
}
The verifier should retrieve:
auth.config.ts,TokenSettings.refreshTokenTtlDays,- test that asserts 30 days,
- docs source ref if already reviewed.
If retrieval finds no source, claim becomes unsupported.
28. Retrieval for Drift Detection
Drift detector asks:
Which docs/notes depend on changed source X?
This is reverse retrieval.
Inputs:
- changed file,
- changed symbol,
- changed contract path,
- changed config key.
Output:
- affected docs pages,
- affected notes,
- affected examples,
- affected prompt bundles,
- required regeneration tasks.
Implementation:
source ref index + graph edges + chunk metadata
This is why every chunk must preserve source refs.
29. Retrieval for Repair
When verifier finds issue, repair uses targeted retrieval.
Issue:
Unsupported claim: API supports password grant.
Repair retrieval:
retrieve OAuth token endpoint contract
retrieve auth config
retrieve tests for token grant types
retrieve previous docs section
Then prompt:
Repair only this section.
Use retrieved evidence.
Remove unsupported claim if no evidence.
Repair should not regenerate full page unless necessary.
30. Retrieval Evaluation
You need evaluation fixtures.
Gold cases:
{
"query": "How to refresh an access token?",
"expected": [
"openapi/auth.yaml#/paths/~1oauth~1token/post",
"tests/auth/refresh-token.test.ts",
"docs/guides/authentication.mdx#refresh-tokens"
],
"forbidden": [
"logseq/journals/old-oauth-idea.md"
]
}
Metrics:
- recall@k for required evidence,
- precision@k,
- forbidden result rate,
- stale evidence rate,
- visibility violation rate,
- coverage completeness,
- explanation quality,
- retrieval latency,
- token packing efficiency.
For docs generation, retrieval quality matters more than raw semantic score.
31. Failure Modes
31.1 Semantic Similarity Trap
A note sounds similar but refers to another product.
Fix:
- project/version metadata,
- source refs,
- graph neighborhood check.
31.2 Exact Identifier Miss
Embedding search misses X-Request-Id.
Fix:
- lexical index,
- identifier fields,
- symbol-aware query expansion.
31.3 Stale Docs Win
Old docs rank high because they are well-written.
Fix:
- freshness penalty,
- source hash validation,
- authority model.
31.4 Private Note Leak
Internal note appears in public generation.
Fix:
- hard visibility filter before ranking.
31.5 Retrieval Flood
Too many chunks crowd out important evidence.
Fix:
- coverage-driven selection,
- redundancy control,
- token budget buckets.
31.6 Graph Expansion Explosion
One node expands to the entire repo.
Fix:
- max depth,
- edge allowlist,
- authority threshold,
- token budget integration.
32. Minimal Implementation Roadmap
Build in this order:
- Define
RetrievalChunkschema. - Build chunkers for docs sections.
- Build chunkers for OpenAPI operations.
- Build chunkers for examples/tests.
- Build chunkers for Logseq pages/blocks.
- Build chunkers for OpenNote notes/chunks.
- Build lexical index.
- Add metadata filters.
- Add source ref index.
- Add graph expansion from knowledge graph.
- Add semantic embeddings.
- Add hybrid scorer.
- Add coverage-driven retrieval.
- Add explainable retrieval output.
- Connect retrieval to context compiler.
- Connect retrieval to verifier.
- Connect retrieval to drift detector.
- Add evaluation fixtures.
Do not start by buying a vector database.
Start with chunk schema, metadata, lexical search, and source refs.
33. What We Have Built in This Part
Kita sudah mendesain retrieval layer untuk AI-driven documentation generator.
Komponen utamanya:
retrieval query model
retrieval result model
chunking strategy
lexical index
semantic index
graph expansion
hybrid ranking
authority/freshness/visibility filters
coverage-driven retrieval
retrieval plan
incremental indexing
retrieval evaluation
Mental model penting:
Retrieval is not “find similar text”. Retrieval is evidence selection under authority, freshness, visibility, coverage, and token constraints.
Dengan layer ini, context compiler tidak lagi bekerja membabi buta. Ia bisa mengambil evidence yang tepat dari source, docs, examples, notes, dan graph.
Part berikutnya akan masuk ke Phase 8: CLI Application Architecture. Kita akan mulai merancang struktur aplikasi CLI secara konkret: command handler, application service, domain model, infrastructure adapter, config loader, plugin boundary, error model, logging, tracing, dan exit code design.
References
- DocPrompting paper:
https://arxiv.org/abs/2207.05987 - Code2Prompt repository:
https://github.com/mufeedvh/code2prompt - Logseq repository:
https://github.com/logseq/logseq - OpenNote repository:
https://github.com/opennote-org/opennote - CodeSearchNet paper:
https://arxiv.org/abs/1909.09436 - SQLite FTS5:
https://www.sqlite.org/fts5.html - Tantivy search engine:
https://github.com/quickwit-oss/tantivy
You just completed lesson 38 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.