Deepen PracticeOrdered learning track

Learn Ai Docs Km Cli Part 038 Retrieval Layer For Docs And Notes

11 min read2096 words
PrevNext
Lesson 3848 lesson track27–39 Deepen Practice

title: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI - Part 038 description: Build a retrieval layer for source-grounded documentation generation across docs, notes, examples, contracts, source snippets, and knowledge graph relations. series: learn-ai-docs-km-cli seriesTitle: Build From Scratch: Mintlify-like AI-driven Documentation Generator CLI with Code2Prompt and Open-source Knowledge Management order: 38 partTitle: Retrieval Layer for Docs and Notes tags:

  • ai-docs
  • documentation
  • cli
  • retrieval
  • rag
  • semantic-search
  • knowledge-graph
  • source-grounding
  • context-engine date: 2026-07-04

Part 038 — Retrieval Layer for Docs and Notes

Pada part sebelumnya kita membangun bidirectional docs and notes sync.

Sekarang kita masuk ke layer yang membuat seluruh sistem menjadi jauh lebih pintar:

Retrieval layer untuk docs, notes, graph, examples, contracts, dan source snippets.

Tanpa retrieval, context compiler hanya bisa memakai heuristik statis:

ambil file yang kelihatan relevan
ambil README
ambil OpenAPI
ambil tests

Itu cukup untuk repo kecil.

Untuk repo besar, monorepo, multi-service platform, dan dokumentasi enterprise, pendekatan itu gagal.

Kita butuh retrieval layer yang bisa menjawab:

  • source mana yang paling relevan untuk halaman ini?
  • note mana yang pernah menjelaskan konsep ini?
  • example mana yang paling authoritative?
  • contract mana yang harus menang jika ada conflict?
  • relation graph mana yang membantu expand context?
  • chunk mana yang aman dimasukkan ke prompt?
  • mana yang stale?
  • mana yang public/internal/private?

Part ini membahas desain retrieval layer yang source-grounded, hybrid, explainable, cacheable, dan safe untuk documentation generation.


1. Mental Model: Retrieval Is Context Selection Under Constraints

Retrieval bukan sekadar search.

Dalam sistem AI docs CLI, retrieval adalah proses memilih evidence untuk task tertentu.

query + task + page spec + source authority + visibility + token budget
  -> ranked evidence set

Kita tidak mengambil dokumen karena “mirip secara embedding” saja.

Kita mengambil dokumen karena:

  • relevan terhadap task,
  • punya authority cukup,
  • fresh,
  • visible untuk target output,
  • tidak redundant,
  • membantu memenuhi page contract,
  • bisa diverifikasi.

Retrieval yang buruk menghasilkan docs yang buruk walaupun modelnya bagus.


2. Why Retrieval Matters for Code and Docs

LLM tidak selalu tahu library, API, atau internal codebase terbaru.

Paper DocPrompting: Generating Code by Retrieving the Docs menunjukkan ide penting: ketika programmer memakai API/library, mereka merujuk dokumentasi; sistem generasi kode bisa membaik dengan mengambil dokumentasi relevan terlebih dahulu sebelum generate output.

Untuk documentation generator kita, prinsipnya lebih luas:

retrieve relevant source/docs/notes/examples first
then generate grounded docs from retrieved evidence

Tanpa retrieval, generator cenderung:

  • memakai prior knowledge yang salah,
  • mengarang behavior,
  • melewatkan edge cases,
  • mencampur old docs dan new source,
  • memasukkan terlalu banyak context tidak relevan.

3. Retrieval Sources

Kita punya banyak source.

source files
symbols
contracts
examples
tests
configs
migrations
existing docs
Logseq pages
OpenNote notes/chunks
knowledge graph nodes/edges
review comments
CI/runbook artifacts

Tidak semua source sama.

SourceUsefulnessRiskDefault Authority
OpenAPI / schemasVery highMedium if staleHigh
source codeVery highMedium if complexHigh
testsHighLow-mediumHigh
examplesHighMedium if outdatedMedium-high
reviewed docsHighMedium if staleMedium-high
Logseq notesMedium-highHigh if unreviewedMedium-low
OpenNote semantic notesMedium-highHigh if unreviewedMedium-low
generated summariesMediumHighLow
review commentsMediumMediumContextual

Retrieval layer must carry source authority in every result.


4. Retrieval Query Model

A retrieval query is not just a string.

type RetrievalQuery = {
  id: string;
  task:
    | "generate_page"
    | "verify_claim"
    | "repair_page"
    | "find_examples"
    | "detect_drift"
    | "answer_dev_question"
    | "sync_notes";

  naturalLanguage?: string;
  symbols?: string[];
  paths?: string[];
  contracts?: string[];
  concepts?: string[];
  targetPage?: string;
  visibility: "public" | "internal" | "private";
  requiredKinds?: RetrievalKind[];
  forbiddenKinds?: RetrievalKind[];
  freshness?: "strict" | "normal" | "relaxed";
  tokenBudget?: number;
};

Example:

{
  "task": "generate_page",
  "targetPage": "docs/guides/authentication.mdx",
  "naturalLanguage": "Write the authentication guide",
  "symbols": ["AuthService", "TokenController"],
  "contracts": ["openapi:POST /oauth/token"],
  "visibility": "public",
  "requiredKinds": ["contract", "source", "example", "reviewed_doc"],
  "forbiddenKinds": ["private_note", "unreviewed_ai_summary"],
  "tokenBudget": 12000
}

Task-aware retrieval beats generic search.


5. Retrieval Result Model

Every result needs more than text.

type RetrievalResult = {
  id: string;
  kind: RetrievalKind;
  title?: string;
  text: string;
  sourceRefs: SourceRef[];
  authority: number;
  relevance: number;
  freshness: number;
  confidence: number;
  visibility: "public" | "internal" | "private";
  ownership: "source" | "human" | "generated" | "hybrid";
  stale: boolean;
  riskFlags: string[];
  relations: RetrievalRelation[];
  score: number;
  explanation: string[];
};

A retrieval result must answer:

Why are you in the context?
Can you be trusted?
Can you be shown to this target?
Are you stale?
What source backs you?

6. Retrieval Kinds

type RetrievalKind =
  | "source_snippet"
  | "symbol"
  | "contract"
  | "schema"
  | "example"
  | "test_case"
  | "config"
  | "migration"
  | "doc_section"
  | "logseq_block"
  | "opennote_chunk"
  | "graph_node"
  | "graph_edge"
  | "runbook_step"
  | "review_comment";

Different kinds need different indexing and scoring.

An OpenAPI operation should not be ranked the same way as a Logseq journal block.


7. Hybrid Retrieval

We use hybrid retrieval.

lexical search + semantic search + graph expansion + authority filtering + recency filtering

Why not semantic only?

Because code/docs contain exact identifiers:

  • AuthService,
  • POST /oauth/token,
  • x-request-id,
  • InvoiceCreated,
  • retry_after_ms,
  • docs.json.

Embedding search can miss exact identifiers.

Why not lexical only?

Because docs/notes use natural language:

  • “token renewal”,
  • “refresh session”,
  • “reauthentication”,
  • “credential rotation”.

Semantic search can connect related phrasing.

Why graph expansion?

Because if we retrieve AuthService, we also need related endpoints, configs, tests, and concepts.


8. Lexical Index

Lexical index supports exact match and keyword search.

Index fields:

id
title
path
symbols
identifiers
headings
body
source refs
tags
relations

Implementation options:

  • SQLite FTS5,
  • Tantivy,
  • Meilisearch,
  • local inverted index,
  • ripgrep-backed search for minimal version.

Minimal local index schema:

CREATE VIRTUAL TABLE retrieval_fts USING fts5(
  id UNINDEXED,
  kind UNINDEXED,
  title,
  path,
  identifiers,
  body,
  tokenize = 'unicode61'
);

Query examples:

SELECT id, bm25(retrieval_fts) AS rank
FROM retrieval_fts
WHERE retrieval_fts MATCH 'AuthService OR oauth token'
ORDER BY rank;

Lexical retrieval is excellent for exact code symbols and endpoint paths.


9. Semantic Index

Semantic index supports meaning-based search.

Chunk embedding record:

type EmbeddingRecord = {
  chunkId: string;
  model: string;
  dimensions: number;
  vector: number[];
  sourceHash: string;
  textHash: string;
  createdAt: string;
};

Semantic query:

"How does token refresh work?"

May retrieve:

  • RefreshTokenService,
  • POST /oauth/token,
  • Token expiry configuration,
  • Logseq note “Session renewal behavior”,
  • test case “does not refresh expired refresh token”.

Semantic retrieval must still obey:

  • visibility,
  • source freshness,
  • authority,
  • source refs,
  • token budget.

10. Chunking Strategy

Bad chunking destroys retrieval.

Chunk types:

file chunk
symbol chunk
heading section chunk
OpenAPI operation chunk
schema chunk
test episode chunk
example chunk
note block chunk
graph neighborhood chunk

Rules:

  • do not split code symbol randomly,
  • keep heading + body together,
  • keep request/response examples together,
  • keep test setup/action/assertion together,
  • preserve source refs,
  • include metadata in chunk,
  • avoid huge chunks that dominate token budget.

Example chunk:

{
  "id": "chunk:openapi:post-oauth-token",
  "kind": "contract",
  "title": "POST /oauth/token",
  "text": "Issues an access token...",
  "source_refs": ["openapi/auth.yaml#/paths/~1oauth~1token/post"],
  "symbols": ["TokenController", "TokenResponse"],
  "visibility": "public",
  "authority": 0.95
}

11. Code-aware Chunking

Code needs special treatment.

For source files, chunk by:

  • exported symbol,
  • class,
  • function,
  • endpoint handler,
  • config object,
  • test case,
  • migration unit.

Avoid arbitrary fixed-size chunks as primary strategy.

Example:

type CodeChunk = {
  id: string;
  file: string;
  language: string;
  symbolId?: string;
  startLine: number;
  endLine: number;
  signature?: string;
  docComment?: string;
  body: string;
  imports: string[];
  outgoingRelations: string[];
};

For large function/class:

signature + comments + relevant branches + called symbols summary

Do not always include full body.


12. Docs-aware Chunking

For MDX docs:

  • split by heading sections,
  • keep frontmatter metadata,
  • preserve internal links,
  • preserve code fences,
  • preserve callout type,
  • exclude generated metadata comments from semantic text,
  • include source refs.

Example:

{
  "id": "chunk:docs:auth-guide:refresh-token",
  "kind": "doc_section",
  "title": "Refresh tokens",
  "path": "docs/guides/authentication.mdx",
  "headingPath": ["Authentication", "Refresh tokens"],
  "text": "Refresh tokens are used to...",
  "source_refs": ["auth.config.ts", "openapi/auth.yaml"],
  "reviewStatus": "accepted"
}

Reviewed docs are more authoritative than unreviewed generated docs.


13. Note-aware Chunking

For Logseq:

  • page properties become metadata,
  • blocks become chunks,
  • nested blocks preserve context path,
  • [[Page References]] become relations,
  • tags become metadata,
  • TODO/DONE state is preserved,
  • journal pages have lower authority by default.

For OpenNote:

  • note metadata becomes document metadata,
  • semantic chunks may already exist,
  • relations are loaded if exported,
  • embedding metadata must include model/version.

Notes need stronger safety filters because they are often informal.


14. Knowledge Graph Expansion

Retrieval should use graph expansion after initial hits.

Example:

Initial hit:

symbol:AuthService

Graph expansion adds:

endpoint:POST /oauth/token
config:auth.token.ttl
test:refresh-token-expiry
concept:Bearer Token
doc:Authentication Guide

Expansion policy:

type ExpansionPolicy = {
  maxDepth: number;
  allowedEdges: string[];
  maxNodes: number;
  requireAuthorityAbove: number;
};

For docs generation, use conservative expansion:

depth 1-2
only high-confidence edges
exclude private/internal if target public

15. Score Formula

Simple score:

score =
  lexicalScore * 0.25 +
  semanticScore * 0.25 +
  graphScore * 0.20 +
  authorityScore * 0.20 +
  freshnessScore * 0.10 -
  riskPenalty

But weights should be task-specific.

For claim verification:

authority > lexical > graph > semantic

For concept discovery:

semantic > graph > lexical > authority

For examples:

example validity > source authority > relevance > freshness

For public docs generation:

visibility safety is a hard filter, not just a score

16. Authority as Hard Constraint

If a low-authority note conflicts with a high-authority contract, contract wins.

Retrieval should surface conflict, not average them.

Example:

Logseq note: API uses cursor pagination
OpenAPI spec: API uses page/limit

Output should be:

{
  "result": "conflict",
  "preferred": "openapi/billing.yaml#/paths/~1invoices/get",
  "conflicting": "logseq/pages/API___Billing.md#pagination",
  "reason": "contract has higher authority than unreviewed note"
}

Do not blend contradictory evidence into one summary.


17. Freshness and Staleness

Every retrieval result should know whether it is stale.

Freshness signals:

  • source hash matches current scan,
  • source ref still exists,
  • generated artifact built after source change,
  • note synced after docs update,
  • contract version matches target version,
  • branch matches current branch.

Stale result example:

{
  "id": "chunk:docs:old-auth-guide",
  "stale": true,
  "reason": "source_ref_hash_changed",
  "sourceRef": "openapi/auth.yaml#/paths/~1oauth~1token/post"
}

Stale evidence can be used for drift detection, but not as authoritative input for new docs unless marked clearly.


18. Visibility Filter

Retrieval must apply visibility before ranking.

function canUse(result: RetrievalResult, targetVisibility: Visibility) {
  if (targetVisibility === "public") return result.visibility === "public";
  if (targetVisibility === "internal") return result.visibility !== "private";
  return true;
}

This prevents private notes from leaking into public docs.

Visibility filter is not optional.


19. Retrieval Plan

For page generation, retrieval should produce a plan.

{
  "query_id": "rq:auth-guide",
  "target": "docs/guides/authentication.mdx",
  "strategy": "hybrid_graph_authority",
  "required_coverage": [
    "overview",
    "authentication-flow",
    "token-request",
    "refresh-token",
    "errors",
    "examples"
  ],
  "selected": [
    "contract:POST /oauth/token",
    "source:TokenController.issueToken",
    "test:refresh-token-expiry",
    "example:curl-token-request"
  ],
  "omitted": [
    {
      "id": "logseq:journal:2026-06-incident",
      "reason": "visibility_internal_for_public_target"
    }
  ],
  "coverage_gaps": [
    "rate limit behavior"
  ]
}

The retrieval plan becomes input to context compiler.


20. Coverage-driven Retrieval

A page spec requires certain sections.

Example:

Authentication Guide requires:
- overview
- token issuance
- auth header format
- refresh behavior
- error responses
- examples

Retrieval should not just return top 20 chunks.

It should cover all required topics.

Algorithm:

for each required section:
  retrieve candidates
  rank candidates
  select high-authority evidence
  mark coverage status
combine evidence
remove duplicates
fit token budget

This avoids a common failure:

All retrieved chunks are about token issuance.
No evidence about refresh behavior.
Generator invents refresh behavior.

Coverage gaps should be explicit.


21. Redundancy Control

Retrieval often returns many similar chunks.

Examples:

  • README and docs say same thing,
  • Logseq note mirrors docs,
  • OpenNote chunk mirrors Logseq block,
  • tests and examples overlap.

We need deduplication.

Signals:

  • same source refs,
  • same stable entity ID,
  • high text similarity,
  • generated-from relation,
  • same hash.

Dedup policy:

prefer higher authority
prefer fresher
prefer reviewed
prefer shorter if enough
preserve one alternate source if useful for verification

22. Token Budget Integration

Retrieval does not end with ranking.

It must fit into token budget.

Budget buckets:

instructions      15%
source evidence   45%
examples          15%
existing docs     10%
notes             5%
output contract   10%

For public docs generation, notes should usually be a small slice unless reviewed.

Packing algorithm:

select must-have evidence
add coverage evidence
add examples
add notes only if useful and allowed
compress lower-priority chunks
omit redundant chunks
emit omitted evidence diagnostics

Retrieval and context packing are separate but tightly connected.


23. Retrieval Index Artifact

Store index metadata.

.aidocs/retrieval/
  index-manifest.v1.json
  lexical.sqlite
  vectors.bin
  embeddings.jsonl
  chunks.jsonl
  graph-expansion-cache.json

Manifest:

{
  "version": 1,
  "created_at": "2026-07-04T10:00:00Z",
  "inputs": {
    "scan": "sha256:...",
    "symbols": "sha256:...",
    "contracts": "sha256:...",
    "docs": "sha256:...",
    "km": "sha256:..."
  },
  "embedding": {
    "provider": "local-or-remote",
    "model": "text-embedding-model",
    "dimensions": 1536
  },
  "chunk_count": 4920
}

Index manifest makes retrieval reproducible and debuggable.


24. Incremental Indexing

Do not rebuild all embeddings on every run.

Use text hash.

if chunk.textHash unchanged and embedding.model unchanged:
  reuse embedding
else:
  recompute embedding

Index invalidation inputs:

  • source file hash changed,
  • docs page hash changed,
  • Logseq page hash changed,
  • OpenNote note hash changed,
  • chunker version changed,
  • embedding model changed,
  • visibility metadata changed,
  • source ref changed.

Chunker version is important. If chunking logic changes, all chunk IDs or boundaries may change.


25. Retrieval CLI

Command surface:

aidocs retrieve "How does authentication work?"

Search everything allowed.

aidocs retrieve --page docs/guides/authentication.mdx

Retrieve for page generation.

aidocs retrieve --symbol AuthService --explain

Explain why results match.

aidocs retrieve --kind example --contract "POST /oauth/token"

Find examples for endpoint.

aidocs retrieve --target-visibility public

Apply visibility filter.

aidocs retrieve --rebuild-index

Rebuild retrieval index.

aidocs retrieve --coverage page-spec.json

Coverage-driven retrieval.


26. Explainability

Every result should show why it was selected.

Example CLI output:

1. POST /oauth/token
   kind: contract
   score: 0.94
   authority: 0.96
   source: openapi/auth.yaml#/paths/~1oauth~1token/post
   reasons:
     - exact endpoint match
     - linked to TokenController.issueToken
     - required by page section token issuance

2. refresh-token-expiry test
   kind: test_case
   score: 0.88
   source: tests/auth/refresh-token.test.ts:42-91
   reasons:
     - covers required section refresh behavior
     - recent source hash
     - validates expiry behavior

If users cannot understand retrieval, they cannot trust generated docs.


27. Retrieval for Verification

Verifier uses retrieval differently than generator.

Claim:

Refresh tokens expire after 30 days.

Verification query:

{
  "task": "verify_claim",
  "naturalLanguage": "Refresh tokens expire after 30 days",
  "requiredKinds": ["source_snippet", "config", "contract", "test_case"],
  "freshness": "strict"
}

The verifier should retrieve:

  • auth.config.ts,
  • TokenSettings.refreshTokenTtlDays,
  • test that asserts 30 days,
  • docs source ref if already reviewed.

If retrieval finds no source, claim becomes unsupported.


28. Retrieval for Drift Detection

Drift detector asks:

Which docs/notes depend on changed source X?

This is reverse retrieval.

Inputs:

  • changed file,
  • changed symbol,
  • changed contract path,
  • changed config key.

Output:

  • affected docs pages,
  • affected notes,
  • affected examples,
  • affected prompt bundles,
  • required regeneration tasks.

Implementation:

source ref index + graph edges + chunk metadata

This is why every chunk must preserve source refs.


29. Retrieval for Repair

When verifier finds issue, repair uses targeted retrieval.

Issue:

Unsupported claim: API supports password grant.

Repair retrieval:

retrieve OAuth token endpoint contract
retrieve auth config
retrieve tests for token grant types
retrieve previous docs section

Then prompt:

Repair only this section.
Use retrieved evidence.
Remove unsupported claim if no evidence.

Repair should not regenerate full page unless necessary.


30. Retrieval Evaluation

You need evaluation fixtures.

Gold cases:

{
  "query": "How to refresh an access token?",
  "expected": [
    "openapi/auth.yaml#/paths/~1oauth~1token/post",
    "tests/auth/refresh-token.test.ts",
    "docs/guides/authentication.mdx#refresh-tokens"
  ],
  "forbidden": [
    "logseq/journals/old-oauth-idea.md"
  ]
}

Metrics:

  • recall@k for required evidence,
  • precision@k,
  • forbidden result rate,
  • stale evidence rate,
  • visibility violation rate,
  • coverage completeness,
  • explanation quality,
  • retrieval latency,
  • token packing efficiency.

For docs generation, retrieval quality matters more than raw semantic score.


31. Failure Modes

31.1 Semantic Similarity Trap

A note sounds similar but refers to another product.

Fix:

  • project/version metadata,
  • source refs,
  • graph neighborhood check.

31.2 Exact Identifier Miss

Embedding search misses X-Request-Id.

Fix:

  • lexical index,
  • identifier fields,
  • symbol-aware query expansion.

31.3 Stale Docs Win

Old docs rank high because they are well-written.

Fix:

  • freshness penalty,
  • source hash validation,
  • authority model.

31.4 Private Note Leak

Internal note appears in public generation.

Fix:

  • hard visibility filter before ranking.

31.5 Retrieval Flood

Too many chunks crowd out important evidence.

Fix:

  • coverage-driven selection,
  • redundancy control,
  • token budget buckets.

31.6 Graph Expansion Explosion

One node expands to the entire repo.

Fix:

  • max depth,
  • edge allowlist,
  • authority threshold,
  • token budget integration.

32. Minimal Implementation Roadmap

Build in this order:

  1. Define RetrievalChunk schema.
  2. Build chunkers for docs sections.
  3. Build chunkers for OpenAPI operations.
  4. Build chunkers for examples/tests.
  5. Build chunkers for Logseq pages/blocks.
  6. Build chunkers for OpenNote notes/chunks.
  7. Build lexical index.
  8. Add metadata filters.
  9. Add source ref index.
  10. Add graph expansion from knowledge graph.
  11. Add semantic embeddings.
  12. Add hybrid scorer.
  13. Add coverage-driven retrieval.
  14. Add explainable retrieval output.
  15. Connect retrieval to context compiler.
  16. Connect retrieval to verifier.
  17. Connect retrieval to drift detector.
  18. Add evaluation fixtures.

Do not start by buying a vector database.

Start with chunk schema, metadata, lexical search, and source refs.


33. What We Have Built in This Part

Kita sudah mendesain retrieval layer untuk AI-driven documentation generator.

Komponen utamanya:

retrieval query model
retrieval result model
chunking strategy
lexical index
semantic index
graph expansion
hybrid ranking
authority/freshness/visibility filters
coverage-driven retrieval
retrieval plan
incremental indexing
retrieval evaluation

Mental model penting:

Retrieval is not “find similar text”. Retrieval is evidence selection under authority, freshness, visibility, coverage, and token constraints.

Dengan layer ini, context compiler tidak lagi bekerja membabi buta. Ia bisa mengambil evidence yang tepat dari source, docs, examples, notes, dan graph.

Part berikutnya akan masuk ke Phase 8: CLI Application Architecture. Kita akan mulai merancang struktur aplikasi CLI secara konkret: command handler, application service, domain model, infrastructure adapter, config loader, plugin boundary, error model, logging, tracing, dan exit code design.


References

  • DocPrompting paper: https://arxiv.org/abs/2207.05987
  • Code2Prompt repository: https://github.com/mufeedvh/code2prompt
  • Logseq repository: https://github.com/logseq/logseq
  • OpenNote repository: https://github.com/opennote-org/opennote
  • CodeSearchNet paper: https://arxiv.org/abs/1909.09436
  • SQLite FTS5: https://www.sqlite.org/fts5.html
  • Tantivy search engine: https://github.com/quickwit-oss/tantivy
Lesson Recap

You just completed lesson 38 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.