Learn Ai Code Documentation Agent Memory Part 013 Chunking Code And Documents
title: Learn AI Code Documentation & Agent Memory Platform - Part 013 description: Chunking code dan dokumen secara AST-aware, symbol-aware, section-aware, provenance-preserving, dan token-efficient untuk retrieval, documentation generation, dan agent context assembly. series: learn-ai-code-documentation-agent-memory seriesTitle: Learn AI Code Documentation & Agent Memory Platform order: 13 partTitle: Chunking Code and Documents tags:
- ai
- retrieval
- chunking
- code-intelligence
- documentation
- agent-context
- repository-analysis
- software-architecture date: 2026-07-02
Part 013 — Chunking Code and Documents
1. Tujuan Part Ini
Part 012 menutup fondasi knowledge representation: metadata, provenance, dan trust. Sekarang kita masuk ke fase retrieval architecture.
Topik pertama adalah chunking.
Chunking adalah proses memecah source code, dokumen, schema, config, dan graph evidence menjadi unit yang bisa diindeks, dicari, disusun menjadi context, dan dikutip sebagai evidence.
Chunking terlihat sederhana, tetapi di sistem code intelligence ia sangat menentukan kualitas seluruh platform.
Jika chunking buruk:
- method terpotong di tengah,
- komentar terpisah dari symbol,
- docs kehilangan heading context,
- evidence span tidak bisa diaudit,
- embedding menjadi noise,
- retrieval mengambil potongan tidak lengkap,
- agent kekurangan constraints,
- generated docs hallucinated,
- memory grounded pada unit yang salah.
Target part ini:
- memahami chunk sebagai unit retrieval, bukan sekadar potongan text,
- membedakan chunk untuk source code, docs, schema, config, graph, dan memory,
- mendesain AST-aware chunking untuk code,
- mendesain section-aware chunking untuk docs,
- menjaga provenance di setiap chunk,
- mengatur chunk identity dan invalidation,
- menentukan granularity yang tepat,
- menangani token budget dan overlap,
- membuat chunk quality gates,
- menghubungkan chunking dengan retrieval, docs, agent context, dan memory.
2. Chunking Bukan Sekadar Memotong Teks
Naive chunking:
ambil setiap 1000 karakter
overlap 200 karakter
embed semuanya
Pendekatan ini cukup untuk dokumen umum, tetapi buruk untuk source code.
Code punya struktur:
- package,
- import,
- class,
- method,
- annotation,
- decorator,
- route,
- test,
- schema,
- config section,
- migration operation.
Jika struktur ini dihancurkan, retrieval kehilangan makna.
2.1 Contoh Chunk Buruk
lines 1-80:
package/import/class header + setengah method
lines 81-160:
sisa method + awal method lain
Masalah:
- signature method ada di chunk pertama,
- body logic ada di chunk kedua,
- retrieval hanya mengambil salah satu,
- agent tidak punya context lengkap.
2.2 Contoh Chunk Baik
chunk:
kind: method
symbol: OrderService.createOrder
includes:
- leading comment
- annotations
- method signature
- method body
span: [31, 74]
Chunk baik mempertahankan unit makna.
3. Mental Model Chunk
Chunk adalah jembatan antara knowledge representation dan retrieval.
Chunk harus punya:
- content,
- type,
- scope,
- source span,
- token estimate,
- identity,
- parent-child relation,
- linked graph nodes,
- confidence,
- freshness metadata,
- permission metadata.
4. Chunk Taxonomy
4.1 Core Chunk Types
| Chunk Type | Source | Example |
|---|---|---|
file_overview | file | compact summary/header metadata |
class_chunk | code | class with methods overview |
method_chunk | code | method/function body |
symbol_header_chunk | code | declaration/signature only |
route_chunk | code/contract | API operation |
test_case_chunk | tests | one test case |
schema_chunk | schema/contract | DTO, OpenAPI schema, protobuf message |
config_section_chunk | config | config prefix/key group |
migration_chunk | SQL | migration operation |
doc_section_chunk | docs | markdown section |
adr_chunk | docs | ADR context/decision/consequence |
runbook_step_chunk | runbook | troubleshooting step |
graph_path_chunk | graph | compact graph path |
memory_chunk | memory | memory record as retrievable unit |
4.2 Chunk Role
| Role | Meaning |
|---|---|
primary_evidence | can support major claim |
supporting_evidence | supports secondary/contextual claim |
navigation | helps find source, not enough as evidence |
summary | compact derived representation |
warning | stale/conflict/uncertainty info |
excluded | not indexed |
Example:
chunk:
type: method_chunk
role: primary_evidence
chunk:
type: file_overview
role: navigation
5. Chunk Metadata Model
5.1 Minimal Chunk
chunk:
chunkId: chunk_01J...
repositoryId: order-service
snapshotId: snap_6f41ab2
commitSha: 6f41ab2
path: src/main/java/com/acme/order/OrderService.java
chunkType: method_chunk
content: "public Order createOrder(...) { ... }"
span:
startLine: 31
endLine: 74
5.2 Production Chunk
chunk:
chunkId: chunk_01J...
logicalChunkId: chunklog_order_service_create_order
tenantId: acme
repositoryId: order-service
snapshotId: snap_6f41ab2
commitSha: 6f41ab2
source:
type: file_span
path: src/main/java/com/acme/order/OrderService.java
contentHash: sha256:...
chunkType: method_chunk
role: primary_evidence
title: "OrderService.createOrder"
language: java
content:
raw: "..."
normalized: "..."
spans:
full:
startLine: 29
endLine: 74
declaration:
startLine: 31
endLine: 32
body:
startLine: 33
endLine: 74
linkedNodes:
- symbol:OrderService.createOrder
- class:OrderService
parentChunkId: chunk_class_order_service
childChunkIds: []
tokenEstimate: 690
quality:
confidence: 0.94
parseStatus: OK
staleRisk: low
security:
visibilityScope: private
redacted: false
6. Chunk Identity
Chunk identity menentukan incremental update.
6.1 Bad Identity
chunkId = random_uuid()
Masalah:
- setiap reindex semua chunk terlihat baru,
- vector index penuh duplikasi,
- docs/memory sulit di-invalidate,
- diff tidak bisa dipercaya.
6.2 Instance and Logical Chunk
Gunakan dua ID:
| ID | Meaning |
|---|---|
chunkInstanceId | chunk untuk snapshot tertentu |
logicalChunkId | chunk logical lintas snapshot |
Contoh:
chunkInstanceId =
hash(repositoryId, snapshotId, path, chunkType, linkedSymbolId, contentHash)
logicalChunkId =
hash(repositoryId, path, chunkType, logicalSymbolId)
6.3 Content Hash
Simpan hash normalized content:
contentHash = sha256(normalizedChunkContent)
Jika line number berubah tetapi content sama, chunk logical tetap sama dan content hash sama.
7. Chunking Code
Code chunking harus berbasis struktur.
7.1 Chunk Hierarchy
7.2 Code Chunking Levels
| Level | Use |
|---|---|
| file overview | navigation and repo overview |
| class/type | module docs and architecture |
| method/function | agent code task and behavior evidence |
| block | large method internal retrieval |
| signature/header | API surface and quick search |
7.3 Default Strategy
For most code:
- create file overview chunk,
- create class/type chunks,
- create method/function chunks,
- create framework code unit chunks,
- create test case chunks,
- avoid block chunks unless unit too large.
8. Method Chunk
8.1 What to Include
A method chunk should include:
- leading comments/docstring,
- annotations/decorators,
- signature,
- body,
- maybe relevant imports if needed,
- parent class name metadata.
Example:
chunk:
type: method_chunk
title: OrderService.createOrder
contentParts:
- leading_comment
- annotations
- signature
- body
8.2 Why Include Signature
The body alone may not contain:
- method name,
- parameter types,
- return type,
- annotations,
- visibility.
Agent needs these.
8.3 Large Method Strategy
If method is too large:
methodChunk:
type: method_chunk
mode: outline_plus_blocks
children:
- block_chunk: validation_branch
- block_chunk: persistence_branch
Keep a parent method summary chunk with structure.
9. Class Chunk
9.1 What to Include
Class chunk can include:
- class declaration,
- annotations,
- fields,
- constructor signatures,
- method list,
- high-level comments,
- not necessarily full method bodies.
Example:
class OrderService {
fields:
- OrderValidator validator
- OrderRepository repository
methods:
- createOrder(CreateOrderRequest): Order
- cancelOrder(UUID): void
}
9.2 Use Cases
Class chunk helps:
- architecture docs,
- module overview,
- agent navigation,
- retrieval when query is broad.
9.3 Avoid Huge Class Chunks
If class is huge, do not embed full class. Embed overview and methods separately.
10. Route Chunk
API route is often more useful than raw method.
10.1 Example
chunk:
type: route_chunk
title: "POST /orders"
content:
handler: OrderController.createOrder
request: CreateOrderRequest
response: OrderResponse
calls:
- OrderService.createOrder
evidence:
- OrderController.java:12-22
- openapi/order-api.yaml#/paths/~1orders/post
10.2 Source Composition
Route chunk may combine:
- controller code,
- OpenAPI operation,
- request/response schema,
- related test.
Mark it as composed chunk:
composition:
type: composed
sources:
- file_span
- schema_pointer
10.3 Be Careful
Composed chunks are useful for retrieval, but claims must still cite original evidence.
11. Test Chunk
Tests are behavior evidence.
11.1 Test Case Chunk
chunk:
type: test_case_chunk
title: "shouldRejectOrderWithoutCustomerId"
source:
path: OrderValidatorTest.java
lines: [35, 52]
linkedNodes:
- test_case:OrderValidatorTest.shouldRejectOrderWithoutCustomerId
- symbol:OrderValidator.validate
11.2 Test Suite Chunk
For broad retrieval:
chunk:
type: test_suite_overview
title: OrderValidatorTest
content:
tests:
- shouldRejectOrderWithoutCustomerId
- shouldAcceptValidCorporateOrder
11.3 Why Test Chunks Matter
For agent code changes, tests often matter more than docs.
If target symbol changes, related test chunks should rank high.
12. Schema and Contract Chunking
Contracts require structure-aware chunking.
12.1 OpenAPI
Chunk by:
- operation,
- schema,
- parameter group,
- error response,
- security scheme.
Example:
chunk:
type: api_operation_chunk
title: "POST /orders"
pointer: /paths/~1orders/post
linkedNodes:
- api_operation:POST:/orders
- schema:CreateOrderRequest
- schema:OrderResponse
12.2 Protobuf
Chunk by:
- message,
- service,
- rpc method,
- enum.
chunk:
type: protobuf_message_chunk
title: OrderCreated
path: events/order.proto
12.3 SQL Migrations
Chunk by operation:
chunk:
type: migration_operation_chunk
title: "ALTER TABLE orders ADD COLUMN status"
table: orders
operation: add_column
12.4 GraphQL
Chunk by:
- type,
- query,
- mutation,
- subscription,
- fragment.
13. Config Chunking
Config files are often YAML/JSON/TOML/properties.
13.1 Chunk by Prefix
order:
validation:
max-items: 100
corporate-tax-id-required: true
Chunk:
chunk:
type: config_section_chunk
title: order.validation
keys:
- order.validation.max-items
- order.validation.corporate-tax-id-required
13.2 Redaction
Config chunks must redact sensitive values.
Bad:
password: my-prod-password
Good:
password: <REDACTED_SECRET>
13.3 Use Cases
Config chunks help:
- ops docs,
- feature behavior docs,
- agent context,
- runbook generation,
- impact analysis.
14. Document Chunking
Docs should be chunked by sections, not fixed character windows.
14.1 Section-Aware Chunking
Markdown:
# Order Validation
## Purpose
...
## Main Components
...
## Flow
...
Chunks:
Purpose,Main Components,Flow.
14.2 Heading Path
Store heading path:
headingPath:
- Order Validation
- Main Components
This provides context when chunk is retrieved.
14.3 Include Parent Heading
Chunk content should include heading.
## Main Components
`OrderValidator` coordinates validation...
Without heading, retrieved chunk may be ambiguous.
14.4 Tables and Code Blocks
Do not split table rows across chunks unless necessary.
Do not separate code block from explanation if they depend on each other.
15. ADR Chunking
ADR has predictable structure.
Typical sections:
- status,
- context,
- decision,
- consequences,
- alternatives.
15.1 ADR Chunk Types
| ADR Section | Chunk Role |
|---|---|
| Context | background |
| Decision | strong decision evidence |
| Consequences | trade-off evidence |
| Alternatives | reasoning |
| Status | lifecycle |
15.2 ADR Decision Chunk
chunk:
type: adr_decision_chunk
title: "ADR 012 Decision"
role: decision_evidence
linkedNodes:
- module:order.validation
- symbol:RuleRegistry
ADR decision chunks should rank high for architecture questions, but not override current code for implementation truth.
16. Runbook Chunking
Runbooks should be chunked by operational task.
16.1 Runbook Chunk Types
| Chunk Type | Example |
|---|---|
| symptom_chunk | "High order validation failures" |
| diagnosis_step_chunk | "Check validation error rate" |
| remediation_step_chunk | "Rollback validation config" |
| escalation_chunk | "Contact team-order-platform" |
| dashboard_chunk | "Grafana dashboard links" |
16.2 Safety
Runbooks may contain operational sensitive data.
Apply visibility and redaction.
17. Graph Path Chunking
Graph paths can be converted into compact chunks.
17.1 Example
chunk:
type: graph_path_chunk
title: "POST /orders request flow"
graphPath:
- POST /orders
- OrderController.createOrder
- OrderService.createOrder
- OrderValidator.validate
- OrderRepository.save
17.2 Use Cases
Graph path chunks are useful for:
- flow retrieval,
- docs generation,
- agent context,
- memory candidate generation.
17.3 Caution
Graph path chunks are derived. Claims still need original edge evidence.
18. Memory Chunking
Memory records should be retrievable.
18.1 Memory Chunk
chunk:
type: memory_chunk
title: "Repo convention: validation rules use RuleRegistry"
content: "Validation rules are registered through RuleRegistry, not instantiated directly in controllers."
memoryId: mem_rule_registry
scope: repository:order-service
evidence:
- RuleRegistry.java:10-88
18.2 Use in Retrieval
Memory chunks should not mix with source chunks blindly.
Rank memory separately and include as derived guidance.
19. Chunk Overlap
Overlap is useful, but must be controlled.
19.1 Code Overlap
For method chunks, overlap may include:
- class name,
- imports relevant to method,
- leading comment,
- annotations,
- surrounding test setup.
Avoid arbitrary line overlap that duplicates unrelated methods.
19.2 Doc Overlap
For docs, overlap can include:
- parent heading,
- previous paragraph if needed,
- table header.
19.3 Overlap Metadata
overlap:
includesParentHeading: true
includesLeadingComment: true
includesImports:
- OrderValidator
20. Token Budget
Chunk size should consider token budget.
20.1 Suggested Size Bands
| Chunk | Target Tokens |
|---|---|
| method/function | 150–1200 |
| class overview | 200–1000 |
| doc section | 200–1200 |
| API operation | 300–1500 |
| schema | 200–1500 |
| runbook step | 150–900 |
| graph path | 100–500 |
| memory | 40–250 |
These are not hard rules. Use task requirements.
20.2 Too Small
Bad:
return repository.save(order);
No context.
20.3 Too Large
Bad:
entire 4000-line service file
Too much noise.
20.4 Adaptive Chunking
If symbol is small, one chunk.
If symbol is huge:
- create overview,
- create child chunks,
- preserve parent relation.
21. Chunk Normalization
Store raw and normalized content.
21.1 Raw Content
Raw content preserves exact source for evidence.
21.2 Normalized Content
Normalized content improves indexing.
Examples:
- remove excessive whitespace,
- normalize line endings,
- preserve identifiers,
- optionally strip comments for some indexes,
- keep comments for documentation index.
21.3 Do Not Over-Normalize Code
Identifiers matter.
Do not replace all variable names if retrieval needs exact code references.
22. Chunk Provenance
Every chunk must carry provenance.
22.1 Required Provenance
provenance:
sourceType: file_span
repositoryId: order-service
snapshotId: snap_6f41ab2
commitSha: 6f41ab2
path: OrderService.java
span: [31, 74]
contentHash: sha256:...
producedBy:
chunkerId: ast-code-chunker
chunkerVersion: 2026.07.02
22.2 Composed Chunk Provenance
provenance:
sourceType: composed
sources:
- fileSpan: OrderController.java:12-22
- schemaPointer: openapi.yaml#/paths/~1orders/post
- graphEdge: OrderController.createOrder CALLS OrderService.createOrder
22.3 Why It Matters
When generated docs cite chunk, system can resolve original source.
23. Chunk Security
Chunks inherit source sensitivity.
23.1 Visibility
security:
visibilityScope: private
derivedFrom:
- repo:order-service/private
23.2 Redaction
redaction:
applied: true
reason: secret_candidate
redactedPatterns:
- api_key
23.3 Exclusion
Do not create content chunk for blocked files.
Create metadata-only record if needed:
chunk:
type: blocked_file_metadata
path: .env.production
content: null
24. Chunk Invalidation
Chunks must update when source changes.
24.1 Invalidation Triggers
| Trigger | Action |
|---|---|
| file hash changed | re-chunk file |
| parser version changed | re-chunk parsed code |
| symbol span changed | update symbol chunk |
| doc section hash changed | update doc chunk |
| graph path changed | update graph chunk |
| memory invalidated | remove/mark memory chunk |
| permission changed | update visibility/index |
24.2 Chunk Diff
chunkDiff:
added:
- chunk_new_rule
removed:
- chunk_old_rule
changed:
- chunk_order_validator_validate
unchanged:
- chunk_rule_registry
24.3 Index Update
If chunk changed:
- delete old vector,
- write new vector,
- update lexical index,
- update chunk metadata,
- trigger docs/memory refresh if referenced.
25. Chunk Storage Schema
25.1 Chunks
CREATE TABLE chunks (
chunk_id TEXT PRIMARY KEY,
logical_chunk_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
repository_id TEXT,
snapshot_id TEXT,
commit_sha TEXT,
source_type TEXT NOT NULL,
path TEXT,
chunk_type TEXT NOT NULL,
role TEXT NOT NULL,
title TEXT NOT NULL,
language TEXT,
content_hash TEXT NOT NULL,
token_estimate INTEGER NOT NULL,
visibility_scope TEXT NOT NULL,
stale_risk TEXT NOT NULL,
created_at TIMESTAMP NOT NULL
);
25.2 Chunk Spans
CREATE TABLE chunk_spans (
id TEXT PRIMARY KEY,
chunk_id TEXT NOT NULL,
span_type TEXT NOT NULL,
start_line INTEGER,
start_column INTEGER,
end_line INTEGER,
end_column INTEGER
);
25.3 Chunk Links
CREATE TABLE chunk_graph_links (
id TEXT PRIMARY KEY,
chunk_id TEXT NOT NULL,
graph_node_id TEXT NOT NULL,
link_type TEXT NOT NULL,
confidence NUMERIC NOT NULL
);
25.4 Chunk Sources
CREATE TABLE chunk_sources (
id TEXT PRIMARY KEY,
chunk_id TEXT NOT NULL,
source_ref_type TEXT NOT NULL,
source_ref_id TEXT NOT NULL,
usage_type TEXT NOT NULL
);
26. Chunker Architecture
26.1 Interface
public interface Chunker {
boolean supports(ChunkingInput input);
List<Chunk> chunk(ChunkingInput input);
}
26.2 Chunking Input
public record ChunkingInput(
RepositorySnapshot snapshot,
SourceFile file,
FileClassification classification,
LanguageDetection language,
ParseResult parseResult,
List<CodeSymbol> symbols,
List<CodeUnit> codeUnits,
ChunkingPolicy policy
) {}
26.3 Chunking Policy
chunkingPolicy:
maxTokensPerChunk: 1200
includeLeadingComments: true
includeAnnotations: true
createOverviewChunks: true
createBlockChunksForLargeMethods: true
preserveEvidenceSpans: true
26.4 Chunker Plugins
- Java AST chunker,
- TypeScript AST chunker,
- Go AST chunker,
- Python AST chunker,
- Markdown section chunker,
- OpenAPI chunker,
- SQL migration chunker,
- YAML config chunker,
- graph path chunker,
- memory chunker.
27. Chunk Quality Gates
27.1 Structural Gates
- chunk has source span,
- chunk has content hash,
- chunk has token estimate,
- chunk linked to source artifact,
- chunk type valid,
- chunk visibility valid.
27.2 Code Gates
- method chunk includes signature,
- class chunk has child method list or child links,
- route chunk includes handler,
- test chunk linked to test symbol,
- generated/vendor chunks not primary evidence.
27.3 Document Gates
- doc section chunk includes heading,
- section span valid,
- stale docs marked,
- generated docs carry generation metadata.
27.4 Security Gates
- no blocked-sensitive content,
- redacted chunks marked,
- chunk visibility no broader than source,
- composed chunk visibility is max sensitivity of sources.
28. Chunk Evaluation
28.1 Retrieval Evaluation
Questions:
- does query retrieve the correct chunk?
- is chunk self-contained enough?
- does chunk include evidence span?
- is irrelevant sibling code excluded?
28.2 Context Evaluation
Questions:
- can agent perform task with selected chunks?
- are required tests included?
- are stale docs excluded?
- is token budget respected?
28.3 Documentation Evaluation
Questions:
- generated claim can cite chunk?
- chunk maps to exact source lines?
- source changes invalidate affected docs?
29. Common Mistakes
29.1 Fixed-Size Chunking for Code
This destroys structure.
29.2 Chunk Without Provenance
A chunk that cannot cite source is not safe evidence.
29.3 Ignoring Tests
Test chunks are essential for behavior and agent modification tasks.
29.4 Embedding Huge Files
Large file chunks are noisy and expensive.
29.5 No Logical Chunk ID
Incremental indexing becomes wasteful and unstable.
29.6 Mixing Source and Memory
Memory chunks should be separated as derived guidance.
29.7 Not Redacting Config
Config chunks can leak secrets.
29.8 No Section-Aware Docs
Documentation retrieval becomes weak if heading hierarchy is lost.
30. Practical Exercise
Build chunking for one repository.
30.1 Input
Use:
OrderController.java
OrderService.java
OrderValidator.java
OrderValidatorTest.java
application.yml
openapi/order-api.yaml
docs/order-validation.md
docs/adr/012-validation-rules.md
30.2 Output
Produce:
chunks.json
chunk-spans.json
chunk-links.json
chunk-quality-report.yaml
30.3 Required Chunks
- class chunks,
- method chunks,
- route chunk,
- test case chunks,
- config section chunk,
- OpenAPI operation chunk,
- ADR decision chunk,
- doc section chunks.
30.4 Acceptance Criteria
- every chunk has provenance,
- every chunk has stable logical ID,
- method chunks include signature,
- doc chunks include heading path,
- config chunks redact sensitive values,
- route chunk links handler + contract,
- changed file invalidates only relevant chunks,
- generated/vendor files not primary evidence.
31. Summary
Chunking is a core retrieval architecture layer, not a preprocessing detail.
Key points:
- chunks are retrieval/evidence units,
- code chunking should be AST-aware and symbol-aware,
- docs should be section-aware,
- contracts/config/schema need domain-specific chunkers,
- chunk identity must support incremental updates,
- chunks need provenance, visibility, token estimate, and linked graph nodes,
- composed chunks are useful but must preserve original evidence,
- memory chunks are derived guidance, not source truth,
- chunk quality directly affects retrieval, docs, agent context, and memory,
- fixed-size text splitting is not enough for production code intelligence.
Part berikutnya membahas Embedding and Vector Indexing: bagaimana membuat vector representation untuk chunks, kapan vector search berguna, kapan tidak cukup, bagaimana versioning/reindexing bekerja, dan bagaimana menjaga cost serta permission.
You just completed lesson 13 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.