Learn Ai Code Documentation Agent Memory Part 005 File Classification And Source Boundaries
title: Learn AI Code Documentation & Agent Memory Platform - Part 005 description: Strategi file classification dan source boundary untuk membangun repository intelligence yang aman, akurat, incremental, dan tidak penuh noise. series: learn-ai-code-documentation-agent-memory seriesTitle: Learn AI Code Documentation & Agent Memory Platform order: 5 partTitle: File Classification and Source Boundaries tags:
- ai
- code-intelligence
- repository-analysis
- file-classification
- source-boundary
- documentation
- agent-memory date: 2026-07-02
Part 005 — File Classification and Source Boundaries
1. Tujuan Part Ini
Setelah repository berhasil di-ingest, masalah berikutnya adalah menentukan file mana yang layak menjadi knowledge source.
Ini terdengar sederhana, tetapi di sistem AI code documentation dan agent memory, kesalahan file classification akan merusak seluruh pipeline:
- generated code dianggap sebagai source of truth,
- dependency folder ikut diindeks,
- secrets masuk ke context,
- binary file menghabiskan resource,
- docs lama dianggap valid,
- test fixture dianggap business logic,
- file build output dianggap source,
- vendored code membuat retrieval noise,
- agent diberi context yang salah.
Part ini membahas cara membangun file classification layer dan source boundary model.
Targetnya:
- memahami jenis file dalam repository,
- mendesain taxonomy file yang berguna untuk retrieval/docs/memory,
- membedakan source, generated, vendor, docs, config, test, infra, schema, migration, asset,
- membuat policy mana yang diindeks, diringkas, diblokir, atau hanya disimpan metadata,
- mendesain classifier yang deterministic, explainable, dan bisa diubah,
- menghubungkan classification ke security, cost, retrieval, dan freshness.
2. Kenapa File Classification Sangat Penting
Repository modern bukan hanya source code.
Satu repository bisa berisi:
src/
test/
docs/
dist/
build/
target/
node_modules/
vendor/
migrations/
schemas/
openapi/
helm/
k8s/
terraform/
.github/
README.md
Dockerfile
package-lock.json
generated/
fixtures/
assets/
Jika semua file diperlakukan sama, sistem akan gagal.
2.1 Failure Mode Jika Tidak Ada Classification
| Failure | Contoh | Dampak |
|---|---|---|
| Noise tinggi | node_modules ikut masuk index | Search buruk, cost besar |
| Hallucinated docs | generated client dianggap business logic | Docs misleading |
| Secret leakage | .env masuk context agent | Security incident |
| Wrong ownership | vendored library dianggap milik tim | Ownership graph salah |
| Stale knowledge | docs lama tidak diberi freshness risk | Human/agent percaya knowledge salah |
| Cost explosion | lockfile besar di-embed | Token/embedding cost besar |
| Bad ranking | test fixture mengalahkan source utama | Retrieval tidak relevan |
| Parser failure | binary/minified file diparse | Worker lambat/gagal |
File classification adalah filter kualitas paling awal.
3. Mental Model: Repository sebagai Evidence Lake
Anggap repository sebagai evidence lake.
Tidak semua evidence punya nilai yang sama.
Kita tidak hanya bertanya:
"File ini bahasa apa?"
Tapi juga:
"File ini boleh digunakan untuk claim apa?"
Contoh:
| File | Boleh Mendukung Claim? | Catatan |
|---|---|---|
OrderService.java | Ya | Primary source |
OrderServiceTest.java | Ya, untuk behavior/test expectation | Supporting source |
README.md | Ya, tapi bisa stale | Documentation source |
openapi.yaml | Ya, untuk API contract | Contract source |
target/generated-sources/... | Biasanya tidak | Generated |
.env | Tidak | Sensitive |
package-lock.json | Terbatas | Dependency metadata, bukan docs naratif |
node_modules/... | Tidak | Vendor/dependency |
dist/app.min.js | Tidak | Build artifact |
4. Source Boundary
Source boundary adalah aturan tentang file mana yang dianggap sebagai sumber knowledge untuk suatu tujuan.
Boundary berbeda tergantung output.
4.1 Boundary untuk Human Documentation
Human documentation boleh menggunakan:
- source code utama,
- tests,
- README,
- ADR,
- OpenAPI specs,
- database migrations,
- config penting,
- deployment manifests,
- runbooks.
Tetapi harus hati-hati dengan:
- generated code,
- old docs,
- sample code,
- fixtures,
- vendored code,
- minified assets.
4.2 Boundary untuk Agent Context
Agent context harus lebih ketat.
Agent context boleh berisi:
- relevant source code,
- relevant tests,
- relevant contracts,
- current docs,
- current memory,
- constraints,
- nearby graph neighbors.
Agent context harus menghindari:
- secrets,
- huge lockfiles,
- irrelevant docs,
- generated artifacts,
- third-party code,
- stale memory,
- files outside permission scope.
4.3 Boundary untuk Agent Memory
Memory harus paling ketat.
Memory hanya boleh dibuat dari source yang:
- punya evidence jelas,
- bukan secret,
- bukan generated noise,
- punya scope,
- bisa di-invalidate saat source berubah,
- tidak berasal dari unreliable summary tanpa provenance.
Buruk:
statement: "Billing service seems to use Stripe internally."
evidence:
- path: README.md
confidence: 0.35
Lebih baik:
statement: "Billing service creates Stripe payment intents in StripePaymentGateway."
evidence:
- path: src/main/java/com/acme/billing/StripePaymentGateway.java
lines: [18, 91]
confidence: 0.88
expiresWhen:
- symbolChanged: com.acme.billing.StripePaymentGateway
5. File Taxonomy
Kita butuh taxonomy yang cukup detail tetapi tidak terlalu rumit.
5.1 Top-Level File Kind
| Kind | Arti |
|---|---|
source | Source code utama |
test | Test code |
documentation | README, ADR, docs, markdown |
contract | OpenAPI, AsyncAPI, protobuf, Avro, GraphQL schema |
schema | DB schema, migration, JSON schema |
configuration | app config, build config, framework config |
infrastructure | Docker, Kubernetes, Terraform, Helm |
ci_cd | GitHub Actions, GitLab CI, Jenkinsfile |
dependency_manifest | pom.xml, package.json, go.mod, requirements.txt |
lockfile | package-lock, yarn.lock, go.sum |
generated | generated source/artifact |
vendor | third-party vendored source |
asset | image, css, static asset |
binary | binary files |
secret_candidate | file/path likely sensitive |
unknown | tidak bisa diklasifikasi |
5.2 Subkind
Top-level kind saja tidak cukup. Tambahkan subkind.
Contoh:
kind: documentation
subkind: adr
kind: contract
subkind: openapi
kind: infrastructure
subkind: kubernetes_manifest
kind: configuration
subkind: spring_boot_config
kind: source
subkind: java_application_code
5.3 Role dalam Knowledge
Selain kind, simpan knowledgeRole.
| Role | Arti |
|---|---|
primary_evidence | Bisa mendukung claim utama |
supporting_evidence | Mendukung claim terbatas |
operational_evidence | Berguna untuk deployment/ops |
contract_evidence | Berguna untuk API/schema claims |
metadata_only | Simpan metadata, tidak masuk content index |
excluded | Tidak diproses |
blocked_sensitive | Diblokir karena security |
Contoh:
path: src/main/java/com/acme/order/OrderService.java
kind: source
subkind: java_application_code
knowledgeRole: primary_evidence
indexPolicy: parse_and_index
path: package-lock.json
kind: lockfile
subkind: npm_lockfile
knowledgeRole: metadata_only
indexPolicy: metadata_only
path: .env
kind: secret_candidate
subkind: env_file
knowledgeRole: blocked_sensitive
indexPolicy: blocked
6. Index Policy
Classification harus menghasilkan keputusan operasional.
6.1 Policy Utama
| Policy | Arti |
|---|---|
parse_and_index | Parse structurally, extract symbols, index chunks |
index_text | Index text, tidak parse as code |
index_metadata_only | Simpan metadata, tidak simpan content |
summarize_only | Boleh dibuat summary terbatas |
exclude | Skip |
block | Jangan baca content; flag security |
6.2 Policy by Kind
| Kind | Default Policy | Catatan |
|---|---|---|
source | parse_and_index | Kecuali generated/vendor |
test | parse_and_index | Important for behavior |
documentation | index_text | Perlu stale detection |
contract | parse_and_index atau index_text | Bisa punya parser khusus |
schema | parse_and_index atau index_text | Migration penting untuk data docs |
configuration | index_text | Redact sensitive values |
infrastructure | index_text | Berguna untuk deployment docs |
ci_cd | index_text | Berguna untuk lifecycle docs |
dependency_manifest | index_text | Berguna untuk stack/dependency |
lockfile | index_metadata_only | Biasanya terlalu noisy |
generated | index_metadata_only atau exclude | Jangan jadi primary evidence |
vendor | exclude | Biasanya bukan source ownership |
asset | index_metadata_only | Kecuali docs asset |
binary | index_metadata_only | Jangan parse |
secret_candidate | block | Jangan masuk context |
unknown | index_metadata_only atau exclude | Tergantung policy |
7. Classification Signals
Classifier harus menggabungkan beberapa sinyal.
7.1 Path Signal
Path sangat kuat.
Contoh pattern:
generated:
- "**/generated/**"
- "**/target/generated-sources/**"
- "**/build/generated/**"
- "**/.openapi-generator/**"
vendor:
- "**/vendor/**"
- "**/third_party/**"
- "**/node_modules/**"
test:
- "**/test/**"
- "**/tests/**"
- "**/__tests__/**"
- "**/*Test.java"
- "**/*.spec.ts"
- "**/*.test.ts"
documentation:
- "README.md"
- "docs/**"
- "**/*.md"
- "**/*.mdx"
- "**/adr/**"
ci_cd:
- ".github/workflows/**"
- ".gitlab-ci.yml"
- "Jenkinsfile"
secret_candidate:
- ".env"
- ".env.*"
- "**/secrets/**"
- "**/*secret*"
- "**/*credential*"
7.2 Extension Signal
| Extension | Candidate |
|---|---|
.java | Java source |
.kt | Kotlin source |
.go | Go source |
.py | Python source |
.ts / .tsx | TypeScript |
.js / .jsx | JavaScript |
.md / .mdx | Documentation |
.yaml / .yml | Config/contract/infra |
.json | Config/schema/manifest |
.proto | Protobuf contract |
.graphql | GraphQL schema |
.sql | DB migration/schema |
.tf | Terraform |
.Dockerfile / Dockerfile | Container build |
.png / .jpg | Asset |
.jar / .zip | Binary/artifact |
Extension saja tidak cukup. yaml bisa berarti:
- Kubernetes manifest,
- OpenAPI spec,
- GitHub Actions,
- Helm values,
- Spring config,
- generic config.
7.3 Content Signal
Content bisa mengidentifikasi subkind.
Contoh YAML:
openapi: 3.0.3
info:
title: Order API
=> contract/openapi
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
=> infrastructure/kubernetes_manifest
spring:
datasource:
url: jdbc:postgresql://...
=> configuration/spring_boot_config
7.4 Size Signal
File terlalu besar biasanya perlu special handling.
| Size | Treatment |
|---|---|
< 256 KB | Normal |
256 KB - 2 MB | Chunk carefully |
2 MB - 10 MB | Metadata + selective processing |
> 10 MB | Metadata only by default |
Ukuran bukan aturan mutlak. File OpenAPI bisa besar tetapi penting. Lockfile bisa besar dan tidak cocok untuk semantic embedding.
7.5 Entropy dan Binary Signal
Binary detection penting.
Sinyal:
- contains null bytes,
- invalid UTF-8 ratio tinggi,
- high entropy,
- known binary extension,
- magic bytes.
Policy:
binary => metadata_only
Jangan mencoba parse binary sebagai text.
7.6 Generated Signal
Generated code bisa terdeteksi dari:
- path,
- file header,
- annotation,
- comments,
- tool markers.
Contoh marker:
// Code generated by ...
// <auto-generated>
// Generated by OpenAPI Generator
// This file was automatically generated
Generated code tidak selalu harus diabaikan. Kadang generated API client penting untuk integration docs. Tetapi default-nya jangan jadikan primary source untuk domain behavior.
8. Classification Result Schema
Buat classification result eksplisit.
fileId: file_01J...
path: src/main/java/com/acme/order/OrderService.java
language: java
kind: source
subkind: java_application_code
knowledgeRole: primary_evidence
indexPolicy: parse_and_index
confidence: 0.96
signals:
path:
- "matches src/main/java/**"
extension:
- ".java"
content:
- "contains package declaration"
risk:
sensitive: false
generated: false
vendor: false
large: false
Contoh generated:
fileId: file_01J...
path: target/generated-sources/openapi/src/main/java/com/acme/api/OrdersApi.java
language: java
kind: generated
subkind: openapi_generated_client
knowledgeRole: metadata_only
indexPolicy: index_metadata_only
confidence: 0.98
signals:
path:
- "matches **/generated-sources/**"
content:
- "contains Generated by OpenAPI Generator"
risk:
sensitive: false
generated: true
vendor: false
large: false
Contoh secret candidate:
fileId: file_01J...
path: .env.production
language: dotenv
kind: secret_candidate
subkind: env_file
knowledgeRole: blocked_sensitive
indexPolicy: block
confidence: 0.99
signals:
path:
- "matches .env.*"
risk:
sensitive: true
generated: false
vendor: false
large: false
9. Classification Pipeline
9.1 Step 1 — Metadata
Input:
- path,
- file name,
- extension,
- size,
- executable bit,
- symlink info,
- sha256,
- last commit metadata.
9.2 Step 2 — Path Rules
Path rules cepat dan deterministic.
Gunakan ordered rules.
rules:
- id: block-env-files
when:
pathMatches:
- ".env"
- ".env.*"
set:
kind: secret_candidate
indexPolicy: block
priority: 1000
- id: exclude-node-modules
when:
pathMatches:
- "**/node_modules/**"
set:
kind: vendor
indexPolicy: exclude
priority: 900
9.3 Step 3 — Binary Detection
Jangan baca seluruh file besar. Ambil sample awal.
Pseudo-code:
boolean looksBinary(byte[] sample) {
int nullBytes = 0;
int controlBytes = 0;
for (byte b : sample) {
int value = b & 0xff;
if (value == 0) nullBytes++;
if (value < 9 || (value > 13 && value < 32)) controlBytes++;
}
double nullRatio = nullBytes / (double) sample.length;
double controlRatio = controlBytes / (double) sample.length;
return nullRatio > 0.01 || controlRatio > 0.30;
}
9.4 Step 4 — Content Sniffing
Content sniffing harus ringan.
Jangan parse penuh di classifier. Parser penuh dilakukan setelah policy memutuskan file layak diproses.
Contoh sniff:
record ContentSignals(
boolean hasPackageDeclaration,
boolean hasOpenApiRoot,
boolean hasKubernetesApiVersion,
boolean hasGeneratedMarker,
boolean hasSecretLikeAssignment
) {}
9.5 Step 5 — Policy Decision
Policy decision menggabungkan:
- kind,
- subkind,
- risk,
- size,
- repository settings,
- tenant settings,
- user override.
10. Rule Precedence
Classifier butuh precedence jelas.
Saran precedence:
- explicit block rules,
- security-sensitive rules,
- vendor/dependency exclusion,
- generated detection,
- explicit repository config,
- path conventions,
- extension/language detection,
- content sniffing,
- fallback unknown.
Kenapa security paling atas?
Karena .env.generated tetap sensitive jika berisi secret. Jangan biarkan generated classification menurunkan severity.
11. Repository-Level Config
Tidak semua repository mengikuti convention yang sama.
Sediakan config file opsional:
# .ai-docs.yml
classification:
include:
- "src/**"
- "docs/**"
- "openapi/**"
exclude:
- "src/generated/**"
- "fixtures/large/**"
sourceRoots:
- "src/main/java"
- "src/main/kotlin"
testRoots:
- "src/test/java"
docsRoots:
- "docs"
- "adr"
generatedRoots:
- "target/generated-sources"
sensitive:
- "config/prod/**"
- "**/*.pem"
11.1 Config Tidak Boleh Mengalahkan Security
Jika repo config bilang include .env, sistem tetap harus block.
security block > repository config
11.2 Config Harus Diaudit
Perubahan config bisa mengubah apa yang diindeks.
Simpan:
- config file path,
- commit,
- applied rules,
- classifier version.
12. Handling Common Repository Types
12.1 Java Maven/Gradle Repository
Typical paths:
src/main/java
src/main/kotlin
src/main/resources
src/test/java
pom.xml
build.gradle
target/
Policy:
| Path | Kind | Policy |
|---|---|---|
src/main/java/** | source | parse_and_index |
src/test/java/** | test | parse_and_index |
src/main/resources/application*.yml | configuration | index_text with redaction |
pom.xml | dependency_manifest | index_text |
target/** | generated/build artifact | exclude or metadata_only |
12.2 Node/TypeScript Repository
Typical paths:
src/
test/
__tests__/
dist/
node_modules/
package.json
package-lock.json
tsconfig.json
Policy:
| Path | Kind | Policy |
|---|---|---|
src/**/*.ts | source | parse_and_index |
*.test.ts, *.spec.ts | test | parse_and_index |
package.json | dependency_manifest | index_text |
package-lock.json | lockfile | metadata_only |
node_modules/** | vendor | exclude |
dist/** | generated/build artifact | metadata_only/exclude |
12.3 Go Repository
Typical paths:
cmd/
internal/
pkg/
go.mod
go.sum
*_test.go
vendor/
Policy:
| Path | Kind | Policy |
|---|---|---|
cmd/**, internal/**, pkg/** | source | parse_and_index |
*_test.go | test | parse_and_index |
go.mod | dependency_manifest | index_text |
go.sum | lockfile | metadata_only |
vendor/** | vendor | exclude |
12.4 Python Repository
Typical paths:
src/
tests/
requirements.txt
pyproject.toml
venv/
.venv/
Policy:
| Path | Kind | Policy |
|---|---|---|
src/**/*.py | source | parse_and_index |
tests/**/*.py | test | parse_and_index |
pyproject.toml | dependency_manifest/configuration | index_text |
requirements.txt | dependency_manifest | index_text |
venv/**, .venv/** | vendor/environment | exclude |
13. Docs Classification
Markdown bukan selalu dokumentasi valid.
13.1 Documentation Subkind
| Subkind | Detection |
|---|---|
readme | README.md |
adr | path contains adr, title contains decision |
runbook | path/title contains runbook, operations, troubleshoot |
api_doc | path contains api/docs, endpoint references |
onboarding | title/path contains onboarding/getting-started |
changelog | CHANGELOG.md, release notes |
architecture | path/title contains architecture/design |
contributing | CONTRIBUTING.md |
generated_doc | frontmatter or marker from system |
13.2 Stale Risk
Docs should receive stale risk score.
Signals:
- last modified much older than source files,
- references symbol/path that no longer exists,
- describes API not found in contract,
- says "TODO/update later",
- no evidence/provenance,
- generated by old generator version.
Example:
path: docs/order-validation.md
kind: documentation
subkind: module_doc
staleRisk:
level: medium
reasons:
- "References OrderRuleEngine, symbol not found"
- "Last updated before major changes in validation package"
14. Sensitive File Handling
Security is not optional.
14.1 Sensitive Path Patterns
Common sensitive files:
.env
.env.*
*.pem
*.key
*.p12
*.jks
id_rsa
credentials.json
secrets.yml
application-prod.yml
kubeconfig
But path is not enough.
14.2 Secret-Like Content
Content signals:
- API key patterns,
- private key blocks,
- password assignments,
- tokens,
- connection strings,
- cloud credentials.
Policy:
- detect,
- block or redact,
- store metadata only,
- add security finding,
- never include raw value in generated docs/context/memory.
14.3 Redaction Strategy
For config files, you may index structure after redaction.
Input:
spring:
datasource:
url: jdbc:postgresql://prod-db:5432/orders
password: supersecret
Redacted:
spring:
datasource:
url: <REDACTED_CONNECTION_STRING>
password: <REDACTED_SECRET>
But do not blindly include production hostnames if policy says environment details are sensitive.
15. Generated Code Handling
Generated code is tricky.
15.1 Do Not Always Exclude
Generated code can be useful when:
- no source schema is available,
- generated API interfaces are used as entry points,
- generated protobuf classes define message shapes,
- generated OpenAPI stubs show route contracts.
But prefer original source:
| Generated Artifact | Prefer |
|---|---|
| OpenAPI generated Java client | openapi.yaml |
| Protobuf generated class | .proto |
| GraphQL generated types | .graphql schema |
| JPA generated metamodel | entity source |
| Minified JS bundle | original TS/JS source |
15.2 Generated Policy
Suggested:
generated:
defaultPolicy: metadata_only
allowAsSupportingEvidence:
- openapi_server_interface
- protobuf_generated_when_proto_missing
neverPrimaryEvidence:
- minified_bundle
- compiled_output
- dependency_generated
15.3 Generated Marker
Store generated marker:
generated:
detected: true
generator: openapi-generator
marker: "Generated by OpenAPI Generator"
sourceCandidate:
path: openapi/order-api.yaml
This helps retrieval choose original contract instead of generated code.
16. Vendor and Third-Party Code
Vendored code should not become product knowledge.
16.1 Vendor Signals
Paths:
vendor/
third_party/
node_modules/
.venv/
externals/
Metadata:
- package manager directories,
- license headers,
- external dependency source,
- submodule markers.
16.2 Why Exclude Vendor
Because it:
- increases index size,
- confuses ownership,
- pollutes graph,
- can expose licenses/security issues beyond scope,
- makes docs explain libraries instead of product code.
16.3 Exception
If repository intentionally vendors critical code that is modified by team, allow override:
classification:
vendorOverrides:
- path: third_party/custom-rule-engine/**
treatAs: source
reason: "Internally modified fork used as product logic"
Require explicit reason.
17. Classification for Tests and Fixtures
Tests are valuable. Fixtures are mixed.
17.1 Tests as Supporting Evidence
Tests reveal behavior:
- expected errors,
- edge cases,
- examples,
- contracts,
- feature flags,
- regression scenarios.
Docs should cite tests for behavioral claims.
Example:
Corporate orders require a tax ID. Evidence: `OrderValidatorTest.shouldRejectCorporateOrderWithoutTaxId`.
17.2 Fixtures
Fixtures can be:
- realistic examples,
- generated samples,
- large dumps,
- fake data,
- golden files.
Policy:
| Fixture Type | Policy |
|---|---|
| Small example fixture | index_text |
| Large JSON dump | metadata_only |
| Golden expected output | index_text if relevant |
| Sensitive-like fixture | redact/block |
| Generated fixture | metadata_only |
18. Classification for Contracts and Schemas
Contracts are high-value evidence.
18.1 API Contracts
Examples:
- OpenAPI,
- AsyncAPI,
- GraphQL,
- protobuf,
- gRPC service definition.
Use contracts for:
- endpoint docs,
- request/response schema,
- event docs,
- service boundary,
- backward compatibility.
18.2 Database Migrations
Migrations reveal:
- table names,
- columns,
- constraints,
- indexes,
- lifecycle changes.
But migration history can be long. Avoid embedding every migration blindly.
Strategy:
- parse migration metadata,
- extract table/column changes,
- build schema timeline,
- retrieve relevant migrations only.
18.3 Schema Evidence
Schema files are often more authoritative than docs.
If README says endpoint accepts field customerId, but OpenAPI says accountId, mark conflict.
19. Impact of Classification on Retrieval
Classification becomes ranking feature.
Example ranking boost:
score =
lexical_score
+ exact_symbol_boost
+ source_kind_boost
+ path_relevance_boost
+ freshness_boost
- generated_penalty
- vendor_penalty
- stale_doc_penalty
19.1 Retrieval Policy by Task
| Task | Preferred Evidence |
|---|---|
| Explain module | source, tests, README, ADR |
| Explain API | OpenAPI, controller, tests |
| Explain deployment | Dockerfile, k8s, Helm, CI |
| Explain data model | entity, migration, schema |
| Prepare agent code change | source, tests, memory, nearby symbols |
| Generate runbook | config, infra, CI, existing runbook |
20. Impact of Classification on Documentation
Documentation generator should know evidence type.
Example:
If evidence source is source, write:
The validation entry point is `OrderValidator.validate`.
If evidence source is stale docs, write:
Existing documentation mentions `OrderRuleEngine`, but no matching symbol was found in the indexed source. Treat this as potentially stale.
If evidence source is test:
Tests indicate that corporate orders without tax IDs are rejected.
Do not give equal confidence to all evidence.
21. Impact of Classification on Memory
Memory should be created only from high-confidence evidence.
21.1 Memory Candidate Policy
Allow memory from:
- primary source,
- tests,
- contracts,
- reviewed docs,
- ADR.
Avoid memory from:
- unknown docs,
- stale docs,
- generated code,
- fixture,
- vendored code,
- low-confidence summary.
21.2 Memory Confidence
Example:
confidence:
base: 0.75
modifiers:
source_kind:
source: +0.15
test: +0.05
stale_doc: -0.30
generated: -0.20
evidence_count:
multiple_independent_sources: +0.10
22. Persisted Schema
22.1 Table: File Classification
CREATE TABLE file_classifications (
file_id TEXT PRIMARY KEY,
repository_id TEXT NOT NULL,
snapshot_id TEXT NOT NULL,
path TEXT NOT NULL,
language TEXT,
kind TEXT NOT NULL,
subkind TEXT,
knowledge_role TEXT NOT NULL,
index_policy TEXT NOT NULL,
confidence NUMERIC NOT NULL,
is_sensitive BOOLEAN NOT NULL,
is_generated BOOLEAN NOT NULL,
is_vendor BOOLEAN NOT NULL,
is_binary BOOLEAN NOT NULL,
is_large BOOLEAN NOT NULL,
classifier_version TEXT NOT NULL,
created_at TIMESTAMP NOT NULL
);
22.2 Table: Classification Signals
CREATE TABLE file_classification_signals (
id TEXT PRIMARY KEY,
file_id TEXT NOT NULL,
signal_type TEXT NOT NULL,
signal_value TEXT NOT NULL,
weight NUMERIC,
created_at TIMESTAMP NOT NULL
);
22.3 Why Store Signals?
Because later we need to answer:
- kenapa file ini di-skip?
- kenapa file ini dianggap generated?
- kenapa docs ini stale?
- kenapa agent tidak menerima file ini sebagai context?
Explainability matters.
23. Classifier Design in Code
23.1 Domain Types
public enum FileKind {
SOURCE,
TEST,
DOCUMENTATION,
CONTRACT,
SCHEMA,
CONFIGURATION,
INFRASTRUCTURE,
CI_CD,
DEPENDENCY_MANIFEST,
LOCKFILE,
GENERATED,
VENDOR,
ASSET,
BINARY,
SECRET_CANDIDATE,
UNKNOWN
}
public enum KnowledgeRole {
PRIMARY_EVIDENCE,
SUPPORTING_EVIDENCE,
OPERATIONAL_EVIDENCE,
CONTRACT_EVIDENCE,
METADATA_ONLY,
EXCLUDED,
BLOCKED_SENSITIVE
}
public enum IndexPolicy {
PARSE_AND_INDEX,
INDEX_TEXT,
INDEX_METADATA_ONLY,
SUMMARIZE_ONLY,
EXCLUDE,
BLOCK
}
23.2 Classification Result
public record FileClassification(
String fileId,
String path,
String language,
FileKind kind,
String subkind,
KnowledgeRole knowledgeRole,
IndexPolicy indexPolicy,
double confidence,
RiskFlags risk,
List<ClassificationSignal> signals,
String classifierVersion
) {}
public record RiskFlags(
boolean sensitive,
boolean generated,
boolean vendor,
boolean binary,
boolean large
) {}
public record ClassificationSignal(
String type,
String value,
double weight
) {}
23.3 Classifier Interface
public interface FileClassifier {
FileClassification classify(FileMetadata metadata, ContentSample sample, ClassificationConfig config);
}
23.4 Rule-Based Implementation
public final class RuleBasedFileClassifier implements FileClassifier {
private final List<ClassificationRule> rules;
public RuleBasedFileClassifier(List<ClassificationRule> rules) {
this.rules = rules.stream()
.sorted(Comparator.comparingInt(ClassificationRule::priority).reversed())
.toList();
}
@Override
public FileClassification classify(
FileMetadata metadata,
ContentSample sample,
ClassificationConfig config
) {
ClassificationBuilder builder = ClassificationBuilder.from(metadata);
for (ClassificationRule rule : rules) {
if (rule.matches(metadata, sample, config)) {
rule.apply(builder);
if (rule.terminal()) {
break;
}
}
}
return builder.finalizeDecision();
}
}
24. Quality Tests for Classifier
Classifier must be tested like product logic.
24.1 Golden Fixtures
Create fixture repo:
fixtures/repo-classification/
src/main/java/App.java
src/test/java/AppTest.java
docs/adr/001-use-postgres.md
openapi/order.yaml
target/generated-sources/ApiClient.java
node_modules/lodash/index.js
.env
package-lock.json
Dockerfile
k8s/deployment.yaml
Expected results:
- path: src/main/java/App.java
kind: source
indexPolicy: parse_and_index
- path: .env
kind: secret_candidate
indexPolicy: block
- path: node_modules/lodash/index.js
kind: vendor
indexPolicy: exclude
24.2 Regression Tests
Test cases:
- generated source under unusual path,
- docs under
src/docs, - OpenAPI YAML under root,
- Kubernetes YAML under
deploy, .env.exampleshould maybe be config/example, not always blocked,- large markdown file,
- minified JS,
- symlink to sensitive file.
24.3 Metrics
Track:
| Metric | Why |
|---|---|
| files by kind | Understand repo shape |
| files blocked | Security visibility |
| files excluded | Cost/noise control |
| parseable files | Parser workload |
| unknown files | Classifier improvement |
| generated ratio | Identify codegen-heavy repos |
25. Edge Cases
25.1 .env.example
.env.example may be safe if values are placeholders.
But do not assume.
Policy:
- inspect content,
- redact secret-like values,
- classify as
configuration/example_env, - never treat as secret source of truth.
25.2 Minified JavaScript
Minified JS should be excluded or metadata-only.
Signals:
- very long lines,
- low whitespace ratio,
.min.js,- source map reference.
25.3 Symlinks
Symlinks can point outside repository.
Policy:
- record symlink,
- do not follow symlink outside allowed root by default,
- resolve safely,
- prevent path traversal.
25.4 Git Submodules
Submodules are separate repositories.
Do not silently merge identity.
Policy:
submodule:
treatAs: external_repository
indexPolicy: depends_on_permission
25.5 Monorepo Packages
A monorepo may contain many logical projects.
Classification should detect source roots per package:
services/order-service/src
services/billing-service/src
libs/common/src
Store package/module boundary separately.
26. Practical Exercise
Build a file classifier for one repository.
26.1 Input
Generate file inventory:
[
{
"path": "src/main/java/com/acme/order/OrderService.java",
"sizeBytes": 4210,
"sha256": "..."
}
]
26.2 Output
Produce:
[
{
"path": "src/main/java/com/acme/order/OrderService.java",
"kind": "source",
"subkind": "java_application_code",
"knowledgeRole": "primary_evidence",
"indexPolicy": "parse_and_index",
"confidence": 0.96
}
]
26.3 Acceptance Criteria
.envis blocked,- generated code is not primary evidence,
- tests are supporting evidence,
- README is documentation,
- OpenAPI is contract,
- lockfile is metadata only,
- unknown files are reported,
- every decision has reason signals.
27. Common Mistakes
27.1 Treating Extension as Truth
.yaml is not enough. It can be config, Kubernetes, OpenAPI, Helm, CI, or arbitrary data.
27.2 Indexing Everything
More content does not mean better retrieval. Noise destroys precision.
27.3 Ignoring Generated Code
Generated code can dominate Java/TypeScript repos. If indexed as primary source, docs become tool-centric instead of domain-centric.
27.4 Blocking Too Aggressively
If .env.example is always blocked, agent may miss required config variables. Better: redact and classify carefully.
27.5 No Explainability
When users ask "why did the system ignore this file?", you need stored signals.
28. Summary
File classification is not a preprocessing detail. It is a core architecture layer.
Key points:
- repository contains many kinds of evidence,
- not all files should be parsed or embedded,
- classification affects retrieval, docs, memory, security, and cost,
- source boundary must be explicit per use case,
- generated/vendor/sensitive/large files need special handling,
- classification must be deterministic and explainable,
- policy result should drive downstream processing,
- every classification decision should be persisted with signals.
Part berikutnya membahas Language Detection and Parser Strategy: bagaimana menentukan bahasa file, memilih parser, menangani parser failure, dan membangun strategi multi-language yang realistis.
You just completed lesson 05 in start here. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.