Start HereOrdered learning track

Learn Ai Code Documentation Agent Memory Part 006 Language Detection And Parser Strategy

[]15 min read2905 words

In This Lesson

1. Tujuan Part Ini 2. Parser Layer dalam Arsitektur 3. Language Detection: Kenapa Tidak Sesederhana Extension

Lesson 0635 lesson track01–06 Start Here

title: Learn AI Code Documentation & Agent Memory Platform - Part 006 description: Strategi language detection dan parser untuk membangun code intelligence multi-language yang stabil, incremental, dan tahan failure. series: learn-ai-code-documentation-agent-memory seriesTitle: Learn AI Code Documentation & Agent Memory Platform order: 6 partTitle: Language Detection and Parser Strategy tags:

ai
code-intelligence
parser
language-detection
tree-sitter
repository-analysis
documentation
agent-memory date: 2026-07-02

Part 006 — Language Detection and Parser Strategy

1. Tujuan Part Ini

Setelah file diklasifikasi, kita perlu menentukan:

bahasa file,
parser yang dipakai,
strategi fallback saat parsing gagal,
unit semantik yang akan diekstrak,
bagaimana parser bekerja secara incremental,
bagaimana hasil parsing menjadi dasar symbol, graph, retrieval, docs, dan memory.

Part ini membahas language detection dan parser strategy untuk sistem code intelligence.

Targetnya bukan membuat compiler. Targetnya adalah membangun parsing layer yang cukup akurat, stabil, dan scalable untuk:

extract symbol,
chunking berbasis struktur,
dependency graph,
API surface detection,
doc generation,
agent context assembly,
memory grounding.

2. Parser Layer dalam Arsitektur

Parser bukan tujuan akhir. Parser adalah alat untuk menghasilkan code units dan evidence spans.

3. Language Detection: Kenapa Tidak Sesederhana Extension

File extension berguna, tapi tidak cukup.

Contoh:

Path	Extension	Kemungkinan
`Dockerfile`	none	Dockerfile
`Jenkinsfile`	none	Groovy-like pipeline
`BUILD`	none	Bazel
`order.yaml`	`.yaml`	OpenAPI, Kubernetes, Helm, Spring config
`schema.sql`	`.sql`	migration, schema, seed
`index`	none	shell, text, binary
`foo.h`	`.h`	C, C++, Objective-C
`template.html`	`.html`	static HTML, template, Vue/Svelte fragment
`package.json`	`.json`	dependency manifest
`tsconfig.json`	`.json`	TypeScript config

Language detection harus menggabungkan:

path,
filename,
extension,
content markers,
shebang,
repository context,
classification result,
parser availability.

4. Language Detection Output

Jangan hanya return string bahasa. Return structured result.

path: src/main/java/com/acme/order/OrderService.java
detectedLanguage: java
confidence: 0.99
source:
  - extension: ".java"
  - content: "package declaration"
  - path: "src/main/java"
parser:
  parserId: tree-sitter-java
  mode: structural

Contoh ambiguous:

path: deploy/order.yaml
detectedLanguage: yaml
semanticDialect: kubernetes
confidence: 0.91
source:
  - extension: ".yaml"
  - content: "apiVersion and kind"
parser:
  parserId: yaml-parser
  mode: structured_text

Contoh fallback:

path: Jenkinsfile
detectedLanguage: groovy
semanticDialect: jenkins_pipeline
confidence: 0.82
parser:
  parserId: fallback-lexical
  mode: lexical
reason:
  - "No configured structural parser for Jenkins pipeline"

5. Language vs Dialect vs File Role

Pisahkan tiga hal:

Concept	Contoh	Fungsi
Language	Java, YAML, SQL	Parser syntax
Dialect	Spring YAML, Kubernetes YAML, OpenAPI YAML	Semantic interpretation
File role	source, config, contract	Downstream policy

Contoh:

language: yaml
dialect: openapi
kind: contract
subkind: openapi

language: yaml
dialect: kubernetes
kind: infrastructure
subkind: kubernetes_manifest

language: yaml
dialect: spring_boot_config
kind: configuration
subkind: application_config

Parser YAML sama, tetapi semantic extractor berbeda.

6. Detection Pipeline

6.1 Filename Rule

Filename khusus:

Filename	Language/Dialect
`Dockerfile`	dockerfile
`Jenkinsfile`	groovy/jenkins
`Makefile`	make
`pom.xml`	xml/maven
`build.gradle`	groovy/gradle
`build.gradle.kts`	kotlin/gradle
`go.mod`	gomod
`package.json`	json/npm_package
`tsconfig.json`	json/typescript_config
`.gitlab-ci.yml`	yaml/gitlab_ci
`docker-compose.yml`	yaml/docker_compose

6.2 Extension Rule

Extension mapping:

.java: java
.kt: kotlin
.go: go
.py: python
.ts: typescript
.tsx: tsx
.js: javascript
.jsx: jsx
.cs: csharp
.rb: ruby
.php: php
.rs: rust
.sql: sql
.proto: protobuf
.graphql: graphql
.yaml: yaml
.yml: yaml
.json: json
.xml: xml
.md: markdown
.mdx: mdx
.tf: terraform

6.3 Shebang

Shebang can override missing extension.

#!/usr/bin/env python3

#!/bin/bash

#!/usr/bin/env node

6.4 Content Sniffing

Examples:

package com.acme.order;

=> Java

syntax = "proto3";

=> Protobuf

openapi: 3.0.0

=> YAML/OpenAPI

apiVersion: apps/v1
kind: Deployment

=> YAML/Kubernetes

7. Parser Strategy

Parser strategy harus realistis.

Tidak semua bahasa perlu compiler-level semantic analysis dari hari pertama.

7.1 Parser Levels

Level	Nama	Kemampuan
0	Metadata-only	Path, size, hash
1	Lexical	Token-ish extraction, regex safe
2	Structural	Syntax tree, symbol boundaries
3	Semantic-lite	Imports, declarations, annotations, route hints
4	Semantic-rich	Type resolution, call resolution, project model
5	Compiler-backed	Full language server/compiler integration

MVP bisa mulai di level 2–3.

Untuk beberapa bahasa, level 4–5 mahal dan butuh effort besar.

7.2 Parser Selection Matrix

File Type	MVP Parser	Later
Java	Tree-sitter / Java parser	JavaParser, Eclipse JDT, javac model
TypeScript	Tree-sitter	TypeScript compiler API
Go	Tree-sitter / go/parser	go/packages
Python	Tree-sitter / ast	type-aware analysis
YAML	YAML parser	dialect-specific validators
JSON	JSON parser	schema-aware extractor
SQL	SQL parser/basic	dialect-specific parser
Markdown	Markdown parser	MDX/doc AST
Proto	Protobuf parser	descriptor model
GraphQL	GraphQL parser	schema validation

7.3 Tree-Sitter Role

Tree-sitter-style parsing is useful because:

supports many languages,
produces concrete syntax trees,
can parse incomplete code reasonably,
can be incremental,
works well for editor-like code analysis,
is good enough for symbol extraction and structural chunking.

But it does not automatically provide:

full type resolution,
build-aware classpath,
macro expansion,
dynamic dispatch resolution,
framework semantics,
cross-project dependency resolution.

So treat it as syntax foundation, not full semantic truth.

8. Parser Abstraction

Do not let product logic depend on one parser library.

8.1 Interface

public interface SourceParser {
    ParserId parserId();

    boolean supports(LanguageDetection detection);

    ParseResult parse(ParseRequest request);
}

8.2 Parse Request

public record ParseRequest(
    String repositoryId,
    String snapshotId,
    String fileId,
    String path,
    String content,
    LanguageDetection language,
    ParseOptions options
) {}

8.3 Parse Result

public record ParseResult(
    String fileId,
    ParserId parserId,
    ParseStatus status,
    SyntaxTree tree,
    List<ParseDiagnostic> diagnostics,
    List<SourceSpan> errorSpans,
    ParserMetrics metrics
) {}

public enum ParseStatus {
    OK,
    PARTIAL,
    FAILED,
    SKIPPED
}

8.4 Why Include Diagnostics?

Because parser failure is normal.

You need diagnostics to answer:

is the file invalid?
is parser unsupported?
is syntax too new?
is file partial/generated?
should fallback be used?
should this affect quality score?

9. Syntax Tree vs Symbol Model

Never expose raw parser tree as your domain model.

Parser tree is library-specific and language-specific.

Instead:

9.1 Parser Tree

Example conceptual Java tree:

program
  package_declaration
  import_declaration
  class_declaration
    modifiers
    identifier
    class_body
      method_declaration

9.2 Canonical Symbol Model

symbolId: sym_01J...
kind: method
language: java
qualifiedName: com.acme.order.OrderService.createOrder
signature: createOrder(CreateOrderRequest): Order
path: src/main/java/com/acme/order/OrderService.java
span:
  startLine: 34
  endLine: 78
modifiers:
  - public
annotations:
  - Transactional
parent:
  kind: class
  qualifiedName: com.acme.order.OrderService

The canonical model is what downstream systems should use.

10. Canonical Language Model

A multi-language system needs normalized concepts.

10.1 Code Unit Kinds

Canonical Kind	Java	TypeScript	Go	Python
package/module	package	module	package	module
type	class/interface/record/enum	class/interface/type	struct/interface	class
function	function/static method	function	func	function
method	method	method	method receiver	method
field	field	property	field	attribute
constant	static final	const	const	constant
decorator/annotation	annotation	decorator	comment/struct tag	decorator
import	import	import	import	import
test	JUnit method	Jest test	`TestX` func	pytest/unittest
route	controller annotation	router registration	handler registration	decorator/router

10.2 Canonical Symbol Fields

symbol:
  id: string
  repositoryId: string
  snapshotId: string
  fileId: string
  language: string
  kind: string
  name: string
  qualifiedName: string
  signature: string
  signatureHash: string
  parentSymbolId: string?
  visibility: string?
  annotations: []
  modifiers: []
  span:
    startLine: int
    startColumn: int
    endLine: int
    endColumn: int
  bodySpan:
    startLine: int
    startColumn: int
    endLine: int
    endColumn: int

10.3 Stable Symbol ID

Stable symbol ID is critical.

Bad:

symbolId = random_uuid()

Better:

symbolId = hash(repositoryId, snapshotId, path, kind, qualifiedName, signatureHash)

But snapshot-bound ID changes every commit. For continuity, also store logical identity:

logicalSymbolId = hash(repositoryId, path_or_package, kind, qualifiedName, signatureHash_without_line)

You may need both:

ID	Purpose
`symbolInstanceId`	Specific snapshot/commit
`logicalSymbolId`	Track same symbol across commits

11. Parser Failure Is a First-Class Case

Do not assume parsing succeeds.

11.1 Failure Types

Failure	Example	Handling
Unsupported language	`.ex` Elixir without parser	fallback lexical
Syntax error	broken branch	partial parse
New syntax	parser outdated	partial/fallback
Generated weird code	invalid formatting	lower confidence
Huge file	parser timeout	metadata/text only
Encoding issue	non-UTF file	skip or decode fallback
Mixed language	Vue/Svelte/MDX	multi-parser

11.2 Parse Status

status: partial
diagnostics:
  - severity: warning
    message: "Unexpected token near line 42"
fallbackUsed: false
confidence: 0.72

11.3 Downstream Impact

If parse is partial:

symbol extraction confidence decreases,
docs should avoid strong claims,
retrieval can still use chunks,
quality report should mention parser uncertainty.

Example generated docs note:

Parser diagnostics were reported for `src/foo.ts`; documentation for that file may be incomplete.

12. Multi-Language Reality

Real repositories are mixed-language.

Example Java service:

src/main/java
src/main/resources/application.yml
src/test/java
pom.xml
Dockerfile
helm/templates/deployment.yaml
openapi/order-api.yaml
db/migration/V001__init.sql
.github/workflows/ci.yml

This is not "Java repo". It is a multi-language knowledge system.

12.1 Language Is Not Product Boundary

Product boundary is repo/module/service, not language.

A documentation request for "order-service" may need:

Java source,
OpenAPI,
SQL migrations,
YAML config,
Dockerfile,
Kubernetes manifest,
CI workflow.

12.2 Parser Orchestration

Use parser registry:

public final class ParserRegistry {
    private final List<SourceParser> parsers;

    public SourceParser select(LanguageDetection detection) {
        return parsers.stream()
            .filter(parser -> parser.supports(detection))
            .findFirst()
            .orElse(FallbackLexicalParser.INSTANCE);
    }
}

13. Structural Chunking Depends on Parser

Better chunks come from syntax boundaries.

13.1 Bad Chunking

Naive fixed-size chunk:

lines 1-200
lines 201-400

Problems:

method split in half,
imports separated from class,
comments detached,
retrieval gets incomplete context.

13.2 Better Chunking

Parser-aware chunks:

class chunk,
method chunk,
function chunk,
route handler chunk,
test case chunk,
schema chunk,
config section chunk.

Example:

chunk:
  kind: method
  symbol: com.acme.order.OrderService.createOrder
  path: src/main/java/com/acme/order/OrderService.java
  lines: [34, 78]
  includes:
    - leading comments
    - annotations
    - signature
    - body

13.3 Chunk Granularity

Chunk Type	Good For	Risk
file	overview	too large
class	module docs	may exceed token
method/function	agent task	may miss context
block	fine retrieval	loses meaning
config section	ops docs	parser dependent

Use hierarchy.

14. Language-Specific Extraction Examples

14.1 Java

Extract:

package,
imports,
class/interface/record/enum,
methods,
fields,
annotations,
visibility,
Spring annotations,
JAX-RS annotations,
tests.

Example:

@RestController
@RequestMapping("/orders")
public class OrderController {
    @PostMapping
    public OrderResponse createOrder(@RequestBody CreateOrderRequest request) {
        return orderService.create(request);
    }
}

Canonical extraction:

symbols:
  - kind: class
    qualifiedName: OrderController
    annotations: [RestController, RequestMapping]
  - kind: method
    qualifiedName: OrderController.createOrder
    annotations: [PostMapping]
routes:
  - method: POST
    path: /orders
    handler: OrderController.createOrder

14.2 TypeScript

Extract:

imports/exports,
functions,
classes,
interfaces/types,
React components,
route handlers,
tests.

Example:

export async function createOrder(req: Request, res: Response) {
  const result = await orderService.create(req.body);
  res.json(result);
}

Canonical extraction:

symbols:
  - kind: function
    qualifiedName: createOrder
    exported: true
dependencies:
  - orderService.create

14.3 Go

Extract:

package,
imports,
funcs,
receiver methods,
structs,
interfaces,
tests.

Example:

func (s *OrderService) Create(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    return s.repo.Save(ctx, req)
}

Canonical extraction:

symbols:
  - kind: method
    receiver: "*OrderService"
    qualifiedName: OrderService.Create

14.4 Python

Extract:

imports,
classes,
functions,
decorators,
FastAPI/Flask routes,
tests.

Example:

@app.post("/orders")
def create_order(request: CreateOrderRequest):
    return order_service.create(request)

Canonical extraction:

symbols:
  - kind: function
    qualifiedName: create_order
    decorators:
      - app.post("/orders")
routes:
  - method: POST
    path: /orders
    handler: create_order

15. Framework-Aware Extraction

Syntax parsing alone does not know frameworks.

For docs, framework hints are valuable.

15.1 Java Framework Hints

Framework	Signal
Spring MVC	`@RestController`, `@RequestMapping`, `@GetMapping`
Spring Service	`@Service`, `@Component`
JPA	`@Entity`, `@Table`, `@Column`
JAX-RS	`@Path`, `@GET`, `@POST`
JUnit	`@Test`, `@ParameterizedTest`

15.2 TypeScript Framework Hints

Framework	Signal
Express	`router.get`, `app.post`
NestJS	`@Controller`, `@Get`, `@Post`
React	function component, JSX
Jest	`describe`, `it`, `test`

15.3 Go Framework Hints

Framework	Signal
net/http	`http.HandleFunc`
Gin	`router.GET`, `router.POST`
gRPC	generated service registration
testing	`func TestX(t *testing.T)`

15.4 Python Framework Hints

Framework	Signal
FastAPI	`@app.get`, `@router.post`
Flask	`@app.route`
Django	urls.py patterns
pytest	`test_` functions

Framework extractors should be plugins, not hardcoded everywhere.

16. Parser Confidence

Every extracted fact should have confidence.

16.1 Confidence Inputs

Signal	Impact
Parse status OK	High
Partial parse	Lower
Exact syntax node match	High
Regex fallback	Lower
Framework annotation clear	High
Dynamic registration	Lower
Generated code	Lower
Test evidence	Supporting

Example:

route:
  method: POST
  path: /orders
  handler: OrderController.createOrder
  confidence: 0.93
  evidence:
    - path: OrderController.java
      lines: [12, 19]
      extraction: spring_mvc_annotation

17. Incremental Parsing Strategy

Repository updates should not reparse everything.

17.1 Change Detection

Input:

changed file path,
old hash,
new hash,
classification,
parser version.

Reparse if:

content hash changed,
parser version changed,
classification policy changed,
language detection changed,
extractor version changed.

17.2 Parse Cache Key

parseCacheKey =
  hash(fileSha256, parserId, parserVersion, extractorVersion, parseOptions)

17.3 Invalidating Downstream

When parse result changes:

18. Mixed-Language Files

Some files contain multiple languages.

Examples:

MDX: Markdown + JSX,
Vue: HTML + JS/TS + CSS,
Svelte: template + script + style,
Jupyter notebook: JSON + code cells,
SQL embedded in Java strings,
Terraform with JSON/YAML snippets.

18.1 Strategy

File Type	Strategy
MDX	parse markdown, extract code blocks and JSX if needed
Vue/Svelte	split blocks, parse script separately
Notebook	extract code cells and markdown cells
SQL in strings	optional heuristic extraction
Markdown code blocks	classify embedded code separately

18.2 Embedded Code Span

Represent embedded code with parent relation:

embeddedSource:
  parentFile: docs/examples.md
  language: java
  span:
    startLine: 22
    endLine: 48
  role: documentation_example

Do not confuse example code with production code.

19. Parser Diagnostics as Quality Signal

Diagnostics should flow to quality report.

Example:

parseQuality:
  filesParsed: 182
  filesPartial: 4
  filesFailed: 2
  unsupportedLanguages:
    - elixir
  warnings:
    - "Partial parse for src/legacy/parser.ts"

Docs generated from partial parse should include lower confidence.

Agent context should prefer fully parsed evidence when possible.

20. Storage Model

20.1 Language Detection Table

CREATE TABLE language_detections (
    file_id TEXT PRIMARY KEY,
    repository_id TEXT NOT NULL,
    snapshot_id TEXT NOT NULL,
    path TEXT NOT NULL,
    language TEXT,
    dialect TEXT,
    confidence NUMERIC NOT NULL,
    detector_version TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL
);

20.2 Parse Result Table

CREATE TABLE parse_results (
    file_id TEXT PRIMARY KEY,
    repository_id TEXT NOT NULL,
    snapshot_id TEXT NOT NULL,
    parser_id TEXT NOT NULL,
    parser_version TEXT NOT NULL,
    extractor_version TEXT NOT NULL,
    status TEXT NOT NULL,
    diagnostic_count INTEGER NOT NULL,
    parse_duration_ms INTEGER NOT NULL,
    created_at TIMESTAMP NOT NULL
);

20.3 Symbol Table

CREATE TABLE code_symbols (
    symbol_instance_id TEXT PRIMARY KEY,
    logical_symbol_id TEXT NOT NULL,
    repository_id TEXT NOT NULL,
    snapshot_id TEXT NOT NULL,
    file_id TEXT NOT NULL,
    language TEXT NOT NULL,
    kind TEXT NOT NULL,
    name TEXT NOT NULL,
    qualified_name TEXT NOT NULL,
    signature TEXT,
    signature_hash TEXT,
    parent_symbol_id TEXT,
    start_line INTEGER NOT NULL,
    start_column INTEGER NOT NULL,
    end_line INTEGER NOT NULL,
    end_column INTEGER NOT NULL,
    confidence NUMERIC NOT NULL
);

20.4 Diagnostics Table

CREATE TABLE parse_diagnostics (
    id TEXT PRIMARY KEY,
    file_id TEXT NOT NULL,
    severity TEXT NOT NULL,
    message TEXT NOT NULL,
    start_line INTEGER,
    start_column INTEGER,
    end_line INTEGER,
    end_column INTEGER
);

21. Parser Evaluation

Parser layer must be tested.

21.1 Golden Source Fixtures

Create small files for each language.

Example Java fixture:

package com.acme.order;

import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class OrderController {
    @PostMapping("/orders")
    public OrderResponse createOrder(CreateOrderRequest request) {
        return null;
    }
}

Expected extraction:

symbols:
  - kind: class
    name: OrderController
  - kind: method
    name: createOrder
routes:
  - method: POST
    path: /orders

21.2 Regression Cases

Test:

nested class,
overloaded methods,
annotations/decorators,
async functions,
generics,
receiver methods,
multiline signatures,
syntax errors,
comments,
generated files,
mixed-language files.

21.3 Metrics

Metric	Meaning
parse success rate	stability
partial parse rate	syntax/parser mismatch
symbol extraction count	coverage
extraction precision	correctness
extraction recall	completeness
parse latency	performance
parser timeout count	reliability

22. Parser Performance

Parsing can become expensive in monorepos.

22.1 Optimization

skip excluded files,
sample before full read,
size threshold,
parse cache,
incremental reparse,
parallel workers,
parser pool,
timeout per file,
memory limit per worker.

22.2 Timeout Policy

parseTimeout:
  defaultMs: 3000
  largeFileMs: 8000
onTimeout:
  status: failed
  fallback: lexical
  indexPolicy: index_text_limited

22.3 Avoid One Bad File Killing the Job

Each file parse should be isolated.

Bad:

one parse exception fails entire repository scan

Good:

file parse fails -> diagnostic -> fallback -> continue job

23. Language Detection and Security

Do not trust file content.

Repo content can contain prompt injection or malicious text.

Parser should treat code/docs as data.

23.1 Parser Safety

no code execution,
no dependency install,
no build execution by default,
no external network,
path sandboxing,
timeout,
memory limit.

23.2 Dangerous Temptation

Do not run:

npm install
mvn test
python setup.py

inside parsing pipeline unless you have a sandbox and explicit workflow.

Static parsing should be safe by default.

24. Parser Versioning

Parser output changes over time.

Store:

parser ID,
parser version,
grammar version,
extractor version,
language detector version.

Why?

Because a doc generated from old parser may have different evidence quality.

Example:

parser:
  id: tree-sitter-java
  version: 0.x
extractor:
  id: java-symbol-extractor
  version: 2026.07.02

If extractor improves, you may need reindex.

25. Parser Plugin Architecture

25.1 Plugin Contract

public interface LanguagePlugin {
    Language language();

    SourceParser parser();

    SymbolExtractor symbolExtractor();

    ChunkExtractor chunkExtractor();

    Optional<FrameworkExtractor> frameworkExtractor();
}

25.2 Registry

public final class LanguagePluginRegistry {
    private final Map<Language, LanguagePlugin> plugins;

    public Optional<LanguagePlugin> find(Language language) {
        return Optional.ofNullable(plugins.get(language));
    }
}

25.3 Benefits

add language without changing core,
test plugins independently,
version extractors separately,
support fallback,
enable enterprise-specific framework extractors.

26. Fallback Lexical Parser

Fallback parser is not "garbage mode". It is controlled degradation.

26.1 What It Can Extract

headings,
imports by regex,
function-like patterns,
class-like patterns,
comments,
TODO markers,
route-like strings,
config keys.

26.2 What It Cannot Guarantee

accurate nesting,
type resolution,
complete call graph,
overloaded method distinction,
framework semantics.

26.3 Mark Confidence

symbol:
  kind: function
  name: maybeCreateOrder
  extractionMethod: lexical_fallback
  confidence: 0.42

Downstream should treat this differently from structural parse.

27. Build-Aware Analysis: Later, Not First

Compiler-backed analysis can improve accuracy.

For Java:

classpath,
annotation processing,
overloaded method resolution,
type hierarchy,
dependency graph.

For TypeScript:

tsconfig,
type checker,
module resolution,
path aliases.

For Go:

module packages,
build tags,
interface implementations.

But build-aware analysis requires:

dependency resolution,
build config,
sandbox,
environment,
time,
security controls.

Start with structural parsing. Add build-aware semantic layer only when needed.

28. Practical Exercise

Build parser pipeline for one language.

28.1 Minimal Input

{
  "path": "src/main/java/com/acme/order/OrderService.java",
  "language": "java",
  "content": "..."
}

28.2 Minimal Output

{
  "parseStatus": "OK",
  "symbols": [
    {
      "kind": "class",
      "qualifiedName": "com.acme.order.OrderService",
      "span": {
        "startLine": 7,
        "endLine": 91
      }
    },
    {
      "kind": "method",
      "qualifiedName": "com.acme.order.OrderService.createOrder",
      "span": {
        "startLine": 24,
        "endLine": 49
      }
    }
  ]
}

28.3 Acceptance Criteria

detects language,
selects parser,
handles parser failure,
extracts class/function/method,
records source spans,
produces stable IDs,
stores diagnostics,
has test fixtures.

29. Common Mistakes

29.1 Treating Parser Tree as Domain Model

Parser tree is not stable domain. Always map to canonical model.

29.2 Overbuilding Semantic Analysis Too Early

Full type resolution is expensive. MVP often needs symbol extraction and structural chunks first.

29.3 Ignoring Parser Errors

Syntax errors and partial parse are normal. Store diagnostics and degrade gracefully.

29.4 No Parser Version

Without parser/extractor version, you cannot explain why output changed after reindex.

29.5 No Confidence

Facts extracted via regex fallback should not have the same confidence as facts extracted from syntax tree.

29.6 Running Build Tools Unsafely

Static parsing should not execute project code or install dependencies by default.

30. Summary

Language detection and parser strategy form the foundation of code understanding.

Key points:

language detection combines filename, extension, content, shebang, dialect, and repository context,
language, dialect, and file role are different,
parser output must be mapped to a canonical symbol model,
parser failure is normal and must be represented,
structural parsing enables better chunking and evidence spans,
framework-aware extraction should be plugin-based,
parser results need confidence, diagnostics, and versioning,
incremental parsing and parse cache are essential for scale,
parser pipeline must never execute repository code by default,
start with semantic-lite extraction before compiler-backed analysis.

Part berikutnya membahas Symbol Extraction and Code Units: bagaimana mengubah syntax tree menjadi class, function, method, endpoint, test, schema, dan unit-unit knowledge yang bisa dipakai retrieval, graph, docs, dan agent memory.

Lesson Recap

You just completed lesson 06 in start here. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 05

Learn Ai Code Documentation Agent Memory Part 005 File Classification And Source Boundaries

Next Lesson

Lesson 07

Learn Ai Code Documentation Agent Memory Part 007 Symbol Extraction And Code Units