Deepen PracticeOrdered learning track

Learn Ai Coding Agent Part 045 Test Generation And Regression Guard

[]14 min read2601 words

In This Lesson

1. Apa target part ini 2. Mental model: test sebagai contract executable 3. Vocabulary: guard, test, verifier, evidence

Lesson 4564 lesson track36–53 Deepen Practice

title: Learn AI Coding Agent From Scratch - Part 045 description: Test generation dan regression guard untuk Honk-like AI coding agent, meliputi test intent, baseline capture, characterization test, negative test, mutation thinking, flaky test control, coverage signal, verifier integration, dan PR evidence. series: learn-ai-coding-agent seriesTitle: Learn AI Coding Agent From Scratch order: 45 partTitle: Test Generation and Regression Guard slug: test-generation-and-regression-guard tags:

ai-coding-agent
test-generation
regression-guard
junit
jacoco
verifier
java
software-testing date: 2026-07-04

Part 045 — Test Generation dan Regression Guard: Agent Tidak Hanya Mengubah Kode, Tapi Membuktikan

Di part sebelumnya kita sudah membuat agent mampu melakukan API migration, dependency upgrade, dan contract migration. Tetapi ada masalah besar: perubahan kode yang terlihat benar belum tentu benar.

Compile hijau hanya membuktikan program masih bisa dikompilasi. Lint hijau hanya membuktikan style dan beberapa rule statis terpenuhi. Unit test hijau hanya membuktikan test yang sudah ada masih lolos. Tidak satu pun dari itu otomatis membuktikan bahwa perubahan agent menjaga behavior yang penting.

Di sinilah test generation dan regression guard masuk.

Bad mental model:
  Agent mengubah kode, lalu menjalankan test.

Better mental model:
  Agent mengubah kode, lalu membangun bukti regresi: apa behavior yang dijaga,
  bagaimana behavior itu diuji, dan kenapa diff ini tidak sekadar compile-green.

Untuk Honk-like coding agent, test bukan aksesoris. Test adalah safety instrument.

Agent yang tidak bisa menambah, memilih, atau memperbaiki guard akan sering menghasilkan PR yang tampak rapi tetapi tidak layak dipercaya.

1. Apa target part ini

Kita akan membangun model praktis untuk membuat agent mampu:

memahami test yang sudah ada,
menemukan celah regression guard,
menambah test yang relevan,
menghindari test palsu yang hanya mengejar coverage,
menjalankan verifier yang tepat,
menyajikan evidence pada PR.

Yang tidak kita lakukan di sini:

mengulang dasar JUnit,
mengulang dasar mocking,
membuat tutorial testing umum,
mengejar coverage percentage secara membabi buta.

Kita akan fokus pada satu pertanyaan:

Bagaimana agent bisa membuktikan bahwa perubahan kode tidak merusak behavior penting?

2. Mental model: test sebagai contract executable

Untuk agent, test harus dipahami sebagai executable contract.

Test yang baik menjawab:

behavior apa yang dilindungi,
input apa yang memicu behavior itu,
output atau efek apa yang diharapkan,
failure apa yang ingin dicegah,
kenapa test ini relevan dengan diff.

Test yang buruk hanya menjawab:

Baris baru ini sekarang ke-cover.

Untuk agent, ini perbedaan besar.

Coverage adalah sinyal. Test intent adalah bukti.

3. Vocabulary: guard, test, verifier, evidence

Sebelum desain, kita rapikan istilah.

Istilah	Arti
Regression guard	Mekanisme yang mencegah perubahan merusak behavior lama
Test case	Eksekusi spesifik untuk mengecek behavior
Characterization test	Test yang mengunci behavior existing sebelum refactor/migration
Golden test	Test yang membandingkan output terhadap expected artifact
Negative test	Test yang membuktikan kasus salah ditolak
Contract test	Test yang mengecek kesesuaian contract antar boundary
Verifier	Runner yang menjalankan build/test/lint/static analysis
Evidence	Artifact yang bisa direview: test file, report, command, coverage delta, failure history

Dalam platform agent, regression guard bukan hanya file test. Guard bisa berupa:

unit test,
integration test,
compile check,
schema compatibility check,
snapshot/golden output,
static assertion,
mutation-style scenario,
reviewer checklist,
LLM judge dengan evidence.

Tetapi test tetap komponen utama karena test bisa dieksekusi ulang.

4. Mengapa agent sering menghasilkan test buruk

LLM bisa menulis test yang tampak meyakinkan. Masalahnya, test yang tampak meyakinkan bisa tetap tidak berguna.

Failure mode umum:

Failure	Contoh
Mirror implementation	Test meniru logic production sehingga bug ikut disalin
Assert too weak	Hanya `assertNotNull` atau hanya mengecek size
Mock everything	Test tidak benar-benar menguji behavior penting
Test after patch only	Tidak tahu apakah test akan gagal sebelum perubahan
Brittle snapshot	Snapshot terlalu besar dan mudah pecah karena noise
Coverage theater	Coverage naik, regression risk tidak turun
Hidden flakiness	Test bergantung waktu, urutan, network, random, timezone
Wrong level	Unit test dipaksa menguji behavior integration
No negative case	Hanya happy path
Verifier mismatch	Agent menjalankan test subset yang tidak merepresentasikan CI

Karena itu kita perlu pipeline test generation yang lebih ketat.

5. Rule utama: test harus punya failure-before/failure-after story

Regression guard paling kuat adalah test yang:

gagal sebelum fix atau sebelum migration selesai,
lulus setelah perubahan,
menjelaskan behavior yang dilindungi.

Tidak semua test baru harus benar-benar dieksekusi sebelum patch, terutama untuk migration mekanis lintas repo. Tetapi agent harus minimal bisa menyatakan salah satu dari tiga tipe ini:

Tipe	Makna
Red-green test	Test gagal sebelum patch, lulus setelah patch
Characterization test	Test mengunci behavior lama sebelum refactor
Invariant test	Test mengecek property yang wajib selalu benar

Jika test tidak masuk salah satu kategori, test perlu dicurigai.

Bad PR evidence:
  Added tests.

Better PR evidence:
  Added regression test for null legacy config key fallback.
  The test would fail if the migration removed backward-compatible reads.

6. Test generation pipeline untuk agent

Pipeline minimal:

Agent tidak boleh langsung menulis test dari prompt. Ia harus menemukan:

production file yang berubah,
existing test convention,
naming convention,
framework,
fixtures,
helper utilities,
test runner command,
module boundary,
CI expectation.

Test yang tidak mengikuti gaya repo akan meningkatkan review cost.

7. Test intent object

Sebelum menulis test, agent membuat TestIntent.

{
  "id": "ti_legacy_config_fallback",
  "change_id": "chg_config_key_migration",
  "behavior": "Reader must accept old config key during transition window",
  "risk": "Removing fallback breaks existing deployments that have not migrated config",
  "test_level": "unit",
  "target_class": "ConfigReader",
  "existing_test_style": "JUnit 5 + AssertJ",
  "pre_patch_expected": "may pass if fallback already exists; otherwise fails",
  "post_patch_expected": "passes",
  "negative_cases": [
    "invalid value still rejected",
    "new key has priority over old key when both exist"
  ],
  "verifier_command": "mvn -pl config-service -Dtest=ConfigReaderTest test"
}

Kenapa ini penting?

Karena test intent memaksa agent menjelaskan kenapa test dibuat.

Tanpa ini, agent bisa menambah test asal-asalan.

8. Test level selection

Agent harus memilih level test dengan sengaja.

Level	Cocok untuk	Tidak cocok untuk
Unit test	Pure logic, mapper, validator, parser, edge case	Full wiring, DB, network
Integration test	DB interaction, serialization, HTTP boundary, config loading	Fast inner loop jika terlalu berat
Contract test	API/schema compatibility, consumer/provider expectation	Internal algorithm detail
Golden/snapshot test	Stable generated output, migration output	Output yang sering berubah/noisy
End-to-end test	Critical user journey	Setiap perubahan kecil

Heuristic:

If change affects pure function:
  prefer unit test.

If change affects serialization/config/schema:
  prefer contract or integration test.

If change affects database migration:
  prefer migration verification + integration test.

If change affects public API behavior:
  prefer contract test + unit tests for edge cases.

If change is mechanical rename only:
  existing compile/test may be enough, unless behavior path changed.

Agent yang selalu memilih unit test belum matang. Agent yang selalu memilih integration test juga belum matang.

9. Existing test discovery

Agent perlu menemukan test yang relevan sebelum membuat test baru.

Sources:

naming convention,
package mirror,
build tool configuration,
changed production symbols,
imports,
call graph,
historical tests around similar classes,
test utility classes,
fixture directories,
CI command.

Example search flow:

Changed file:
  src/main/java/com/acme/config/ConfigReader.java

Search candidates:
  src/test/java/com/acme/config/ConfigReaderTest.java
  src/test/java/com/acme/config/*Test.java
  rg "ConfigReader" src/test
  rg "legacyConfigKey" src/test
  rg "ConfigReader" .github workflows pom.xml

Tool-level API:

{
  "tool": "find_related_tests",
  "input": {
    "changed_files": ["src/main/java/com/acme/config/ConfigReader.java"],
    "symbols": ["ConfigReader", "readConfig", "ConfigKey"]
  }
}

Output:

{
  "direct_tests": [
    "src/test/java/com/acme/config/ConfigReaderTest.java"
  ],
  "indirect_tests": [
    "src/test/java/com/acme/config/ConfigModuleTest.java"
  ],
  "test_style": {
    "framework": "junit-jupiter",
    "assertion_library": "assertj",
    "mocking": "mockito"
  },
  "commands": [
    "mvn -pl config-service -Dtest=ConfigReaderTest test"
  ]
}

10. Build tool integration: Maven example

Untuk Java/Maven, agent biasanya memakai Surefire untuk unit test dan Failsafe untuk integration test. Maven Surefire memiliki convention include/exclude untuk test class, dan dokumentasi Surefire menjelaskan pola inclusion/exclusion test. Ini penting karena agent perlu tahu command mana yang menjalankan test tertentu.

Command examples:

mvn test
mvn -pl config-service test
mvn -pl config-service -Dtest=ConfigReaderTest test
mvn -pl config-service -Dtest=ConfigReaderTest#readsLegacyKey test

Agent harus merekam command sebagai artifact:

{
  "command": "mvn -pl config-service -Dtest=ConfigReaderTest test",
  "purpose": "targeted regression test",
  "exit_code": 0,
  "duration_ms": 8421,
  "stdout_artifact": "artifact://runs/123/tests/config-reader.out",
  "stderr_artifact": "artifact://runs/123/tests/config-reader.err"
}

Jangan hanya mencatat “tests passed”. Review butuh command dan scope.

11. JUnit 5 style inference

JUnit 5/Jupiter adalah salah satu basis test modern Java. Dokumentasi JUnit User Guide memosisikan dokumen tersebut sebagai referensi untuk programmer yang menulis test, extension author, engine author, build tool, dan IDE vendor.

Agent tidak perlu menghafal semua fitur JUnit. Agent perlu menginfer style repo:

apakah menggunakan @Test, @ParameterizedTest, @Nested, @TempDir, @BeforeEach,
assertion library apa,
apakah pakai AssertJ atau built-in assertions,
apakah test display name dipakai,
apakah mocking dihindari untuk class tertentu,
apakah test package mirror production package.

Bad:

@Test
void testConfig() {
    ConfigReader reader = new ConfigReader();
    assertNotNull(reader.read("legacy.timeout"));
}

Better:

@Test
void readsLegacyTimeoutKeyDuringMigrationWindow() {
    ConfigReader reader = new ConfigReader(Map.of("legacy.timeout", "30s"));

    Duration timeout = reader.readTimeout();

    assertThat(timeout).isEqualTo(Duration.ofSeconds(30));
}

Even better when migration semantics matter:

@Test
void newTimeoutKeyTakesPrecedenceWhenBothKeysExist() {
    ConfigReader reader = new ConfigReader(Map.of(
        "legacy.timeout", "30s",
        "service.timeout", "45s"
    ));

    Duration timeout = reader.readTimeout();

    assertThat(timeout).isEqualTo(Duration.ofSeconds(45));
}

Kenapa lebih baik?

nama test menjelaskan behavior,
input jelas,
expected output kuat,
migration precedence diuji,
tidak sekadar notNull.

12. Characterization test sebelum refactor

Saat agent melakukan refactor besar atau migration mekanis, sering kali behavior lama tidak terdokumentasi dengan baik. Characterization test mengunci behavior existing sebelum perubahan.

Flow:

Rule:

Characterization test should not encode desired new behavior.
It encodes current behavior that must survive the transformation.

Example:

@Test
void preservesCaseInsensitiveHeaderLookup() {
    HeaderMap headers = new HeaderMap();
    headers.put("X-Request-Id", "abc");

    assertThat(headers.get("x-request-id")).isEqualTo("abc");
}

Jika agent refactor HeaderMap, test ini menjaga behavior subtle.

13. Negative tests

Agent sering menulis happy path. Production bug sering muncul di edge path.

Untuk migration, negative tests penting:

Migration	Negative case
Config key migration	Invalid old value tetap ditolak
API migration	Null/error response tetap dipetakan benar
Dependency upgrade	Exception baru tidak ditelan diam-diam
Schema migration	Unknown field policy tetap sesuai contract
Auth logic	Unauthorized path tetap ditolak

Example:

@Test
void rejectsInvalidLegacyTimeoutValue() {
    ConfigReader reader = new ConfigReader(Map.of("legacy.timeout", "soon"));

    assertThatThrownBy(reader::readTimeout)
        .isInstanceOf(ConfigException.class)
        .hasMessageContaining("legacy.timeout");
}

Negative test mencegah agent membuat fallback terlalu permisif.

14. Metamorphic/property-style thinking

Tidak semua behavior cocok diuji dengan satu expected output. Kadang lebih kuat memakai invariant.

Example untuk migration formatter:

Invariant:
  parse(render(x)) == x for supported schema subset.

Example untuk idempotent migration:

Invariant:
  migrate(migrate(file)) == migrate(file)

Pseudo-test:

@Test
void migrationIsIdempotent() {
    String once = migrator.migrate(oldConfig);
    String twice = migrator.migrate(once);

    assertThat(twice).isEqualTo(once);
}

Untuk agent, invariant test sangat berguna karena agent bisa menjelaskan property, bukan hanya contoh.

15. Coverage signal: pakai, tapi jangan disembah

JaCoCo menghitung coverage metrics dari Java class files, termasuk bytecode instruction dan debug information. Ini kuat karena coverage bisa dikumpulkan via instrumentation, bahkan ketika source mapping hanya sampai level tertentu.

Tetapi coverage tidak membuktikan correctness.

Agent harus memperlakukan coverage sebagai:

indikator area yang belum tersentuh,
sanity check test relevance,
regression signal kalau coverage turun drastis,
evidence tambahan, bukan verdict utama.

Bad policy:

Reject PR if coverage below 80% globally.

Better policy:

For changed high-risk files, require either:
  - related test exists and passes, or
  - explicit no-test justification approved by judge/human.
If coverage on changed file drops significantly, require explanation.

Coverage delta artifact:

{
  "changed_file": "src/main/java/com/acme/config/ConfigReader.java",
  "line_coverage_before": 0.82,
  "line_coverage_after": 0.86,
  "branch_coverage_before": 0.61,
  "branch_coverage_after": 0.68,
  "interpretation": "New tests exercise legacy-key fallback and new-key precedence."
}

16. Rotten green tests dan false confidence

Ada kategori test yang hijau tetapi tidak benar-benar menjalankan assertion penting. Literatur menyebut salah satu bentuknya sebagai “rotten green tests”: test yang passing meskipun setidaknya satu assertion tidak dieksekusi. Ini adalah contoh bagus kenapa “test passed” tidak cukup.

Agent harus menjaga diri dari false confidence:

assertion berada di callback yang tidak terpanggil,
exception path membuat test selesai sebelum assert,
mock verification tidak pernah relevan,
async test tidak menunggu future selesai,
try/catch menelan error,
snapshot di-update tanpa alasan.

Heuristic static check:

Reject newly generated tests when:
  - test has no assertion/verification,
  - only assertNotNull on object under test,
  - catches Exception without failing,
  - sleeps fixed duration for async behavior,
  - updates golden file without showing semantic diff.

17. Test generation prompt contract

Agent harus menulis test melalui prompt contract yang ketat.

You are adding regression tests for a code change.

Goal:
- Protect the behavior described in TEST_INTENT.

Rules:
- Follow existing test style in TEST_STYLE.
- Prefer minimal, behavior-focused tests.
- Do not mock the unit under test.
- Do not assert only non-null unless nullness is the behavior.
- Include at least one negative or edge case when risk says so.
- Do not modify production code in this step.
- Do not update snapshots unless explicitly allowed.

Inputs:
- CHANGE_SUMMARY
- TEST_INTENT
- EXISTING_TESTS
- PRODUCTION_SNIPPETS
- BUILD_COMMANDS

Output:
- Patch only for test files.
- Explain which behavior each test protects.

Ini bukan prompt kosmetik. Ini safety guard.

18. Test file mutation policy

Agent perlu policy khusus untuk test files.

Action	Policy
Add new focused unit test	Usually allowed
Modify existing assertion	Requires explanation
Delete test	Requires human approval
Disable test	Usually blocked
Add `@Disabled`	Blocked unless explicit approval
Loosen assertion	Requires judge escalation
Update snapshot	Requires semantic diff and approval policy
Add sleep/time dependency	Strongly discouraged/block

Agen yang memperbaiki build dengan melemahkan test adalah agent berbahaya.

Invariant:
  Agent may strengthen tests autonomously.
  Agent may not weaken tests autonomously.

19. Running test before and after patch

Ideal pipeline:

Tidak semua task memungkinkan red-green murni. Tetapi agent harus tahu statusnya.

Evidence examples:

{
  "new_tests": [
    {
      "name": "readsLegacyTimeoutKeyDuringMigrationWindow",
      "type": "regression",
      "pre_patch_result": "failed",
      "post_patch_result": "passed",
      "protected_behavior": "legacy config key fallback"
    }
  ]
}

Atau:

{
  "new_tests": [
    {
      "name": "preservesCaseInsensitiveHeaderLookup",
      "type": "characterization",
      "pre_patch_result": "passed",
      "post_patch_result": "passed",
      "protected_behavior": "existing case-insensitive header lookup"
    }
  ]
}

20. Flaky test control

Agent bisa memperburuk repo dengan test flakey. Karena itu test generation perlu flakiness guard.

Risk factors:

real time clock,
timezone,
random values,
network,
filesystem global state,
test order dependency,
parallel execution race,
fixed sleep,
shared static state,
Docker/external service unavailable.

Policy:

Generated tests must avoid:
  - Thread.sleep for synchronization,
  - real network call,
  - current time without injected clock,
  - random without fixed seed,
  - order-dependent global state.

Example repair:

Bad:

Thread.sleep(1000);
assertThat(cache.get("k")).isNull();

Better:

Ticker ticker = new FakeTicker();
Cache cache = new Cache(ticker, Duration.ofSeconds(1));

cache.put("k", "v");
ticker.advance(Duration.ofSeconds(2));

assertThat(cache.get("k")).isNull();

21. Test oracle quality

Test oracle adalah mekanisme yang menentukan pass/fail. Agent perlu menilai oracle strength.

Oracle	Strength	Catatan
`assertNotNull`	Low	Sering terlalu lemah
exact output	High	Bagus jika output stabil
exception type + message key	Medium/High	Jangan terlalu brittle pada full text jika tidak stabil
state transition assertion	High	Bagus untuk lifecycle
interaction verification only	Medium	Lemah jika tidak cek outcome
snapshot besar	Medium	Perlu diff review
property/invariant	High	Bagus untuk transform/migration

Agent judge harus bertanya:

Would this test fail if the important behavior were broken?

Jika jawabannya “tidak jelas”, test bukan guard yang kuat.

22. Regression guard untuk code migration

Untuk API migration, test generation berbeda dari feature development.

Target guard:

public behavior unchanged,
migration rule applied correctly,
old edge cases preserved,
new API semantics handled,
no fallback removal terlalu cepat,
compile/test pass.

Example test intent:

{
  "behavior": "Token parser preserves rejection of expired token after library upgrade",
  "risk": "New library exception type may cause expired tokens to be treated as malformed instead of expired",
  "test_level": "unit",
  "negative_case": true
}

Test:

@Test
void expiredTokenStillProducesExpiredTokenFailure() {
    TokenParser parser = parserWithClock(fixedClockAfterExpiry());

    TokenFailure failure = parser.parse(expiredToken()).left();

    assertThat(failure.reason()).isEqualTo(TokenFailureReason.EXPIRED);
}

Ini jauh lebih kuat daripada hanya compile setelah dependency upgrade.

23. Regression guard untuk config/schema migration

Untuk config/schema migration, guard sering berupa compatibility test.

Example:

@Test
void acceptsBothLegacyAndNewConfigKeysDuringTransition() {
    ConfigSchema schema = ConfigSchema.loadCurrent();

    assertThat(schema.validate(Map.of("legacy.timeout", "30s"))).isValid();
    assertThat(schema.validate(Map.of("service.timeout", "30s"))).isValid();
}

Precedence:

@Test
void newKeyTakesPrecedenceWhenBothKeysArePresent() {
    RuntimeConfig config = RuntimeConfig.from(Map.of(
        "legacy.timeout", "30s",
        "service.timeout", "45s"
    ));

    assertThat(config.timeout()).isEqualTo(Duration.ofSeconds(45));
}

Rollback guard:

@Test
void serializedConfigRemainsReadableByPreviousVersionSchema() {
    String generated = CurrentConfigWriter.write(exampleConfig());

    assertThat(previousSchema.validate(generated)).isValid();
}

24. Agent test planner pseudo-code

type TestPlan = {
  intents: TestIntent[];
  filesToModify: string[];
  commands: string[];
  risk: "low" | "medium" | "high";
  needsHumanApproval: boolean;
};

async function planRegressionGuards(change: ChangeContext): Promise<TestPlan> {
  const relatedTests = await testDiscovery.findRelatedTests(change.changedSymbols);
  const style = await testDiscovery.inferTestStyle(relatedTests);
  const behaviorSurfaces = await impactAnalyzer.behaviorSurfaces(change);

  const intents = behaviorSurfaces.map(surface => ({
    id: makeId(surface),
    behavior: surface.behavior,
    risk: surface.risk,
    testLevel: chooseTestLevel(surface),
    existingTestStyle: style,
    verifierCommand: chooseCommand(surface, relatedTests)
  }));

  return {
    intents,
    filesToModify: chooseTestFiles(intents, relatedTests),
    commands: [...new Set(intents.map(i => i.verifierCommand))],
    risk: maxRisk(intents),
    needsHumanApproval: intents.some(i => i.risk === "high" && i.testLevel === "e2e")
  };
}

25. Agent test writer pseudo-code

async function addRegressionTests(run: Run, plan: TestPlan): Promise<TestPatch> {
  const projection = await contextBuilder.forTestWriting(plan);

  const patch = await llm.generatePatch({
    system: TEST_WRITER_SYSTEM_PROMPT,
    context: projection,
    outputFormat: "unified_diff"
  });

  const policy = await testPolicy.evaluatePatch(patch);
  if (!policy.allowed) {
    throw new PolicyViolation(policy.reasons);
  }

  await fileTools.applyPatch(patch);

  const results = [];
  for (const command of plan.commands) {
    results.push(await verifier.run(command));
  }

  return {
    patch,
    results,
    evidence: await evidenceBuilder.fromTestResults(results)
  };
}

26. Repair loop untuk test failure

Saat test gagal, agent harus mengklasifikasikan failure.

Failure type	Action
Test compile error	Repair test syntax/import/style
Wrong expectation	Re-check behavior; do not blindly change expected
Production bug exposed	Repair production code
Fixture missing	Add minimal fixture
Flaky failure	Stabilize test or mark requires human review
Environment failure	Retry if transient; otherwise infrastructure failure

Important invariant:

Agent must not repair a failing regression test by weakening the assertion
unless it can prove the original expectation was incorrect and preserve evidence.

Repair prompt should include:

failing command,
error excerpt,
test intent,
related production code,
diff so far,
allowed files,
forbidden actions.

27. LLM-as-judge untuk test quality

Selain menjalankan test, kita bisa memakai judge untuk menilai test quality. Judge tidak menggantikan verifier. Judge membaca evidence.

Judge questions:

Apakah test baru relevan dengan change request?
Apakah assertion cukup kuat?
Apakah test akan gagal jika behavior penting rusak?
Apakah test mengikuti style repo?
Apakah test melemahkan guard existing?
Apakah ada flakiness risk?
Apakah test hanya coverage theater?

Output:

{
  "verdict": "pass",
  "confidence": 0.83,
  "findings": [
    {
      "severity": "info",
      "message": "New test covers legacy key fallback and precedence."
    }
  ],
  "required_actions": []
}

If failing:

{
  "verdict": "fail",
  "confidence": 0.91,
  "findings": [
    {
      "severity": "high",
      "message": "New test only asserts non-null and would not fail if timeout precedence were wrong."
    }
  ],
  "required_actions": [
    "Strengthen assertion to check selected timeout value."
  ]
}

28. PR evidence template

Agent PR body harus menunjukkan guard, bukan hanya “tests added”.

## Regression Guard

Added:
- `ConfigReaderTest#readsLegacyTimeoutKeyDuringMigrationWindow`
- `ConfigReaderTest#newTimeoutKeyTakesPrecedenceWhenBothKeysExist`
- `ConfigReaderTest#rejectsInvalidLegacyTimeoutValue`

Why:
- Protects backward-compatible config migration.
- Ensures new key precedence.
- Prevents invalid legacy values from being silently accepted.

Verification:
- `mvn -pl config-service -Dtest=ConfigReaderTest test` ✅
- `mvn -pl config-service test` ✅

Evidence:
- Targeted test report: `artifact://runs/123/tests/config-reader.xml`
- Diff boundary report: `artifact://runs/123/diff-boundary.json`

Ini menurunkan review cost.

29. Database schema for test evidence

Tambahkan table atau document model:

create table run_test_intent (
  id uuid primary key,
  run_id uuid not null,
  change_id text,
  behavior text not null,
  risk text not null,
  test_level text not null,
  target_file text,
  verifier_command text,
  created_at timestamptz not null default now()
);

create table run_test_result (
  id uuid primary key,
  run_id uuid not null,
  test_intent_id uuid,
  command text not null,
  exit_code int not null,
  duration_ms bigint not null,
  report_artifact_uri text,
  pre_patch_result text,
  post_patch_result text,
  created_at timestamptz not null default now()
);

Kenapa simpan intent dan result terpisah?

Karena test result menjawab “apa yang terjadi”, sedangkan intent menjawab “kenapa test ini ada”.

30. Failure drill

Drill 1 — Agent menambah test yang selalu pass

Signal:

tidak ada assertion kuat,
hanya assertNotNull,
judge gagal.

Response:

reject patch,
minta stronger oracle,
simpan finding.

Drill 2 — Agent menghapus failing test

Signal:

diff menghapus test atau menambahkan @Disabled,
no approval.

Response:

block policy,
mark run failed or requires human approval.

Drill 3 — Test hijau lokal, gagal CI

Signal:

verifier command subset berbeda dengan CI,
environment mismatch.

Response:

parse CI failure,
add missing verifier profile,
rerun with closer environment.

Drill 4 — Test flakey

Signal:

pass/fail inconsistent on rerun,
time/random/network dependency.

Response:

repair test determinism,
if not possible, block autonomous PR.

31. Implementation checklist

Sebelum lanjut, platform kita harus punya:

32. Kesimpulan

Test generation bukan fitur “nice to have”. Untuk AI coding agent, test generation adalah bagian dari trust model.

Agent yang matang tidak berpikir:

Saya sudah mengubah kode. Sekarang jalankan test.

Agent yang matang berpikir:

Saya mengubah behavior surface tertentu.
Saya harus membuktikan behavior penting tetap aman.
Saya perlu guard yang kuat, relevan, executable, dan bisa direview.

Inilah perbedaan antara code generator dan code-change system.

Di part berikutnya, kita naik tingkat: perubahan tidak selalu lokal. Satu perubahan kecil bisa menyebar ke banyak file, public API, call site, test, config, dan build module. Kita akan membahas multi-file cascading change.

References

JUnit User Guide — Overview: https://docs.junit.org/6.1.0/overview.html
Maven Surefire Plugin — Inclusions and Exclusions of Tests: https://maven.apache.org/surefire/maven-surefire-plugin/examples/inclusion-exclusion.html
JaCoCo — Coverage Counters: https://www.eclemma.org/jacoco/trunk/doc/counters.html
RTj: a Java framework for detecting and refactoring rotten green test cases: https://arxiv.org/abs/1912.07322

Lesson Recap

You just completed lesson 45 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 44

Learn Ai Coding Agent Part 044 Config And Schema Migration

Next Lesson

Lesson 46

Learn Ai Coding Agent Part 046 Multi File Cascading Change