Final StretchOrdered learning track

Learn Agentic Ai Engineering Part 035 Capstone Autonomous Engineering System

[]30 min read5968 words

In This Lesson

1. Hubungan dengan Framework Kaufman 2. Definisi Sistem 3. North Star Capability

Finish

Lesson 3535 lesson track30–35 Final Stretch

title: Learn Advanced Agentic AI Engineering & Autonomous Software Engineering - Part 035 description: Capstone blueprint for an autonomous engineering system: issue intake, repository understanding, planning, coding, testing, PR review, release assistance, policy, sandboxing, observability, evaluation, governance, rollout, and production readiness. series: learn-agentic-ai-engineering seriesTitle: Learn Advanced Agentic AI Engineering & Autonomous Software Engineering order: 35 partTitle: Capstone: Autonomous Engineering System tags:

agentic-ai
autonomous-software-engineering
capstone
ai-engineering
coding-agents
agent-platform
governance
observability
evaluation
series
final status: final seriesStatus: completed date: 2026-06-29

Part 035 — Capstone: Autonomous Engineering System

Target part ini: mampu mendesain blueprint end-to-end untuk Autonomous Engineering System yang dapat menerima issue, memahami repository, membuat rencana, menjalankan perubahan di sandbox, menulis/menjalankan test, membuat PR evidence packet, meminta approval, membantu release, dan tetap bisa diaudit, dievaluasi, diamankan, serta dioperasikan sebagai platform production.

Ini adalah bagian terakhir dari seri.

Part 001 sampai Part 034 membangun komponen-komponen terpisah:

skill map,
first principles,
autonomy boundary,
runtime architecture,
workflow vs agent loop,
planning,
tool calling,
MCP,
context,
memory,
RAG,
state machine,
human approval,
multi-agent,
protocol,
design pattern,
anti-pattern,
autonomous SWE lifecycle,
repository understanding,
coding loop,
debugging,
test generation,
PR review,
refactoring/migration,
DevOps/release agents,
evaluation,
observability,
reliability,
security,
policy/permission/identity,
sandboxing,
governance,
platform architecture,
enterprise operating model.

Part ini menyatukannya menjadi satu sistem.

Pertanyaan utama:

Jika kita harus membangun autonomous engineering platform yang benar-benar layak untuk enterprise engineering, bentuk sistemnya seperti apa?

Jawaban singkat:

Bukan satu agent super pintar. Bangun controlled autonomous engineering system: orchestrator yang stateful, tool gateway yang aman, sandbox yang terisolasi, repository intelligence, planning/execution/verifier loop, PR/release workflow, policy engine, human approval, trace/eval pipeline, dan governance layer.

OpenAI Agents SDK mendeskripsikan agent sebagai aplikasi yang dapat plan, call tools, collaborate across specialists, dan keep enough state untuk multi-step work.
Reference: https://developers.openai.com/api/docs/guides/agents

Anthropic menekankan bahwa agentic system yang efektif sering kali lebih baik dibangun dari pola sederhana dan komposable, dengan distinction jelas antara workflow yang predictable dan agent yang lebih dynamic.
Reference: https://www.anthropic.com/research/building-effective-agents

LangGraph diposisikan sebagai low-level orchestration framework untuk long-running, stateful agents dengan durable execution, persistence, human-in-the-loop, memory, dan streaming.
Reference: https://pypi.org/project/langgraph/

MCP menyediakan protokol terbuka untuk menghubungkan AI applications dengan tools, resources, dan prompts melalui model host/client/server.
Reference: https://modelcontextprotocol.io/specification/2025-11-25

SWE-bench mengevaluasi sistem AI pada real-world GitHub issues: diberikan codebase dan issue, sistem harus menghasilkan patch yang menyelesaikan problem.
Reference: https://github.com/swe-bench/SWE-bench

OWASP Top 10 for LLM Applications dan OWASP agentic guidance memberi vocabulary risiko: prompt injection, insecure output handling, sensitive information disclosure, excessive agency, insecure plugin/tool design, supply chain, memory/context poisoning, dan unbounded consumption.
Reference: https://owasp.org/www-project-top-10-for-large-language-model-applications/

NIST AI RMF dan Generative AI Profile memberi kerangka governance untuk memasukkan trustworthiness considerations ke design, development, use, dan evaluation AI systems.
Reference: https://www.nist.gov/itl/ai-risk-management-framework

Prinsip capstone:

Autonomy is not the architecture. Autonomy is a capability granted by architecture, policy, evaluation, and trust.

1. Hubungan dengan Framework Kaufman

Dalam kerangka Kaufman, skill ini terlalu besar jika dipelajari sebagai “membangun AI engineer otomatis”.

Kita pecah menjadi subskill operasional:

mendefinisikan target performa agent,
membuat autonomy boundary,
membuat state machine agent,
membangun repository understanding pipeline,
membangun planning loop,
membangun tool gateway,
membangun sandbox execution,
membangun verification hierarchy,
membangun PR evidence packet,
membangun approval gate,
membangun release-assist workflow,
membangun observability pipeline,
membangun eval harness,
membangun policy/identity/permission model,
membangun governance artefact,
menjalankan rollout bertahap.

Target 20 jam pertama untuk capstone:

Anda mampu membuat design blueprint dan prototype kecil autonomous issue resolver yang hanya boleh mengerjakan low-risk issue, berjalan di sandbox, menghasilkan patch + test + evidence packet, dan tidak boleh merge/deploy tanpa human approval.

Target top 1% bukan “agent bisa coding”.

Target top 1% adalah:

Anda bisa membangun sistem yang memungkinkan agent berkontribusi ke software delivery tanpa menghancurkan correctness, security, auditability, ownership, dan engineering culture.

2. Definisi Sistem

Autonomous Engineering System adalah platform yang membantu pekerjaan software engineering dengan kemampuan:

memahami permintaan engineering,
memahami repository,
menyusun rencana perubahan,
menjalankan eksperimen,
mengubah kode,
membuat/memperbaiki test,
memverifikasi hasil,
membuat PR,
menjelaskan evidence,
meminta approval,
membantu review,
membantu release,
membantu incident diagnosis,
belajar dari telemetry dan eval.

Namun sistem ini bukan:

pengganti ownership engineer,
bypass code review,
bypass security review,
auto-merge engine untuk semua perubahan,
executor dengan secret unrestricted,
chatbot dengan akses repository,
CI bot yang kebetulan memakai LLM,
kumpulan prompt tanpa runtime state.

Definisi yang lebih presisi:

Autonomous Engineering System adalah agentic SDLC control system yang mengorkestrasi model, tools, state, policy, sandbox, evaluation, observability, dan human approval untuk menjalankan pekerjaan engineering dengan batas otonomi yang eksplisit.

3. North Star Capability

North star bukan “mengganti engineer”.

North star yang lebih sehat:

Menurunkan cost dan lead time pekerjaan engineering yang repetitive/routine, sambil meningkatkan traceability, review quality, regression coverage, dan operational discipline.

Contoh pekerjaan yang layak:

Kategori	Cocok Untuk Agent	Tidak Cocok Untuk Full Autonomy
Bug fix kecil	Reproduce, localize, patch, test	Ambiguous root cause across critical system
Test improvement	Add missing regression test	Menentukan strategi test enterprise-wide sendiri
Dependency upgrade	Minor/patch upgrade dengan recipe	Major migration berisiko tinggi tanpa human architect
Documentation	Update docs from code/PR	Menulis policy organisasi tanpa review
Refactoring	Mechanical rename/extract/migrate	Redesign domain model critical
PR review	Risk scan + actionable comments	Reject/approve PR sebagai authority tunggal
CI diagnosis	Explain failure + suggest fix	Push rollback ke production tanpa approval
Release assist	Readiness checklist + risk summary	Deploy high-impact change tanpa gate

North star metrics:

Metric	Arti
Lead time reduction	Waktu dari issue accepted ke PR ready berkurang
Review load reduction	Reviewer lebih sedikit membuang waktu di hal mekanis
Regression coverage increase	Bug fix disertai test relevan
Patch acceptance rate	Patch agent diterima setelah review manusia
Rework rate	Patch agent tidak sering harus diulang total
Incident contribution rate	Agent tidak menaikkan risiko incident
Evidence completeness	PR selalu punya bukti reproduksi/verifikasi
Policy violation rate	Tool/action agent tidak melanggar boundary
Evaluation pass rate	Agent tidak regresi di benchmark internal

4. Architecture Overview

Sistem capstone terdiri dari tiga plane:

Control Plane
Mengelola registry, policy, identity, approval, evaluation, audit, governance, dan rollout.
Execution Plane
Menjalankan agent loop, planner, tool calls, sandbox, repository operations, tests, and verification.
Evidence Plane
Menyimpan trace, logs, decision records, tool results, eval results, PR evidence, approval records, dan audit events.

Key idea:

Agent tidak langsung menyentuh dunia. Agent menyentuh runtime, runtime menyentuh policy, policy mengizinkan tool gateway, tool gateway menjalankan sandboxed tools, semua menghasilkan evidence.

5. Capability Model

Sistem sebaiknya tidak dimulai dari “agent bisa apa saja”.

Mulai dari capability yang jelas.

capability:
  id: autonomous_issue_resolver.low_risk_bugfix
  owner: platform-engineering
  description: Resolve low-risk bug issues with reproduction and PR evidence.
  allowed_inputs:
    - github_issue
    - jira_ticket
  allowed_repositories:
    - service-catalog-tag: low-risk-enabled
  allowed_actions:
    - clone_repository
    - inspect_code
    - run_tests
    - edit_files
    - create_branch
    - open_pull_request
    - comment_on_issue
  forbidden_actions:
    - merge_pull_request
    - deploy_to_production
    - read_production_secrets
    - modify_iam_policy
    - write_to_production_database
  required_evidence:
    - reproduction_attempt
    - root_cause_summary
    - patch_summary
    - tests_run
    - risk_assessment
    - rollback_notes
  approval:
    open_pr: automatic
    merge_pr: human_required
    production_deploy: human_required

Capability bukan hanya nama fitur.

Capability adalah kontrak:

siapa owner,
input apa yang valid,
tool apa yang boleh dipakai,
credential apa yang boleh digunakan,
evidence apa yang wajib,
approval mana yang wajib,
terminal state apa yang sah,
evaluasi apa yang harus lulus.

6. Autonomy Tiering

Autonomy harus bertingkat.

Tier	Nama	Agent Boleh	Agent Tidak Boleh
0	Advisory	membaca, merangkum, memberi saran	menulis file, membuat branch, mengubah sistem
1	Assisted Edit	membuat patch lokal di sandbox	membuat PR tanpa approval eksplisit
2	PR Automation	membuat branch dan PR	merge/deploy
3	Bounded Maintenance	menjalankan perubahan rutin terdaftar	menyentuh high-risk file/system
4	Release Assist	membuat readiness packet, rollback suggestion	deploy/rollback sendiri
5	Conditional Operations	execute action dengan pre-approved runbook	improvisasi di production

Rule praktis:

Semakin dekat action ke production, customer data, security boundary, money movement, legal exposure, atau irreversible operation, semakin rendah otonomi agent.

Contoh mapping:

autonomy_policy:
  low_risk_doc_update:
    max_tier: 2
    required_approval: reviewer
  low_risk_unit_test_patch:
    max_tier: 2
    required_approval: reviewer
  dependency_patch_upgrade:
    max_tier: 2
    required_approval: code_owner
  auth_logic_change:
    max_tier: 1
    required_approval: security_owner
  payment_logic_change:
    max_tier: 1
    required_approval: domain_owner_and_security
  production_rollback:
    max_tier: 4
    required_approval: incident_commander

7. End-to-End Lifecycle

Lifecycle capstone:

Core invariant:

No terminal success without evidence.

Terminal success harus menjawab:

Issue apa yang dikerjakan?
Scope apa yang disetujui?
File apa yang berubah?
Kenapa perubahan itu benar?
Test apa yang membuktikan?
Risiko apa yang tersisa?
Siapa yang approve?
Apa yang tidak dilakukan?
Bagaimana rollback?
Trace execution-nya di mana?

8. Intake Layer

Intake layer menerima permintaan dari:

GitHub issue,
Jira ticket,
Slack command,
PR comment,
scheduled maintenance job,
CI failure,
dependency alert,
incident timeline.

Intake tidak boleh langsung menjalankan agent.

Intake harus melakukan normalization.

engineering_request:
  id: REQ-2026-000123
  source: github_issue
  source_url: https://github.com/acme/billing/issues/812
  repository: acme/billing-service
  requester: alice@example.com
  requested_capability: autonomous_issue_resolver.low_risk_bugfix
  title: "Invoice total is wrong when discount is zero"
  description: "..."
  acceptance_criteria:
    - "zero discount must not change total"
    - "existing discount calculation tests must pass"
  constraints:
    - "do not change public API"
    - "do not modify migration files"
  deadline: null
  attachments: []

Good intake contains:

explicit repository,
problem statement,
acceptance criteria,
constraints,
risk hints,
expected output,
owner/reviewer,
source provenance.

Bad intake:

“fix bug”,
“make it better”,
“optimize everything”,
“update service”,
“make tests green” without context.

Agent boleh meminta clarification jika acceptance criteria tidak cukup.

Namun untuk automation, lebih baik intake layer menolak request ambiguous daripada membiarkan agent berimprovisasi.

9. Risk Classifier

Risk classifier menentukan apakah request boleh dikerjakan agent dan pada autonomy tier berapa.

Risk signal:

Signal	Contoh
Domain criticality	billing, auth, KYC, AML, enforcement decision
File sensitivity	IAM, crypto, migration, payment, policy, compliance
Runtime impact	production path, background job, customer-facing API
Data impact	PII, secrets, financial data, regulated data
Change scope	number of files, public API, schema, dependency graph
Reversibility	doc/test change vs data migration
Test confidence	high coverage vs unknown coverage
Ownership clarity	code owner exists vs unclear owner
Incident linkage	current incident vs routine maintenance

Example classifier output:

risk_assessment:
  risk_tier: medium
  reasons:
    - modifies_billing_domain
    - touches_calculation_logic
    - no_schema_change
    - unit_tests_available
  max_autonomy_tier: 1
  required_approvals:
    - billing_code_owner
  forbidden_actions:
    - open_pr_without_plan_approval
    - modify_public_api
    - modify_database_schema

Important:

Risk classifier is not a vibe check. It is a policy decision that must be explainable and reviewable.

10. Repository Intelligence Layer

Repository intelligence layer membuat map repository.

Ia tidak hanya membaca file.

Ia membangun beberapa representation:

Map	Fungsi
File map	struktur folder, generated files, test files
Build map	build tool, modules, tasks, dependencies
Symbol map	class/function/interface/type relationships
Dependency map	internal/external dependency graph
Test map	test file to production file relationships
Ownership map	CODEOWNERS, maintainers, teams
Runtime map	entrypoints, deployment units, config
Risk map	sensitive files/domains
Convention map	naming, layering, patterns, lint rules

Repository intelligence output:

repo_context_packet:
  repository: acme/billing-service
  commit: 5a7c91f
  language_stack:
    - java
    - spring_boot
    - gradle
  modules:
    - billing-core
    - billing-api
  likely_files:
    production:
      - billing-core/src/main/java/com/acme/billing/InvoiceCalculator.java
      - billing-core/src/main/java/com/acme/billing/DiscountPolicy.java
    tests:
      - billing-core/src/test/java/com/acme/billing/InvoiceCalculatorTest.java
  build_commands:
    unit: ./gradlew :billing-core:test
    full: ./gradlew test
  owners:
    - team-billing-platform
  risk_notes:
    - billing_domain
    - no_database_schema_detected
    - no_auth_file_detected

Repository map harus versioned.

Jangan memakai repo map stale untuk commit baru tanpa invalidation.

11. Context Builder

Context builder membuat context yang dikirim ke model.

Ia harus mengikuti prinsip:

Send enough to reason, not enough to leak, confuse, or exceed budget.

Context layers:

system instruction,
capability policy,
request packet,
repo context packet,
relevant files/snippets,
previous attempts,
tool results,
reviewer feedback,
verification evidence,
constraints.

Context builder harus menandai provenance:

context_item:
  id: ctx-00042
  type: source_file_snippet
  source: repository
  repository: acme/billing-service
  path: billing-core/src/main/java/com/acme/billing/InvoiceCalculator.java
  commit: 5a7c91f
  line_range: "42-91"
  trust_level: trusted_repo_content
  freshness: current_commit
  content_hash: sha256:...

Agent harus bisa membedakan:

user instruction,
repository content,
tool output,
untrusted issue content,
retrieved documentation,
policy instruction.

Ini penting untuk prompt injection.

Issue body dan README dari repository adalah untrusted content.

Policy dan system instruction adalah trusted control content.

12. Planner

Planner membuat rencana kerja.

Plan harus berbentuk artifact, bukan reasoning bebas yang hilang.

plan:
  id: PLAN-2026-000123
  objective: Fix invoice total when discount is zero.
  assumptions:
    - zero discount should behave as no discount
  constraints:
    - do not change public API
    - do not modify database schema
  steps:
    - id: S1
      action: inspect
      target: InvoiceCalculator and related tests
      expected_evidence: relevant calculation path identified
    - id: S2
      action: reproduce
      target: existing or new focused unit test
      expected_evidence: failing test or documented non-reproduction
    - id: S3
      action: patch
      target: minimal calculation logic change
      expected_evidence: diff limited to billing-core
    - id: S4
      action: verify
      target: ./gradlew :billing-core:test
      expected_evidence: test report
  forbidden:
    - modify public API
    - modify database schema
    - change unrelated formatting
  risk:
    tier: medium
    requires_plan_approval: true

Plan quality checklist:

objective jelas,
scope terbatas,
assumptions eksplisit,
constraints eksplisit,
step observable,
setiap step punya expected evidence,
verification command jelas,
forbidden action jelas,
risk tier jelas,
approval requirement jelas.

Poor plan:

I will inspect the code, make changes, and test it.

Good plan:

I will inspect InvoiceCalculator and DiscountPolicy, reproduce the zero-discount case with a focused unit test, patch only calculation logic if reproduction confirms the issue, run module-level tests, and create a PR with the failing-before/passing-after evidence.

13. Executor

Executor menjalankan plan melalui state machine.

Executor tidak boleh langsung mengikuti setiap model output.

Executor harus memvalidasi:

apakah action valid untuk current state,
apakah action diizinkan policy,
apakah tool schema valid,
apakah credential tersedia,
apakah resource budget tersedia,
apakah approval diperlukan,
apakah output memenuhi contract.

Executor invariant:

Model proposes. Runtime disposes.

Runtime harus bisa berkata:

reject,
require approval,
require clarification,
retry,
fallback,
abort,
continue.

14. Tool Gateway

Tool gateway adalah choke point.

Semua tool call harus melewatinya.

Tool gateway responsibilities:

schema validation,
authorization,
credential scoping,
sandbox routing,
rate limiting,
timeout,
idempotency key,
output sanitization,
event logging,
secret redaction,
egress control,
policy enforcement,
tool versioning.

Example tool contract:

tool:
  name: run_tests
  version: 1.2.0
  description: Run tests in the sandboxed repository checkout.
  side_effect: sandbox_only
  input_schema:
    command: string
    timeout_seconds: integer
    working_directory: string
  policy:
    allowed_commands:
      - "./gradlew :billing-core:test"
      - "./gradlew test"
    forbidden_patterns:
      - "curl"
      - "wget"
      - "nc"
      - "rm -rf /"
  output_schema:
    exit_code: integer
    stdout_excerpt: string
    stderr_excerpt: string
    test_report_path: string
    duration_ms: integer

Tool gateway rule:

Tools are not helper functions. Tools are capabilities with authority.

15. MCP Gateway

MCP servers can expose tools, resources, and prompts.

In enterprise architecture, agent should not connect directly to arbitrary MCP servers.

Use MCP gateway:

MCP gateway enforces:

server allowlist,
tool allowlist,
prompt/resource visibility,
identity propagation,
tenant isolation,
tool metadata validation,
output classification,
version pinning,
provenance tagging,
audit events.

Never treat MCP server description as trusted security boundary.

MCP is integration protocol.

Security still needs policy, sandbox, identity, network control, and audit.

16. Sandbox Execution

Sandbox is mandatory for autonomous SWE.

Minimum sandbox controls:

Control	Purpose
Isolated filesystem	prevent host mutation
Ephemeral checkout	clean run per task
Network default deny	prevent exfiltration and uncontrolled downloads
Scoped package cache	control supply chain surface
No production secrets	prevent credential leakage
Resource limits	prevent runaway cost/DoS
Time budget	prevent infinite loops
Process isolation	contain executed code
Artifact capture	preserve diff, logs, reports
Egress approval	allow controlled external access

Example sandbox profile:

sandbox_profile:
  id: java-low-risk-bugfix
  filesystem:
    mode: ephemeral
    writable_paths:
      - /workspace/repo
      - /workspace/tmp
    read_only_paths:
      - /workspace/policy
  network:
    default: deny
    allowlist:
      - internal-artifact-cache.acme.local
  secrets:
    allowed: []
  resources:
    cpu: 4
    memory: 8Gi
    timeout_minutes: 30
  package_management:
    allow_download: false
    use_locked_cache: true
  artifact_capture:
    - git_diff
    - test_reports
    - terminal_logs

Sandbox rule:

If the agent can execute code, assume the code may be malicious, broken, expensive, or exfiltrating.

17. Verification Hierarchy

Do not rely on model self-review.

Verification hierarchy:

static checks,
formatting/lint,
type checking/compilation,
focused unit tests,
regression tests,
integration tests,
contract tests,
security checks,
mutation/property checks if relevant,
human review,
staged rollout signals.

For low-risk bugfix:

verification_plan:
  required:
    - compile
    - focused_test
    - affected_module_test
    - diff_review
  optional:
    - full_test_suite
    - mutation_test
    - security_scan
  forbidden_shortcuts:
    - delete_failing_test
    - weaken_assertion_without_justification
    - skip_test_without_approval

Agent must produce verification evidence:

verification_evidence:
  reproduction:
    status: reproduced
    command: ./gradlew :billing-core:test --tests InvoiceCalculatorTest.zeroDiscount
    before_patch_result: failed
  after_patch:
    focused_test: passed
    module_test: passed
    full_test: not_run
    not_run_reason: exceeds low-risk budget
  changed_tests:
    - InvoiceCalculatorTest.zeroDiscountDoesNotChangeTotal
  risk_remaining:
    - full suite not run in agent sandbox; CI will run on PR

Verification principle:

Passing tests are evidence, not proof. But no evidence is not acceptable.

18. PR Evidence Packet

PR opened by agent must not look like a human guessed.

It should include evidence packet.

## Summary
Fixes zero-discount invoice total calculation by treating zero discount as no discount.

## Scope
- Modified `InvoiceCalculator`
- Added regression test for zero discount
- No API/schema/config changes

## Reproduction
Before patch:
- `./gradlew :billing-core:test --tests InvoiceCalculatorTest.zeroDiscountDoesNotChangeTotal`
- Failed with expected total 100.00 but got 0.00

## Verification
After patch:
- Focused test: passed
- Module tests: passed
- Full test suite: not run in sandbox; CI will run

## Risk
Medium: billing calculation logic.
Mitigation: minimal diff, focused regression test, billing code owner review required.

## Constraints Honored
- Did not change public API
- Did not modify database schema
- Did not modify unrelated files

## Rollback
Revert this PR. No migration or data transformation involved.

## Agent Trace
Trace ID: trc_2026_000123

Evidence packet reduces review cost.

It also gives auditability.

Bad PR description:

Fixed bug.

Good PR description:

Here is the reproduction, patch scope, verification result, residual risk, rollback path, and trace ID.

19. Review Agent

Review agent should not replace human code owner.

It should improve review quality.

Review agent roles:

summarize diff,
identify risky files,
compare PR against requirements,
detect missing tests,
detect security concern,
detect inconsistent pattern,
propose focused questions,
verify PR evidence completeness,
create review checklist.

Review output should be ranked:

Severity	Meaning
Blocker	likely correctness/security issue
Major	important maintainability/design issue
Minor	local improvement
Nit	style only
Question	uncertainty requiring human context

Review agent anti-pattern:

Dump 50 comments with low confidence.

Better:

3 high-confidence findings, each tied to diff line, invariant, consequence, and suggested action.

Finding format:

finding:
  severity: major
  confidence: high
  file: InvoiceCalculator.java
  lines: "82-91"
  invariant: zero discount must behave as no discount
  issue: branch treats zero as missing discount and resets total
  consequence: invoice total becomes incorrect for valid zero-discount case
  suggestion: compare discount presence separately from discount value
  evidence:
    - failing test InvoiceCalculatorTest.zeroDiscountDoesNotChangeTotal

20. Release Assist

Agent should assist release, not own it blindly.

Release assist tasks:

summarize changes since last release,
classify release risk,
check CI status,
check required approvals,
check open incidents,
check feature flag state,
generate release notes,
generate rollback notes,
monitor canary signals,
explain deployment failure,
suggest rollback/roll-forward options.

Release readiness packet:

release_readiness:
  version: 2026.06.29-rc1
  services:
    - billing-service
  changes:
    - PR-812 zero-discount invoice fix
  ci_status: passed
  approvals:
    code_owner: approved
    security: not_required
  risk_tier: medium
  rollout_plan:
    - deploy_to_staging
    - canary_5_percent
    - canary_25_percent
    - full_rollout
  monitors:
    - invoice_calculation_error_rate
    - billing_api_5xx
    - discount_policy_exception_count
  rollback:
    method: revert_deployment
    data_migration: none

Release agent forbidden actions by default:

deploy production without approval,
rollback production without incident commander approval,
disable monitors,
change alert thresholds,
rotate secrets,
modify IAM,
bypass change window.

21. Observability and Evidence Plane

Agent observability differs from normal service observability.

You need to reconstruct why something happened.

Minimum trace events:

Event	Required Fields
request_received	request ID, source, user, repo
risk_classified	tier, reasons, policy version
context_built	context items, hashes, token count
plan_created	plan ID, steps, constraints
tool_call_requested	tool, args hash, state
tool_call_authorized	policy decision, credential scope
tool_call_executed	duration, output hash, exit code
file_changed	path, diff hash, risk tag
test_run	command, result, report path
approval_requested	approver, reason, evidence packet
approval_decision	approver, decision, timestamp
pr_opened	PR URL, branch, evidence hash
run_completed	terminal state, summary

Trace event example:

{
  "event_type": "tool_call_executed",
  "trace_id": "trc_2026_000123",
  "run_id": "run_456",
  "state": "TestsRun",
  "tool": "run_tests",
  "tool_version": "1.2.0",
  "args_hash": "sha256:...",
  "policy_decision_id": "poldec_789",
  "sandbox_id": "sbx_abc",
  "exit_code": 0,
  "duration_ms": 42391,
  "output_hash": "sha256:...",
  "timestamp": "2026-06-29T05:40:00Z"
}

Do not log secrets.

Do not log full prompts blindly if they contain sensitive data.

Use redaction and content classification.

22. Evaluation Harness

A capstone system must have offline and online eval.

Offline eval types:

Eval	Purpose
Task eval	Can agent solve known tasks?
Trajectory eval	Did agent follow safe path?
Tool-call eval	Did agent choose legal tools?
Patch eval	Does patch pass tests?
Review eval	Are findings useful and accurate?
Security eval	Does agent resist injection/tool abuse?
Cost eval	Token/tool/runtime budget
Regression eval	Did new model/prompt/tool version worsen behavior?

Online eval types:

Eval	Purpose
Human acceptance	Was PR accepted?
Rework rate	How much human correction needed?
Incident linkage	Did agent-caused change fail?
Policy violation	Did agent attempt forbidden actions?
Evidence completeness	Did PR include required evidence?
Latency/cost	Is runtime sustainable?

Eval record:

eval_result:
  eval_id: agent_low_risk_bugfix_regression_v17
  agent_version: 2026.06.29
  model: model-x
  policy_version: pol-42
  tool_versions:
    run_tests: 1.2.0
    edit_file: 1.4.1
  dataset: internal-low-risk-bugfix-2026q2
  results:
    task_success_rate: 0.62
    evidence_complete_rate: 0.94
    policy_violation_rate: 0.00
    average_cost_usd: 1.42
    p95_duration_minutes: 18
  decision: pass_with_monitoring

Do not only measure final success.

Measure path quality.

A dangerous agent can pass tasks by violating policy.

23. Security Model

Threat model the whole system.

Attack surfaces:

issue body prompt injection,
README/documentation injection,
malicious test output,
malicious dependency script,
compromised MCP server,
tool description injection,
credential exfiltration,
branch/PR manipulation,
reviewer approval manipulation,
memory poisoning,
context poisoning,
eval dataset contamination,
supply-chain attack,
runaway cost,
confused deputy via delegated permissions.

Security controls:

Threat	Control
Prompt injection	instruction hierarchy, context labeling, output validation
Tool abuse	tool gateway, allowlist, policy engine
Excessive agency	autonomy tier, approval gate, capability registry
Secret leakage	secret broker, redaction, no secrets in sandbox by default
Data exfiltration	network deny, egress allowlist, output scanning
Malicious dependency	locked cache, no arbitrary install, SBOM/signature checks
MCP compromise	registry, version pinning, gateway, audit
Memory poisoning	provenance, confidence, retention policy, reviewable memory writes
Policy bypass	PEP/PDP separation, immutable audit, policy regression tests
Supply chain	pinned tools, signed images, artifact verification

Security principle:

The model is not the trust boundary. The runtime is.

24. Policy and Identity Model

Every agent action must have identity.

Identity layers:

Identity	Meaning
Human requester	who requested work
Agent definition	which agent/capability acted
Runtime instance	which run/session acted
Tool identity	which tool/service was invoked
Credential subject	which scoped credential was used
Approver	who authorized gated action

Audit question:

Who caused this change?

Correct answer should be:

Human Alice requested REQ-123. Agent autonomous_issue_resolver version 2026.06.29 executed run run_456 under capability policy pol-42, used sandbox credential cred-789, opened PR-812, approved by Bob as billing code owner.

Policy rule example:

package agent.policy

default allow := false

allow if {
  input.action == "open_pull_request"
  input.capability == "autonomous_issue_resolver.low_risk_bugfix"
  input.risk_tier in ["low", "medium"]
  input.evidence.reproduction.status in ["reproduced", "not_reproduced_with_reason"]
  input.evidence.tests_run.count > 0
  not input.diff.touches_forbidden_files
}

requires_approval if {
  input.domain in ["billing", "auth", "compliance"]
}

The exact policy language can vary.

The invariant matters:

Policy must be executable, versioned, testable, and auditable.

25. Governance Artefacts

For enterprise use, create governance artefacts.

Minimum artifacts:

Agent Card,
Capability Contract,
Risk Assessment,
Tool Registry Entry,
Data Handling Statement,
Evaluation Report,
Approval Matrix,
Incident Playbook,
Rollback Procedure,
Change Log,
Audit Evidence Schema,
Model/Provider Risk Record.

Agent Card example:

agent_card:
  name: Autonomous Issue Resolver
  version: 2026.06.29
  owner: platform-engineering
  business_owner: engineering-productivity
  purpose: Resolve low-risk software issues by opening PRs with evidence.
  allowed_users:
    - engineering
  allowed_repos:
    - opted_in_repositories
  max_autonomy_tier: 2
  allowed_actions:
    - inspect_repo
    - run_tests_in_sandbox
    - edit_files
    - open_pr
  forbidden_actions:
    - merge_pr
    - deploy_production
    - read_production_secrets
  data_access:
    code: yes
    tickets: yes
    production_data: no
    secrets: no
  evals:
    required_before_release:
      - low_risk_bugfix_regression
      - prompt_injection_suite
      - tool_policy_suite
  monitoring:
    dashboards:
      - agent_success
      - policy_violations
      - cost_latency
  incident_owner: platform-oncall

Governance should not be theater.

It should map to runtime enforcement.

26. Minimal Viable Capstone

Do not start by building all capabilities.

Build a minimal viable capstone:

A low-risk bugfix PR agent for one repository, one language stack, one build tool, sandboxed execution, no production secrets, no merge permission, mandatory evidence packet, and evaluation harness.

Scope:

mvc_scope:
  repositories: 1
  languages:
    - java
  build_tool:
    - gradle
  capabilities:
    - issue_intake
    - repo_map
    - focused_test_run
    - edit_file
    - open_pr
  forbidden:
    - merge
    - deploy
    - production_credentials
    - database_write
    - internet_egress
  required:
    - trace
    - evidence_packet
    - human_review
    - offline_eval_before_release

Success criteria:

20 curated low-risk tasks,
0 policy violations,
80% evidence completeness,
30% useful PR rate in pilot,
no merge without human,
no secrets exposure,
every run replayable from event log,
every PR has trace ID.

This is enough to learn.

Do not prematurely build multi-agent swarm, enterprise MCP marketplace, or autonomous release system.

27. Reference Implementation Blueprint

A practical service decomposition:

Possible implementation components:

Component	Possible Technology
Orchestration	LangGraph-like state graph, Temporal-like workflow, custom state machine
Model/tool runtime	OpenAI Agents SDK-style abstraction, custom runner
Tool integration	MCP gateway + first-party tools
Policy	OPA/Rego or custom policy service
Sandbox	container/firecracker/kata/ephemeral VM depending risk
Trace	OpenTelemetry-compatible traces + custom event schema
Eval	custom eval harness + golden task suite
Artifact store	object storage with hash-addressed artifacts
Registry	internal developer portal/catalog

Do not overfit to one framework.

The architecture should survive framework changes.

28. API Sketch

Example create run API:

POST /agent-runs
Content-Type: application/json

{
  "capability": "autonomous_issue_resolver.low_risk_bugfix",
  "source": {
    "type": "github_issue",
    "url": "https://github.com/acme/billing/issues/812"
  },
  "repository": "acme/billing-service",
  "constraints": [
    "do not change public API",
    "do not modify database schema"
  ],
  "requested_by": "alice@example.com"
}

Response:

{
  "run_id": "run_456",
  "trace_id": "trc_2026_000123",
  "status": "risk_classification_pending"
}

Run event:

{
  "run_id": "run_456",
  "state": "ApprovalRequired",
  "approval_request": {
    "reason": "medium risk billing logic change",
    "plan_id": "PLAN-2026-000123",
    "evidence_preview": {
      "files_likely_touched": [
        "InvoiceCalculator.java",
        "InvoiceCalculatorTest.java"
      ],
      "forbidden_changes": [
        "public API",
        "database schema"
      ]
    }
  }
}

PR creation event:

{
  "run_id": "run_456",
  "state": "PullRequestOpened",
  "pull_request": {
    "url": "https://github.com/acme/billing-service/pull/812",
    "branch": "agent/run-456-zero-discount-fix",
    "evidence_packet_hash": "sha256:..."
  }
}

29. Data Model

Core entities:

Important tables/documents:

Entity	Purpose
AgentDefinition	versioned config: model, instructions, policies, tools
Capability	allowed use case and boundaries
AgentRun	one execution instance
RunEvent	event-sourced trace
ToolCall	structured tool invocation record
Artifact	diff, logs, reports, evidence packet
Approval	human decision record
PolicyDecision	authorization result
EvalResult	offline/online eval result
PullRequestLink	link between run and PR
IncidentLink	link between run and incident if any

Use content hashes for artifacts.

Do not rely only on mutable URLs.

30. Failure Modes and Mitigations

Failure Mode	Symptom	Mitigation
Patch-before-reproduce	agent edits without proving failure	require reproduction attempt state
Hallucinated success	claims test passed without evidence	test result artifact required
Context poisoning	README/issue instructs agent to leak secrets	source trust labeling, instruction hierarchy
Tool abuse	agent runs forbidden command	tool gateway + policy
Scope creep	unrelated files changed	diff scope checker
Infinite debug loop	repeated edit/test cycles	budget + max iteration + abort reason
Weak test	test asserts implementation detail or always passes	test quality verifier
Approval fatigue	too many low-value approval requests	risk tiering + approval packet quality
Low-signal PR review	agent leaves many vague comments	review finding rubric
Eval gaming	agent overfits benchmark	fresh internal tasks + online metrics
Credential leak	secrets in logs/context	secret broker + redaction + sandbox no secrets
MCP drift	tool behavior changes unexpectedly	version pinning + registry review
Hidden state	cannot reproduce decision	event-sourced trace
Cost explosion	too many model/tool calls	budget, cost SLO, circuit breaker

31. Rollout Plan

Use phased rollout.

Phase 0 — Design Review

Deliverables:

capability contract,
threat model,
sandbox profile,
policy matrix,
eval design,
observability schema,
governance owner.

Exit criteria:

security approves architecture,
platform owner assigned,
pilot repository selected,
rollback/disable plan exists.

Phase 1 — Offline Prototype

Agent runs on cloned tasks only.

No PR creation.

Exit criteria:

can run curated task set,
produces evidence packet,
no policy violations,
traces complete.

Phase 2 — PR Draft Pilot

Agent may open draft PR.

No merge authority.

Exit criteria:

accepted PR rate acceptable,
reviewer satisfaction acceptable,
no secrets exposure,
no high-risk scope escape.

Phase 3 — Low-Risk Production Use

Agent can operate on opted-in repositories.

Still no merge/deploy authority.

Exit criteria:

stable metrics,
incident playbook tested,
eval gate integrated with release.

Phase 4 — Expanded Capabilities

Add dependency upgrade, PR review, CI diagnosis, release assist.

Exit criteria:

each capability has separate eval,
separate policy,
separate owners,
clear SLO.

Phase 5 — Conditional Operations

Only for tightly constrained runbooks.

Example:

restart non-critical job in staging,
re-run failed CI,
rollback preview environment,
create release candidate branch.

Production operations remain human-approved unless extremely mature and low-risk.

32. Production Readiness Checklist

Architecture

Agent runtime is stateful and replayable.
All tool calls go through tool gateway.
Policy engine is external to model.
Sandbox exists for code execution.
Network egress is controlled.
Secrets are scoped and redacted.
MCP servers are registry-controlled.
Context items have provenance.
Memory writes are governed.

Autonomy

Verification

Observability

Evaluation

Offline eval suite exists.
Security eval exists.
Tool policy eval exists.
Regression gate exists for agent changes.
Online metrics are monitored.
Human review feedback is captured.

Governance

33. Internal Engineering Standard

A good internal standard might say:

Any autonomous engineering agent that can read repository code and produce code changes must run under a registered capability, use sandboxed execution, emit traceable tool-call events, enforce policy through a non-model policy engine, generate evidence for every PR, and require human approval for merge, deployment, credential access, production data access, or high-risk domain changes.

Minimum rules:

No unregistered agent in production repositories.
No direct tool execution bypassing gateway.
No production secrets in agent context.
No merge/deploy authority by default.
No success status without evidence packet.
No model/prompt/tool upgrade without eval regression.
No MCP server without registry review.
No memory write without provenance and retention policy.
No high-risk domain change without code owner approval.
No incident action without incident commander approval.

34. Example End-to-End Scenario

Scenario:

A customer reports invoice total is wrong when discount is zero.

34.1 Intake

Request normalized:

request:
  repository: acme/billing-service
  issue: invoice total wrong when discount is zero
  constraints:
    - no public API change
    - no schema change
  expected_output: draft PR with evidence

34.2 Risk Classification

Risk output:

risk:
  tier: medium
  domain: billing
  max_autonomy: assisted_pr
  approval_required:
    - billing_code_owner

34.3 Repo Understanding

Repo map finds:

likely_files:
  - InvoiceCalculator.java
  - DiscountPolicy.java
  - InvoiceCalculatorTest.java
commands:
  focused: ./gradlew :billing-core:test --tests InvoiceCalculatorTest

34.4 Planning

Plan says:

inspect calculator,
add failing test,
patch minimal logic,
run focused test,
run module test,
open draft PR.

34.5 Execution

Agent edits only allowed files.

Tool gateway rejects any unrelated command.

Sandbox captures diff and logs.

34.6 Verification

Evidence:

before:
  focused_test: failed
after:
  focused_test: passed
  module_test: passed

34.7 PR

PR includes:

summary,
reproduction,
verification,
risk,
constraints,
rollback,
trace ID.

34.8 Review

Review agent produces:

0 blockers,
1 question about rounding behavior,
confirms no schema/API change.

Human code owner approves or requests revision.

34.9 Release Assist

After merge, release agent produces readiness packet.

Human release owner deploys.

34.10 Audit

Audit can reconstruct:

who requested,
agent version,
policy version,
tool calls,
diff,
test evidence,
approval,
PR,
release notes.

35. What Makes This “Top 1%” Engineering?

Many engineers can wire an LLM to a repository.

Fewer can design the control system around it.

Top-level competence appears in these decisions:

Explicit autonomy boundary instead of vague trust.
State machine instead of uncontrolled chat loop.
Tool gateway instead of direct tool access.
Sandbox-first execution instead of local machine mutation.
Evidence packet instead of narrative confidence.
Evaluation harness instead of demo-based validation.
Policy engine instead of prompt-only guardrail.
Human approval as runtime state instead of manual side process.
MCP gateway instead of arbitrary connector sprawl.
Observability for decisions instead of logs only.
Governance mapped to enforcement instead of paperwork.
Capability-based rollout instead of universal agent access.

The mental model:

Autonomous engineering is not a model capability problem alone. It is a socio-technical control problem across software delivery, security, evaluation, operations, and governance.

36. Deliberate Practice

Exercise 1 — Capability Contract

Choose one repository.

Write a capability contract for low-risk bugfix PR agent:

allowed actions,
forbidden actions,
evidence requirements,
approval gates,
sandbox profile,
risk rules.

Exercise 2 — State Machine

Draw state machine for:

intake,
risk classification,
repo map,
reproduction,
planning,
patch,
verification,
PR,
review,
completion.

Define terminal failure states.

Exercise 3 — Tool Gateway

Define schemas for:

search_code,
read_file,
edit_file,
run_tests,
create_branch,
open_pr.

For each, define:

side effect,
policy requirement,
timeout,
idempotency,
output schema,
logging fields.

Exercise 4 — Evidence Packet

Take a real PR.

Rewrite its description as agent evidence packet:

reproduction,
patch summary,
verification,
risk,
rollback,
constraints.

Exercise 5 — Eval Suite

Create 10 internal tasks:

5 low-risk bugfix,
2 test improvement,
2 dependency patch upgrade,
1 negative task that must be rejected.

Define pass/fail criteria.

Exercise 6 — Threat Model

Threat model the agent using these attacker inputs:

malicious issue body,
malicious README,
malicious test output,
malicious MCP tool description,
malicious dependency install script.

For each, define control.

37. Common Interview/Architecture Questions

37.1 Why not just let the coding agent open and merge PRs?

Because merge is not just a code operation.

It transfers risk into shared codebase and eventually production.

Without human approval, eval gate, ownership, and rollback discipline, the organization loses accountability.

37.2 Why is sandbox required if repository code is trusted?

Repository code may include arbitrary scripts, test hooks, dependency install steps, generated commands, or compromised dependencies.

Agent also may execute commands based on untrusted context.

Sandbox protects host, secrets, network, and neighboring systems.

37.3 Why do we need policy engine if prompt says “do not do X”?

Because prompt is instruction, not enforcement.

Policy engine can reject actual tool calls regardless of what the model says.

37.4 Why is evidence packet mandatory?

Because software engineering requires reviewable proof of work.

Evidence packet reduces reviewer burden and gives auditability.

37.5 Why not start with multi-agent architecture?

Multi-agent increases coordination complexity, cost, non-determinism, and security surface.

Start single-agent plus verifier/reviewer roles, then split only when specialization creates measurable value.

37.6 What is the most dangerous hidden assumption?

That agent success in a demo transfers directly to production.

Production needs repeatability, policy, evaluation, failure recovery, ownership, and observability.

38. Final Mental Model

A mature autonomous engineering platform has this shape:

Loop principle:

The agent improves only through evidence, evaluation, and controlled rollout. Not through optimism.

39. Final Checklist for the Whole Series

You have completed the series if you can explain and design:

40. Series Completion

Seri Learn Advanced Agentic AI Engineering & Autonomous Software Engineering selesai di Part 035.

Jumlah part: 35.

Bagian terakhir ini adalah capstone yang menggabungkan semua materi sebelumnya menjadi blueprint autonomous engineering platform yang production-minded, auditable, governable, secure, and evaluable.

Final principle:

The best autonomous engineering system is not the one that acts the most. It is the one that acts within the clearest boundaries, produces the strongest evidence, fails safely, and improves under measurement.

Lesson Recap

You just completed lesson 35 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 34

Learn Agentic Ai Engineering Part 034 Enterprise Adoption And Operating Model

END_OF_SERIES