Learn Ai Development Driven Implementation Usage Part 027 Quality Metrics And Productivity Measurement
title: Learn AI Development Driven Implementation and Usage - Part 027 description: Quality metrics and productivity measurement for AI-driven software development: delivery, quality, reliability, review burden, risk, cost, and governance evidence. series: learn-ai-development-driven-implementation-usage seriesTitle: Learn AI Development Driven Implementation and Usage order: 27 partTitle: Quality Metrics and Productivity Measurement tags:
- ai
- software-engineering
- productivity
- metrics
- dora
- code-quality
- delivery
- governance
- series date: 2026-06-30
Part 027 — Quality Metrics and Productivity Measurement
Tujuan bagian ini: membangun sistem pengukuran yang bisa menjawab pertanyaan penting: apakah AI benar-benar meningkatkan delivery engineering tanpa menurunkan kualitas, keamanan, maintainability, dan learning organisasi?
AI-driven development sering dijual dengan narasi kecepatan: lebih cepat membuat code, lebih cepat membuat test, lebih cepat menyelesaikan ticket. Narasi itu belum cukup.
Dalam software engineering serius, productivity bukan jumlah baris code. Productivity adalah kemampuan organisasi untuk mengubah intent menjadi software yang benar, aman, reliable, mudah diubah, dan bernilai bisnis dengan cycle time yang baik.
AI bisa mempercepat delivery. AI juga bisa mempercepat pembuatan defect, memperbesar review burden, menambah noise, membuat engineer menerima perubahan tanpa memahami sistem, dan menciptakan audit gap. Maka pengukuran AI development harus multi-dimensional.
Bagian ini membahas cara mengukur AI-driven implementation seperti engineering system, bukan seperti campaign tooling.
1. Kaufman Framing: Skill yang Sebenarnya Dipelajari
Skill utama bagian ini adalah:
Mengukur efek AI terhadap software delivery secara objektif, seimbang, dan actionable.
Sub-skill-nya:
| Sub-skill | Output yang bisa dinilai |
|---|---|
| Metric design | Bisa memilih metrik yang sesuai dengan tujuan dan risiko |
| Baseline thinking | Bisa membandingkan sebelum/sesudah AI secara fair |
| Signal vs noise | Bisa membedakan metrik yang actionable dari vanity metric |
| Quality measurement | Bisa mengukur defect, rework, test signal, review burden, dan incident |
| Productivity measurement | Bisa mengukur flow, throughput, cycle time, dan cost tanpa menyalahgunakan metrik individu |
| AI attribution | Bisa mengukur kontribusi AI tanpa menganggap semua improvement berasal dari AI |
| Governance evidence | Bisa menghasilkan bukti audit bahwa AI digunakan dengan kontrol memadai |
| Feedback loop | Bisa mengubah metrik menjadi improvement backlog |
1.1 Target Performa Setelah 20 Jam
Setelah latihan 20 jam, Anda harus bisa:
- membuat balanced scorecard untuk AI-assisted engineering,
- memilih baseline sebelum rollout AI,
- mengukur lead time, review burden, rework, defect escape, CI health, dan cost,
- membedakan output metric dan outcome metric,
- membuat dashboard yang tidak mendorong perilaku buruk,
- membuat AI usage evidence untuk review dan audit,
- menjalankan experiment kecil untuk membandingkan workflow AI vs non-AI,
- membuat improvement backlog berdasarkan data.
2. Core Mental Model: AI Productivity Is a System Property
Produktivitas AI bukan properti individu, tool, atau model saja. Produktivitas adalah properti sistem kerja.
Jika requirement buruk, AI akan mempercepat implementasi yang salah.
Jika test lemah, AI akan membuat patch yang tampak benar tetapi tidak terbukti.
Jika review tidak disiplin, AI akan menjadi defect multiplier.
Jika observability buruk, tim tidak tahu apakah improvement benar-benar terjadi.
Maka pertanyaan yang tepat bukan:
“Berapa persen code ditulis AI?”
Pertanyaan yang tepat:
“Apakah sistem delivery kita menghasilkan perubahan bernilai lebih cepat, dengan kualitas sama atau lebih baik, risiko terkendali, cost masuk akal, dan learning tetap terjaga?”
3. Anti-Metrics: Metrik yang Tampak Menarik tapi Menyesatkan
Sebelum memilih metrik bagus, buang metrik buruk.
| Anti-metric | Kenapa berbahaya | Pengganti yang lebih baik |
|---|---|---|
| Lines of code generated by AI | Mendorong code bloat dan low-quality generation | Accepted behavior-changing PR with passing evidence |
| Prompt count | Mengukur aktivitas, bukan hasil | Successful task completion with review quality |
| AI acceptance rate | Acceptance tinggi bisa berarti blind acceptance | Accepted diff after tests + review findings |
| Number of AI-created PRs | Bisa memperbesar review queue | Merged PRs with low rework and low defect escape |
| Developer velocity individual | Mudah disalahgunakan untuk surveillance | Team-level flow and outcome metrics |
| Story points completed | Tidak stabil lintas tim | Cycle time, delivery throughput, customer impact |
| Test coverage only | Coverage tinggi bisa tanpa assertion bermakna | Mutation score, assertion quality, defect detection |
| Number of comments from AI reviewer | Banyak komentar bisa berarti noise | Actionable finding precision and false-positive rate |
| Token spend only | Cost rendah bisa berarti context miskin | Cost per accepted, verified change |
Prinsipnya:
Metrik AI harus mengukur verified outcome, bukan generative activity.
4. Measurement Pyramid
Gunakan pyramid agar metrik tidak terjebak pada satu level.
4.1 AI Usage Signal
Contoh:
- jumlah task yang memakai AI,
- kategori penggunaan AI,
- model/tool yang dipakai,
- prompt/template yang dipakai,
- agent runtime,
- token/cost,
- command/tool invocation,
- approval events.
Ini hanya telemetry dasar. Jangan berhenti di sini.
4.2 Workflow Health
Contoh:
- PR cycle time,
- review waiting time,
- rework count,
- CI failure count,
- PR size,
- number of review iterations,
- time from first review to merge.
Ini menunjukkan apakah AI memperlancar flow atau hanya memindahkan bottleneck ke reviewer.
4.3 Engineering Quality
Contoh:
- escaped defect,
- incident caused by change,
- security finding,
- test flakiness,
- mutation score,
- static analysis issue,
- dependency vulnerability,
- rollback/forward-fix rate.
Ini menjawab apakah AI output aman.
4.4 Delivery Outcome
Contoh:
- lead time for changes,
- deployment frequency,
- failed deployment recovery time,
- change failure rate,
- throughput of valuable changes,
- release predictability.
Ini menyambungkan AI ke delivery capability.
4.5 Business Outcome
Contoh:
- feature adoption,
- operational cost reduction,
- support ticket reduction,
- customer-facing defect reduction,
- compliance evidence completeness,
- faster regulatory response.
Ini paling penting, tapi paling sulit diatribusikan langsung ke AI.
5. DORA Metrics sebagai Baseline Delivery
DORA metrics berguna karena mengukur delivery performance dari sistem engineering, bukan hanya aktivitas individu.
Empat kategori utama yang perlu dipakai sebagai baseline:
| Metric | Pertanyaan yang dijawab | AI-specific interpretation |
|---|---|---|
| Change lead time | Seberapa cepat perubahan dari commit sampai production? | Apakah AI mempercepat flow end-to-end atau hanya coding lokal? |
| Deployment frequency | Seberapa sering tim deploy? | Apakah AI membuat perubahan lebih kecil dan lebih sering? |
| Change failure rate | Berapa proporsi deploy yang menyebabkan masalah? | Apakah AI menaikkan defect/incident? |
| Failed deployment recovery time | Berapa lama recover dari deploy gagal? | Apakah AI membantu diagnosis/rollback atau memperburuk operability? |
Catatan penting:
- DORA tidak cukup untuk menilai AI.
- DORA perlu dilengkapi dengan quality, review, security, dan cost metric.
- Jangan pakai DORA untuk menghukum individu.
- DORA idealnya dilihat per tim, per service, dan per risk class.
6. AI Development Balanced Scorecard
Gunakan scorecard empat dimensi.
6.1 Dimensi 1: Flow
Metrik:
- task cycle time,
- PR cycle time,
- review waiting time,
- time to first green CI,
- merge latency,
- release latency.
Interpretasi:
- AI baik jika menurunkan waktu pada bottleneck nyata.
- AI buruk jika mempercepat coding tetapi menaikkan review waiting time.
6.2 Dimensi 2: Quality
Metrik:
- escaped defect rate,
- rework rate,
- PR revert rate,
- incident caused by change,
- static analysis issue trend,
- test failure after merge,
- bug reopen rate.
Interpretasi:
- AI baik jika tidak menaikkan defect escape.
- AI sangat baik jika membantu menurunkan rework dan meningkatkan test signal.
6.3 Dimensi 3: Review Load
Metrik:
- reviewer time per PR,
- number of review rounds,
- reviewer comment density,
- AI reviewer false positive rate,
- human override rate,
- average PR diff size.
Interpretasi:
- AI baik jika membuat PR lebih jelas, kecil, dan terbukti.
- AI buruk jika membuat reviewer memvalidasi diff besar yang tidak dipahami author.
6.4 Dimensi 4: Risk and Governance
Metrik:
- secret exposure event,
- unauthorized context usage,
- protected path modification without approval,
- dependency risk introduced,
- audit evidence completeness,
- policy exception count,
- high-risk AI task approval ratio.
Interpretasi:
- AI baik jika menghasilkan evidence otomatis dan memperbaiki kontrol.
- AI buruk jika membuat perubahan tidak traceable.
6.5 Dimensi 5: Cost and ROI
Metrik:
- AI spend per accepted PR,
- AI spend per successful task,
- cost per reviewable diff,
- token burn by workflow type,
- rework cost,
- reviewer time saved or added,
- incident cost avoided or created.
Interpretasi:
- AI murah tapi menyebabkan rework mahal bukan improvement.
- AI mahal tapi mengurangi incident pada sistem kritikal bisa sangat bernilai.
7. Metric Catalog untuk AI-Driven Development
7.1 Flow Metrics
| Metric | Definition | Useful breakdown |
|---|---|---|
| Task cycle time | Dari task ready sampai merged/released | task type, risk class, AI mode |
| Coding time | Dari branch dibuat sampai PR opened | repo, language, complexity |
| PR review latency | Dari PR opened sampai first human review | team, reviewer pool |
| Review cycle time | Dari first review sampai approval | PR size, AI usage, risk |
| Time to green CI | Dari PR opened sampai CI passing | failure type, test type |
| Merge latency | Dari approval sampai merge | release gate, branch protection |
| Release latency | Dari merge sampai production | service, environment |
Interpretasi senior
Jika coding time turun 50% tetapi PR review latency naik 80%, AI tidak mempercepat delivery. AI hanya memindahkan kerja dari author ke reviewer.
Jika time to green CI turun karena AI membantu debugging pipeline, itu sinyal bagus.
Jika merge latency naik karena PR besar, berarti task slicing buruk.
7.2 Review Metrics
| Metric | Definition | Warning sign |
|---|---|---|
| PR size | Lines/files changed | PR AI-generated terlalu besar |
| Review rounds | Jumlah iterasi review | AI patch tidak matang |
| Actionable comment ratio | Komentar yang menghasilkan perubahan valid | AI reviewer noisy |
| Human override rate | Temuan AI ditolak oleh manusia | Review prompt buruk atau model kurang tepat |
| Author explanation quality | PR menjelaskan intent, risk, tests | AI summary terlalu generik |
| Reviewer confidence score | Reviewer yakin patch dipahami | Author cognitive offloading |
Simple rubric: Reviewability Score
| Score | Meaning |
|---|---|
| 1 | Diff besar, behavior tidak jelas, test lemah |
| 2 | Intent jelas tetapi risk/test kurang |
| 3 | Reviewable, test cukup, beberapa ambiguity |
| 4 | Diff kecil, evidence kuat, risk jelas |
| 5 | Sangat mudah direview, invariant dan rollback jelas |
Formula sederhana:
reviewability_score = average(
intent_clarity,
diff_smallness,
test_evidence,
risk_statement,
rollback_clarity
)
7.3 Quality Metrics
| Metric | Definition | AI-related question |
|---|---|---|
| Escaped defect | Bug ditemukan setelah merge/release | Apakah AI memperbesar bug leakage? |
| Rework rate | PR perlu perubahan besar setelah review | Apakah AI patch rendah kualitas? |
| Reopen rate | Bug ticket dibuka ulang | Apakah AI memperbaiki symptom saja? |
| Regression count | Behavior lama rusak | Apakah characterization test cukup? |
| Flaky test rate | Test gagal tidak deterministik | Apakah AI membuat test rapuh? |
| Mutation survival | Mutasi code lolos test | Apakah assertion AI lemah? |
| Security finding | SAST/DAST/dependency issue | Apakah AI memperkenalkan insecure pattern? |
7.4 Test Signal Metrics
| Metric | Good sign | Bad sign |
|---|---|---|
| Assertion density meaningful | Assertion membuktikan behavior penting | Assertion hanya not-null/status code |
| Branch/scenario coverage | Critical path dan edge case tercakup | Happy path only |
| Mutation score | Test gagal saat logic dirusak | Mutasi banyak survive |
| Flake rate | Stabil di CI | Fails/retries sering |
| Test runtime | Cepat enough untuk feedback | Lambat tanpa value |
| Failure diagnosis quality | Error mudah dipahami | Error generik dan noisy |
Jangan ukur coverage saja. Coverage menjawab “code dieksekusi?”, bukan “behavior terbukti?”.
7.5 Security Metrics
| Metric | Definition |
|---|---|
| AI-introduced vulnerability count | Temuan security pada PR yang memakai AI |
| Secret leakage event | Secret ikut masuk prompt/log/diff |
| Dependency risk introduced | Dependency baru dengan CVE/license issue |
| Unsafe output handling | AI-generated code tidak validasi output eksternal |
| Prompt injection exposure | Tool/agent bisa dipengaruhi input tidak terpercaya |
| Excessive agency event | Agent menjalankan command/akses di luar izin |
| Policy exception count | Kasus penggunaan AI di luar policy |
Security metric harus dilihat dengan severity. Satu critical vulnerability lebih penting dari 100 style issue.
7.6 Cost Metrics
| Metric | Formula sederhana |
|---|---|
| AI cost per task | total AI spend / completed AI-assisted task |
| AI cost per merged PR | total AI spend / merged AI-assisted PR |
| AI cost per accepted diff | total AI spend / accepted generated diff |
| Rework-adjusted cost | AI cost + human rework time cost |
| Review-adjusted cost | AI cost + reviewer time cost |
| Incident-adjusted cost | AI cost + incident caused/avoided cost |
Cost metric harus memasukkan human time. Tool murah tapi membuat reviewer bekerja dua kali lebih lama itu mahal.
7.7 Learning Metrics
Ini sering diabaikan.
| Metric | Purpose |
|---|---|
| Author explanation quality | Apakah engineer memahami patch? |
| Post-review learning notes | Apakah review menghasilkan pembelajaran? |
| Prompt/template improvement count | Apakah workflow membaik? |
| Repeated issue rate | Apakah AI mengulang kesalahan sama? |
| Pairing session reflection | Apakah engineer bisa menjelaskan design? |
| New engineer onboarding time | Apakah AI-readable repo membantu onboarding? |
AI adoption yang bagus mempercepat learning. AI adoption yang buruk membuat engineer makin pasif.
8. AI Attribution: Jangan Salah Mengklaim Improvement
Masalah umum:
Setelah AI rollout, cycle time turun. Maka AI dianggap penyebab.
Belum tentu.
Cycle time bisa turun karena:
- scope task lebih kecil,
- reviewer lebih tersedia,
- CI lebih cepat,
- incident menurun,
- requirement lebih jelas,
- release process berubah,
- tim menghindari task sulit,
- measurement window bias.
8.1 Minimum Attribution Model
Untuk setiap perubahan yang mengklaim AI impact, catat:
| Field | Contoh |
|---|---|
| Task type | bugfix, refactor, test, migration, docs, API |
| Risk class | low, medium, high, critical |
| AI mode | chat, IDE pair, terminal agent, cloud agent, AI review |
| Human role | author, reviewer, approver |
| Baseline comparable | previous similar tasks |
| Outcome | merged, reverted, defect, rework |
| Evidence | tests, CI, review, logs, docs |
8.2 Difference-in-Differences Sederhana
Jika ingin lebih serius:
AI_effect =
(AI_team_after - AI_team_before)
-
(control_team_after - control_team_before)
Ini tidak sempurna, tetapi lebih baik daripada before/after mentah.
8.3 Matched Task Comparison
Bandingkan task yang sejenis:
| Dimension | Match by |
|---|---|
| Domain | service/module yang sama |
| Type | bugfix vs bugfix, test vs test |
| Size | estimasi kompleksitas mirip |
| Risk | low/medium/high |
| Baseline | historical tasks yang sebanding |
| Team | skill/team yang sama atau mirip |
Jangan bandingkan AI-assisted docs update dengan non-AI database migration.
9. Measurement Architecture
Untuk mengukur AI development, data datang dari beberapa sumber.
9.1 Issue Tracker Data
Ambil:
- task type,
- priority,
- component,
- assignee team,
- created date,
- ready date,
- start date,
- done date,
- risk label,
- AI-assisted label.
Praktik:
labels:
- ai-assisted
- ai-mode:pair
- ai-mode:cloud-agent
- risk:medium
- task-type:bugfix
9.2 Git Data
Ambil:
- commits,
- branch lifetime,
- file count,
- diff size,
- churn,
- module touched,
- protected path touched.
Hati-hati: diff size bukan productivity. Diff size adalah review/risk signal.
9.3 Pull Request Data
Ambil:
- opened time,
- first review time,
- approval time,
- merge time,
- review comments,
- requested changes,
- PR description quality,
- linked issue,
- checklist completion,
- AI disclosure.
9.4 CI/CD Data
Ambil:
- build duration,
- test duration,
- failed job,
- rerun count,
- flaky indicator,
- deployment result,
- rollback/forward-fix,
- environment.
9.5 Security Data
Ambil:
- SAST findings,
- dependency findings,
- secret scanning,
- license findings,
- container findings,
- IaC findings,
- severity,
- fix time.
9.6 AI Tool Logs
Ambil hanya yang boleh dikumpulkan sesuai privacy/security policy:
- tool name,
- AI mode,
- task id,
- token/cost,
- model class,
- approval event,
- command class,
- protected path attempt,
- outcome.
Jangan menyimpan prompt mentah berisi secret, customer data, atau sensitive source jika policy melarang.
10. AI Usage Taxonomy untuk Measurement
Metrik harus tahu AI dipakai untuk apa.
| AI usage category | Contoh | Risk level default |
|---|---|---|
| Search/explanation | memahami code, menjelaskan error | low-medium |
| Documentation | PR summary, runbook, ADR draft | low-medium |
| Test generation | unit/integration test | medium |
| Code implementation | feature/bugfix/refactor | medium-high |
| Security review | review vulnerability | medium-high |
| Database migration | schema/data migration | high |
| DevOps/IaC | workflow/deployment/infra | high |
| Production operation | diagnosis/action pada prod | critical |
Jika semua AI usage digabung, data menjadi tidak berguna.
Contoh insight yang benar:
AI pair programming menurunkan coding time untuk low-risk test generation sebesar 35%, tetapi cloud-agent implementation untuk medium-risk API change menaikkan review rounds 20% karena task slicing terlalu besar.
Contoh insight yang buruk:
AI meningkatkan productivity 40%.
11. Designing a Dashboard that Does Not Create Bad Behavior
Dashboard buruk membuat orang mengoptimalkan angka, bukan sistem.
11.1 Rules
- Gunakan metrik tim, bukan ranking individu.
- Tampilkan quality bersama speed.
- Tampilkan review burden bersama throughput.
- Tampilkan confidence interval/trend, bukan angka absolut palsu.
- Pisahkan task type dan risk class.
- Jangan jadikan AI acceptance rate sebagai KPI.
- Gunakan dashboard untuk improvement, bukan punishment.
11.2 Dashboard Sections
Section A: Delivery Flow
| Metric | View |
|---|---|
| Lead time | trend by team/service |
| PR cycle time | p50/p75/p90 |
| Review wait time | by reviewer pool |
| Time to green CI | by repo |
| Deployment frequency | by service |
Section B: Quality and Safety
| Metric | View |
|---|---|
| Change failure rate | by service/risk |
| Escaped defects | by task type |
| Rework rate | by AI mode |
| Security findings | by severity |
| Test flakiness | by pipeline |
Section C: AI Usage
| Metric | View |
|---|---|
| AI-assisted task count | by category |
| AI mode distribution | chat/pair/agent/review |
| Cost per completed task | by mode |
| Approval events | by risk class |
| Policy exceptions | by team/service |
Section D: Review Health
| Metric | View |
|---|---|
| Review rounds | by PR size/risk |
| Actionable AI review finding ratio | by prompt/model |
| Human override rate | by reviewer |
| PR size distribution | by AI mode |
| Reviewer load | by week |
Section E: Governance Evidence
| Metric | View |
|---|---|
| AI disclosure completeness | by PR |
| Protected path approval | by PR |
| Security gate pass/fail | by repo |
| Documentation impact statement | by PR |
| Audit trail completeness | by use case |
12. AI ROI Model
ROI AI development tidak bisa hanya dihitung dari subscription cost.
12.1 Cost Components
| Cost | Example |
|---|---|
| Tool subscription | seat/license |
| API/token | model usage |
| Infra | cloud sandbox, CI minutes |
| Human review | reviewer time |
| Rework | fixing AI output |
| Governance | audit, policy, approval |
| Security | scanning, incident response |
| Training | enablement time |
12.2 Benefit Components
| Benefit | Example |
|---|---|
| Reduced cycle time | faster bugfix/feature delivery |
| Reduced toil | docs/runbook/test automation |
| Improved quality | fewer repeated defects |
| Faster onboarding | AI-readable repo and knowledge pack |
| Better review | risk checklist and summary |
| Faster diagnosis | log/error hypothesis generation |
| Compliance evidence | generated traceability |
12.3 Simple Formula
net_value =
delivery_time_saved_value
+ toil_reduction_value
+ defect_cost_avoided
+ incident_cost_avoided
+ onboarding_time_saved
+ audit_effort_reduced
- ai_tool_cost
- added_review_cost
- rework_cost
- governance_cost
- incident_cost_caused
Jangan memaksa presisi palsu. Gunakan model ini untuk berpikir, bukan accounting sempurna.
13. Quality Gate untuk AI Metrics
Sebelum AI workflow dianggap berhasil, minimal harus lolos quality gate.
| Gate | Required evidence |
|---|---|
| Behavior gate | Test membuktikan acceptance criteria |
| Review gate | Human reviewer memahami diff |
| Security gate | Tidak ada high/critical unresolved finding |
| Compatibility gate | Contract/backward compatibility aman |
| Operational gate | Logging/metrics/rollback cukup |
| Documentation gate | Docs/ADR/runbook update jika terdampak |
| Governance gate | AI usage disclosure dan approvals lengkap |
13.1 Go/No-Go Example
| Condition | Decision |
|---|---|
| Cycle time turun, defect naik | no-go atau restrict use case |
| Cycle time turun, review burden naik tinggi | redesign task slicing |
| Review burden turun, quality stabil | scale carefully |
| Quality naik, cycle time stabil | still valuable |
| Cost naik, incident turun | evaluate risk-adjusted value |
| AI reviewer noisy | tune/restrict reviewer |
14. Measurement of AI Code Review
AI code review perlu metrik sendiri.
14.1 Precision and Recall Thinking
| Term | Meaning in AI review |
|---|---|
| True positive | AI menemukan issue valid |
| False positive | AI melaporkan issue tidak valid |
| False negative | AI melewatkan issue yang ditemukan human/production |
| True negative | AI benar tidak melaporkan issue |
Dalam praktik, recall sulit diukur karena kita tidak tahu semua issue yang hilang. Tetapi precision bisa diukur dari human disposition.
14.2 Review Finding Lifecycle
14.3 AI Reviewer Metrics
| Metric | Use |
|---|---|
| Finding precision | Kurangi noise |
| Accepted finding count | Value indicator |
| Severity distribution | Apakah AI hanya style comment? |
| Duplicate finding rate | Prompt/model noise |
| Reviewer override rate | Trust calibration |
| Fix verification rate | Issue benar-benar selesai |
| Time added/removed | Apakah review jadi lebih cepat? |
14.4 AI Reviewer Policy
AI reviewer boleh:
- memberi checklist,
- menemukan issue potensial,
- meminta evidence,
- membandingkan dengan convention,
- menandai risk.
AI reviewer tidak boleh:
- menjadi final approver,
- override human reviewer,
- menyetujui high-risk change tanpa evidence,
- mengubah code otomatis tanpa author review,
- membuat blocking comment untuk style noise.
15. AI-Assisted Testing Metrics
AI sering terlihat sangat produktif saat membuat test. Tetapi test bisa palsu.
15.1 Test Quality Dimensions
| Dimension | Question |
|---|---|
| Relevance | Apakah test terkait requirement? |
| Oracle strength | Apakah assertion membuktikan behavior? |
| Edge coverage | Apakah edge case penting tercakup? |
| Regression power | Apakah test akan gagal jika bug kembali? |
| Maintainability | Apakah fixture jelas dan tidak rapuh? |
| Determinism | Apakah test stabil di CI? |
| Runtime | Apakah feedback loop masih cepat? |
15.2 AI Test Score
ai_test_score = average(
relevance,
oracle_strength,
edge_case_coverage,
regression_power,
maintainability,
determinism
)
Gunakan score ini saat review generated tests.
15.3 Common AI Test Failure
| Failure | Symptom | Metric signal |
|---|---|---|
| Weak oracle | Test pass walau logic salah | mutation survival tinggi |
| Implementation mirroring | Test copy logic produksi | escaped defect tetap tinggi |
| Happy path only | Edge bug lolos | scenario coverage rendah |
| Over-mocking | Integration issue lolos | prod defect pada boundary |
| Fragile fixture | Test sering gagal karena setup | flake rate tinggi |
| Slow test bloat | CI makin lambat | test runtime naik |
16. AI Implementation Metrics by Risk Class
Jangan gunakan threshold sama untuk semua task.
| Risk class | AI usage allowed | Measurement priority |
|---|---|---|
| Low | docs, tests, small refactor, UI copy | speed, reviewability |
| Medium | bugfix, endpoint, non-critical workflow | quality, rework, CI |
| High | auth, payment, database, compliance | security, review evidence, rollback |
| Critical | production operation, destructive migration | approval, audit, incident readiness |
16.1 Example Thresholds
| Metric | Low risk | Medium risk | High risk |
|---|---|---|---|
| Max PR size | 400 LOC | 250 LOC | 150 LOC |
| Required human reviewers | 1 | 1-2 | 2+ |
| Required tests | unit | unit + integration | unit + integration + contract/regression |
| AI disclosure | yes | yes | yes + approval note |
| Security scan | standard | standard | blocking high/critical |
| Rollback plan | optional | required if behavior | required |
Angka di atas bukan universal. Pakai sebagai starting point.
17. Data Quality Problems
Metrik engineering mudah rusak.
17.1 Common Problems
| Problem | Impact | Mitigation |
|---|---|---|
| Missing labels | AI impact tidak terbaca | PR template + automation |
| Inconsistent task type | Comparison salah | controlled taxonomy |
| PR unrelated changes | Cycle/quality bias | PR-per-intent |
| Squashed history loses signal | Attribution sulit | keep PR metadata |
| Manual AI usage undisclosed | Under-reporting | team agreement, not punitive |
| Tool logs incomplete | Cost/approval unknown | standardized logging |
| Individual ranking | Gaming metrics | team-level dashboard |
17.2 AI Disclosure Template
Tambahkan ke PR:
## AI Usage
- AI used: yes/no
- Mode: chat / IDE pair / terminal agent / cloud agent / AI review
- Scope: explanation / test generation / implementation / refactor / docs / CI repair
- Human validation performed:
- [ ] I reviewed the diff manually
- [ ] I understand the behavior change
- [ ] I ran relevant tests
- [ ] I checked security-sensitive paths
- Risk class: low / medium / high / critical
- Notes:
Disclosure bukan untuk mempermalukan. Disclosure untuk observability dan governance.
18. Experiment Design for AI Rollout
Jangan rollout AI ke semua workflow lalu bingung membaca dampaknya.
18.1 Start with Use Case
Contoh use case yang bagus untuk experiment:
- generate unit tests untuk existing service,
- fix flaky tests,
- summarize PR + docs impact,
- debug CI failure,
- implement low-risk endpoint,
- refactor small module with characterization tests.
Contoh use case buruk untuk experiment pertama:
- production database migration,
- auth redesign,
- payment flow change,
- multi-service architecture rewrite,
- compliance-critical workflow without test baseline.
18.2 Experiment Template
# AI Workflow Experiment
## Hypothesis
Using AI for <workflow> will improve <metric> without degrading <quality metric>.
## Scope
- Repo/service:
- Task type:
- Risk class:
- AI mode:
- Human gate:
## Baseline
- Historical period:
- Comparable tasks:
- Baseline metrics:
## Success Criteria
- Flow:
- Quality:
- Review:
- Cost:
- Governance:
## Guardrails
- Stop condition:
- Protected paths:
- Required tests:
- Required approvals:
## Data Collection
- Issue labels:
- PR template:
- CI logs:
- Security scans:
- AI usage logs:
## Review Cadence
- Weekly review:
- Final decision:
18.3 Stop Conditions
Hentikan atau batasi experiment jika:
- escaped defect naik signifikan,
- reviewer burden naik tanpa value,
- AI-generated PR sering butuh rewrite,
- security findings naik,
- policy exception berulang,
- engineer tidak bisa menjelaskan patch,
- cost tidak sebanding dengan outcome,
- prompt/tool behavior tidak bisa diaudit.
19. AI Productivity Review Meeting
Lakukan review berkala, misalnya dua mingguan atau bulanan.
Agenda:
- Apa workflow AI yang paling bernilai?
- Apa workflow yang paling noisy?
- Apakah lead time turun?
- Apakah review burden naik/turun?
- Apakah defect/rework berubah?
- Apakah cost masuk akal?
- Apakah ada security/governance incident?
- Prompt/template apa yang perlu distandardisasi?
- Repo mana yang perlu dibuat lebih AI-readable?
- Use case mana yang perlu dibatasi?
19.1 Output Meeting
Output bukan slide. Output harus menjadi backlog:
| Finding | Action |
|---|---|
| AI PR terlalu besar | update task slicing policy |
| Generated tests weak | add mutation review checklist |
| Cloud agent sering gagal setup | improve repo bootstrap script |
| AI reviewer noisy | tune prompt and restrict severity |
| Cost tinggi pada debugging | improve log context pack |
| Docs summary useful | standardize PR docs impact template |
20. Scorecard Templates
20.1 Team AI Delivery Scorecard
# Team AI Delivery Scorecard
Period: <YYYY-MM>
Team: <team>
Services: <services>
## AI Usage Mix
- AI-assisted tasks:
- AI modes:
- Top workflows:
## Flow
- Lead time p50/p75/p90:
- PR cycle time p50/p75/p90:
- Review wait time:
- Time to green CI:
## Quality
- Escaped defects:
- Rework rate:
- Reverts:
- Security findings:
- Flaky tests:
## Review Health
- Average PR size:
- Review rounds:
- AI reviewer precision:
- Reviewer burden trend:
## Cost
- Tool/API spend:
- Cost per accepted PR:
- Review-adjusted cost:
## Governance
- AI disclosure completeness:
- Approval exceptions:
- Protected path violations:
## Decisions
- Scale:
- Restrict:
- Improve:
20.2 PR-Level AI Evidence Scorecard
# AI Evidence Scorecard
PR: <link>
Task: <link>
Risk: low/medium/high/critical
AI mode: <mode>
## Evidence
- Acceptance criteria mapped: yes/no
- Tests added/updated: yes/no
- CI passed: yes/no
- Security scan passed: yes/no
- Docs updated: yes/no
- Rollback plan included: yes/no
## Review
- Human reviewer understands diff: yes/no
- AI review findings dispositioned: yes/no
- Rework needed: none/minor/major
## Outcome
- Merged:
- Reverted:
- Incident:
- Follow-up required:
21. Interpreting Common Metric Patterns
21.1 Pattern: Coding Faster, Review Slower
Signal:
- branch lifetime turun,
- PR review cycle naik,
- review rounds naik,
- reviewer comment density naik.
Diagnosis:
- AI membuat diff terlalu besar,
- author tidak memahami patch,
- PR summary generik,
- test evidence lemah.
Action:
- enforce PR-per-intent,
- add AI usage disclosure,
- require author explanation,
- improve task slicing.
21.2 Pattern: More Tests, Same Defects
Signal:
- test count naik,
- coverage naik,
- escaped defect tetap atau naik,
- mutation score rendah.
Diagnosis:
- weak oracle,
- happy path only,
- over-mocking,
- tests mirror implementation.
Action:
- add behavior matrix,
- require assertion review,
- add mutation testing for critical modules,
- review fixture quality.
21.3 Pattern: AI Reviewer Finds Many Issues, Few Accepted
Signal:
- AI comments tinggi,
- accepted findings rendah,
- reviewer dismisses often,
- review latency naik.
Diagnosis:
- prompt terlalu generic,
- AI tidak punya repo convention,
- severity tidak dikalibrasi,
- style noise.
Action:
- restrict AI reviewer scope,
- add severity policy,
- provide repo-specific checklist,
- measure precision.
21.4 Pattern: Delivery Faster, Incidents Higher
Signal:
- lead time turun,
- deployment frequency naik,
- change failure rate naik,
- incident count naik.
Diagnosis:
- quality gate terlalu longgar,
- risky AI task tidak dibedakan,
- test coverage tidak cukup,
- rollout/rollback lemah.
Action:
- add risk class,
- require extra gate for high-risk task,
- strengthen observability,
- slow down unsafe workflows.
22. Practical SQL/Data Model Sketch
Untuk tim yang ingin membangun warehouse sederhana:
CREATE TABLE ai_assisted_prs (
pr_id TEXT PRIMARY KEY,
repo TEXT NOT NULL,
service TEXT,
team TEXT,
task_type TEXT,
risk_class TEXT,
ai_mode TEXT,
opened_at TIMESTAMP,
first_review_at TIMESTAMP,
approved_at TIMESTAMP,
merged_at TIMESTAMP,
lines_added INT,
lines_deleted INT,
files_changed INT,
ci_failures INT,
review_rounds INT,
ai_review_findings INT,
accepted_ai_findings INT,
human_requested_changes INT,
security_findings_high INT,
escaped_defect BOOLEAN DEFAULT FALSE,
reverted BOOLEAN DEFAULT FALSE,
ai_cost_usd NUMERIC,
disclosure_complete BOOLEAN
);
Example derived metrics:
SELECT
team,
ai_mode,
percentile_cont(0.5) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (merged_at - opened_at)) / 3600
) AS pr_cycle_time_p50_hours,
AVG(review_rounds) AS avg_review_rounds,
AVG(CASE WHEN escaped_defect THEN 1 ELSE 0 END) AS escaped_defect_rate,
AVG(CASE WHEN disclosure_complete THEN 1 ELSE 0 END) AS disclosure_rate
FROM ai_assisted_prs
WHERE merged_at IS NOT NULL
GROUP BY team, ai_mode;
23. Minimum Viable Measurement Setup
Jika tim belum punya data platform, mulai sederhana.
Week 1
- Tambahkan AI usage section di PR template.
- Tambahkan labels untuk AI mode dan risk class.
- Catat PR cycle time dan review rounds.
- Catat CI pass/fail.
- Catat rework major/minor.
Week 2
- Tambahkan dashboard sederhana dari GitHub/GitLab data.
- Review 10 AI-assisted PR secara manual.
- Hitung false positive AI review.
- Bandingkan PR size AI vs non-AI.
- Identifikasi satu workflow yang perlu diperbaiki.
Week 3
- Tambahkan cost tracking.
- Tambahkan security findings tracking.
- Tambahkan task type breakdown.
- Buat improvement backlog.
Week 4
- Putuskan workflow mana yang diskalakan.
- Putuskan workflow mana yang dibatasi.
- Update context/prompt/template.
- Buat team AI working agreement revision.
24. Failure Modes in Measurement
24.1 Goodhart's Law
Saat metrik menjadi target, metrik bisa rusak.
Contoh:
- Jika targetnya PR count, orang membuat PR kecil tidak bernilai.
- Jika targetnya AI usage, orang memakai AI saat tidak perlu.
- Jika targetnya review speed, reviewer asal approve.
- Jika targetnya test count, orang membuat test lemah.
Mitigasi:
- gunakan metric set seimbang,
- review qualitative examples,
- hindari individual ranking,
- pakai metric sebagai diagnosis, bukan hukuman.
24.2 Survivorship Bias
Tim hanya melihat PR yang berhasil merge. Padahal AI task yang gagal juga penting.
Catat:
- abandoned AI branches,
- failed agent tasks,
- rewritten AI output,
- discarded generated tests,
- prompts that produced unsafe output.
24.3 Automation Bias
Jika AI dashboard terlihat rapi, orang percaya tanpa audit.
Mitigasi:
- sampling manual,
- audit raw PR,
- compare against incidents,
- qualitative reviewer feedback.
24.4 Local Optimization
AI mempercepat coding lokal tetapi memperburuk system flow.
Mitigasi:
- measure end-to-end,
- include review/CI/release,
- use DORA + quality + cost.
25. What Top 1% Engineers Watch
Engineer kuat tidak hanya bertanya “berapa cepat?”. Mereka bertanya:
- Apakah AI memperbaiki bottleneck nyata?
- Apakah PR lebih kecil atau lebih besar?
- Apakah reviewer lebih percaya atau lebih lelah?
- Apakah defect escape berubah?
- Apakah AI-generated tests benar-benar menangkap bug?
- Apakah security posture membaik atau melemah?
- Apakah high-risk task punya approval dan evidence?
- Apakah cost masih sebanding dengan outcome?
- Apakah engineer masih memahami code yang mereka merge?
- Apakah context/prompt/template membaik dari waktu ke waktu?
26. 20-Hour Deliberate Practice Plan
Hour 1-2: Baseline
Ambil 20 PR terakhir dari satu repo.
Catat:
- PR cycle time,
- review rounds,
- PR size,
- CI failures,
- rework,
- defect follow-up.
Hour 3-4: AI Disclosure Template
Tambahkan PR template AI usage.
Simulasikan pada 5 PR lama.
Hour 5-6: Task Taxonomy
Klasifikasi task:
- docs,
- test,
- bugfix,
- feature,
- refactor,
- migration,
- DevOps,
- security.
Hour 7-8: Risk Class
Tambahkan risk class:
- low,
- medium,
- high,
- critical.
Hour 9-10: Reviewability Score
Score 10 PR berdasarkan:
- intent clarity,
- diff size,
- test evidence,
- risk statement,
- rollback clarity.
Hour 11-12: AI Reviewer Evaluation
Jalankan AI review pada 5 PR.
Catat:
- valid findings,
- false positives,
- missed issues,
- severity quality.
Hour 13-14: Test Quality Evaluation
Ambil 5 AI-generated tests.
Score:
- relevance,
- oracle strength,
- edge coverage,
- determinism,
- maintainability.
Hour 15-16: Dashboard Sketch
Buat dashboard sederhana:
- flow,
- quality,
- review,
- cost,
- governance.
Hour 17-18: Experiment Design
Tulis satu AI workflow experiment:
- hypothesis,
- scope,
- baseline,
- success criteria,
- stop condition.
Hour 19-20: Improvement Backlog
Buat backlog 10 item untuk meningkatkan AI development system.
Prioritaskan berdasarkan:
- impact,
- risk reduction,
- effort,
- evidence quality.
27. Checklist: AI Development Metrics Readiness
Gunakan checklist ini sebelum mengklaim AI adoption berhasil.
## Metrics Readiness Checklist
### Baseline
- [ ] Baseline pre-AI tersedia
- [ ] Task type taxonomy tersedia
- [ ] Risk class taxonomy tersedia
- [ ] Comparable task set tersedia
### Flow
- [ ] PR cycle time diukur
- [ ] Review wait time diukur
- [ ] Time to green CI diukur
- [ ] Release latency diukur
### Quality
- [ ] Rework rate diukur
- [ ] Escaped defect diukur
- [ ] Flaky test rate diukur
- [ ] Security finding diukur
### Review
- [ ] PR size diukur
- [ ] Review rounds diukur
- [ ] AI reviewer finding disposition diukur
- [ ] Reviewer feedback dikumpulkan
### AI Usage
- [ ] AI mode dicatat
- [ ] AI usage disclosure di PR
- [ ] Cost dicatat
- [ ] Approval event dicatat
### Governance
- [ ] Policy exception dicatat
- [ ] Protected path modification dicatat
- [ ] Audit evidence tersimpan
- [ ] High-risk workflow punya gate
### Improvement
- [ ] Metrics review cadence ada
- [ ] Improvement backlog dibuat
- [ ] Prompt/context updates tracked
- [ ] Unsafe workflow bisa dihentikan
28. Ringkasan
AI development measurement harus menjawab outcome, bukan aktivitas.
Prinsip utama:
- Ukur productivity sebagai system property.
- Gunakan DORA sebagai baseline delivery, bukan satu-satunya ukuran.
- Gabungkan flow, quality, review, risk, cost, learning, dan governance.
- Jangan mengukur developer individu dengan AI metrics.
- Jangan memakai LOC, prompt count, atau acceptance rate sebagai KPI utama.
- Pisahkan task type dan risk class.
- Ukur review burden karena AI sering memindahkan bottleneck ke reviewer.
- Ukur generated test quality, bukan hanya test count.
- Buat AI usage disclosure sebagai observability, bukan surveillance.
- Jadikan metrik sebagai feedback loop untuk memperbaiki context, prompt, repo readiness, test strategy, dan governance.
AI yang baik bukan yang paling banyak menghasilkan code.
AI yang baik adalah AI yang membantu tim mengirim perubahan bernilai dengan evidence lebih kuat, risiko lebih kecil, dan learning loop lebih cepat.
29. Referensi Praktis
- DORA Metrics — https://dora.dev/guides/dora-metrics/
- Accelerate / DORA research sebagai dasar software delivery performance measurement
- SPACE framework untuk developer productivity discussion
- OWASP Top 10 for Large Language Model Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/
- NIST AI Risk Management Framework — https://www.nist.gov/itl/ai-risk-management-framework
- NIST AI 600-1 Generative AI Profile — https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
You just completed lesson 27 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.