Final StretchOrdered learning track

Learn Ai Development Driven Implementation Usage Part 027 Quality Metrics And Productivity Measurement

[]29 min read5752 words

In This Lesson

1. Kaufman Framing: Skill yang Sebenarnya Dipelajari 2. Core Mental Model: AI Productivity Is a System Property 3. Anti-Metrics: Metrik yang Tampak Menarik tapi Menyesatkan

PrevNext

Lesson 2730 lesson track26–30 Final Stretch

title: Learn AI Development Driven Implementation and Usage - Part 027 description: Quality metrics and productivity measurement for AI-driven software development: delivery, quality, reliability, review burden, risk, cost, and governance evidence. series: learn-ai-development-driven-implementation-usage seriesTitle: Learn AI Development Driven Implementation and Usage order: 27 partTitle: Quality Metrics and Productivity Measurement tags:

ai
software-engineering
productivity
metrics
dora
code-quality
delivery
governance
series date: 2026-06-30

Part 027 — Quality Metrics and Productivity Measurement

Tujuan bagian ini: membangun sistem pengukuran yang bisa menjawab pertanyaan penting: apakah AI benar-benar meningkatkan delivery engineering tanpa menurunkan kualitas, keamanan, maintainability, dan learning organisasi?

AI-driven development sering dijual dengan narasi kecepatan: lebih cepat membuat code, lebih cepat membuat test, lebih cepat menyelesaikan ticket. Narasi itu belum cukup.

Dalam software engineering serius, productivity bukan jumlah baris code. Productivity adalah kemampuan organisasi untuk mengubah intent menjadi software yang benar, aman, reliable, mudah diubah, dan bernilai bisnis dengan cycle time yang baik.

AI bisa mempercepat delivery. AI juga bisa mempercepat pembuatan defect, memperbesar review burden, menambah noise, membuat engineer menerima perubahan tanpa memahami sistem, dan menciptakan audit gap. Maka pengukuran AI development harus multi-dimensional.

Bagian ini membahas cara mengukur AI-driven implementation seperti engineering system, bukan seperti campaign tooling.

1. Kaufman Framing: Skill yang Sebenarnya Dipelajari

Skill utama bagian ini adalah:

Mengukur efek AI terhadap software delivery secara objektif, seimbang, dan actionable.

Sub-skill-nya:

Sub-skill	Output yang bisa dinilai
Metric design	Bisa memilih metrik yang sesuai dengan tujuan dan risiko
Baseline thinking	Bisa membandingkan sebelum/sesudah AI secara fair
Signal vs noise	Bisa membedakan metrik yang actionable dari vanity metric
Quality measurement	Bisa mengukur defect, rework, test signal, review burden, dan incident
Productivity measurement	Bisa mengukur flow, throughput, cycle time, dan cost tanpa menyalahgunakan metrik individu
AI attribution	Bisa mengukur kontribusi AI tanpa menganggap semua improvement berasal dari AI
Governance evidence	Bisa menghasilkan bukti audit bahwa AI digunakan dengan kontrol memadai
Feedback loop	Bisa mengubah metrik menjadi improvement backlog

1.1 Target Performa Setelah 20 Jam

Setelah latihan 20 jam, Anda harus bisa:

membuat balanced scorecard untuk AI-assisted engineering,
memilih baseline sebelum rollout AI,
mengukur lead time, review burden, rework, defect escape, CI health, dan cost,
membedakan output metric dan outcome metric,
membuat dashboard yang tidak mendorong perilaku buruk,
membuat AI usage evidence untuk review dan audit,
menjalankan experiment kecil untuk membandingkan workflow AI vs non-AI,
membuat improvement backlog berdasarkan data.

2. Core Mental Model: AI Productivity Is a System Property

Produktivitas AI bukan properti individu, tool, atau model saja. Produktivitas adalah properti sistem kerja.

Jika requirement buruk, AI akan mempercepat implementasi yang salah.

Jika test lemah, AI akan membuat patch yang tampak benar tetapi tidak terbukti.

Jika review tidak disiplin, AI akan menjadi defect multiplier.

Jika observability buruk, tim tidak tahu apakah improvement benar-benar terjadi.

Maka pertanyaan yang tepat bukan:

“Berapa persen code ditulis AI?”

Pertanyaan yang tepat:

“Apakah sistem delivery kita menghasilkan perubahan bernilai lebih cepat, dengan kualitas sama atau lebih baik, risiko terkendali, cost masuk akal, dan learning tetap terjaga?”

3. Anti-Metrics: Metrik yang Tampak Menarik tapi Menyesatkan

Sebelum memilih metrik bagus, buang metrik buruk.

Anti-metric	Kenapa berbahaya	Pengganti yang lebih baik
Lines of code generated by AI	Mendorong code bloat dan low-quality generation	Accepted behavior-changing PR with passing evidence
Prompt count	Mengukur aktivitas, bukan hasil	Successful task completion with review quality
AI acceptance rate	Acceptance tinggi bisa berarti blind acceptance	Accepted diff after tests + review findings
Number of AI-created PRs	Bisa memperbesar review queue	Merged PRs with low rework and low defect escape
Developer velocity individual	Mudah disalahgunakan untuk surveillance	Team-level flow and outcome metrics
Story points completed	Tidak stabil lintas tim	Cycle time, delivery throughput, customer impact
Test coverage only	Coverage tinggi bisa tanpa assertion bermakna	Mutation score, assertion quality, defect detection
Number of comments from AI reviewer	Banyak komentar bisa berarti noise	Actionable finding precision and false-positive rate
Token spend only	Cost rendah bisa berarti context miskin	Cost per accepted, verified change

Prinsipnya:

Metrik AI harus mengukur verified outcome, bukan generative activity.

4. Measurement Pyramid

Gunakan pyramid agar metrik tidak terjebak pada satu level.

4.1 AI Usage Signal

Contoh:

jumlah task yang memakai AI,
kategori penggunaan AI,
model/tool yang dipakai,
prompt/template yang dipakai,
agent runtime,
token/cost,
command/tool invocation,
approval events.

Ini hanya telemetry dasar. Jangan berhenti di sini.

4.2 Workflow Health

Contoh:

PR cycle time,
review waiting time,
rework count,
CI failure count,
PR size,
number of review iterations,
time from first review to merge.

Ini menunjukkan apakah AI memperlancar flow atau hanya memindahkan bottleneck ke reviewer.

4.3 Engineering Quality

Contoh:

escaped defect,
incident caused by change,
security finding,
test flakiness,
mutation score,
static analysis issue,
dependency vulnerability,
rollback/forward-fix rate.

Ini menjawab apakah AI output aman.

4.4 Delivery Outcome

Contoh:

lead time for changes,
deployment frequency,
failed deployment recovery time,
change failure rate,
throughput of valuable changes,
release predictability.

Ini menyambungkan AI ke delivery capability.

4.5 Business Outcome

Contoh:

feature adoption,
operational cost reduction,
support ticket reduction,
customer-facing defect reduction,
compliance evidence completeness,
faster regulatory response.

Ini paling penting, tapi paling sulit diatribusikan langsung ke AI.

5. DORA Metrics sebagai Baseline Delivery

DORA metrics berguna karena mengukur delivery performance dari sistem engineering, bukan hanya aktivitas individu.

Empat kategori utama yang perlu dipakai sebagai baseline:

Metric	Pertanyaan yang dijawab	AI-specific interpretation
Change lead time	Seberapa cepat perubahan dari commit sampai production?	Apakah AI mempercepat flow end-to-end atau hanya coding lokal?
Deployment frequency	Seberapa sering tim deploy?	Apakah AI membuat perubahan lebih kecil dan lebih sering?
Change failure rate	Berapa proporsi deploy yang menyebabkan masalah?	Apakah AI menaikkan defect/incident?
Failed deployment recovery time	Berapa lama recover dari deploy gagal?	Apakah AI membantu diagnosis/rollback atau memperburuk operability?

Catatan penting:

DORA tidak cukup untuk menilai AI.
DORA perlu dilengkapi dengan quality, review, security, dan cost metric.
Jangan pakai DORA untuk menghukum individu.
DORA idealnya dilihat per tim, per service, dan per risk class.

6. AI Development Balanced Scorecard

Gunakan scorecard empat dimensi.

6.1 Dimensi 1: Flow

Metrik:

task cycle time,
PR cycle time,
review waiting time,
time to first green CI,
merge latency,
release latency.

Interpretasi:

AI baik jika menurunkan waktu pada bottleneck nyata.
AI buruk jika mempercepat coding tetapi menaikkan review waiting time.

6.2 Dimensi 2: Quality

Metrik:

escaped defect rate,
rework rate,
PR revert rate,
incident caused by change,
static analysis issue trend,
test failure after merge,
bug reopen rate.

Interpretasi:

AI baik jika tidak menaikkan defect escape.
AI sangat baik jika membantu menurunkan rework dan meningkatkan test signal.

6.3 Dimensi 3: Review Load

Metrik:

reviewer time per PR,
number of review rounds,
reviewer comment density,
AI reviewer false positive rate,
human override rate,
average PR diff size.

Interpretasi:

AI baik jika membuat PR lebih jelas, kecil, dan terbukti.
AI buruk jika membuat reviewer memvalidasi diff besar yang tidak dipahami author.

6.4 Dimensi 4: Risk and Governance

Metrik:

secret exposure event,
unauthorized context usage,
protected path modification without approval,
dependency risk introduced,
audit evidence completeness,
policy exception count,
high-risk AI task approval ratio.

Interpretasi:

AI baik jika menghasilkan evidence otomatis dan memperbaiki kontrol.
AI buruk jika membuat perubahan tidak traceable.

6.5 Dimensi 5: Cost and ROI

Metrik:

AI spend per accepted PR,
AI spend per successful task,
cost per reviewable diff,
token burn by workflow type,
rework cost,
reviewer time saved or added,
incident cost avoided or created.

Interpretasi:

AI murah tapi menyebabkan rework mahal bukan improvement.
AI mahal tapi mengurangi incident pada sistem kritikal bisa sangat bernilai.

7. Metric Catalog untuk AI-Driven Development

7.1 Flow Metrics

Metric	Definition	Useful breakdown
Task cycle time	Dari task ready sampai merged/released	task type, risk class, AI mode
Coding time	Dari branch dibuat sampai PR opened	repo, language, complexity
PR review latency	Dari PR opened sampai first human review	team, reviewer pool
Review cycle time	Dari first review sampai approval	PR size, AI usage, risk
Time to green CI	Dari PR opened sampai CI passing	failure type, test type
Merge latency	Dari approval sampai merge	release gate, branch protection
Release latency	Dari merge sampai production	service, environment

Interpretasi senior

Jika coding time turun 50% tetapi PR review latency naik 80%, AI tidak mempercepat delivery. AI hanya memindahkan kerja dari author ke reviewer.

Jika time to green CI turun karena AI membantu debugging pipeline, itu sinyal bagus.

Jika merge latency naik karena PR besar, berarti task slicing buruk.

7.2 Review Metrics

Metric	Definition	Warning sign
PR size	Lines/files changed	PR AI-generated terlalu besar
Review rounds	Jumlah iterasi review	AI patch tidak matang
Actionable comment ratio	Komentar yang menghasilkan perubahan valid	AI reviewer noisy
Human override rate	Temuan AI ditolak oleh manusia	Review prompt buruk atau model kurang tepat
Author explanation quality	PR menjelaskan intent, risk, tests	AI summary terlalu generik
Reviewer confidence score	Reviewer yakin patch dipahami	Author cognitive offloading

Simple rubric: Reviewability Score

Score	Meaning
1	Diff besar, behavior tidak jelas, test lemah
2	Intent jelas tetapi risk/test kurang
3	Reviewable, test cukup, beberapa ambiguity
4	Diff kecil, evidence kuat, risk jelas
5	Sangat mudah direview, invariant dan rollback jelas

Formula sederhana:

reviewability_score = average(
  intent_clarity,
  diff_smallness,
  test_evidence,
  risk_statement,
  rollback_clarity
)

7.3 Quality Metrics

Metric	Definition	AI-related question
Escaped defect	Bug ditemukan setelah merge/release	Apakah AI memperbesar bug leakage?
Rework rate	PR perlu perubahan besar setelah review	Apakah AI patch rendah kualitas?
Reopen rate	Bug ticket dibuka ulang	Apakah AI memperbaiki symptom saja?
Regression count	Behavior lama rusak	Apakah characterization test cukup?
Flaky test rate	Test gagal tidak deterministik	Apakah AI membuat test rapuh?
Mutation survival	Mutasi code lolos test	Apakah assertion AI lemah?
Security finding	SAST/DAST/dependency issue	Apakah AI memperkenalkan insecure pattern?

7.4 Test Signal Metrics

Metric	Good sign	Bad sign
Assertion density meaningful	Assertion membuktikan behavior penting	Assertion hanya not-null/status code
Branch/scenario coverage	Critical path dan edge case tercakup	Happy path only
Mutation score	Test gagal saat logic dirusak	Mutasi banyak survive
Flake rate	Stabil di CI	Fails/retries sering
Test runtime	Cepat enough untuk feedback	Lambat tanpa value
Failure diagnosis quality	Error mudah dipahami	Error generik dan noisy

Jangan ukur coverage saja. Coverage menjawab “code dieksekusi?”, bukan “behavior terbukti?”.

7.5 Security Metrics

Metric	Definition
AI-introduced vulnerability count	Temuan security pada PR yang memakai AI
Secret leakage event	Secret ikut masuk prompt/log/diff
Dependency risk introduced	Dependency baru dengan CVE/license issue
Unsafe output handling	AI-generated code tidak validasi output eksternal
Prompt injection exposure	Tool/agent bisa dipengaruhi input tidak terpercaya
Excessive agency event	Agent menjalankan command/akses di luar izin
Policy exception count	Kasus penggunaan AI di luar policy

Security metric harus dilihat dengan severity. Satu critical vulnerability lebih penting dari 100 style issue.

7.6 Cost Metrics

Metric	Formula sederhana
AI cost per task	total AI spend / completed AI-assisted task
AI cost per merged PR	total AI spend / merged AI-assisted PR
AI cost per accepted diff	total AI spend / accepted generated diff
Rework-adjusted cost	AI cost + human rework time cost
Review-adjusted cost	AI cost + reviewer time cost
Incident-adjusted cost	AI cost + incident caused/avoided cost

Cost metric harus memasukkan human time. Tool murah tapi membuat reviewer bekerja dua kali lebih lama itu mahal.

7.7 Learning Metrics

Ini sering diabaikan.

Metric	Purpose
Author explanation quality	Apakah engineer memahami patch?
Post-review learning notes	Apakah review menghasilkan pembelajaran?
Prompt/template improvement count	Apakah workflow membaik?
Repeated issue rate	Apakah AI mengulang kesalahan sama?
Pairing session reflection	Apakah engineer bisa menjelaskan design?
New engineer onboarding time	Apakah AI-readable repo membantu onboarding?

AI adoption yang bagus mempercepat learning. AI adoption yang buruk membuat engineer makin pasif.

8. AI Attribution: Jangan Salah Mengklaim Improvement

Masalah umum:

Setelah AI rollout, cycle time turun. Maka AI dianggap penyebab.

Belum tentu.

Cycle time bisa turun karena:

scope task lebih kecil,
reviewer lebih tersedia,
CI lebih cepat,
incident menurun,
requirement lebih jelas,
release process berubah,
tim menghindari task sulit,
measurement window bias.

8.1 Minimum Attribution Model

Untuk setiap perubahan yang mengklaim AI impact, catat:

Field	Contoh
Task type	bugfix, refactor, test, migration, docs, API
Risk class	low, medium, high, critical
AI mode	chat, IDE pair, terminal agent, cloud agent, AI review
Human role	author, reviewer, approver
Baseline comparable	previous similar tasks
Outcome	merged, reverted, defect, rework
Evidence	tests, CI, review, logs, docs

8.2 Difference-in-Differences Sederhana

Jika ingin lebih serius:

AI_effect =
  (AI_team_after - AI_team_before)
  -
  (control_team_after - control_team_before)

Ini tidak sempurna, tetapi lebih baik daripada before/after mentah.

8.3 Matched Task Comparison

Bandingkan task yang sejenis:

Dimension	Match by
Domain	service/module yang sama
Type	bugfix vs bugfix, test vs test
Size	estimasi kompleksitas mirip
Risk	low/medium/high
Baseline	historical tasks yang sebanding
Team	skill/team yang sama atau mirip

Jangan bandingkan AI-assisted docs update dengan non-AI database migration.

9. Measurement Architecture

Untuk mengukur AI development, data datang dari beberapa sumber.

9.1 Issue Tracker Data

Ambil:

task type,
priority,
component,
assignee team,
created date,
ready date,
start date,
done date,
risk label,
AI-assisted label.

Praktik:

labels:
- ai-assisted
- ai-mode:pair
- ai-mode:cloud-agent
- risk:medium
- task-type:bugfix

9.2 Git Data

Ambil:

commits,
branch lifetime,
file count,
diff size,
churn,
module touched,
protected path touched.

Hati-hati: diff size bukan productivity. Diff size adalah review/risk signal.

9.3 Pull Request Data

Ambil:

opened time,
first review time,
approval time,
merge time,
review comments,
requested changes,
PR description quality,
linked issue,
checklist completion,
AI disclosure.

9.4 CI/CD Data

Ambil:

build duration,
test duration,
failed job,
rerun count,
flaky indicator,
deployment result,
rollback/forward-fix,
environment.

9.5 Security Data

Ambil:

SAST findings,
dependency findings,
secret scanning,
license findings,
container findings,
IaC findings,
severity,
fix time.

9.6 AI Tool Logs

Ambil hanya yang boleh dikumpulkan sesuai privacy/security policy:

tool name,
AI mode,
task id,
token/cost,
model class,
approval event,
command class,
protected path attempt,
outcome.

Jangan menyimpan prompt mentah berisi secret, customer data, atau sensitive source jika policy melarang.

10. AI Usage Taxonomy untuk Measurement

Metrik harus tahu AI dipakai untuk apa.

AI usage category	Contoh	Risk level default
Search/explanation	memahami code, menjelaskan error	low-medium
Documentation	PR summary, runbook, ADR draft	low-medium
Test generation	unit/integration test	medium
Code implementation	feature/bugfix/refactor	medium-high
Security review	review vulnerability	medium-high
Database migration	schema/data migration	high
DevOps/IaC	workflow/deployment/infra	high
Production operation	diagnosis/action pada prod	critical

Jika semua AI usage digabung, data menjadi tidak berguna.

Contoh insight yang benar:

AI pair programming menurunkan coding time untuk low-risk test generation sebesar 35%, tetapi cloud-agent implementation untuk medium-risk API change menaikkan review rounds 20% karena task slicing terlalu besar.

Contoh insight yang buruk:

AI meningkatkan productivity 40%.

11. Designing a Dashboard that Does Not Create Bad Behavior

Dashboard buruk membuat orang mengoptimalkan angka, bukan sistem.

11.1 Rules

Gunakan metrik tim, bukan ranking individu.
Tampilkan quality bersama speed.
Tampilkan review burden bersama throughput.
Tampilkan confidence interval/trend, bukan angka absolut palsu.
Pisahkan task type dan risk class.
Jangan jadikan AI acceptance rate sebagai KPI.
Gunakan dashboard untuk improvement, bukan punishment.

11.2 Dashboard Sections

Section A: Delivery Flow

Metric	View
Lead time	trend by team/service
PR cycle time	p50/p75/p90
Review wait time	by reviewer pool
Time to green CI	by repo
Deployment frequency	by service

Section B: Quality and Safety

Metric	View
Change failure rate	by service/risk
Escaped defects	by task type
Rework rate	by AI mode
Security findings	by severity
Test flakiness	by pipeline

Section C: AI Usage

Metric	View
AI-assisted task count	by category
AI mode distribution	chat/pair/agent/review
Cost per completed task	by mode
Approval events	by risk class
Policy exceptions	by team/service

Section D: Review Health

Metric	View
Review rounds	by PR size/risk
Actionable AI review finding ratio	by prompt/model
Human override rate	by reviewer
PR size distribution	by AI mode
Reviewer load	by week

Section E: Governance Evidence

Metric	View
AI disclosure completeness	by PR
Protected path approval	by PR
Security gate pass/fail	by repo
Documentation impact statement	by PR
Audit trail completeness	by use case

12. AI ROI Model

ROI AI development tidak bisa hanya dihitung dari subscription cost.

12.1 Cost Components

Cost	Example
Tool subscription	seat/license
API/token	model usage
Infra	cloud sandbox, CI minutes
Human review	reviewer time
Rework	fixing AI output
Governance	audit, policy, approval
Security	scanning, incident response
Training	enablement time

12.2 Benefit Components

Benefit	Example
Reduced cycle time	faster bugfix/feature delivery
Reduced toil	docs/runbook/test automation
Improved quality	fewer repeated defects
Faster onboarding	AI-readable repo and knowledge pack
Better review	risk checklist and summary
Faster diagnosis	log/error hypothesis generation
Compliance evidence	generated traceability

12.3 Simple Formula

net_value =
  delivery_time_saved_value
+ toil_reduction_value
+ defect_cost_avoided
+ incident_cost_avoided
+ onboarding_time_saved
+ audit_effort_reduced
- ai_tool_cost
- added_review_cost
- rework_cost
- governance_cost
- incident_cost_caused

Jangan memaksa presisi palsu. Gunakan model ini untuk berpikir, bukan accounting sempurna.

13. Quality Gate untuk AI Metrics

Sebelum AI workflow dianggap berhasil, minimal harus lolos quality gate.

Gate	Required evidence
Behavior gate	Test membuktikan acceptance criteria
Review gate	Human reviewer memahami diff
Security gate	Tidak ada high/critical unresolved finding
Compatibility gate	Contract/backward compatibility aman
Operational gate	Logging/metrics/rollback cukup
Documentation gate	Docs/ADR/runbook update jika terdampak
Governance gate	AI usage disclosure dan approvals lengkap

13.1 Go/No-Go Example

Condition	Decision
Cycle time turun, defect naik	no-go atau restrict use case
Cycle time turun, review burden naik tinggi	redesign task slicing
Review burden turun, quality stabil	scale carefully
Quality naik, cycle time stabil	still valuable
Cost naik, incident turun	evaluate risk-adjusted value
AI reviewer noisy	tune/restrict reviewer

14. Measurement of AI Code Review

AI code review perlu metrik sendiri.

14.1 Precision and Recall Thinking

Term	Meaning in AI review
True positive	AI menemukan issue valid
False positive	AI melaporkan issue tidak valid
False negative	AI melewatkan issue yang ditemukan human/production
True negative	AI benar tidak melaporkan issue

Dalam praktik, recall sulit diukur karena kita tidak tahu semua issue yang hilang. Tetapi precision bisa diukur dari human disposition.

14.2 Review Finding Lifecycle

14.3 AI Reviewer Metrics

Metric	Use
Finding precision	Kurangi noise
Accepted finding count	Value indicator
Severity distribution	Apakah AI hanya style comment?
Duplicate finding rate	Prompt/model noise
Reviewer override rate	Trust calibration
Fix verification rate	Issue benar-benar selesai
Time added/removed	Apakah review jadi lebih cepat?

14.4 AI Reviewer Policy

AI reviewer boleh:

memberi checklist,
menemukan issue potensial,
meminta evidence,
membandingkan dengan convention,
menandai risk.

AI reviewer tidak boleh:

menjadi final approver,
override human reviewer,
menyetujui high-risk change tanpa evidence,
mengubah code otomatis tanpa author review,
membuat blocking comment untuk style noise.

15. AI-Assisted Testing Metrics

AI sering terlihat sangat produktif saat membuat test. Tetapi test bisa palsu.

15.1 Test Quality Dimensions

Dimension	Question
Relevance	Apakah test terkait requirement?
Oracle strength	Apakah assertion membuktikan behavior?
Edge coverage	Apakah edge case penting tercakup?
Regression power	Apakah test akan gagal jika bug kembali?
Maintainability	Apakah fixture jelas dan tidak rapuh?
Determinism	Apakah test stabil di CI?
Runtime	Apakah feedback loop masih cepat?

15.2 AI Test Score

ai_test_score = average(
  relevance,
  oracle_strength,
  edge_case_coverage,
  regression_power,
  maintainability,
  determinism
)

Gunakan score ini saat review generated tests.

15.3 Common AI Test Failure

Failure	Symptom	Metric signal
Weak oracle	Test pass walau logic salah	mutation survival tinggi
Implementation mirroring	Test copy logic produksi	escaped defect tetap tinggi
Happy path only	Edge bug lolos	scenario coverage rendah
Over-mocking	Integration issue lolos	prod defect pada boundary
Fragile fixture	Test sering gagal karena setup	flake rate tinggi
Slow test bloat	CI makin lambat	test runtime naik

16. AI Implementation Metrics by Risk Class

Jangan gunakan threshold sama untuk semua task.

Risk class	AI usage allowed	Measurement priority
Low	docs, tests, small refactor, UI copy	speed, reviewability
Medium	bugfix, endpoint, non-critical workflow	quality, rework, CI
High	auth, payment, database, compliance	security, review evidence, rollback
Critical	production operation, destructive migration	approval, audit, incident readiness

16.1 Example Thresholds

Metric	Low risk	Medium risk	High risk
Max PR size	400 LOC	250 LOC	150 LOC
Required human reviewers	1	1-2	2+
Required tests	unit	unit + integration	unit + integration + contract/regression
AI disclosure	yes	yes	yes + approval note
Security scan	standard	standard	blocking high/critical
Rollback plan	optional	required if behavior	required

Angka di atas bukan universal. Pakai sebagai starting point.

17. Data Quality Problems

Metrik engineering mudah rusak.

17.1 Common Problems

Problem	Impact	Mitigation
Missing labels	AI impact tidak terbaca	PR template + automation
Inconsistent task type	Comparison salah	controlled taxonomy
PR unrelated changes	Cycle/quality bias	PR-per-intent
Squashed history loses signal	Attribution sulit	keep PR metadata
Manual AI usage undisclosed	Under-reporting	team agreement, not punitive
Tool logs incomplete	Cost/approval unknown	standardized logging
Individual ranking	Gaming metrics	team-level dashboard

17.2 AI Disclosure Template

Tambahkan ke PR:

## AI Usage

- AI used: yes/no
- Mode: chat / IDE pair / terminal agent / cloud agent / AI review
- Scope: explanation / test generation / implementation / refactor / docs / CI repair
- Human validation performed:
  - [ ] I reviewed the diff manually
  - [ ] I understand the behavior change
  - [ ] I ran relevant tests
  - [ ] I checked security-sensitive paths
- Risk class: low / medium / high / critical
- Notes:

Disclosure bukan untuk mempermalukan. Disclosure untuk observability dan governance.

18. Experiment Design for AI Rollout

Jangan rollout AI ke semua workflow lalu bingung membaca dampaknya.

18.1 Start with Use Case

Contoh use case yang bagus untuk experiment:

generate unit tests untuk existing service,
fix flaky tests,
summarize PR + docs impact,
debug CI failure,
implement low-risk endpoint,
refactor small module with characterization tests.

Contoh use case buruk untuk experiment pertama:

production database migration,
auth redesign,
payment flow change,
multi-service architecture rewrite,
compliance-critical workflow without test baseline.

18.2 Experiment Template

# AI Workflow Experiment

## Hypothesis
Using AI for <workflow> will improve <metric> without degrading <quality metric>.

## Scope
- Repo/service:
- Task type:
- Risk class:
- AI mode:
- Human gate:

## Baseline
- Historical period:
- Comparable tasks:
- Baseline metrics:

## Success Criteria
- Flow:
- Quality:
- Review:
- Cost:
- Governance:

## Guardrails
- Stop condition:
- Protected paths:
- Required tests:
- Required approvals:

## Data Collection
- Issue labels:
- PR template:
- CI logs:
- Security scans:
- AI usage logs:

## Review Cadence
- Weekly review:
- Final decision:

18.3 Stop Conditions

Hentikan atau batasi experiment jika:

escaped defect naik signifikan,
reviewer burden naik tanpa value,
AI-generated PR sering butuh rewrite,
security findings naik,
policy exception berulang,
engineer tidak bisa menjelaskan patch,
cost tidak sebanding dengan outcome,
prompt/tool behavior tidak bisa diaudit.

19. AI Productivity Review Meeting

Lakukan review berkala, misalnya dua mingguan atau bulanan.

Agenda:

Apa workflow AI yang paling bernilai?
Apa workflow yang paling noisy?
Apakah lead time turun?
Apakah review burden naik/turun?
Apakah defect/rework berubah?
Apakah cost masuk akal?
Apakah ada security/governance incident?
Prompt/template apa yang perlu distandardisasi?
Repo mana yang perlu dibuat lebih AI-readable?
Use case mana yang perlu dibatasi?

19.1 Output Meeting

Output bukan slide. Output harus menjadi backlog:

Finding	Action
AI PR terlalu besar	update task slicing policy
Generated tests weak	add mutation review checklist
Cloud agent sering gagal setup	improve repo bootstrap script
AI reviewer noisy	tune prompt and restrict severity
Cost tinggi pada debugging	improve log context pack
Docs summary useful	standardize PR docs impact template

20. Scorecard Templates

20.1 Team AI Delivery Scorecard

# Team AI Delivery Scorecard

Period: <YYYY-MM>
Team: <team>
Services: <services>

## AI Usage Mix
- AI-assisted tasks:
- AI modes:
- Top workflows:

## Flow
- Lead time p50/p75/p90:
- PR cycle time p50/p75/p90:
- Review wait time:
- Time to green CI:

## Quality
- Escaped defects:
- Rework rate:
- Reverts:
- Security findings:
- Flaky tests:

## Review Health
- Average PR size:
- Review rounds:
- AI reviewer precision:
- Reviewer burden trend:

## Cost
- Tool/API spend:
- Cost per accepted PR:
- Review-adjusted cost:

## Governance
- AI disclosure completeness:
- Approval exceptions:
- Protected path violations:

## Decisions
- Scale:
- Restrict:
- Improve:

20.2 PR-Level AI Evidence Scorecard

# AI Evidence Scorecard

PR: <link>
Task: <link>
Risk: low/medium/high/critical
AI mode: <mode>

## Evidence
- Acceptance criteria mapped: yes/no
- Tests added/updated: yes/no
- CI passed: yes/no
- Security scan passed: yes/no
- Docs updated: yes/no
- Rollback plan included: yes/no

## Review
- Human reviewer understands diff: yes/no
- AI review findings dispositioned: yes/no
- Rework needed: none/minor/major

## Outcome
- Merged:
- Reverted:
- Incident:
- Follow-up required:

21. Interpreting Common Metric Patterns

21.1 Pattern: Coding Faster, Review Slower

Signal:

branch lifetime turun,
PR review cycle naik,
review rounds naik,
reviewer comment density naik.

Diagnosis:

AI membuat diff terlalu besar,
author tidak memahami patch,
PR summary generik,
test evidence lemah.

Action:

enforce PR-per-intent,
add AI usage disclosure,
require author explanation,
improve task slicing.

21.2 Pattern: More Tests, Same Defects

Signal:

test count naik,
coverage naik,
escaped defect tetap atau naik,
mutation score rendah.

Diagnosis:

weak oracle,
happy path only,
over-mocking,
tests mirror implementation.

Action:

add behavior matrix,
require assertion review,
add mutation testing for critical modules,
review fixture quality.

21.3 Pattern: AI Reviewer Finds Many Issues, Few Accepted

Signal:

AI comments tinggi,
accepted findings rendah,
reviewer dismisses often,
review latency naik.

Diagnosis:

prompt terlalu generic,
AI tidak punya repo convention,
severity tidak dikalibrasi,
style noise.

Action:

restrict AI reviewer scope,
add severity policy,
provide repo-specific checklist,
measure precision.

21.4 Pattern: Delivery Faster, Incidents Higher

Signal:

lead time turun,
deployment frequency naik,
change failure rate naik,
incident count naik.

Diagnosis:

quality gate terlalu longgar,
risky AI task tidak dibedakan,
test coverage tidak cukup,
rollout/rollback lemah.

Action:

add risk class,
require extra gate for high-risk task,
strengthen observability,
slow down unsafe workflows.

22. Practical SQL/Data Model Sketch

Untuk tim yang ingin membangun warehouse sederhana:

CREATE TABLE ai_assisted_prs (
  pr_id TEXT PRIMARY KEY,
  repo TEXT NOT NULL,
  service TEXT,
  team TEXT,
  task_type TEXT,
  risk_class TEXT,
  ai_mode TEXT,
  opened_at TIMESTAMP,
  first_review_at TIMESTAMP,
  approved_at TIMESTAMP,
  merged_at TIMESTAMP,
  lines_added INT,
  lines_deleted INT,
  files_changed INT,
  ci_failures INT,
  review_rounds INT,
  ai_review_findings INT,
  accepted_ai_findings INT,
  human_requested_changes INT,
  security_findings_high INT,
  escaped_defect BOOLEAN DEFAULT FALSE,
  reverted BOOLEAN DEFAULT FALSE,
  ai_cost_usd NUMERIC,
  disclosure_complete BOOLEAN
);

Example derived metrics:

SELECT
  team,
  ai_mode,
  percentile_cont(0.5) WITHIN GROUP (
    ORDER BY EXTRACT(EPOCH FROM (merged_at - opened_at)) / 3600
  ) AS pr_cycle_time_p50_hours,
  AVG(review_rounds) AS avg_review_rounds,
  AVG(CASE WHEN escaped_defect THEN 1 ELSE 0 END) AS escaped_defect_rate,
  AVG(CASE WHEN disclosure_complete THEN 1 ELSE 0 END) AS disclosure_rate
FROM ai_assisted_prs
WHERE merged_at IS NOT NULL
GROUP BY team, ai_mode;

23. Minimum Viable Measurement Setup

Jika tim belum punya data platform, mulai sederhana.

Week 1

Tambahkan AI usage section di PR template.
Tambahkan labels untuk AI mode dan risk class.
Catat PR cycle time dan review rounds.
Catat CI pass/fail.
Catat rework major/minor.

Week 2

Tambahkan dashboard sederhana dari GitHub/GitLab data.
Review 10 AI-assisted PR secara manual.
Hitung false positive AI review.
Bandingkan PR size AI vs non-AI.
Identifikasi satu workflow yang perlu diperbaiki.

Week 3

Tambahkan cost tracking.
Tambahkan security findings tracking.
Tambahkan task type breakdown.
Buat improvement backlog.

Week 4

Putuskan workflow mana yang diskalakan.
Putuskan workflow mana yang dibatasi.
Update context/prompt/template.
Buat team AI working agreement revision.

24. Failure Modes in Measurement

24.1 Goodhart's Law

Saat metrik menjadi target, metrik bisa rusak.

Contoh:

Jika targetnya PR count, orang membuat PR kecil tidak bernilai.
Jika targetnya AI usage, orang memakai AI saat tidak perlu.
Jika targetnya review speed, reviewer asal approve.
Jika targetnya test count, orang membuat test lemah.

Mitigasi:

gunakan metric set seimbang,
review qualitative examples,
hindari individual ranking,
pakai metric sebagai diagnosis, bukan hukuman.

24.2 Survivorship Bias

Tim hanya melihat PR yang berhasil merge. Padahal AI task yang gagal juga penting.

Catat:

abandoned AI branches,
failed agent tasks,
rewritten AI output,
discarded generated tests,
prompts that produced unsafe output.

24.3 Automation Bias

Jika AI dashboard terlihat rapi, orang percaya tanpa audit.

Mitigasi:

sampling manual,
audit raw PR,
compare against incidents,
qualitative reviewer feedback.

24.4 Local Optimization

AI mempercepat coding lokal tetapi memperburuk system flow.

Mitigasi:

measure end-to-end,
include review/CI/release,
use DORA + quality + cost.

25. What Top 1% Engineers Watch

Engineer kuat tidak hanya bertanya “berapa cepat?”. Mereka bertanya:

Apakah AI memperbaiki bottleneck nyata?
Apakah PR lebih kecil atau lebih besar?
Apakah reviewer lebih percaya atau lebih lelah?
Apakah defect escape berubah?
Apakah AI-generated tests benar-benar menangkap bug?
Apakah security posture membaik atau melemah?
Apakah high-risk task punya approval dan evidence?
Apakah cost masih sebanding dengan outcome?
Apakah engineer masih memahami code yang mereka merge?
Apakah context/prompt/template membaik dari waktu ke waktu?

26. 20-Hour Deliberate Practice Plan

Hour 1-2: Baseline

Ambil 20 PR terakhir dari satu repo.

Catat:

PR cycle time,
review rounds,
PR size,
CI failures,
rework,
defect follow-up.

Hour 3-4: AI Disclosure Template

Tambahkan PR template AI usage.

Simulasikan pada 5 PR lama.

Hour 5-6: Task Taxonomy

Klasifikasi task:

docs,
test,
bugfix,
feature,
refactor,
migration,
DevOps,
security.

Hour 7-8: Risk Class

Tambahkan risk class:

low,
medium,
high,
critical.

Hour 9-10: Reviewability Score

Score 10 PR berdasarkan:

intent clarity,
diff size,
test evidence,
risk statement,
rollback clarity.

Hour 11-12: AI Reviewer Evaluation

Jalankan AI review pada 5 PR.

Catat:

valid findings,
false positives,
missed issues,
severity quality.

Hour 13-14: Test Quality Evaluation

Ambil 5 AI-generated tests.

Score:

relevance,
oracle strength,
edge coverage,
determinism,
maintainability.

Hour 15-16: Dashboard Sketch

Buat dashboard sederhana:

flow,
quality,
review,
cost,
governance.

Hour 17-18: Experiment Design

Tulis satu AI workflow experiment:

hypothesis,
scope,
baseline,
success criteria,
stop condition.

Hour 19-20: Improvement Backlog

Buat backlog 10 item untuk meningkatkan AI development system.

Prioritaskan berdasarkan:

impact,
risk reduction,
effort,
evidence quality.

27. Checklist: AI Development Metrics Readiness

Gunakan checklist ini sebelum mengklaim AI adoption berhasil.

## Metrics Readiness Checklist

### Baseline
- [ ] Baseline pre-AI tersedia
- [ ] Task type taxonomy tersedia
- [ ] Risk class taxonomy tersedia
- [ ] Comparable task set tersedia

### Flow
- [ ] PR cycle time diukur
- [ ] Review wait time diukur
- [ ] Time to green CI diukur
- [ ] Release latency diukur

### Quality
- [ ] Rework rate diukur
- [ ] Escaped defect diukur
- [ ] Flaky test rate diukur
- [ ] Security finding diukur

### Review
- [ ] PR size diukur
- [ ] Review rounds diukur
- [ ] AI reviewer finding disposition diukur
- [ ] Reviewer feedback dikumpulkan

### AI Usage
- [ ] AI mode dicatat
- [ ] AI usage disclosure di PR
- [ ] Cost dicatat
- [ ] Approval event dicatat

### Governance
- [ ] Policy exception dicatat
- [ ] Protected path modification dicatat
- [ ] Audit evidence tersimpan
- [ ] High-risk workflow punya gate

### Improvement
- [ ] Metrics review cadence ada
- [ ] Improvement backlog dibuat
- [ ] Prompt/context updates tracked
- [ ] Unsafe workflow bisa dihentikan

28. Ringkasan

AI development measurement harus menjawab outcome, bukan aktivitas.

Prinsip utama:

Ukur productivity sebagai system property.
Gunakan DORA sebagai baseline delivery, bukan satu-satunya ukuran.
Gabungkan flow, quality, review, risk, cost, learning, dan governance.
Jangan mengukur developer individu dengan AI metrics.
Jangan memakai LOC, prompt count, atau acceptance rate sebagai KPI utama.
Pisahkan task type dan risk class.
Ukur review burden karena AI sering memindahkan bottleneck ke reviewer.
Ukur generated test quality, bukan hanya test count.
Buat AI usage disclosure sebagai observability, bukan surveillance.
Jadikan metrik sebagai feedback loop untuk memperbaiki context, prompt, repo readiness, test strategy, dan governance.

AI yang baik bukan yang paling banyak menghasilkan code.

AI yang baik adalah AI yang membantu tim mengirim perubahan bernilai dengan evidence lebih kuat, risiko lebih kecil, dan learning loop lebih cepat.

29. Referensi Praktis

DORA Metrics — https://dora.dev/guides/dora-metrics/
Accelerate / DORA research sebagai dasar software delivery performance measurement
SPACE framework untuk developer productivity discussion
OWASP Top 10 for Large Language Model Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/
NIST AI Risk Management Framework — https://www.nist.gov/itl/ai-risk-management-framework
NIST AI 600-1 Generative AI Profile — https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

Lesson Recap

You just completed lesson 27 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 26

Learn Ai Development Driven Implementation Usage Part 026 Human Ai Collaboration Patterns

Next Lesson

Lesson 28

Learn Ai Development Driven Implementation Usage Part 028 Enterprise Governance And Risk Management