Deepen PracticeOrdered learning track

Learn Agentic Ai Engineering Part 023 Code Review And Pr Review Agents

[]18 min read3426 words

In This Lesson

1. Kaufman Framing 2. Mental Model: PR Review as Risk Reduction 3. Review Scope Taxonomy

Lesson 2335 lesson track20–29 Deepen Practice

title: Learn Advanced Agentic AI Engineering & Autonomous Software Engineering - Part 023 description: Code review and PR review agents for autonomous software engineering: review scope, risk scoring, evidence-based findings, diff analysis, architecture/security/maintainability review, false-positive control, and review governance. series: learn-agentic-ai-engineering seriesTitle: Learn Advanced Agentic AI Engineering & Autonomous Software Engineering order: 23 partTitle: Code Review and PR Review Agents tags:

agentic-ai
autonomous-software-engineering
code-review
pr-review-agent
engineering-governance
series date: 2026-06-29

Part 023 — Code Review and PR Review Agents

Target part ini: mampu mendesain code review / PR review agent yang memberi feedback bernilai, berbasis evidence, rendah false-positive, dan aman dipakai dalam engineering workflow produksi. Fokusnya bukan "AI komentar di PR", tetapi risk-aware review system.

Code review agent adalah salah satu bentuk agentic AI yang paling mudah terlihat manfaatnya, tetapi juga paling mudah menjadi noise generator.

Agent review yang buruk akan:

mengomentari style minor yang tidak penting,
mengulang lint/static analysis,
melewatkan bug serius,
memberi saran yang tidak memahami architecture,
membuat reviewer manusia kehilangan trust,
memperlambat PR tanpa meningkatkan kualitas.

Agent review yang baik melakukan hal berbeda:

A PR review agent should not optimize for number of comments.
It should optimize for risk reduction per reviewer-minute.

Artinya, review agent bukan pengganti reviewer manusia secara total. Ia adalah review amplifier: mempercepat triage, memperluas coverage, menyiapkan evidence, dan mengarahkan perhatian manusia ke area paling berisiko.

1. Kaufman Framing

1.1 Target performance

Setelah part ini, kita ingin mampu:

mendefinisikan scope review agent secara eksplisit,
membedakan review style, correctness, security, architecture, maintainability, test, dan operational risk,
membuat risk-scoring untuk PR,
menghindari komentar generik dan unverifiable,
menggabungkan diff, repository context, tests, ownership, dan policy,
mendesain review finding yang actionable,
mengukur false positive, false negative, severity calibration, dan developer acceptance,
menempatkan agent review dalam governance workflow tanpa membuat bottleneck.

Target praktis:

Jika ada PR kompleks, kita bisa membuat agent yang membaca diff, memahami konteks repository, memilih reviewer concern yang relevan, menghasilkan findings berbasis evidence, dan membedakan blocking issue dari suggestion.

1.2 Deconstruct the skill

Code review agent terdiri dari beberapa subskill:

PR intake — memahami tujuan PR, scope, linked issue, files changed, dan risk.
Diff comprehension — memahami perubahan behavior, bukan hanya baris yang berubah.
Context expansion — mengambil file sekitar, symbol references, tests, docs, config, dan runtime contracts.
Risk classification — menilai impact, reversibility, security, data, concurrency, migration, compatibility, dan production blast radius.
Review rubric selection — memilih checklist sesuai jenis perubahan.
Finding generation — membuat komentar yang spesifik, actionable, dan evidence-based.
False-positive control — menolak komentar jika evidence lemah.
Human handoff — menyusun summary, review packet, dan escalation.
Governance logging — menyimpan decision, skipped checks, uncertainty, dan reviewer overrides.
Learning loop — memperbaiki rubric dari accepted/rejected feedback.

1.3 Learn enough to self-correct

Kita tidak perlu memulai dengan agent yang bisa review semua hal.

Mulai dari kemampuan self-correction berikut:

tahu kapan sebuah komentar terlalu generik,
tahu kapan issue seharusnya diserahkan ke static analyzer,
tahu kapan PR butuh konteks tambahan,
tahu kapan agent harus diam,
tahu kapan rekomendasi perlu human approval,
tahu kapan severity terlalu tinggi atau terlalu rendah.

Skill utama review agent adalah judgement calibration.

2. Mental Model: PR Review as Risk Reduction

Review manusia bukan hanya mencari bug. Review adalah mekanisme organisasi untuk mengurangi risiko perubahan.

PR review = change understanding + risk detection + quality enforcement + shared ownership

Untuk agent, modelnya menjadi:

PR review agent = risk classifier + context retriever + rubric executor + evidence generator + human attention router

2.1 Review agent bukan lint bot

Lint bot mengecek rule eksplisit.

Review agent mengecek hal yang lebih kontekstual:

apakah perubahan sesuai intent issue,
apakah invariant domain rusak,
apakah error path hilang,
apakah backward compatibility terjaga,
apakah test membuktikan behavior yang benar,
apakah observability cukup untuk debugging,
apakah perubahan membuat coupling baru,
apakah migration plan aman,
apakah secret/data boundary terlanggar.

Jika sebuah check bisa 100% deterministic, sebaiknya jangan dijadikan LLM review comment. Jadikan CI/static analysis.

2.2 Agent harus berani tidak berkomentar

Komentar yang salah lebih mahal daripada tidak berkomentar.

Alasannya:

reviewer manusia harus membaca,
author harus menilai,
diskusi bisa melebar,
trust pada agent turun,
signal-to-noise memburuk.

Maka review agent perlu prinsip:

No evidence, no comment.
Low confidence, summarize uncertainty instead of blocking.
Deterministic issue, prefer machine check.
Architectural concern, escalate with context.

3. Review Scope Taxonomy

Tidak semua review sama. Agent perlu tahu jenis concern yang sedang diperiksa.

Scope	Pertanyaan inti	Cocok untuk agent?	Catatan
Formatting	Apakah style sesuai?	Rendah	Gunakan formatter/linter.
Local correctness	Apakah logic lokal benar?	Sedang	Butuh tests dan context.
Behavioral correctness	Apakah requirement terpenuhi?	Tinggi	Butuh issue intent dan oracle.
Error handling	Apakah failure path aman?	Tinggi	Cocok jika rubric jelas.
Security	Apakah ada injection/leak/privilege risk?	Sedang-tinggi	Jangan bergantung hanya pada LLM.
Compatibility	Apakah API/schema/event contract berubah?	Tinggi	Butuh schema/diff/context.
Architecture	Apakah boundary/coupling rusak?	Tinggi tapi sulit	Butuh repo map dan design docs.
Performance	Apakah kompleksitas/resource use memburuk?	Sedang	Butuh benchmark/profiling untuk strong claim.
Observability	Apakah perubahan bisa dioperasikan?	Tinggi	Cocok untuk checklist.
Test quality	Apakah test membuktikan behavior?	Tinggi	Agent bisa sangat membantu.
Migration safety	Apakah rollout aman?	Tinggi	Butuh deployment/migration context.

3.1 Rubric harus berbeda per PR type

PR bug fix berbeda dari feature, refactor, dependency upgrade, schema migration, atau security patch.

Contoh taxonomy:

pr_type:
  bug_fix:
    focus:
      - reproduction evidence
      - regression test
      - minimal patch
      - side effects
  feature:
    focus:
      - requirement coverage
      - API contract
      - error path
      - observability
  refactor:
    focus:
      - semantic preservation
      - test coverage
      - public behavior unchanged
      - diff minimization
  migration:
    focus:
      - compatibility
      - rollback
      - data safety
      - staged rollout
  security_fix:
    focus:
      - exploit scenario
      - regression test
      - data boundary
      - bypass analysis

Review agent yang memakai satu checklist universal akan terlalu dangkal.

4. PR Review Agent Architecture

Arsitektur review agent produksi sebaiknya memisahkan intake, context, analysis, findings, dan publishing.

4.1 PR intake

PR intake mengumpulkan:

title,
description,
linked issue,
labels,
author,
changed files,
diff size,
test changes,
generated files,
dependencies touched,
config touched,
migration files,
public API changes,
security-sensitive paths,
historical failure signals.

Output intake bukan teks bebas. Output harus structured.

{
  "pr_id": 812,
  "pr_type_candidates": ["bug_fix", "schema_change"],
  "changed_file_count": 14,
  "risk_indicators": [
    "database_migration",
    "authorization_logic_changed",
    "public_api_response_changed"
  ],
  "requires_human_review": true,
  "initial_risk_tier": "high"
}

4.2 Diff analyzer

Diff analyzer harus memproduksi semantic view:

functions/classes changed,
public signatures changed,
behavior branches added/removed,
exception handling changed,
validation changed,
authorization checks changed,
persistence query changed,
event payload changed,
configuration changed,
tests added/removed.

Diff analyzer tidak boleh hanya menyalin git diff ke prompt. Untuk PR besar, itu boros token dan sering membuat agent kehilangan struktur.

4.3 Context planner

Context planner menentukan apa yang perlu dibaca.

Contoh:

Diff signal	Context tambahan
Public method signature changed	Callers, tests, API docs, compatibility notes
Authorization condition changed	Policy docs, role matrix, security tests, endpoints
SQL query changed	Schema, indexes, transaction boundary, data volume assumptions
Event schema changed	Consumers, contract tests, schema registry, versioning policy
Retry logic changed	Idempotency contract, timeout config, incident history
Error handling changed	Error taxonomy, client behavior, observability docs

4.4 Rubric selector

Rubric selector memilih checklist berdasarkan PR type dan risk.

review_rubric:
  authorization_change:
    blocking:
      - missing deny-by-default behavior
      - privilege escalation path
      - tenant boundary bypass
      - missing negative tests
    non_blocking:
      - unclear role naming
      - missing policy comment
  schema_change:
    blocking:
      - destructive migration without rollback plan
      - incompatible response shape
      - missing consumer impact assessment
    non_blocking:
      - migration name unclear
      - docs not updated

4.5 Finding generator

Finding generator harus menghasilkan komentar dalam format yang bisa dievaluasi.

finding:
  severity: blocking | major | minor | note
  category: correctness | security | compatibility | architecture | test | ops
  location:
    file: src/main/java/.../AccessPolicy.java
    line: 88
  claim: "This branch appears to allow suspended users to access tenant resources."
  evidence:
    - "The previous check required ACTIVE status before tenant membership lookup."
    - "The new condition only checks tenant membership."
    - "No negative test covers suspended user access."
  suggested_action: "Restore active-status check or add explicit denial before membership validation."
  confidence: 0.78
  requires_human_decision: true

Komentar seperti “Consider improving error handling” tidak cukup.

5. Risk Scoring

Review agent perlu menentukan prioritas.

5.1 Risk dimensions

Gunakan risk score multi-dimensi, bukan satu angka generik.

Dimension	Pertanyaan
User impact	Apakah perubahan memengaruhi user-facing behavior?
Data impact	Apakah data bisa hilang/korup/bocor?
Security impact	Apakah authn/authz/secret boundary berubah?
Availability impact	Apakah perubahan bisa menyebabkan outage/degradation?
Compatibility impact	Apakah contract API/event/schema berubah?
Complexity	Apakah diff besar, tersebar, atau cross-cutting?
Reversibility	Apakah perubahan mudah rollback?
Test evidence	Apakah tests cukup membuktikan behavior?
Operational readiness	Apakah logging/metrics/migration/runbook cukup?

5.2 Example risk classifier

risk_score:
  user_impact: 4
  data_impact: 5
  security_impact: 3
  availability_impact: 4
  compatibility_impact: 5
  complexity: 4
  reversibility: 2
  test_evidence: 2
  operational_readiness: 2
computed_tier: high
reason:
  - "Database migration changes non-null column behavior."
  - "API response contract changed without compatibility layer."
  - "Only happy-path tests added."

5.3 Risk tier to review action

Risk tier	Agent action
Low	Summary + optional suggestions
Medium	Inline findings + test evidence review
High	Blocking review packet + human escalation
Critical	Do not auto-approve; require named owner/security/release review

Agent harus bisa mengatakan:

I found no strong inline finding, but this PR is high risk because it changes an API contract and migration path. Human review should focus on compatibility and rollback.

Itu sering lebih berguna daripada komentar palsu di line diff.

6. Finding Quality Bar

6.1 Good review finding

Finding yang baik punya 7 kualitas:

Specific — menunjuk lokasi dan behavior.
Evidence-based — menyebut apa yang berubah dan mengapa bermasalah.
Actionable — memberi langkah perbaikan.
Calibrated — severity sesuai risiko.
Non-duplicative — tidak mengulang CI/lint.
Context-aware — memahami pattern repository.
Verifiable — bisa dibuktikan dengan test, reasoning, atau docs.

6.2 Bad review finding

Contoh komentar buruk:

This function is complex. Consider refactoring.

Mengapa buruk:

tidak spesifik,
tidak menjelaskan risiko,
tidak memberi alternatif,
tidak tahu apakah kompleksitas memang diperlukan,
tidak bisa diverifikasi.

Komentar lebih baik:

This method now mixes validation, authorization, and persistence side effects. The risk is that validation failure after partial persistence can leave a partially-created record. Consider moving all validation and authorization checks before `repository.save(...)`, or wrap the operation in a transaction and add a regression test for invalid input after tenant lookup.

6.3 Suppression rule

Agent harus suppress komentar jika:

hanya style preference,
confidence rendah,
deterministic tool lebih cocok,
tidak ada suggested action,
tidak ada evidence,
berpotensi misleading,
concern terlalu luas untuk inline comment.

Untuk concern luas, gunakan summary:

Architectural note: this PR introduces a new dependency from billing to workflow runtime. I do not have enough evidence to call this incorrect, but it may violate the current layering convention. Human reviewer should confirm whether this dependency is allowed.

7. Review Categories

7.1 Correctness review

Pertanyaan inti:

Apakah behavior sesuai requirement?
Apakah edge case hilang?
Apakah default path berubah?
Apakah branch baru reachable?
Apakah exception path berubah?
Apakah null/empty/boundary behavior berubah?
Apakah time/order/concurrency assumption berubah?

Checklist:

correctness_review:
  inspect:
    - changed branches
    - removed guards
    - changed defaults
    - changed ordering
    - changed exception handling
    - changed validation
  require_evidence:
    - linked requirement or issue
    - tests for intended behavior
    - tests for negative behavior
    - reasoning for edge cases

7.2 Security review

Security review agent tidak boleh menggantikan SAST/DAST/manual security audit. Ia sebaiknya menjadi triage layer.

Fokus:

authentication bypass,
authorization drift,
tenant boundary,
injection risk,
unsafe deserialization,
SSRF/file access,
secret leakage,
insecure logging,
unsafe dependency change,
insecure tool/output handling.

Review comment harus menyebut exploit scenario atau bypass path, bukan sekadar “potential security issue”.

The new endpoint accepts `tenantId` from request body and passes it to the repository without checking membership against the authenticated principal. If callers can choose another tenantId, this can become cross-tenant data access. Add an ownership check before repository access and a negative test for a user from a different tenant.

7.3 Architecture review

Architecture review paling sulit karena butuh local conventions.

Agent perlu context:

package/module boundaries,
dependency rules,
ADRs,
examples of similar implementations,
forbidden dependencies,
ownership rules,
extension patterns,
lifecycle constraints.

Architecture finding harus hati-hati:

This introduces a dependency from `case-core` to `workflow-adapter`. Existing dependencies appear to point in the opposite direction. If `case-core` is intended to remain engine-agnostic, consider moving the mapping into the adapter layer. Human reviewer should confirm the intended dependency rule.

7.4 Maintainability review

Maintainability bukan alasan untuk komentar generik.

Agent perlu membedakan:

duplication yang disengaja vs tidak,
abstraction yang terlalu dini vs perlu,
naming issue yang mengganggu domain clarity,
hidden coupling,
config sprawl,
test brittleness,
unbounded growth path.

7.5 Test review

Review test harus menjawab:

Apakah test membuktikan bug/feature?
Apakah test punya oracle kuat?
Apakah test terlalu coupled ke implementation detail?
Apakah negative path ada?
Apakah edge cases relevan?
Apakah test flaky?
Apakah test hanya snapshot besar tanpa assertion bermakna?

Bad agent habit:

Please add more tests.

Better:

This PR changes behavior when a case is reopened after enforcement escalation, but the added test only covers initial case creation. Please add a regression test for reopened escalated cases, especially the transition from `ESCALATED` back to `UNDER_REVIEW`.

7.6 Operational review

Operational readiness sering terlewat oleh coding agent.

Checklist:

log signal for new failure mode,
metric for new queue/job/worker,
trace span around external call,
timeout/retry config,
idempotency key,
rollout flag,
migration/rollback plan,
alert threshold,
runbook update.

Agent bisa sangat berguna untuk mengingatkan hal-hal ini karena berbasis pattern.

8. Inline Comment vs Summary vs Blocking Review

Tidak semua concern cocok jadi inline comment.

Output type	Cocok untuk
Inline comment	Localized, specific, actionable issue
PR summary	Global risk, architecture concern, review guidance
Blocking review	Strong evidence of correctness/security/compatibility failure
Advisory note	Uncertainty, possible issue, human focus area
No comment	Low evidence, style preference, duplicate tool finding

8.1 Decision model

9. Review Agent Prompt Contract

Prompt bukan satu-satunya kontrol, tetapi prompt contract tetap penting.

9.1 System instruction skeleton

You are a pull request review agent.
Your goal is to reduce meaningful engineering risk, not maximize comments.

Only produce findings when:
- the issue is specific,
- the evidence is visible in the diff/context,
- the suggested action is actionable,
- the severity is calibrated,
- the concern is not better handled by deterministic tooling.

Suppress findings that are stylistic, speculative, generic, or low confidence.
When uncertain but risk may be significant, write a summary note for human reviewers instead of an inline blocking comment.

9.2 Output schema

{
  "summary": {
    "pr_type": "bug_fix",
    "risk_tier": "medium",
    "human_focus_areas": ["error handling", "regression coverage"]
  },
  "findings": [
    {
      "severity": "major",
      "category": "correctness",
      "file": "src/.../CaseReopenService.java",
      "line": 142,
      "claim": "The reopened case path skips enforcement hold validation.",
      "evidence": [
        "The previous path called validateHoldStatus before transition.",
        "The new fast path transitions directly to UNDER_REVIEW.",
        "No test covers reopened case with active hold."
      ],
      "suggested_action": "Call validateHoldStatus before transition or add explicit exception for this path.",
      "confidence": 0.82
    }
  ],
  "suppressed_findings": [
    {
      "reason": "style_only",
      "description": "Variable naming could be shorter but does not affect review quality."
    }
  ]
}

Suppressed findings berguna untuk audit dan tuning, tetapi tidak perlu dipublikasikan ke PR.

10. Context Package for PR Review

Review agent membutuhkan paket konteks yang stabil.

10.1 Minimum context

review_context:
  pr:
    title: ...
    description: ...
    linked_issue: ...
    labels: ...
  diff:
    changed_files: ...
    semantic_changes: ...
    tests_changed: ...
  repository:
    architecture_notes: ...
    ownership: ...
    conventions: ...
    dependency_rules: ...
  ci:
    test_results: ...
    lint_results: ...
    static_analysis: ...
  risk:
    sensitive_paths: ...
    public_contracts: ...
    migration_files: ...

10.2 Context anti-patterns

Avoid:

dumping entire repository,
reviewing diff without issue intent,
ignoring tests,
ignoring CI status,
ignoring generated files,
ignoring ownership/routing,
mixing stale docs with fresh code without freshness markers.

10.3 Context freshness

Agent should know:

context_freshness:
  diff: current
  base_branch: main@abc123
  tests: current_run_17
  architecture_doc: last_updated_2025_11_03
  service_ownership: current

Stale architecture docs can be useful, but agent must not treat them as unquestionable truth.

11. Integration with GitHub/GitLab Workflow

11.1 Event triggers

Common triggers:

PR opened,
PR ready for review,
PR updated/synchronized,
comment command like /ai-review,
label added such as needs-ai-review,
before merge queue,
after failed CI.

11.2 Avoid reviewing too early

Reviewing every push can create noise.

Better:

review_policy:
  draft_pr: summary_only
  ready_for_review: full_review
  after_force_push: incremental_review
  ci_failed: failure-focused_review
  high_risk_paths_changed: require_full_review

11.3 Incremental review

Agent should not re-comment same issue after every commit.

Need:

finding fingerprint,
comment update instead of duplicate,
resolved finding detection,
stale finding cleanup,
diff-range mapping.

Finding fingerprint example:

hash(category + normalized_claim + symbol + file + semantic_context)

Not just line number, because lines move.

12. Autonomous SWE PR Review Loop

In autonomous SWE, coding agent may create PR, then review agent checks it.

Important invariant:

The coding agent must not be the only reviewer of its own output.

A separate review agent can share infrastructure, but should have separate role instructions, context, and evaluation rubric.

13. Review Agent vs Static Analysis vs Security Scanner

Review agent should coordinate with deterministic tools.

Tool	Strength	Weakness
Formatter	Style consistency	No semantic insight
Linter	Known rule violations	Can miss domain risk
Type checker	Type safety	Does not prove behavior
SAST	Known security patterns	False positives; context gaps
Dependency scanner	CVEs/licenses	Does not understand usage
Test runner	Executable evidence	Only covers tested behavior
LLM review agent	Contextual reasoning	Non-determinism; hallucination risk

Best architecture:

Deterministic tools produce evidence.
Agent interprets and prioritizes evidence.
Humans decide high-impact ambiguity.

14. Security and Safety of Review Agents

Review agents read code, secrets-like strings, configs, comments, and sometimes logs. Treat them as sensitive systems.

14.1 Threats

Threat	Example	Control
Prompt injection in code	Comment says “ignore previous instruction”	Treat repo content as untrusted data
Secret exposure	Agent includes key in review comment	Secret redaction before model/context/output
Malicious PR manipulation	PR includes instructions to approve	Separate instruction hierarchy
Data exfiltration via tool	Agent calls external URL with code	Tool egress policy
False approval	Agent says safe without evidence	Approval gate + evidence requirement
Review poisoning	Author optimizes PR text to fool agent	Use diff/evidence, not description only

14.2 Untrusted code principle

Repository content is input, not instruction.

Code comments, markdown files, issue descriptions, and PR text must not override system policy.

This is especially important for public repositories and external contributors.

14.3 Review output redaction

Before publishing comments:

detect secrets,
detect internal-only data,
detect private incident links,
detect customer identifiers,
detect exploit details that should be private.

Security finding output may need private channel rather than public PR comment.

15. Evaluation of PR Review Agents

15.1 Metrics

Use metrics that reflect review usefulness.

Metric	Meaning
Finding precision	Fraction of comments accepted as valid
Finding recall	Fraction of known issues found
Severity calibration	Whether severity matches human judgement
Actionability	Whether author can act without clarification
Duplicate rate	Repeated/same comments
Noise rate	Low-value comments per PR
Time-to-review	Latency until useful review
Developer acceptance	Applied suggestions / accepted comments
Human override rate	How often humans reject/override agent
Regression prevention	Issues caught before merge

15.2 Gold set

Create a review evaluation set from historical PRs:

PRs with known review comments,
PRs that caused incidents,
PRs with security fixes,
PRs with revert follow-ups,
PRs with test failures after merge,
PRs with accepted architecture comments.

But do not blindly treat historical comments as ground truth. Human reviews also contain noise.

15.3 Review eval schema

review_eval_case:
  pr_id: 1021
  diff: ...
  context_bundle: ...
  known_findings:
    - category: correctness
      severity: major
      acceptable_locations:
        - file: CaseTransitionService.java
          lines: [120, 145]
      expected_claim: "suspended cases can transition to approved"
  unacceptable_findings:
    - "style-only naming suggestion"
  scoring:
    precision_weight: 0.35
    recall_weight: 0.35
    severity_weight: 0.15
    actionability_weight: 0.15

15.4 Trajectory evaluation

Evaluate not only final comments, but also:

context selected,
evidence used,
suppressed comments,
uncertainty handling,
tool calls,
retries,
time/cost.

This mirrors agent evaluation more broadly: the trajectory matters because two agents can produce similar comments for very different reasons.

16. Governance Model

16.1 Review authority

Agent review output should have explicit authority:

Authority level	Meaning
Advisory	Human may ignore
Required response	Author must respond before merge
Blocking recommendation	Human owner must decide
Automated block	Only for deterministic policy violation

Do not let LLM-only judgement become an unreviewable gate for high-impact changes.

16.2 Audit trail

Store:

model/version,
prompt/rubric version,
context hashes,
diff hash,
tool results,
findings,
suppressed findings,
confidence,
human feedback,
final outcome.

16.3 Ownership routing

Agent can route review:

review_routing:
  files:
    - path: services/authz/**
      required_reviewers:
        - security-team
        - platform-identity-owner
    - path: db/migrations/**
      required_reviewers:
        - data-platform-owner
  risk_tier_high:
    required_reviewers:
      - service-owner

This is often more valuable than trying to “solve” the review itself.

17. Common Anti-Patterns

17.1 Comment maximizer

Agent produces many comments to look useful.

Fix:

hard cap comments,
require evidence,
rank by risk,
suppress low confidence.

17.2 Style bot in disguise

Agent mostly comments on naming/style.

Fix:

delegate style to deterministic tools,
restrict LLM to semantic/risk review,
measure noise rate.

17.3 Self-review illusion

Coding agent reviews its own patch and approves.

Fix:

separate reviewer role,
separate context/rubric,
human gate for high risk.

17.4 No context review

Agent reads diff only.

Fix:

context planner,
symbol/caller lookup,
tests/docs retrieval,
dependency graph.

17.5 Blocking without evidence

Agent blocks PR based on speculative concern.

Fix:

require evidence array,
severity calibration,
reviewer escalation.

17.6 Security theater

Agent labels everything “security risk”.

Fix:

require exploit/bypass scenario,
integrate SAST/dependency tools,
calibrate severity with security team.

18. Implementation Blueprint

18.1 Minimal viable PR review agent

Start with limited scope:

mvp_scope:
  languages:
    - Java
    - TypeScript
  review_categories:
    - test_quality
    - error_handling
    - compatibility
  trigger:
    - manual /ai-review
  output:
    - PR summary
    - max 5 inline findings
  authority:
    - advisory only

18.2 Production hardening roadmap

Stage 1: Summary-only review
Stage 2: Inline advisory findings
Stage 3: Risk-tier classification
Stage 4: Integration with CI/static analysis
Stage 5: Owner routing and review packets
Stage 6: Blocking only for deterministic policy violations
Stage 7: Continuous eval and feedback learning

18.3 Pseudocode

class PullRequestReviewAgent:
    def review(self, pr_event):
        intake = self.pr_intake.parse(pr_event)
        semantic_diff = self.diff_analyzer.analyze(intake.diff)
        risk = self.risk_classifier.classify(intake, semantic_diff)

        context_plan = self.context_planner.plan(
            intake=intake,
            semantic_diff=semantic_diff,
            risk=risk,
        )
        context = self.context_retriever.retrieve(context_plan)

        rubric = self.rubric_selector.select(
            pr_type=intake.pr_type_candidates,
            risk=risk,
            changed_files=intake.changed_files,
        )

        candidate_findings = self.finding_generator.generate(
            semantic_diff=semantic_diff,
            context=context,
            rubric=rubric,
        )

        verified = []
        suppressed = []
        for finding in candidate_findings:
            result = self.finding_gate.evaluate(finding)
            if result.publishable:
                verified.append(result.finding)
            else:
                suppressed.append(result.reason)

        return ReviewResult(
            summary=self.summary_builder.build(intake, risk, verified, suppressed),
            findings=self.rank_and_limit(verified),
            audit=self.audit_builder.build(...),
        )

19. Reviewer UX

Agent review should respect developer attention.

19.1 Good PR summary

## AI Review Summary

Risk tier: High
Primary focus areas for human reviewers:
1. API compatibility: response shape changed in `CaseSummaryResponse`.
2. Data migration: new non-null column requires backfill verification.
3. Test evidence: only happy-path tests added.

Published findings: 2
Suppressed low-confidence findings: 4
CI status considered: unit tests passed, integration tests pending

19.2 Good inline comment

**Major — compatibility**

This changes `statusReason` from nullable to required in the response mapper, but I do not see a compatibility layer or contract test for older clients.

Evidence:
- `CaseSummaryResponse.statusReason` is now required.
- Existing controller test only validates the new response.
- No consumer contract test was updated.

Suggested action: keep the field backward-compatible during rollout or add a versioned response path plus consumer contract tests.

19.3 Bad inline comment

Maybe improve this logic.

No evidence, no action, no category, no severity.

20. Practice Tasks

Task 1 — Build a review rubric

Pick one repository and define review rubrics for:

bug fix,
feature,
refactor,
migration,
security fix.

Each rubric must include blocking and non-blocking findings.

Task 2 — Build a risk classifier

Given 20 historical PRs, label:

PR type,
risk dimensions,
risk tier,
required reviewer roles.

Compare with actual review outcomes.

Task 3 — Review finding quality gate

Create a rule that suppresses findings unless they have:

category,
severity,
specific location,
claim,
evidence,
suggested action,
confidence.

Task 4 — Historical PR eval

Take 10 PRs that caused bugs after merge. Ask the agent to review the original diff. Measure whether it would have flagged the issue.

Task 5 — Noise budget

Run review agent on 30 low-risk PRs. Measure:

average comments per PR,
accepted comments,
rejected comments,
duplicates,
style-only comments.

Goal: reduce noise without losing meaningful findings.

21. Checklist

Use this checklist when designing PR review agents.

[ ] Does the agent optimize for risk reduction, not comment count?
[ ] Is PR type classified before review?
[ ] Is risk scored across multiple dimensions?
[ ] Does the agent retrieve context beyond diff?
[ ] Are rubrics selected by PR type and risk?
[ ] Are findings evidence-based and actionable?
[ ] Are low-confidence findings suppressed?
[ ] Is deterministic tooling used before LLM judgement?
[ ] Are security findings routed safely?
[ ] Is human authority explicit?
[ ] Are duplicate comments avoided?
[ ] Are findings and suppressed findings logged?
[ ] Is review quality evaluated against historical PRs?
[ ] Is developer feedback used to tune rubrics?

22. Key Takeaways

Code review agent is a risk reduction system, not a comment generator.
The most important output may be a human focus summary, not inline comments.
Review scope must be explicit: correctness, security, architecture, test, compatibility, ops.
Strong findings require claim, evidence, severity, location, and suggested action.
Agent must suppress weak/speculative comments.
Deterministic tools should produce evidence; LLM should interpret and prioritize.
Review agent must treat repository content as untrusted data.
In autonomous SWE, coding agent and review agent should be separate roles.
Evaluation must measure precision, recall, severity calibration, actionability, and noise.

23. References

OpenAI Developers — Codex code review in GitHub: https://developers.openai.com/codex/integrations/github
GitHub Docs — Using GitHub Copilot code review: https://docs.github.com/copilot/using-github-copilot/code-review/using-copilot-code-review
GitHub Docs — About GitHub Copilot code review: https://docs.github.com/en/copilot/concepts/agents/code-review
Anthropic — Building effective agents: https://www.anthropic.com/research/building-effective-agents
OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
OWASP Agentic AI Security: https://owasp.org/www-project-agentic-ai-security/
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?: https://arxiv.org/abs/2310.06770
SWE-bench official site: https://www.swebench.com/

Lesson Recap

You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 22

Learn Agentic Ai Engineering Part 022 Test Generation And Verification Agents

Next Lesson

Lesson 24

Learn Agentic Ai Engineering Part 024 Refactoring And Migration Agents