Start HereOrdered learning track

Learn Agentic Ai Engineering Part 006 Task Decomposition And Planning

[]15 min read2881 words

In This Lesson

1. Why This Part Matters 2. Kaufman Framing 3. Core Mental Model

Lesson 0635 lesson track01–06 Start Here

title: Learn Advanced Agentic AI Engineering & Autonomous Software Engineering - Part 006 description: Learn how agentic systems decompose goals into bounded tasks, plans, dependencies, evidence, constraints, verification steps, and recovery paths. series: learn-agentic-ai-engineering seriesTitle: Learn Advanced Agentic AI Engineering & Autonomous Software Engineering order: 6 partTitle: Task Decomposition and Planning tags:

agentic-ai
autonomous-software-engineering
planning
task-decomposition
agents
architecture
series date: 2026-06-29

Part 006 — Task Decomposition and Planning

1. Why This Part Matters

Agentic systems fail less often because the model is "bad at language" and more often because the task was decomposed poorly.

Bad decomposition causes:

Wrong tool usage.
Missing prerequisites.
Confused execution order.
Premature patching.
Duplicate work.
Hidden assumptions.
Ambiguous completion.
Unbounded retries.
False success.

A strong agentic engineer treats a plan as an executable risk model.

A plan is not a pretty list of steps. A production-grade plan defines:

What must be known.
What must be changed.
What must be preserved.
Which tools may be used.
Which actions are reversible.
Which actions require approval.
How progress is measured.
How completion is verified.
How failure is recovered.

This part gives you the mental model and structure for building planning systems that survive real-world ambiguity.

2. Kaufman Framing

2.1 Target Performance

After this part, you should be able to:

Turn a vague goal into a task graph with dependencies and verification points.
Classify tasks by uncertainty, side effect, reversibility, and required evidence.
Design planning prompts and planning schemas that produce executable plans, not narrative plans.
Decide when to use linear planning, DAG planning, hierarchical planning, search planning, or replanning.
Build plan-review gates for high-risk autonomous software engineering tasks.

2.2 Subskills

Subskill	What You Must Be Able to Do
Goal clarification	Convert vague user intent into explicit success criteria.
Task atomization	Break work into small tasks that can be executed, observed, and verified.
Dependency modelling	Identify ordering constraints and parallelizable work.
Evidence planning	Define what observations are needed before acting.
Risk-aware sequencing	Put low-risk information-gathering before high-risk side effects.
Replanning	Detect when a plan is invalid and repair it.
Completion verification	Use external criteria to decide done.

2.3 The 20-Hour Practice Loop

Practice this repeatedly:

Do this on real tasks: repository changes, incident diagnosis, document analysis, data reconciliation, API integration, release planning.

3. Core Mental Model

A plan is a hypothesis about how to move from current state to desired state under constraints.

This definition matters.

A plan is not truth. It is a working model that must be updated as observations arrive.

A good agent does not blindly follow the first plan. It maintains plan validity.

4. Goal, Task, Step, Action

These terms must be distinct.

Level	Meaning	Example
Goal	Desired outcome	Fix login failure when token is near expiry.
Task	Coherent unit of work	Reproduce the bug.
Step	Ordered operation inside a task	Run auth integration test.
Action	Tool invocation or model operation	`run_tests("AuthTokenTest")`
Observation	Result of action	Test fails with clock skew assertion.
Evidence	Observation that supports a decision	Failure trace points to expiry check order.
Verification	Check that goal is satisfied	Regression test passes and old behavior preserved.

Bad agents collapse these levels. Good systems preserve them.

5. From Goal to Success Criteria

A vague goal is unsafe.

Example:

Fix the flaky login test.

Better success criteria:

{
  "goal": "Fix flaky login test",
  "success_criteria": [
    "The identified flaky test passes 20 consecutive local runs",
    "No production authentication logic is changed unless root cause requires it",
    "The fix does not disable or weaken the assertion",
    "A short root-cause note is included in the PR description"
  ],
  "non_goals": [
    "Do not rewrite the authentication module",
    "Do not skip the test",
    "Do not change unrelated timeout settings globally"
  ]
}

A plan without success criteria is just activity.

6. Planning Inputs

A planning system should not rely only on the user prompt.

It should consume:

Input	Purpose
User goal	Desired outcome.
Current state	What is known now.
Constraints	Time, budget, security, policy, allowed tools.
Environment	Repo, branch, sandbox, APIs, data sources.
Historical context	Prior attempts, related incidents, previous decisions.
Risk profile	Side effects, sensitivity, blast radius.
Verification options	Tests, validators, reviewers, static checks, policy checks.

Planning quality is limited by input quality. Do not ask an agent to plan with missing state and then blame the model for guessing.

7. Task Classification

Before decomposition, classify the task.

7.1 By Information Shape

Type	Description	Planning Implication
Known-known	Inputs and process are clear.	Use workflow.
Known-unknown	Need to gather specific missing info.	Plan retrieval/investigation first.
Unknown-known	User forgot or omitted available context.	Ask clarification or inspect context.
Unknown-unknown	Discovery problem.	Use bounded exploration.

7.2 By Side Effect

Type	Example	Control
Read-only	Search docs, inspect code	Low risk; allow more autonomy.
Draft-only	Generate PR description	Low-medium risk; validate.
Local write	Edit sandbox file	Medium risk; checkpoint.
External write	Create ticket, send email	High risk; approval.
Financial/legal/security action	Refund, block user, rotate secrets	Very high risk; deterministic policy + human gate.

7.3 By Reversibility

Type	Example	Planning Rule
Reversible	Create draft, edit branch file	Agent may act with checkpoint.
Compensatable	Create ticket, label issue	Agent may act if compensation exists.
Hard to reverse	Send external email	Approval before action.
Irreversible	Delete data, deploy destructive migration	Avoid or require strict human control.

7.4 By Verification Strength

Type	Example	Reliability
Strong verifier	Unit tests, schema validation	Good for autonomous execution.
Medium verifier	Static analysis, lint, rubric	Needs caution.
Weak verifier	Model self-evaluation	Not enough for high-risk tasks.
Human verifier	Expert review	Required for subjective/high-risk tasks.

Autonomy should increase when side effects are low and verification is strong.

8. Decomposition Patterns

8.1 Linear Decomposition

Use when steps are naturally sequential.

Example: document extraction.

Read document.
Extract fields.
Validate schema.
Normalize values.
Return structured output.

Linear decomposition is easy to test but brittle for exploratory work.

8.2 Hierarchical Decomposition

Break a large goal into nested tasks.

Use when:

Work has natural layers.
Different subtasks need different tools.
Humans may review intermediate artifacts.

8.3 DAG Decomposition

A DAG represents dependencies and parallelizable work.

Use DAGs when:

Some tasks can run in parallel.
Some tasks depend on shared evidence.
You need scheduling and progress tracking.

A DAG plan is better than a bullet list for agentic systems because it exposes dependency mistakes.

8.4 Search Decomposition

Use when the agent must explore multiple hypotheses.

This is useful for debugging and architecture decisions.

Tree-of-Thought-style approaches formalize this idea by exploring multiple reasoning paths and using evaluation/backtracking rather than committing to the first chain.

8.5 Plan-Execute-Replan

Use when uncertainty is high.

The plan must include replanning triggers.

Examples:

Test result contradicts hypothesis.
Required file does not exist.
Tool returns permission denied.
Cost budget nearly exhausted.
User goal conflicts with policy.
New risk discovered.

9. The Planning Artifact

A production plan should be structured.

9.1 Minimum Plan Schema

{
  "goal": "...",
  "success_criteria": ["..."],
  "assumptions": ["..."],
  "unknowns": ["..."],
  "constraints": {
    "allowed_tools": ["..."],
    "forbidden_actions": ["..."],
    "max_steps": 20,
    "requires_human_approval_before": ["..."]
  },
  "tasks": [
    {
      "id": "T1",
      "title": "...",
      "type": "read_only | local_write | external_write | verify | human_review",
      "depends_on": [],
      "tool_candidates": ["..."],
      "expected_evidence": ["..."],
      "done_when": ["..."],
      "risk": "low | medium | high"
    }
  ],
  "verification": ["..."],
  "fallbacks": ["..."]
}

9.2 Why Structured Plans Matter

Structured plans allow:

Policy engines to inspect proposed actions.
Humans to review before execution.
Runtimes to schedule tasks.
Observability systems to track progress.
Evaluators to compare actual trajectory against intended trajectory.
Recovery systems to resume after failure.

A narrative plan is useful for humans. A structured plan is useful for systems.

10. Planning for Evidence Before Action

A strong agent gathers evidence before making changes.

Bad plan:

Edit likely file.
Run tests.
Hope it works.

Better plan:

Reproduce failure.
Identify failing assertion.
Locate responsible code path.
Inspect recent changes.
Form root-cause hypothesis.
Make minimal patch.
Run targeted tests.
Run regression tests.
Summarize evidence.

This is the most important habit in autonomous software engineering:

Investigation before modification.

11. Planning Horizon

Planning horizon is how far ahead the agent should plan.

Horizon	Description	Use Case
One-step	Decide next action only.	Simple ReAct loop.
Short horizon	Plan next 3–5 steps.	Debugging, research.
Full plan	Plan complete task before execution.	Reviewable enterprise workflows.
Rolling plan	Plan high-level path, replan after observations.	Complex engineering tasks.
Branching plan	Explore alternatives before acting.	Architecture, root cause analysis.

Do not always require full plans. Full plans are often wrong in exploratory tasks.

Do not always use one-step planning. One-step agents often become reactive and inefficient.

For production autonomous SWE, prefer rolling plans:

Create initial plan.
Execute safe evidence-gathering steps.
Replan after new evidence.
Gate writes.
Verify.

12. Dependency Modelling

A task dependency means one task requires the output or evidence of another.

Bad:

1. Update auth code.
2. Find failing test.
3. Understand expected behavior.

Good:

1. Find failing test.
2. Understand expected behavior.
3. Reproduce failure.
4. Update auth code.

12.1 Dependency Types

Dependency	Meaning	Example
Data dependency	Needs output from previous task.	Need stack trace before root cause.
Permission dependency	Needs approval.	Need human approval before sending email.
Environment dependency	Needs setup.	Need sandbox before running tests.
Risk dependency	Need evidence before side effect.	Need policy check before refund.
Verification dependency	Need test before done.	Need regression result before PR.

13. Planning With Constraints

Constraints are not optional hints. They are execution boundaries.

Examples:

{
  "constraints": {
    "time_budget_minutes": 20,
    "max_tool_calls": 40,
    "allowed_paths": ["src/auth/**", "tests/auth/**"],
    "forbidden_paths": ["infra/**", "secrets/**"],
    "no_external_network": true,
    "approval_required_for": ["git_push", "send_email", "deploy", "delete_file"],
    "must_preserve": ["public API compatibility", "audit logging behavior"]
  }
}

The planner should produce plans that fit constraints. The runtime should enforce them anyway.

Never trust the planner to self-enforce critical constraints.

14. Planning With Tools

A task should not just say "investigate". It should specify tool candidates and tool limits.

Example:

{
  "id": "T2",
  "title": "Locate token expiry validation",
  "type": "read_only",
  "tool_candidates": ["symbol_search", "grep", "read_file"],
  "forbidden_tools": ["write_file", "shell_exec"],
  "expected_evidence": [
    "File path containing token expiry validation",
    "Function or method name",
    "Relevant test references"
  ],
  "done_when": [
    "At least one implementation file and one test file are identified"
  ]
}

Tool-aware planning reduces unnecessary autonomy.

15. Planning With Verification

Every task should have a local done condition. The whole plan should have global success criteria.

15.1 Local Done

Example:

{
  "task": "Reproduce failing test",
  "done_when": [
    "The exact failing command is recorded",
    "The failure output is captured",
    "The failure is reproducible at least twice or marked intermittent"
  ]
}

15.2 Global Done

Example:

{
  "goal": "Fix flaky login test",
  "global_done_when": [
    "Targeted test passes repeatedly",
    "Full auth test suite passes",
    "Patch diff is limited to relevant files",
    "PR summary explains root cause and fix"
  ]
}

The more autonomous the agent, the stronger the done conditions must be.

16. Planning for Failure

Plans should include failure handling.

16.1 Failure Categories

Failure	Example	Recovery
Missing context	File not found	Search alternatives, ask human.
Tool failure	Test runner unavailable	Retry, fallback command, report blocker.
Contradictory evidence	Two tests imply different behavior	Escalate or branch hypotheses.
Budget exceeded	Too many attempts	Stop with partial findings.
Policy blocked	Tool call denied	Ask approval or choose safe alternative.
Verification failed	Tests still fail	Repair loop if budget remains.

16.2 Recovery Plan Schema

{
  "failure_recovery": [
    {
      "condition": "targeted test command fails due to missing dependency",
      "action": "inspect project build docs and try documented setup command",
      "max_attempts": 2,
      "escalate_after": "dependency setup still fails"
    },
    {
      "condition": "patch causes unrelated tests to fail",
      "action": "revert patch and re-evaluate root cause",
      "max_attempts": 1
    }
  ]
}

A plan without recovery paths is not production-grade.

17. Replanning Triggers

Replanning should happen when assumptions break.

Examples:

Required file does not exist.
Search results contradict task assumption.
Tool output has low confidence.
A supposedly low-risk action becomes high-risk.
Tests fail in a new area.
Human rejects the plan.
Budget is nearly exhausted.
The agent discovers the user goal is impossible.

Replanning should not erase history. It should preserve:

Original goal.
Prior attempts.
Evidence gathered.
Failed hypotheses.
Updated assumptions.

18. Autonomous SWE Planning Example

Goal:

Fix issue: users with valid refresh tokens are sometimes logged out after daylight saving time changes.

18.1 Bad Plan

1. Search for refresh token code.
2. Modify expiry logic.
3. Run tests.
4. Done.

This is too shallow. It jumps to modification.

18.2 Better Plan

{
  "goal": "Fix intermittent logout around daylight saving time changes",
  "success_criteria": [
    "Bug is reproduced or a plausible failing test is added",
    "Root cause is tied to time-zone or clock-skew handling",
    "Fix preserves existing token security semantics",
    "Auth tests pass",
    "New regression test covers DST boundary"
  ],
  "unknowns": [
    "Where refresh token expiry is calculated",
    "Whether system uses local time, UTC, or injected clock",
    "Whether failure is in token creation, validation, or session refresh"
  ],
  "tasks": [
    {
      "id": "T1",
      "title": "Inventory refresh token code paths",
      "type": "read_only",
      "depends_on": [],
      "tool_candidates": ["symbol_search", "grep", "read_file"],
      "expected_evidence": ["Token creation path", "Token validation path", "Clock/time abstraction"],
      "done_when": ["Relevant files and functions are listed"],
      "risk": "low"
    },
    {
      "id": "T2",
      "title": "Find existing time-boundary tests",
      "type": "read_only",
      "depends_on": ["T1"],
      "tool_candidates": ["grep", "read_file"],
      "expected_evidence": ["Existing expiry tests", "Clock mocking utilities"],
      "done_when": ["Test coverage gap is identified"],
      "risk": "low"
    },
    {
      "id": "T3",
      "title": "Create minimal failing regression test",
      "type": "local_write",
      "depends_on": ["T1", "T2"],
      "tool_candidates": ["write_file", "run_tests"],
      "expected_evidence": ["Failing test demonstrates DST boundary issue"],
      "done_when": ["Regression test fails before fix"],
      "risk": "medium"
    },
    {
      "id": "T4",
      "title": "Patch time handling minimally",
      "type": "local_write",
      "depends_on": ["T3"],
      "tool_candidates": ["edit_file", "run_tests"],
      "expected_evidence": ["Patch uses UTC/injected clock consistently"],
      "done_when": ["Regression test passes"],
      "risk": "medium"
    },
    {
      "id": "T5",
      "title": "Run auth regression suite",
      "type": "verify",
      "depends_on": ["T4"],
      "tool_candidates": ["run_tests"],
      "expected_evidence": ["No auth regression"],
      "done_when": ["Relevant tests pass"],
      "risk": "low"
    }
  ],
  "human_review_required_before": ["Changing token cryptographic semantics", "Changing token lifetime policy"],
  "fallbacks": [
    "If bug cannot be reproduced, stop with investigation summary and proposed test scenario",
    "If fix requires policy change, escalate before patching"
  ]
}

This plan is executable, reviewable, and risk-aware.

19. Planning Prompt Design

A planning prompt should force structure and humility.

19.1 Weak Planning Prompt

Make a plan to solve this issue.

Likely output: generic steps.

19.2 Strong Planning Prompt

You are planning only. Do not execute.

Given the goal, current state, allowed tools, forbidden actions, and success criteria, produce a structured plan.

Rules:
- Prefer read-only evidence gathering before modification.
- List assumptions separately from facts.
- Every task must have a done_when condition.
- Every write action must depend on evidence.
- Mark tasks requiring human approval.
- Include replanning triggers.
- Include verification steps.
- Return JSON matching the schema.

The key is not the words. The key is the constraints and output schema.

20. Plan Review

Before execution, review the plan.

20.1 Review Rubric

Question	Good Sign	Bad Sign
Is success explicit?	Measurable done criteria	"Improve" / "fix" only
Are assumptions separated?	Facts and assumptions distinct	Model states guesses as facts
Are dependencies valid?	Evidence before action	Writes before investigation
Are tools scoped?	Narrow tool list	Broad shell/browser access
Are risks marked?	Side effects gated	All tasks treated same
Is verification strong?	Tests/validators/human review	Self-evaluation only
Is fallback defined?	Stop/escalate paths	Infinite retries

20.2 Plan Risk Score

A simple scoring model:

risk_score = side_effect_risk + data_sensitivity + reversibility_risk + verifier_weakness + ambiguity

Use risk score to decide:

Execute automatically.
Require human plan approval.
Require human approval before specific tasks.
Refuse or redesign.

21. Planning and Human-in-the-Loop

Human review should not appear only at the end.

Review gates can exist at:

Goal clarification.
Plan approval.
Before side-effect action.
After failed verification.
Before final delivery.

This is important for enterprise trust: humans should approve the right artifact at the right time, not inspect a mysterious final result.

22. Planning and Memory

Planning uses memory, but memory must be controlled.

Useful memory:

Prior user preferences.
Previous failed attempts.
Known project conventions.
Historical incident patterns.
Approved architectural decisions.

Dangerous memory:

Unverified model guesses.
Stale environment details.
Prompt-injected instructions.
Sensitive data beyond need-to-know.
Reflections that encode wrong assumptions.

Rule:

Memory can inform planning, but evidence must validate planning.

23. Planning and Cost

Planning itself costs tokens, latency, and complexity.

Use lightweight planning for low-risk tasks.

Use heavy planning for:

High-impact code changes.
Multi-system workflows.
External side effects.
Long-running tasks.
Compliance-sensitive decisions.
Ambiguous production failures.

Planning should reduce downstream waste. If planning becomes ceremony, simplify.

24. Common Planning Anti-Patterns

24.1 Narrative Plan Masquerading as Execution Plan

Bad:

I'll analyze the code, find the issue, fix it, and test it.

This has no dependencies, tools, evidence, or verification.

24.2 Premature Modification

Bad:

First, update the likely broken file.

Fix:

Reproduce.
Inspect.
Gather evidence.
Then modify.

24.3 Infinite Investigation

Bad:

Search all files, inspect all modules, read all docs.

Fix:

Define search scope.
Define evidence target.
Define budget.

24.4 Unowned Verification

Bad:

Verify the fix works.

Fix:

Run `AuthTokenExpiryTest`, `SessionRefreshIT`, and relevant regression suite. Done only if all pass.

24.5 No Replanning Trigger

Bad:

Follow this plan exactly.

Fix:

Replan if tests contradict root-cause hypothesis or required files are absent.

25. Planning Quality Checklist

A plan is acceptable when:

26. Practice Drills

Drill 1: Decompose a Vague Goal

Goal:

Make the onboarding workflow smarter with AI.

Produce:

Success criteria.
Non-goals.
Unknowns.
Task DAG.
Tool list.
Human gates.
Verification plan.

Drill 2: Repair a Bad Plan

Bad plan:

1. Search code.
2. Fix problem.
3. Run tests.
4. Submit PR.

Rewrite it with:

Evidence-first tasks.
Dependencies.
Tool constraints.
Done conditions.
Replanning triggers.

Drill 3: Plan Risk Review

For each task, score:

Side effect risk.
Data sensitivity.
Reversibility.
Verification strength.
Ambiguity.

Then decide whether the agent may execute automatically.

27. Key Takeaways

A plan is a hypothesis, not truth.
Good planning separates goal, task, step, action, observation, evidence, and verification.
Evidence should precede modification.
Task graphs are better than vague bullet lists for agentic systems.
The runtime must enforce constraints even if the planner says it will follow them.
Replanning is not failure; it is how robust agents handle reality.
Strong verification is what makes autonomy safe.

28. References

Anthropic, "Building Effective Agents" — https://www.anthropic.com/research/building-effective-agents
OpenAI Agents SDK documentation — https://openai.github.io/openai-agents-python/agents/
OpenAI Agents SDK Handoffs — https://openai.github.io/openai-agents-python/handoffs/
LangGraph Overview — https://docs.langchain.com/oss/python/langgraph/overview
LangGraph Persistence — https://docs.langchain.com/oss/python/langgraph/persistence
LangChain Human-in-the-Loop Middleware — https://docs.langchain.com/oss/python/langchain/human-in-the-loop
Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" — https://arxiv.org/abs/2210.03629
Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" — https://arxiv.org/abs/2305.10601
Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" — https://arxiv.org/abs/2303.11366
Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" — https://arxiv.org/abs/2302.04761
Karpas et al., "MRKL Systems" — https://arxiv.org/abs/2205.00445

Lesson Recap

You just completed lesson 06 in start here. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 05

Learn Agentic Ai Engineering Part 005 Agentic Workflow Vs Agent Loop

Next Lesson

Lesson 07

Learn Agentic Ai Engineering Part 007 Tool Calling Engineering