Why I Decomposed One AI Agent Bug Into Nine Tasks — And Why That's the Whole Point of Agentic Development
This is the second post in a series about building Scorpio, an AI orchestration system. The first post—Lisa Didn't Ship. Scorpio Does.—covers the full journey: why I built it, what failed the first time, and the architectural decisions that made Scorpio work. Start there if you want the backstory.
Scorpio was occasionally killing a long-running dev session while the agent was clearly still working. The frustrating part is, the system wasn't being lazy about it. It was checking. It just wasn't checking the right things.
That's not a failure of intelligence. That's a failure of supervision. And if you've been following along, you know I care a lot about the difference.
But this post isn't really about the timeout bug. It's about how I fixed it—and what that process revealed about why agentic development works when it works, and why it falls apart when it doesn't.
The answer, for the hundredth time, is specificity .
The Bug
Scorpio already had a two-part timeout. This wasn't some rookie wall-clock guillotine. The logic checked two things: is the stdout/stderr file size still growing? And has the total max time elapsed? If either condition triggered, the session died.
On paper, that sounds reasonable. In practice, it had a blind spot big enough to drive a truck through.
Real work in Scorpio doesn't always show up in stdout/stderr. Agents have thinking phases that aren't chatty. Tool calls block while work happens elsewhere. Transcript growth can be the only signal that the agent is alive and productive—but the watchdog wasn't looking at transcripts. It was staring at stdout like a dog watching one door while the burglar walks in through the window.
So the agent is mid-task, making real progress, the transcript is growing, work is clearly happening. But stdout has gone quiet for a stretch. The watchdog sees a flatlined file size, decides nobody's home, and kills the session. You get a timeout message that implies the agent stalled, when the truth is: we killed a productive worker mid-sentence because we were checking the wrong vital signs.
That's the pipeline equivalent of "Sorry, we close at 5," while you're literally standing at the counter holding your credit card.
When I was building Lisa, I learned the hard way that trust is the whole game. If you're asking someone to hand control to an autonomous pipeline—to walk away and let the system work—you can't have random, undeserved failures. Lisa had plenty of those, and it's one of the reasons she never completed a successful end-to-end run. I wasn't going to make the same mistake twice.
One Bug, Nine Tasks
Here's where the real lesson lives.
I could have filed one task: "Fix the timeout bug." Given that to an agent (or a developer) and hoped for the best. That's what most people do. That's also why most agentic workflows produce inconsistent results.
Instead, I decomposed the problem into nine specific tasks, each with its own clearly defined scope, acceptance criteria, and dependencies:
- TEST-450 — Add failing regression tests that reproduce the exact failure mode: active session, work happening, killed anyway. These tests had to fail before the fix and pass after. You write the proof of the disease before you write the cure.
- CORE-450 — Add the dual-timeout config surface. Separate
session_timeout_hard_sfromsession_idle_timeout_sas explicit, independent controls. Preserve backward compatibility with the legacySESSION_TIMEOUTvariable. This is pure plumbing—no behavior change yet, just giving the system the vocabulary to express what it couldn't before. - CORE-451 — Implement the liveness-aware watchdog in
run_session. This is the behavioral heart of the fix: monitor activity, refresh liveness when work is detected, terminate only on idle timeout or hard cap, and log which one killed the session. Same two questions as before, better answers. - TEST-451 — Add the integration matrix. Cover the full scenario space: active session survives past idle threshold, idle session times out, hard cap always terminates, resume works correctly after timeout failures. Both verbose and non-verbose modes.
- CORE-470 — Build the Claude transcript activity probe. This is the missing sense organ. Resolve transcript candidates, track file size and mtime as activity signals, return deterministic states: activity, no activity, or unavailable. Fail safely—probe failure never crashes execution.
- CORE-471 — Implement composite liveness mode. Upgrade the watchdog from single-source idle tracking to a composite evaluator: stream growth plus transcript activity. Idle timeout fires only when all enabled signals are quiet for the full idle window. This is the fix for the blind spot.
- CORE-472 — Tune defaults and add activity mode config. Introduce
session_activity_mode(streamsvsstreams_plus_transcript), default to composite for Claude sessions, raise the non-QA idle timeout from 300 seconds to 900 seconds. This is where operator experience gets dialed in. - DOC-450 — Document the timeout model, tuning guidance, and operator runbook. Migration notes from legacy config. Troubleshooting section for diagnosing timeout reasons from logs and artifacts.
- DOC-470 — Document composite liveness behavior. Explain activity signals, evaluation order, and the "idle timeout with active transcript" diagnostic path. Concrete recommended ranges for tuning.
Nine tasks. One bug. And every single one of them has acceptance criteria you can evaluate as a boolean: done or not done. Pass or fail. 1 or 0.
Why This Matters More Than the Fix Itself
In my first post, I talked about the core lesson from building both Lisa and Scorpio:
LLMs can do almost anything, if the task is finite enough to define success as a boolean.
This is what that looks like in practice. Not as a principle. As a dependency graph.
If I'd filed one task—"fix the timeout"—and pointed an agent at it, here's what would have happened: it would have changed some code, maybe improved the obvious case, missed the transcript probe entirely, skipped the regression tests, left the docs stale, and broken backward compatibility. I know this because that's exactly what happened with Lisa, over and over, on problems far simpler than this one.
The specificity of the task is what produces reliability. Each of those nine tasks is small enough that an agent (or a human) can hold the entire problem in their head, know exactly what "done" means, and either get there or clearly fail. No ambiguity. No scope creep. No "well, it sort of works."
And the dependency chain matters too. TEST-450 comes before CORE-450, because you write the failing test before you write the fix. CORE-470 builds the probe before CORE-471 wires it into the watchdog. DOC tasks follow implementation tasks. The order isn't arbitrary—it's the architecture of confidence.
The Broader Pattern
This is what I keep relearning, and what I think most people building agentic systems are still underestimating: the quality of your orchestration output is determined almost entirely by the quality of your task decomposition.
Everyone wants to talk about prompts. Model selection. Temperature settings. Agentic loops. Those things matter. But they're all downstream of a more fundamental question: did you define the work clearly enough that success is unambiguous?
If you did, most models will get you there. If you didn't, no model will.
Scorpio's skill system exists to serve this principle. Each skill definition specifies exactly what an agent should do, how it should do it, and what constraints apply—per task type. A
dev-task-runner
skill behaves completely differently than a
bug-investigator
skill. The orchestrator matches skills to tasks and injects the right definition into each session. Task-specific behavior, not generic labels. That was the whole lesson from Lisa's failure: role without specificity is just a costume.
The timeout fix is a clean example because it's technical enough to be precise but human enough to feel the stakes. Nobody wants their productive work killed by a half-blind watchdog. But the reason the fix worked—the reason it shipped clean and didn't introduce new bugs—is that it was nine problems, not one. And each one was solvable.
If You're Building Something Similar
The timeout pattern is worth stealing on its own:
- Separate idle timeout from hard timeout. (Scorpio already had the two-part check. This is table stakes.)
- Track liveness across all available signals, not just the obvious one. Output growth beats vibes, but transcript growth beats stdout alone.
- Always log the termination reason.
- Design degradation paths. Missing telemetry should narrow supervision, not crash it.
But the deeper pattern is the decomposition:
- Break every fix into the smallest tasks that can be independently verified.
- Write failing tests before writing fixes.
- Define acceptance criteria as booleans. Pass or fail. No "sort of."
- Order tasks by dependency, not by vibes.
- Document as you go, not after.
Agents don't need more autonomy before they get more supervision. And supervision doesn't need to be smart. It needs to be specific.
Or, in Scorpio terms: if you're going to hand a flamethrower to the worker, at least give it a checklist that fits on one page.
Common Questions About Agentic Task Decomposition
What is task decomposition in agentic AI development?
Task decomposition is the practice of breaking a complex problem into small, independently verifiable tasks before assigning them to AI agents. Each task should have clearly defined acceptance criteria that can be evaluated as a simple pass or fail. In agentic orchestration systems like Scorpio, decomposition is what turns unreliable AI output into predictable, shippable results. Without it, agents tend to partially solve problems, miss edge cases, and produce inconsistent work.
Why do AI agents fail on complex tasks?
AI agents most commonly fail on complex tasks not because the model lacks capability, but because the task definition is too broad or ambiguous. When an agent receives a vague instruction like "fix the timeout bug," it lacks the specificity to know what "done" looks like. It may address the obvious symptom while missing regression tests, backward compatibility, documentation, and related configuration changes. Decomposing into smaller tasks with binary success criteria gives agents (and humans) a clear target, which dramatically improves reliability.
What is a liveness-aware watchdog in an AI orchestration system?
A liveness-aware watchdog monitors whether an AI agent session is still actively working, rather than simply tracking elapsed wall-clock time. Instead of killing a session after a fixed duration, it checks activity signals like output stream growth and transcript file changes. This prevents false timeouts where a productive agent is terminated simply because it entered a quiet working phase. A well-designed watchdog separates idle timeout (no activity detected) from hard timeout (absolute time cap), and logs which condition triggered termination.
What is the difference between idle timeout and hard timeout for AI agents?
Idle timeout terminates a session when no activity has been detected for a defined period — indicating the agent may be stuck, deadlocked, or waiting on input that will never arrive. Hard timeout terminates a session after an absolute maximum duration regardless of activity, serving as a cost control and governance mechanism. Both are necessary in production orchestration systems. Idle timeout prevents wasted resources on stuck sessions; hard timeout prevents runaway loops and unbounded costs.
How do you make AI agent workflows reliable in production?
The most effective way to make AI agent workflows reliable is to ensure every task assigned to an agent has acceptance criteria that can be evaluated as a boolean — done or not done. This requires disciplined task decomposition: breaking work into the smallest units that can be independently verified, ordering them by dependency, writing failing tests before implementing fixes, and documenting changes as part of the workflow rather than as an afterthought. Model selection and prompt engineering matter, but they are downstream of task specificity.
What is composite liveness detection in AI orchestration?
Composite liveness detection uses multiple activity signals to determine whether an AI agent session is still working. Rather than relying on a single indicator like stdout growth, a composite approach checks stdout, stderr, and transcript file activity together. The session is considered idle only when all enabled signals are quiet for the full idle window. This reduces false positives in scenarios where real work produces output through channels other than the standard output stream, such as during long tool calls or thinking phases.
What is the difference between roles and skills in AI agent orchestration?
In role-based orchestration, agents are assigned generic identities like "Developer" or "QA" that don't specify how to perform individual task types. In skill-based orchestration, agents receive task-specific behavior definitions that detail exactly what to do, how to do it, and what constraints apply for each type of work. A dev-task-runner skill behaves differently than a bug-investigator skill. Skills produce more reliable output because they replace vague identity labels with concrete, repeatable instructions.
How should you structure task dependencies in an agentic development pipeline?
Task dependencies in an agentic pipeline should follow a logical confidence-building order: write failing tests that prove the bug exists, implement the foundational configuration or plumbing changes, build the behavioral fix on top of that foundation, add integration test coverage, and document the changes last. Each task should only depend on tasks whose output it directly consumes. This ordering isn't arbitrary — it's the architecture of confidence, ensuring each step validates the one before it.


