Your AI Pipeline Doesn't Need More Power, It Needs a Vocabulary
Something broke on Tuesday. Not a crash, worse.
Scorpio, the AI orchestrator I've been building, started replanning work that had already shipped. A review session pulled in task completion state from a month ago. The advisor pipeline generated findings nobody could act on. And somewhere in the middle of debugging, the watchdog killed a session that was working perfectly fine, because it mistook thinking for death.
None of these were the same bug. That was the actual problem.
The drift
When you build a system that spawns AI sessions to execute development tasks, names accumulate fast. pm-task-breakdown generates task files. dev-task-runner executes them. qa-reviewer reviews them. qa-pm-triage triages the reviews. These names made sense at 2am when each skill was a standalone experiment. They stopped making sense when the orchestrator had to route between them programmatically and a human needed to read the plan file and understand what was about to happen.
The feature-architect skill was also referenced as feature-planner in some docs and feature-design in others. The planner was creating task files directly, bypassing the designated task generator. Workflow chains were implicit. The orchestrator was routing correctly by accident, held together by convention and hope.
This is the kind of rot that doesn't show up in tests. It shows up when you try to explain your system to someone and realize you can't, because the system can't explain itself.
Sixteen tasks to rename things
The fix was a canonical taxonomy: [scope]-[phase]-[outcome]. pm-task-breakdown became task-plan-generator. dev-task-runner became task-execute-runner. feature-architect became feature-plan-design. Fourteen skills, sixteen tasks, three spec documents, five test suites, two documentation passes. Every skill got exactly one canonical name plus backward-compatible aliases resolved at runtime.
More importantly, a single ownership rule: only task-plan-generator writes task files. Planning skills now output design artifacts and a standardized TASK-PLAN-HANDOFF.json, then stop. This killed an entire class of bugs where different skills wrote task files in slightly different formats, silently confusing the downstream QA chain.
Tedious work. The kind nobody writes blog posts about. But the moment it landed, every workflow chain became legible in the plan file. The help text matched what actually ran. And the alias resolver meant muscle memory still worked.
Four dead sessions that weren't dead
While the taxonomy migration was in progress, a different crisis forced its way in. Four consecutive production runs timed out. The sessions were fine -- actively working, writing to their transcripts, making progress. But Scorpio's watchdog only checked stdout.
Claude sessions can go long stretches without producing visible output while they think. The watchdog saw silence and pulled the trigger. The fix was a composite liveness mode called streams_plus_transcript. Instead of one heartbeat signal, the watchdog now monitors two: stdout/stderr growth and Claude's transcript JSONL growth. Either signal resets the idle timer. The implementation includes a probe that can switch to newer transcript files mid-session if the current one goes quiet, transcript discovery failures degrade safely to stream-only behavior rather than crashing.
This shipped alongside a new scorpio doctor command for structured postmortem analysis of failed runs. Together: thirteen tasks, all through QA, all triage-approved. The false-timeout pattern was the catalyst, but the diagnostic infrastructure it produced will catch problems that haven't happened yet.
The mess real projects make
Taxonomy gave the system a coherent vocabulary. Liveness kept it from killing its own workers. But neither solved what happened next: Scorpio met a real project that had been running for weeks, and the assumptions fell apart.
The stale task problem
In a project that's been active for a while, docs/tasks/ is a graveyard. Completed work sits next to fresh tasks. Scorpio was dutifully replanning all of it, including tasks that shipped days ago, because it had no concept of "finished."
The fix is a pre-plan safety gate. When mixed scope is detected (completed and active tasks coexisting), the orchestrator stops and presents three choices: include everything, exclude completed work, or archive the old tasks and move forward. The default is exclude. Non-interactive runs get a config knob. No silent archive operations ever.
This required defining "completed" precisely. Two forms of evidence: a done-marker file matching the canonical task ID, or an entry in tasks_completed from a successful historical run. The ID matching uses normalized TASKTYPE-NNN extraction, which immediately fixed a bug where API-002-add-auth-endpoint.md wasn't matching its API-002-report.md done marker.
The scope bleed
Review and triage sessions ran with tasks=["all"]. "All" meant every task the project had ever seen. A review session could pull in completion state from month-old work and produce summaries that mixed current progress with ancient history. The fix was surgical: "all" now expands to the current plan's task IDs. The plan is the boundary.
The ghost runs
The new Scorpio Launcher, a browser-based dashboard for operators who'd rather click than type, surfaced a problem immediately. What does the UI show when current-run points to a run from last week that's marked active, but no process is behind it? A stale-runtime detector now identifies these orphaned states, surfaces warnings with duration context, and lets operators start fresh runs without force-killing phantoms.
Closing the advisor loop
The piece that ties it together is the advisor pipeline. Scorpio has five advisor skills: UX, growth, trust, accessibility, monetization, when combined they produce a synthesis report. Previously, those reports were endpoints. Read them, decide what to do, create tasks by hand.
Now there's a machine-readable path from finding to execution.
The advisor-synthesis skill writes a JSON sidecar alongside the markdown report. Scorpio's intake validates each finding against a strict schema: id, title, severity, effort, recommendation (all required), ID format enforced, rank validated when present. Malformed findings get logged and skipped. If the entire sidecar is invalid, the system falls back to parsing the markdown. The operator selects which findings to act on, and the intake produces a TASK-PLAN-HANDOFF.json that feeds directly into the execution pipeline.
Strict on input. Graceful on failure. The same pattern that kept showing up everywhere else.
What this actually taught me
Naming is architecture. Not in the abstract sense. In the concrete sense that when your orchestrator, your skills, your CLI, and your documentation use different words for the same concept, every new feature has to navigate that drift. Sixteen tasks to rename things sounds like overhead. It was the highest-leverage work of the month.
Defaults are policy decisions. Choosing exclude_completed as the default pre-plan action is an opinion about forward-only workflows. Requiring --archive for destructive operations is a bet that annoying is better than destructive. These aren't technical choices. They're product choices baked into 8,600 lines of Bash.
Instrument failures before they become patterns. Four false timeouts in a row would have been a frustrating mystery without the diagnostic surface to catch them. scorpio doctor exists because the alternative was staring at log files and guessing.
What's next
The orchestrator has more than tripled in size since early February. The launcher works but is early. The advisor pipeline passed smoke tests and needs a live project run. The immediate question isn't what to build next, it's whether 8,600 lines of Bash is a feature or a problem.
Probably both.


