Lisa Didn't Ship. Scorpio Does. Here's What I Learned Building an AI Orchestration System.

Jason Block • February 25, 2026

My day job is running a large travel advisor network. My nights, for the past several months, have been something different: building an AI orchestration system. I've built it twice now. The first version never successfully completed a full development run. This is the story of what I learned.

The Problem

If you've used Claude Code (or any other agentic coding tool) for anything non-trivial, you already know the wall. You start a session, get into the codebase, build momentum, and then you run out of context. New session. Re-explain everything. Lose the thread.


Or worse, you create a really well defined spec, launch an agent team and they either stall and run out of context or run off into unpredictable directions. For one task that's annoying. For a real feature across ten or fifteen tasks, it's completely unmanageable.


What I wanted was simple to describe and hard to build: a system that could decompose a software development project into phases, run each phase in its own fresh session, and manage all the handoffs automatically. No babysitting. No context dumps. Files as shared memory instead of conversation history. Then put it all in a nice, user-friendly GUI so novices wouldn't need to mess with the terminal too much. Sort of the direction Claude Co-Work and Codex Desktop app, but more task-focused. Think fun graphic and chat-based feature interviews, kanban board interface to visualize work, etc.

I wanted to be able to define a project objective, detail the requirements and user stories, then let 'er rip and watch the show.


The Inspiration

In late 2025 I started following Jeffrey Emmanuel (https://x.com/doodlestein). At the time he'd published Agent Mail, a structured approach for agent sessions to pass information between each other. I'd been wrestling with this exact problem and struggling to get markdown-based comms to work reliably. Agent Mail was the first thing I'd seen that offered a real framework for inter-session communication.


Around the same time, Geoffrey Huntley's Ralph Wiggum Loop (https://ghuntley.com/ralph/) was entering the broader AI conversation: a Bash loop that puts Claude Code on a "night shift," iterating on a task with fresh context each pass until it's done. The codebase itself is the persistent state. I never used the loop directly, but the ideas in the air were unmistakable: session isolation, file-based state, autonomous iteration.


Jeffrey has since built out his broader Agentic Coding Flywheel. In some ways we've been building toward similar goals in parallel. He's more prolific in terms of projects (and public output), at least so far.

These ideas gave me the framework I needed. I started building.


Lisa: Attempt One

I'm a huge Simpsons fan, so it seemed obvious to me to name my first effort Lisa, a much smarter and more competent version of Ralph. Lisa was beautiful. A TypeScript and Bash hybrid, PostgreSQL for persistence, 17-page React UI, three defined agent roles (Project Manager/System Architect, Developer, and QA) with structured mailboxes, real-time streaming, and state machines.


The UI was genuinely beautiful.


Lisa never completed a successful end-to-end development run for anything significant. The loop logic was buggy and inconsistent. Lisa's failures (well, my failures I suppose) were a great lesson in what doesn't work. While Lisa's UI was beautiful up front, under the hood it was a mess and the outputs didn't work.


Slop. But not slop because the code was bad. The deeper problem was architectural: I gave the agents roles without giving them defined behavior. A Developer agent building a login form received the same base prompt as a Developer agent refactoring a database schema. Role without specificity doesn't shape behavior and guide ouput, it's just a label.

The agent had no idea how to do the work, only what kind of worker it was supposed to be. It would be like walking up to a ballerina and saying, "You're an expert tennis player."  Mmmm, okay, sure.


The UI was fighting against itself too. Seventeen pages with buttons for steps that should have just  happened automatically. I kept hearing my own voice saying: I just wish the thing would work.


So I scrapped it. Lisa was killed off.


Scorpio: From Scratch

I still wanted to play on the Simpsons-inspired Ralph wave, and if Lisa Simpson is the smartest character in the Simpsons universe, Hank Scorpio might be the most effective. He appears in exactly one episode: Season 8's "You Only Move Twice," voiced by Albert Brooks. Charming CEO of Globex Corporation, excellent benefits, genuinely supportive of his team, casually attempting world domination. He doesn't overthink the architecture. He just gets things done.


I liked that energy for what I was building next.


Scorpio is Bash-first, file-system state, no infrastructure dependencies. The same core idea: orchestrate  independent AI sessions through a phased pipeline, but with one critical architectural shift: skills instead of  roles. There is a UI, or will be, but Scorpio's power is it just gets to the heart of the matter and throws a flamethrower at it. Watch that episode if you don't get the reference.


Skills are self-contained markdown definitions. Each one specifies exactly what an agent should do, how it should do it, and what constraints apply, per task type. A dev-task-runner skill behaves completely differently than a bug-investigator skill, which behaves differently than a qa-reviewer. The orchestrator loads the skill registry, matches skills to tasks, and injects the right definition into each session's prompt.


Task-specific behavior, not generic labels.


The rest is built around one principle: files over context. Every session reads from disk, writes to disk, and  passes to the next session through structured files. Disk is unlimited. Context windows are not. This isn't a  workaround, it's the architecture!


The pipeline runs in phases: discover → plan → build → validate → release → learn. Each session writes one of three results:

  • Success
  • Failed, or
  • Needs Decision


When a session hits an ambiguity it can't resolve, it flags it and a separate "thinker" session analyzes the problem and can retry with a modified prompt. The system self-heals without you. It works until it figures it out, meets acceptance criteria, and passes tests.


Scorpio today: ~3,200 lines of Bash, 70+ integration tests passing, 20+ skills covering the full lifecycle,  phase-based execution with dependency ordering and lock-aware parallel sessions, and support for Claude, Codex, and Gemini. It has successfully built itself and a few other projects I'm working on.


My Core Learning

Here's what I keep coming back to after all of this:

  1. LLMs can do almost anything, if the task is finite enough to define success as a boolean.
  2. Yes it's done or No it's not. 1 or 0. Binary.  (I'm aware of how nerdy that is.)
  3. Every other decision, like skill definitions, session boundaries, handoff protocol, thinker pattern, and file-based state all flow from that constraint. If you can't state what "done" looks like before the agent starts, the loop will run until it times out and you won't know why. Getting orchestration right is getting decomposition right. Break work into pieces where done is unambiguous. The rest is iteration and perseverance.


Getting all the pieces to work together in harmony, that's literally what orchestration means. It turns out the word is exactly right.


What's Next

The UI is still in development. The core framework is solid. I'm exploring whether there's a commercial  application here, and I'm building in public to find out. I hope to speak and write more about this as I speak to groups, conferences, through this blog and on socials.


This is the first post in that series.


If you're building something similar, or even just thinking  about it, I'd genuinely like to hear what you're working on. All of this is so brand new and every day, literally, there are five new things to learn. No one can learn, see, do, test it all. It is an exciting, terrifying time to be alive. So what's the best thing to do in that situation? Grab a flamethrower...


By JW Block March 2, 2026
The drift When you build a system that spawns AI sessions to execute development tasks, names accumulate fast. pm-task-breakdown generates task files. dev-task-runner executes them. qa-reviewer reviews them. qa-pm-triage triages the reviews. These names made sense at 2am when each skill was a standalone experiment. They stopped making sense when the orchestrator had to route between them programmatically and a human needed to read the plan file and understand what was about to happen. The feature-architect skill was also referenced as feature-planner in some docs and feature-design in others. The planner was creating task files directly, bypassing the designated task generator. Workflow chains were implicit. The orchestrator was routing correctly by accident, held together by convention and hope. This is the kind of rot that doesn't show up in tests. It shows up when you try to explain your system to someone and realize you can't, because the system can't explain itself. Sixteen tasks to rename things The fix was a canonical taxonomy: [scope]-[phase]-[outcome]. pm-task-breakdown became task-plan-generator. dev-task-runner became task-execute-runner. feature-architect became feature-plan-design. Fourteen skills, sixteen tasks, three spec documents, five test suites, two documentation passes. Every skill got exactly one canonical name plus backward-compatible aliases resolved at runtime. More importantly, a single ownership rule: only task-plan-generator writes task files. Planning skills now output design artifacts and a standardized TASK-PLAN-HANDOFF.json, then stop. This killed an entire class of bugs where different skills wrote task files in slightly different formats, silently confusing the downstream QA chain. Tedious work. The kind nobody writes blog posts about. But the moment it landed, every workflow chain became legible in the plan file. The help text matched what actually ran. And the alias resolver meant muscle memory still worked. Four dead sessions that weren't dead While the taxonomy migration was in progress, a different crisis forced its way in. Four consecutive production runs timed out. The sessions were fine -- actively working, writing to their transcripts, making progress. But Scorpio's watchdog only checked stdout. Claude sessions can go long stretches without producing visible output while they think. The watchdog saw silence and pulled the trigger. The fix was a composite liveness mode called streams_plus_transcript. Instead of one heartbeat signal, the watchdog now monitors two: stdout/stderr growth and Claude's transcript JSONL growth. Either signal resets the idle timer. The implementation includes a probe that can switch to newer transcript files mid-session if the current one goes quiet, transcript discovery failures degrade safely to stream-only behavior rather than crashing. This shipped alongside a new scorpio doctor command for structured postmortem analysis of failed runs. Together: thirteen tasks, all through QA, all triage-approved. The false-timeout pattern was the catalyst, but the diagnostic infrastructure it produced will catch problems that haven't happened yet. The mess real projects make Taxonomy gave the system a coherent vocabulary. Liveness kept it from killing its own workers. But neither solved what happened next: Scorpio met a real project that had been running for weeks, and the assumptions fell apart. The stale task problem In a project that's been active for a while, docs/tasks/ is a graveyard. Completed work sits next to fresh tasks. Scorpio was dutifully replanning all of it, including tasks that shipped days ago, because it had no concept of "finished." The fix is a pre-plan safety gate. When mixed scope is detected (completed and active tasks coexisting), the orchestrator stops and presents three choices: include everything, exclude completed work, or archive the old tasks and move forward. The default is exclude. Non-interactive runs get a config knob. No silent archive operations ever. This required defining "completed" precisely. Two forms of evidence: a done-marker file matching the canonical task ID, or an entry in tasks_completed from a successful historical run. The ID matching uses normalized TASKTYPE-NNN extraction, which immediately fixed a bug where API-002-add-auth-endpoint.md wasn't matching its API-002-report.md done marker. The scope bleed Review and triage sessions ran with tasks=["all"]. "All" meant every task the project had ever seen. A review session could pull in completion state from month-old work and produce summaries that mixed current progress with ancient history. The fix was surgical: "all" now expands to the current plan's task IDs. The plan is the boundary. The ghost runs The new Scorpio Launcher, a browser-based dashboard for operators who'd rather click than type, surfaced a problem immediately. What does the UI show when current-run points to a run from last week that's marked active, but no process is behind it? A stale-runtime detector now identifies these orphaned states, surfaces warnings with duration context, and lets operators start fresh runs without force-killing phantoms. Closing the advisor loop The piece that ties it together is the advisor pipeline. Scorpio has five advisor skills: UX, growth, trust, accessibility, monetization, when combined they produce a synthesis report. Previously, those reports were endpoints. Read them, decide what to do, create tasks by hand. Now there's a machine-readable path from finding to execution. The advisor-synthesis skill writes a JSON sidecar alongside the markdown report. Scorpio's intake validates each finding against a strict schema: id, title, severity, effort, recommendation (all required), ID format enforced, rank validated when present. Malformed findings get logged and skipped. If the entire sidecar is invalid, the system falls back to parsing the markdown. The operator selects which findings to act on, and the intake produces a TASK-PLAN-HANDOFF.json that feeds directly into the execution pipeline. Strict on input. Graceful on failure. The same pattern that kept showing up everywhere else. What this actually taught me Naming is architecture. Not in the abstract sense. In the concrete sense that when your orchestrator, your skills, your CLI, and your documentation use different words for the same concept, every new feature has to navigate that drift. Sixteen tasks to rename things sounds like overhead. It was the highest-leverage work of the month. Defaults are policy decisions. Choosing exclude_completed as the default pre-plan action is an opinion about forward-only workflows. Requiring --archive for destructive operations is a bet that annoying is better than destructive. These aren't technical choices. They're product choices baked into 8,600 lines of Bash. Instrument failures before they become patterns. Four false timeouts in a row would have been a frustrating mystery without the diagnostic surface to catch them. scorpio doctor exists because the alternative was staring at log files and guessing. What's next The orchestrator has more than tripled in size since early February. The launcher works but is early. The advisor pipeline passed smoke tests and needs a live project run. The immediate question isn't what to build next, it's whether 8,600 lines of Bash is a feature or a problem. Probably both.
Animated dog adjusts a large clock in a gothic room. Dark colors, staircases, and crooked towers.
By JW Block February 26, 2026
Scorpio was occasionally killing a long-running dev session while the agent was clearly still working. The frustrating part is, the system wasn't being lazy about it. It was checking. It just wasn't checking the right things. That's not a failure of intelligence. That's a failure of supervision. And if you've been following along, you know I care a lot about the difference. But this post isn't really about the timeout bug. It's about how I fixed it—and what that process revealed about why agentic development works when it works, and why it falls apart when it doesn't. The answer, for the hundredth time, is specificity . The Bug Scorpio already had a two-part timeout. This wasn't some rookie wall-clock guillotine. The logic checked two things: is the stdout/stderr file size still growing? And has the total max time elapsed? If either condition triggered, the session died. On paper, that sounds reasonable. In practice, it had a blind spot big enough to drive a truck through. Real work in Scorpio doesn't always show up in stdout/stderr. Agents have thinking phases that aren't chatty. Tool calls block while work happens elsewhere. Transcript growth can be the only signal that the agent is alive and productive—but the watchdog wasn't looking at transcripts. It was staring at stdout like a dog watching one door while the burglar walks in through the window. So the agent is mid-task, making real progress, the transcript is growing, work is clearly happening. But stdout has gone quiet for a stretch. The watchdog sees a flatlined file size, decides nobody's home, and kills the session. You get a timeout message that implies the agent stalled, when the truth is: we killed a productive worker mid-sentence because we were checking the wrong vital signs. That's the pipeline equivalent of "Sorry, we close at 5," while you're literally standing at the counter holding your credit card. When I was building Lisa, I learned the hard way that trust is the whole game. If you're asking someone to hand control to an autonomous pipeline—to walk away and let the system work—you can't have random, undeserved failures. Lisa had plenty of those, and it's one of the reasons she never completed a successful end-to-end run. I wasn't going to make the same mistake twice. One Bug, Nine Tasks Here's where the real lesson lives. I could have filed one task: "Fix the timeout bug." Given that to an agent (or a developer) and hoped for the best. That's what most people do. That's also why most agentic workflows produce inconsistent results. Instead, I decomposed the problem into nine specific tasks, each with its own clearly defined scope, acceptance criteria, and dependencies: TEST-450 — Add failing regression tests that reproduce the exact failure mode: active session, work happening, killed anyway. These tests had to fail before the fix and pass after. You write the proof of the disease before you write the cure. CORE-450 — Add the dual-timeout config surface. Separate session_timeout_hard_s from session_idle_timeout_s as explicit, independent controls. Preserve backward compatibility with the legacy SESSION_TIMEOUT variable. This is pure plumbing—no behavior change yet, just giving the system the vocabulary to express what it couldn't before. CORE-451 — Implement the liveness-aware watchdog in run_session . This is the behavioral heart of the fix: monitor activity, refresh liveness when work is detected, terminate only on idle timeout or hard cap, and log which one killed the session. Same two questions as before, better answers. TEST-451 — Add the integration matrix. Cover the full scenario space: active session survives past idle threshold, idle session times out, hard cap always terminates, resume works correctly after timeout failures. Both verbose and non-verbose modes. CORE-470 — Build the Claude transcript activity probe. This is the missing sense organ. Resolve transcript candidates, track file size and mtime as activity signals, return deterministic states: activity, no activity, or unavailable. Fail safely—probe failure never crashes execution. CORE-471 — Implement composite liveness mode. Upgrade the watchdog from single-source idle tracking to a composite evaluator: stream growth plus transcript activity. Idle timeout fires only when all enabled signals are quiet for the full idle window. This is the fix for the blind spot. CORE-472 — Tune defaults and add activity mode config. Introduce session_activity_mode ( streams vs streams_plus_transcript ), default to composite for Claude sessions, raise the non-QA idle timeout from 300 seconds to 900 seconds. This is where operator experience gets dialed in. DOC-450 — Document the timeout model, tuning guidance, and operator runbook. Migration notes from legacy config. Troubleshooting section for diagnosing timeout reasons from logs and artifacts. DOC-470 — Document composite liveness behavior. Explain activity signals, evaluation order, and the "idle timeout with active transcript" diagnostic path. Concrete recommended ranges for tuning. Nine tasks. One bug. And every single one of them has acceptance criteria you can evaluate as a boolean: done or not done. Pass or fail. 1 or 0. Why This Matters More Than the Fix Itself In my first post, I talked about the core lesson from building both Lisa and Scorpio: LLMs can do almost anything, if the task is finite enough to define success as a boolean. This is what that looks like in practice. Not as a principle. As a dependency graph. If I'd filed one task—"fix the timeout"—and pointed an agent at it, here's what would have happened: it would have changed some code, maybe improved the obvious case, missed the transcript probe entirely, skipped the regression tests, left the docs stale, and broken backward compatibility. I know this because that's exactly what happened with Lisa, over and over, on problems far simpler than this one. The specificity of the task is what produces reliability. Each of those nine tasks is small enough that an agent (or a human) can hold the entire problem in their head, know exactly what "done" means, and either get there or clearly fail. No ambiguity. No scope creep. No "well, it sort of works." And the dependency chain matters too. TEST-450 comes before CORE-450, because you write the failing test before you write the fix. CORE-470 builds the probe before CORE-471 wires it into the watchdog. DOC tasks follow implementation tasks. The order isn't arbitrary—it's the architecture of confidence. The Broader Pattern This is what I keep relearning, and what I think most people building agentic systems are still underestimating: the quality of your orchestration output is determined almost entirely by the quality of your task decomposition. Everyone wants to talk about prompts. Model selection. Temperature settings. Agentic loops. Those things matter. But they're all downstream of a more fundamental question: did you define the work clearly enough that success is unambiguous? If you did, most models will get you there. If you didn't, no model will. Scorpio's skill system exists to serve this principle. Each skill definition specifies exactly what an agent should do, how it should do it, and what constraints apply—per task type. A dev-task-runner skill behaves completely differently than a bug-investigator skill. The orchestrator matches skills to tasks and injects the right definition into each session. Task-specific behavior, not generic labels. That was the whole lesson from Lisa's failure: role without specificity is just a costume. The timeout fix is a clean example because it's technical enough to be precise but human enough to feel the stakes. Nobody wants their productive work killed by a half-blind watchdog. But the reason the fix worked—the reason it shipped clean and didn't introduce new bugs—is that it was nine problems, not one. And each one was solvable. If You're Building Something Similar The timeout pattern is worth stealing on its own: Separate idle timeout from hard timeout. (Scorpio already had the two-part check. This is table stakes.) Track liveness across all available signals, not just the obvious one. Output growth beats vibes, but transcript growth beats stdout alone. Always log the termination reason. Design degradation paths. Missing telemetry should narrow supervision, not crash it. But the deeper pattern is the decomposition: Break every fix into the smallest tasks that can be independently verified. Write failing tests before writing fixes. Define acceptance criteria as booleans. Pass or fail. No "sort of." Order tasks by dependency, not by vibes. Document as you go, not after. Agents don't need more autonomy before they get more supervision. And supervision doesn't need to be smart. It needs to be specific. Or, in Scorpio terms: if you're going to hand a flamethrower to the worker, at least give it a checklist that fits on one page.