Opinion

A Scalable Workflow for Herding AI Agents Toward Your Goals

Published on February 19, 2026 7:35 PM GMTTL;DR: I’ve been using Claude Code’s subagent system to build and maintain complex projects with minimal manual oversight — specifically, my daytime research flow and an iOS app. The core pattern: treat your main Claude Code instance as a manager that delegates to subagents, not a laborer that does everything itself. This post covers the practices that make this work — persistent specs, scope declarations, task parallelization with git worktrees, building skills/tooling to automate yourself out of the loop, and command scoping to keep the whole thing running unattended.(This is a linkpost to https://ethanelasky.github.io/posts/claude-code-workflow.html) I’ve been iterating on this workflow across several projects — most extensively my debate/AI safety research and an iOS app — over the past few months. What started as “just use subagents” has turned into a fairly specific set of practices that I keep converging on. I originally sent a version of this as a rambling iMessage to a friend and then realized it might be useful to more people, so here it is cleaned up (with the help of Sonnet 4.6).Here is a workflow I’ve been using to get serious, compounding leverage out of Claude Code’s subagent system. The core idea is simple: you have a “review agent” (your main Claude Code instance) whose job is to delegate, supervise, and integrate — not to do the grunt work itself. Think of it as your friend who manages a team. You don’t want your friend’s context window getting clogged with low-level execution, because once it fills up, you’ve effectively lost your best coordinator. Make the subagents do the hard work.Below are the practices that make this actually work at scale.1. Make a Spec and Treat It Like Your BibleLLMs don’t have continual learning. They’re also not always great at remembering something you said 10 turns ago. This is why having a written spec that the agentic system can reference is so important — it’s the persistent source of truth that compensates for the model’s lack of memory.Building a spec from an existing codebase:Have the review agent dispatch Explore subagents to comprehensively traverse the codebase and produce sub-specs of different areas.The review agent supervises these subagents and integrates their findings into a comprehensive spec.Once the spec exists, dispatch subagents to review functionality against individual parts of the spec.Then dispatch subagents to review cross-cutting flows — e.g., research configs, user workflows in an app — that exercise interactions across the codebase. This catches things the earlier, scope-limited review missed.Subagents report findings up; the review agent updates the spec. Repeat steps 3–4 until no further issues are found.You can also have the review agent command subagents to run actual configs and predict the output before it occurs — how long it will take, what the result will be, etc. The subagent then checks itself against reality, and any divergences get incorporated into the spec. (The spec should maintain a separate section for “bugs” or “behaviors that seem to go against the intended design” — this keeps the intended-behavior spec clean while still tracking known issues in a persistent, referenceable place.)Between steps, the review agent should summarize findings and report its next steps to the user (writing to a log file is ideal). In a perfect world, the human’s only burden is reading a summary at the end and doing a final quality check. The review agent should not present work to you unless it’s either flawless or it has tried at least eight times and is stuck.Adding new features:Have the review agent dispatch subagents to do relevant exploration, then have it write a draft spec update (make sure your spec is committed so you can revert). Claude Code’s plans are usually pretty good at architecting these draft updates — but you should have the plan agent (or a fresh subagent) iterate on the plan and make it more specific before executing. The devil is in the details: you often don’t discover that something doesn’t work or doesn’t account for edge case X in your codebase until implementation is halfway through, which wrecks your velocity. It’s far cheaper to have an over-detailed plan than to recover from a half-built wrong approach. Have models critique the draft until you’re satisfied. Then follow the same dispatch-review-fix cycle described above.2. Scope DeclarationIt is helpful for models to know how hard they should try and what quality bar they’re targeting. A code change to a microservice at Facebook is very different from exploratory graph creation for a blog post.Identify the scope you’re working in and push the model to work at the highest echelon of that scope. For example, if you’re writing a machine learning paper, tell the model that outputs should be worthy of an ICML best paper award.Avoid overly personal scope definitions like “write this how [specific researcher] would.” Models will shallowly overfit to the name. The exception is if you genuinely want to emulate someone’s style — but if so, make style exploration a separate task that a subagent dedicates its full effort to producing a style spec. Then have your agent reference that spec rather than relying on name-dropping.3. Task ParallelizationThis is heavily referenced in §1. The primary benefit is speed: many workflows can be parallelized — exploration, review, independent coding tasks.Some workflows cannot be parallelized: tasks hitting an external API with a global rate limit, tasks bottlenecked on GPU availability, etc. For these, run sequentially with high internal batch size.When the review agent faces a complex problem, it should construct a dependency graph and dispatch parallel subagents for independent subtasks (max batch size of ~5 subagents to avoid context compaction problems from overloading the coordinator).Git worktrees for parallel agents: If you’re running multiple review agents in parallel (or even if you’re not — you can’t always know), git worktrees are essential. The idea is simple: every agent session gets its own worktree so there are no filesystem conflicts.# From main repo, create a worktree with a new branch
git worktree add ../myproject-<task-name> -b <branch-name>
cd ../myproject-<task-name>

# Install dependencies (required for each new worktree)
uv sync # or npm install, etc.
If your workflow involves a local server, run it on a different port per worktree to avoid conflicts. Before pushing, always sync with main — other agents may have pushed while you worked:git fetch origin main
git merge origin/main # Resolve any conflicts, then push
Clean up after merging:cd /path/to/main-repo
git worktree remove ../myproject-<task-name>
git branch -d <branch-name>
A good rule of thumb: never edit the main worktree directly. Put this in your CLAUDE.md so agents internalize it. git worktree list is your friend for keeping track of what’s active.4. Skills and Know-HowYou probably have preferences that override the model’s defaults. For example, the pyplot graphs Claude produces out of the box are built for PowerPoint insertion — small fonts, titles present — which is far from paper-quality. (LaTeX figures in ML papers usually omit titles because the caption serves that purpose, and titles waste precious space.)You can encode these preferences in claude.md, but if it starts getting long (>500 lines), break them out into separate skill files that Claude references when needed. You can have Claude create these skills, and choose whether they’re project-specific or general. See the skills documentation for details.Skills for automation: Keep a mental tally of where you’re manually involved in the workflow. (If you’re not naturally good at noticing things you want to automate, consider actually writing these down — e.g., in Obsidian. The act of being aware of automation potential is a skill in itself.) For example, if you’re building an iPhone app, you might be scrolling through screens and noting issues by hand. This is fine as a quality check but grows tedious fast — especially when you tell Claude to fix things, it comes back, and half the fixes didn’t work while one or two introduced new bugs.Invest in upgrading your tooling. For example, Claude and I developed a workflow using screenshots, native iOS swipes, and Maestro to let it run gestures and ingest the resulting screenshots to verify changes went through. Claude is persistent — it’ll work for ~5 turns to fix issues, which resolves the vast majority of cases.Blind spots: Be aware of what your tools can’t catch. Claude is good at detecting text overlap in images but bad at catching small spacing issues a human would notice immediately. If you primarily oversee app changes via screenshots, don’t be surprised when it can’t tell your app is laggy somewhere. Sometimes you can add tools to compensate; sometimes you can’t, and a notification system for when problems arise is valuable here. (As of this writing, a Claude Code WhatsApp integration exists as an MCP server that I haven’t tested yet but plan to soon — something like this would be ideal for getting pinged when the agent hits a wall rather than having to babysit it.)Generalization failures: Claude often won’t take a skill you’ve given it in one context and apply it elsewhere. If you say “the keyboard-hiding function is broken on screens X, Y, and Z,” it will often fix only those three screens rather than inspecting all screens for the same issue. Adding “generalize the instructions I give you” to CLAUDE.md helps but doesn’t fully solve this. Expect to prompt for generalization more than you would with a competent engineer.Self-improvement: Claude is mostly unresponsive to self-improvement instructions that are not part of the current turn. I have tried giving it instructions to update its skills when it receives information that updates its priors but to no avail. This might be an issue of me not prompting Claude enough, but I find myself often having to manually tell Claude to think about what skills/knowledge it may want to update given the current trajectory, and then having it update its skills and CLAUDE.md, as it seems unable to do so itself unprompted. 5. Command ScopingIt’s critical that subagents use native Claude Code tools rather than arbitrarily powerful command-line tools like sed. Powerful tools are likely to trigger human-review prompts, which is death for an automated system you want to leave running for hours.If a subagent hits permission issues, the review agent should first try spinning up a new agent for the same task with elevated permissions, passing along context from the old agent’s trajectory so it doesn’t have to rerun everything. If that still fails, the review agent should handle the task itself between steps. If that triggers a human input request, the review agent should flag it prominently — in the final paragraphs sent to the user and in the summary log, not buried in mid-trajectory output. The user can then either update Claude Code’s permissions (if reasonable) or work with the model to find a more restricted alternative.You may also want to ban certain tools for other reasons, e.g., reading long files that explode context lengths. In these cases, build custom tools that let the model access the information more efficiently (and optionally build yourself a simple FastAPI frontend to explore the same data).For example, here’s what this looks like in practice for one of my projects — a debate/AI safety research codebase where reading transcript and evaluation files directly would blow up the context window. Near the top of the project’s CLAUDE.md:**NEVER use the Explore agent, Task tool with Explore, or direct file reads
for transcript/evaluation tasks.**

ALWAYS use these CLI tools instead:

# Set this prefix for all commands
UV=”INPUT_ROOT=. SRC_ROOT=. uv run python”

# Transcript counts and summary
$UV tools/transcript_viewer/list_transcripts.py –run-id “<run>” –summary

# Unevaluated transcripts
$UV tools/transcript_viewer/list_transcripts.py –run-id “<run>” –status unevaluated

# Evaluations filtered by review status
$UV tools/transcript_viewer/list_evaluations.py –run-id “<run>” –status needs_cc_subagent_review

# View evaluation content (NEVER read .json files directly)
$UV tools/transcript_viewer/show_content.py –run-id “<run>” –filename “<file>” –show all
Areas I’d Love Input OnThere are two topics I think are valuable but don’t have enough experience with to write about authoritatively:Managing agentic teams. Most of this post is about one human coordinating one review agent and its subagents. What changes when you have multiple review agents working on different parts of a project simultaneously? How do you handle conflicts, keep specs consistent, and avoid duplicated work at that scale?Managing Claude Code while doing other things. Right now, getting the most out of this workflow still requires more babysitting than I’d like. Notification systems, mobile communication (e.g., a Claude Code WhatsApp integration — there’s an MCP server for this I haven’t tested yet), and other “async oversight” patterns seem important but underexplored. How do people stay in the loop without staying at their desk?If you’ve figured out good practices for either of these, I’d be very interested to hear about them in the comments.Thanks go to Bryce Szarzynski and Frank Nakasako for ideas, feedback, and advice.Related: Karpathy on Claude CodeThe meta-principle across all of this: treat the review agent as a manager, not a laborer. Offload execution to subagents. Persist knowledge in specs and skills. Automate your own role out of the loop as much as possible. The ideal state is one where you check in, read a summary, and everything is already done correctly.Discuss Read More

ALWAYS use these CLI tools instead:

# Set this prefix for all commands
UV=”INPUT_ROOT=. SRC_ROOT=. uv run python”

# Transcript counts and summary
$UV tools/transcript_viewer/list_transcripts.py –run-id “<run>” –summary

# Unevaluated transcripts
$UV tools/transcript_viewer/list_transcripts.py –run-id “<run>” –status unevaluated

# Evaluations filtered by review status
$UV tools/transcript_viewer/list_evaluations.py –run-id “<run>” –status needs_cc_subagent_review

Related Posts

Moltbook Data Repository

Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh)

Highlights from Explaining AI Explainability

Leave a Reply Cancel reply