LaiDub

Podcasts

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson
1:15:40
EN/ZH
Watch with Captions
AI Engineervor 24 Tagen

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson

Two engineers from Anthropic's Applied AI team — Ash Prabaker and Andrew Wilson — walk through what it actually takes to keep a coding agent productive for five-plus hours: a year of model and harness co-evolution that took runs from 20 minutes to 12+ hours, and the internal harness recipe behind their one-shot app demos — a planner that writes deliberately vague specs, a generator and an adversarial evaluator that negotiate "done" into testable contracts, taste rubrics that make design gradable, and a debugging loop that is mostly reading traces by hand. A 35-minute audience Q&A covers Ralph loops, agent teams, traceability, and human-in-the-loop trade-offs. ## [00:00] Introduction and speakers Ash Prabaker opens with introductions: he and Andrew Wilson are engineers on Anthropic's Applied AI team, and the session grew out of a blog post the team published a couple of weeks earlier on agents that keep working for extended stretches. Companies love showing one-shotted-a-browser demos, he notes, but rarely share what's inside the harness — that gap is the agenda. Andrew takes history and shipped primitives; Ash returns for the experimental half. > *We're talking 5 6 hour plus kind of runs.* ## [01:21] Overview of long-running agents Andrew, a solution architect based in London, frames the year with a quote from Boris, Claude Code's creator, on the tool's first anniversary: a year ago Claude struggled with bash commands and string escaping; now nearly all of Claude Code is written by Claude Code, with runs lasting days. > *it could run for, you know, maybe 20 minutes at a time.* ## [02:29] Challenges: Context, Planning, and Judgment Three buckets explain why long runs are hard. Context: windows are finite, new sessions start with amnesia, coherence rots as the window fills, and models near the limit exhibit "context anxiety" — rushing to finish. Planning: models try to one-shot everything, build half a feature and stop, or run out of context mid-app. Judgment, the least intuitive: models are poor critics of their own output, declaring a half-baked feature done or shipping a button with no backend behind it. > *models are really bad at judging their own output* ## [04:14] Two approaches: Model updates vs. Harness evolution Fixes come from two directions. Bake ability into the weights — the METER chart (how long an agent completes 50% of tasks on a minimal scaffold) went from about 1 hour on Opus 3.7 to 12 hours on Opus 4.6 a year later. Or change the harness: the Agent SDK ships the core primitives — the agent loop, MCP tools, sub-agent delegation, claude.md, skills, slash commands, the permission system. Andrew's running observation: every model release shipped harness changes alongside it. > *when we've released a model we've always also released a lot of harness changes alongside the models* ## [05:58] Prehistory: Sonnet 3.5, Computer Use, and MCP Before Claude Code existed there were artifacts on Claude.ai, and Sonnet 3.5 — the first model that showed real coding promise because it could look at what it had built and iterate. Computer use added clicking, screenshots, and self-testing; the MCP spec gave it tools. > *That was quite an aha moment sort of pre-Claude code.* ## [06:34] The evolution of Claude Code February 2025: Sonnet 3.7 lands state-of-the-art on SWE-bench and Claude Code ships as a research preview — explicitly to learn how developers use Claude for coding and feed that back into the model. That sets the recurring trend: as models improve, harness pieces become unnecessary or evolve. By May, Opus 4 and Sonnet 4 manage their own context better and reach task completion without reward hacking; Claude Code goes GA with an SDK. > *the goal of Claude code was to better understand how developers use Claude for coding to inform future model improvements* ## [07:55] The Ralph loop technique An interlude on the Ralph Wiggum technique — Jeffrey Huntley published it last July, traction arrived around December. The simple version: feed a prompt into the CLI on a loop until the tasks are done. The real version has phases — plan the prompt into features, pick one task, start a fresh session with a clean context window. Its appeal is captured in Huntley's "deterministically bad in an undeterministic world." Anthropic's own plugin runs inside a single session instead, relying on compaction, max iterations, a safe word, and a stop hook. > *it's better to fail predictably than it is to succeed unpredictably* ## [09:49] Sonnet 4.5, Agent SDK, and checkpoints Sonnet 4.5 starts tracking its own token consumption — context-aware enough to manage the end of its window instead of panicking. Claude Code 2.0 introduces checkpoints for rewinding a session. The Claude Code SDK is renamed the Agent SDK because the team realized the harness generalizes beyond coding. Runs reach roughly 30 hours. > *we realized it's much more general purpose than actually just for coding* ## [10:49] Opus 4.5 and the role of sub-agents Haiku 4.5 and Opus 4.5 complete the family, and the economics shift: many sub-agents become affordable, and Opus 4.5 plans well — so Opus plans while Sonnet executes. Skills arrive with progressive disclosure (only frontmatter loads up front), and programmatic tool calling lets the model write code to chain tool calls and return just the final result instead of dumping everything into context. > *all of a sudden running many sub-agents became really economical* ## [12:05] First long-running agent patterns Around November the team published its first long-running-agents blog post. A human writes something vague — "create a Slack clone" — and an initializer agent breaks it into persistent artifacts: a feature list stored as featurelist.json (models overwrite markdown more readily than JSON), a progress file, a git repo, an init script. The harness loop then runs in fresh context windows: get bearings, run the init script as a smoke test, pick exactly one unfinished feature, implement, verify with Puppeteer, commit, repeat. > *the models might overwrite markdown files, whereas they're they're less likely to just overwrite JSON files* ## [14:20] Opus 4.6, Agent Teams, and server-side compaction Sonnet 4.6 offers near-Opus intelligence at Sonnet pricing and becomes the workhorse; Opus 4.6 is "very much an agentic model" — the METER figure jumps from ~4 to 12 hours on a minimal scaffold. Agent teams ship: sub-agents coordinate directly with each other and report to the main agent only when needed. Server-side compaction means sessions can effectively run indefinitely, and 1M context goes GA — nudging the design question toward fewer fresh sessions and one big window. Andrew's closing point: the harness doesn't vanish as models improve; gaps get filled by the harness, the model trains on that, and pieces get deleted. > *the harness doesn't just disappear as the models get better* ## [17:28] State-of-the-art harness patterns Ash polls the room — only two or three people have agents running in the background right now — then lays out the core pattern, borrowed shamelessly from GANs: a generator builds, a standalone evaluator grades, with adversarial pressure between separate context windows, system prompts, and jobs. The evaluator doesn't read diffs; it opens live pages with Playwright, clicks around, and hands critique back. Why doesn't an LLM evaluator just rubber-stamp LLM output? The gap they exploit: tuning a standalone critic to be harsh is tractable; tuning a builder to be self-critical is not — same as humans, where critiquing a meal is easy and cooking it is hard. > *The evaluator here isn't just reading diffs, but it's actually using playwright, um, to open live pages, click around, try things out* ## [21:30] Evaluating subjective output with rubrics Most people say you can't grade taste; the team disagrees — if you hold a strong enough opinion, write it down. Their rubric scores design, originality, craft, and functionality, weighted toward the first two since Opus 4.6 already handles functionality — the real fight is purple gradients and AI-slop aesthetics. Few-shot examples on reference sites calibrate the evaluator's taste to their own. The distinctive behavior this unlocks: when the generator keeps scoring low on originality, the GAN-style harness throws everything out and restarts — where a single loop would keep patching the same thing. > *most people say you can't grade taste, but, you know, we think you can if you have a a strong enough opinion on it and you just kind of write it down* ## [23:44] Introducing the 'Planner' role To go from nice pages to working apps they added one more role. The planner turns a one-line prompt into a deliberately high-level spec — a series of sprints — and explicitly does not plan granular technical details, because a wrong detail cascades through every sprint and magnifies over multi-hour horizons. Squint and it's a PM/IC/QA org chart. > *We just kind of gave each role its own kind of context window.* ## [25:04] The generator-evaluator contract The glue between generator and evaluator: before a single line is written, the two agents negotiate what "done" means. The generator proposes a feature and tests; the evaluator pushes back — scope too big, tests too weak, missed edge cases — via markdown files on disk until both agree. Grading then happens against that contract, not the planner's original spec. Ash calls this the key innovation the Ralph loop never had: nobody argues with the main loop. The proof is a "build a retro game maker" prompt run both ways. Solo loop: pretty screens, but in play mode the arrow keys and space bar do nothing. With the harness (~$200, 6 hours): the app names itself Retro Forge, builds a 54-color sprite editor, turns a vague "AI features" spec line into a working AI level assistant, and play mode has a live debug HUD, a running physics loop, and real collisions — the difference is entirely scaffolding. > *we have the two agents basically negotiate what done actually means* ## [31:28] Specificity in contracts and debugging traces What the evaluator actually catches is unglamorous: a FastAPI route-ordering bug that passes unit tests but breaks in prod, a Boolean logic bug on the delete key — found only because it uses the app. For the game maker, the agents settled on 27 contract criteria; vague criteria produce vague critiques the generator shrugs off. Ash is candid that out of the box, Claude is a bad QA agent — the same sycophancy that plagues LLM-as-judge had early evaluators filing "fix it later, might take 2 weeks" and moving on. There was no secret fix: the art was reading traces, finding where the model's judgment diverged from theirs, and tuning prompts — plus piping transcripts to files and having another agent grep them to close the loop. > *If you have vague criteria, you have vague critiques* ## [34:14] Adjusting harnesses as models evolve Is harness design dead? Ash's answer: learn each model's spiky behaviors and fill the gaps. Moving from Opus 4.5 to 4.6 they dropped context resetting entirely (4.6 has no context anxiety; one continuous session plus compaction suffices), dropped forced sprint decomposition (4.6 holds a 2-hour continuous build coherently), and moved the evaluator from every sprint to the end of each one-shot generation. The harness wasn't wrong — it was right for 4.5, and the frontier moved. Today's setup keeps the planner-generator-evaluator core, shares state through the file system, and runs at roughly half the previous cost — demonstrated by a DAW the harness built whose music was, by Ash's admission, trash, but whose app was thoroughly fleshed out. > *it was right for 4.5, the frontier moved* ## [37:56] How to build your own agent harness None of this requires Anthropic's internal harness. Auto mode covers the safe middle ground; custom sub-agents already exist as a primitive — give your evaluator a harsh system prompt and a detailed rubric; Playwright MCP or Claude for Chrome handles web apps, computer use handles native; skills package grading rubrics into the dev flow. > *there's nothing stopping you from just going ahead and building something similar to this kind of on your own* ## [39:01] Key takeaways for long-running agents The photo slide: self-evaluation is a trap — use an adversarial evaluator. Compaction does not equal coherence — lossy summaries drift; structured handoffs and clean contexts work. Subjective quality is gradable if you force yourself to write the standard down. And sit with the model reading traces — only then do you know which scaffold pieces to delete when the frontier moves. > *self-evaluation, very much a trap* ## [40:05] Q&A session Eleven audience members take the mics for 35 minutes. Highlights: evaluator tuning generalizes across projects when you target common model weak points (calibrate with "this is AI slop" examples). On Ralph loops and the model's "smart zone": with 1M context GA and 4.6's coherence, the team moved to one continuous session with compaction — but use your own evals. On watching agents work: Ash sees wanting to watch as a trust gap; the model now reads console errors and spots overlapping text itself. The 4.6 generation is strikingly willing to throw ten passes away and restart when it can't hill-climb the rubric — one evaluator got fed up and told the generator to delete everything. The planner stays out of the inner loop deliberately; the spec is re-inserted as a reference instead. For products that outlive the run, the harness leaves breadcrumbs — a learnings JSON ("tried this, found this bug, fix worked") plus high-level docs — enough for a human with Claude Code to pick up. Feeding the generator's context to the critic was tried and rejected: judging output alone beats muddying the two streams. Traceability remains mostly reading traces by hand ("you got to read the whole thing"), with Claude-over-traces as a first pass. And on human-in-the-loop sprint reviews: hooks can inject one, but the team optimizes for full autonomy — run ten generations, read the seven failures, tune the harness prompts, repeat. > *you got to read the whole thing* ## Entities - **Ash Prabaker** (Person): Engineer, Anthropic Applied AI team; presents the state-of-the-art harness patterns and Q&A. - **Andrew Wilson** (Person): Solution architect, Anthropic Applied AI (London); presents the model/harness history. - **Anthropic** (Organization): The speakers' employer; ships Claude models, Claude Code, and the Agent SDK. - **Claude Code** (Software): Anthropic's coding agent CLI whose one-year evolution frames the talk. - **Agent SDK** (Software): Renamed Claude Code SDK; ships the agent-loop primitives the harness builds on. - **Generator-evaluator pattern** (Concept): GAN-inspired split of builder and adversarial critic with separate contexts; core of the harness. - **Ralph loop** (Concept): Jeffrey Huntley's loop-a-prompt-until-done technique; precursor lacking an arguing counterparty. - **Playwright MCP** (Software): Browser-automation tooling the evaluator uses to test live apps.

#long-running-agents#agent-harness#claude-code