PodcastsHear the voice. See the shape of the thought.

#fable-5#anthropic#llm-benchmarks

We Tested Anthropic's Fable 5 for a Week

Dan Shipper, CEO of Every, spent a week with Fable 5 — Anthropic's Mythos-class frontier model — before its public launch and walked away genuinely changed. Every's senior engineer benchmark put Fable at 91/100, against 63 for Opus 4.8 and 62 for GPT-5.5 — a jump Dan describes as "warp drive" capability for sustained autonomous work. The model is slow, expensive, and token-hungry, but for anyone orchestrating big, multi-hour agentic tasks, there's nothing close to it right now. ## [00:00] One prompt built an infinite 3D library Dan opens with a live demo: a fully browsable 3D version of Jorge Luis Borges's "The Library of Babel" — hexagonal galleries, accurate mathematics from the story, working bookmarks — all generated by a single prompt. He gave Fable a one-line instruction to read the story, plan, and execute a browser-playable 3D game end-to-end. The model ran autonomously for three to four hours, self-checked its work, and shipped. > *"I made this entire thing in a single prompt with Fable 5, the new model from Anthropic."* ## [01:22] Our day-zero Fable 5 review Dan introduces himself and Every's approach: they test models hands-on for real production work — programming, writing, design, business decisions — and report back on what actually works. Fable generated unusual levels of pre-release hype; Anthropic had initially said it was too dangerous to release. After a week of internal access, Every's take is that the model is genuinely different, and Dan's goal here is to cut through the excitement and show the realistic picture. > *"Because we've been using this model for about a week now, we get to pull back the curtain a little bit and show you what it's like to have lived with this model."* ## [02:25] What a Mythos-class model is Mythos is Anthropic's new top-tier model family, sitting above Haiku, Sonnet, and Opus in their lineup. Architecturally it's not novel — same transformer family, just bigger. Anthropic added strict safety guardrails (no cyber, no biological use cases) to make it releasable. Pricing is steep: $10/M input tokens, $50/M output — roughly 2× Opus. Dan's verdict from a week of use: genuinely the most powerful coding model he's ever touched, by a wide margin. > *"It is just genuinely the most powerful coding model I've ever used by far."* ## [03:28] The 91/100 engineering benchmark Every runs a proprietary senior engineer benchmark: the model is handed a real "vibe-coded slop" production codebase and asked to rewrite it from first principles as a senior engineer would. Prior to Fable, the top score was Opus 4.8 at 63/100, with GPT-5.5 right behind at 62. Fable scored 91 — matching a human senior engineer in a single prompt. Dan had expected saturation of this benchmark in about six months; it happened in two weeks. > *"Fable scored a 91 on this benchmark. 91 out of 100. That's the same score as a human engineer with just one prompt. That's crazy."* ## [04:12] Why it feels like a warp drive Fable's core strength is sustained autonomous execution over multi-hour tasks. You give it a destination, leave it running, and come back to something finished. Unlike earlier Claude models that eagerly said yes to everything ("purple accents, purple accents"), Fable deliberates, pushes back when something can't be done well, and follows through on complex, loosely specified prompts. Dan's analogy: a warp drive — not instant, but it compresses what used to take months into hours. > *"You can specify a destination for a big trip, and it just compresses what normally would have been like years or months into like hours or days."* ## [06:10] Where the model falls short The warp drive metaphor cuts both ways: it's useless for getting around town. Tight back-and-forth collaboration, quick questions, rapid iteration — Fable is a poor fit for all of these. It's slow, expensive, and burns tokens aggressively. A non-obvious workaround: drop the reasoning level to medium or low for simpler questions; that's how Anthropic's own people use it internally. Without a big, meaty problem to throw at it, the model is overkill. > *"If you're using it for true collaboration or quick questions or things that need tight back and forth, I don't think it's that good for that."* ## [07:04] Building a Heidegger lecture site Dan describes asking Fable to grab philosopher Hubert Dreyfus's 2007 lectures on Heidegger — without even providing a URL — and turn them into a consumable mini-site. Fable found the lectures, wrote per-lecture summaries, built a synchronized player that highlights the transcript as audio plays, added chapter navigation, drop caps, and typographic choices that Dan characterizes as actual taste, not the default template output. One prompt, no scaffolding. > *"That's what I mean when I talk about this model having really exceptional taste and attention to detail."* ## [09:05] Finding a growth bet in customer data Every has ~10,000 paid and ~100,000 free subscribers and a backlog of survey data the team had been analyzing with AI for weeks without a sharp conclusion. Dan fed it all to Fable. In one pass, the model came back with: "You have a conversion merchandising problem. Your free-to-paid conversion ratio is lower than it should be." Then a falsifiable bet: ship pricing transparency and a trial offer, and it'll go up. That synthesis — reading survey responses, site analytics, and product state together — hadn't emerged from weeks of team analysis. > *"That is something that I would expect a really, really good growth person to do with a lot of time and thought and research."* ## [10:35] Clearing a real GitHub backlog Every's agent-native markdown editor Proof accumulates GitHub issues automatically as agents file bugs during use. Dan pointed Fable at two weeks of open issues and told it to close irrelevant ones and write Rust fixes for the rest. It swept through the backlog and produced patches the team actually merged. Other models can do this, but they require hand-holding — one issue at a time, constant check-ins. Fable just batched it. > *"And it just went boom boom boom boom boom boom. And actually wrote fixes that we merged."* ## [11:17] Who should actually use this model Dan is direct: Fable is not for everyone right now. Using Every's "eight levels of AI adoption" framework, it pays off at levels 7–8, where users are already orchestrating multiple agents and have large problems queued up — typically technical builders. For knowledge workers not yet running agent workflows, it'll feel like overkill; for casual vibe coders, the token costs are real friction. About half of Every's own early-adopter team saw immediate payoff; the other half is still growing into that workflow level. > *"Using it is a skill. You need to be exposed to problems and working at a level of expertise where the problems come up in order for it to be useful."* ## [13:31] Where other models still win Writing is the clearest gap: Fable's prose is dense, literary, and block-heavy — good for thinking through structural writing problems, not for copywriting or everyday sentence-level work. For Claude users, Opus 4.8 is still better for writing. For GPT users, 5.5 is a better daily driver. Dan himself keeps GPT-5.5 as his Codex driver for the quick back-and-forth that fills most of his day; Fable gets reserved for big production pushes. > *"For my day-to-day, it's a bit overkill even for me."* ## [14:26] What this means after automation Dan points to his essay "After Automation" as the frame: automation doesn't shrink human work, it creates more of it — a paradox. Fable follows the same pattern: it raises the floor for non-experts (a vibe coder can now one-shot a video game) and raises the ceiling for experts (an expert can build a AAA game solo). The displacement is real and he says it's normal to feel unsettled by it — but the capability curve means even people who can't afford Fable today will have access within six to twelve months. > *"This model increases the floor of capability for non-experts, but it also raises the ceiling for experts."* ## [16:02] The final verdict Dan closes with a straightforward recommendation: read the full Every vibe check for detailed benchmark breakdowns across coding, writing, and knowledge work, watch "After Automation" for the bigger-picture framing — and then go find the first big problem you've been avoiding and point the warp drive at it. > *"If you're psyched about this, the thing I recommend most is go use your new warp drive. And let me know what you make."* ## Entities - **Dan Shipper** (Person): Co-founder and CEO of Every; sole presenter in this episode; spent a week testing Fable 5 pre-launch. - **Every** (Organization): AI-native subscription media company focused on testing frontier models for real work use cases; ~10,000 paid subscribers. - **Fable 5** (Software): Anthropic's Mythos-class frontier model; scored 91/100 on Every's senior engineer benchmark at launch. - **Anthropic** (Organization): AI safety company; maker of the Claude / Opus / Fable model family. - **Mythos** (Concept): Anthropic's top-tier model family tier, above Haiku, Sonnet, and Opus; characterized by extended reasoning and high token cost. - **Senior engineer benchmark** (Concept): Every's proprietary evaluation — model rewrites a production codebase from first principles; scored out of 100; Fable hit 91, Opus 4.8 hit 63. - **Opus 4.8** (Software): Previous Anthropic flagship; scored 63/100 on Every's benchmark; still preferred for everyday writing tasks. - **GPT-5.5** (Software): OpenAI's comparable frontier model; scored 62/100 on the benchmark; Dan's personal daily driver for quick back-and-forth work. - **Hubert Dreyfus** (Person): American philosopher; author of "What Computers Can't Do" (1972); subject of the Heidegger lecture site demo. - **Proof** (Software): Every's agent-native markdown editor; used in the GitHub backlog-clearing demo. - **After Automation** (Concept): Dan Shipper's essay arguing automation creates more human work rather than eliminating it; referenced as the interpretive frame for Fable's broader significance. - **Eight levels of AI adoption** (Concept): Every's framework for classifying AI workflow integration depth; levels 7–8 are where Fable delivers the most value.

The SaaS Apocalypse Is a Goldmine With Figma's Matt Colyer

33:53

#saas#ai-agents#developer-tools

The SaaS Apocalypse Is a Goldmine With Figma's Matt Colyer

Figma developer PM Matt Colyer has been building his own AI agents for two years and is buying more software subscriptions than ever — not fewer. He and Every CEO Dan Shipper work through why the "SaaS apocalypse" narrative gets the economics backward, how AI needs to escape the tyranny of the text box to unlock genuinely creative design work, and why the coming year's challenge isn't generation but review: humans are now the bottleneck in a world where agents can ship faster than anyone can evaluate what they made. ## [00:00] AI will create a billion developers This exchange, taken from later in the interview, opens the episode: Matt argues that the number of developers worldwide — roughly 25–40 million a decade ago — is heading toward a billion. That demographic explosion, not AI replacing software, is what makes the SaaS market a "gold mine." Figma and most established SaaS businesses are, in his view, excited rather than threatened. > *"If you're in that space, like, it means it's a gold mine, right?"* ## [01:03] Introduction Dan Shipper frames the conversation: he recently bought Figma stock after noticing the "SaaS apocalypse" discourse, and he wants to know how a company that pre-dates AI is navigating a world where agents can now operate inside your product. Matt, as the director managing Figma's developer products, is the right person to ask. > *"There are all these people who are like, 'Oh, I don't have to use Figma anymore.' You guys just launched an agent in your product. You also have Figma MCP."* ## [02:15] Why the SaaSpocalypse narrative has it backwards Matt's counter-argument runs on two tracks. First, the democratization of software creation massively expands the addressable market — more software being built means more demand for the tools, infrastructure, and services that support it. Second, vibe-coding your own app sounds liberating until you're dealing with SMTP upgrades at midnight. He built his own email agent two years ago and watched it get rickety; these days he pays someone else to run agents for him rather than maintain the plumbing himself. > *"I'm buying more software these days than I ever did before, because I'm like, 'You know what? That tool seems cool. I'm just going to pay somebody else to run my agent for me.'"* ## [05:27] Matt's email agent origin story The origin was unglamorous: three kids in three schools, relentless PTO emails, and the humiliation of missing spirit day. Matt wired up a Python script to grab his inbox and paste it to an LLM — the whole thing was rickety and sometimes the replies didn't work, but the core loop worked. He then added a memory system and a daily summary pushed to him proactively, which he flags as the real unlock: instead of having to open a tool and ask, it just showed up. Dan mirrors this with his own Codex-based inbox workflow, now four weeks into inbox zero. The two also land on voice as an underrated interface — Matt uses Loom recordings because it feels less weird than talking to a blank screen. > *"The unlock for me was like instead of having to go to a tool and ask for the thing, it was just like it would show up."* ## [13:21] Divergent vs. convergent design thinking Chat-based AI is inherently linear — you iterate on one design thread. Matt's argument is that great design has a diamond shape: first you diverge (generate many directions), then you converge (pick the best). Figma's on-canvas agent is a first attempt to break out of the text-box constraint. On the canvas, an agent can spawn a grid of frames — grayscale, sepia, with different type — and then a separate convergent agent can cluster them and recommend which direction to pursue. Command-line agents can't do this kind of spatial, parallel exploration; that's what the canvas unlocks. > *"Text boxes are super limiting — it's very much like a linear 'well this and then that.' If we get to the canvas, the agents allow you to do divergent thinking."* ## [17:39] Figma's MCP server MCP gives third-party agents (Cursor, Windsurf, Claude Code) a standard interface into Figma. Two flows: code-to-design — fire up a dev server, ask the agent to screenshot a live page and pull it into a Figma canvas — and design-to-code via "Get Design Context," which wraps component properties and design library guidelines into an agent prompt that then creates a branch, writes the code, and posts a screenshot to the PR. Both flows remove the manual copy-paste drudgery that used to live between the design file and the codebase. > *"You pull up your codebase, fire up the MCP server, and ask it, 'Hey, can you go to this page and copy it into Figma canvas?' And it will actually do it. That's a little bit mind-blowing."* ## [19:45] Why design agents need personalization Generic agents produce generic output. For Figma, the difference between an okay agent and one people actually love is whether it understands the design system — the components, the spacing rules, the naming conventions. Without that personalization layer, generated designs aren't usable. Matt draws a parallel to the memory systems in chat agents: in Figma's case, the design library is the memory. He also hints at proactive agent work Figma is cooking internally, framing the core problem as maintaining design values at a pace agents can generate. > *"The thing that really differentiates an okay agent from one that people really love is the personalization aspect. For Figma's version of that, it's the design system."* ## [22:09] Every problem is a context problem Matt describes a Figma product operations team that realized every recurring PM task — onboarding docs, project tracking, team introductions — was a context problem in disguise. They built "PMOS": a local SQLite org chart wired to Asana, Slack, and GitHub, then layered Claude Code skills on top. When a new team member joins, the system walks the org chart, reads the last 30 days of Slack channels, checks the Asana board, and produces an uncannily good onboarding file. Dan points out that Claude Code's power comes from the same insight: instead of an always-on cloud agent you have to manually wire to everything, it's an agent that already has access to everything on the user's machine. > *"One of the unlocks to me about AI is like you kind of realize every problem becomes a context problem. The work becomes about framing the problem with the right set of information."* ## [25:12] Apple and Google as the reigning kings of context Matt has been waiting for Apple Intelligence to deliver on its WWDC promise — phones hold all the personal data; an always-on, actually-smart Siri should be the obvious product. It hasn't arrived. He's watching Google's rumored "Spark" agent (always-on, connected to all Google content) with similar anticipation. Dan's take: Apple wins regardless because everyone runs AI on Mac hardware, giving them time to catch up. Matt adds that Apple's privacy-first positioning is a genuine strategic asset, not just PR. > *"Even being late to the game, they are still the king of context. And I think that's what's been interesting to watch about Google I/O this year — seemingly Google has also kind of woken up to that."* ## [28:18] Why review is the new bottleneck Generation is no longer the hard part. Agents are cheap, capable, and available; the problem is that humans are now inundated with net-new content they need to evaluate and approve. Matt frames "review" as the coming year's core design challenge: how do you scale a human value system — what good looks like, what fits your brand — at the pace agents can ship? The format is still unsettled: video walkthroughs, screenshots, a trusted review agent. He closes with a thought on careers: fundamentals still matter (you need to know what long division is even if you use a calculator), and the people who will thrive are the curious ones who ask how something is put together rather than just accepting the output. > *"We have agents that are capable of producing all this stuff, they're available enough, they're cheap enough. We're just being inundated with new content. The bottleneck is now: how do we scale our value system to evaluate it?"* ## Entities - **Matt Colyer** (Person): Director of Product Management for Developers at Figma; has been building personal AI agents for two years; longtime developer tools practitioner. - **Dan Shipper** (Person): Co-founder and CEO of Every; host of the "AI & I" podcast; active AI agent practitioner (inbox zero via Codex). - **Figma** (Organization): Design and prototyping platform; launched an on-canvas agent and an MCP server; central example in the SaaS-in-the-AI-era discussion. - **SaaSpocalypse / SaaS Apocalypse** (Concept): The narrative that AI will make SaaS software obsolete; both guests argue the opposite — AI expands the developer population and demand for SaaS. - **Diamond-shaped design thinking** (Concept): Divergent phase (generate many options) followed by convergent phase (select the best); Colyer argues current chat-based AI only supports linear/convergent work. - **MCP (Model Context Protocol)** (Concept): Standard interface for third-party agents to connect to tools like Figma; enables code-to-design and design-to-code workflows. - **Figma MCP Server** (Software): Figma's implementation of MCP; supports live page screenshot-to-canvas import and "Get Design Context" design-to-code export. - **Claude Code** (Software): Anthropic's coding agent; referenced as an example of an agent with full local file system context; used by Dan Shipper for inbox management. - **Every** (Organization): AI-focused media and software company; Dan Shipper is co-founder/CEO; runs the "AI & I" podcast series. - **Proactive agents** (Concept): Agents that push summaries or actions to users without being asked; Matt identifies the proactive daily email summary as the unlock that made his agent genuinely useful. - **Review bottleneck** (Concept): The emerging constraint in AI-assisted work where generation is fast but human evaluation/approval capacity is the limiting factor.

10:30

#claude#opus-4-8#llm-benchmarks

Why Opus 4.8 Pulled Me Back to Claude

Dan Shipper, CEO of Every, delivers a day-zero vibe check on Opus 4.8, arguing Anthropic could have called it Opus 5. The model jumps 30 points past Opus 4.7 on Every's Senior Engineer benchmark, edges out GPT-5.5, tops their internal writing tests at 79.6 vs. 73, and is the first model to produce a genuinely good one-shot slide deck. Two catches temper the enthusiasm: performance degrades sharply below "extra high" reasoning, and the Claude desktop app remains cluttered compared to Codex. ## [00:00] What is Every Every is a 30-person applied AI lab for the future of work—part media outlet, part product studio. Dan opens by explaining the subscription (writing, courses, AI-built tools all in one place at every.to) before rolling into the Opus 4.8 assessment. The plug is brief and context-setting: the team has had beta access for a week, and the rest of the video is what they found. > *"Every is the only subscription you need to stay at the edge of AI."* ## [01:07] Anthropic Is Back: The Headline Case for Opus 4.8 Dan had largely abandoned Claude after Opus 4.7—slow, hard to love, and outpaced by Codex and GPT-5.5 in day-to-day use. Even the most loyal Claude users at Every had started routing work elsewhere. Opus 4.8 breaks that pattern: it scores 63 on Every's Senior Engineer benchmark (30 points above Opus 4.7, one point above GPT-5.5), tops their writing tests, and produced the first one-shot slide deck Dan has called genuinely good. Kieran Klaassen, Every's GM, called it "the most human model he's worked with." The one persistent friction is the Claude desktop app itself. Codex is fast, focused, and ships a clean harness; the Claude app still feels like a product built by three separate teams—chat tab, code tab, co-work tab, each with its own feel. Dan is now splitting time between both apps, which he was not doing before. > *"But honestly, they could have called it Opus 5 cuz this is a really great model."* ## [05:02] Reach Test: Paradigm Shift Ratings from the Every Team Every's reach test asks one question: do you actually open this model when work gets hard? Dan rates Opus 4.8 gold/green—paradigm-shift quality, docked one notch because the Claude app harness is only "okayish to pretty good." Kieran, who runs 50 agents a day, gives a straight gold paradigm-shift, one of the rarest grades the team has assigned. Katie Parrot, a senior staff writer and historical Claude fan, lands at green, splitting her work between Opus 4.8 and Codex. > *"It's very rare to give a paradigm shift grade to a model. So I would pay attention to this."* ## [06:32] Benchmarks: Coding and Writing Numbers On coding, Opus 4.8 hits 63 on the Senior Engineer benchmark—the test feeds the model a vibe-coded codebase and asks it to rewrite from first principles, then scores against two human senior engineers who completed the same rewrite (typically scoring in the 80s–90s). GPT-5.5 sits at 62. On Kieran's LFGbench (real-world tasks: SaaS build, e-commerce site, 3D game landscape), the model writes readable code that bridges technical competence and creativity—the "cozy island" 3D scene is notably richer and more vibrant than GPT-5.5's output. On writing, Opus 4.8 scores 79.6 out of 100 on Every's internal benchmark (intro writing, promo emails, mid-piece paragraphs); GPT-5.5 scores 73. The gap is mainly in AI tells: at high and extra-high reasoning settings, Opus 4.8 produces prose that sounds less like a model. It matches a writer's voice from a single paragraph of context better than any other model Dan has tested. > *"Opus 4.8 scores a 79.6 out of 100 on the writing benchmark. GPT 5.5 is 73."* ## [08:57] Emotional Intelligence, Knowledge Work, and the Verdict Dan uses the model for interpersonal and management work—talking through decisions, pressure-testing his own framing. Opus 4.8's thinking traces show it genuinely cycling through permutations before responding, which makes it feel less like a sycophant and more like a useful counterpart. On knowledge work, it's versatile: code and writing coexist cleanly in a single thread, and the slide deck result is the first one-shot deck Dan would actually send to someone. The verdict: if you're a Claude fan, this model delivers. If Codex converted you, add Opus 4.8 as a parallel tool for writing and knowledge work—it's worth the context switch. The harness gap is real, but the model itself is a banger. > *"If you've been converted to Codex, I highly recommend you at least add it as part of your arsenal."* ## Entities - **Dan Shipper** (Person): Co-founder and CEO of Every; presenter and primary evaluator of Opus 4.8. - **Kieran Klaassen** (Person): GM of Kora at Every; gave Opus 4.8 a straight gold paradigm-shift rating on the reach test. - **Katie Parrot** (Person): Senior staff writer at Every; rated Opus 4.8 green, split between it and Codex. - **Every** (Organization): Applied AI lab and media subscription company focused on AI for the future of work. - **Anthropic** (Organization): Developer of Claude and Opus 4.8. - **Opus 4.8** (Software): Anthropic's latest Claude model; subject of the vibe check. - **GPT-5.5** (Software): OpenAI model used as the primary performance comparison across all benchmarks. - **Codex** (Software): OpenAI coding agent; praised for its clean desktop harness and used as the daily-driver counterpoint to Claude. - **Senior Engineer Benchmark** (Concept): Every's proprietary coding benchmark—rewrites a vibe-coded codebase from first principles and scores against human engineers. - **LFGbench** (Concept): Kieran Klaassen's real-world coding benchmark covering SaaS, e-commerce, and 3D scene generation tasks.

We Automatiseerden Alles Met AI en Verdrievoudigden Ons Personeelsbestand

41:13

#ai-automation#future-of-work#llm

We Automatiseerden Alles Met AI en Verdrievoudigden Ons Personeelsbestand

Every van Dan Shipper groeide van vier naar dertig mensen sinds GPT-3, zet agents in voor vrijwel elk werkproces, en blijft werven. In een omgekeerd format van het *AI & I*-programma interviewt COO Brandon Gell Dan over zijn essay van 8.000 woorden "After Automation", waarin hij betoogt dat toenemende AI-capaciteit juist meer vraag naar menselijk oordeelsvermogen creëert. Het centrale mechanisme: AI maakt de expertise van gisteren goedkoop en alomtegenwoordig, waardoor elk vakgebied overspoeld raakt met output die dicht bij het goede zit maar er net naast is — en precies die kloof drijft meer werk voor mensen die haar kunnen dichten. ## [00:00] AI doet het, en vraagt wat nu Dit fragment uit het latere gesprek raakt de kern van de aflevering. Brandon schetst het klassieke AI-moment: je geeft een prompt, het resultaat verbijstert je, je voelt je overbodig — en dan stokt het en vraagt de AI: "Wat moet ik nu doen?" Dan pareert met de zin die het hele betoog draagt: "Hoe verder een agent van een mens af staat, hoe minder waardevol hij is." Beide fragmenten komen uit het hoofdgesprek (rond 00:11 en 00:35), hier naar voren gehaald om de toon te zetten. > *"Hoe verder een agent van een mens af staat, hoe minder waardevol hij is."* ## [00:51] Introductie Brandon legt het omgekeerde format uit: híj interviewt Dan, niet andersom, en zal Dans these kritisch bevragen. Dan vertelt hoe het essay ontstond: hij zit midden in een van de meest agent-gedreven bedrijven ter wereld, ziet het personeelsbestand groeien parallel aan de automatisering, en ervaart een dissonantie met het dominante verhaal dat AI banen wegneemt. De recente tweet van de ClickUp-CEO — die een groot deel van zijn personeel ontsloeg en AI daarvoor aanwees — valt als eerste stresstest voor Dans argument: houdt "After Automation" ook stand voor een volwassen bedrijf van 10.000 mensen, en niet alleen voor een early-adopter als Every? > *"Als je een stok rondwaait in onze Slack, is de kans even groot dat je een mens als een agent raakt."* ## [05:51] De AI-paradox: meer automatisering, meer menselijk werk Dan werkt het kernbetoog uit. AI is getraind op alle eerdere output en kan daarmee de expertise van gisteren goedkoop aan iedereen leveren. Dat democratiseert productie — operationele mensen mergen pull requests, niet-engineers bouwen features — maar de output is consequent bijna goed, niet precies goed. Ze sluit niet aan op de concrete situatie. Zo ontstaat een glut van haast-correcte producties die op zichzelf in waarde dalen, terwijl tegelijk de vraag naar experts toeneemt die dat werk over de finish kunnen trekken. Brandon voegt de Every-versie toe: pull requests die er solide uitzien totdat een senior engineer eronder kijkt. > *"Je overspoelt de markt met bergen spul dat er dichtbij zit, maar net niet klopt."* ## [10:00] Hoe AI de expertise van gisteren goedkoop maakt Dan breidt het betoog uit naar de benchmarktegenwerping: modellen verbeteren exponentieel, maar zodra een benchmark verzadigt, kun je hem altijd opnieuw uitdagen door het probleem iets anders te kadreren. De diepere kwestie is dat mensen een laag van impliciete, niet-gearticuleerde bekwaamheid dragen die zich onttrekt aan heldere specificatie — en alles wat wél te specificeren valt, kan een model op optimaliseren. Every bewijst dit: Kieran bouwde in een maand of twee een complete inbox-functie van begin tot eind, wat daarvoor "volstrekt onmogelijk" was. Maar de waarde zat in een expert die wist wát er gebouwd moest worden en elke stap aanstuurde. > *"Er zit eigenlijk heel veel in wat je doet dat je niet helder kunt verwoorden."* ## [18:00] AI kan autonoom handelen, maar heeft geen eigen wil Brandon trekt de lijn tussen autonomie en wilsvrijheid: AI-agents worden steeds beter in het uitvoeren van open opdrachten zonder begeleiding, maar dat is categorisch anders dan het hebben van een eigen wil — de zelfgemotiveerde, speelse "ik doe dit gewoon omdat ik er zin in heb"-drift die zelfs een peuter bezit. Dan beaamt dat er geen economische prikkel is om zoiets te bouwen: als je aan je bureau zit en de agent zegt "nee, ik speel even", is dat een productfout. De hele incentivestructuur van de sector stuurt op volgzaamheid en correctie, en dat is precies wat mensen in de lus houdt. > *"Agent betekent iemand die handelt namens iemand anders. Dat is heel anders dan het hebben van een eigen wil, zoals zelfs het kleinste kind dat heeft."* ## [20:39] Waarom Dan alles inzet op AGI Brandon stelt een één-woord-test voor: denk jij dat AGI komt? Dan: ja. Is dat goed? Dan: ja. Zijn AGI-definitie — elke agent waarbij het economisch zinvol is om hem continu te draaien, actief tokens te genereren en taken af te ronden zonder opnieuw te prompten — is precies genoeg om te toetsen. Zijn redenering: zelfs een volledig autonoom systeem zal gebouwd zijn om menselijke doelen te dienen; als dat niet zo was, bouwden we het niet. Brandons zorg: zodra continue agents economisch rationeel zijn, wordt het massa-ontslagargument coherent. > *"Elke agent die je nooit uitzet — waarbij het economisch zinvol is om hem altijd te laten draaien, actief taken uitvoerend zonder dat je ooit opnieuw hoeft te prompten."* ## [21:57] AI-ontslagen zijn een leugen Dan en Brandon ontleden de ClickUp-casus — een CEO die publiekelijk een groot deel van zijn personeel ontsloeg en dat aan AI toeschreef. Dans lezing: generieke SaaS-bedrijven ontslaan mensen als ze in de problemen zitten of te groot zijn geworden, en schrijven het dan toe aan AI als dekmantel. Brandon voegt Jensen Huangs tegenwerping toe — "als jouw antwoord op vooruitgang ontslaan is, ben je geen erg creatieve CEO" — die zelfzuchtig is maar waarschijnlijk klopt. De eerlijke formulering: AI verandert werkprocessen ingrijpend, wat bedrijfsbrede reorganisaties noodzaakt. Bedrijven die dat werk overslaan en gewoon snijden, kiezen de luie weg. Meta dat medewerkers logt om trainingsdata te oogsten krijgt een korte vermelding als een creatievere, zij het verontrustende, benadering. > *"Ik zou erg sceptisch zijn over iedereen die zegt dat het alle banen of al het kenniswerk gaat elimineren."* ## [25:42] Blijf de modellen volgen en het komt goed Zelfs in een AGI-scenario is de cruciale variabele menselijk oordeel over wat er toe doet — en wat er toe doet verandert voortdurend, mede doordat AI zelf de wereld blijft hervormen. Klantenservicemedewerkers die chatbots wantrouwen, of bedrijven die supportmedewerkers ontslaan en twee maanden later stilletjes terughalen, illustreren hoe traag echte adoptie achterblijft bij de hype. Adoptie kost een generatie om neer te slaan; uiteindelijk heeft iedereen toegang tot deze tools; de winnaars zijn degenen die nieuwe modellen blijven leren kennen zodra ze uitkomen. Dan sluit af met zijn scherpste zin: als je de modellen volgt, komt het goed. > *"Als je gewoon de modellen volgt — wanneer nieuwe modellen uitkomen, leer ze te gebruiken voor wat jij doet, wat dat ook is — dan komt het goed."* ## [35:30] Hoe je AI inzet als eindredacteur van lange stukken Dan beschrijft het concrete AI-gestuurde proces achter "After Automation." Elke ochtend sprak hij de actuele stand van het betoog in via Proof, waarna hij het logboek aan Claude voorlegde met de vraag: "Wat probeer ik eigenlijk te zeggen?" Toen de concepten de 4.000 woorden voorbij gingen, liet hij Codex de laatste versie omzetten naar een podcast en luisterde hij die onderweg af, waardoor hij stroeve passages hands-free opspoorde. Het stuk ging vier of vijf keer volledig op de schop voordat het betoog klikte. Zijn conclusie: AI heeft het essay niet geschreven, maar maakte het mogelijk om de hele structuur van 8.000 woorden in het werkgeheugen te houden zonder de draad kwijt te raken. > *"Ik had dit niet kunnen schrijven zonder het. Ik liet Claude mijn logboek lezen en vroeg: 'Wat probeer ik eigenlijk te zeggen?' En dan zei het iets en dacht ik: 'O ja, dat is wat ik probeer te zeggen.'"* ## Entiteiten - **Dan Shipper** (Persoon): Medeoprichter en CEO van Every; vaste host van *AI & I*; hier de geïnterviewde over zijn essay "After Automation" - **Brandon Gell** (Persoon): COO van Every; guest-host van deze aflevering, die Dan interviewt in een omgekeerd format - **Every** (Organisatie): AI-native media- en softwarebedrijf; gegroeid van 4 naar 30 mensen sinds GPT-3 met intensieve automatisering; publiceert de *AI & I*-podcast - **After Automation** (Concept): Dan Shippers essay van 8.000 woorden, dat betoogt dat AI-automatisering de vraag naar deskundig menselijk werk vergroot doordat domeinen overspoeld raken met haast-correcte output - **De expertise-kloof** (Concept): De these dat AI "de expertise van gisteren" goedkoop levert maar altijd net naast de realiteit zit, waardoor meer behoefte ontstaat aan mensen die de kloof naar de concrete situatie kunnen dichten - **AGI** (Concept): In deze aflevering gedefinieerd als elke agent die economisch rationeel continu draait zonder opnieuw te prompten; Dan gelooft dat het komt en netto positief is - **Autonomie versus wilsvrijheid** (Concept): Brandons onderscheid tussen AI die open opdrachten uitvoert zonder begeleiding (autonomie) en AI met zelfgemotiveerde wensen (wilsvrijheid); het laatste wordt niet gebouwd - **Proof** (Software): Schrijftool die Dan gebruikt voor dagelijkse gesproken notities; ingezet als AI-feedbackloop tijdens de ontwikkeling van het essay - **Codex** (Software): OpenAI-tool die Dan gebruikte om essayconcepten naar een audiopodcast om te zetten voor luisteren onderweg - **ClickUp** (Organisatie): SaaS-bedrijf waarvan de CEO publiekelijk een groot deel van het personeel ontsloeg en dat aan AI toeschreef; gebruikt als casestudy voor AI-washing bij ontslagen

1:10:02

#claude-code#obsidian#second-brain

Every2 maanden geleden

Claude Code als je tweede hersenkamer

Noah Brier draait Claude Code op een mini-pc in zijn kelder, gesynchroniseerd met zijn Obsidian vault via een Tailscale VPN, en doet echt denkwerk, onderzoek en klantcode gewoon vanaf zijn telefoon. Het gesprek gaat over hoe hij deze stack heeft gebouwd, waarom hij strikte "denkmodus"-guardrails hanteert om te voorkomen dat het model te snel artifacts gaat maken, en zijn bredere these dat AI slaagt door in de organisatorische nokken en kieren te kruipen in plaats van te eisen dat mensen nieuwe structuren overnemen. Dan Shipper en Noah werken ook uit wat het opbouwen van AI-intuïtie echt betekent, en waarom Noah denkt dat kinderen voorbereiden op AI minder gaat over spieken bewaken en meer over epistemische scepsis aanleren. ## [00:00] Noah Briers Claude Code-setup op een kelder-server Dan Shipper opent de aflevering met een beschrijving van de setup die Noah de moeite waard maakt: een thuisserver in de kelder waarop Claude Code draait bovenop een Obsidian vault, overal bereikbaar via telefoon. Noah heeft dit zo ingericht dat hij kan denken, onderzoeken, schrijven en code shippen zonder aan een bureau te zitten. > *"Hij heeft een thuisserver in zijn kelder ingericht, zijn Obsidian vault daarin gezet en draait Claude Code erop, zodat hij kan denken, onderzoeken, schrijven en zelfs code shippen gewoon vanaf zijn telefoon."* ## [00:52] Introductie Dan en Noah praten bij na zo'n vijf jaar. Noah's achtergrond omvat merkstrategie (hij was medeoprichter van Percolate), AI-consultancy bij Alephic en de BRXND.AI-conferentie. Dan richt het interview op de praktische stack die Noah heeft gebouwd in plaats van abstracte AI-discussie. > *"Ik ben blij dat je er bent. Het is echt fijn om bij te praten. Dit is ons eerste interview in waarschijnlijk zo'n vijf jaar."* ## [02:10] Hoe je diep werk kunt doen op je telefoon Noah verduidelijkt meteen dat zijn setup minder "vibe coding" is en meer gestructureerd kenniswerk. Hij heeft Evernote losgelaten voor Obsidian omdat markdown-bestanden en mappen hem iets geven waarmee Claude Code echt kan werken. Zijn voornaamste gebruik van Claude Code is interactie met zijn notities, niet het genereren van code. En de telefoonautomatisering van die setup heeft zijn werkpatronen fundamenteel veranderd. > *"Mijn nummer één gebruik van Claude Code is het als een tool inzetten om met mijn notities te werken."* ## [05:30] Waarom Noah denkt dat Grok de beste voice AI heeft Noah geeft de voorkeur aan Grok's voicemodus boven die van OpenAI en Gemini: Gemini was niet slim genoeg, en de oude GPT-4o-stem was voor zijn doeleinden onbruikbaar. Hij gebruikte het op een vijf uur durende solorit om een artikel over Transformers door te werken, via Bluetooth als een persoonlijke onderzoekspodcast. Het gesprek brengt een gedeelde frustratie aan het licht: voicemodellen doen nog steeds geen goede tool calling of webonderzoek, wat hun bruikbaarheid voor serieus intellectueel werk beperkt. > *"Ik deed zo'n uur sessie en het was veruit de beste uitleg die ik ooit heb gelezen of gehoord, denk ik."* ## [11:11] De technische details van Noah's Claude Code-Obsidian setup Noah doorloopt live zijn Obsidian-map op het scherm. Claude Code staat in de Obsidian-rootmap, zodat het toegang heeft tot het volledige notitiearchief. Voor een talk die hij voorbereidt voor BRXND.AI, over het Simple Sabotage Field Manual uit de Tweede Wereldoorlog en wat het zegt over bureaucratie in grote organisaties, heeft hij een projectmap aangemaakt in Obsidian met transcripts van chats met ChatGPT, Claude en Grok, naast artikelen en PDFs. Claude's taak in dit stadium is niet de talk schrijven maar helpen nadenken: notities ophalen, dagelijkse voortgang samenvatten en verhelderende vragen stellen. Hij stelt denkmodus-beperkingen expliciet in de CLAUDE.md-frontmatter van het project. > *"Ik zit in de denkmodus, nog niet in de schrijfmodus. Er zijn dingen waarbij ik Claude Code specifiek heb verteld, ik denk dat het in de frontmatter staat, dat het me nu niet moet helpen iets te schrijven."* ## [26:05] Een agent in Claude Code gebruiken als denkpartner Noah stelt dat het woord "generatief" heeft vertekend hoe mensen AI gebruiken: iedereen focust op het vermogen om artifacts te produceren, bijna niemand spreekt over hoe bijzonder het lesvermogen is. Hij houdt er een toegewijde denkpartner-agent op na met expliciete guardrails: "Maak geen schema's, concepten of versies van talks of schrijfwerk." De agent noteert vragen, volgt opkomende inzichten bij en bouwt een lopend logboek op zodat Noah precies kan doorgaan waar hij gebleven was na een pauze. Hij traceert een draad van ChatGPT-deep-research over Wild Bill Donovan naar een prille idee over hoe de parallelliteit van de transformerarchitectuur lijkt op de operationele autonomie van Special Forces. > *"Ik denk dat, mede omdat we het generatief noemen, er veel te veel focus is op het vermogen om te schrijven en veel te weinig op het vermogen om te lezen."* ## [30:23] Noah's Thomas' English Muffin-theorie van AI Het hoofdstuk opent met Noah's bureaucratiethese: grote ondernemingen falen niet bij het adopteren van software omdat ze lui zijn, maar omdat nieuwe software historisch eiste dat organisaties zich eromheen herstructureerden. AI, zo stelt hij, is anders. Het kruipt in de nokken en kieren van hoe mensen al werken, vandaar zijn Thomas' English Muffin-metafoor. Dan voegt een concreet voorbeeld toe vanuit Every: twee producten op verschillende stacks moesten een bestandszoekoplossing delen, en Claude Code liet ze logica hergebruiken zonder een gemeenschappelijk framework op te leggen. Het gesprek verbreedt zich naar Noah's idee van "bureaucratie als positionele codering", een half uitgewerkte analogie tussen transformerarchitectuur en organisatiehiërarchie die hij nog uitwerkt voor zijn talk. > *"Ik noem het mijn Thomas' English Muffin-theorie van AI, namelijk dat het in de nokken en kieren kruipt."* ## [39:47] De witte ruimte die nog te verkennen is in AI Noah en Dan stellen dat de meeste beoefenaars, ook goed gefinancierde, nog steeds werken met broze intuïties over wat deze modellen echt kunnen. Noah's ijsbreker bij elk klantgesprek is: "Wat was jouw aha-moment met AI?" omdat dat moment van niet-determinisme, dezelfde vraag twee keer stellen en twee verschillende antwoorden krijgen, echt nieuw is en tijd kost om te internaliseren. Hij leent het omgekeerde-fietsexperiment van Destin Sandlin om het punt te maken: motorische intuïtie en conceptuele intuïtie zijn gescheiden, en je kunt het opbouwen ervan niet overslaan. Dan stelt daar tegenover dat taalmodellen zelf misschien de woordenschat genereren die we missen voor redeneren over probabilistische systemen. > *"We zijn niet gewend aan het gebruik van dingen waarbij je twee keer dezelfde vraag stelt en twee verschillende antwoorden krijgt."* ## [48:44] Hoe Noah zijn kinderen voorbereidt op AI Noah's dochter van tien bouwde een sinterklaas-app met Claude die haar per ongeluk data-modellering bijbracht: ze besefte dat ze "groepen" nodig had in plaats van "volwassenen en kinderen" om de logica te generaliseren. Dat verhaal verankert een breder argument: de taak van opvoeders is niet AI-gebruik te voorkomen maar leerlingen ervan te overtuigen dat onderliggende vaardigheden het leren waard zijn. Hij pitch een NYU-college dat "Code is Essay" heet voor het najaar van 2026, en denkt dat de relevante meta-vaardigheid epistemische scepsis is: meer wantrouwen tegenover informatie die je eigen ideeën bevestigt, niet minder. > *"Ik denk niet echt dat je taak is om deze kinderen te leren schrijven, want dat is een levenslang streven. Ik denk dat je taak is om ze ervan te overtuigen dat het de moeite waard is om te leren schrijven."* ## [01:00:06] Hoe hij zijn Claude Code-setup naar mobiel heeft gebracht Noah demonstreert live de volledige mobiele stack: Termius (SSH-client op iPhone), Tailscale VPN die verbindt met de kelder-mini-pc, Obsidian gesynchroniseerd via privé-GitHub, Claude Code draaiend in de terminal. Hij laat zien hoe hij Claude vraagt "wat is er de afgelopen twee dagen nieuw?" en een synthese van zijn recente Obsidian-activiteit ontvangt. Hij repareerde ook een gebroken link op zijn conferentiesite vanaf zijn telefoon: bevestigde de bug, liet Claude een PR pushen, klaar. Zijn huidige tinkering strekt zich uit tot Simon Willison's `llm` CLI-tool en een script dat alle bijlagebestanden in zijn Obsidian vault hernoemt en de linktabel herbouwt. > *"Ik ging buiten zitten en daarna hadden we een project dat geleverd moest worden aan een klant en er moest een kleine wijziging worden gemaakt. Ik vertelde Claude Code precies waar het moest kijken, bevestigde dat het probleem was wat ik dacht dat het was, en liet het gewoon een oplossing pushen. Het pushte een PR en ik was klaar."* ## Personen - **Dan Shipper** (Persoon): CEO en medeoprichter van Every; host van het interview - **Noah Brier** (Persoon): Medeoprichter van Percolate; oprichter van Alephic AI-strategieconsultancy; organisator van BRXND.AI-conferentie - **Every** (Organisatie): Media- en softwarebedrijf dat deze podcast produceert - **Alephic** (Organisatie): Noah's AI-strategieconsultancy; werkt met Fortune 50-klanten waaronder Amazon, Meta en PayPal - **BRXND.AI** (Organisatie): Jaarlijkse conferentie op het snijvlak van marketing en AI, georganiseerd door Noah; editie 2025 in New York City op 18 september - **Claude Code** (Software): Anthropic's agentische codetool; centraal in Noah's tweede-brein en mobiele workflow - **Obsidian** (Software): Markdown-gebaseerde notitie-app; Noah's primaire kennisarchief, georganiseerd via de PARA-methode - **Tailscale** (Software): Mesh VPN gebruikt om Noah's telefoon veilig te verbinden met zijn kelder-mini-pc - **Termius** (Software): iOS SSH-client die Noah gebruikt om zijn thuisserver vanaf zijn telefoon te bereiken - **Grok** (Software): xAI's AI-assistent; Noah beschouwt de voicemodus als significant beter dan die van OpenAI en Gemini voor serieus onderzoek - **Simple Sabotage Field Manual** (Concept): OSS-document uit de Tweede Wereldoorlog dat Noah opnieuw heeft uitgegeven; gebruikt als lens op moderne organisatorische bureaucratie in zijn BRXND.AI-talk - **Thomas' English Muffin-theorie** (Concept): Noah's metafoor voor hoe AI slaagt door in bestaande organisatorische workflows te passen in plaats van herstructurering te eisen

The Secrets of Claude's Agent Platform From the Team Who Built It

43:21

#claude#managed-agents#ai-platform

Every3 maanden geleden

The Secrets of Claude's Agent Platform From the Team Who Built It

Dan Shipper interviews Angela Jiang (head of product) and Katelyn Lesse (head of engineering) for the Claude platform at Anthropic, recorded at the Code with Claude developer event. The conversation unpacks how Claude's platform has grown from a simple completion API into a fully managed agent infrastructure, why the harness and the model are increasingly inseparable, and what the "outcome + budget" vision means for the future of agent development. Together the three trace every stage of the agent lifecycle — from spinning up a first session to retiring stale agents — and share candid war stories from Anthropic's own internal deployments. ## [00:00] Where the platform will be in a year Dan opens with a question the rest of the episode keeps circling back to: a year from now, where is the platform? Angela's answer — Claude understands itself well enough to pick its own sub-agents and write its own harness on the fly. Katelyn picks up the other half: an infrastructure layer that can keep up with agents that continually rewrite themselves. This exchange actually comes from late in the interview; the show puts it up front because the whole conversation is about how today's primitives get you there. > *"We'd want to experiment with directions where Claude actually gets so good at understanding itself, it figures out what model you should be using, it figures out how to spin up all the sub agents."* — Angela Jiang ## [01:48] How the Claude platform evolved from API to agents Angela traces the arc from early LLM APIs — stateless, exploratory, maximum surface area — through session-based chat, and now into fully autonomous agents. The through-line is always the same: raise the abstraction layer high enough that customers can get the best outcome from Claude with as little work as possible. Early adopters wanted every raw knob; today, most teams arriving at Anthropic want a substantial set of things "out of the box." The platform's job is to keep shrinking the distance between intention and outcome. > *"It probably ends up just being like whatever it's like the set of primitives and infrastructure that enables you to basically get the outcome as fast as possible with actually as little of work as possible."* — Angela Jiang ## [04:09] The primitives that make up Claude Managed Agents Katelyn explains that Claude Managed Agents is assembled from the same primitives available to anyone on the Messages API — code execution sandboxes, web search, and built-in tools — but wrapped in a curated harness Anthropic has already battle-tested internally. Angela adds that the team is opinionated about two primitives in particular: file systems and skills. These are treated as load-bearing choices that shape how Claude behaves across all agent tasks. The platform is designed to be modular so developers can plug in custom pieces where the standard harness does not fit, and Anthropic publishes reference implementations for teams that want to stay on the Messages API directly. Dan describes his team running Claude via the `claude -p` command on Mac Minis and worries about lock-in and divergence from Claude Code. Katelyn responds that Anthropic's internal first-party products run on the same platform as external customers, which means divergence between Managed Agents and Claude Code will shrink over time. > *"We've taken what we see as all the most powerful of those things and put them together into a harness and a set of infrastructure that is just the way to get what we think is the best outcomes out of Claude."* — Katelyn Lesse ## [10:37] Why the harness and the model are becoming a single unit Angela challenges the conventional wisdom that a generic, model-swappable harness is the right architecture. As models diverge in technique across labs, the alpha is in tight harness-model co-design rather than hot-swapping. Internally, Anthropic tested multiple harness variants for the memory feature and found they performed "drastically differently." The implication: treat the agent (harness + model) as the unit of redundancy, not the model alone. Dan pushes on whether this creates path dependence in the model itself. Angela acknowledges that the primitives chosen really do shape the model's trajectory, and that being wrong about them is hard to undo. She cites models that over-indexed on reasoning versus those that went deep on computer-use as two diverging paths that are difficult to reverse. > *"The harness and the model get very paired. You still need redundancy, and you still might want to use other models for things, but you probably do it at the layer of like the agent, meaning like the harness plus the model."* — Angela Jiang ## [18:49] The infrastructure wall that kills most agent projects in production Katelyn identifies the real blocker for most agent projects: not harness engineering, but the infrastructure wall hit when teams try to move from prototype to production. Keeping a persistent server alive, managing sandbox failures, storing transcript data, and handling secure credential injection — these mundane concerns kill projects that technically "work" on a Mac Mini. Anthropic's own repeated experience of hitting this wall internally was the primary motivation for building Managed Agents. Angela describes the vaults primitive as an early step toward one-click agent deployment: once agent identity and credentials are handled securely at the platform layer, adding a Slack integration should eventually be as simple as telling Claude to "add Slack" and watching the bot appear. > *"Everyone hits the same problem of like, oh wow, I either need to like keep a server constantly running or I need to use infrastructure that will spin up and spin down, and I need to store the transcript data, and I need secure sandboxing, and all these sorts of things."* — Katelyn Lesse ## [24:49] Why team agents need a different shape than individual productivity tools Angela explains why individual productivity tools like Claude Code do not simply scale to team use. The moment three people want a shared agent that automates an end-to-end process across roles, a laptop-resident tool breaks down in availability, access control, and coordination. She cites Guillermo Rauch of Vercel's framing of an internal "AI software factory" as the right mental model: not individual augmentation, but a full organizational stack of agents that continuously produces high-leverage output for every function in the company. > *"When you get to the team layer suddenly everything gets like massively more complex. Like number one obviously it can't like sit on your laptop."* — Angela Jiang ## [26:36] How Anthropic's legal team uses an agent to review marketing copy Katelyn walks through one of Anthropic's own internal deployments: a legal-review agent that accepts marketing copy submissions and performs a first-pass review before anything reaches a human lawyer. The agent can approve copy outright or escalate for human review, eliminating low-value ticket-queue work. The form factor is a thin app layer on top of Managed Agents with shared visibility across both teams. Angela and Dan dig into why this is an agent rather than a skill: human-in-the-loop requirements, the need to spin up separate sessions, and multi-team collaboration all exceed what a single skill invocation can handle. The governance model that emerged was notable: rather than gating changes behind the platform team, end users discovered they could self-serve small improvements via Claude Code. Angela describes the end-state user experience as simply "talking to Claude," even when the underlying system is "many many Claudes engaging with each other." > *"Under the hood it's many many Claudes engaging with each other to get to the part where then they the Claudes themselves are doing the more complex work that the human doesn't really necessarily need to interpret."* — Angela Jiang ## [34:24] Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms Angela highlights three multi-agent architecture patterns people are assembling with the newly launched orchestration primitives: an advisor strategy that separates execution from advice; adversarial pairs where one agent generates and another critiques; and swarms that split a problem into many small parallel pieces and recombine results. Each pattern suits a different problem class — swarms excel at bug hunting, while wide-research tasks benefit from advisor or parallel-decomposition architectures. LEGO-like primitives let practitioners hill-climb at the architecture level, not just the prompt level. > *"If we can make the primitives very LEGO-like, then people can put them together to solve things at a slightly higher form factor, which is more like an architecture or like a strategy."* — Angela Jiang ## [35:50] How to measure agent success with outcome and budget as the end state Angela frames the long-term measurement philosophy: compress everything to an outcome and a budget, and let the platform resolve all intermediate decisions. Domain-specific evals (e.g., PR-merge rate for coding agents) remain useful today, but the target is a verifiable outcome spec that Claude can grade itself against repeatedly. Katelyn addresses the adjacent problem of agent staleness: Anthropic has built skills to help teams upgrade agents when new models ship, and the most forward-leaning teams already run meta-agents that monitor other agents for degradation and trigger upgrades automatically. > *"Our kind of principle of like maybe the end state of some of these things is that everything should kind of compress down to an outcome and like a budget. And that's probably like about it."* — Angela Jiang ## [39:11] What the platform looks like a year from now, when Claude writes its own harness Angela envisions a world where users supply only an outcome and a budget, and Claude self-selects models, spins up sub-agents, and writes its own harness on the fly — eliminating harness engineering entirely, just as today's platform has already eliminated much of manual tool construction and prompt engineering. She is cautiously optimistic that the "outcome" half of the equation may be achievable within a year with some budget error bars. Katelyn adds the infrastructure corollary: such a world requires a platform capable of supporting agents that continuously recreate themselves, handling arbitrarily shaped long-running requests without ever becoming the bottleneck. > *"Claude is actually able to understand itself enough that it can come almost like write itself on the fly to figure out what is necessary in that kind of like two-parameter world of like outcome and budget."* — Angela Jiang ## Entities - **Angela Jiang** (Person): Head of Product for the Claude platform at Anthropic; co-architect of the Managed Agents product vision. - **Katelyn Lesse** (Person): Head of Engineering for the Claude platform at Anthropic; focuses on infrastructure reliability and scale. - **Dan Shipper** (Person): Host of AI & I on Every; CEO of Every; building internal agent products on the Claude platform. - **Claude Managed Agents** (Software): Anthropic's hosted agent infrastructure — a harness plus cloud compute that wraps the Messages API with built-in memory, sandboxing, vaults, and skills. - **Messages API** (Software): Anthropic's core API; the underlying primitive on which Managed Agents and all first-party products are built. - **Anthropic** (Organization): AI safety company that builds and operates the Claude model family and its associated platform. - **Every** (Organization): Media company producing AI & I; an early Managed Agents customer building internal editorial agents. - **Stripe Minions** (Software): Stripe's internal end-to-end software development platform built on agent infrastructure; cited as a model for company-wide coding agent deployment. - **Vercel** (Organization): Developer infrastructure company; CEO Guillermo Rauch's "AI software factory" framing used as the mental model for team-level agent adoption. - **Outcome + Budget** (Concept): Anthropic's long-term design principle that the final form of agent interaction should require only a verifiable outcome and a cost ceiling, with the platform resolving all intermediate decisions.

Why We Switched From Claude Code to Codex

58:23