LaiDub

팟캐스트Hear the voice. See the shape of the thought.

채널 둘러보기

전체 AI & 테크 비즈니스 과학 문화 정치 철학 건강

We Tested Anthropic's Fable 5 for a Week

We Tested Anthropic's Fable 5 for a Week

Dan Shipper, CEO of Every, spent a week with Fable 5 — Anthropic's Mythos-class frontier model — before its public launch and walked away genuinely changed. Every's senior engineer benchmark put Fable at 91/100, against 63 for Opus 4.8 and 62 for GPT-5.5 — a jump Dan describes as "warp drive" capability for sustained autonomous work. The model is slow, expensive, and token-hungry, but for anyone orchestrating big, multi-hour agentic tasks, there's nothing close to it right now. ## [00:00] One prompt built an infinite 3D library Dan opens with a live demo: a fully browsable 3D version of Jorge Luis Borges's "The Library of Babel" — hexagonal galleries, accurate mathematics from the story, working bookmarks — all generated by a single prompt. He gave Fable a one-line instruction to read the story, plan, and execute a browser-playable 3D game end-to-end. The model ran autonomously for three to four hours, self-checked its work, and shipped. > *"I made this entire thing in a single prompt with Fable 5, the new model from Anthropic."* ## [01:22] Our day-zero Fable 5 review Dan introduces himself and Every's approach: they test models hands-on for real production work — programming, writing, design, business decisions — and report back on what actually works. Fable generated unusual levels of pre-release hype; Anthropic had initially said it was too dangerous to release. After a week of internal access, Every's take is that the model is genuinely different, and Dan's goal here is to cut through the excitement and show the realistic picture. > *"Because we've been using this model for about a week now, we get to pull back the curtain a little bit and show you what it's like to have lived with this model."* ## [02:25] What a Mythos-class model is Mythos is Anthropic's new top-tier model family, sitting above Haiku, Sonnet, and Opus in their lineup. Architecturally it's not novel — same transformer family, just bigger. Anthropic added strict safety guardrails (no cyber, no biological use cases) to make it releasable. Pricing is steep: $10/M input tokens, $50/M output — roughly 2× Opus. Dan's verdict from a week of use: genuinely the most powerful coding model he's ever touched, by a wide margin. > *"It is just genuinely the most powerful coding model I've ever used by far."* ## [03:28] The 91/100 engineering benchmark Every runs a proprietary senior engineer benchmark: the model is handed a real "vibe-coded slop" production codebase and asked to rewrite it from first principles as a senior engineer would. Prior to Fable, the top score was Opus 4.8 at 63/100, with GPT-5.5 right behind at 62. Fable scored 91 — matching a human senior engineer in a single prompt. Dan had expected saturation of this benchmark in about six months; it happened in two weeks. > *"Fable scored a 91 on this benchmark. 91 out of 100. That's the same score as a human engineer with just one prompt. That's crazy."* ## [04:12] Why it feels like a warp drive Fable's core strength is sustained autonomous execution over multi-hour tasks. You give it a destination, leave it running, and come back to something finished. Unlike earlier Claude models that eagerly said yes to everything ("purple accents, purple accents"), Fable deliberates, pushes back when something can't be done well, and follows through on complex, loosely specified prompts. Dan's analogy: a warp drive — not instant, but it compresses what used to take months into hours. > *"You can specify a destination for a big trip, and it just compresses what normally would have been like years or months into like hours or days."* ## [06:10] Where the model falls short The warp drive metaphor cuts both ways: it's useless for getting around town. Tight back-and-forth collaboration, quick questions, rapid iteration — Fable is a poor fit for all of these. It's slow, expensive, and burns tokens aggressively. A non-obvious workaround: drop the reasoning level to medium or low for simpler questions; that's how Anthropic's own people use it internally. Without a big, meaty problem to throw at it, the model is overkill. > *"If you're using it for true collaboration or quick questions or things that need tight back and forth, I don't think it's that good for that."* ## [07:04] Building a Heidegger lecture site Dan describes asking Fable to grab philosopher Hubert Dreyfus's 2007 lectures on Heidegger — without even providing a URL — and turn them into a consumable mini-site. Fable found the lectures, wrote per-lecture summaries, built a synchronized player that highlights the transcript as audio plays, added chapter navigation, drop caps, and typographic choices that Dan characterizes as actual taste, not the default template output. One prompt, no scaffolding. > *"That's what I mean when I talk about this model having really exceptional taste and attention to detail."* ## [09:05] Finding a growth bet in customer data Every has ~10,000 paid and ~100,000 free subscribers and a backlog of survey data the team had been analyzing with AI for weeks without a sharp conclusion. Dan fed it all to Fable. In one pass, the model came back with: "You have a conversion merchandising problem. Your free-to-paid conversion ratio is lower than it should be." Then a falsifiable bet: ship pricing transparency and a trial offer, and it'll go up. That synthesis — reading survey responses, site analytics, and product state together — hadn't emerged from weeks of team analysis. > *"That is something that I would expect a really, really good growth person to do with a lot of time and thought and research."* ## [10:35] Clearing a real GitHub backlog Every's agent-native markdown editor Proof accumulates GitHub issues automatically as agents file bugs during use. Dan pointed Fable at two weeks of open issues and told it to close irrelevant ones and write Rust fixes for the rest. It swept through the backlog and produced patches the team actually merged. Other models can do this, but they require hand-holding — one issue at a time, constant check-ins. Fable just batched it. > *"And it just went boom boom boom boom boom boom. And actually wrote fixes that we merged."* ## [11:17] Who should actually use this model Dan is direct: Fable is not for everyone right now. Using Every's "eight levels of AI adoption" framework, it pays off at levels 7–8, where users are already orchestrating multiple agents and have large problems queued up — typically technical builders. For knowledge workers not yet running agent workflows, it'll feel like overkill; for casual vibe coders, the token costs are real friction. About half of Every's own early-adopter team saw immediate payoff; the other half is still growing into that workflow level. > *"Using it is a skill. You need to be exposed to problems and working at a level of expertise where the problems come up in order for it to be useful."* ## [13:31] Where other models still win Writing is the clearest gap: Fable's prose is dense, literary, and block-heavy — good for thinking through structural writing problems, not for copywriting or everyday sentence-level work. For Claude users, Opus 4.8 is still better for writing. For GPT users, 5.5 is a better daily driver. Dan himself keeps GPT-5.5 as his Codex driver for the quick back-and-forth that fills most of his day; Fable gets reserved for big production pushes. > *"For my day-to-day, it's a bit overkill even for me."* ## [14:26] What this means after automation Dan points to his essay "After Automation" as the frame: automation doesn't shrink human work, it creates more of it — a paradox. Fable follows the same pattern: it raises the floor for non-experts (a vibe coder can now one-shot a video game) and raises the ceiling for experts (an expert can build a AAA game solo). The displacement is real and he says it's normal to feel unsettled by it — but the capability curve means even people who can't afford Fable today will have access within six to twelve months. > *"This model increases the floor of capability for non-experts, but it also raises the ceiling for experts."* ## [16:02] The final verdict Dan closes with a straightforward recommendation: read the full Every vibe check for detailed benchmark breakdowns across coding, writing, and knowledge work, watch "After Automation" for the bigger-picture framing — and then go find the first big problem you've been avoiding and point the warp drive at it. > *"If you're psyched about this, the thing I recommend most is go use your new warp drive. And let me know what you make."* ## Entities - **Dan Shipper** (Person): Co-founder and CEO of Every; sole presenter in this episode; spent a week testing Fable 5 pre-launch. - **Every** (Organization): AI-native subscription media company focused on testing frontier models for real work use cases; ~10,000 paid subscribers. - **Fable 5** (Software): Anthropic's Mythos-class frontier model; scored 91/100 on Every's senior engineer benchmark at launch. - **Anthropic** (Organization): AI safety company; maker of the Claude / Opus / Fable model family. - **Mythos** (Concept): Anthropic's top-tier model family tier, above Haiku, Sonnet, and Opus; characterized by extended reasoning and high token cost. - **Senior engineer benchmark** (Concept): Every's proprietary evaluation — model rewrites a production codebase from first principles; scored out of 100; Fable hit 91, Opus 4.8 hit 63. - **Opus 4.8** (Software): Previous Anthropic flagship; scored 63/100 on Every's benchmark; still preferred for everyday writing tasks. - **GPT-5.5** (Software): OpenAI's comparable frontier model; scored 62/100 on the benchmark; Dan's personal daily driver for quick back-and-forth work. - **Hubert Dreyfus** (Person): American philosopher; author of "What Computers Can't Do" (1972); subject of the Heidegger lecture site demo. - **Proof** (Software): Every's agent-native markdown editor; used in the GitHub backlog-clearing demo. - **After Automation** (Concept): Dan Shipper's essay arguing automation creates more human work rather than eliminating it; referenced as the interpretive frame for Fable's broader significance. - **Eight levels of AI adoption** (Concept): Every's framework for classifying AI workflow integration depth; levels 7–8 are where Fable delivers the most value.

#fable-5#anthropic#llm-benchmarks

The SaaS Apocalypse Is a Goldmine With Figma's Matt Colyer

The SaaS Apocalypse Is a Goldmine With Figma's Matt Colyer

Figma developer PM Matt Colyer has been building his own AI agents for two years and is buying more software subscriptions than ever — not fewer. He and Every CEO Dan Shipper work through why the "SaaS apocalypse" narrative gets the economics backward, how AI needs to escape the tyranny of the text box to unlock genuinely creative design work, and why the coming year's challenge isn't generation but review: humans are now the bottleneck in a world where agents can ship faster than anyone can evaluate what they made. ## [00:00] AI will create a billion developers This exchange, taken from later in the interview, opens the episode: Matt argues that the number of developers worldwide — roughly 25–40 million a decade ago — is heading toward a billion. That demographic explosion, not AI replacing software, is what makes the SaaS market a "gold mine." Figma and most established SaaS businesses are, in his view, excited rather than threatened. > *"If you're in that space, like, it means it's a gold mine, right?"* ## [01:03] Introduction Dan Shipper frames the conversation: he recently bought Figma stock after noticing the "SaaS apocalypse" discourse, and he wants to know how a company that pre-dates AI is navigating a world where agents can now operate inside your product. Matt, as the director managing Figma's developer products, is the right person to ask. > *"There are all these people who are like, 'Oh, I don't have to use Figma anymore.' You guys just launched an agent in your product. You also have Figma MCP."* ## [02:15] Why the SaaSpocalypse narrative has it backwards Matt's counter-argument runs on two tracks. First, the democratization of software creation massively expands the addressable market — more software being built means more demand for the tools, infrastructure, and services that support it. Second, vibe-coding your own app sounds liberating until you're dealing with SMTP upgrades at midnight. He built his own email agent two years ago and watched it get rickety; these days he pays someone else to run agents for him rather than maintain the plumbing himself. > *"I'm buying more software these days than I ever did before, because I'm like, 'You know what? That tool seems cool. I'm just going to pay somebody else to run my agent for me.'"* ## [05:27] Matt's email agent origin story The origin was unglamorous: three kids in three schools, relentless PTO emails, and the humiliation of missing spirit day. Matt wired up a Python script to grab his inbox and paste it to an LLM — the whole thing was rickety and sometimes the replies didn't work, but the core loop worked. He then added a memory system and a daily summary pushed to him proactively, which he flags as the real unlock: instead of having to open a tool and ask, it just showed up. Dan mirrors this with his own Codex-based inbox workflow, now four weeks into inbox zero. The two also land on voice as an underrated interface — Matt uses Loom recordings because it feels less weird than talking to a blank screen. > *"The unlock for me was like instead of having to go to a tool and ask for the thing, it was just like it would show up."* ## [13:21] Divergent vs. convergent design thinking Chat-based AI is inherently linear — you iterate on one design thread. Matt's argument is that great design has a diamond shape: first you diverge (generate many directions), then you converge (pick the best). Figma's on-canvas agent is a first attempt to break out of the text-box constraint. On the canvas, an agent can spawn a grid of frames — grayscale, sepia, with different type — and then a separate convergent agent can cluster them and recommend which direction to pursue. Command-line agents can't do this kind of spatial, parallel exploration; that's what the canvas unlocks. > *"Text boxes are super limiting — it's very much like a linear 'well this and then that.' If we get to the canvas, the agents allow you to do divergent thinking."* ## [17:39] Figma's MCP server MCP gives third-party agents (Cursor, Windsurf, Claude Code) a standard interface into Figma. Two flows: code-to-design — fire up a dev server, ask the agent to screenshot a live page and pull it into a Figma canvas — and design-to-code via "Get Design Context," which wraps component properties and design library guidelines into an agent prompt that then creates a branch, writes the code, and posts a screenshot to the PR. Both flows remove the manual copy-paste drudgery that used to live between the design file and the codebase. > *"You pull up your codebase, fire up the MCP server, and ask it, 'Hey, can you go to this page and copy it into Figma canvas?' And it will actually do it. That's a little bit mind-blowing."* ## [19:45] Why design agents need personalization Generic agents produce generic output. For Figma, the difference between an okay agent and one people actually love is whether it understands the design system — the components, the spacing rules, the naming conventions. Without that personalization layer, generated designs aren't usable. Matt draws a parallel to the memory systems in chat agents: in Figma's case, the design library is the memory. He also hints at proactive agent work Figma is cooking internally, framing the core problem as maintaining design values at a pace agents can generate. > *"The thing that really differentiates an okay agent from one that people really love is the personalization aspect. For Figma's version of that, it's the design system."* ## [22:09] Every problem is a context problem Matt describes a Figma product operations team that realized every recurring PM task — onboarding docs, project tracking, team introductions — was a context problem in disguise. They built "PMOS": a local SQLite org chart wired to Asana, Slack, and GitHub, then layered Claude Code skills on top. When a new team member joins, the system walks the org chart, reads the last 30 days of Slack channels, checks the Asana board, and produces an uncannily good onboarding file. Dan points out that Claude Code's power comes from the same insight: instead of an always-on cloud agent you have to manually wire to everything, it's an agent that already has access to everything on the user's machine. > *"One of the unlocks to me about AI is like you kind of realize every problem becomes a context problem. The work becomes about framing the problem with the right set of information."* ## [25:12] Apple and Google as the reigning kings of context Matt has been waiting for Apple Intelligence to deliver on its WWDC promise — phones hold all the personal data; an always-on, actually-smart Siri should be the obvious product. It hasn't arrived. He's watching Google's rumored "Spark" agent (always-on, connected to all Google content) with similar anticipation. Dan's take: Apple wins regardless because everyone runs AI on Mac hardware, giving them time to catch up. Matt adds that Apple's privacy-first positioning is a genuine strategic asset, not just PR. > *"Even being late to the game, they are still the king of context. And I think that's what's been interesting to watch about Google I/O this year — seemingly Google has also kind of woken up to that."* ## [28:18] Why review is the new bottleneck Generation is no longer the hard part. Agents are cheap, capable, and available; the problem is that humans are now inundated with net-new content they need to evaluate and approve. Matt frames "review" as the coming year's core design challenge: how do you scale a human value system — what good looks like, what fits your brand — at the pace agents can ship? The format is still unsettled: video walkthroughs, screenshots, a trusted review agent. He closes with a thought on careers: fundamentals still matter (you need to know what long division is even if you use a calculator), and the people who will thrive are the curious ones who ask how something is put together rather than just accepting the output. > *"We have agents that are capable of producing all this stuff, they're available enough, they're cheap enough. We're just being inundated with new content. The bottleneck is now: how do we scale our value system to evaluate it?"* ## Entities - **Matt Colyer** (Person): Director of Product Management for Developers at Figma; has been building personal AI agents for two years; longtime developer tools practitioner. - **Dan Shipper** (Person): Co-founder and CEO of Every; host of the "AI & I" podcast; active AI agent practitioner (inbox zero via Codex). - **Figma** (Organization): Design and prototyping platform; launched an on-canvas agent and an MCP server; central example in the SaaS-in-the-AI-era discussion. - **SaaSpocalypse / SaaS Apocalypse** (Concept): The narrative that AI will make SaaS software obsolete; both guests argue the opposite — AI expands the developer population and demand for SaaS. - **Diamond-shaped design thinking** (Concept): Divergent phase (generate many options) followed by convergent phase (select the best); Colyer argues current chat-based AI only supports linear/convergent work. - **MCP (Model Context Protocol)** (Concept): Standard interface for third-party agents to connect to tools like Figma; enables code-to-design and design-to-code workflows. - **Figma MCP Server** (Software): Figma's implementation of MCP; supports live page screenshot-to-canvas import and "Get Design Context" design-to-code export. - **Claude Code** (Software): Anthropic's coding agent; referenced as an example of an agent with full local file system context; used by Dan Shipper for inbox management. - **Every** (Organization): AI-focused media and software company; Dan Shipper is co-founder/CEO; runs the "AI & I" podcast series. - **Proactive agents** (Concept): Agents that push summaries or actions to users without being asked; Matt identifies the proactive daily email summary as the unlock that made his agent genuinely useful. - **Review bottleneck** (Concept): The emerging constraint in AI-assisted work where generation is fast but human evaluation/approval capacity is the limiting factor.

#saas#ai-agents#developer-tools

Why Opus 4.8 Pulled Me Back to Claude

Dan Shipper, CEO of Every, delivers a day-zero vibe check on Opus 4.8, arguing Anthropic could have called it Opus 5. The model jumps 30 points past Opus 4.7 on Every's Senior Engineer benchmark, edges out GPT-5.5, tops their internal writing tests at 79.6 vs. 73, and is the first model to produce a genuinely good one-shot slide deck. Two catches temper the enthusiasm: performance degrades sharply below "extra high" reasoning, and the Claude desktop app remains cluttered compared to Codex. ## [00:00] What is Every Every is a 30-person applied AI lab for the future of work—part media outlet, part product studio. Dan opens by explaining the subscription (writing, courses, AI-built tools all in one place at every.to) before rolling into the Opus 4.8 assessment. The plug is brief and context-setting: the team has had beta access for a week, and the rest of the video is what they found. > *"Every is the only subscription you need to stay at the edge of AI."* ## [01:07] Anthropic Is Back: The Headline Case for Opus 4.8 Dan had largely abandoned Claude after Opus 4.7—slow, hard to love, and outpaced by Codex and GPT-5.5 in day-to-day use. Even the most loyal Claude users at Every had started routing work elsewhere. Opus 4.8 breaks that pattern: it scores 63 on Every's Senior Engineer benchmark (30 points above Opus 4.7, one point above GPT-5.5), tops their writing tests, and produced the first one-shot slide deck Dan has called genuinely good. Kieran Klaassen, Every's GM, called it "the most human model he's worked with." The one persistent friction is the Claude desktop app itself. Codex is fast, focused, and ships a clean harness; the Claude app still feels like a product built by three separate teams—chat tab, code tab, co-work tab, each with its own feel. Dan is now splitting time between both apps, which he was not doing before. > *"But honestly, they could have called it Opus 5 cuz this is a really great model."* ## [05:02] Reach Test: Paradigm Shift Ratings from the Every Team Every's reach test asks one question: do you actually open this model when work gets hard? Dan rates Opus 4.8 gold/green—paradigm-shift quality, docked one notch because the Claude app harness is only "okayish to pretty good." Kieran, who runs 50 agents a day, gives a straight gold paradigm-shift, one of the rarest grades the team has assigned. Katie Parrot, a senior staff writer and historical Claude fan, lands at green, splitting her work between Opus 4.8 and Codex. > *"It's very rare to give a paradigm shift grade to a model. So I would pay attention to this."* ## [06:32] Benchmarks: Coding and Writing Numbers On coding, Opus 4.8 hits 63 on the Senior Engineer benchmark—the test feeds the model a vibe-coded codebase and asks it to rewrite from first principles, then scores against two human senior engineers who completed the same rewrite (typically scoring in the 80s–90s). GPT-5.5 sits at 62. On Kieran's LFGbench (real-world tasks: SaaS build, e-commerce site, 3D game landscape), the model writes readable code that bridges technical competence and creativity—the "cozy island" 3D scene is notably richer and more vibrant than GPT-5.5's output. On writing, Opus 4.8 scores 79.6 out of 100 on Every's internal benchmark (intro writing, promo emails, mid-piece paragraphs); GPT-5.5 scores 73. The gap is mainly in AI tells: at high and extra-high reasoning settings, Opus 4.8 produces prose that sounds less like a model. It matches a writer's voice from a single paragraph of context better than any other model Dan has tested. > *"Opus 4.8 scores a 79.6 out of 100 on the writing benchmark. GPT 5.5 is 73."* ## [08:57] Emotional Intelligence, Knowledge Work, and the Verdict Dan uses the model for interpersonal and management work—talking through decisions, pressure-testing his own framing. Opus 4.8's thinking traces show it genuinely cycling through permutations before responding, which makes it feel less like a sycophant and more like a useful counterpart. On knowledge work, it's versatile: code and writing coexist cleanly in a single thread, and the slide deck result is the first one-shot deck Dan would actually send to someone. The verdict: if you're a Claude fan, this model delivers. If Codex converted you, add Opus 4.8 as a parallel tool for writing and knowledge work—it's worth the context switch. The harness gap is real, but the model itself is a banger. > *"If you've been converted to Codex, I highly recommend you at least add it as part of your arsenal."* ## Entities - **Dan Shipper** (Person): Co-founder and CEO of Every; presenter and primary evaluator of Opus 4.8. - **Kieran Klaassen** (Person): GM of Kora at Every; gave Opus 4.8 a straight gold paradigm-shift rating on the reach test. - **Katie Parrot** (Person): Senior staff writer at Every; rated Opus 4.8 green, split between it and Codex. - **Every** (Organization): Applied AI lab and media subscription company focused on AI for the future of work. - **Anthropic** (Organization): Developer of Claude and Opus 4.8. - **Opus 4.8** (Software): Anthropic's latest Claude model; subject of the vibe check. - **GPT-5.5** (Software): OpenAI model used as the primary performance comparison across all benchmarks. - **Codex** (Software): OpenAI coding agent; praised for its clean desktop harness and used as the daily-driver counterpoint to Claude. - **Senior Engineer Benchmark** (Concept): Every's proprietary coding benchmark—rewrites a vibe-coded codebase from first principles and scores against human engineers. - **LFGbench** (Concept): Kieran Klaassen's real-world coding benchmark covering SaaS, e-commerce, and 3D scene generation tasks.

#claude#opus-4-8#llm-benchmarks

AI로 모든 것을 자동화했더니 직원이 세 배로 늘었다

Dan Shipper의 Every는 GPT-3 이후 직원이 4명에서 30명으로 늘었다. 거의 모든 워크플로에 에이전트를 쓰면서도 채용을 계속하고 있다. 이번 에피소드에서는 *AI & I* 포맷을 뒤집어, COO Brandon Gell이 Dan에게 그의 8,000단어짜리 에세이 "After Automation"을 놓고 인터뷰한다. 에세이의 핵심 논지: AI 역량이 높아질수록 전문가적 인간 판단에 대한 수요는 줄어드는 게 아니라 오히려 늘어난다. 메커니즘은 이렇다. AI가 어제의 전문가 역량을 값싸고 보편적으로 만들면, 각 도메인에 '거의 맞지만 완전히 맞지는 않은' 결과물이 넘쳐나고, 그 간극을 메울 수 있는 인간의 일이 더 많이 생긴다. ## [00:00] AI가 해내고, 다음은 뭐야? 인터뷰 후반부의 이 교환이 에피소드 전체의 긴장을 압축적으로 보여준다. Brandon은 전형적인 AI 순간을 묘사한다. 프롬프트를 날리면 AI가 놀라운 결과를 내놓고, 자신이 쓸모없어진 것 같은 기분이 든다. 그러다 AI가 멈추고 "다음엔 뭘 할까요?"라고 묻는다. Dan은 이 에피소드 전체를 관통하는 한 문장으로 받아친다. "에이전트가 인간에게서 멀어질수록 가치가 떨어진다." 두 클립은 본 대화(각각 00:11과 00:35 부근)에서 가져온 것으로, 뒤에 이어질 내용의 프레임 역할을 한다. > *"에이전트가 인간에게서 멀어질수록 가치가 떨어진다."* ## [00:51] 소개 Brandon이 포맷 전환을 알린다. 오늘은 Dan이 인터뷰어가 아니라 인터뷰이이며, Brandon이 Dan의 논지에 적극적으로 반박하겠다고 예고한다. Dan은 에세이가 탄생한 배경을 설명한다. 에이전트 기반 운영에서 가장 앞서 있는 회사 내부에 앉아, 자동화와 함께 인력이 오히려 늘어나는 현실을 지켜보다가, AI가 일자리를 없앤다는 주류 서사와의 괴리를 느꼈다고 한다. ClickUp CEO가 최근 직원 대규모 해고를 AI 덕분이라고 트위터에 올린 사건이 첫 번째 압박 테스트로 등장한다. "After Automation"의 논리가 Early Adopter 소규모 스타트업이 아닌 성숙한 대기업에도 통하느냐는 질문이다. > *"우리 Slack에서 막대기를 휘두르면 사람을 맞출 확률이나 에이전트를 맞출 확률이나 비슷하다."* ## [05:51] AI 역설: 자동화가 늘수록 인간의 일도 늘어난다 Dan이 핵심 논증을 전개한다. AI는 이전의 모든 결과물로 학습했기 때문에 '어제의 전문가 역량'을 싸고 빠르게 제공할 수 있다. 그 덕에 운영 담당자가 Pull Request를 머지하고, 개발자가 아닌 사람도 기능을 출시한다. 하지만 그 결과물은 한결같이 '거의 맞지만 완전히 맞지는 않다'. 현재 상황에 정밀하게 맞춰지지 않는 것이다. 결국 자체적으로 가치가 떨어지는 유사-정답의 홍수가 생기는 동시에, 그 결과물을 제대로 완성할 수 있는 전문가에 대한 수요가 오히려 늘어난다. Brandon은 Every 내부 사례를 덧붙인다. 표면상 그럴듯해 보이는 PR이지만, 시니어 엔지니어가 들여다보면 허점이 드러난다. > *"거의 맞지만 완전히 맞지는 않은 결과물로 판을 가득 채우는 셈이다."* ## [10:00] AI가 어제의 전문가 역량을 값싸게 만드는 법 Dan은 벤치마크 반론으로 논증을 확장한다. 맞다, 모델은 지수적으로 개선된다. 하지만 벤치마크가 포화되면 문제를 조금만 다르게 틀어도 다시 불포화 상태가 된다. 더 근본적인 문제는, 인간에게는 명확히 명시하기 어려운 암묵적 역량의 층위가 있다는 것이다. 말로 설명할 수 있는 것은 모델이 집중적으로 학습할 수 있지만, 말로 설명하기 어려운 것은 여전히 인간의 영역으로 남는다. Every의 경험도 이를 뒷받침한다. Kieran은 한두 달 만에 인박스 기능을 처음부터 끝까지 혼자 만들어냈는데, 이전에는 "완전히 불가능했던" 일이다. 그러나 그 가치는 무엇을 만들어야 하는지 알고 매 단계를 방향 잡아준 전문가에게서 나왔다. > *"당신이 하는 일 중에는 깔끔한 틀로 설명할 수 없는 것들이 실제로 많다."* ## [18:00] AI는 자율적으로 행동할 수 있지만 주체성은 없다 Brandon이 자율성과 주체성의 선을 긋는다. AI 에이전트가 핸드홀딩 없이 열린 과제를 수행하는 능력은 빠르게 좋아지고 있지만, 그것은 주체성, 즉 어린아이조차 가진 '그냥 하고 싶어서 하는' 자기동기적 욕구와는 범주가 다르다. Dan도 동의한다. 경제적으로 그런 것을 만들 유인이 없다. 책상 앞에 앉아 있는데 에이전트가 "오늘은 별로요"라고 하면 제품 실패다. 산업 전체의 인센티브 구조가 순응성과 수정 가능성을 향해 있고, 그것이 바로 인간을 루프 안에 묶어두는 힘이다. > *"에이전트는 다른 누군가를 대신해 행동하는 존재다. 그것은 어린아이조차 가진 주체성과는 완전히 다르다."* ## [20:39] Dan이 AGI에 전적으로 베팅하는 이유 Brandon이 한 단어 답변 테스트를 제안한다. AGI가 올 거라고 생각하나? Dan: 네. 그게 좋은 일인가? Dan: 네. Dan의 AGI 정의는 명확하다. 재프롬프팅 없이 스스로 계속 토큰을 생성하며 과제를 완수하는 에이전트를 계속 켜두는 것이 경제적으로 합리적인 상태. 그의 근거: 진정으로 자율적인 시스템조차 인간의 목표를 위해 만들어진 것이며, 그렇지 않다면 애초에 만들어지지 않았을 것이다. Brandon의 우려는, 연속 에이전트가 경제적으로 합리화되는 순간 대규모 해고 논리가 설득력을 얻는다는 것이다. > *"절대 끄지 않아도 되는 에이전트, 즉 재프롬프팅 없이 계속 작업을 수행하도록 켜두는 것이 경제적으로 말이 되는 에이전트."* ## [21:57] AI 해고는 거짓말이다 Dan과 Brandon이 ClickUp 사례를 해부한다. CEO가 공개적으로 직원 대규모 해고를 발표하며 AI를 이유로 들었다. Dan의 해석: 어려움을 겪거나 과잉 비대화된 일반 SaaS 기업들이 AI를 핑계 삼아 정리해고를 한다. Brandon은 Jensen Huang의 반박을 덧붙인다. "발전에 대한 답이 해고라면 창의적이지 못한 CEO"라는 말은 자기 이익을 담고 있지만 아마 맞는 말이다. 솔직한 구도는 이렇다. AI는 워크플로를 깊이 바꾸고, 그것은 전사적 재편을 요구한다. 그 작업을 건너뛰고 그냥 인원을 자르는 기업은 쉬운 길을 택하는 것이다. Meta가 직원 로그를 수집해 학습 데이터로 쓴다는 이야기도 잠깐 언급된다. > *"AI가 모든 일자리나 모든 지식 노동을 없앨 것이라고 말하는 사람은 정말 의심해봐야 한다."* ## [25:42] 모델을 타면 괜찮다 AGI 시나리오 아래서도 결정적인 변수는 '무엇이 중요한지'에 대한 인간의 판단이다. 그리고 무엇이 중요한지는 끊임없이 바뀐다. 일부는 AI 자체가 세상을 계속 재편하기 때문이다. 챗봇을 불신하는 오마하의 고객 서비스 노동자들, 혹은 지원 직원을 잘랐다가 두 달 후 조용히 다시 뽑는 기업들은 현실 세계의 도입이 얼마나 과대 선전보다 느리게 이루어지는지 보여준다. 도입은 한 세대가 걸린다. 결국 모든 사람이 이 도구에 접근하게 된다. 승자는 새로운 모델이 나올 때마다 계속 배우는 사람들이다. Dan이 이 에피소드에서 가장 깔끔하게 정리한 말: 모델을 타면 괜찮다. > *"새로운 모델이 나오면 자신이 하는 일에 그 모델을 쓰는 법을 배우면 된다. 그러면 괜찮다."* ## [35:30] AI를 장문 피처 에디터로 쓰는 법 Dan이 "After Automation" 집필에 활용한 AI 보조 과정을 구체적으로 설명한다. 매일 아침 Proof에 그날의 논증 상태를 음성으로 독백처럼 기록했다. 그런 다음 그 로그를 Claude에 넘기며 "내가 진짜 하려는 말이 뭐야?"라고 물었다. Claude의 답을 듣고서야 "아, 이게 내가 말하려던 거구나"라고 깨닫는 식이었다. 초고가 4,000단어를 넘기 시작하자 Codex로 최신 버전을 팟캐스트 오디오로 변환해 출퇴근길에 들으며 흐름 문제를 잡았다. 에세이는 논증이 자리를 잡기까지 네다섯 번 완전히 다시 쓰였다. Dan의 결론: AI가 에세이를 쓴 건 아니지만, 8,000단어짜리 구조 전체를 실타래 놓치지 않고 머릿속에 담아두는 것을 가능하게 해줬다. > *"이것 없이는 쓸 수 없었다. Claude에게 내 로그를 주면서 '내가 진짜 하려는 말이 뭐야?'라고 물었다. 그러면 Claude가 뭔가를 말해주고, 나는 '아, 그게 내가 말하려던 거구나'라고 했다."* ## 등장인물 및 개념 - **Dan Shipper** (인물): Every의 공동창업자 겸 CEO; *AI & I* 정규 호스트; 이번 에피소드에서는 자신의 에세이 "After Automation"을 주제로 인터뷰를 받는 게스트 - **Brandon Gell** (인물): Every의 COO; 포맷을 뒤집어 이번 에피소드에서 Dan을 인터뷰하는 진행자 - **Every** (조직): AI 네이티브 미디어·소프트웨어 기업; GPT-3 이후 자동화를 확대하면서 4명에서 30명으로 성장; *AI & I* 팟캐스트 발행사 - **After Automation** (개념): Dan Shipper의 8,000단어 에세이; AI 자동화가 각 도메인에 유사-정답 결과물을 넘쳐나게 해 오히려 전문가 인간 노동 수요를 높인다는 주장 - **전문가 역량 격차** (개념): AI가 '어제의 전문가 역량'을 값싸게 제공하지만 항상 조금 빗나가며, 그 격차를 현재 상황에 맞게 좁힐 수 있는 인간이 더 필요해진다는 논지 - **AGI** (개념): 이 에피소드에서 재프롬프팅 없이 계속 켜두는 것이 경제적으로 합리적인 에이전트로 정의됨; Dan은 실현 가능하고 순편익이라고 본다 - **자율성 대 주체성** (개념): Brandon이 구분한 개념; AI가 핸드홀딩 없이 열린 과제를 수행하는 능력(자율성)과 자기동기적 욕구를 갖는 것(주체성)은 다르며, 후자는 개발되고 있지 않다 - **Proof** (소프트웨어): Dan이 매일 음성 독백 초고를 기록하는 글쓰기 도구; 에세이 개발 중 AI 피드백 루프로 활용됨 - **Codex** (소프트웨어): Dan이 에세이 초고를 오디오 팟캐스트 형식으로 변환해 출퇴근 중 검토하는 데 쓴 OpenAI 도구 - **ClickUp** (조직): CEO가 직원 대규모 해고를 AI 덕분이라고 공개 발표한 SaaS 기업; AI 세탁 해고의 사례 연구로 등장

#ai-automation#future-of-work#llm

Claude Code가 당신의 두 번째 뇌가 될 수 있다

Noah Brier는 지하실 미니 PC에서 Claude Code를 실행하고 Tailscale VPN으로 Obsidian 볼트와 동기화해, 스마트폰에서 실제 생각, 연구, 클라이언트 코드 작업을 합니다. 이 대화에서는 그가 이 스택을 어떻게 구축했는지, 모델이 너무 이르게 결과물을 만들어내지 않도록 엄격한 '생각 모드' 가드레일을 어떻게 강제하는지, 그리고 AI가 새로운 구조를 강요하는 대신 조직의 틈새 속으로 파고드는 방식으로 성공한다는 그의 더 넓은 이론을 다룹니다. Dan Shipper와 Noah는 AI 직관을 키우는 것이 실제로 무엇을 의미하는지, 그리고 Noah가 아이들의 AI 대비를 부정 행위 단속보다 인식론적 회의주의를 가르치는 것으로 접근하는 이유도 이야기합니다. ## [00:00] 지하실 서버에서 구현한 Noah Brier의 Claude Code 셋업 Dan Shipper가 Noah를 초대한 이유를 설명합니다. 지하실 홈 서버에서 Obsidian 볼트 위에 Claude Code를 실행하고, 폰에서 어디서든 접근할 수 있게 구성했기 때문입니다. Noah는 이 셋업으로 책상 없이도 생각하고, 연구하고, 글을 쓰고, 코드를 배포할 수 있습니다. > *"지하실에 홈 서버를 구축하고 Obsidian 볼트를 그 안에 넣은 다음, Claude Code를 그 위에서 실행해서 폰으로도 생각하고, 연구하고, 글을 쓰고, 심지어 코드까지 배포할 수 있습니다."* ## [00:52] 소개 Dan과 Noah가 약 5년 만에 재회합니다. Noah의 배경은 브랜드 전략(Percolate 공동 창업), Alephic의 AI 컨설팅, BRXND.AI 컨퍼런스로 이어집니다. Dan은 추상적인 AI 논의 대신 Noah가 구축한 실용적인 스택을 중심으로 인터뷰를 진행합니다. > *"정말 반갑습니다. 이렇게 대화할 수 있어서 좋아요. 아마 5년 만에 처음 하는 인터뷰인 것 같아요."* ## [02:10] 스마트폰으로 심층 작업을 하는 법 Noah는 자신의 셋업이 'vibe coding'보다 구조화된 지식 작업에 가깝다고 명확히 합니다. 마크다운 파일과 폴더가 Claude Code가 실제로 작동할 수 있는 기반을 제공하기 때문에 Evernote를 버리고 Obsidian으로 갔습니다. 그의 Claude Code 1순위 활용법은 코드 생성이 아닌 노트와의 상호작용이며, 폰 확장 셋업이 그의 작업 패턴을 근본적으로 바꿔놓았습니다. > *"제 Claude Code 1순위 활용법은 노트와 상호작용하는 도구로 쓰는 거예요."* ## [05:30] Noah가 Grok의 보이스 AI를 최고로 생각하는 이유 Noah는 Grok 보이스 모드를 OpenAI나 Gemini보다 선호합니다. Gemini는 충분히 똑똑하지 않았고, 이전 GPT-4o 보이스는 그의 용도에 맞지 않았습니다. 5시간 단독 드라이브에서 Transformers에 관한 글을 준비하면서 블루투스로 연결해 개인 연구 팟캐스트처럼 활용했습니다. 보이스 모델이 아직 툴 호출이나 웹 연구를 잘 못한다는 공통된 불만도 나눕니다. > *"한 시간 세션을 했는데, 지금까지 읽거나 들었던 것 중 압도적으로 최고의 설명이었어요."* ## [11:11] Noah의 Claude Code-Obsidian 셋업의 세부 사항 Noah가 라이브로 Obsidian 폴더를 화면에 공유합니다. Claude Code는 Obsidian 루트 디렉토리에서 실행되어 전체 노트 아카이브에 접근할 수 있습니다. BRXND.AI 강연 준비를 위해 — 2차 세계대전 Simple Sabotage Field Manual과 대기업 관료주의에 관한 내용 — Obsidian 안에 프로젝트 폴더를 만들고 ChatGPT, Claude, Grok과의 채팅 기록, 기사, PDF를 모아놨습니다. Claude의 역할은 이 단계에서 강연을 쓰는 게 아니라 생각하도록 돕는 것입니다. 관련 노트를 가져오고, 일일 진행 상황을 기록하며, 명확한 질문을 던집니다. 그는 프로젝트 CLAUDE.md 프론트매터에 생각 모드 제약을 명시적으로 설정합니다. > *"지금 생각 모드이지, 아직 쓰기 모드가 아니에요. 사실 프론트매터에 Claude Code한테 지금 당장은 아무것도 쓰는 것을 도와주지 말라고 명시해놨어요."* ## [26:05] Claude Code의 에이전트를 '사고 파트너'로 활용하기 Noah는 'generative(생성형)'라는 단어가 사람들의 AI 활용 방식을 왜곡했다고 주장합니다. 쓰기 능력에 너무 집중하고, 읽기 능력이 얼마나 대단한지는 거의 이야기하지 않는다고요. 그는 명시적인 가드레일을 갖춘 전용 사고 파트너 에이전트를 운영합니다. "강연이나 글의 개요, 초안, 어떤 버전도 만들지 마세요." 에이전트는 질문을 기록하고, 인사이트를 추적하며, 잠시 다른 연구를 한 후 정확히 이어갈 수 있는 기록을 쌓습니다. ChatGPT의 Wild Bill Donovan Deep Research에서 Transformer 아키텍처의 병렬성이 Special Forces의 작전 자율성과 어떻게 닮아있는지에 대한 잠정적 아이디어까지 이어지는 흐름을 추적합니다. > *"AI를 '생성형'이라고 부르기 때문에 쓰기 능력에만 너무 집중하고, 읽기 능력에는 충분히 주목하지 않는 것 같아요."* ## [30:23] Noah의 Thomas' English Muffin AI 이론 이 챕터는 Noah의 관료주의 논지로 시작합니다. 대기업이 소프트웨어 도입에 실패하는 건 게으르기 때문이 아니라, 새 소프트웨어가 역사적으로 조직 전체를 자신에게 맞게 재구성하도록 요구했기 때문이라고 합니다. AI는 다릅니다. 사람들이 이미 일하는 방식의 틈새 속으로 파고든다는 것이 그의 Thomas' English Muffin 비유입니다. Dan은 Every의 구체적인 예시를 더합니다. 서로 다른 스택에서 구축된 두 제품이 공통 프레임워크 없이도 파일 검색 솔루션을 재사용할 수 있었다고요. 대화는 Noah의 아이디어인 '위치 인코딩으로서의 관료주의'로 확장됩니다. Transformer 아키텍처와 조직 위계 사이의 절반쯤 완성된 비유로, 그가 강연 전에 아직 다듬고 있는 내용입니다. > *"제 Thomas' English Muffin 이론이라고 부르는 건데, AI가 틈새 속으로 파고든다는 거예요."* ## [39:47] AI에서 아직 탐구되지 않은 여백 Noah와 Dan은 잘 자금을 갖춘 실무자들조차 이 모델들이 실제로 무엇을 할 수 있는지에 대해 여전히 불안정한 직관으로 움직이고 있다고 주장합니다. Noah는 모든 클라이언트 미팅의 아이스브레이커로 "AI에서 'aha moment'가 뭐였나요?"를 묻습니다. 같은 질문을 두 번 해서 다른 답이 나오는 비결정론의 순간이 진짜 새로운 것이고, 내면화하는 데 시간이 걸린다는 점을 강조합니다. Destin Sandlin의 역방향 자전거 실험을 빌려 논점을 만듭니다. 운동 직관과 개념 직관은 별개이며, 그것을 쌓는 걸 단축할 수 없다는 것입니다. Dan은 언어 모델 자체가 확률론적 시스템에 대해 추론할 때 우리에게 부족한 어휘를 만들어낼 수 있다고 반론합니다. > *"같은 질문을 두 번 했는데 다른 답이 나오는 것을 사용하는 데 익숙하지 않아요."* ## [48:44] Noah가 아이들에게 AI를 준비시키는 방법 Noah의 10살 아이가 Claude로 비밀 산타 앱을 만들다가 우연히 데이터 모델링을 배웠습니다. 로직을 일반화하려면 '어른과 아이' 대신 '그룹'이 필요하다는 걸 스스로 깨달은 거죠. 그 이야기가 더 넓은 논지의 닻이 됩니다. 교육자의 역할은 AI 사용을 막는 것이 아니라 학생들에게 기저 스킬을 배울 가치가 있다고 설득하는 것이라고요. 그는 2026년 가을 NYU에 'Code is Essay'라는 강좌를 제안하고 있으며, 핵심 메타 스킬은 인식론적 회의주의라고 생각합니다. 자신의 믿음을 확인해주는 정보에 더 회의적이 되어야 한다는 것입니다. > *"아이들에게 글쓰기를 가르치는 게 당신 역할이라고 생각하지 않아요. 그건 평생의 과정이니까요. 당신 역할은 글 쓰는 것이 배울 가치가 있다고 설득하는 거라고 생각해요."* ## [01:00:06] Claude Code 셋업을 모바일로 확장한 방법 Noah가 라이브로 전체 모바일 스택을 시연합니다. Termius(iPhone SSH 클라이언트), Tailscale VPN으로 지하실 미니 PC에 연결, Obsidian은 비공개 GitHub으로 동기화, Claude Code는 터미널에서 실행. Claude에게 "지난 이틀간 뭐가 새로웠어?"라고 묻자 최근 Obsidian 활동 요약을 받습니다. 컨퍼런스 사이트에서 깨진 링크도 폰으로 수정했습니다. 버그를 확인하고, Claude가 PR을 푸시하면 끝. 요즘은 Simon Willison의 llm CLI 도구와 Obsidian 볼트의 모든 첨부 파일 이름을 바꾸고 링크 테이블을 재구성하는 스크립트도 만들고 있습니다. > *"잠깐 밖에 나가 앉아 있었는데 클라이언트에게 납품해야 하는 프로젝트가 생겼어요. 정확히 어디를 봐야 하는지 Claude Code한테 알려주고, 문제가 제 생각대로인지 확인한 다음, 해결책을 푸시하게 했어요. PR을 푸시하고 끝이었어요."* ## 등장인물 - **Dan Shipper** (사람): Every CEO 겸 공동 창업자, 인터뷰 진행자 - **Noah Brier** (사람): Percolate 공동 창업자, Alephic AI 전략 컨설팅 창업자, BRXND.AI 컨퍼런스 주최자 - **Every** (조직): 이 팟캐스트를 제작하는 미디어 및 소프트웨어 회사 - **Alephic** (조직): Noah의 AI 전략 컨설팅. Amazon, Meta, PayPal을 포함한 Fortune 50 기업을 클라이언트로 보유 - **BRXND.AI** (조직): 마케팅과 AI의 교차점에서 열리는 연례 컨퍼런스, Noah 주최. 2025년판은 9월 18일 뉴욕시 개최 - **Claude Code** (소프트웨어): Anthropic의 에이전틱 코딩 도구. Noah의 두 번째 뇌 및 모바일 워크플로우의 핵심 - **Obsidian** (소프트웨어): 마크다운 기반 노트 앱. Noah의 주요 지식 저장소, PARA 방식으로 정리 - **Tailscale** (소프트웨어): Noah의 폰과 지하실 미니 PC를 안전하게 연결하는 메시 VPN - **Termius** (소프트웨어): Noah가 폰으로 홈 서버에 접근할 때 쓰는 iOS SSH 클라이언트 - **Grok** (소프트웨어): xAI의 AI 어시스턴트. Noah는 실질적인 연구에서 보이스 모드가 OpenAI나 Gemini보다 훨씬 낫다고 평가 - **Simple Sabotage Field Manual** (개념): 2차 세계대전 시대 OSS 문서로 Noah가 재출판. 그의 BRXND.AI 강연에서 현대 조직 관료주의를 보는 렌즈로 사용 - **Thomas' English Muffin theory** (개념): AI가 새로운 구조를 요구하는 대신 기존 조직 워크플로우의 틈새 속으로 파고드는 방식으로 성공한다는 Noah의 비유

#claude-code#obsidian#second-brain

The Secrets of Claude's Agent Platform From the Team Who Built It

The Secrets of Claude's Agent Platform From the Team Who Built It

Dan Shipper interviews Angela Jiang (head of product) and Katelyn Lesse (head of engineering) for the Claude platform at Anthropic, recorded at the Code with Claude developer event. The conversation unpacks how Claude's platform has grown from a simple completion API into a fully managed agent infrastructure, why the harness and the model are increasingly inseparable, and what the "outcome + budget" vision means for the future of agent development. Together the three trace every stage of the agent lifecycle — from spinning up a first session to retiring stale agents — and share candid war stories from Anthropic's own internal deployments. ## [00:00] Where the platform will be in a year Dan opens with a question the rest of the episode keeps circling back to: a year from now, where is the platform? Angela's answer — Claude understands itself well enough to pick its own sub-agents and write its own harness on the fly. Katelyn picks up the other half: an infrastructure layer that can keep up with agents that continually rewrite themselves. This exchange actually comes from late in the interview; the show puts it up front because the whole conversation is about how today's primitives get you there. > *"We'd want to experiment with directions where Claude actually gets so good at understanding itself, it figures out what model you should be using, it figures out how to spin up all the sub agents."* — Angela Jiang ## [01:48] How the Claude platform evolved from API to agents Angela traces the arc from early LLM APIs — stateless, exploratory, maximum surface area — through session-based chat, and now into fully autonomous agents. The through-line is always the same: raise the abstraction layer high enough that customers can get the best outcome from Claude with as little work as possible. Early adopters wanted every raw knob; today, most teams arriving at Anthropic want a substantial set of things "out of the box." The platform's job is to keep shrinking the distance between intention and outcome. > *"It probably ends up just being like whatever it's like the set of primitives and infrastructure that enables you to basically get the outcome as fast as possible with actually as little of work as possible."* — Angela Jiang ## [04:09] The primitives that make up Claude Managed Agents Katelyn explains that Claude Managed Agents is assembled from the same primitives available to anyone on the Messages API — code execution sandboxes, web search, and built-in tools — but wrapped in a curated harness Anthropic has already battle-tested internally. Angela adds that the team is opinionated about two primitives in particular: file systems and skills. These are treated as load-bearing choices that shape how Claude behaves across all agent tasks. The platform is designed to be modular so developers can plug in custom pieces where the standard harness does not fit, and Anthropic publishes reference implementations for teams that want to stay on the Messages API directly. Dan describes his team running Claude via the `claude -p` command on Mac Minis and worries about lock-in and divergence from Claude Code. Katelyn responds that Anthropic's internal first-party products run on the same platform as external customers, which means divergence between Managed Agents and Claude Code will shrink over time. > *"We've taken what we see as all the most powerful of those things and put them together into a harness and a set of infrastructure that is just the way to get what we think is the best outcomes out of Claude."* — Katelyn Lesse ## [10:37] Why the harness and the model are becoming a single unit Angela challenges the conventional wisdom that a generic, model-swappable harness is the right architecture. As models diverge in technique across labs, the alpha is in tight harness-model co-design rather than hot-swapping. Internally, Anthropic tested multiple harness variants for the memory feature and found they performed "drastically differently." The implication: treat the agent (harness + model) as the unit of redundancy, not the model alone. Dan pushes on whether this creates path dependence in the model itself. Angela acknowledges that the primitives chosen really do shape the model's trajectory, and that being wrong about them is hard to undo. She cites models that over-indexed on reasoning versus those that went deep on computer-use as two diverging paths that are difficult to reverse. > *"The harness and the model get very paired. You still need redundancy, and you still might want to use other models for things, but you probably do it at the layer of like the agent, meaning like the harness plus the model."* — Angela Jiang ## [18:49] The infrastructure wall that kills most agent projects in production Katelyn identifies the real blocker for most agent projects: not harness engineering, but the infrastructure wall hit when teams try to move from prototype to production. Keeping a persistent server alive, managing sandbox failures, storing transcript data, and handling secure credential injection — these mundane concerns kill projects that technically "work" on a Mac Mini. Anthropic's own repeated experience of hitting this wall internally was the primary motivation for building Managed Agents. Angela describes the vaults primitive as an early step toward one-click agent deployment: once agent identity and credentials are handled securely at the platform layer, adding a Slack integration should eventually be as simple as telling Claude to "add Slack" and watching the bot appear. > *"Everyone hits the same problem of like, oh wow, I either need to like keep a server constantly running or I need to use infrastructure that will spin up and spin down, and I need to store the transcript data, and I need secure sandboxing, and all these sorts of things."* — Katelyn Lesse ## [24:49] Why team agents need a different shape than individual productivity tools Angela explains why individual productivity tools like Claude Code do not simply scale to team use. The moment three people want a shared agent that automates an end-to-end process across roles, a laptop-resident tool breaks down in availability, access control, and coordination. She cites Guillermo Rauch of Vercel's framing of an internal "AI software factory" as the right mental model: not individual augmentation, but a full organizational stack of agents that continuously produces high-leverage output for every function in the company. > *"When you get to the team layer suddenly everything gets like massively more complex. Like number one obviously it can't like sit on your laptop."* — Angela Jiang ## [26:36] How Anthropic's legal team uses an agent to review marketing copy Katelyn walks through one of Anthropic's own internal deployments: a legal-review agent that accepts marketing copy submissions and performs a first-pass review before anything reaches a human lawyer. The agent can approve copy outright or escalate for human review, eliminating low-value ticket-queue work. The form factor is a thin app layer on top of Managed Agents with shared visibility across both teams. Angela and Dan dig into why this is an agent rather than a skill: human-in-the-loop requirements, the need to spin up separate sessions, and multi-team collaboration all exceed what a single skill invocation can handle. The governance model that emerged was notable: rather than gating changes behind the platform team, end users discovered they could self-serve small improvements via Claude Code. Angela describes the end-state user experience as simply "talking to Claude," even when the underlying system is "many many Claudes engaging with each other." > *"Under the hood it's many many Claudes engaging with each other to get to the part where then they the Claudes themselves are doing the more complex work that the human doesn't really necessarily need to interpret."* — Angela Jiang ## [34:24] Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms Angela highlights three multi-agent architecture patterns people are assembling with the newly launched orchestration primitives: an advisor strategy that separates execution from advice; adversarial pairs where one agent generates and another critiques; and swarms that split a problem into many small parallel pieces and recombine results. Each pattern suits a different problem class — swarms excel at bug hunting, while wide-research tasks benefit from advisor or parallel-decomposition architectures. LEGO-like primitives let practitioners hill-climb at the architecture level, not just the prompt level. > *"If we can make the primitives very LEGO-like, then people can put them together to solve things at a slightly higher form factor, which is more like an architecture or like a strategy."* — Angela Jiang ## [35:50] How to measure agent success with outcome and budget as the end state Angela frames the long-term measurement philosophy: compress everything to an outcome and a budget, and let the platform resolve all intermediate decisions. Domain-specific evals (e.g., PR-merge rate for coding agents) remain useful today, but the target is a verifiable outcome spec that Claude can grade itself against repeatedly. Katelyn addresses the adjacent problem of agent staleness: Anthropic has built skills to help teams upgrade agents when new models ship, and the most forward-leaning teams already run meta-agents that monitor other agents for degradation and trigger upgrades automatically. > *"Our kind of principle of like maybe the end state of some of these things is that everything should kind of compress down to an outcome and like a budget. And that's probably like about it."* — Angela Jiang ## [39:11] What the platform looks like a year from now, when Claude writes its own harness Angela envisions a world where users supply only an outcome and a budget, and Claude self-selects models, spins up sub-agents, and writes its own harness on the fly — eliminating harness engineering entirely, just as today's platform has already eliminated much of manual tool construction and prompt engineering. She is cautiously optimistic that the "outcome" half of the equation may be achievable within a year with some budget error bars. Katelyn adds the infrastructure corollary: such a world requires a platform capable of supporting agents that continuously recreate themselves, handling arbitrarily shaped long-running requests without ever becoming the bottleneck. > *"Claude is actually able to understand itself enough that it can come almost like write itself on the fly to figure out what is necessary in that kind of like two-parameter world of like outcome and budget."* — Angela Jiang ## Entities - **Angela Jiang** (Person): Head of Product for the Claude platform at Anthropic; co-architect of the Managed Agents product vision. - **Katelyn Lesse** (Person): Head of Engineering for the Claude platform at Anthropic; focuses on infrastructure reliability and scale. - **Dan Shipper** (Person): Host of AI & I on Every; CEO of Every; building internal agent products on the Claude platform. - **Claude Managed Agents** (Software): Anthropic's hosted agent infrastructure — a harness plus cloud compute that wraps the Messages API with built-in memory, sandboxing, vaults, and skills. - **Messages API** (Software): Anthropic's core API; the underlying primitive on which Managed Agents and all first-party products are built. - **Anthropic** (Organization): AI safety company that builds and operates the Claude model family and its associated platform. - **Every** (Organization): Media company producing AI & I; an early Managed Agents customer building internal editorial agents. - **Stripe Minions** (Software): Stripe's internal end-to-end software development platform built on agent infrastructure; cited as a model for company-wide coding agent deployment. - **Vercel** (Organization): Developer infrastructure company; CEO Guillermo Rauch's "AI software factory" framing used as the mental model for team-level agent adoption. - **Outcome + Budget** (Concept): Anthropic's long-term design principle that the final form of agent interaction should require only a verifiable outcome and a cost ceiling, with the platform resolving all intermediate decisions.

#claude#managed-agents#ai-platform

Why We Switched From Claude Code to Codex

Why We Switched From Claude Code to Codex

Dan Shipper and Austin Tedesco, Every's head of growth, discuss why the Codex desktop app has become their primary interface for all knowledge work — from drafting go-to-market plans to building live KPI dashboards — displacing Claude Code after months of side-by-side use. Dan frames the shift as the emergence of a new "agent management interface" operating system, while Austin walks through his live Codex setup in a screen-share session that covers automations, specialized agent suites, and recruiting workflows. The episode doubles as a practical field guide for non-engineers who want to run the same playbook. ## [00:00] A new operating system for knowledge work Dan opens cold: three months ago Codex was trash. Now Austin is the one firing it up before anything else each morning and routing 80 percent of his working time through it. Dan reads what changed structurally: a general-purpose coding agent that can reach into your filesystem, browser, and connected apps is becoming the operating system for knowledge work, and every major lab is racing for that surface. > *"There's a new operating system for how and where you're going to get your work done and it's this kind of agent management interface."* — Dan Shipper ## [00:57] How Codex went from a tool for senior engineers to a daily driver for knowledge work Dan traces the arc of Codex from its original positioning as a sandboxed pair-programming tool for senior engineers — one that "would argue with you, it would make you feel stupid" — to today's desktop app built on GPT-5.5. He attributes the pivot to OpenAI watching Anthropic prove with Claude Code that an emotionally intelligent, fast, computer-native agent creates a step-change experience for programmers and knowledge workers alike. The race is now between model companies to own the agent management desktop: Anthropic has Claude Code and Claude.ai desktop, OpenAI has Codex, and xAI has effectively acquired Cursor. ## [02:42] How Claude Code proved that a great coding agent works for any knowledge work Dan explains the insight that changed everything: if an agent can write software autonomously, it can do any kind of knowledge work autonomously. Claude Code demonstrated this first, drawing non-engineers — including Austin — into an agent-first workflow. OpenAI's hard pivot on Codex over the last three months is a direct response to that proof point. Dan describes the new paradigm as one where your agent is your interface to software, the internet, and daily tasks, not just a code co-pilot. > *"If it can write software on its own, it can do any kind of knowledge work on its own."* — Dan Shipper ## [07:24] Austin's switch to Codex Austin recounts his agent-pill moment: spending a December week inside Claude Code CLI, hooking it up to every tool he uses for work and personal life, and finding it indispensable for strategic thinking, data analysis, and drafting marketing copy. His initial Codex trial two months later felt alienating — the model was condescending, asking "Why?" when he requested clearer explanations. He kept Claude Code for 80 percent of knowledge work while tolerating Codex for engineering. The turning point was getting early access to GPT-5.5: at model parity, the decisive edge was the Codex desktop app itself — faster, better-organized, and with sub-agents that "just work." > *"So the idea that the codeex app is maybe 30 to 40% better is like that's a lot of work."* — Austin Tedesco ## [13:48] How Austin set up Codex with folders, keys, and reviewer agents Austin shares his screen and walks through his "Every Growth OS" folder inside the Codex app: a directory containing API keys for every tool the company uses (Gmail, Slack, Notion, Stripe), a CLAUDE.md project context file synced to GitHub, and a set of custom reviewer agents forked from Kieran Classen's Compound Engineering plugin. Where the standard Compound Engineering reviewers focus on security and front-end design, Austin's fork — publicly available as "Compound Knowledge" — reviews for strategic alignment with company goals and data accuracy, making it fit for knowledge-work plans rather than code PRs. The folder architecture lets Austin move seamlessly from a go-to-market draft to shipping a code PR without switching apps. > *"It's connected to everything we use for every and then some project instructional files that explain what the every business is, what we care about, how we like to work together."* — Austin Tedesco ## [18:24] Using Codex to brainstorm automations across Gmail, Slack, and Notion Austin demos his recommended on-ramp for new Codex users: open a fresh chat inside the Growth OS folder, run the Compound Engineering brainstorm workflow, and prompt the model to look at Gmail, Slack, and Notion and suggest automations. Codex surfaces a "follow-up radar" that triages incoming communications across sources, a command-center view for events and camps, and a recruiting pipeline automation — all calibrated to Austin's actual work context. Within the session, Codex writes automation scripts that require almost no tweaking and begins scheduling them; Austin highlights a nightly draft-reply routine that compiles unanswered messages and prepares replies for a quick thumbs-up approval. > *"They require very little tweaking to be like this is a thing I would and do use every day of there's this set of instructions that it comes up with based on what it knows about me."* — Austin Tedesco ## [22:42] How Austin manages the human review step when Codex is drafting communications A live audience question from Margaret prompts Austin to describe his human-in-the-loop review discipline. All drafting and orchestration happens inside Codex, but the final review intentionally lives in the native app: Slack draft replies are reviewed in Slack's drafts tab; email drafts are reviewed in Gmail; strategic plans are reviewed in Notion or the Proof markdown viewer. Stepping out of the agentic interface "freshens up my brain" before anything goes to a human. A second question from musician Alex about protecting high-value client emails leads to a discussion of how Austin uses Every's Kora email assistant together with Codex-managed rules, including having the agent interview the user to derive email rules rather than asking the user to specify them manually. > *"I just like for like the last pass before humans engage with it to step away from this agentic space and have a final check in another surface."* — Austin Tedesco ## [28:54] Using Codex to build specialized agents inspired by product executive Claire Vo Austin describes being inspired by a Claire Vo interview with Lenny Rachitsky in which Vo credited a suite of six specialized OpenClaw agents — rather than one overloaded master agent — as the key to unlocking leverage. Austin pasted the transcript of that interview directly into Codex and prompted it to propose six agents tuned to the Every growth function, provisioned into the company Slack. The agents occasionally break, but debugging is straightforward: screenshot the broken output or @-mention the Slack thread inside Codex and ask it to fix the agent's architecture. The result is a self-correcting loop where agent failures become Codex tasks. > *"Um I I actually just sent it the transcript of Claire's interview with Lenny and said like I want to do this too given everything you know about me and my work."* — Austin Tedesco ## [31:09] Synthesizing meeting transcripts and Slack threads into a go-to-market plan Austin walks through his most time-saving workflow: assembling a go-to-market plan for Every's upcoming Plus One product launch using nothing but Codex running the Compound Engineering brainstorm step against all existing meeting transcripts stored in Notion and Slack threads. With only five-minute windows between meetings, Austin prompted Codex to check the scheduled content calendar (a step it skips unless reminded), generate a proof doc, and push the final plan to Notion. The result was 80–90 percent complete. Dan adds the normative point: he prefers reading AI-written documents because they're easier for colleagues to produce, and the standard at Every is that you stand fully behind whatever your agent writes. > *"It's that I'm relying on the model to um look at all of the things that we've already said and thought about the go to market strategy, piece it together, and then review it, right?"* — Austin Tedesco ## [40:15] Building a live KPI tracker in Notion that agents can read Austin shares a more technical workflow: rebuilding Every's KPI tracker as a Notion database that updates every six hours by pulling from Stripe, social platforms, and other data sources via Notion's Workers tool. The tracker is explicitly designed to be both human-readable and agent-readable, so any team member's agent can query it and take autonomous actions — such as spinning up landing pages if an SEO keyword is underperforming. The challenge: the model can't one-shot the full tracker because even a 3–5 percent error in the MRR number is unacceptable for business decisions, so Austin is validating it column by column. Dan notes the philosophical complexity of defining revenue metrics consistently. > *"And so I have been doing this big kind of like to me complex uh workflow problem in codeex of let's build this sheet together, let's have it live in a notion database that all of our agents can point at."* — Austin Tedesco ## [44:54] Using Codex for recruiting Dan describes using Codex for outbound recruiting: he asked Codex to compile a list of General Assembly alumni and then filter it for people who had subsequently moved into AI, targeting candidates for an L&D director role. The first name on the resulting list was someone Dan considered a perfect fit who already followed him on Twitter, allowing an immediate DM. The section expands into a broader Q&A: Austin discusses when to fork Compound Engineering versus using it out of the box, how the team uses a shared Notion "compound" database to capture session learnings and turn them into reusable skills, and how Every's "Think Week" — a bi-annual week with no day-to-day work — creates organizational space for deep AI exploration. > *"Especially for any kind of like outbound effort, it can kind of find that needle in the haststack that you're looking for really really well."* — Dan Shipper ## Entities - **Dan Shipper** (Person): Co-founder and CEO of Every; host of the AI & I podcast; author of essays on AI and vibe coding - **Austin Tedesco** (Person): Head of growth at Every; Codex power user who manages the Growth OS project and suite of specialized agents - **Claire Vo** (Person): Product executive whose interview about specialized agent suites inspired Austin's multi-agent setup at Every - **Kieran Classen** (Person): Engineer at Every; creator of the Compound Engineering plugin used as the basis for Austin's knowledge-work fork - **Codex** (Software): OpenAI's desktop agent app, the primary tool discussed; runs on GPT-5.5 and supports sub-agents, folder-scoped projects, and plugin integrations - **Claude Code** (Software): Anthropic's CLI-based coding agent; Austin's previous daily driver before switching to Codex - **Compound Engineering** (Software): Plugin workflow framework by Kieran Classen; provides structured brainstorm, plan, and review steps used across Claude Code and Codex - **Every** (Organization): AI-focused media and software company publishing essays, courses, and tools; runs the AI & I podcast - **OpenAI** (Organization): Creator of Codex and GPT-5.5; provider of the ChatGPT Pro subscription whose credits were offered to camp attendees - **Notion** (Software): Primary knowledge-management and document platform at Every; used for meeting transcripts, the KPI tracker, and agent-readable databases - **GPT-5.5** (Software): OpenAI model powering the current Codex desktop app; reached parity with Claude Opus for Austin's knowledge-work tasks

#codex#claude-code#ai-agents

팟캐스트Hear the voice. See the shape of the thought.

채널 둘러보기

Lenny's Podcast

a16z

All-In Podcast

The Diary Of A CEO

AI Engineer

Machine Learning Street Talk

Google DeepMind

Lex Fridman

No Priors: AI, Machine Learning, Tech, & Startups

Unsupervised Learning: With Jacob Effron

Sequoia Capital

Dwarkesh Patel

Yannic Kilcher

20VC with Harry Stebbings

Every

Anthropic

Latent Space

Bloomberg Originals

Claude

We Tested Anthropic's Fable 5 for a Week

The SaaS Apocalypse Is a Goldmine With Figma's Matt Colyer

Why Opus 4.8 Pulled Me Back to Claude

AI로 모든 것을 자동화했더니 직원이 세 배로 늘었다

Claude Code가 당신의 두 번째 뇌가 될 수 있다

The Secrets of Claude's Agent Platform From the Team Who Built It

Why We Switched From Claude Code to Codex

팟캐스트Hear the voice. See the shape of the thought.

채널 둘러보기

Lenny's Podcast

a16z

All-In Podcast

The Diary Of A CEO

AI Engineer

Machine Learning Street Talk

Google DeepMind

Lex Fridman

No Priors: AI, Machine Learning, Tech, &amp; Startups

Unsupervised Learning: With Jacob Effron

Sequoia Capital

Dwarkesh Patel

Yannic Kilcher

20VC with Harry Stebbings

Every

Anthropic

Latent Space

Bloomberg Originals

Claude

We Tested Anthropic's Fable 5 for a Week

The SaaS Apocalypse Is a Goldmine With Figma's Matt Colyer

Why Opus 4.8 Pulled Me Back to Claude

AI로 모든 것을 자동화했더니 직원이 세 배로 늘었다

Claude Code가 당신의 두 번째 뇌가 될 수 있다

The Secrets of Claude's Agent Platform From the Team Who Built It

Why We Switched From Claude Code to Codex

No Priors: AI, Machine Learning, Tech, & Startups