Retour aux podcasts AI Engineer

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson

Nice meeting you guys. 很高兴见到大家。 Um I'm Ash. 我是 Ash。 Uh this is Andrew. 这位是 Andrew。 We both work in uh as engineers in our applied AI uh team here at Anthropic. 我们都是 Anthropic Applied AI 团队的工程师。 Um and the kind of topic for this session was uh inspired by a blog post we put out uh just a couple weeks ago actually about how to think about building uh agents that can actually run for really long extended periods of time. 这场分享的主题，来自我们几周前发的一篇博客，讲的是怎么思考构建能连续运行很长时间的 Agent。 You know, we're talking 5 6 hour plus kind of runs. 我们说的是 5、6 个小时以上的运行。 Uh I think we've all seen these kind of demos, you know, of like companies being like, "Hey, we've like one-shotted a browser." 大家应该都见过那种演示，公司出来说“嘿，我们一把梭出了一个浏览器”。 For example, but not necessarily sharing like some of the details into what goes into the harness and that's what we kind of want to talk about today. 但 harness 里面到底有什么，细节很少有人分享，这正是我们今天想聊的。 So, the first off um my amazing quick Andrew will talk about a little bit about basically how we've got here, some of the primitives that we've shipped in code code um and, you know, where we are today. 首先，我的好搭档 Andrew 会讲讲我们是怎么走到今天的，我们在 Claude Code 里发布过的原语，以及现状。 Um and then I'll hop back on stage to talk a little bit about some of the more experimental stuff that we're playing with with harnesses um as well as, you know, a few examples of of what we've seen. 然后我再回到台上，讲讲我们在 harness 上玩的一些更偏实验性的东西，还有几个实际案例。 But, over to you. 交给你。 Sounds good. 好的。 Thank you, Ash. 谢谢 Ash。 And yeah, thanks everyone for joining uh first session of the AI Engineer conference. 也谢谢各位来参加 AI Engineer 大会的第一场分享。 So, glad you're spending it with us. 很高兴你们把时间花在这。 Uh my name's Andrew. 我叫 Andrew。 I'm on the applied AI team based out of London working as a solution architect with a lot of our digital native and industries customers. 我在 Applied AI 团队，常驻伦敦，做解决方案架构师，服务很多数字原生和行业客户。 So, um yeah, I'm going to give a little bit of a history tour uh trip down memory lane, but really with the focus on all the things that we've shipped that lead to agents being able to run uh for multiple hours or even days at a time. 我会带大家做一次历史回顾，重点是我们发布过的所有让 Agent 能连续跑数小时甚至数天的东西。 Um and then I'll hand over to Ash to do more of the the state of the art. 然后交给 Ash 讲最前沿的部分。 Right. 好。 Okay, so um little quote from or on Twitter from Boris, the creator of Cloud Code. 先看一段推特引言，来自 Claude Code 的创造者 Boris。 This was on the one-year anniversary of Cloud Code. 写在 Claude Code 一周年的时候。 Uh, basically saying a year ago, Cloud was struggling just to write bash commands and escaping strings. 大意是，一年前 Claude 连写 bash 命令、转义字符串都很吃力。 Um, and it could run for, you know, maybe 20 minutes at a time. 一次大概只能跑 20 分钟。 And then, we're now at the point where almost all of Cloud Code is being written by Cloud Code, and it can run effectively for days at a time. 而现在，Claude Code 几乎全部由 Claude Code 自己编写，一跑就是好几天。 Uh, so sort of a a big big swing over just the course of a year, and I'll walk through that history uh, a little bit now. 短短一年变化非常大，我现在就带大家把这段历史过一遍。 But, just to 不过先 Let me uh, zoom in here. 我把这里放大一下。 Um, just to sort of frame the problem. 先把问题框定清楚。 I'll play why why is it that it's really difficult for these agents to run for extended periods of time? 为什么让 Agent 长时间运行这么难？ Um, I think broadly there's three big buckets. 大体上有三大类原因。 Uh, some are more intuitive than others. 有的直观，有的不那么直观。 So, firstly, context. 第一，上下文。 I think we all understand context windows very much finite. 上下文窗口是有限的，大家都懂。 So, you start a new session, there's like amnesia. 新开一个会话，就像失忆一样。 The agent has to start from scratch, so you need some sort of memory components. Agent 得从零开始，所以需要某种记忆组件。 Um, also as you're working through a context window, there's this notion of context rot. 而且随着上下文窗口越用越满，还有所谓的上下文腐化。 So, uh, there's less coherence as you're you're getting deeper into that session. 会话越深入，连贯性就越差。 Uh, also, you might get to the point where uh, the model actually exhibits what's called context sense anxiety. 甚至到某个点，模型会表现出所谓的上下文焦虑。 So, it gets kind of nervous as it reaches the end of its context window, and it just quickly hurries up to finish what it's doing. 快到窗口尽头时它会紧张，匆匆忙忙赶着把手头的事收尾。 Um, this kind of leads into planning. 这就引出了规划问题。 So, uh, in general, models are not that great at planning just out of the box. 总体上，模型开箱即用的规划能力并不强。 Uh, they might try and do everything in just one shot. 它们可能想一口气把所有事做完。 Or, for example, they might build half a feature and then stop, or they might just run out of context altogether and sort of leave a half-finished app built. 或者做了半个功能就停下，又或者上下文彻底耗尽，留下一个做了一半的应用。 Um, but then maybe less intuitively, um, models are really bad at judging their own output. 但更反直觉的是，模型非常不擅长评判自己的产出。 So, I know we all know that models can be sycophantic and sort of tell you what you want to hear, but this applies as well to to coding tasks. 大家都知道模型会谄媚，捡你爱听的说，但这一点在编码任务上同样成立。 So, it might look at a feature and see that it's sort of half half-baked or a little bit implemented and say, "Yeah, okay, uh, that looks done." 它可能看一眼某个半生不熟、只实现了一点的功能，然后说“行，看起来做完了”。 and then it'll move on to the next thing. 然后就去做下一件事了。 Or it might build a feature like a button, but actually the back end, you know, it doesn't exist for it. 或者它做了个按钮之类的功能，但后端根本不存在。 There's sort of no nothing behind that, but it looks like the feature is done. 背后什么都没有，但功能看起来像是做完了。 So, um I know Ash will talk quite extensively of so some of the new techniques we have to to help with this um specifically so models can become better at judging their own output. Ash 待会儿会展开讲我们应对这个问题的一些新技术，专门让模型更会评判自己的产出。 So, there's there's two ways really we can we can fix these things. 解决这些问题其实有两条路。 Uh the first one is obviously the model. 第一条显然是模型本身。 So, um baking it all into the model weights themselves. 把能力全部炼进模型权重里。 And I'm sure you've all seen this this meter chart. 这张 METER 曲线图大家肯定都见过。 It's basically how long can an agent run for with a minimal scaffold uh where it's completing 50% of the tasks. 它衡量的是 Agent 在极简脚手架上、完成 50% 任务的前提下能跑多久。 And you'll see from Opus 3.7, it's around 1 hour and up to Opus 4.6, 1 year later, it's at 12 hours. 可以看到从 Opus 3.7 的大约 1 小时，到一年后的 Opus 4.6，已经是 12 小时。 So, an entire day. 一整天。 Um and we've of course, you know, managed to get that running much longer. 我们当然也让它跑得更久过。 Other people have as well, but this is just a sort of a very minimal scaffold. 其他人也做到过，但这只是一个非常简单的脚手架。 The second thing that you can do is, of course, make changes to the harness itself. 第二条路，自然是改 harness 本身。 So, this is the scaffolding um around the model. 也就是模型外面那层脚手架。 And we have the agent SDK which ships with all of the primitives that we've been building over time. 我们的 Agent SDK 带齐了我们一路构建的全部原语。 So, there's the core agent loop itself where you have Claude model that's determining what to do, what tools to run, uh maybe it's pulling in some tools from MCP servers. 核心是 agent 循环本身，Claude 模型决定做什么、跑哪些工具，可能还从 MCP 服务器拉一些工具进来。 Uh it might delegate some tasks to a sub agent. 它可能把一些任务委派给 sub-agent。 It's bringing in all the context from things like claude.md or the skills that are loaded or slash commands. 它从 claude.md、加载的 skills、斜杠命令这些地方把上下文带进来。 And there's a whole permission system. 还有一整套权限系统。 And the this this will change over time as well as the models get better and improve. 随着模型变强，这些也会跟着演变。 But these are sort of the the core primitives that we're working with. 但这些就是我们手上的核心原语。 And then of course, you use this framework to to build your own harness for whatever it is you're trying to do such as some of the things that Ash will show uh later on when we're getting to more long-running agents. 然后你就用这套框架，为你要做的事构建自己的 harness，比如 Ash 待会儿讲长时运行 Agent 时会展示的那些。 Uh I think what's also interesting is just looking back at the last year of releases is that when we've released a model we've always also released a lot of harness changes alongside the models. 回看过去一年的发布还有个有意思的点：每次发模型，我们总会同时发布大量配套的 harness 改动。 So really these things are like co-evolving together. 这两样东西是在共同演化的。 So we'll just look back um suppose firstly just prehistory um beyond you know one year ago. 我们先回头看看史前时代，也就是一年多以前。 I think we all remember that that period where Claude had the artifact section of Claude.ai and and uh Sonnet 3.5 was the first model that really showed promise when it came to coding. 大家应该都记得那个时期，Claude.ai 上有 artifacts 区，而 Sonnet 3.5 是第一个在编码上真正展现潜力的模型。 And it could now verify that it could look at what it had built and sort of iterate from there. 它已经能验证，能看着自己写出的东西继续迭代。 That was quite an aha moment sort of pre-Claude code. 那是 Claude Code 诞生前的一个顿悟时刻。 Uh but then also we shipped computer use so it could start clicking around taking screenshots um testing its own code as well as MCP spec uh which enabled it to sort of use tools. 之后我们还发布了 computer use，它能点击、截图、测试自己的代码，还有让它能使用工具的 MCP 规范。 So then getting into Claude code uh this is February 2025. 然后进入 Claude Code，2025 年 2 月。 So this is about just over a year ago. 也就是一年多前。 Um Sonnet 3.7 was released and this was sort of state of the art on Swebench. Sonnet 3.7 发布，在 SWE-bench 上差不多是当时最强。 And Claude code was released in research preview. Claude Code 以研究预览版发布。 And I think an an interesting quote that I pulled from this release actually is that the goal of Claude code was to better understand how developers use Claude for coding to inform future model improvements. 那次发布里有句话很有意思：Claude Code 的目标是更好地理解开发者如何用 Claude 写代码，以反哺未来的模型改进。 So essentially when we released Claude code the whole idea was for it to be somewhat experimental to inform how we actually improve the base model itself. 所以我们发布 Claude Code 时，整个思路就是把它当成实验品，用来指导我们改进基础模型本身。 And you'll see this trend that over time the models become better. 你会看到这个趋势：模型随时间越来越强。 Uh the harness certain aspects of it might become less necessary or it will evolve. harness 的某些部分会变得不再必要，或者随之演化。 Um just just in terms of uh these slides as well in the bottom left corner these are some of the things that are are sort of the focus of these releases whether it's uh context or planning uh or verification and then some some stats but I'm not going to sort of read everything. 这些幻灯片的左下角，标的是每次发布的重点，比如上下文、规划或者验证，还有一些数据，我就不逐条念了。 Um so yeah next this was around May time of last year Opus 4 and Sonnet 4 4 were released. 接下来到去年 5 月前后，Opus 4 和 Sonnet 4 发布。 And just in general um these tools got much better at sort of managing their own contacts and getting to task completion uh without reward hacking or anything like that. 总体上这些工具更会管理自己的上下文了，也能做到完成任务而不投机取巧。 And then Claude Code became GA as well and we released the Claude Code SDK. Claude Code 也正式 GA，我们发布了 Claude Code SDK。 So sort of the the harness powering Claude Code. 也就是驱动 Claude Code 的那套 harness。 Um little interlude here from the timeline. 时间线里插播一段。 I think everybody now knows about this Ralph Wiggum technique. Ralph Wiggum 这个技术现在应该人尽皆知了。 Uh you might not know that it was actually last July that this was that this came out uh when when Jeffrey Huntley initially released the paper because it really sort of gained a lot of traction around say December or so of last year uh when for example people started playing around with it themselves. 但你可能不知道它其实是去年 7 月出的，Jeffrey Huntley 最初发布那篇文章的时候，真正火起来要到去年 12 月前后，大家开始自己上手玩。 Claude also released our own uh Ralph Loop within the the Claude Code uh harness itself. Claude 也在 Claude Code harness 里发布了我们自己的 Ralph Loop。 But essentially it's it's quite sort of a simple technique that you're just taking a prompt and you're feeding it into Claude Code CLI for example and then you're just running that on a loop until uh all the tasks are complete. 本质上它是个很简单的技术：拿一个 prompt，喂给比如 Claude Code CLI，然后循环跑，直到所有任务完成。 It's a little bit deeper than that. 其实比这要深一点。 I I think people tend to simplify it. 大家往往把它简化了。 There's actually a few phases where at first you know would have some kind of planning where it breaks down that prompt into a few different features and then it would pick sort of one task from that and start a new session and then work with a fresh context window. 它实际分几个阶段：先做某种规划，把 prompt 拆成几个不同的功能，然后从里面挑一个任务，开一个新会话，用全新的上下文窗口去做。 So a lot of those concepts were were applied in the Ralph Loop but I think um why it caught so much attention is because it sort of seems really simplistic and he put it uh deterministically bad in an undeterministic world. 很多这些概念都用在了 Ralph Loop 里，但我觉得它之所以那么受关注，是因为它看起来特别简单，他的说法是，在不确定的世界里做到确定性地差。 So the idea being that it's better to fail predictably than it is to succeed unpredictably. 意思是，可预测地失败，好过不可预测地成功。