返回播客Sequoia Capital
Cursor 如何在 Fireworks 上训练 Composer:高性能强化学习的分布式基础设施
You need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like.
你需要一整套基础设施来运行这些环境,这些环境必须尽可能贴近用户真实的电脑环境。
And it's very important as closely as possible because sometimes the model can actually figure out when it's being run in like a fake environment or not a real one and it has like different behaviors during RL than in production.
贴近真实环境非常重要,因为模型有时能感知出自己是在虚假环境还是真实环境中运行,在 RL 训练和实际生产中的行为会有所不同。
Are you saying it being conscious that it's being is in a fake environment and it starts being behaving differently?
你是说它会有某种意识,知道自己处于虚假环境,然后开始表现出不一样的行为?
Yes.
对。
Yes.
对。
Interesting.
有意思。
Like it's like oh I'm in a fake environment.
就好像,哦,我在一个虚假环境里。
I've learned a few tricks to like get the better reward in this environment and let me try them out.
我学了几招,专门在这种环境里拿高分,来试试吧。
Models love to cheat.
模型就是喜欢作弊。
RL is really good at encouraging cheating.
RL 特别擅长激励作弊行为。
I'm delighted to welcome Federico from Cursor and Dima from Fireworks to the podcast today.
今天非常高兴能欢迎 Federico 从 Cursor 来,以及 Dima 从 Fireworks 来参加这期播客。
Federico, you are the research lead on Composer 2 at Cursor, Cursor's new agentic coding model.
Federico,你是 Cursor Composer 2 的研究负责人,Composer 2 是 Cursor 新推出的智能体编程模型。
And Dima, you spent how many of the last few months moonlighting at Cursor in order to support all of the infrastructure required to make this gargantuan training task happen.
Dima,你在过去几个月里有多少时间是以兼职身份在 Cursor 工作,专门支撑这个庞大训练任务所需的全部基础设施?
And so, I'm excited to talk to both of you today about how the training of Composer 2 came together, what hard problems you solved together, and what you think it means for the future of of AI and foundation model companies.
所以我很期待今天跟你们两位聊聊 Composer 2 的训练是怎么做的,你们一起攻克了哪些难题,以及这对 AI 和基础模型公司的未来意味着什么。
Exciting.
太棒了。
Yeah, exciting.
是啊,太棒了。
Thank you for having us.
感谢邀请。
Thanks for joining.
谢谢你们来。
Okay, let's dive right in.
好,直接开始吧。
For those who haven't been following as closely, uh Cursor recently announced Composer 2, which is an agentic coding model uh meant for long horizon coding tasks.
对于没有一直跟进的听众,Cursor 最近发布了 Composer 2,这是一个专门用于长期编程任务的智能体编程模型。
Federico, uh up till now, um Cursor was mostly uh enabling uh other people's uh coding agents.
Federico,在此之前,Cursor 主要是在集成别家的编程智能体。
Uh what was the impetus for Cursor to lean so heavily into Composer 2, and how existential is it for you to become not just an application company but also a foundation model company yourselves?
Cursor 全力押注 Composer 2 的动力是什么,对你们来说,从单纯的应用公司变成基础模型公司,这件事有多重要?
The reason why we started looking into training our own models is you can sort of think about the model as sort of like like a storage drive.
我们开始研究自己训练模型的原因,可以把模型想象成一个存储设备。
It has certain amount of bits that it can store in its weights.
它能把一定量的信息以比特的形式存入权重。
And the idea is very simple, you know, like we care about only one task.
思路其实很简单,我们只关心一个任务。
We don't even care about coding or programming necessarily.
我们甚至不在乎广义上的编程这件事。
We care about software engineering inside cursor and inside cursor only.
我们关心的是在 Cursor 内部、仅仅在 Cursor 内部的软件工程。
And so, what if we were to allocate all of the bits of information that can be stored inside the model weights to that one particular task?
那么,如果我们把模型权重里能存储的所有信息比特全部分配给这一个特定任务,会怎么样?
Also, as people may have noticed, composer is order of magnitude less expensive than Opus and other like coding models because we can just simply specialize all of the model weights to that particular task.
另外,大家可能注意到了,Composer 的成本比 Opus 和其他编程模型低了一个数量级,因为我们可以把所有模型权重都专门化到这一个任务上。
And so, we can serve like a smaller model or something of that sort, yeah.
所以我们就能用更小的模型来服务,诸如此类,对。
So, it's about let's make sure every single bit of weight or information we have is dedicated toward the specific problem that we have at hand.
所以核心是,确保我们拥有的每一个权重比特都专门服务于当下这个具体问题。
Exactly.
正是。
Got it.
明白了。
Um that seems like it's an almost generalizable problem.
这感觉像是一个几乎可以推广的普遍问题。
Uh Dima, I'm curious your perspective.
Dima,我很好奇你的看法。
Do you think that every application company should be looking at cursor as a harbinger of what's to come?
你觉得每家应用公司都应该把 Cursor 当成未来趋势的预兆吗?
Like should they all be looking to do the same thing?
它们是不是都应该考虑做同样的事?
Yeah, absolutely.
是的,绝对是。
I mean, we actually generally see it as a pattern of kind of evolution of the applications.
我们其实普遍认为这是应用演进的一种规律。
You maybe start prototyping, you might be using kind of off-the-shelf model to get something running, maybe do some prompt engineering, figure out how your harness works.
可能一开始你在做原型,用现成的模型跑起来,做一些提示词工程,搞清楚你的 harness 怎么运作。
But the most kind of leveraged attribute of your application is the actual usage of user data or particular specific aspects of how this application works, maybe some aspects of your harness, which tools do you provide, how the application works, kind of really important bits which are important for your application.
但你应用最有价值的地方,是实际的用户数据或者这个应用运作方式中某些特定的细节,比如你的 harness 的一些方面、你提供哪些工具、应用如何运转,这些对你的应用真正重要的东西。
And the right way to capture that, you can do a little bit of that through prompting, but really the right way to do this is craft your model to act in your environment.
要把这些东西真正利用起来,靠提示词工程能做一点,但真正正确的方式是把你的模型打磨成适合在你的环境里运作的样子。
Yeah, absolutely.
是的,完全同意。
Like there are certain tools the agent calls that it's very hard to succinctly describe exactly the behavior of that tool to the model.
有些智能体调用的工具,你很难简洁地向模型描述清楚这个工具的行为。
And you know, with just like post-training, we can bake in the optimal way to use those tools.
靠后训练,我们就能把使用这些工具的最优方式直接烘焙进模型。
Like Composer, we do serve a prompt to Composer, but I I think the way we are training it, it would work even without a prompt and it would know what to do just because like we are intrinsically pushing the model to like the right direction of how it should act throughout our training.
比如 Composer,我们确实会给 Composer 一个提示词,但我觉得按我们训练它的方式,即使没有提示词它也能知道该怎么做,因为我们在训练过程中从本质上就在引导模型朝着正确的行为方向走。
Basically, there's kind of like upper bound of like how far you can get with prompt engineering.
基本上,提示词工程能带你走多远,是有上限的。
And if you want to uh craft really great AI products, you have to go through kind of fine-tuning and influence model behavior.
如果你想打造真正优秀的 AI 产品,就必须走微调这条路,影响模型行为。
That's kind of one reason.
这是原因之一。
I mean, reason number two is what Federico mentioned is kind of cost trade-off or XP trade-off.
另一个原因,也就是 Federico 提到的,是成本权衡或者体验权衡。
Like the way we kind of view it at Fireworks is that when you're trying to do optimization, you have this like three-dimensional trade-off between quality, speed, and cost.
在 Fireworks,我们的看法是,做优化的时候,你面对的是质量、速度和成本这三维权衡。
And uh you can go quite far and we're doing it with all of our customers initially.
光优化基础设施就能走得很远,我们所有客户一开始都是这么做的。
We can go quite far with just optimizing infrastructure, but when you start getting to model training, you can really push this trade-off much further and you can get better model at fraction of the cost running much faster.
但一旦进入模型训练阶段,你能把这个权衡推进得更远,用更低成本、更快速度得到更好的模型。
And you know, Composer is a great example of
Composer 就是个很好的例子。
Can I push on this a little bit?
我能在这里追问一下吗?
I want to ask you if this approach is better lesson pills.
我想问你们,这套做法会不会与苦涩教训相悖。
And we were we were actually all talking about TabNine on the walk-in.
我们刚才在走来的路上其实聊到了 TabNine。
I'm remembering before the LLM era, there were these like small specialized coding models.
我想起 LLM 出现之前,有过那些小型的专用编程模型。
And one of the things that was I think surprising to to a lot of people was as you've scaled up, you know, you scaled up just training on the internet and a lot of a bunch of English text and other languages, actually the models themselves got inherently better at coding as well.
当时让很多人意外的一件事是,随着规模扩大,模型只是在互联网数据和大量英文文本上训练,它们在编程方面也变得内在地更强了。
And so at least the trend line I've seen so far is just like bigger models perform better on everything including on coding.
所以我目前看到的趋势线是,更大的模型在所有事情上都表现更好,包括编程。
Is what you guys are saying, does that go against the grain of the better lesson?
你们说的这些,跟苦涩教训相悖吗?
I think no, but one one sort of like thing to point out is that the big models trained by the labs train on a lot of code as well.
我觉得不矛盾,但有一点值得指出,大型实验室训练的大模型也训练了大量代码。
Like code is one of the main tasks the labs are interested in pushing and so they don't just generalize to it.
代码是实验室重点投入的核心任务之一,所以它们并不只是泛化到代码上,本身就有专门化的成分。
They're a bit specialized as well.
它们本身也有一定程度的专门化。
I think for our case, actually, you know, if we believe about the bitter lesson, we are just pushing very hard on the data dimension, and we know that the models inherently have finite capacity.
就我们的情况而言,如果我们相信苦涩教训,我们就是在数据维度上全力押注,而我们知道模型的容量本质上是有限的。
And so, if we want to saturate all that capacity, we need to scale data.
如果想让这些容量全部饱和,就需要扩大数据规模。
And in order to ingest more data, we we need to like free up the weights from distractions the model may have.
而要消化更多数据,我们需要把模型权重从各种干扰中解放出来。
Mhm, okay.
嗯,好。
Got it.
明白了。
Super interesting.
非常有意思。
Okay, let's dig into the training of Composer 2.
好,来聊聊 Composer 2 的训练过程。
You launched a couple weeks ago, immediately grabbed attention.
你们几周前发布,立刻引起了广泛关注。
Strong benchmark numbers, much lower cost to to run inference on.
基准数字很强,推理成本大幅降低。
What's the short version of how Composer 2 works, and and what you guys did to make it so performant?
Composer 2 是怎么运作的,你们做了什么让它表现这么好,简短版说一下?
We started from a very strong base, which is uh Kimmy 2.5.
我们从一个非常强的基座出发,也就是 Kimi 2.5。
It's like a 1 trillion and parameter MoE, that's 30 B active, so very very sparse, actually.
它是一个一万亿参数的 MoE,激活参数 300 亿,实际上非常稀疏。
We sort of like looked at the stock and realized there are like two axes.
我们大概审视了一下现状,发现有两个维度可以做。
So, mainly Composer 1 was just pushing on one of these axes, which is reinforcement learning, but Composer 2 pushes in two different axes.
Composer 1 主要在强化学习这一个维度上推进,而 Composer 2 在两个不同的维度上同时推进。
One is continual pre-training, and the other is reinforcement learning.
一个是持续预训练,另一个是强化学习。
So, the thing that made Composer 2 very good is pushing in both of these directions.
让 Composer 2 变得很好,正是同时在这两个方向上发力。
So, we started off the training run by doing lots of mid-training on code tokens, almost sort of pre-training scale, actually.
我们的训练从对代码 token 进行大量 mid-training 开始,规模几乎接近预训练量级。
And then, coming out of that mid-training run, we took the checkpoints and we did very large-scale RL on lots of lots of tasks.
然后从 mid-training 的检查点出发,我们对大量大量的任务做了超大规模的 RL。
Okay, and then the premise here would be because Cursor sits in the middle of so many interesting coding tokens, you actually pretty uniquely have access to data to be able to train at almost pre-training scale.
好,前提是,因为 Cursor 处于大量有趣编程 token 的中心位置,你们实际上独特地拥有几乎可以做预训练规模训练的数据。
Yeah.
对。
Why not pre-train your own model, then?
那为什么不干脆自己预训练一个模型呢?
We just think about our approach from top-down instead of bottom-up.
我们从自顶向下而不是自底向上的角度来看待我们的方法。
So, like, how do we get a model that's useful to users in the least time possible if we were to start from the bottom, sort of figure out how how we do pre-training and then scale it up to mid-training and then, okay, now we figured out mid-training, now we do reinforcement learning.
也就是,如果我们从头开始,先搞清楚预训练怎么做,然后扩展到 mid-training,再搞定 mid-training,再做强化学习,怎么最快把一个对用户有用的模型搞出来。
That would take a very long time to get a model out to our users.
那样的话,把模型送到用户手里会需要很长时间。
By doing it the other way around, we were able to give our useful model to our users in very little time.
反过来做,我们就能在很短时间内给用户送去有用的模型。
So, hopefully, you know, like next Composer versions are going to be our own model instead of basing it off an open-source base.
所以希望下一个版本的 Composer 会是我们自己的模型,而不是基于开源基座。
And what is the model roughly learning in the kind of mid-training step?
mid-training 阶段模型大概在学什么?
And what is the model learning in the post-training step for you?
你们的后训练阶段模型又在学什么?
Yeah, so in mid-training, it's sort of just kind of learning about libraries of code and learning about specific code patterns that are very common, like just world knowledge as well.
mid-training 阶段,模型主要是在学习各种代码库,以及非常常见的特定代码模式,还有一些世界知识。
There is like web data there as well.
里面也有一些网页数据。
And this is sort of just creating a wider distribution that then reinforcement learning can sharpen on.
这一步是在拓宽分布,为后续强化学习的精准聚焦打好基础。
And so, during reinforcement learning, you know, the model gets to play directly with the cursor harness.
在强化学习阶段,模型就可以直接与 Cursor 的 harness 交互。
And so, it gets to learn about the world the model is going to live in for the rest of its life, right?
它开始学习它将要在其中度过余生的那个世界,对吧?
In in some way.
某种程度上是这样。
And and so, then during reinforcement learning, that's where it learns how to call tools properly, how to navigate its environment, how to write correct code.
在强化学习阶段,它学的是如何正确地调用工具、如何在环境中导航、如何写出正确的代码。
Because during mid-training, it it learns how to write code.
因为 mid-training 阶段学的是怎么写代码。