下一个训练范式将走向何方?
So here's a big research bet that all the labs are making.
这是所有实验室都在做的重大研究赌注。
They think that if we train AIs to accomplish millions of verifiable tasks across thousands of diverse RL environments, then we will have basically built AGI, because this kind of training will have created a kind of problem-solving agent: the kind of thing that can make progress on open-ended tasks for weeks on end in the face of errors and mistakes and ambiguity.
他们认为,如果让 AI 在数千个不同的 RL 环境中完成数百万个可验证任务,就基本上构建出了 AGI,因为这种训练会催生一种解题智体:那种能在面对错误、失误和模糊性时持续数周推进开放性任务的东西。
And the people who are optimistic about this vision will say that all these things we talk about as the fundamental deficits in the current training paradigm — for example, the data inefficiency of these models, or the fact that they lack continual learning — can just be steamrolled if we scale training more, in the same way that all the fundamental research problems in natural language processing collapsed when we threw enough compute into LLMs.
对这一愿景持乐观态度的人会说,我们谈到的那些当前训练范式的根本缺陷,例如这些模型的数据低效,或者缺乏持续学习,只要加大训练规模就能碾压过去,就像当初向 LLMs 投入足够算力,自然语言处理的所有根本性研究难题都随之瓦解一样。
So in the previous essay, I talked about how these models are one one-millionth as sample-efficient as humans, and the people who are in favor of the current training paradigm will say, "Look, that might be true, but this is only true during training."
在之前那篇文章中,我谈到这些模型的样本效率只有人类的百万分之一,而支持当前训练范式的人会说:"你看,这也许是真的,但这只在训练阶段成立。"
Training is this one-time cost that is amortized across billions of sessions that a model will experience.
训练是一次性成本,可以摊销到模型将要经历的数十亿次会话之中。
What really matters is how smart and general and sample-efficient the model is during a session, and this has clearly been improving as we've been doing more RL training.
真正重要的是模型在会话中有多聪明、多通用、样本效率有多高,而这一点随着更多 RL 训练的推进,已经明显在提升。
AI agents are able to solve more and more ambitious problems over longer and longer time spans.
AI 智体能处理越来越有野心的问题,持续时间也越来越长。
Anybody who has used these models for coding knows that.
用这些模型写过代码的人都知道这一点。
Similarly, people would say, look, continual learning — this capability I keep harping about, where the model's weights get updated based on what it's learning from deployment — may simply not be necessary.
同样地,人们会说,你看,持续学习,也就是我反复强调的那个能力,让模型的权重根据部署中学到的内容更新,可能根本没有必要。
Because if in-context learning gets so good across longer and longer time horizons, then you don't need to distill everything the model is learning on the job back into the weights.
因为如果上下文学习在越来越长的时间跨度内变得足够强,就不需要把模型在实际工作中学到的一切蒸馏回权重里。
People often say that their employees are not net productive until six months or more on the job.
人们常说,员工入职 6 个月甚至更久后,才能真正产生净产出。
So clearly, online learning is necessary for competence.
显然,在线学习对于能力达标是必要的。
But what if you could just fit those six months into the context window?
但如果能把这6个月直接塞进上下文窗口呢?
There have been tons of architectural innovations that dramatically increase the amount of information, or the amount of context, that a transformer can store.
已经涌现出大量架构创新,大幅提升了 transformer 能存储的信息量和上下文量。
And why not think that, with a couple more years of progress, we might have what feels like infinitely large context windows?
再过几年,我们为什么不能拥有感觉上几乎无限大的上下文窗口呢?
Okay, so before we discuss this research bet a bit further, I want to step back and ask a completely tangential question, which I find actually very interesting and confusing about the nature of current AI progress.
好,在进一步讨论这个研究赌注之前,我想退一步,提一个完全离题的问题,关于当前 AI 进展的本质,我觉得它既有趣又令人困惑。
Why has progress on computer use been so much slower than other domains?
为什么计算机使用方面的进展比其他领域慢得多?
Computer use is so clearly verifiable.
计算机使用显然是可验证的。
You could ask a question like: did the desired Etsy item I ordered get delivered?
可以问这样的问题:我在 Etsy 上下单的商品送到了吗?
Is the venue for an event I'm trying to organize booked?
我要组织的活动,场地订好了吗?
Have my taxes been submitted?
税款申报了吗?
So isn't it weird that computer use has been making so much slower progress than coding and math and these other verifiable domains?
那么,计算机使用的进展比编程、数学以及其他可验证领域慢得多,这难道不奇怪吗?
I'm sure there are many reasons for this, and one of them, of course, is the fact that the models are exposed to far less high-quality multimodal data during pretraining.
这背后原因有很多,其中之一当然是:模型在预训练期间接触到的高质量多模态数据远远不足。
But one reason that I think is actually quite underrated by people, and which I think reveals the canyon walls against which this river of AI progress will only slowly chip away, is that it is not enough for a domain to be verifiable.
但我认为有一个被大众严重低估的原因,它揭示了 AI 进展之河只能缓慢侵蚀的峡谷壁,那就是:一个领域仅仅可验证还不够。
It also has to be very grindable, in the sense that you have to be able to run lots of parallel rollouts against a deterministic and replayable simulator, and you have to run those rollouts from the same starting point.
它还必须具备很高的可磨练性,意思是必须能针对确定性且可回放的模拟器运行大量并行 rollout,而且这些 rollout 必须从相同起点出发。
If you're trying to make a model better at coding, you can define some container that has a software repo with some missing feature that you have tasked the AIs with creating.
如果想提升模型的编程能力,可以定义一个容器,里面有一个缺少某项功能的软件仓库,并把实现该功能的任务交给 AI。
And then you have a thousand parallel agents go at the problem, each of which has an identical copy of the container.
然后让 1000 个并行智体同时攻克这个问题,每个都拥有一份完全相同的容器副本。
But this doesn't work with computer use, at least not trivially.
但这对计算机使用不奏效,至少没那么容易。
You can't just have a thousand agents go try the same checkout flow on Amazon to get better at using websites, because Andy Jassy will find your bots and shut your ass down.
没法让 1000 个智体反复走 Amazon 的同一套结账流程来提升网页使用能力,因为 Andy Jassy 会找到你的机器人并封掉你。
You can solve this by making clones of Slack and Gmail and all the other common applications and websites.
可以通过克隆 Slack、Gmail 以及所有其他常用应用和网站来解决这个问题。
But at least currently, this is a very labor-intensive and unscalable way to build environments.
但至少就目前而言,这是一种极其劳动密集且难以规模化的环境构建方式。
Of course, once AIs get good enough at coding themselves to build these clones with extremely high fidelity, then I'm sure computer use will make quicker progress than it is right now.
当然,一旦AI的编码能力足够强,能以极高的保真度自行构建这些克隆应用,computer use的进展肯定会比现在快得多。
And you're also killing two birds with one stone with this kind of procedure, because getting AIs to rebuild whole applications from scratch is also a great RL objective for coding.
这种方式还能一举两得,因为让AI从头重建整个应用,本身也是编程方向的绝佳RL训练目标。
So while computer use itself may soon be solved, its current lethargy is telling us the following: that unless you can build a very replayable training target for a domain, the models will struggle to make much progress.
所以,尽管computer use本身可能很快就能被解决,但它当前的迟滞状态告诉我们:除非能为某个领域构建高度可重放的训练目标,否则模型很难取得实质进展。
And the reason this is true, of course, is that the models are incredibly sample-inefficient during training.
这背后的原因当然是:模型在训练过程中的样本效率极低。
This is the point I was making in my last video essay.
这正是我上一期视频文章的论点。
So for computer use, we might be able to make up for the sample-efficiency deficit by building these farmable deterministic simulators.
对于computer use而言,通过构建这些可批量生产的确定性模拟器,或许能弥补样本效率的不足。
But for so many other different kinds of skills that we need AIs to have, we simply can't do this.
但对于我们希望AI具备的许多其他技能,这种方式根本行不通。
How do we train an AI to get really good at building a business from scratch?
怎样训练一个AI,使其真正擅长从零开始创建一家企业?
How about winning court cases, or having a profitable day of trading in the markets, or helping a candidate win an election?
赢得法庭案件、在市场上获得盈利的交易日,或帮助候选人赢得选举,又该如何训练?
The rollout here requires interacting with the real world, and you can't recreate it from just within a datacenter.
这里的rollout需要与真实世界交互,无法仅在数据中心内部重现。
The outer-loop verification here may take months or even years of real-world actions to elicit, and you can't re-observe it by perturbing the model's actions slightly in thousands of parallel rollouts to isolate exactly what the model did that actually worked.
这里的外循环验证可能需要数月乃至数年的真实世界行动才能触发,且无法通过在数千个并行rollout中对模型行动进行微调来重新观察,从而精确定位真正奏效的操作。
Now, dealing with such reset-free, non-stationary environments is a known open problem in RL.
处理这类无法重置的非平稳环境,是RL领域一个已知的开放性难题。
I'm not pointing out anything new.
我并没有在指出什么新东西。
But I really do want to emphasize that because of the idiosyncratic and sparse nature of data in most domains in the world, you need sample efficiency in order to get proficient.
但我确实想强调:由于现实世界大多数领域中数据的特殊性和稀疏性,要达到精通,就必须具备样本效率。
If AIs are to develop all the skills that humans have, and even skills that humans don't have, then they need to be able to learn from information revealed in unstructured, unverifiable, and ambiguous ways from scarce amounts of real-world interaction.
如果AI要掌握人类所有的技能,乃至人类尚不具备的技能,就必须能够从稀少的真实世界交互中,以非结构化、无法验证且充满歧义的方式汲取信息并加以学习。
Because in many domains, the relevant training information simply
因为在许多领域,相关的训练信息根本
doesn't exist in any other way.
不以任何其他形式存在。
What is the RL environment to make an AI that is as good at politics as Lyndon Johnson, or as good at building a space-launch business as Elon Musk?
什么样的RL环境能培养出在政治上堪比林登·约翰逊、在构建太空发射业务上堪比埃隆·马斯克的AI?
The labs are betting that RLVR will generalize.
各大实验室押注RLVR能够泛化。
That is, that if you train on enough containerized, reproducible environments, you will develop a very general agent that can make and execute plans and learn rapidly from new information, and even pick up new skills, all within a single session.
也就是说,如果在足够多的容器化可复现环境中训练,就能造就出一个高度通用的agent,能在单次会话内制定并执行计划,从新信息中快速学习,甚至习得新技能。
If you drop this endlessly RLVR'd AI into Texas politics in 1948, it could give you better advice than LBJ about winning the Senate seat.
如果把这个经过无限RLVR训练的AI投入1948年的得克萨斯政界,它给出的赢得参议院席位的建议或许能超过林登·约翰逊。
And if you gave it a hundred million dollars in 2002 and let it cook, it would build SpaceX for you.
若2002年给它1亿美元任其发挥,它会为你打造出SpaceX。
Now, whether RLVR can generalize this well is an empirical question.
RLVR能否泛化得如此之好,是一个实证问题。
If the labs went from spending billions of dollars on RL environments to a trillion dollars, would you get the kind of thing that is a fully human-like general intelligence within the context window?
如果各实验室将RL环境的投入从数十亿美元扩大到一万亿,是否就能在上下文窗口内获得完全类人的通用智能?
Dario gave a telling quote during our podcast together, which I think hints that RLVR generalization is not infinitely strong.
Dario在我们共同录制的播客中说了一句耐人寻味的话,我认为这暗示RLVR的泛化能力并非无限强大。
When he was explaining why model performance tends to degrade at long context, he said: "There's two things.
他在解释为何模型在长上下文下性能倾向于下降时说道:"有两件事。
There's the context length you train at, and there's a context length that you serve at.
一是训练时的上下文长度,二是服务时的上下文长度。
If you train at a small context length and then try to serve at a long context length, maybe you get these degradations."
如果用较短的上下文长度训练,再尝试以较长的上下文长度服务,就可能出现这些性能退化。"
Now, maybe I'm reading too much into this, but it seems like he's saying that short-horizon RL training doesn't necessarily generalize to long-horizon RL performance.
也许我过度解读了,但他似乎在表明:短时程RL训练不一定能泛化到长时程RL表现。
And if you can't generalize from short horizon to long horizon, then how are agents supposed to generalize from getting trained at a bunch of white-collar tasks to, say, having the ability to be dropped in the real world and build a business from scratch as well as Sam Walton?
如果无法从短时程泛化到长时程,智能体又怎能从训练一堆白领任务,进而具备被投入现实世界、像萨姆·沃尔顿那样白手起家创建企业的能力?
And even if, after enough in-context experience, the AIs could become like Henry Ford or Albert Einstein, all that would be ephemeral and wasted if you couldn't get those learnings back into the weights.
即便积累足够的上下文经验,AI 能成长为亨利·福特或阿尔伯特·爱因斯坦那样的人物,若无法将所学重新沉淀进权重,这一切也将昙花一现、付之东流。
Around 30 to 50 percent of a lab's compute goes to inference, and that compute is currently not playing any productive role in helping improve the model.
一个实验室大约 30% 至 50% 的算力用于推理,而这部分算力目前对提升模型毫无实质贡献。
This seems like a huge waste.
这看起来是极大的浪费。
And it's even worse than it sounds, because it is only in deployment that the most valuable bits of information which your model could learn from are actually revealed.
实际情况比听起来更糟,因为只有在部署阶段,才会揭示出模型最有价值的学习信息。
What's actually happening in the organizations where I'm being used?
使用我的那些组织里实际在发生什么?
What are they using me for?
他们到底在用我做什么?
And what kinds of mistakes do I tend to make in the real world?
在现实世界中,我常犯哪类错误?
We've got some genius grad student who's never been allowed to take a real internship, and we keep giving it more and more classroom case studies in the form of RL training on environments.
我们有个天才研究生,从来没被允许真正实习,却一直被塞进越来越多的课堂案例分析,也就是环境上的 RL 训练。
It's so bizarre that we have AIs that are broadly deployed through the economy already, and are participating in so many different kinds of tasks, and are privy to so much domain- and organization-specific tacit knowledge, and they're not able to make use of it.
真是荒诞,我们已有 AI 广泛部署于整个经济体,参与各种各样的任务,接触了大量领域专属和组织专属的隐性知识,却无法加以利用。
But this kind of continual learning requires going back to the weights.
但这种持续学习需要回写进权重。
AIs can't just keep building up a bigger and bigger KV cache as they learn from more and more users.
AI 不能靠不断积累更大的 KV cache 来从越来越多的用户那里学习。
That's just not scalable, and that's also not how humans do it.
这根本无法规模化,也不是人类的做法。
There's no clean separation in our brain between parameters and activations, and it's not like some part of your skull keeps expanding as you learn more things throughout your lifetime.
人类大脑中参数与激活并无截然分离,也不会因为一生中学习更多东西而让颅骨的某部分不断扩张。
When we learn stuff, there's clearly some kind of compression, and this aids our generalization and grokking.
学习新知时,显然存在某种压缩过程,这有助于泛化与顿悟。
There are, in fact, some humans who have this autistic-savant-type ability to recall random tables of numbers or nonsense syllables years later — basically the kind of fidelity of information that models have in context.
事实上,确有一些人具备自闭症学者式的超强记忆能力,能在多年后准确背出随机数字表或无意义音节,这基本上就相当于模型在上下文中所拥有的信息保真度。
And such sheer volume cripples these humans' ability to understand abstractions and metaphors.
然而如此海量的信息反而削弱了这些人理解抽象概念和隐喻的能力。
Human continual learning is less about having all your observations at the tip of your tongue and more about chiseling the right intuitions and big-picture knowledge back into the weights.
人类的持续学习,与其说是随时脱口而出所有观察,不如说是将正确的直觉和宏观知识重新雕刻进权重。
But the moment you move into the weights, you have to give up on in-context learning's sample efficiency.
一旦转向权重学习,就必须放弃上下文学习的样本效率。
Because gradient updates are super sample-inefficient, all of the successfully shipped online-learning models have had to learn the exact same thing across millions of users.
由于梯度更新的样本效率极低,所有成功落地的在线学习模型都必须在数百万用户间反复学习完全相同的内容。
For example, the Cursor Tab model online-learns by predicting the same exact objective for over 400 million requests a day.
以 Cursor Tab 模型为例,它通过对每天超过 4 亿次请求预测完全相同的目标来实现在线学习。
The objective here is which edits actually got accepted by the user.
这里的目标,就是哪些编辑实际上被用户接受了。
At least so far, we haven't seen models online-learn different kinds of things for different users, because while a single session may generate more than enough data for a human to learn from, it's not enough to train a more capable AI.
至少目前为止,我们还未见过模型针对不同用户在线学习不同内容,因为单次会话虽然能产生足够人类学习的数据,但对于训练更强大的 AI 来说远远不够。
Current online learning can work for a very limited number of use cases.
当前的在线学习只能适用于极少数使用场景。
But the whole point of continual learning is that the world is very complicated, and each job and company and problem is different, and you need your intelligence to be able to learn the specific information related to a particular deployment, which simply can't be stuffed into some shared training run.
但持续学习的关键在于:世界极其复杂,每份工作、每家公司、每个问题都各不相同,智能系统需要能够学习与特定部署相关的专属信息,而这些根本无法塞进某个共享训练过程中。
These are all the things we're talking about when we talk about on-the-job learning: things like how everything in your organization works and fits together, how to cooperate with all the infrastructure and the other people around you to make progress on some larger project, what the common failure modes are, and many other things like this.
这些就是我们谈论在职学习时所指的一切:比如组织内部所有事情的运转与协作方式,如何与周边基础设施和其他人合力推进更大的项目,常见的失败模式有哪些,以及诸如此类的许多内容。
As the podcast has grown, I've had to deal with more and more operational overhead.
随着播客的成长,我需要处理的运营杂务越来越多。
Take paying bills.
就拿付账单来说。
In the past, contractors would just email me their invoices.
过去,承包商只需给我发账单邮件。
Every few weeks, I'd dig through my inbox, create a folder with all the bills, and manually pay each one.
每隔几周,我就翻遍收件箱,建个文件夹归总账单,再逐一手动付款。
At this point, though, I just give everybody an email address that goes straight to Mercury, which is my banking platform.
不过现在,我只给每个人提供一个邮箱地址,账单会直接发到我的银行平台 Mercury。
Whenever anybody sends an invoice to that address, Mercury automatically downloads it, scans it, and extracts all the relevant information — things like the contractor name, address, payment amount, invoice number, and due date — and then uses all of this to create a draft payment.
每当有人向那个地址发送发票,Mercury 就会自动下载、扫描并提取所有相关信息,包括承包商姓名、地址、付款金额、发票号和到期日,然后据此生成付款草稿。
Mercury then stores a list of these drafts for me to review.
Mercury 随后将这些草稿汇总成列表供我审核。
I just go through the list and double-check that they've been built correctly.
只需逐条过一遍,确认都正确无误。
I don't have to track anything or enter any information myself.
不需要自己追踪或录入任何信息。
Mercury does all the fundamental things for your business extremely well, and it puts them all in one place.
Mercury 把企业所需的核心功能做到了极致,而且全部集中在一个平台。
If you want to learn more, go to mercury.com.
想了解更多,请访问 mercury.com。
Mercury is a fintech company, not an FDIC-insured bank.
Mercury 是一家金融科技公司,不是 FDIC 承保银行。
Banking services provided through Choice Financial Group and Column N.A., Members FDIC.
银行服务由以下两家 FDIC 成员机构提供:Choice Financial Group 及 Column N.A.。
In this way, sample efficiency and continual learning are actually deeply connected problems.
由此可见,样本效率与持续学习实际上是密切相关的两个问题。