Terug naar podcasts Unsupervised Learning: With Jacob Effron

Gemini Co-Lead on World Models, RL's Next Domains & Continual Learning

Oriel Vignal is the co-lead of Gemini alongside Nome Shazir and Jeff Dean. Oriol 与 Shazeer、Jeff Dean 共同主导 Gemini。 He's had an incredible career in AI uh pioneering many of the breakthroughs in deep learning in the last decade. 他在 AI 领域有着辉煌的职业生涯，过去十年里开创了深度学习的许多突破。 And it was a ton of fun to get to sit down with him after Google IO. Google IO 结束后能跟他坐下来聊，非常过瘾。 If you've been following Google IO, they basically shipped a bunch of products across a ton of interesting surface areas throughout AI. 如果你一直在关注 Google IO，他们基本上在 AI 各个领域发布了大量产品。 And so Oriel and I hit all of them. Oriol 和我都聊到了。 We talked about what's required for further advances in multimodal models uh and what's going to make these world models actually usable. 我们聊了多模态模型进一步发展需要什么，以及怎样才能让世界模型真正落地可用。 We talked about the uh increase in memory and the importance of memory and how the advances there will look like reasoning these next few years as well as what Oreal thinks the path forward is. 我们还聊了记忆能力的提升与重要性，以及未来几年记忆方面的进展会像推理一样演进，还有 Oriol 对前进路径的看法。 And we hit on the state of scaffolding today, what folks are building and what Oriel thinks persists. 我们聊了 scaffolding 现状，有哪些在构建，Oriol 认为什么会留下来。 Uh it's a ton of fun to get to basically take all the top questions that founders, investors are thinking through and just pose them to Oriel. 能把创始人、投资人正在思考的核心问题全部汇集起来直接问 Oriol，真的很过瘾。 So I think folks will really enjoy this conversation. 我想大家会很喜欢这期对话。 Without further ado, here he is. 废话不多说，有请他。 L'Oreal, thanks so much for uh coming on the podcast. Oriol，非常感谢你来上我们的播客。 Yeah, it's great to be here. 很高兴来到这里。 Thanks, Jacob. 谢谢，Jacob。 Yeah, uh very exciting to have you a a day after IO. 非常激动，IO 结束后第二天就能请到你。 I know things have been uh have been busy, but I've been really excited for this because you're one of the people kind of most directly shaping the frontier of of models today and your work at Google. 我知道最近你很忙，但我对今天真的很期待，因为你是当下最直接塑造模型前沿的人之一，尤其是你在 Google 的工作。 Um and you obviously in in the releases that happened yesterday at IO, they hit on like pretty much all the themes that people are thinking about in the in the space, where these products and models are going. 昨天 IO 上发布的东西，涵盖了大家正在思考的几乎所有主题，包括产品和模型的走向。 Uh, and so I feel like there's just our goal today is to to talk through kind of the research behind those announcements. 我们今天的目标就是聊聊那些发布背后的研究。 Um, you know, where where this is all headed, you know, the kind of future p path of RL and post training and, you know, get your read on the uh on the space as a whole. 聊聊这一切的走向，RL 和后训练的未来路径，以及你对整个领域的判断。 I figured where I'd start was with world models because I think that was just a really impressive part of you know of yesterday and also I think a pretty you know where Google's pretty distinct from a lot of the rest of the field. 我想从世界模型开始，因为我觉得那是昨天最令人印象深刻的部分，也是 Google 与其他很多人相当不同的地方。 So you obviously shipped this incredibly impressive world model in Omni yesterday and you know I think Demis has talked a lot about you know seeing world models as a path to AGI and it's interesting right because it seems like other labs maybe are more focused on code and you know getting to recursive self-improvement um and so I'm wondering if it's a fair characterization and you know why you think uh you know you and the team in Google have been somewhat uniquely focused on this uh world model space. 你们昨天在 Omni 里发布了令人叹为观止的世界模型，Demis 也多次谈到世界模型是通往 AGI 的路径。有意思的是，其他实验室似乎更聚焦于代码，以及如何实现递归自我改进。我想知道这样描述是否准确，以及为什么你和 Google 团队在世界模型这个方向上有如此独到的专注。 First of all, I guess the the coding or like self-improvement angle is is a is it at a bit of a different layer, right? 首先，编程或者说自我改进这个角度，其实处于一个不同的层次，对吧？ So, you can certainly bet and believe that you know these models can can reprogram and improve themselves and it's something I've been actually quite actively working on at the moment. 你当然可以相信，这些模型能够自我重编程、自我改进，这也是我目前一直在积极推进的方向。 But then the object that they improve the model whether it's multimodel um and closer or a world model as we call it and that even how to define that is a bit abstract since day one and way before actually Gemini program started we were working on not just language but you know understanding the visual world and kind of jointly modeling words in the context of um vision blank video etc. 但它们改进的对象，无论是多模态模型还是我们所说的世界模型，如何定义世界模型本身从一开始就有些抽象。早在 Gemini 项目启动之前，我们就不只是在做语言，而是在研究对视觉世界的理解，以及在视觉、图像、视频等模态下对语言的联合建模。 So I think that part um you know it's been at the core of Gemini and and before our research and I think maybe one way to characterize it is you know language clearly there's a lot of information um collectively that we wrote about the world. 我觉得这一点一直是 Gemini 乃至更早期研究的核心。或许可以这样理解：语言承载了人类集体记录的大量关于世界的信息。 So that's clearly paid off big time. 所以这个方向显然已经大有回报。 Um we've kind of distilled in a way all the knowledge written um and that is being written at the moment into into these weights. 我们把已写下的以及正在被写下的所有知识，都蒸馏进了这些权重之中。 It was definitely convenient that we put it all on the internet too. 人类把这一切都放上了互联网，确实很方便。 Yes, exactly. 是的，正是如此。 Right. 对。 So, and and and also like this now with users, right? 所以，现在还有用户这边，对吧？ There's obviously like also a flywheel effect, but at the same time, there are there is lots of knowledge in videos and images and what I I would say it kind of has happened but softly. 当然也有飞轮效应，但与此同时，视频和图像里蕴含着大量知识，我觉得这件事某种程度上已经发生了，只是很温和。 I I think there probably might be a big moment is how would you extract all the knowledge that you would acquire if you were to look at all the videos and images which we certainly um use right in our in our training mixtures but could that knowledge somehow um add value and efficiency to the language component and I think we've seen constructive um sort of um let's say transferred learning right from one to the other we see that and we see generalization but probably what I would characterize as the the GPT moment of video and images. 我想可能会有一个重大时刻，就是如何从你观看过的所有视频和图像中提取全部知识这些数据我们在训练组合里当然也用到但那些知识能否以某种方式提升语言组件的价值和效率。我们已经看到了某种建设性的迁移学习，也看到了泛化，但我所说的那个视频和图像的 GPT 时刻，可能还没到来。 I'm not sure we quite have seen that. 我不太确定我们真的已经看到了。 Do you have any like uh uh thoughts on like what that GPT moment might be for for for video and image as you as you kind of have this intuitive feeling that it hasn't yet been reached. 关于视频和图像的 GPT 时刻，你有没有什么想法？你感觉它还没有到来。 Yeah. 嗯。 So at the moment we train um all the modalities we mix them and we keep enhancing the recipe. 目前我们训练所有模态，把它们混合在一起，并持续改进这个方案。 So Omni is a a good way to see that progress in which we not only input videos and images, we we've seen amazing capabilities with long context um understanding etc. Omni 就是观察这一进展的好窗口我们不仅输入视频和图像，还在长上下文理解方面看到了惊人能力等等。 But we also now are able to output um you know bio but also interact with it in a very natural way through language 我们现在也能够输出，用非常自然的方式通过语言与它交互， um editing it combining you know the the modalities in a way that feels almost almost magical right so that progress is absolutely there but maybe one of the you know deep learning dreams and it might be an original kind of dream from way before uh large language models would be 编辑它，融合各种模态，感觉几乎有点神奇。这方面的进展是实实在在的。但也许深度学习的一个梦想，一个在大语言模型之前很久就有的原初梦想，就是 hey can I train on all the you know image data without text perhaps as as a hard challenge and still somehow extract all the all the meaning and nuance from from that modality or set of modalities and vast amounts of data. 能不能只用图像数据、不带文字进行训练，这是个极大的挑战，然后还能从那个模态或一组模态和海量数据中提取出全部意义与细节。 Right? 对吧？ So could we train on all the videos ever produced and images and get to the same level of understanding that clearly the language models using language get to although probably slightly superficially and some missing links with cause effect and so on that for instance Demis talks about often right so that is the moment that have I seen that probably not and most likely we have the most advanced um or one of the most advanced you know multimodel recipe that mixes everything but that pure transfer um is I think one of the core quest of machine learning um for the last decade plus. 能不能用所有现存的视频和图像进行训练，并达到语言模型用语言所能达到的同等理解层次尽管可能稍显表面，也缺少 Demis 常提到的因果链接。那个时刻，我见过吗？可能没有。我们很可能拥有最先进的多模态方案，把一切都融合在了一起，但那种纯粹的迁移，我认为是机器学习过去十多年来的核心追求之一。 I mean to the extent you you can talk about it 我是说，在你能聊的范围内， I'm curious 我很好奇 could you give our listeners some context on like what are still the key problems that need to be solved around this or as you think about like the the kind of you know the the types of problems that you're you're trying to you know work on to further advance this. 能不能给听众介绍一下，围绕这个问题还有哪些关键挑战需要解决，或者说你们正在努力推进的方向是什么？ It's hard to describe like the the solution space but the idea of um you know you could imagine observing or learning from all the video data and then somehow deriving um you know the rules of gravity is one that is used um often right like how could you precisely describe how the world works based only on images right and so the the issue there is linking language or these concepts as we sometimes call um to what you see in the image without the explicit language linkage 解法空间很难描述，但有一个思路：想象一下观察或学习所有视频数据，然后以某种方式推导出重力法则之类的规律这是个常被提到的例子如何仅凭图像就能精确描述世界的运作方式。问题在于，要把语言或我们有时所说的这些概念，与你在图像中看到的内容联系起来，而没有显式的语言桥梁 um is fairly tricky, right? 这相当棘手，对吧？ So, so what you end up doing is trying to explicitly create data sets where there's some sort of correlation or connection between the images and video and some language like maybe it's labels or descriptions and so on. 所以你最终会去显式地构建数据集，让图像、视频与某种语言之间存在某种关联或对应，比如标签或描述之类的。 But of course, the amount of data now at your disposal is much less because we haven't clearly described and transcribed every single piece of media out there. 但这样一来，可用的数据量就大幅缩水了，因为我们并没有对所有媒体内容都做过清晰的描述和转录。 So I think that's kind of extracting that those concepts in the purest form not in just some language that we associate to the words and what we see would be very very powerful and there's lots of early research on uh discrete representations representation learning and I mean that's one of the things that probably I would say is in fairly research stage. 我觉得，以最纯粹的形式提取这些概念，不只是把词和所见对应起来，会非常有力量。现在已经有不少早期研究在做离散表征和表征学习，这大概是我认为仍处于相当研究阶段的方向之一。 So it's not something we can possibly scale up but I think that's one of the possibly I'm not sure it's needed. 目前还没办法规模化，但我觉得这是其中一个可能我不确定是否真的需要。 I mean whether we agree with that or not is another question but it's if it was to be unlocked it would be massive. 我们是否认同这一点是另一个问题，但如果它被突破了，影响将是巨大的。 You mentioned kind of this the term world model and how it's thrown around a bunch and you know obviously uh you you kind of Omni was was positioned as a world model and I'm curious you know how you thought about that categorization versus you know you obviously had really good video models for a while right 你提到了世界模型这个词，它被反复引用。Omni 被定位为世界模型，我想了解你是怎么看待这个定性的，以及它与你们做了很长时间的视频模型有什么不同？ what makes Omni like a world model and and you know how is it different from kind of the the generation of video models that you guys have been working on? 是什么让 Omni 算是世界模型，它和你们之前做的视频模型这一代有何区别？ I I guess a pure aspect of world model would be representation learning, right? 我觉得世界模型的一个纯粹方面就是表征学习，对吧？ So so you could imagine we take these modalities like the the the videos which are like a sequence sequences of images or even just images and then um compressing that into sort of a set of concepts and what that those you know the movements the objects etc are within those. 可以想象，我们取这些模态，比如视频也就是图像序列，或者单纯的图像然后把它们压缩成一组概念，以及其中的运动、物体等等。 um that's kind of called representation learning and it models the world in a very compact way 这就是表征学习，它以非常紧凑的方式对世界建模， uh that compresses away 把不相关的东西压缩掉， uh what's probably not relevant right 把可能不重要的东西过滤掉，对吧？ so probably that one is the more classical but also probably not exactly what we mean or we see or we feel when we interact with omni right what what you see there is a bit more about 这个可能更接近传统定义，但也许并不完全是我们与 Omni 互动时所感受到的体验。 um you being able to really change how the video um behaves or the kinds of videos you're you're getting out of um an initial maybe image that you ask to animate. 你可以真正改变视频的行为方式，或从一张初始图像出发生成各种不同的动画视频。 You explicitly ask all the movements um or even like actions that would be like move forward and you can see that being kind of precisely simulated. 你可以明确指定所有动作，比如向前移动，然后看到被精确模拟出来。 And so that is more of like the world model itself is acting as a renderer of the world um that you can really just change by a language and then having that now object besides being a cool product to play with of course like we love to generate you know all sorts of different um you know movements or situations and so on very rigidly. 世界模型本身充当世界的渲染器，你可以通过语言来改变它。除了作为好玩的产品，我们当然也喜欢生成各种各样的动作和情景。 Uh it could also meaningfully um add maybe a dimension of simulation that could make us you know use for example things like um prediction um before acting in the world and of course obvious applications um for for these kind of 3D or video world models would be uh clearly you know self-driving cars or or robotics. 它还可以增加一个仿真维度，让我们用于行动前的预测。3D 或视频世界模型的明显应用场景就是自动驾驶和机器人。 It seems so relevant to robotics and it feels like um you know everyone's kind of still trying to figure out the right data mix of simulation data you know versus versus you know forms of teleop data and egocentric video data but it seems like as these simulations continue to get better 这与机器人非常相关。大家都还在摸索仿真数据、遥操作数据和自我中心视频数据的最佳配比，但随着仿真越来越好， uh you know it's more and more of of a compelling thing to put in uh into the data mix and I'm curious like you know does this work then directly intersect with you know the broader robotics work you all are doing and and how do you think about what's actually required to you know append robotic actions onto uh you know these types of models. 它被纳入数据配比的理由越来越充分。我想知道，这项工作是否与你们更广泛的机器人工作相交，以及如何将机器人动作接入这类模型。 there's a bit of a also beautiful connection because of course if if we acquire even if it's obviously a bit more expensive um or time consume consuming but if we get more data that is captured from robots that we we certainly are investing in you know that data could make it into the model enhancing the world model capabilities themselves 这里有一个美妙的联系：如果我们从机器人采集到更多数据，哪怕代价更高、更耗时，这些数据也可以反哺模型，增强世界模型本身的能力。 and then the other direction which is kind of what you're asking about 另一个方向，也就是你所问的， perhaps is okay 或许也可以。 now we can simulate um and we could create lots of different scenarios in which these robots or you know whatever um 1D 3D groups etc could be training on without the cost and the time latency of the physical world right 现在我们可以进行仿真，为机器人或各类 1D、3D 体等创造大量不同场景进行训练，而无需承担现实世界的成本和时间延迟。 so for the latter to work better 所以要让后者做得更好， I mean it's it's still a very open problem 说实话，这仍然是一个非常开放的问题。 there's also all sorts of issues with transfer but the more powerful these models get clearly there's kind of a inflection point where things start to be worth doing and and and we might see an acceleration in robotics 迁移方面也有各种挑战，但模型越强大，就越接近拐点，事情开始变得值得去做，机器人领域可能会加速。 indeed in you know that definitely we're seeing in the hardware space 确实，这在硬件领域已经看到了。 lots of investment 大量投资涌入。 so things are accelerating and picking up there 所以整体在加速、在提升。 but but for the world models to be useful at least from my limited knowledge 但就世界模型的有用程度而言，至少从我有限的了解来看， but of course I've you know I've been able to interact with these systems and see them that the precision um of even grasping a model which we get for granted as humans the the the visuals the exact you know how would this feel to your hand which is a modality 我有机会与这些系统交互，亲眼看到它们。就连抓握这件我们人类习以为常的事情，其精确度、视觉感知，还有触感，都是一种模态， we currently obviously don't even have data for um and then the the the exact forces how things would move 而我们目前根本没有相关数据，更别说精确的受力以及物体如何运动了。 it needs to be very very accurate right so that's where there's a gap and and perhaps then some creativity and research is still required and lots of investment in robotics over the years 它需要非常非常精确。这就是差距所在，还需要一些创意和研究，以及多年来对机器人领域的大量投入。 uh but it's promising and at some level maybe not at the precise motor control but at the kind of planning and gross we are going to start seeing how these models accelerate our progress into the quest of robotics. 但前景是乐观的。也许不是在精细运动控制层面，而是在规划和粗运动控制层面，我们将看到这些模型加速推动机器人领域的进展。 A huge part of these models is kind of like you know learning implicitly learning physics through you know consuming lots of of of video data. 这些模型很重要的一部分就是通过大量视频数据隐式地学习物理知识。 And so I think you mentioned gravity is like the canonical example of what people look for. 你提到重力是大家检验的典型例子。 Do you have any kind of gut sense being so close to these models of like when you think that will just be a solved problem within within world models? 你离这些模型这么近，有没有直觉判断，觉得什么时候这会成为世界模型中的已解决问题？ Yeah, it's a good question actually. 这确实是个好问题。 You're kind of you made me think about evaluation, right? 你让我想到了评估这件事。 like how would you evaluate if you train a very good you know video. 如果你训练了一个非常好的视频模型，你怎么评估它？ Yeah. 对。 How do you evaluate physics in a model? 怎么评估模型中的物理知识？ It it 这个 yeah it is a good question right 是个好问题。 you could imagine the problem is as soon as you add language all of a sudden that knowledge is is there in in the weight. 可以想象，问题在于一旦加入语言，那些知识就直接编码进权重里了。 So if you ask basic questions about gravity of course you would answer them by just having read um you know explanations of them online and so on. 如果你问关于引力的基础问题，模型当然能回答，因为它读过网上大量相关解释。 So you would need to somehow connect the the concept of gravity which could be present or not in a world model to then decoded that into an explanation that would satisfy you know 所以你需要以某种方式，把世界模型里可能存在也可能不存在的引力概念，解码成一套能令人满意的解释。 maybe initially would be some basic explanation later on could even derive like the the the equations and so on 最初也许只是基础解释，后来甚至可以推导出相应方程之类的东西。 that's how can you you could build an ebal 这就是你能怎么构建一套 eval 的思路。 I don't think to my knowledge we we've been thinking about this 据我所知，我们一直在思考这个问题。 from this point of view there's definitely [clears throat] lots of early work on an unsupervised machine machine translation where you you would try to translate to a language that you would never see during training and you you could align the representation. 从这个角度来看，无监督机器翻译早期有不少工作，尝试翻译成训练时从未见过的语言，通过对齐表征来实现。 So there's probably some ideas on you get a language model that can speak or you can decode from you get this world models that would create this kind of concept conceptual level understanding and aligning both 所以大概可以这样：训练一个能够输出语言的语言模型，再让世界模型负责产生概念层面的理解，然后将两者对齐。