Podcasts पर वापस जाएं Sequoia Capital

Robotics' End Game: Nvidia's Jim Fan

And up first, I'm delighted to introduce my friend Jim Fan. 有请第一位嘉宾，我荣幸地介绍我的好友 Jim Fan。 Uh Jim leads the embodied autonomous research uh group at Nvidia, otherwise known as Nvidia robotics. Jim 领导 NVIDIA 的具身自主研究团队，也就是 NVIDIA Robotics。 Um I think that robot robots are just one of the most thrilling things that's going to happen. 我觉得机器人是未来最令人兴奋的事情之一。 Uh a car basically is a big robot, but I'm excited for robots can go beep boop and lift things for us. 汽车本质上是个大机器人，但我更期待机器人能哔哔嗡嗡、帮我们搬东西。 And so, Jim was Jim was a standout at last year's AI Ascent, and we're delighted to have you back. Jim 在去年的 AI Ascent 上表现亮眼，我们很高兴再次邀请他回来。 Thanks, [applause] everyone. 谢谢，[掌声] 大家好。 Thanks. 谢谢。 So, it was a summer day in 2016. 那是 2016 年的一个夏日。 Actually, right in this office that we're sitting in. 就在我们现在坐的这间办公室里。 [snorts] [轻笑] There's a guy in shiny leather jacket, you know, big biceps, hurling this large metal tray. 有个穿着亮面皮夹克的家伙，你懂的，大块头，端着一个巨大的金属托盘。 And on this large piece of metal, he wrote, "To Elon and the OpenAI team, to the future of computing and humanity, I present you the world's first DGX-1." 在这大块金属上，他写道：「致 Elon 和 OpenAI 团队，为了计算与人类的未来，我向你们呈上世界第一台 DGX-1。」 So, that was the first time I met Jensen. 那是我第一次见到 Jensen。 And as any good intern would do, I rushed to getting line to sign my name on it. 就像任何一个称职的实习生一样，我冲上去排队，在上面签了名。 So, can you spot it, my name? 能找到我的名字吗？ It's here. 就在这里。 And can you spot another? 还能找到另一个吗？ That's Andre, right there. 那是 Andre，就在那里。 So, Andre, we're going to the Computer History Museum. Andre，我们要把这东西送进计算机历史博物馆了。 I feel like a dinosaur. 我感觉自己像个活化石。 You know, back then, I had no clue what I was signing up for. 那时候，我根本不知道自己在签什么。 And then, no one can describe what happened next better than Ilya himself. 后来发生的事，没有谁比 Ilya 本人描述得更好了。 If you believe in deep learning, deep learning will believe in you. 你若信深度学习，深度学习必报答你。 And oh boy, did deep learning believe in all of us big time. 深度学习对我们所有人的回报，远超想象。 Three step functions, six years. 三次阶跃，六年时间。 That's how all it took to bring us here today. 就这些，就够把我们带到今天。 The first tick, GPT-3, pre-training. 第一跳，GPT-3，预训练。 Next token prediction is really about learning the rules of grammar, the shape of language. 下一个 token 预测，本质上是学习语法规则、语言的形状。 It's about simulating how thoughts and code and strings in general should unfold. 它在模拟思想、代码和字符串该如何展开。 2022 2022 Instruct GPT supervised fine-tuning aligned the simulation to do useful work. InstructGPT 通过监督微调，让模拟对齐，去完成有用的工作。 o1 reasoning using reinforcement learning to surpass imitation learning and finally auto research accelerating the whole loop beyond what's humanly possible. o1 用强化学习推理，突破了模仿学习，最终实现自动研究，把整个循环加速到超越人类极限。 So, as Andre said all the labs are getting to the final boss fight. 正如 Andre 所说，所有实验室都在走向最终 Boss 战。 So, for LLMs they're in the thick of the end game. 对 LLM 来说，它们正处于终局的白热化阶段。 And honestly, I'm very jealous. 说实话，我非常羡慕。 Look at how happy Andre was, big smile on his on his face. 看看 Andre 有多开心，笑得那么灿烂。 The LLM folks are having the party of their lifetime. LLM 这帮人正在经历他们人生中最大的派对。 They're speed running AGI on mystical creatures literally called mythos. 他们在用名叫「mythos」的神秘生物加速狂奔 AGI。 So, why can't robotics get a piece of fun? 那机器人领域为什么不能分一杯羹？ So, as any self-respecting scientist would do I copy homework and I give it a new name. 所以，作为一个有自尊的科学家，我抄了作业，然后给它起了个新名字。 I call it the great parallel. 我叫它「伟大的平行」。 So, instead of simulating strings can we simulate next physical world state? 既然能模拟字符串，能不能模拟下一个物理世界状态？ And then we can align through action fine-tuning onto a thin slice of that simulation that matters for real robots. 然后通过动作微调，对齐到那个模拟中真正对机器人有意义的薄切片。 And we let reinforcement learning carry the last mile. 再让强化学习走完最后一公里。 And that's it. 就这样。 The great parallel copying the LLM success. 伟大的平行，复刻 LLM 的成功路径。 If you can't beat them, join them. 打不过，就加入。 So, please join me in a new episode robotics the end game. 请大家和我一起进入新一集：机器人终局。 I'm sorry, I just couldn't resist. 抱歉，我就是忍不住。 Nano bananas too good. 纳米香蕉这个梗太好了。 Thanks, Demis. 谢了，Demis。 So, how do we play the end game? 那怎么打终局？ It boils down to two things, model strategy and data strategy. 归结起来就两件事：模型策略和数据策略。 Let's look at the model first. 先看模型。 The last 3 years were dominated by VLAs or visual language action models and models like PaLM and Goot fall in this category. 过去三年，VLA，也就是视觉语言动作模型占据主导，PaLM 和 Goot 这类模型都属于这一类。 So, we assume that the pre-training is done by a VLA and we simply graph an action head on top of it. 我们假设预训练由 VLA 完成，然后简单地在顶部接一个动作头。 But really, if you think about these models, they're LVAs because the most amount of parameters are dedicated to language. 但如果你仔细想这些模型，它们其实是 LVA，因为大多数参数都分配给了语言。 So, language is first-class citizen followed by vision and action. 所以，语言是一等公民，视觉和动作次之。 And by design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs. VLA 在编码知识和名词上很厉害，但在物理和动词上就差多了。 It's kind of head-heavy in the wrong places. 可以说是头重脚轻，而且重在了错误的地方。 This is my favorite example from the original VLA paper. 这是原版 VLA 论文里我最喜欢的一个例子。 Move the Coke can to a picture of Taylor Swift. 把可乐罐移到 Taylor Swift 的照片旁边。 Yes, it has not seen Taylor Swift before. 是的，它之前从未见过 Taylor Swift。 Yes, it's able to generalize. 是的，它能泛化。 But this is not quite the pre-training ability that we're looking for. 但这并不是我们想要的预训练能力。 So, what's the second pre-training paradigm? 那么，第二种预训练范式是什么？ And I always thought that it would be something glorious. 我一直以为会是某种宏大的东西。 Unfortunate, it turns out that this is AI video slop that we call. 结果，原来就是我们常说的 AI 视频垃圾。 You know, I can watch these cats playing banjo on security cam all day. 我能一整天看猫咪弹班卓琴的安防摄像头视频。 It's peak internet. 这就是互联网的巅峰。 But really, look at this. 但说真的，你看这个。 No one can take this 没人能抗拒 [laughter] [笑声] until we realize that these video models are learning to simulate next world state internally. 直到我们意识到，这些视频模型在内部学会了模拟下一个世界状态。 So, these are some rollouts from VEO3. 这些是 VEO3 的一些推演结果。 You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction all by themselves. 你能看到模型自己学会了重力、浮力、光照、反射、折射。 None of this is coded in. 这些全都不是人工编码进去的。 Physics emerge by predicting the next blob of pixels at scale. 物理规律，是通过大规模预测下一帧像素自发涌现出来的。 And even visual planning emerges. 连视觉规划能力也涌现了。 Look at how VEO solves these mazes. 看看 VEO 是怎么解迷宫的。 It solves them by running simulation forward in pixel space. 它靠在像素空间里向前模拟来解开迷宫。 And draw attention to the lower right corner here. 注意右下角。 This is my favorite example. 这是我最喜欢的例子。 Let's watch. 我们来看看。 And you blink if you miss how VEO3 solves this one. 眨眼就会错过 VEO3 是怎么解这道题的。 [laughter] [笑声] It's super smart. 太聪明了。 You know, VEO3 figures out that if you're not looking, geometry is optional. VEO3 悟出来了，只要没人盯着，几何就是可选的。 I call this physics law. 我称之为物理法则。 So, how do we make these world models 那我们怎么让这些世界模型 Well, we do action fine-tuning. 我们做动作微调。 We align the superposition of all possible future states and [snorts] collapse that onto a thin slice that matters for real robots. 我们对所有可能未来状态的叠加态进行对齐，[轻笑] 然后将其坍缩到对真实机器人有意义的薄切片上。 Introducing Dreamer. 介绍一下 Dreamer。 It's a new type of policy model that dreams a couple seconds into the future and acts accordingly. 这是一种全新的策略模型，能向未来梦见几秒钟，然后据此行动。 And you know that motor actions, they're high-dimensional continuous signals. 电机动作本质上是高维连续信号。 So, that looks just like pixels. 这跟像素差不多。 We can render it at the same time as we render the videos. 可以在渲染视频的同时一起渲染出来。 So, Dreamer jointly decodes the next world states and next actions. Dreamer 同时解码下一个世界状态和下一个动作。 And as a result, it's able to zero-shot solve tasks and verbs that it has never seen in training. 因此，它能零样本解决训练中从未见过的任务和动词。 And as a robot executes, we can visualize what it's dreaming about. 机器人执行任务时，我们可以可视化它在「梦」什么。