Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He
I have a pretty big claim
我有个大胆的论点。
the visual intelligence are actually mostly coming from language like these video models especially from now since the diffusion model technology is more mature like every time you see there there's some improvement on these models
视觉智能其实主要来自语言。现在扩散模型技术越来越成熟,视频模型每次有提升
I would say mostly the the gain comes from language model not not coming from the the v
我会说,收益主要来自语言模型,而不是来自视频模型本身
the video model itself like the the video description models themselves.
视频模型本身,也就是视频描述模型。
Before we get into today's episode, I just have a small message for listeners.
在今天的节目开始前,我有一小段话想对听众说。
Thank you.
谢谢大家。
We would not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content.
如果不是你们选择点进来收听,我们根本无法持续给你们带来 AI 工程、科学和娱乐内容。
We've been approached by sponsors on an almost daily basis.
广告商几乎每天都在联系我们。
But fortunately, enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way.
幸运的是,已经有足够多的人订阅了我们,让我们不靠广告也能持续运营,我们也想保持这样。
But I just have one favor to ask all of you.
但我只有一个请求。
The single most powerful, completely free thing you can do is to click that subscribe button.
你能做的最有力、完全免费的事,就是点击订阅按钮。
It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring the Inspace to you each and every week.
这是我唯一想请你做的事,对我和我们辛苦每周带来 Latent Space 的团队来说,这意味着一切。
If you do it, I promise you, we'll never stop working to make the show even better.
你如果这样做,我保证,我们永远不会停止让节目变得更好。
Now, let's get into it.
好,我们开始吧。
Okay, we're here in the studio with Eden
好,今天我们在录音室请来了 Ethan
Ha, uh, most recently of XAI.
最近在 xAI 工作。
Welcome.
欢迎。
Yeah, thank you.
谢谢。
Glad being here.
很高兴来这里。
We're also here with Vivu.
Vibhu 也在这里。
Uh you were first coming to us or joining the L in space world because you were working on Cosmos and Nvidia and you did a great paper.
你最初联系我们,是因为你在 NVIDIA 做 Cosmos,还写了一篇很棒的论文。
We loved it.
我们很喜欢。
Uh you presented it as well.
你还来做了讲演。
So thank you for doing that.
谢谢你当时来。
Yep. F also presented the moes.
对,也讲了 MoE 的工作。
Yes.
是的。
Twice at lat space.
在 Latent Space 讲了两次。
Yeah.
是啊。
Yeah.
嗯。
Yeah.
嗯。
How did you actually hear about us?
你是怎么找到我们的?
Did we reach out to you?
是我们联系你的吗?
Is that how it worked?
是这样的吗?
No, actually I the the community like I I realized oh there's this online community.
不是,是我自己发现了这个社区,我意识到有这么一个在线社区
Yeah.
嗯。
That people talk about AI and also learn learn from each other through papers every every week through the paper club.
大家在这里讨论 AI,也通过每周的论文俱乐部相互学习。
It it's very nice.
非常好。
I learned a lot
我学到了很多。
I think three years non-stop.
我觉得连续坚持了三年。
We haven't stopped even on Christmas and New Year's.
圣诞节和新年都没停过。
Many weeks I want to stop.
很多周我都想停了。
No good.
那可不行。
I think you had posted that you worked on a paper and I was like oh very cool.
我记得你发过一篇论文,我看了觉得很棒。
We have paper club presented
你来论文俱乐部做了分享
but I might have reached out to you after.
我大概是之后联系了你。
Yeah, because it's an amateur club, right?
对,因为这是个业余俱乐部,对吧?
Uh so it's very unusual and but we have sometimes paper authors come by and and actually explain the paper.
很不寻常,但我们有时候会请论文作者来亲自讲。
Today we just did uh the poolside paper which is apparently very good.
今天我们刚讲完了 Poolside 那篇论文,据说非常好。
Came out yesterday.
昨天刚出来的。
Uh pretty interesting, right?
挺有意思的,对吧?
Fully open.
完全开放。
They talk about everything system.
他们讲了整个系统。
So it's a good one.
值得一读。
We'll we'll recommend people to read it.
我们会推荐大家去看看。
Bring us up to speed on your transition to XAI because I actually don't even know when you joined.
给我们讲讲你加入 xAI 的经过,我其实都不知道你是什么时候加入的。
uh just like tell us tell the story about the sort of transition
大概讲讲那段转变。
before XAI I was working on Cosmos word model as at Nvidia.
在 xAI 之前,我在 NVIDIA 做 Cosmos 世界模型。
So Cosmos is a the giant video foundation models that can that aims to simulate the world and for it serves as a foundation of for all of the roboticists to build on top of there.
Cosmos 是一套大型视频基础模型,目标是模拟世界,作为机器人研究者的基础平台。
Once I built the Cosmos one, I realized as this thing also has a scaling law similar to language model.
Cosmos 建好之后,我意识到它和语言模型一样有规模扩展定律。
We need to scale scale up the video models further.
需要把视频模型继续扩大规模。
Uh that's that's why I realized I need to move to somewhere with much more computer resources.
所以我意识到需要去一个算力资源更丰富的地方。
That's how I
就是这样
then Nvidia GPU came themselves.
然后 NVIDIA 的 GPU 团队自己也来了。
Yeah.
嗯。
And timeline wise, when was Cosmo?
时间线上,Cosmos 是什么时候的?
It was pretty early, right?
挺早的吧?
It was open world model paper.
就是那篇开放世界模型论文。
It was like uh end of 2024.
大概是 2024 年底。
End of 2024.
2024 年底。
Yeah.
对。
Then at at mid 2025 I moved to XAI.
2025 年中我去了 xAI。
At that time I I joined by the time when XI was about to build video models and in multimodel models.
那时候 xAI 正准备做视频模型和多模态模型,我就在那个时间点加入了。
There were no no infra no data and no model.
当时什么基础设施、数据、模型都没有。
And just as just a few engineers we we built it in three months and released the first model Gro imagine 0.9 and since then I I keep working on video models and move more from pre-training and to post training of the video models for example like reference to videos kind of like the cameo feature and the video extensions and now and uh before I left I I work time uh world model leading a small team to to focus on the real time long horizon video generation.
就几个工程师,三个月内从零搭起来,发布了第一个模型 Grok Imagine 0.9。之后我一直在做视频模型,从预训练逐渐转向后训练,比如参考视频、视频扩展,还在离职前带了一个小型世界模型团队,专注于实时长时程视频生成。
Can you give like a rough road map of like okay you're on a brand new team
能大致描述一下路线图吗?比如你们是全新的团队
Grock previously was only text or they partnered with BFL for uh their their image gen stuff
Grok 之前只做文本,图像生成是和 BFL 合作的
what do you what are the building blocks right
那些基础构建模块是什么
you have compute data you can procure somewhere like just you know what what are like the sequence of things that people should think about when you're setting up a new team I
有算力、有数据,那整体的搭建顺序是怎样的,新团队应该怎么想
mean actually even deeper not just data you can procure you guys had to go through getting the data too right so
甚至更深一层,不只是数据采购,你们还要自己收集数据,对吧
you shipped it pretty fast but yeah
发布得很快,三个月
yeah 3 months is like actually like very surprisingly fast.
三个月真的出人意料地快。
Yeah.
对。
One thing I say like thanks to my experience at Nvidia cuz first time when we were building Cosmos together, we built it uh for about a year.
有一点要说,要感谢在 NVIDIA 的经历,第一次做 Cosmos 花了大概一年。
So So this is like the second time I do it roughly roughly have an idea like what to do.
所以这次算是第二次做,大致知道该怎么做了。
I say the most important thing is is a talent.
我认为最重要的是人才。
Everyone everyone were very strong and clever very close with each other towards a common goal.
大家都非常强,相互紧密,朝着同一个目标。
So that speed up things a lot.
所以推进速度很快。
So you reduce the communication bandwidth among people and everyone can can work toward the same goal.
团队沟通成本低,大家能朝一个方向推进。
It's it's like every day there's not that much meetings on the calendar like maybe like a like a a sync a day and after that it's it's just all building.
每天日历上没什么会议,大概每天一个同步,之后全是做东西。
It was pretty fun at that time.
那段时间挺有意思的。
And another thing is that XAI has very strong foundations of like data data infer model infer and the the supporting there can can helps the model development a lot.
还有一点是 xAI 在数据、推理、模型推理等基础设施上有非常强的积累,这对模型开发帮助很大。
when I look at like training models, I don't uh so actually the the top important thing is like how many uh how many iterations can you do like per per day and and the the more iteration can you do you can you can train the model much faster.
在训练模型这件事上,我认为最重要的是每天能做多少次迭代,迭代越多,模型训练越快。
So if you have very strong infra and you you have a lot of compute, you can you can train these models in very short period of time that can give you a much larger buffer to uh for errors and it also gives you the opportunity to spot more bugs.
如果基础设施够强,算力够多,每天可以做的迭代就多很多,也就能快速试验、快速验证想法。
Yeah.
嗯。
What is an iteration?
什么叫一次迭代?
Is it like uh a few hundred steps or what?
是几百步还是什么?
Let's say just train training the model like from acquire new data and maybe design new algorithms and train a new model maybe a smaller scale.
就是说,从拿到新数据、做数据描述标注,到完成一次训练,整个端到端的流程。
Yeah.
对。
So cycle time for like any hyperparam that you're
就是任何超参实验的周期。
Yeah.
对。