Volver a PodcastsLatent Space
🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
So ESMC is is also approaching programmable biology, but I would say in a very different way.
ESMC 也在探索可编程生物学,但路径截然不同。
It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have.
它采用的是世界建模视角,核心思路是:有一个预测模型,通过搜索世界模型,找到满足设计标准的蛋白质分子。
So we've been able to use this to actually now go and design um many protein binders.
我们已经利用这个模型设计出了许多蛋白质结合子。
But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS.
更令人兴奋的是,我们还用它设计了抗体和 SCFV。
Hello, welcome to the latent space AI for science podcast.
大家好,欢迎来到 Latent Space AI for Science 播客。
I'm R.J. Haneki, CTO of Muromix.
我是 RJ Honicky,Miro Omix 的 CTO。
Yeah.
嗯。
And, uh, I'm Brandon today.
我是 Brandon。
It's a pleasure to have Alex Reeves, uh, head of science at Biohub.
今天非常荣幸邀请到 BioHub 的 Head of Science,Alex Rives。
Yeah.
嗯。
Would you like to introduce yourself real quick?
你能先做个简单的自我介绍吗?
Yeah.
好的。
Yeah.
好的。
Thank you for having me here.
感谢你们邀请我来。
It's great to be here.
很高兴来到这里。
Um, I'm head of science at Biohub.
我是 BioHub 的 Head of Science。
I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology.
我是计算机科学家,研究 AI 在生物学中的应用,大量工作集中在用于生物学的语言模型上。
By the time this podcast is released, you will have put out several new exciting interesting models.
这期播客发布的时候,你应该已经发布了几个令人兴奋的新模型。
Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now.
看完之后,我不禁觉得,你可能是蛋白质生物学领域最彻底践行苦涩的教训的人。
Can you give a little context about what that means for biology and you know why you're so committed and excited to this route?
能解释一下这对生物学意味着什么,以及你为什么如此坚定地走这条路吗?
Well, I'll take that.
好,我接受这个称号。
Um, I believe in scaling laws.
我相信缩放定律。
So, you know, I guess I've been working on this for, you know, since since the summer of 2018.
我从 2018 年夏天就开始做这个方向了。
Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology.
我们团队在 Meta AI 时训练了生物学领域第一个 Transformer 语言模型。
And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates.
我一直相信,训练模型去预测进化产生的下一个 token 时,生物学信息会自然涌现。
So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities.
团队多年来持续探索这个方向,每一代模型规模提升一个数量级,都能看到新能力涌现,缩放曲线清晰可见。
Yeah.
嗯。
So you've been you say emergence of capabilities scaling over generations.
所以你说的是跨代缩放带来的能力涌现。
You've been working at this as you said for I guess it would be 8 years now or something like that.
你已经做了大概 8 年了。
It didn't always work that way right like there was signs that scaling might work.
但一开始并不总是有效果的,当初是什么信号让你相信缩放能奏效?
You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before.
我们接下来会聊一些新结果,我觉得你已经非常清晰地验证了这个假说,这在以前是没有发生过的。
But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way.
但你对这条路有一种强烈的信念,我不确定如果是我,是否会如此坚定地相信它能奏效。
I mean proteins are not the protein language is not the same thing as natural language.
蛋白质语言和自然语言毕竟不是一回事。
There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish.
有相似之处,但如果你用普通语言 Transformer 高温采样,得到的是乱码。
you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason.
用蛋白质语言模型无限温度采样,得到的仍然是合法的蛋白质,尽管可能不有趣,这背后的原因不同。
I'm not necessarily sure that I would
我不一定会
I primarily assume the natural language model insight would transfer over.
我不会默认认为自然语言模型的洞察可以直接迁移过去。
So what is specifically about proteins that you thought was special or you you know that would make this also valid?
蛋白质有什么特别之处,让你认为这个方法同样有效?
Yeah, I mean it's a really interesting question.
这是个很有意思的问题。
I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences.
我觉得这触及了当前 AI 领域的一个深层问题。AI 现在是一门高度经验性的科学,没有理论能总是指导我们,但我们有非常强的缩放经验证据。让我动力十足的是:想想进化,想想我们拥有的蛋白质数据,数据库里有数十亿条蛋白质序列。
And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under.
这些序列包含规律性模式,早在我们用语言模型研究这个问题之前几十年就已知道:蛋白质家族的序列存在模式,这些模式来自进化所处的约束条件。
So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space.
可以把蛋白质序列想象成折叠成三维空间结构的东西。
And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure.
序列中可能有两个残基或氨基酸,在折叠结构中彼此接触。
And so evolution isn't free to choose those independently from each other.
进化无法独立选择这两个位置。
If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position.
在一个位置做了选择,另一个位置就必须做出兼容的选择。
So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology.
回溯到基因测序的早期,当人们开始观察同一蛋白质在不同相关生物中的差异时,就能看到这些反映底层生物学的模式。
So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts.
ESM 的核心思路是:把这个原理应用到整个进化历史,覆盖所有生命中产生的大量蛋白质多样性,让语言模型预测进化在所有生物学背景下选择放置哪些氨基酸。
So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins.
这张完整图景中包含了关于蛋白质底层生物学的大量信息。
And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling.
这就是这个想法的源头,模型在预测下一个 token,实际上我们用的是掩码语言模型的训练方式。
So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose.
模型预测序列中各处被掩码的 token,从而必须学习那些决定进化能选择哪些 token 的底层约束。
Yeah.
嗯。
So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right?
稍微说一下历史背景,你们刚发布了 ESM Cambrian 对吗?
Is that what it's called?
是叫这个名字吗?
Yeah.
嗯。
And this is like the maybe fourth or fifth in a series of models.
这大概是这个系列的第四或第五个模型了。
I think maybe even more if you go back before they were called ESM.
如果算上更早的版本,可能还更多。
Well, they they were called ESM from the start.
从一开始就叫 ESM。
Yeah.
嗯。
We had sort of various branches of the different models.
我们有几条不同的模型分支。
Yeah.
嗯。
So, so this one I would say is is kind of a a fourth generation model.
这是第四代模型。
Um it's actually a model that we trained a little over a year ago.
实际上是我们大约一年多前训练的。
Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time.
现在我们在 BioHub,这次将首次以 MIT 许可证完全开源这个模型。
So, we're really excited to do that.
我们对此非常兴奋。
But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology.
最大的新突破是,我们真正构建了一个蛋白质生物学的世界模型。
So the foundation of that is ESMC.
它的基础是 ESMC。
But you know using the representations of EFSMC, we've kind of now built a a structure prediction model.
基于 ESMC 的表示,我们构建了一个结构预测模型。
Um and this is the next generation ESM fold model.
这就是新一代的 ESMFold 模型。
And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology.
我们还运用机械可解释性和稀疏编码技术,深入探索语言模型的表示空间,挖掘出模型用来表示蛋白质生物学的底层特征。
So bringing all of this together, we're able to, you know, really make predictions for protein structure.
综合这一切,我们能够对蛋白质结构做出预测。
um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution.
能预测蛋白质由哪些底层特征构成,从而在进化层面建立关联。
We're able to take this model and invert it to design proteins.
我们还能对这个模型做逆向操作来设计蛋白质。
And we've we've we've used this to kind of create a comprehensive picture of protein biology.
我们用这个模型构建了一幅蛋白质生物学的全景图。
So we we put together kind of all the world's largest protein sequence databases.
我们整合了全球最大的蛋白质序列数据库。
And so that kind of amounts to 6.8 billion non-redundant proteins.
总共涵盖 68 亿条非冗余蛋白质。
And then we've we've resolved predicted structures for 1.1 billion of those.
其中 11 亿条我们已预测出结构。
And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology.
我们还计算了所有这些蛋白质的特征,以便在进化和蛋白质生物学层面建立全面关联。
6.8 billion of which you've resolved structure for 1.2 is that 1.1
68 亿中你们解析了结构的是 12 亿,还是 11 亿?
1.1.
11 亿。
So what about the others?
那其他的呢?
Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity.
我们对数据库进行了 70% 序列同一性聚类。
So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center.
从某种意义上说,是对所有蛋白质都解析了结构,因为每个簇都有一个中心代表。
We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure.
预测中心的结构后,可以预期同簇的其他蛋白质有相似的模板结构。
There be be small variations but they have the same fold.
会有小的变化,但折叠方式相同。
1.2 billion or so clusters
大概 12 亿个簇,
that are that are kind of covering the 6.8 billion.
覆盖了那 68 亿条序列。
Yeah.
对。
Okay.
好的。
Interesting.
有意思。
And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right?
既然在聊缩放,你怎么知道这个规模是合适的?
Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model?
你怎么知道这 11 亿条序列是合适的分辨率?
Well, we've chosen them so that they really cover that entire space.
我们选择的这些序列能真正覆盖整个蛋白质空间。
So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created.
我可以说,这个数据库是迄今为止对蛋白质结构和功能最全面的呈现。
It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution.
它将蛋白质结构多样性的已知数量扩充了数亿条,同时构建了一个特征空间,让我们能够发现蛋白质在进化层面上的关联。
So we can see kind of really interesting themes emerging across evolution.
我们在进化层面看到了一些非常有意思的主题涌现。
you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections
比如,将基因编辑系统关联起来,这些系统在序列上相差很远,但在底层功能模式和结构同源性上有共同之处,模型能把它们联系起来。
now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein
现在说到机械可解释性的部分,如果我理解正确,你们用了稀疏自编码器等技术来分析:当我用一条蛋白质激活网络时,
Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks.
输出的激活模式是什么?它们之间怎么关联?我理解是:序列上不相关或只有部分关联的蛋白质,在行为上却有相似表现,因此激活了相似的网络。
Is that kind of the summary of what you just said?
这是你刚才说的核心意思吗?
Yeah.
对。