ポッドキャストに戻る Latent Space

🔬 苦い教訓がタンパク質の世界にやってくる — Alex Rives、BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. ESMCもプログラマブルバイオロジーに取り組んでいますが、アプローチは全く異なります。 It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. 世界モデルの観点から迫っています。つまり、予測モデルを持ち、その世界モデルを探索して、設計基準を満たすタンパク質分子を見つけるという考え方です。 So we've been able to use this to actually now go and design um many protein binders. これを使って、多くのタンパク質結合剤を実際に設計できるようになりました。 But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. でも一番エキサイティングなのは、抗体、SCFVを設計できるようになったことです。 Hello, welcome to the latent space AI for science podcast. こんにちは。Latent Space AIサイエンスポッドキャストへようこそ。 I'm R.J. Haneki, CTO of Muromix. 私はRJ Honicky、Miro OmixのCTOです。 Yeah. そうですね。 And, uh, I'm Brandon today. そして今日は私、Brandonです。 It's a pleasure to have Alex Reeves, uh, head of science at Biohub. BioHubのHead of Science、Alex Rivesをお迎えできて光栄です。 Yeah. ありがとうございます。 Would you like to introduce yourself real quick? 簡単に自己紹介をお願いできますか。 Yeah. もちろんです。 Yeah. ええ。 Thank you for having me here. お招きいただきありがとうございます。 It's great to be here. ここに来られて嬉しいです。 Um, I'm head of science at Biohub. BioHubでHead of Scienceを務めています。 I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. コンピューター科学者として、生物学のためのAIに取り組んでいます。これまでの研究の多くが生物学のための言語モデルに関するものでした。 By the time this podcast is released, you will have put out several new exciting interesting models. このポッドキャストが公開される頃には、いくつかの新しくて面白いモデルを発表しているはずです。 Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. 拝見して思ったのですが、あなたは今、タンパク質生物学で最も「苦い教訓」を体現している人かもしれません。 Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? 生物学においてそれが何を意味するのか、なぜそのアプローチにこれほど取り組んでいるのか、少し背景を教えてもらえますか。 Well, I'll take that. ぜひお答えします。 Um, I believe in scaling laws. 私はスケーリング則を信じています。 So, you know, I guess I've been working on this for, you know, since since the summer of 2018. 2018年の夏からずっとこれに取り組んできました。 Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. Meta AIのFAIRにいた頃、私のチームはタンパク質生物学のための最初のTransformer言語モデルを訓練しました。 And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. 進化が生み出す次のトークンを予測するモデルを訓練すると、生物学的情報が創発的に現れるだろうとずっと思っていました。 So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. チームは長年にわたってこのアイデアを探求し、スケーリングカーブを実際に確認できました。世代ごとにモデルを一桁規模で大きくするにつれて、新たな能力が創発することを目の当たりにしてきました。 Yeah. はい。 So you've been you say emergence of capabilities scaling over generations. 能力の創発、世代を超えたスケーリングと言っていましたね。 You've been working at this as you said for I guess it would be 8 years now or something like that. 8年間取り組んできたと言いましたが、それくらいになりますか。 It didn't always work that way right like there was signs that scaling might work. 最初からうまくいっていたわけではなく、スケーリングが機能するかもしれないという兆候があった段階でしたよね。 You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. 新しい結果を見ていくと、この仮説を今まで誰もできなかった形で明確に実証されたと思います。 But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. でも、あなたはこれをとても強く確信している。私には、同じように確信を持てたかどうか分かりません。 I mean proteins are not the protein language is not the same thing as natural language. タンパク質の言語は自然言語と同じではありません。 There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. 類似点はあります。でも通常の言語Transformerを高い温度でサンプリングすると、でたらめが出てくる。 you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. タンパク質言語モデルを無限温度でサンプリングすると、異なるドメインであるにもかかわらず、面白くはないとしても有効なタンパク質が得られます。 I'm not necessarily sure that I would 私は必ずしも確信が持てなかったかもしれません。 I primarily assume the natural language model insight would transfer over. 自然言語モデルの知見が転用できるとは、最初は思っていませんでした。 So what is specifically about proteins that you thought was special or you you know that would make this also valid? タンパク質のどこに特別な点があって、これも有効だと思ったのですか。 Yeah, I mean it's a really interesting question. 本当に面白い質問ですね。 I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. AI全体で今まさに問われている深い問いだと思います。AIは経験的な科学なので、常に理論で導けるわけではありません。でも強力な経験的証拠としてのスケーリングがあります。私が動機付けられたのは、進化とタンパク質データを考えると、数十億のタンパク質配列を持つデータベースがあるということです。 And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. そして、これらの配列にはパターンがあります。言語モデルの研究を始める数十年前から、進化が働く制約によって生じるタンパク質ファミリーの配列パターンが存在することは知られていました。 So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. 例えば、三次元空間に折り畳まれるタンパク質配列を考えてみましょう。 And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. その配列に2つの残基、つまりアミノ酸があって、折り畳まれた構造でコンタクトしているかもしれない。 And so evolution isn't free to choose those independently from each other. だから進化はそれらを互いに独立して選ぶことができません。 If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. ある位置での選択が、次の位置での互換性のある別の選択を強いるのです。 So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. 遺伝子配列が読めるようになった当初から、同じタンパク質を近縁生物で比較すると、根本的な生物学を反映したパターンが見え始めていました。 So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. ESMの背景にある考えは、進化の全体を通じて、生命が生み出したタンパク質の膨大な多様性を横断して、進化がタンパク質に配置するアミノ酸を言語モデルに予測させたらどうかというものでした。 So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. タンパク質の根本的な生物学についての膨大な情報がその全体像に含まれていると考えることができます。 And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. それがこの研究を始めるきっかけになったアイデアです。モデルが次のトークンを予測し、マスク言語モデリングで訓練することで学ぶのです。 So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. 配列の様々な部分からマスクされたトークンを予測することで、進化がどのトークンを選ぶかを規定する根本的な制約について何かを学ばざるを得なくなります。 Yeah. ええ。 So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? 少し歴史を振り返ると、Evolutionary Scale Modeling Cambrian、ESMCを先日リリースしましたね。 Is that what it's called? そういう名前でしたか。 Yeah. はい。 And this is like the maybe fourth or fifth in a series of models. これはおそらく4番目か5番目のモデルシリーズですよね。 I think maybe even more if you go back before they were called ESM. ESMと呼ばれる前のものまで遡ると、もっと多いかもしれません。 Well, they they were called ESM from the start. いや、最初からESMと呼んでいました。 Yeah. そうですね。 We had sort of various branches of the different models. 異なるモデルのいくつかの枝がありました。 Yeah. はい。 So, so this one I would say is is kind of a a fourth generation model. これは4世代目のモデルと言えます。 Um it's actually a model that we trained a little over a year ago. 実はちょうど1年以上前に訓練したモデルです。 Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. BioHubに移って、初めてMITライセンスのもとで完全にオープンソース化しています。 So, we're really excited to do that. それは本当に嬉しいことです。 But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. 大きな新しい点は、タンパク質生物学の世界モデルを構築したことです。 So the foundation of that is ESMC. その基盤となるのがESMCです。 But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. ESMCの表現を使って、構造予測モデルを構築しました。 Um and this is the next generation ESM fold model. これが次世代のESMFoldモデルです。 And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. また、機械的解釈可能性とスパースコーディングの手法を用いて、言語モデルの表現空間を深く調べ、タンパク質生物学を表現するために実際に使われている潜在的な特徴を引き出せるようにしました。 So bringing all of this together, we're able to, you know, really make predictions for protein structure. これらを統合することで、タンパク質構造の予測ができます。 um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. タンパク質を構成する根本的な特徴についての予測もでき、進化を横断したつながりを構築できます。 We're able to take this model and invert it to design proteins. このモデルを反転させてタンパク質を設計することもできます。 And we've we've we've used this to kind of create a comprehensive picture of protein biology. これを活用してタンパク質生物学の包括的な全体像を作り上げました。 So we we put together kind of all the world's largest protein sequence databases. 世界最大のタンパク質配列データベースをすべて統合しました。 And so that kind of amounts to 6.8 billion non-redundant proteins. それで68億の非冗長なタンパク質になります。 And then we've we've resolved predicted structures for 1.1 billion of those. そのうち11億の構造を予測で解決しました。 And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. また、それら全体で特徴を計算することで、進化とタンパク質生物学を横断したつながりを作れるようにしました。 6.8 billion of which you've resolved structure for 1.2 is that 1.1 68億のうち12億、いや11億の構造を解決したのですか。 1.1. 11億です。 So what about the others? 残りはどうしたのですか。 Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. データベースを70%の配列同一性でクラスタリングしました。 So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. つまり、各クラスターにクラスター中心があるという意味で、実質的にすべての構造を解決しています。 We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. そこで構造を予測して、他のタンパク質は同じような鋳型構造を持つと推測できます。 There be be small variations but they have the same fold. 小さな違いはあっても、同じフォールドです。 1.2 billion or so clusters 12億ほどのクラスターが that are that are kind of covering the 6.8 billion. 68億をカバーしています。 Yeah. はい。 Okay. なるほど。 Interesting. 面白いですね。 And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? スケーリングの話をしていますが、この11億が正しい数だとどうやって分かるのですか。 Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? この11億に絞ることが、このモデルにとって適切な解像度だとどうして分かるのですか。 Well, we've chosen them so that they really cover that entire space. その空間全体をカバーするように選んでいます。 So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. このデータベースについて言えるのは、これまで作られた中で最も包括的なタンパク質構造と機能の全体像だということです。 It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. タンパク質構造の多様性についての知識に数億の構造を加え、進化を横断してタンパク質間のつながりを見つけるための特徴空間も作り出しています。 So we can see kind of really interesting themes emerging across evolution. 進化を横断した興味深いテーマが浮かび上がってきます。 you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections 例えば、配列は大きく異なっていても、共通の潜在的な機能パターン、構造的相同性をモデルが結びつけ、つながりを見つける遺伝子編集システムのようなものです。 now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein 機械的解釈可能性の話をすると、スパースオートエンコーダーなどの手法を使って、ネットワークをタンパク質で活性化したときの出力パターンを理解するということですよね。 Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. 実際の配列では無関係または部分的にしか関連していないが、挙動の点では類似しており、そのため同様のネットワークを活性化している、という理解で合っていますか。 Is that kind of the summary of what you just said? それが今おっしゃったことの要約ですか。 Yeah. はい。