팟캐스트로 돌아가기 Latent Space

🔬 단백질에도 쓴맛 교훈이 온다 — Alex Rives, BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. ESMC 역시 프로그래머블 생물학에 접근하고 있지만, 아주 다른 방식으로 접근한다고 말씀드릴 수 있을 것 같습니다. It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. 이 세계 모델링 관점에서 접근하는 건데요, 기본 아이디어는 예측 모델을 갖고 그 세계 모델을 탐색해서 원하는 설계 기준을 충족하는 단백질 분자를 찾아내는 것입니다. So we've been able to use this to actually now go and design um many protein binders. 이를 활용해 실제로 많은 단백질 결합 분자들을 설계할 수 있게 됐습니다. But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. 가장 흥미로운 점은, 이걸로 실제로 scFv 항체를 설계하는 데 성공했다는 겁니다. Hello, welcome to the latent space AI for science podcast. 안녕하세요, Latent Space AI for Science 팟캐스트에 오신 걸 환영합니다. I'm R.J. Haneki, CTO of Muromix. 저는 Muro Mix의 CTO, R.J. Honicky입니다. Yeah. 네. And, uh, I'm Brandon today. 저는 오늘 Brandon입니다. It's a pleasure to have Alex Reeves, uh, head of science at Biohub. BioHub의 과학 총괄 Alex Rives를 모시게 되어 영광입니다. Yeah. 네. Would you like to introduce yourself real quick? 간단히 자기소개 부탁드릴까요? Yeah. 네. Yeah. 네. Thank you for having me here. 초대해 주셔서 감사합니다. It's great to be here. 여기 오게 돼서 정말 기쁩니다. Um, I'm head of science at Biohub. 저는 BioHub의 과학 총괄입니다. I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. 저는 컴퓨터 과학자로, 생물학을 위한 AI를 연구하고 있으며 제 작업의 많은 부분이 단백질 언어 모델에 관한 것입니다. By the time this podcast is released, you will have put out several new exciting interesting models. 이 팟캐스트가 공개될 즈음엔 여러 새롭고 흥미로운 것들을 발표하셨을 텐데요. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. 살펴보면서, 여러분이 단백질 생물학의 쓴 교훈에 대한 포스터 차일드가 되셨을 수도 있겠다는 생각이 들었습니다. Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? 생물학에서 그게 무슨 의미인지, 그리고 왜 중요한지 조금 설명해 주실 수 있을까요? Well, I'll take that. 그렇게 받아들이겠습니다. Um, I believe in scaling laws. 저는 스케일링 법칙을 믿습니다. So, you know, I guess I've been working on this for, you know, since since the summer of 2018. 초기, 아마도 제 경력의 첫 4~5년 전부터 이 일을 해왔다고 할 수 있겠네요. Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. 저희 팀은 Meta FAIR에 있을 때 단백질 서열에 처음으로 트랜스포머를 학습시켰습니다. And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. 저는 항상 단백질 언어 모델에서 일종의 역량 창발이 있을 거라고 생각해왔습니다. So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. 저희 팀은 수년에 걸쳐 그 아이디어를 다양하게 탐구했고, 지금 우리가 보고 있는 것이 여러 세대에 걸친 스케일링을 통한 역량 창발이라고 생각합니다. Yeah. 네. So you've been you say emergence of capabilities scaling over generations. 그러니까, 세대를 걸친 스케일링을 통한 역량 창발이라고 하셨는데요. You've been working at this as you said for I guess it would be 8 years now or something like that. 말씀대로라면 이 일을 8년 정도 해오신 거네요. It didn't always work that way right like there was signs that scaling might work. 항상 그런 방식으로 작동했던 건 아니었잖아요, 스케일링이 작동할 수 있다는 징후가 있었을 뿐이었죠. You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. 앞으로 나올 새로운 결과들을 다루게 될 텐데, 거기서 정말로 스케일링 법칙을 증명하셨다고 생각합니다. But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. 하지만 다른 분들이 반드시 그렇게 보진 않을 수도 있는 방식으로, 이에 대한 강한 확신을 갖고 계신 것 같습니다. I mean proteins are not the protein language is not the same thing as natural language. 단백질 언어가 자연어와 동일한 건 아니니까요. There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. 유사점은 있지만, 트랜스포머를 보통 온도로 샘플링하면 자연어를 얻지만 you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. 단백질 언어 모델을 무한 온도로 샘플링하면 완전히 랜덤한 아미노산 서열을 얻게 됩니다. I'm not necessarily sure that I would 꼭 그렇게 확신하진 않았을 것 같습니다. I primarily assume the natural language model insight would transfer over. 저는 주로 자연어 모델의 통찰이 전이될 거라고 가정했습니다. So what is specifically about proteins that you thought was special or you you know that would make this also valid? 그렇다면 단백질의 어떤 점이 특별하다고 생각하셨나요? Yeah, I mean it's a really interesting question. 네, 정말 흥미로운 질문이군요. I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. 지금 AI 전반에 걸쳐 정말 깊은 질문이라고 생각합니다. 제가 말씀드리고 싶은 건, 단백질이 아주 이상적인 테스트베드였다는 점입니다. 생물학적 서열에서 언어 모델의 훈련 신호로 어떤 걸 얻을 수 있는지 연구할 수 있는 공간이죠. 그 핵심은, 진화가 기본적으로 단백질 서열을 최적화하는 과정이라는 겁니다. 그렇죠? 수십억 년의 진화적 선택을 거쳐, 그 과정의 결과로 살아있는 생명체 안에 존재하는 서열들을 보게 됩니다. And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. 그 서열들엔 패턴이 담겨 있고, 과거에는 이른바 다중 서열 정렬, MSA 분야에서 이 패턴들이 단백질의 구조와 기능 정보를 담고 있다고 알려져 있었습니다. So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. 예를 들어 하나의 단백질 서열이 있고, 그게 특정 구조로 접힌다고 생각해 보세요. And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. 그 서열 안에서 서로 공간적으로 인접한 두 잔기 혹은 아미노산이 있다고 상상해 보세요. And so evolution isn't free to choose those independently from each other. 진화는 그 둘을 독립적으로 선택할 수 없습니다. If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. 한 위치에서 선택을 하면, 다른 위치에서도 그에 상응하는 선택을 해야 하니까요. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. 유전자 서열 분석 초창기로 거슬러 올라가면, 사람들은 MSA로 공진화 신호를 측정하는 방법을 찾아냈습니다. So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. ESM 뒤에 있는 아이디어, 즉 ESM의 핵심 사고방식은 이것입니다. 진화가 최적화한 모든 서열을 마스킹된 언어 모델로 훈련시키면 어떻게 될까요? 진화가 이미 선택해 온 아미노산들을 예측하도록 훈련시키는 거죠. 그렇게 하면 그 공진화 신호들을 어쩌면 더 표현력 있는 방식으로, 그리고 스케일링을 통해 더 나은 방식으로 모델이 학습할 거라고 생각했습니다. So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. 이 안에 놀라운 양의 정보가 담겨 있다고 생각할 수 있습니다. And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. 모델이 이런 것들을 예측하는 걸 학습한다면, 모델 내부의 표현들이 사실상 단백질 구조에 대한 압축된 표현이 될 거라는 게 바로 이 아이디어를 불러일으킨 핵심입니다. So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. 서열의 여러 부분에서 마스킹된 토큰들을 예측하는 거죠. Yeah. 네. So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? 역사 이야기를 조금 하자면, ESMFold 2를 방금 발표하셨는데요. Is that what it's called? 그게 맞는 이름인가요? Yeah. 네. And this is like the maybe fourth or fifth in a series of models. 이게 모델 시리즈에서 아마 네 번째나 다섯 번째쯤 되는 것 같은데요. I think maybe even more if you go back before they were called ESM. ESM이라는 이름이 붙기 전으로 돌아가면 더 많을 수도 있겠지만요. Well, they they were called ESM from the start. 아, 처음부터 ESM이었습니다. Yeah. 네. We had sort of various branches of the different models. 여러 모델들이 각각 다른 방향으로 발전했습니다. Yeah. 네. So, so this one I would say is is kind of a a fourth generation model. 이번 건 4세대 모델이라고 할 수 있겠네요. Um it's actually a model that we trained a little over a year ago. 1년 조금 전에 학습시킨 모델입니다. Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. BioHub에 오게 되면서, 이 모델을 완전히 오픈소스로 공개하려 합니다. So, we're really excited to do that. 정말 기대가 됩니다. But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. 이번에 새로운 점은 진정한 과학적 엔진을 구축했다는 겁니다. So the foundation of that is ESMC. 그 기반이 바로 ESMC입니다. But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. ESMC의 표현을 활용해, 이제 구조 예측 헤드를 구축했습니다. Um and this is the next generation ESM fold model. 이게 바로 차세대 ESMFold 모델입니다. And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. 기계적 해석 가능성 기법과 희소 오토인코더를 사용해 모델 내부 특징들을 살펴봤고, 단백질을 나타내는 특징 공간을 밝혀냈습니다. So bringing all of this together, we're able to, you know, really make predictions for protein structure. 이 모든 걸 종합해서, 예측을 할 수 있게 됐습니다. um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. 단백질의 기저 특징들에 대한 예측입니다. We're able to take this model and invert it to design proteins. 이 모델을 역전시켜 단백질을 설계할 수도 있습니다. And we've we've we've used this to kind of create a comprehensive picture of protein biology. 또한 단백질의 포괄적인 그림을 만드는 데도 활용했습니다. So we we put together kind of all the world's largest protein sequence databases. 세상에서 가장 큰 단백질 서열 데이터베이스들을 모아 단백질 아틀라스를 구축했습니다. And so that kind of amounts to 6.8 billion non-redundant proteins. 그게 68억 개의 비중복 단백질에 해당합니다. And then we've we've resolved predicted structures for 1.1 billion of those. 그 중 11억 개의 구조를 예측해 해결했습니다. And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. 모든 것에 대한 특징도 계산했기 때문에, 이 방대한 단백질 공간에서 검색을 할 수 있습니다. 6.8 billion of which you've resolved structure for 1.2 is that 1.1 68억 개 중에서 구조를 해결한 건 12억, 아니 11억 개라고요? 1.1. 11억 개입니다. So what about the others? 나머지는요? Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. 기본적으로 그 데이터베이스를 가져다가 클러스터링을 했습니다. So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. 어떤 의미에서는 모든 것의 구조를 해결한 거나 마찬가지입니다. 각 클러스터마다 대표 구조가 있으니까요. We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. 거기서 구조를 예측하면, 다른 단백질들도 동일한 구조를 가질 거라고 예상할 수 있습니다. There be be small variations but they have the same fold. 작은 변이가 있겠지만 같은 폴드를 갖게 됩니다. 1.2 billion or so clusters 12억 개 정도의 클러스터군요. that are that are kind of covering the 6.8 billion. 68억 개를 커버하는 클러스터들이죠. Yeah. 네. Okay. 아. Interesting. 흥미롭네요. And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? 그리고 스케일링 이야기를 하니까 말인데, 이게 맞는 방향이라는 걸 어떻게 알 수 있나요? Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? 11억 개에 집중하는 게 맞는 방향이라는 걸 어떻게 알 수 있을까요? Well, we've chosen them so that they really cover that entire space. 전체 공간을 진정으로 커버하도록 선택했습니다. So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. 이 데이터베이스에 대해 말씀드릴 수 있는 건, 실제로 단백질 구조 지식의 가장 포괄적인 데이터베이스라는 겁니다. It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. 수억 개의 구조를 단백질 구조 지식에 추가하고 있으며, 이런 구조 예측들이 실제로 정확합니다. So we can see kind of really interesting themes emerging across evolution. 진화 전반에 걸쳐 흥미로운 공통 주제들이 나타나는 걸 볼 수 있습니다. you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections 예를 들어 계통수상에서 매우 멀리 떨어져 있는 유전자 편집 시스템들을 연결할 수 있습니다. now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein 기계적 해석 가능성 부분 이야기로 넘어가면, SAE가 있다고 이해하는데, 이걸로 모델의 특징들을 살펴보고 그 특징들이 생물학적 특성들과 어떻게 연관되는지 파악하는 건가요? Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. 그 출력 패턴들을 파악하고 그게 생물학적 특성과 어떻게 연관되는지 이해하는 거잖아요. 이게 맞는 요약인가요? Is that kind of the summary of what you just said? 그게 방금 말씀하신 것의 요약인가요? Yeah. 네.