Voltar aos Podcasts Latent Space

🔬 A Lição Amarga Chega às Proteínas - Alex Rives, BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. Então o ESMC também está abordando a biologia programável, mas eu diria de uma forma muito diferente. It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. Ele aborda isso a partir de uma perspectiva de modelagem de mundo, onde a ideia é basicamente que você tem um modelo preditivo e vai buscar nesse modelo de mundo as moléculas de proteína que satisfaçam os critérios de design que você tiver. So we've been able to use this to actually now go and design um many protein binders. Então conseguimos usar isso para agora ir e projetar muitos binders de proteínas. But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. Mas acho que o mais empolgante é que conseguimos usar isso para projetar anticorpos, scFvs. Hello, welcome to the latent space AI for science podcast. Olá, bem-vindos ao podcast Latent Space AI for Science. I'm R.J. Haneki, CTO of Muromix. Eu sou RJ Honicky, CTO da Miro Omix. Yeah. É. And, uh, I'm Brandon today. E, hm, eu sou o Brandon hoje. It's a pleasure to have Alex Reeves, uh, head of science at Biohub. É um prazer ter Alex Rives, head of Science do BioHub. Yeah. É. Would you like to introduce yourself real quick? Você quer se apresentar rapidinho? Yeah. Claro. Yeah. Sim. Thank you for having me here. Obrigado por me receber aqui. It's great to be here. É ótimo estar aqui. Um, I'm head of science at Biohub. Sou head of Science do BioHub. I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. Sou cientista da computação e trabalho com IA para biologia, e grande parte do meu trabalho tem sido em modelos de linguagem para biologia. By the time this podcast is released, you will have put out several new exciting interesting models. Quando este podcast for lançado, você já terá publicado vários modelos novos e empolgantes. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. Ao analisá-los, não pude deixar de pensar que você talvez seja a pessoa mais alinhada à Lição Amarga na biologia de proteínas hoje em dia. Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? Você pode dar um pouco de contexto sobre o que isso significa para a biologia e por que está tão comprometido e animado com esse caminho? Well, I'll take that. Bom, aceito isso. Um, I believe in scaling laws. Acredito em leis de escala. So, you know, I guess I've been working on this for, you know, since since the summer of 2018. Bem, acho que trabalho nisso desde o verão de 2018. Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. E a minha equipe, quando estávamos na Meta FAIR, treinou de fato o primeiro modelo de linguagem transformer para biologia de proteínas. And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. Sempre achei que haveria uma espécie de emergência de informação biológica ao treinar um modelo para prever o próximo token que a evolução cria. So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. Nossa equipe explorou essa ideia ao longo de vários anos e realmente vimos a curva de escala, e à medida que aumentamos os modelos por uma ordem de magnitude a cada geração, há essa emergência de novas capacidades. Yeah. É. So you've been you say emergence of capabilities scaling over generations. Então você fala em emergência de capacidades e escala ao longo das gerações. You've been working at this as you said for I guess it would be 8 years now or something like that. Você trabalha nisso, como você disse, acho que faz 8 anos agora, ou algo assim. It didn't always work that way right like there was signs that scaling might work. Nem sempre funcionou dessa forma, certo? Havia sinais de que o escalonamento poderia funcionar. You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. Vamos chegar a alguns resultados novos onde acho que você demonstrou claramente essa hipótese de uma forma que não havia acontecido antes. But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. Mas você parece ter um comprometimento muito forte com isso de um jeito que eu não tenho certeza se teria a mesma convicção de que funcionaria da mesma forma. I mean proteins are not the protein language is not the same thing as natural language. Quero dizer, as proteínas, a linguagem das proteínas não é a mesma coisa que a linguagem natural. There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. Há semelhanças, mas se você começar a amostrar um transformer de linguagem natural na temperatura normal, vai obter nonsense. you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. Se você amostrar um modelo de linguagem de proteínas em temperatura infinita, vai obter algo que é uma proteína válida, mesmo que não seja uma proteína interessante, apesar de ser um domínio diferente por uma razão diferente. I'm not necessarily sure that I would Não tenho certeza se eu necessariamente I primarily assume the natural language model insight would transfer over. Eu primariamente assumiria que o insight dos modelos de linguagem natural se transferiria. So what is specifically about proteins that you thought was special or you you know that would make this also valid? Então o que há especificamente nas proteínas que você achava especial e que faria isso também ser válido? Yeah, I mean it's a really interesting question. É, realmente uma pergunta muito interessante. I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. Acho que é uma questão profunda em toda a IA agora de forma mais ampla, e o que é tão interessante é que a IA agora é uma ciência tão empírica que não temos uma teoria que sempre possa nos guiar nessas questões, mas temos essa evidência empírica muito forte de escalonamento. O que me motivou foi pensar na evolução e nos dados que temos sobre proteínas: temos bases de dados com bilhões de sequências de proteínas. And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. E essas sequências contêm padrões, e havia muito tempo se sabia, bem antes de começarmos a trabalhar nisso com modelos de linguagem, que existem padrões nas sequências de famílias de proteínas que surgem por causa das restrições sob as quais a evolução opera. So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. Então você pode pensar em uma sequência de proteína que se dobra em uma estrutura tridimensional no espaço. And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. E você pode imaginar que há dois resíduos ou aminoácidos nessa sequência que podem estar em contato nessa estrutura dobrada. And so evolution isn't free to choose those independently from each other. Assim, a evolução não pode escolher esses resíduos de forma independente um do outro. If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. Se ela faz uma escolha em uma posição, precisa fazer outra escolha compatível na posição seguinte. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. Então, remontando ao início do sequenciamento genético, quando as pessoas começaram a conseguir olhar para isso e analisar proteínas relacionadas em organismos relacionados, era possível começar a ver esses padrões que refletem a biologia subjacente fundamental. So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. Então a ideia por trás do ESM era: e se você aplicasse esse princípio em toda a evolução, em toda a vasta diversidade de proteínas geradas em toda a vida, e basicamente tivesse um modelo de linguagem prevendo os aminoácidos que a evolução escolheria colocar nas proteínas em todos esses contextos biológicos? So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. Então você pode imaginar que há uma quantidade incrível de informação nesse quadro total sobre a biologia subjacente das proteínas. And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. E essa foi realmente a ideia que deu origem a tudo isso: à medida que um modelo precisa prever o próximo token, e na verdade treinamos esses modelos com masked language modeling. So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. Então eles estão prevendo tokens que foram mascarados em várias partes da sequência, e o modelo teria que aprender algo sobre essas restrições subjacentes que moldam quais tokens a evolução pode escolher. Yeah. É. So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? Então, talvez para um pouco de histórico: você acabou de lançar o Evolutionary Scale Modeling Cambrian, certo? Is that what it's called? É esse o nome? Yeah. É. And this is like the maybe fourth or fifth in a series of models. E este é talvez o quarto ou quinto de uma série de modelos. I think maybe even more if you go back before they were called ESM. Acho que talvez ainda mais se você for antes de serem chamados de ESM. Well, they they were called ESM from the start. Bem, eles foram chamados de ESM desde o início. Yeah. É. We had sort of various branches of the different models. Tínhamos vários ramos dos diferentes modelos. Yeah. É. So, so this one I would say is is kind of a a fourth generation model. Então, este eu diria que é um modelo de quarta geração. Um it's actually a model that we trained a little over a year ago. É na verdade um modelo que treinamos há pouco mais de um ano. Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. Agora que estamos no BioHub, estamos tornando este modelo totalmente open source pela primeira vez, sob licença MIT. So, we're really excited to do that. Estamos muito animados com isso. But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. Mas a grande novidade aqui é que realmente construímos um modelo de mundo da biologia de proteínas. So the foundation of that is ESMC. A base disso é o ESMC. But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. E usando as representações do ESMC, construímos agora um modelo de predição de estrutura. Um and this is the next generation ESM fold model. E este é o modelo ESMFold de próxima geração. And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. E também usamos técnicas de interpretabilidade mecanicista e sparse coding para começar a olhar profundamente para o espaço de representação do modelo de linguagem e conseguir extrair as características subjacentes que o modelo realmente usa para representar a biologia de proteínas. So bringing all of this together, we're able to, you know, really make predictions for protein structure. Juntando tudo isso, somos capazes de fazer previsões para a estrutura de proteínas. um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. Previsões sobre as características subjacentes de que as proteínas são feitas, o que nos permite construir conexões através da evolução. We're able to take this model and invert it to design proteins. Conseguimos pegar esse modelo e invertê-lo para projetar proteínas. And we've we've we've used this to kind of create a comprehensive picture of protein biology. E usamos isso para criar um quadro abrangente da biologia de proteínas. So we we put together kind of all the world's largest protein sequence databases. Reunimos basicamente todas as maiores bases de dados de sequências de proteínas do mundo. And so that kind of amounts to 6.8 billion non-redundant proteins. E isso totaliza 6,8 bilhões de proteínas não redundantes. And then we've we've resolved predicted structures for 1.1 billion of those. E resolvemos estruturas previstas para 1,1 bilhão dessas. And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. E também calculamos características em todas elas para que possamos criar essas conexões em toda a evolução e biologia de proteínas. 6.8 billion of which you've resolved structure for 1.2 is that 1.1 De 6,8 bilhões das quais você resolveu a estrutura para 1,2, ou seria 1,1? 1.1. 1,1. So what about the others? E as demais? Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. Bem, basicamente o que fizemos foi pegar essa base de dados e agrupá-la com 70% de identidade de sequência. So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. Então estamos de fato resolvendo estruturas para tudo, no sentido de que para cada cluster temos um centro do cluster. We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. Prevemos a estrutura ali e podemos esperar que as outras proteínas tenham uma estrutura de template similar. There be be small variations but they have the same fold. Haverá pequenas variações, mas elas têm o mesmo fold. 1.2 billion or so clusters Cerca de 1,2 bilhão de clusters that are that are kind of covering the 6.8 billion. que cobrem os 6,8 bilhões. Yeah. É. Okay. Certo. Interesting. Interessante. And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? E, talvez já que estamos falando de escala, como você sabe que esse é o número certo? Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? Como você sabe que se concentrar nesses 1,1 bilhão é a resolução certa para esse modelo? Well, we've chosen them so that they really cover that entire space. Bem, nós os escolhemos para que realmente cubram todo esse espaço. So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. Então, o que posso dizer sobre esse banco de dados é que ele é de fato o quadro mais abrangente de estrutura e função de proteínas já criado. It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. Ele adiciona centenas de milhões de estruturas ao nosso conhecimento da diversidade de estrutura de proteínas e também cria esse espaço de características que nos permite encontrar essas conexões entre proteínas ao longo da evolução. So we can see kind of really interesting themes emerging across evolution. Então podemos ver temas realmente interessantes emergindo ao longo da evolução. you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections Conectando, por exemplo, sistemas de edição genética que estão muito distantes em sequência, mas compartilham algum tipo de padrão funcional subjacente, homologia estrutural, que o modelo consegue reunir e encontrar essas conexões. now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein Agora estamos falando da parte de interpretabilidade mecanicista. Então, se entendo corretamente, você usa sparse autoencoders e outras técnicas talvez para entender: o que acontece quando ativo a rede usando uma proteína? Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. Quais são os padrões de saídas que estou vendo e como eles se relacionam entre si? Se entendo corretamente, você tem sequências não relacionadas ou apenas parcialmente relacionadas com base na sequência real, mas em termos de comportamento elas têm comportamento similar e portanto ativam redes similares. Is that kind of the summary of what you just said? É esse o resumo do que você disse? Yeah. É.