Volver a Podcasts Latent Space

🔬 La Amarga Lección llega a las proteínas — Alex Rives, BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. Así que ESMC también está abordando la biología programable, pero diría que de una manera muy diferente. It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. Lo aborda desde esta perspectiva de modelado del mundo, donde la idea es básicamente que tienes un modelo predictivo y sabes que vas a explorar el modelo del mundo para encontrar moléculas de proteínas que satisfagan cualquier criterio de diseño que tengas. So we've been able to use this to actually now go and design um many protein binders. Así que hemos podido usar esto para ir y diseñar muchos binders de proteínas. But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. Pero creo que lo más emocionante es que hemos podido usar esto para diseñar anticuerpos, SCFVS. Hello, welcome to the latent space AI for science podcast. Hola, bienvenidos al podcast Latent Space de IA para la ciencia. I'm R.J. Haneki, CTO of Muromix. Soy RJ Honicky, CTO de Muromix. Yeah. Sí. And, uh, I'm Brandon today. Y yo soy Brandon, hoy. It's a pleasure to have Alex Reeves, uh, head of science at Biohub. Es un placer tener aquí a Alex Rives, jefe de ciencia en BioHub. Yeah. Sí. Would you like to introduce yourself real quick? ¿Te gustaría presentarte brevemente? Yeah. Sí. Yeah. Sí. Thank you for having me here. Gracias por tenerme aquí. It's great to be here. Es estupendo estar aquí. Um, I'm head of science at Biohub. Soy jefe de ciencia en BioHub. I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. Soy informático y trabajo en IA para la biología, y gran parte de mi trabajo ha sido sobre modelos de lenguaje para biología. By the time this podcast is released, you will have put out several new exciting interesting models. Para cuando salga este podcast, habrás publicado varios modelos nuevos e interesantes. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. Al repasarlos, no pude evitar pensar que quizás eres la persona que más cree en la lección amarga en la biología de proteínas ahora mismo. Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? ¿Puedes dar un poco de contexto sobre lo que eso significa para la biología y por qué estás tan comprometido y emocionado con este camino? Well, I'll take that. Bueno, lo acepto. Um, I believe in scaling laws. Creo en las leyes de escala. So, you know, I guess I've been working on this for, you know, since since the summer of 2018. Llevo trabajando en esto desde el verano de 2018. Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. Mi equipo, cuando estábamos en Metaphair, entrenó el primer modelo de lenguaje transformer para la biología de proteínas. And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. Siempre pensé que habría una especie de emergencia de información biológica al entrenar un modelo para predecir el siguiente token que crea la evolución. So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. Nuestro equipo ha explorado esa idea durante varios años y creo que realmente hemos visto la curva de escala: a medida que aumentamos los modelos en un orden de magnitud en cada generación, emerge este nuevo conjunto de capacidades. Yeah. Sí. So you've been you say emergence of capabilities scaling over generations. Entonces hablas de emergencia de capacidades y de escalar a lo largo de las generaciones. You've been working at this as you said for I guess it would be 8 years now or something like that. Has estado trabajando en esto, como dijiste, supongo que unos 8 años ya. It didn't always work that way right like there was signs that scaling might work. No siempre funcionó así, ¿verdad? Había indicios de que el escalado podría funcionar. You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. Pronto llegaremos a resultados nuevos donde creo que has demostrado claramente esta hipótesis de una manera que no había ocurrido antes. But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. Pero pareces tener un compromiso muy fuerte con esto, de una manera en la que no estoy seguro de que yo hubiera tenido tanta convicción de que funcionaría igual. I mean proteins are not the protein language is not the same thing as natural language. Las proteínas, el lenguaje de proteínas, no son lo mismo que el lenguaje natural. There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. Hay similitudes, pero si empiezas a hacer muestreo en un transformer de lenguaje natural a temperatura alta, obtendrás palabrería sin sentido. you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. Si muestreas un modelo de lenguaje de proteínas a temperatura infinita, vas a obtener algo que es una proteína válida, aunque no sea una proteína interesante, a pesar de ser un dominio diferente por una razón diferente. I'm not necessarily sure that I would No estoy necesariamente seguro de que yo hubiera I primarily assume the natural language model insight would transfer over. principalmente asumido que la intuición de los modelos de lenguaje natural se transferiría. So what is specifically about proteins that you thought was special or you you know that would make this also valid? ¿Qué tiene de especial específicamente las proteínas que pensaste que harían que esto también fuera válido? Yeah, I mean it's a really interesting question. Sí, es una pregunta muy interesante. I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. Creo que es una pregunta profunda en la IA en este momento, de forma más amplia, y lo que me parece tan interesante es que la IA ahora es una ciencia tan empírica que no siempre tenemos teoría que nos guíe, pero sí tenemos esta evidencia empírica muy sólida del escalado. Lo que me motivó es pensar en la evolución y en los datos que tenemos sobre proteínas: tenemos bases de datos con miles de millones de secuencias. And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. Esas secuencias contienen patrones, y hacía décadas que se sabía, mucho antes de que empezáramos a trabajar en esto con modelos de lenguaje, que existen patrones en las secuencias de familias de proteínas que surgen debido a las restricciones bajo las que opera la evolución. So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. Puedes pensar en una secuencia de proteína que se pliega en una estructura tridimensional en el espacio. And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. Puedes imaginar que hay dos residuos o aminoácidos en esa secuencia que podrían estar en contacto en esa estructura plegada. And so evolution isn't free to choose those independently from each other. Entonces la evolución no es libre de elegirlos independientemente el uno del otro. If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. Si hace una elección en una posición, tiene que hacer otra elección compatible en la siguiente posición. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. Remontándose hasta el principio de la secuenciación génica, cuando la gente comenzó por primera vez a poder observar esto y ver diferentes organismos relacionados con la misma proteína, se podían empezar a ver esos patrones que reflejan la biología subyacente fundamental. So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. La idea detrás de ESM fue: ¿qué pasaría si aplicaras este principio a través de toda la evolución, a través de la vasta diversidad de proteínas que han sido generadas en toda la vida y básicamente tuvieras un modelo de lenguaje prediciendo los aminoácidos que la evolución elegirá colocar en proteínas en todos esos contextos biológicos? So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. Puedes pensar que hay una cantidad increíble de información en ese panorama total sobre la biología subyacente de las proteínas. And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. Y esa fue realmente la idea que lo inspiró: a medida que un modelo tiene que predecir el siguiente token, y en realidad entrenamos estos modelos con modelado de lenguaje enmascarado. So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. Así que están prediciendo tokens que están enmascarados en varias partes de la secuencia, lo que le obliga a aprender algo sobre esas restricciones subyacentes que están dando forma a qué tokens puede elegir la evolución. Yeah. Sí. So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? Así que quizás para dar un poco de historia: acabas de publicar el Evolutionary Scale Modeling Cambrian, ¿verdad? Is that what it's called? ¿Es eso cómo se llama? Yeah. Sí. And this is like the maybe fourth or fifth in a series of models. Y es como el cuarto o quinto de una serie de modelos. I think maybe even more if you go back before they were called ESM. Creo que incluso más si te remontas a antes de que se llamaran ESM. Well, they they were called ESM from the start. Bueno, se llamaban ESM desde el principio. Yeah. Sí. We had sort of various branches of the different models. Tuvimos varias ramas de los diferentes modelos. Yeah. Sí. So, so this one I would say is is kind of a a fourth generation model. Este diría que es un modelo de cuarta generación. Um it's actually a model that we trained a little over a year ago. Es en realidad un modelo que entrenamos hace poco más de un año. Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. Ahora que estamos en BioHub, lo estamos publicando completamente como código abierto bajo licencia MIT por primera vez. So, we're really excited to do that. Estamos muy emocionados de hacer eso. But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. Pero lo grande que hay de nuevo aquí es que realmente hemos construido un modelo del mundo de la biología de proteínas. So the foundation of that is ESMC. La base de eso es ESMC. But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. Pero usando las representaciones de ESMC, ahora hemos construido un modelo de predicción de estructura. Um and this is the next generation ESM fold model. Y este es el modelo ESMFold 2 de próxima generación. And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. Y también hemos usado las técnicas de interpretabilidad mecanicista y codificación dispersa para comenzar a mirar profundamente en el espacio de representación del modelo de lenguaje y poder extraer las características subyacentes que el modelo realmente usa para representar la biología de proteínas. So bringing all of this together, we're able to, you know, really make predictions for protein structure. Reuniendo todo esto, podemos hacer predicciones sobre la estructura de proteínas. um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. Predicciones sobre las características subyacentes de las que están hechas las proteínas, lo que nos permite construir vínculos a través de la evolución. We're able to take this model and invert it to design proteins. Podemos tomar este modelo e invertirlo para diseñar proteínas. And we've we've we've used this to kind of create a comprehensive picture of protein biology. Y lo hemos usado para crear una imagen completa de la biología de proteínas. So we we put together kind of all the world's largest protein sequence databases. Reunimos las bases de datos de secuencias de proteínas más grandes del mundo. And so that kind of amounts to 6.8 billion non-redundant proteins. Lo que supone unos 6.800 millones de proteínas no redundantes. And then we've we've resolved predicted structures for 1.1 billion of those. Y hemos resuelto estructuras predichas para 1.100 millones de esas. And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. Y también hemos calculado características en todas ellas para poder establecer estos vínculos a lo largo de toda la evolución y la biología de proteínas. 6.8 billion of which you've resolved structure for 1.2 is that 1.1 6.800 millones de las cuales has resuelto la estructura para 1.200 millones... ¿es 1.100 millones? 1.1. 1.100 millones. So what about the others? ¿Y el resto? Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. Básicamente lo que hicimos fue tomar esa base de datos y agruparla con un 70% de identidad de secuencia. So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. En realidad se están resolviendo estructuras para todo en el sentido de que para cada grupo tenemos un centro de clúster. We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. Estamos prediciendo la estructura allí y luego podemos esperar que las otras proteínas tengan una estructura de plantilla similar. There be be small variations but they have the same fold. Habrá pequeñas variaciones pero tienen el mismo plegamiento. 1.2 billion or so clusters Unos 1.200 millones de clústeres. that are that are kind of covering the 6.8 billion. Que están cubriendo los 6.800 millones. Yeah. Sí. Okay. Bien. Interesting. Interesante. And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? Y sí, ya que estamos hablando de escalado, ¿cómo sabes que este es el número correcto? Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? Como, eh, ¿cómo sabes que enfocarte en esos 1,100 millones es la resolución correcta para este modelo? Well, we've chosen them so that they really cover that entire space. Bueno, los hemos elegido para que realmente cubran todo ese espacio. So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. Así que creo que lo que puedo decir sobre esta base de datos es que es realmente la imagen más completa de estructura y función de proteínas que se ha creado. It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. Está añadiendo, ya sabes, cientos de millones de estructuras a nuestro conocimiento de la diversidad de la estructura de proteínas, y también está creando este espacio de características que nos permite encontrar estos vínculos entre proteínas a lo largo de la evolución. So we can see kind of really interesting themes emerging across evolution. Así que podemos ver temas realmente interesantes que emergen a través de la evolución. you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections Ya sabes, vinculando por ejemplo sistemas de edición genética que están muy alejados en secuencia, pero, ya sabes, comparten algún tipo de patrones funcionales subyacentes, homología estructural, que el modelo puede reunir y encontrar esas conexiones. now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein Ahora estamos hablando de la parte de interpretabilidad mecanicista, así que si entiendo bien usas autoencoders dispersos y quizás otras técnicas para entender, bien, cuando activo la red usando una proteína, Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. entonces, ¿cuáles son los patrones de salidas que estoy viendo y cómo se relacionan entre sí? Si entiendo bien, tienes estas secuencias que no están relacionadas o solo parcialmente relacionadas según la secuencia real, pero en términos de comportamiento tienen un comportamiento similar y por eso activan redes similares. Is that kind of the summary of what you just said? ¿Es ese más o menos el resumen de lo que acabas de decir? Yeah. Sí.