Terug naar podcasts Latent Space

🔬 De Bitter Lesson Komt voor Eiwitten - Alex Rives, BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. ESMC benadert programmeerbare biologie ook, maar op een heel andere manier. It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. Het benadert dit vanuit een soort wereldmodel-perspectief, waarbij het idee is dat je een voorspellend model hebt en dat je het wereldmodel gaat doorzoeken om eiwitomoleculen te vinden die voldoen aan de ontwerpvereisten die je hebt. So we've been able to use this to actually now go and design um many protein binders. We zijn er nu in geslaagd om dit te gebruiken om veel eiwit-binders te ontwerpen. But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. Maar het meest opwindend is dat we dit kunnen gebruiken om antilichamen te ontwerpen, scFv's. Hello, welcome to the latent space AI for science podcast. Hallo, welkom bij de Latent Space AI for Science podcast. I'm R.J. Haneki, CTO of Muromix. Ik ben RJ Honicky, CTO van Miro Omix. Yeah. Ja. And, uh, I'm Brandon today. En ehm, ik ben Brandon vandaag. It's a pleasure to have Alex Reeves, uh, head of science at Biohub. Het is een genoegen om Alex Rives te hebben, hoofd van wetenschap bij BioHub. Yeah. Ja. Would you like to introduce yourself real quick? Wil je jezelf even kort voorstellen? Yeah. Ja. Yeah. Ja. Thank you for having me here. Bedankt dat ik hier mag zijn. It's great to be here. Fijn om hier te zijn. Um, I'm head of science at Biohub. Ehm, ik ben hoofd van wetenschap bij BioHub. I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. Ik ben informaticus en ik werk aan AI voor biologie, en veel van mijn werk gaat over taalmodellen voor biologie. By the time this podcast is released, you will have put out several new exciting interesting models. Tegen de tijd dat deze podcast uitkomt, zul je verschillende nieuwe interessante modellen hebben gepubliceerd. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. Terwijl ik ze doornam, kon ik niet anders dan het gevoel hebben dat jij misschien wel de meest Bitter Lesson-gezinde persoon in de eiwitbiologie bent op dit moment. Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? Kun je wat context geven over wat dat betekent voor biologie en waarom je zo toegewijd en enthousiast bent over deze aanpak? Well, I'll take that. Nou, ik neem dat graag aan. Um, I believe in scaling laws. Ehm, ik geloof in schalingsregels. So, you know, I guess I've been working on this for, you know, since since the summer of 2018. Dus, weet je, ik werk hier al aan, ehm, al sinds de zomer van 2018. Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. En mijn team bij Meta FAIR trainde het eerste transformer-taalmodel voor eiwitbiologie. And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. En ik heb altijd gedacht dat er een soort opkomst van biologische informatie zou zijn als je een model traint om het volgende token te voorspellen dat evolutie creëert. So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. Ons team heeft dat idee door de jaren heen echt verkend en ik denk dat we de schaalingscurve echt hebben gezien: elke generatie groeiden de modellen met een orde van grootte, en dan zie je nieuwe capaciteiten opduiken. Yeah. Ja. So you've been you say emergence of capabilities scaling over generations. Je zegt opkomst van capaciteiten, schaling over generaties. You've been working at this as you said for I guess it would be 8 years now or something like that. Je werkt hier al aan, zoals je zei, zo'n 8 jaar of zo. It didn't always work that way right like there was signs that scaling might work. Het werkte niet altijd zo, er waren wel tekenen dat schaling zou kunnen werken. You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. We zullen nieuwe resultaten bespreken waarbij je de hypothese echt duidelijk hebt bewezen op een manier die nog niet eerder is gedaan. But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. Maar je lijkt er erg aan overtuigd te zijn, op een manier waarvan ik niet zeker weet of ik er zelf zo van overtuigd zou zijn geweest dat het op dezelfde manier zou werken. I mean proteins are not the protein language is not the same thing as natural language. Eiwitten zijn niet hetzelfde als natuurlijke taal. There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. Er zijn overeenkomsten, maar als je een gewone taal-transformer bij temperatuur gaat samplen, krijg je onzin. you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. Als je een eiwit-taalmodel bij oneindige temperatuur samplet, krijg je iets wat een geldig eiwit is, ook al is het geen interessant eiwit, ondanks dat het een ander domein is om een andere reden. I'm not necessarily sure that I would Ik was er niet per se zeker van dat ik dat zou I primarily assume the natural language model insight would transfer over. Ik ging er voornamelijk van uit dat de inzichten van natuurlijke taalmodellen zouden overdragen. So what is specifically about proteins that you thought was special or you you know that would make this also valid? Wat is er specifiek aan eiwitten waardoor je dacht dat dit ook geldig zou zijn? Yeah, I mean it's a really interesting question. Ja, het is echt een interessante vraag. I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. Ik denk dat dit een diepe vraag is in AI op dit moment, en wat zo interessant is: AI is nu zo'n empirische wetenschap dat we geen theorie hebben die ons altijd kan begeleiden, maar we hebben sterk empirisch bewijs voor schaling. Wat mij motiveerde: als je nadenkt over evolutie en de data die we hebben over eiwitten, hebben we databases met miljarden eiwitsequenties. And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. En die sequenties bevatten patronen, en het was al lang bekend, decennia voor we begonnen te werken met taalmodellen, dat er patronen zijn in de sequenties van eiwitfamilies die er zijn vanwege de beperkingen waaronder evolutie werkt. So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. Je kunt denken aan een eiwitsequentie die in de driedimensionale ruimte vouwt tot een driedimensionale structuur. And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. Je kunt je voorstellen dat er twee residuen of aminozuren in die sequentie zijn die contact kunnen maken in die gevouwen structuur. And so evolution isn't free to choose those independently from each other. Evolutie heeft niet de vrijheid om die onafhankelijk van elkaar te kiezen. If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. Als ze een keuze maakt op één positie, moet ze op de volgende positie een compatibele keuze maken. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. Teruggaand naar het begin van gensequencing, toen mensen voor het eerst eiwitten in gerelateerde organismen konden vergelijken, begon je deze patronen te zien die de fundamentele onderliggende biologie weerspiegelen. So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. Dus het idee achter ESM was: wat als je dit principe zou toepassen over alle evolutie, over de enorme diversiteit van eiwitten die zijn gegenereerd in al het leven, en een taalmodel de aminozuren laat voorspellen die evolutie kiest in eiwitten over al die biologische contexten? So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. Je kunt je voorstellen dat er een ongelooflijke hoeveelheid informatie zit in dat totaalplaatje over de onderliggende biologie van eiwitten. And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. En dat was echt het idee achter dit alles: terwijl een model het volgende token voorspelt, en we trainen deze modellen eigenlijk met gemaskeerd taalmodelleren. So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. Ze voorspellen tokens die zijn gemaskeerd uit verschillende delen van de sequentie, waardoor het iets moet leren over die onderliggende beperkingen die bepalen welke tokens evolutie kan kiezen. Yeah. Ja. So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? Misschien even wat geschiedenis: je hebt net Evolutionary Scale Modeling Cambrian uitgebracht, toch? Is that what it's called? Heet het zo? Yeah. Ja. And this is like the maybe fourth or fifth in a series of models. En dit is misschien het vierde of vijfde in een reeks modellen. I think maybe even more if you go back before they were called ESM. Misschien zelfs meer als je teruggaat voor ze ESM werden genoemd. Well, they they were called ESM from the start. Ze heetten ESM vanaf het begin. Yeah. Ja. We had sort of various branches of the different models. We hadden soort van verschillende takken van de modellen. Yeah. Ja. So, so this one I would say is is kind of a a fourth generation model. Dit is, zou ik zeggen, een model van de vierde generatie. Um it's actually a model that we trained a little over a year ago. Het is eigenlijk een model dat we iets meer dan een jaar geleden hebben getraind. Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. Nu we bij BioHub zijn, open-sourcen we dit model volledig onder MIT-licentie voor het eerst. So, we're really excited to do that. We zijn daar echt enthousiast over. But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. Maar het grote nieuwe hier is dat we echt een wereldmodel van eiwitbiologie hebben gebouwd. So the foundation of that is ESMC. De basis daarvoor is ESMC. But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. Maar met de representaties van ESMC hebben we nu een structuurvoorspellingsmodel gebouwd. Um and this is the next generation ESM fold model. En dit is het ESMFold-model van de volgende generatie. And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. We hebben ook de technieken van mechanistische interpretabiliteit en sparse coding gebruikt om diep in de representatieruimte van het taalmodel te kijken en de onderliggende kenmerken te kunnen extraheren die het model daadwerkelijk gebruikt om eiwitbiologie voor te stellen. So bringing all of this together, we're able to, you know, really make predictions for protein structure. Door dit alles samen te brengen kunnen we voorspellingen doen voor eiwitstructuur. um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. Voorspellingen over de onderliggende kenmerken waaruit eiwitten zijn opgebouwd, waardoor we koppelingen kunnen maken over evolutie heen. We're able to take this model and invert it to design proteins. We kunnen dit model nemen en omdraaien om eiwitten te ontwerpen. And we've we've we've used this to kind of create a comprehensive picture of protein biology. En we hebben dit gebruikt om een alomvattend beeld van eiwitbiologie te creëren. So we we put together kind of all the world's largest protein sequence databases. We hebben de grootste eiwitsequentiedatabases ter wereld samengebracht. And so that kind of amounts to 6.8 billion non-redundant proteins. En dat komt neer op 6,8 miljard niet-redundante eiwitten. And then we've we've resolved predicted structures for 1.1 billion of those. En we hebben voorspelde structuren opgelost voor 1,1 miljard daarvan. And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. En we hebben ook kenmerken berekend over al die eiwitten zodat we koppelingen kunnen maken over de hele evolutie en eiwitbiologie. 6.8 billion of which you've resolved structure for 1.2 is that 1.1 6,8 miljard waarvan je de structuur hebt opgelost voor 1,2, of is dat 1,1? 1.1. 1,1. So what about the others? Wat dan met de rest? Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. We hebben die database genomen en geclusterd op 70% sequentie-identiteit. So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. We lossen eigenlijk structuren op voor alles, in die zin dat we voor elk cluster een clustercentrum hebben. We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. We voorspellen de structuur daar en dan kunnen we verwachten dat de andere eiwitten een vergelijkbare sjabloonstructuur zullen hebben. There be be small variations but they have the same fold. Er zijn kleine variaties, maar ze hebben dezelfde vouwing. 1.2 billion or so clusters Ongeveer 1,2 miljard clusters that are that are kind of covering the 6.8 billion. die de 6,8 miljard dekken. Yeah. Ja. Okay. Oké. Interesting. Interessant. And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? En ja, misschien, nu we het over schaling hebben, hoe weet je dat dit het juiste aantal is? Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? Hoe weet je dat de focus op die 1,1 miljard de juiste resolutie is voor dit model? Well, we've chosen them so that they really cover that entire space. We hebben ze zo gekozen dat ze echt die hele ruimte afdekken. So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. Ik kan zeggen dat deze database echt het meest uitgebreide beeld is van eiwitstructuur en -functie dat ooit is gemaakt. It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. Het voegt honderden miljoenen structuren toe aan onze kennis van de diversiteit van eiwitstructuren en creëert een kenmerkruimte waarmee we koppelingen kunnen vinden tussen eiwitten over de evolutie heen. So we can see kind of really interesting themes emerging across evolution. We zien echt interessante thema's opkomen over de evolutie heen. you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections Zo kunnen we genbewerking-systemen koppelen die ver uit elkaar liggen in sequentie, maar die een soort onderliggende functionele patronen en structurele homologie delen die het model bij elkaar kan brengen en verbindingen kan vinden. now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein Nu spreken we over het mechanistische interpretabiliteitsdeel: als ik het goed begrijp gebruik je sparse autoencoders en andere technieken om te begrijpen wat er gebeurt als ik het netwerk activeer met een eiwit. Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. Wat zijn dan de patronen van outputs die ik zie en hoe verhouden ze zich tot elkaar? Als ik het goed begrijp, zijn er sequenties die niet verwant of slechts gedeeltelijk verwant zijn op basis van de daadwerkelijke sequentie, maar wat gedrag betreft vertonen ze vergelijkbaar gedrag en activeren ze daardoor vergelijkbare netwerken. Is that kind of the summary of what you just said? Is dat de samenvatting van wat je net zei? Yeah. Ja.