Zurück zu Podcasts Latent Space

🔬 Die Bitter Lesson kommt für Proteine – Alex Rives, BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. Also nähert sich ESMC der programmierbaren Biologie, aber ich würde sagen, auf eine sehr andere Art. It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. Es nähert sich ihr aus einer Art Weltmodellierungs-Perspektive, bei der die Idee im Grunde ist: Man hat ein prädiktives Modell, und man sucht in diesem Weltmodell nach Proteinmolekülen, die quasi die Designkriterien erfüllen, die man hat. So we've been able to use this to actually now go and design um many protein binders. Wir konnten das tatsächlich nutzen, um jetzt viele Proteinbinder zu designen. But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. Aber ich glaube, am aufregendsten ist, dass wir das nutzen konnten, um tatsächlich Antikörper zu designen, SCFVS. Hello, welcome to the latent space AI for science podcast. Hallo und willkommen beim Latent Space AI for Science Podcast. I'm R.J. Haneki, CTO of Muromix. Ich bin R.J. Honicky, CTO von Miro Omix. Yeah. Ja. And, uh, I'm Brandon today. Und ich bin heute Brandon. It's a pleasure to have Alex Reeves, uh, head of science at Biohub. Wir freuen uns sehr, Alex Rives begrüßen zu dürfen, Head of Science bei BioHub. Yeah. Ja. Would you like to introduce yourself real quick? Möchten Sie sich kurz vorstellen? Yeah. Ja. Yeah. Ja. Thank you for having me here. Danke, dass ihr mich eingeladen habt. It's great to be here. Es ist schön, hier zu sein. Um, I'm head of science at Biohub. Ich bin Head of Science bei BioHub. I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. Ich bin Informatiker und arbeite an KI für die Biologie, und ein Großteil meiner Arbeit hat sich auf Sprachmodelle für die Biologie konzentriert. By the time this podcast is released, you will have put out several new exciting interesting models. Zum Zeitpunkt, an dem dieser Podcast erscheint, werden Sie mehrere neue, spannende und interessante Modelle veröffentlicht haben. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. Als ich sie durchgegangen bin, konnte ich nicht umhin zu denken, dass Sie vielleicht der überzeugendste Vertreter der Bitter-Lesson-Idee in der Proteinbiologie gerade sind. Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? Können Sie etwas Kontext dazu geben, was das für die Biologie bedeutet, und warum Sie diesem Weg so verpflichtet und begeistert sind? Well, I'll take that. Ich nehme das gern an. Um, I believe in scaling laws. Ich glaube an Skalierungsgesetze. So, you know, I guess I've been working on this for, you know, since since the summer of 2018. Ich arbeite daran, glaube ich, seit dem Sommer 2018. Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. Mein Team hat, als wir bei Meta FAIR waren, im Grunde das erste Transformer-Sprachmodell für die Proteinbiologie trainiert. And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. Und ich habe, glaube ich, immer gedacht, dass es eine Art Emergenz biologischer Information geben würde, wenn man ein Modell trainiert, das nächste Token vorherzusagen, das die Evolution erzeugt. So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. Unser Team hat diese Idee über viele Jahre hinweg wirklich erkundet, und wir haben die Skalierungskurve wirklich gesehen, und wirklich beobachtet, wie mit jedem Modell einer Größenordnung mehr neue Fähigkeiten entstehen. Yeah. Ja. So you've been you say emergence of capabilities scaling over generations. Sie sprechen also von Emergenz von Fähigkeiten und Skalierung über Generationen hinweg. You've been working at this as you said for I guess it would be 8 years now or something like that. Sie arbeiten daran, wie Sie sagten, seit ich schätze, jetzt ungefähr 8 Jahren. It didn't always work that way right like there was signs that scaling might work. Es hat nicht immer so funktioniert, oder? Es gab Zeichen, dass Skalierung vielleicht funktioniert. You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. Wir werden noch zu einigen neuen Ergebnissen kommen, bei denen Sie diese Hypothese, glaube ich, wirklich klar nachgewiesen haben, wie es vorher noch nicht geschehen war. But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. Aber Sie scheinen ein starkes Bekenntnis dazu zu haben, auf eine Weise, bei der ich mir nicht sicher bin, ob ich so überzeugt gewesen wäre, dass es genauso funktionieren würde. I mean proteins are not the protein language is not the same thing as natural language. Ich meine, Proteine sind nicht dasselbe wie natürliche Sprache, die Proteinsprache ist nicht dasselbe wie natürliche Sprache. There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. Es gibt Ähnlichkeiten, aber wenn man einen normalen Sprach-Transformer bei einer gewissen Temperatur sampelt, bekommt man Kauderwelsch. you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. Wenn man ein Proteinsprachmodell bei unendlicher Temperatur sampelt, bekommt man etwas, das ein gültiges Protein ist, wenn auch kein interessantes, trotz der Tatsache, dass es eine andere Domäne ist, aus einem anderen Grund. I'm not necessarily sure that I would Ich bin mir nicht unbedingt sicher, ob ich I primarily assume the natural language model insight would transfer over. Ich nehme primär an, dass die Erkenntnisse aus natürlichsprachlichen Modellen übertragbar wären. So what is specifically about proteins that you thought was special or you you know that would make this also valid? Was ist also spezifisch an Proteinen, das Sie für besonders hielten, oder das dazu beitragen würde, dass das hier ebenfalls gültig ist? Yeah, I mean it's a really interesting question. Ja, das ist eine wirklich interessante Frage. I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. Ich denke, das ist eine tiefe Frage quer durch die gesamte KI gerade. Was so interessant ist, ist, dass KI gerade eine so empirische Wissenschaft ist, sodass wir keine Theorie haben, die uns in diesen Dingen immer leiten kann, aber wir haben diese wirklich starken empirischen Belege für Skalierung. Was mich motiviert hat, ist: Wenn man über die Evolution nachdenkt und über die Daten, die wir über Proteine haben, haben wir Datenbanken mit Milliarden von Proteinsequenzen. And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. Und diese Sequenzen enthalten Muster, und es war seit langem bekannt, also das geht Jahrzehnte zurück, bevor wir damit begonnen haben, daran mit Sprachmodellen zu arbeiten, dass es Muster in den Sequenzen von Proteinfamilien gibt, die durch die Zwänge entstehen, unter denen die Evolution arbeitet. So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. Man kann sich zum Beispiel eine Proteinsequenz vorstellen, die sich im Raum zu einer dreidimensionalen Struktur faltet. And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. Und man kann sich vorstellen, dass es zwei Reste oder Aminosäuren in dieser Sequenz gibt, die in der gefalteten Struktur in Kontakt sind. And so evolution isn't free to choose those independently from each other. Und die Evolution ist nicht frei, diese unabhängig voneinander zu wählen. If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. Wenn sie eine Wahl an einer Position trifft, muss sie gewissermaßen eine andere Wahl treffen, die an der nächsten Position kompatibel ist. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. Wenn man also ganz zum Anfang der Gensequenzierung zurückgeht, als Menschen erstmals in der Lage waren, das zu betrachten und verschiedene verwandte Proteine in verwandten Organismen zu untersuchen, konnte man beginnen, diese Art von Mustern zu sehen, die die grundlegende zugrundeliegende Biologie widerspiegeln. So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. Die Idee hinter ESM, das Denken hinter ESM war also: Was wäre, wenn man dieses Prinzip über alle Evolution hinweg, über die enorme Vielfalt von Proteinen, die über das gesamte Leben erzeugt wurden, anwenden würde und im Grunde ein Sprachmodell die Aminosäuren vorhersagen ließe, die die Evolution in all diesen biologischen Kontexten in Proteinen platzieren wird. So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. Man kann also denken, dass in diesem Gesamtbild eine schier unglaubliche Menge an Information über die zugrundeliegende Biologie der Proteine steckt. And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. Und das war wirklich die Idee, die das angestoßen hat: Ein Modell muss beim Vorhersagen des nächsten Tokens, und wir haben diese Modelle tatsächlich mit Masked Language Modeling trainiert... So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. Also sagen sie quasi Token vorher, die aus verschiedenen Teilen der Sequenz maskiert wurden, und das Modell müsste etwas über diese zugrundeliegenden Zwänge lernen, die bestimmen, welche Token die Evolution wählen kann. Yeah. Ja. So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? Vielleicht kurz zur Geschichte: Sie haben gerade Evolutionary Scale Modeling Cambrian veröffentlicht, richtig? Is that what it's called? Heißt das so? Yeah. Ja. And this is like the maybe fourth or fifth in a series of models. Und das ist wie das vierte oder fünfte Modell in einer Reihe. I think maybe even more if you go back before they were called ESM. Vielleicht sogar noch mehr, wenn man zu der Zeit zurückgeht, bevor sie ESM genannt wurden. Well, they they were called ESM from the start. Sie wurden von Anfang an ESM genannt. Yeah. Ja. We had sort of various branches of the different models. Wir hatten quasi verschiedene Zweige der verschiedenen Modelle. Yeah. Ja. So, so this one I would say is is kind of a a fourth generation model. Also, das hier würde ich sagen, ist gewissermaßen ein Modell der vierten Generation. Um it's actually a model that we trained a little over a year ago. Es ist tatsächlich ein Modell, das wir vor etwas mehr als einem Jahr trainiert haben. Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. Jetzt, da wir bei BioHub sind, stellen wir dieses Modell zum ersten Mal vollständig unter MIT-Lizenz als Open Source zur Verfügung. So, we're really excited to do that. Darüber freuen wir uns wirklich sehr. But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. Aber das wirklich Neue hier ist, dass wir wirklich ein Weltmodell der Proteinbiologie aufgebaut haben. So the foundation of that is ESMC. Die Grundlage dafür ist ESMC. But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. Aber mit den Repräsentationen von ESMC haben wir jetzt gewissermaßen ein Strukturvorhersagemodell aufgebaut. Um and this is the next generation ESM fold model. Und das ist das ESMFold-Modell der nächsten Generation. And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. Und wir haben auch die Techniken der mechanistischen Interpretierbarkeit und des Sparse Coding eingesetzt, um wirklich tief in den Repräsentationsraum des Sprachmodells zu schauen und die zugrundeliegenden Merkmale herauszuarbeiten, die das Modell tatsächlich zur Darstellung der Proteinbiologie verwendet. So bringing all of this together, we're able to, you know, really make predictions for protein structure. Das alles zusammengenommen ermöglicht uns, wirklich Vorhersagen zur Proteinstruktur zu machen. um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. Vorhersagen über die zugrundeliegenden Merkmale, aus denen Proteine bestehen, was uns erlaubt, Verbindungen über die Evolution hinweg herzustellen. We're able to take this model and invert it to design proteins. Wir können dieses Modell nehmen und invertieren, um Proteine zu designen. And we've we've we've used this to kind of create a comprehensive picture of protein biology. Und wir haben das genutzt, um ein umfassendes Bild der Proteinbiologie zu erstellen. So we we put together kind of all the world's largest protein sequence databases. Wir haben quasi alle weltgrößten Proteinsequenzdatenbanken zusammengestellt. And so that kind of amounts to 6.8 billion non-redundant proteins. Das entspricht etwa 6,8 Milliarden nicht-redundanten Proteinen. And then we've we've resolved predicted structures for 1.1 billion of those. Und wir haben für 1,1 Milliarden davon vorhergesagte Strukturen aufgelöst. And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. Und wir haben auch Merkmale über all diese berechnet, sodass wir diese Verbindungen quer durch Evolution und Proteinbiologie herstellen können. 6.8 billion of which you've resolved structure for 1.2 is that 1.1 6,8 Milliarden, von denen Sie für 1,2 Milliarden die Struktur aufgelöst haben, oder 1,1 Milliarden? 1.1. 1,1 Milliarden. So what about the others? Und was ist mit dem Rest? Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. Nun, was wir im Grunde getan haben, ist diese Datenbank auf 70 % Sequenzidentität geclustert. So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. Es ist also wirklich so, dass wir Strukturen für alles auflösen, in dem Sinne, dass für jeden Cluster ein Cluster-Zentrum haben. We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. Wir sagen die Struktur dort vorher und können dann erwarten, dass die anderen Proteine eine ähnliche Vorlagestruktur haben werden. There be be small variations but they have the same fold. Es wird kleine Variationen geben, aber sie haben dieselbe Faltung. 1.2 billion or so clusters Etwa 1,2 Milliarden Cluster that are that are kind of covering the 6.8 billion. die die 6,8 Milliarden abdecken. Yeah. Ja. Okay. Okay. Interesting. Interessant. And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? Und vielleicht, da wir gerade über Skalierung sprechen, woher weiß man, dass das die richtige Zahl ist? Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? Woher weiß man, dass der Fokus auf diese 1,1 Milliarden die richtige Auflösung für dieses Modell ist? Well, we've chosen them so that they really cover that entire space. Wir haben sie so gewählt, dass sie diesen gesamten Raum wirklich abdecken. So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. Was ich über diese Datenbank sagen kann, ist, dass sie wirklich das umfassendste Bild von Proteinstruktur und -funktion ist, das je erstellt wurde. It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. Sie fügt Hunderte Millionen Strukturen zu unserem Wissen über die Vielfalt der Proteinstruktur hinzu und schafft auch diesen Merkmalsraum, der es uns erlaubt, diese Verbindungen zwischen Proteinen über die Evolution hinweg zu finden. So we can see kind of really interesting themes emerging across evolution. Wir können wirklich interessante Themen sehen, die sich über die Evolution hinweg herausbilden. you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections die zum Beispiel Genbearbeitungssysteme verbinden, die in der Sequenz sehr weit voneinander entfernt sind, aber irgendwie gemeinsame zugrundeliegende funktionale Muster, strukturelle Homologie aufweisen, die das Modell zusammenführen und diese Verbindungen finden kann. now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein Jetzt sprechen wir also über den mechanistischen Interpretierbarkeitsteil. Wenn ich das richtig verstehe, nutzt man Sparse Autoencoder und andere Techniken, um zu verstehen: Wenn ich das Netzwerk mit einem Protein aktiviere... Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. welche Ausgabemuster sehe ich dann, und wie hängen sie miteinander zusammen? Wenn ich das richtig verstehe, hat man Sequenzen, die nicht verwandt oder nur teilweise verwandt sind, basierend auf der eigentlichen Sequenz, aber in Bezug auf ihr Verhalten zeigen sie ähnliches Verhalten und aktivieren daher ähnliche Netzwerke. Is that kind of the summary of what you just said? Ist das in etwa die Zusammenfassung dessen, was Sie gerade gesagt haben? Yeah. Ja.