Podcasts पर वापस जाएं Latent Space

🔬 प्रोटीन की दुनिया में बिटर लेसन का आगमन — Alex Rives, BioHub

So ESMC is is also approaching programmable biology, but I would say in a very different way. तो ESMC भी प्रोग्रामेबल बायोलॉजी को अपनाने की कोशिश कर रहा है, लेकिन मैं कहूंगा कि एक बिल्कुल अलग तरीके से। It's approaching it from this kind of world modeling perspective where the idea is basically you have a predictive model and you know you're going to search the world model to find protein molecules that satisfy kind of whatever design criteria that you have. यह इसे एक तरह के वर्ल्ड मॉडलिंग नजरिए से देख रहा है, जहां बुनियादी विचार यह है कि आपके पास एक प्रेडिक्टिव मॉडल है और आप उस वर्ल्ड मॉडल में खोज करके ऐसे प्रोटीन मॉलिक्यूल ढूंढेंगे जो आपके डिज़ाइन मानदंडों को पूरा करते हों। So we've been able to use this to actually now go and design um many protein binders. तो हम इसका इस्तेमाल करके अब वास्तव में कई प्रोटीन बाइंडर्स डिज़ाइन कर पाए हैं। But I think sort of most excitingly, we've been able to use this to actually design antibodies, SCFVS. लेकिन मुझे लगता है कि सबसे रोमांचक बात यह है कि हम इसका उपयोग करके एंटीबॉडी और SCFV भी डिज़ाइन कर पाए हैं। Hello, welcome to the latent space AI for science podcast. नमस्ते, Latent Space AI for Science पॉडकास्ट में आपका स्वागत है। I'm R.J. Haneki, CTO of Muromix. मैं RJ Honicky हूं, Muromix के CTO। Yeah. हां। And, uh, I'm Brandon today. और, मैं आज Brandon हूं। It's a pleasure to have Alex Reeves, uh, head of science at Biohub. हमें बहुत खुशी है कि Alex Rives यहां हैं, BioHub के Head of Science। Yeah. हां। Would you like to introduce yourself real quick? क्या आप थोड़ा अपना परिचय देना चाहेंगे? Yeah. हां। Yeah. हां। Thank you for having me here. मुझे यहां बुलाने के लिए शुक्रिया। It's great to be here. यहां आकर बहुत अच्छा लग रहा है। Um, I'm head of science at Biohub. मैं BioHub में Head of Science हूं। I'm a computer scientist uh and I work on AI for biology and a lot of my work has been on language models for biology. मैं एक कंप्यूटर साइंटिस्ट हूं और बायोलॉजी के लिए AI पर काम करता हूं, और मेरा बहुत सारा काम बायोलॉजी के लिए लैंग्वेज मॉडल पर रहा है। By the time this podcast is released, you will have put out several new exciting interesting models. जब तक यह पॉडकास्ट रिलीज़ होगा, आप कई नए और रोमांचक मॉडल लॉन्च कर चुके होंगे। Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson person in protein biology right now. उन्हें देखते हुए मुझे यह सोचने पर मजबूर होना पड़ा कि शायद आप अभी प्रोटीन बायोलॉजी में सबसे कट्टर बिटर लेसन के समर्थक हैं। Can you give a little context about what that means for biology and you know why you're so committed and excited to this route? क्या आप थोड़ा बता सकते हैं कि बायोलॉजी के लिए इसका क्या मतलब है और आप इस रास्ते के प्रति इतने प्रतिबद्ध और उत्साहित क्यों हैं? Well, I'll take that. अच्छा, मैं यह मान लेता हूं। Um, I believe in scaling laws. मैं स्केलिंग लॉ में विश्वास करता हूं। So, you know, I guess I've been working on this for, you know, since since the summer of 2018. मैं इस पर 2018 की गर्मियों से काम कर रहा हूं। Um, and so my team when we were at Metaphair trained uh really the first transformer language model for protein biology. और जब हम Meta FAIR में थे, तब मेरी टीम ने प्रोटीन बायोलॉजी के लिए पहला ट्रांसफॉर्मर लैंग्वेज मॉडल ट्रेन किया था। And so I guess you know I I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token that evolution creates. और मुझे हमेशा से लगता था कि जैसे-जैसे आप एक ऐसे मॉडल को ट्रेन करते हैं जो evolution द्वारा बनाए गए अगले टोकन का अनुमान लगाए, उसमें बायोलॉजिकल जानकारी का emergence होगा। So our team has really explored that idea over a number of different years and we've really kind of I think seen the scaling curve and really seen as we have have increased models by an order of magnitude kind of in each generation that you know there's this emergence of new capabilities. हमारी टीम ने कई सालों में इस विचार की खोज की और हमने वास्तव में स्केलिंग कर्व को देखा और देखा कि जैसे-जैसे हमने हर पीढ़ी में मॉडल को एक ऑर्डर ऑफ मैग्निट्यूड बड़ा किया, नई क्षमताओं का emergence होता है। Yeah. हां। So you've been you say emergence of capabilities scaling over generations. तो आप कह रहे हैं कि पीढ़ियों में स्केलिंग के साथ क्षमताओं का emergence हो रहा है। You've been working at this as you said for I guess it would be 8 years now or something like that. आप इस पर काम कर रहे हैं, जैसा आपने कहा, शायद अब 8 साल हो गए होंगे। It didn't always work that way right like there was signs that scaling might work. यह हमेशा से इस तरह काम नहीं करता था, है न, जैसे कुछ संकेत थे कि स्केलिंग काम कर सकती है। You know we'll be getting to some new results where I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before. आप जानते हैं, हम कुछ नए नतीजों पर पहुंचेंगे जहां मुझे लगता है कि आपने इस परिकल्पना को वास्तव में साबित कर दिया है जैसा पहले नहीं हुआ था। But you seem to have like a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. लेकिन आपकी इस पर इतनी मजबूत प्रतिबद्धता दिखती है कि मुझे यकीन नहीं कि मैं भी उतना आश्वस्त होता कि यह उसी तरह काम करेगा। I mean proteins are not the protein language is not the same thing as natural language. मेरा मतलब है, प्रोटीन की भाषा नेचुरल लैंग्वेज जैसी नहीं है। There are similarities but if you start sampling a transformer at you know a normal language transformer at temperature you're going to get gibberish. समानताएं हैं, लेकिन अगर आप एक सामान्य लैंग्वेज ट्रांसफॉर्मर को सामान्य temperature पर sample करें तो आपको बकवास मिलेगी। you sample a protein language model at infinite temperature, you're going to get something which is a valid protein if not a not interesting protein despite the fact that is a different domain for a different reason. अगर आप एक प्रोटीन लैंग्वेज मॉडल को infinite temperature पर sample करें, तो आपको कुछ ऐसा मिलेगा जो एक valid प्रोटीन है, भले ही कोई दिलचस्प प्रोटीन न हो, इस तथ्य के बावजूद कि यह एक अलग कारण से एक अलग डोमेन है। I'm not necessarily sure that I would मुझे जरूरी नहीं कि यकीन होता I primarily assume the natural language model insight would transfer over. मैं मुख्यतः यह मान लेता कि नेचुरल लैंग्वेज मॉडल की insight यहां भी transfer होगी। So what is specifically about proteins that you thought was special or you you know that would make this also valid? तो प्रोटीन में विशेष रूप से क्या था जिससे आपको लगा कि यह खास है, या जो इसे valid भी बनाएगा? Yeah, I mean it's a really interesting question. हां, यह वाकई एक दिलचस्प सवाल है। I think kind of a deep question across AI right now more broadly and you know I think you know what's what's so interesting is AI right now is is such an empirical science and so we don't have you know theory that can always guide us in these things but we have this really strong empirical evidence of scaling the thing that I was motivated by is you know if you think about evolution and you know you think about the data that we we have around proteins we have databases that have billions of protein sequences. मुझे लगता है यह AI में अभी एक गहरा सवाल है और जो इतना दिलचस्प है वह यह है कि AI इस वक्त बहुत ही empirical विज्ञान है, तो हमारे पास हमेशा ऐसा सिद्धांत नहीं होता जो हमें दिशा दे, लेकिन स्केलिंग का यह बहुत मजबूत empirical प्रमाण है, और जिस चीज़ ने मुझे प्रेरित किया वह यह है कि अगर आप evolution के बारे में सोचें और प्रोटीन के आसपास जो डेटा हमारे पास है उसे देखें तो हमारे पास ऐसे डेटाबेस हैं जिनमें अरबों प्रोटीन sequences हैं। And you know, those those sequences contain patterns and you know it had had been long been known so that you know this is going back you know decades kind of before you know we started working on this with language models but that there are patterns the sequences of protein families that come there because of the constraints that evolution is operating under. और वे sequences pattern रखते हैं, और यह काफी पहले से ज्ञात था, यानी दशकों पहले जब हमने language models के साथ इस पर काम शुरू किया, कि protein families के sequences में ऐसे pattern होते हैं जो evolution की constraints की वजह से आते हैं। So you can think about, you know, like a um a protein sequence that folds into a three-dimensional structure in space. आप एक ऐसी प्रोटीन sequence के बारे में सोच सकते हैं जो तीन-आयामी संरचना में fold होती है। And you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. और आप कल्पना कर सकते हैं कि इस sequence में दो residues या amino acids हैं जो उस folded structure में contact में हो सकते हैं। And so evolution isn't free to choose those independently from each other. तो evolution उन्हें एक-दूसरे से स्वतंत्र रूप से चुनने के लिए स्वतंत्र नहीं है। If it makes a choice at at one position, it kind of has to make another choice that's going to be compatible at the next position. अगर एक position पर कोई चुनाव होता है, तो उसे अगले position पर एक compatible चुनाव करना ही पड़ता है। So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to to look at this and kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology. तो gene sequencing की शुरुआत से, जब लोगों ने पहली बार संबंधित जीवों में एक ही प्रोटीन को देखना शुरू किया, आप इन patterns को देख सकते थे जो मूलभूत underlying बायोलॉजी को दर्शाते हैं। So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts. तो ESM के पीछे का विचार यह था कि क्या होगा अगर आप evolution में मौजूद सभी proteins के विशाल विविधता में इस सिद्धांत को लागू करें और एक language model को amino acids का अनुमान लगाने दें जो evolution सभी biological contexts में proteins में चुनेगी। So you can think that there's just this this kind of like incredible amount of information in that total picture about the underlying biology of proteins. आप सोच सकते हैं कि उस समग्र तस्वीर में proteins की underlying बायोलॉजी के बारे में अविश्वसनीय मात्रा में जानकारी है। And so that was really the idea that sparked this is is you know as as a model is having to predict the next token and actually we train these models with mass language modeling. और यही वह विचार था जिसने इसे जन्म दिया, यानी जैसे-जैसे एक model अगले token का अनुमान लगाता है और हमने वास्तव में masked language modeling के साथ इन models को train किया। So they're predicting kind of tokens that are masked out of various parts of the sequence that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose. तो वे sequence के विभिन्न हिस्सों से masked tokens का अनुमान लगाते हैं, जिससे उन्हें उन underlying constraints के बारे में कुछ सीखना पड़ता है जो यह तय करते हैं कि evolution कौन से tokens चुन सकती है। Yeah. हां। So maybe for a bit of history um so you know you have you you just released um evolutionary scale modeling Cambrian, right? तो शायद थोड़ा इतिहास के लिए, तो आपने अभी Evolutionary Scale Modeling Cambrian रिलीज़ किया है, है न? Is that what it's called? क्या उसे यही कहते हैं? Yeah. हां। And this is like the maybe fourth or fifth in a series of models. और यह models की एक series में शायद चौथा या पांचवां है। I think maybe even more if you go back before they were called ESM. मुझे लगता है शायद और भी अगर आप ESM कहलाने से पहले जाएं। Well, they they were called ESM from the start. वे तो शुरू से ही ESM कहलाते थे। Yeah. हां। We had sort of various branches of the different models. हमारे अलग-अलग models की कुछ अलग branches थीं। Yeah. हां। So, so this one I would say is is kind of a a fourth generation model. तो यह एक तरह से चौथी पीढ़ी का model है। Um it's actually a model that we trained a little over a year ago. यह वास्तव में एक ऐसा model है जिसे हमने करीब एक साल पहले train किया था। Now that we're at Biohub, we're um we're we're open sourcing this this model fully under MIT license for the first time. अब जब हम BioHub में हैं, तो हम इस model को पहली बार MIT लाइसेंस के तहत पूरी तरह open source कर रहे हैं। So, we're really excited to do that. हम यह करने के लिए वाकई उत्साहित हैं। But kind of the the big thing that is new here is that we've really kind of built a world model of protein biology. लेकिन यहां जो सबसे बड़ी नई बात है वह यह है कि हमने प्रोटीन बायोलॉजी का एक वर्ल्ड मॉडल बनाया है। So the foundation of that is ESMC. इसकी नींव ESMC है। But you know using the representations of EFSMC, we've kind of now built a a structure prediction model. लेकिन ESMC के representations का उपयोग करके हमने एक structure prediction model बनाया है। Um and this is the next generation ESM fold model. और यह अगली पीढ़ी का ESMFold model है। And then we've also used the techniques of of of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology. और हमने mechanistic interpretability और sparse coding की तकनीकों का उपयोग करके language model के representation space में गहराई से झांकना शुरू किया है और उन underlying features को निकाल पाए हैं जिनका उपयोग model प्रोटीन बायोलॉजी को represent करने के लिए करता है। So bringing all of this together, we're able to, you know, really make predictions for protein structure. इन सबको एक साथ लाकर हम प्रोटीन structure की predictions कर सकते हैं। um predictions about kind of the underlying features that that proteins are made out of that allows us to build linkages across evolution. proteins जिन underlying features से बनते हैं उनके बारे में predictions, जो हमें evolution में proteins के बीच linkages बनाने देती हैं। We're able to take this model and invert it to design proteins. हम इस model को proteins design करने के लिए invert कर सकते हैं। And we've we've we've used this to kind of create a comprehensive picture of protein biology. और हमने इसका उपयोग करके प्रोटीन बायोलॉजी की एक व्यापक तस्वीर बनाई है। So we we put together kind of all the world's largest protein sequence databases. हमने दुनिया के सबसे बड़े protein sequence databases एकत्र किए हैं। And so that kind of amounts to 6.8 billion non-redundant proteins. और इसमें कुल 6.8 billion non-redundant proteins हैं। And then we've we've resolved predicted structures for 1.1 billion of those. और हमने उनमें से 1.1 billion के predicted structures resolve किए हैं। And and we've also computed features across all of those so that we can make these linkages basically all across um evolution and protein biology. और हमने उन सभी में features compute किए हैं ताकि हम evolution और protein बायोलॉजी में इन linkages को बना सकें। 6.8 billion of which you've resolved structure for 1.2 is that 1.1 6.8 billion में से आपने 1.2, या फिर 1.1 के structure resolve किए? 1.1. 1.1। So what about the others? तो बाकी के बारे में क्या? Well, so so basically what we did is we took that database and we clustered it at 70% sequence identity. तो मूल रूप से हमने उस database को 70% sequence identity पर cluster किया। So it's it's really resolving structures for everything in the sense that for each cluster we kind of have a cluster center. तो यह वास्तव में सभी के लिए structures resolve करना है इस अर्थ में कि प्रत्येक cluster के लिए हमारे पास एक cluster center है। We're predicting the structure there and then we can expect that the other proteins are going to have a similar template structure. हम वहां structure का अनुमान लगा रहे हैं और फिर हम expect कर सकते हैं कि बाकी proteins का एक समान template structure होगा। There be be small variations but they have the same fold. छोटे variations होंगे लेकिन उनका fold एक ही होगा। 1.2 billion or so clusters तो लगभग 1.2 billion clusters हैं। that are that are kind of covering the 6.8 billion. जो 6.8 billion को cover करते हैं। Yeah. हां। Okay. ठीक है। Interesting. दिलचस्प। And yeah, maybe since we're talking about scaling, how do you know that um this is the right number, right? और हां, चूंकि हम scaling की बात कर रहे हैं, तो आप कैसे जानते हैं कि यह सही संख्या है? Like uh how do you know that focusing on these 1.1 billion and that's the right resolution for this model? जैसे आप यह कैसे जानते हैं कि इन 1.1 billion पर focus करना और यह इस model के लिए सही resolution है? Well, we've chosen them so that they really cover that entire space. हमने उन्हें इस तरह चुना है कि वे वास्तव में उस पूरे space को cover करते हैं। So, I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. तो मुझे लगता है कि इस database के बारे में मैं यह कह सकता हूं कि यह protein structure और function की अब तक की सबसे व्यापक तस्वीर है। It's adding, you know, hundreds of millions of structures to our knowledge of of kind of protein the diversity of protein structure and it's also creating this uh feature space that allows us to find these linkages between proteins across evolution. यह protein structure की विविधता के बारे में हमारी जानकारी में सैकड़ों millions structures जोड़ रहा है और एक ऐसा feature space भी बना रहा है जो हमें evolution में proteins के बीच linkages ढूंढने में मदद करता है। So we can see kind of really interesting themes emerging across evolution. तो हम evolution में कुछ वास्तव में दिलचस्प themes उभरते देख सकते हैं। you know linking for example um gene editing systems which are very far apart in sequence but you know they share some kind of underlying functional um patterns structural homology that the model's able to bring together and and find those connections जैसे gene editing systems को जोड़ना जो sequence में बहुत दूर हैं लेकिन कुछ underlying functional patterns और structural homology share करते हैं जिन्हें model एक साथ लाकर उन connections को ढूंढ सकता है। now we're talking about the mechanistic interpretability part so you have if I understand correctly you use sparse autoenccoders and other techniques maybe to understand okay what are the when I activate the network using a protein अब हम mechanistic interpretability के बारे में बात कर रहे हैं, तो अगर मैं सही समझ रहा हूं तो आप sparse autoencoders और अन्य तकनीकों का उपयोग करते हैं यह समझने के लिए कि जब आप एक protein से network को activate करते हैं। Then what are the patterns of outputs that I'm seeing and how do they relate to each other if I understand correctly is that you have these sequences that are unrelated or only partly related based on the actual sequence but in terms of behavior they have similar behavior and therefore they are activating similar networks. तो जो output patterns मुझे दिख रहे हैं और वे एक-दूसरे से कैसे संबंधित हैं, अगर मैं सही समझ रहा हूं तो आपके पास ऐसे sequences हैं जो actual sequence के आधार पर असंबंधित या आंशिक रूप से संबंधित हैं लेकिन behavior के मामले में वे समान behavior दिखाते हैं और इसलिए वे समान networks को activate करते हैं। Is that kind of the summary of what you just said? क्या यह आपने जो कहा उसका सारांश है? Yeah. हां।