Terug naar podcasts Latent Space

⚡️ Google's open AI-strategie — Omar Sanseviero, Google DeepMind

We got so much Gemma 4, Gemma 3 1, Gemma scope med Gemma. We hebben zo veel gehad: Gemma 4, Gemma 3 1, Gemma Scope. Give us the TLDR. Geef ons de TLDR. Yeah, so yeah, Gemma 4 is just out. Ja, Gemma 4 is net uit. This is the most capable open model we've released so far. Dit is het meest capabele open model dat we tot nu toe hebben uitgebracht. We really tried to compact as much intelligence per parameter as we could. We hebben echt geprobeerd zo veel mogelijk intelligentie per parameter te stoppen. Bring all of these multimodal capabilities. Alle multimodale mogelijkheden erin brengen. So yeah, that's Gemma 4. Ja, dat is Gemma 4. So one interesting thing, you have this thing with effective parameters, not active parameters. Interessant: jullie hebben het over effectieve parameters, niet actieve parameters. Can you explain what it is? Kun je uitleggen wat dat is? Yeah, so pretty much in the traditional transformer architecture you have like this big embedding layer, right? Ja, in de traditionele transformer-architectuur heb je zo'n grote embedding-laag, toch? And this new architecture is is more of a small change in the transformer architecture, in the transformer block. En deze nieuwe architectuur is meer een kleine aanpassing in het transformer-blok. Pretty much we add a per layer embedding. We voegen in principe een per-laag embedding toe. So at every layer we add an embedding table. Op elke laag voegen we een embedding-tabel toe. What is exciting is that you don't need to do like the full matrix multiplication. Het mooie is dat je niet de volledige matrixvermenigvuldiging hoeft te doen. This is pretty much a lookup table. Dit is in feite een opzoektabel. So the Gemma 4 model is a E2B. Het Gemma 4-model is een E2B. That means that it effectively has 2 billion parameters loaded into the GPU. Dat betekent dat er effectief 2 miljard parameters in de GPU worden geladen. It actually has almost 5 billion parameters, but those 3 billion parameters can be in the CPU, they can be in the disk, which means that you can do inference extremely quickly. Het heeft eigenlijk bijna 5 miljard parameters, maar die 3 miljard kunnen op de CPU staan, op de schijf, wat betekent dat je inferentie extreem snel kunt doen. This is just a lookup table. Dit is gewoon een opzoektabel. And what's the con? En wat is het nadeel? Why don't we Waarom doen we dit niet Why don't we always do this? Waarom doen we dit niet altijd? Can it scale? Schaalt het? Is it open research? Is het open onderzoek? Like you know, it seems very Het lijkt zo Okay, if I can just offload half the parameters to CPUs. Oké, als ik gewoon de helft van de parameters naar de CPU kan offloaden. Yeah, so pretty much here we did lots of quality experimentation and this is really optimized and designed for like on device. Ja, we hebben hier veel kwaliteitsexperimenten gedaan en dit is echt geoptimaliseerd en ontworpen voor on-device gebruik. And when I say on device I mean like running in a phone, Android, Raspberry Pi, and so on, right? En met on-device bedoel ik draaien op een telefoon, Android, Raspberry Pi, enzovoort. When you go larger you usually want to compact more Als je groter gaat, wil je meestal meer comprimeren You want to have more like dense architectures or MOEs. Je wil meer dense architecturen of MoE's. So this this research Dus dit onderzoek This research decisions were very helpful for these small small use cases. Deze onderzoeksbeslissingen waren erg nuttig voor deze kleine use cases. Yeah, something I learned from the run that you organized this morning. Ja, iets wat ik leerde van de hardloopsessie die jij vanochtend organiseerde. For for our listeners, I think it's the first ever like official run club at AIE 6:30 a.m. Voor onze luisteraars: ik denk dat het de eerste officiële hardloopclub ooit was op AIE, om 6:30 uur 's ochtends. Very rough, but at least I woke up for it. Heel zwaar, maar ik ben er tenminste voor opgestaan. I met Cormac and he was telling me that I apparently in China the super apps are shipping models in the app bundle. Ik ontmoette Cormac en hij vertelde me dat in China de super-apps modellen in het app-pakket meesturen. For inference and just like use among all their super app. Voor inferentie en gebruik binnen hun super-app. Assistants. Assistenten. Yeah. Ja. And I don't know is is is that like a target use case for you guys? En is dat een doel-usecase voor jullie? Yeah, so actually if you install like if you buy a pixel phone or a high end Samsung, they come from with a Gemini Nano and Gemini Nano is baked into the operating system and Gemini Nano is really built on top of Gemma. Ja, als je een Pixel-telefoon of een high-end Samsung koopt, worden die geleverd met Gemini Nano en Gemini Nano zit ingebakken in het besturingssysteem en is echt gebouwd bovenop Gemma. So last year we released Gemma 3N which was this architecture really designed for phone use cases and they use a Gemma 3N with some additional training, some additional adaptations to make the model good for like traditional on device use cases, right? Vorig jaar hebben we Gemma 3N uitgebracht, een architectuur echt ontworpen voor phone-usecases, en ze gebruiken Gemma 3N met extra training en aanpassingen om het model geschikt te maken voor traditioneel on-device gebruik. So pretty much when you buy like these high end phones, you can already use a Gemini out of the box. Als je zo'n high-end telefoon koopt, kun je dus al meteen Gemini gebruiken. Yeah, we actually covered the 3N paper in our paper club and this like idea of like sort of parameter offloading or like download on demand is like very cool. Ja, we hebben het 3N-paper besproken in onze paper club en dat idee van parameter-offloading of download-on-demand is erg cool. Is it exactly the same in the Gemma 4 stuff? Is het precies hetzelfde bij Gemma 4? Yep. Ja. Okay. Oké. For the smaller models. Voor de kleinere modellen. Yeah. Ja. Yeah. Ja. Yeah. Ja. And does it does it scale? En schaalt het? Is there a potential Is er een potentieel So for reference, Gemma 4 is a 29B and a 31B ones and only one's dense, but have you scaled it? Ter referentie: Gemma 4 heeft een 29B en een 31B, waarvan er slechts één dense is, maar hebben jullie het opgeschaald? Have you pushed it up? Hebben jullie het verder doorgezet? Is it Is het We are doing lots of experiments. We doen veel experimenten. Experiments. Experimenten. Yeah, yeah. Ja, ja. Stay tuned. Houd het in de gaten. Yeah. Ja. What goes into shipping a mean line model like this? Wat komt er kijken bij het uitbrengen van een topmodel zoals dit? Like Zoals Yeah. Ja. What what's the behind the scenes? Wat speelt er achter de schermen? It's complex. Het is complex. The Gemma team is actually relatively small. Het Gemma-team is eigenlijk relatief klein. We have like two or three PMs, we have one marketing person and then there is our like engineers and researchers working on shipping this. We hebben zo twee of drie PM's, één marketingpersoon en dan de engineers en onderzoekers die aan dit alles werken. Of course there's like the full training part, we how do we do the post training, distillation, post training techniques and so on. Er is natuurlijk het volledige trainingsgedeelte: hoe doen we de post-training, distillatie, post-trainingtechnieken, enzovoort. What is quite exciting is that once we have the model, then we collaborate with a bunch of open source partners, right? Het mooie is dat we, zodra we het model hebben, samenwerken met een heleboel open-source partners. So for example, we work with a Lama CPP, Olama, MLX, Hugging Face, vLLM, Nvidia, AMD. We werken bijvoorbeeld met Llama.cpp, Ollama, MLX, Hugging Face, vLLM, Nvidia, AMD. So we have almost 50 external partners for every well for the Gemma for lunch, which has been the most complex launch. We hadden bijna 50 externe partners voor de Gemma 4-lancering, wat de meest complexe lancering tot nu toe was. And also internally, we collaborate with a bunch of different teams. En intern werken we ook samen met een heleboel verschillende teams. So, think of Google Cloud, Vertex, Vertex models models as a service, ADK, uh and then Android as well, right? Denk aan Google Cloud, Vertex, Vertex models as a service, ADK, en ook Android. So, we work, for example, with Android team and uh with the launch of Gemma 4, we released an integration with Android Studio. We werken bijvoorbeeld met het Android-team, en bij de lancering van Gemma 4 brachten we een integratie met Android Studio uit. So, in Android Studio, there is this agent mode where you can have a a model helping you write code and do things within Android Studio. In Android Studio is er een agentmodus waarbij een model je helpt code te schrijven en dingen te doen binnen Android Studio. And they ship this integration with offline models using llama.cpp or vLLM or any open AI compatible endpoint. Ze leveren deze integratie met offline modellen via llama.cpp, vLLM of elk OpenAI-compatibel endpoint. So, now you can use Gemma 4 to also write code Android applications in Android Studio. Je kunt Gemma 4 nu dus ook gebruiken om Android-applicaties te schrijven in Android Studio. What's the difference? Wat is het verschil? When would someone want to do that versus just using Gemini? Wanneer zou iemand dat willen doen in plaats van gewoon Gemini te gebruiken? Outside of course Outside of the obvious, you're offline or you want the privacy. Buiten het voor de hand liggende: je bent offline of je wilt privacy. planes a lot or something. Vliegt veel of zo. I did. Dat deed ik. Okay, I will say, on my long 10-hour flight to London, I did use Gemini as Oké, ik moet zeggen: tijdens mijn lange vlucht van 10 uur naar Londen heb ik Gemini gebruikt als Yeah, I I was on Gemma 4 though. Ja, ik zat op Gemma 4 trouwens. Sorry, Gemma Gemma. Sorry, Gemma, Gemma. Yeah, yeah, it's mostly offline use cases. Ja, het zijn vooral offline use cases. Right or if you Toch, of als je Yeah. Ja. Offline or privacy, like if you want to have all of your development set up locally and you don't want to send any code to to any API, you would use that. Offline of privacy: als je je complete developmentomgeving lokaal wilt draaien en geen code naar een API wilt sturen, dan gebruik je dat. Do you see a future where, you know, small models get good enough? Zie jij een toekomst waarbij kleine modellen goed genoeg worden? Like, does it cannibalize? Kannibaliseert het? It's an interesting position. Het is een interessante positie. Like, you have big Gemini, you have Gemma, both get exponentially better over time. Je hebt grote Gemini, je hebt Gemma, beide worden exponentieel beter met de tijd. Like, current Gemma is much better than what we had closed source a few years ago, right? De huidige Gemma is al veel beter dan wat we een paar jaar geleden hadden als closed-source, toch? Yeah, for me, it's quite exciting. Ja, voor mij is dat best opwindend. I mean, if you look at Gemma, you compare to how we were 1 year ago, I would say Gemma uh 4 is matching state-of-the-art from 1 1 and 1/2 years ago for most things. Als je Gemma vergelijkt met hoe we een jaar geleden waren, zou ik zeggen dat Gemma 4 voor de meeste dingen de state-of-the-art van anderhalf jaar geleden evenaart. With local models or models that you can run in your own hardware, you can get capabilities, so you can get agentic agentic capabilities, function calling, system instructions, like conversational and that kind of stuff. Met lokale modellen of modellen die je op je eigen hardware draait, krijg je capabiliteiten als agentische mogelijkheden, function calling, systeeminstructies, conversatie, dat soort dingen. Knowledge is much trickier, so for knowledge, you do need a larger model, right? Kennis is veel lastiger, dus daarvoor heb je een groter model nodig. That's why if you compare Gemini to Gemma, Gemini Dat is waarom als je Gemini vergelijkt met Gemma, Gemini