Terug naar podcasts Sequoia Capital

Hoe Cursor Composer trainde op Fireworks: Gedistribueerde infrastructuur voor hoogperformante RL

You need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like. Je hebt alle infrastructuur nodig om deze omgevingen te draaien die zo goed mogelijk moeten lijken op de computer van een gebruiker. And it's very important as closely as possible because sometimes the model can actually figure out when it's being run in like a fake environment or not a real one and it has like different behaviors during RL than in production. En het is heel belangrijk om dat zo nauwkeurig mogelijk te doen, want soms kan het model er zelf achter komen wanneer het in een nep-omgeving draait en niet in een echte, en dan vertoont het ander gedrag tijdens RL dan in productie. Are you saying it being conscious that it's being is in a fake environment and it starts being behaving differently? Bedoel je dat het zich bewust is dat het in een nep-omgeving zit en daardoor anders gaat gedragen? Yes. Ja. Yes. Ja. Interesting. Interessant. Like it's like oh I'm in a fake environment. Alsof het denkt: oh, ik zit in een nep-omgeving. I've learned a few tricks to like get the better reward in this environment and let me try them out. Ik heb een paar trucs geleerd om hier een betere reward te halen, laat ik die eens uitproberen. Models love to cheat. Modellen houden ervan om te spieken. RL is really good at encouraging cheating. RL moedigt spieken enorm aan. I'm delighted to welcome Federico from Cursor and Dima from Fireworks to the podcast today. Ik heet Federico van Cursor en Dima van Fireworks van harte welkom in de podcast. Federico, you are the research lead on Composer 2 at Cursor, Cursor's new agentic coding model. Federico, jij bent research lead voor Composer 2 bij Cursor, het nieuwe agentische programmeermodel van Cursor. And Dima, you spent how many of the last few months moonlighting at Cursor in order to support all of the infrastructure required to make this gargantuan training task happen. En Dima, hoeveel maanden heb jij de afgelopen tijd naast je werk bij Cursor meegeholpen om alle infrastructuur mogelijk te maken die nodig was voor deze gigantische trainingsrun? And so, I'm excited to talk to both of you today about how the training of Composer 2 came together, what hard problems you solved together, and what you think it means for the future of of AI and foundation model companies. Ik kijk er heel erg naar uit om vandaag met jullie te praten over hoe de training van Composer 2 tot stand is gekomen, welke moeilijke problemen jullie samen hebben opgelost, en wat dat volgens jullie betekent voor de toekomst van AI en foundation model bedrijven. Exciting. Spannend. Yeah, exciting. Ja, spannend. Thank you for having us. Bedankt voor de uitnodiging. Thanks for joining. Bedankt dat jullie er zijn. Okay, let's dive right in. Oké, laten we meteen beginnen. For those who haven't been following as closely, uh Cursor recently announced Composer 2, which is an agentic coding model uh meant for long horizon coding tasks. Voor wie het niet zo nauw gevolgd heeft: Cursor heeft onlangs Composer 2 aangekondigd, een agentisch programmeermodel bedoeld voor langetermijn coderingstaken. Federico, uh up till now, um Cursor was mostly uh enabling uh other people's uh coding agents. Federico, tot nu toe richtte Cursor zich vooral op het inzetten van andermans programmeeragents. Uh what was the impetus for Cursor to lean so heavily into Composer 2, and how existential is it for you to become not just an application company but also a foundation model company yourselves? Wat was de aanleiding voor Cursor om zo zwaar in te zetten op Composer 2, en hoe existentieel is het voor jullie om niet alleen een applicatiebedrijf te zijn, maar ook zelf een foundation model bedrijf te worden? The reason why we started looking into training our own models is you can sort of think about the model as sort of like like a storage drive. De reden waarom we zijn gaan kijken naar het trainen van eigen modellen, is dat je het model een beetje kunt zien als een opslagmedium. It has certain amount of bits that it can store in its weights. Het heeft een bepaalde hoeveelheid bits die het in zijn gewichten kan opslaan. And the idea is very simple, you know, like we care about only one task. En het idee is heel simpel: we geven alleen om één taak. We don't even care about coding or programming necessarily. We geven niet eens per se om coderen of programmeren. We care about software engineering inside cursor and inside cursor only. We geven om software engineering binnen Cursor, en alleen binnen Cursor. And so, what if we were to allocate all of the bits of information that can be stored inside the model weights to that one particular task? Dus wat als we alle bits aan informatie die in de modelgewichten kunnen worden opgeslagen, zouden toewijzen aan die ene specifieke taak? Also, as people may have noticed, composer is order of magnitude less expensive than Opus and other like coding models because we can just simply specialize all of the model weights to that particular task. Bovendien, zoals mensen misschien hebben opgemerkt, is Composer een orde van grootte goedkoper dan Opus en andere programmeermoodellen, omdat we simpelweg alle modelgewichten kunnen specialiseren voor die specifieke taak. And so, we can serve like a smaller model or something of that sort, yeah. Zo kunnen we een kleiner model aanbieden, of iets van die aard, ja. So, it's about let's make sure every single bit of weight or information we have is dedicated toward the specific problem that we have at hand. Dus het gaat erom dat elk bit aan gewicht of informatie die we hebben, wordt ingezet voor het specifieke probleem dat we hebben. Exactly. Precies. Got it. Begrepen. Um that seems like it's an almost generalizable problem. Dat lijkt wel een bijna generaliseerbaar probleem. Uh Dima, I'm curious your perspective. Dima, ik ben benieuwd naar jouw perspectief. Do you think that every application company should be looking at cursor as a harbinger of what's to come? Denk jij dat elk applicatiebedrijf naar Cursor zou moeten kijken als een voorbode van wat er gaat komen? Like should they all be looking to do the same thing? Moeten ze allemaal hetzelfde gaan doen? Yeah, absolutely. Ja, absoluut. I mean, we actually generally see it as a pattern of kind of evolution of the applications. We zien dit eigenlijk als een patroon in de evolutie van applicaties. You maybe start prototyping, you might be using kind of off-the-shelf model to get something running, maybe do some prompt engineering, figure out how your harness works. Misschien begin je met prototypen, gebruik je een standaardmodel om iets werkend te krijgen, doe je wat prompt engineering en ontdek je hoe je harness werkt. But the most kind of leveraged attribute of your application is the actual usage of user data or particular specific aspects of how this application works, maybe some aspects of your harness, which tools do you provide, how the application works, kind of really important bits which are important for your application. Maar het meest waardevolle aspect van je applicatie is het feitelijke gebruik van gebruikersdata of specifieke kenmerken van hoe die applicatie werkt, misschien bepaalde aspecten van je harness, welke tools je aanbiedt, hoe de applicatie werkt, echt de cruciale onderdelen die belangrijk zijn voor jouw applicatie. And the right way to capture that, you can do a little bit of that through prompting, but really the right way to do this is craft your model to act in your environment. En de juiste manier om dat te benutten, kun je een beetje via prompting doen, maar de echte manier is om je model te laten handelen in jouw omgeving. Yeah, absolutely. Ja, absoluut. Like there are certain tools the agent calls that it's very hard to succinctly describe exactly the behavior of that tool to the model. Er zijn bepaalde tools die de agent aanroept waarbij het heel moeilijk is om het gedrag van die tool beknopt aan het model te beschrijven. And you know, with just like post-training, we can bake in the optimal way to use those tools. En weet je, met alleen post-training kunnen we de optimale manier om die tools te gebruiken inbakken. Like Composer, we do serve a prompt to Composer, but I I think the way we are training it, it would work even without a prompt and it would know what to do just because like we are intrinsically pushing the model to like the right direction of how it should act throughout our training. Bij Composer sturen we wel een prompt, maar door de manier waarop we het trainen zou het ook zonder prompt werken en zou het weten wat het moet doen, omdat we het model intrinsiek in de juiste richting duwen tijdens de training. Basically, there's kind of like upper bound of like how far you can get with prompt engineering. Er is als het ware een bovengrens aan hoe ver je kunt komen met prompt engineering. And if you want to uh craft really great AI products, you have to go through kind of fine-tuning and influence model behavior. Als je echt goede AI-producten wilt maken, moet je door fine-tuning gaan om het modelgedrag te beïnvloeden. That's kind of one reason. Dat is eigenlijk de ene reden. I mean, reason number two is what Federico mentioned is kind of cost trade-off or XP trade-off. Reden nummer twee is wat Federico noemde, namelijk de kosten-kwaliteitsafweging of de snelheid-kwaliteitsafweging. Like the way we kind of view it at Fireworks is that when you're trying to do optimization, you have this like three-dimensional trade-off between quality, speed, and cost. De manier waarop wij bij Fireworks ernaar kijken, is dat je bij optimalisatie te maken hebt met een driedimensionale afweging tussen kwaliteit, snelheid en kosten. And uh you can go quite far and we're doing it with all of our customers initially. En je kunt best ver komen, en dat doen we aanvankelijk ook met al onze klanten. We can go quite far with just optimizing infrastructure, but when you start getting to model training, you can really push this trade-off much further and you can get better model at fraction of the cost running much faster. Je kunt vrij ver komen door puur de infrastructuur te optimaliseren, maar zodra je begint met modeltraining kun je die afweging veel verder opschuiven en een beter model krijgen voor een fractie van de kosten, dat veel sneller draait. And you know, Composer is a great example of En weet je, Composer is een mooi voorbeeld daarvan. Can I push on this a little bit? Mag ik hier even op doorgaan? I want to ask you if this approach is better lesson pills. Ik wil jullie vragen of deze aanpak ingaat tegen de bitter lesson. And we were we were actually all talking about TabNine on the walk-in. We hadden het eigenlijk ook over TabNine toen we hier naar binnen kwamen. I'm remembering before the LLM era, there were these like small specialized coding models. Ik denk terug aan vóór het LLM-tijdperk, toen er kleine gespecialiseerde programmeermodellen waren. And one of the things that was I think surprising to to a lot of people was as you've scaled up, you know, you scaled up just training on the internet and a lot of a bunch of English text and other languages, actually the models themselves got inherently better at coding as well. En één van de dingen die ik denk veel mensen verraste, is dat naarmate je schaalde, naarmate je trainde op het internet en veel Engelse tekst en andere talen, de modellen zelf ook inherent beter werden in programmeren. And so at least the trend line I've seen so far is just like bigger models perform better on everything including on coding. En de trend die ik tot nu toe zie, is dat grotere modellen gewoon beter presteren op alles, inclusief programmeren. Is what you guys are saying, does that go against the grain of the better lesson? Gaat wat jullie zeggen in tegen de bitter lesson? I think no, but one one sort of like thing to point out is that the big models trained by the labs train on a lot of code as well. Ik denk van niet, maar één ding dat het noemen waard is, is dat de grote modellen van de labs ook op heel veel code getraind zijn. Like code is one of the main tasks the labs are interested in pushing and so they don't just generalize to it. Code is een van de belangrijkste taken waarvoor de labs willen scoren, dus ze generaliseren daar niet alleen naartoe. They're a bit specialized as well. Ze zijn ook enigszins gespecialiseerd. I think for our case, actually, you know, if we believe about the bitter lesson, we are just pushing very hard on the data dimension, and we know that the models inherently have finite capacity. In ons geval, als we de bitter lesson serieus nemen, duwen we alleen maar heel hard op de datadimensie, en we weten dat modellen inherent een beperkte capaciteit hebben. And so, if we want to saturate all that capacity, we need to scale data. Dus als we die capaciteit volledig willen benutten, moeten we data schalen. And in order to ingest more data, we we need to like free up the weights from distractions the model may have. En om meer data te kunnen verwerken, moeten we de gewichten bevrijden van afleidingen die het model heeft. Mhm, okay. Mm-hm, oké. Got it. Begrepen. Super interesting. Super interessant. Okay, let's dig into the training of Composer 2. Oké, laten we de training van Composer 2 induiken. You launched a couple weeks ago, immediately grabbed attention. Jullie zijn een paar weken geleden gelanceerd en trokken meteen de aandacht. Strong benchmark numbers, much lower cost to to run inference on. Sterke benchmarkcijfers, veel lagere kosten voor inferentie. What's the short version of how Composer 2 works, and and what you guys did to make it so performant? Wat is de korte versie van hoe Composer 2 werkt, en wat hebben jullie gedaan om het zo goed te laten presteren? We started from a very strong base, which is uh Kimmy 2.5. We zijn begonnen met een heel sterke basis, namelijk Kimi 2.5. It's like a 1 trillion and parameter MoE, that's 30 B active, so very very sparse, actually. Dat is een MoE van zo'n 1 biljoen parameters, met 30 miljard actieve parameters, dus heel sparse eigenlijk. We sort of like looked at the stock and realized there are like two axes. We hebben als het ware de balans opgemaakt en gezien dat er twee assen zijn. So, mainly Composer 1 was just pushing on one of these axes, which is reinforcement learning, but Composer 2 pushes in two different axes. Composer 1 werkte alleen aan één van die assen, namelijk reinforcement learning, maar Composer 2 werkt aan twee verschillende assen. One is continual pre-training, and the other is reinforcement learning. De eerste is continual pre-training, en de tweede is reinforcement learning. So, the thing that made Composer 2 very good is pushing in both of these directions. Wat Composer 2 zo goed heeft gemaakt, is het feit dat we in beide richtingen zijn gegaan. So, we started off the training run by doing lots of mid-training on code tokens, almost sort of pre-training scale, actually. We zijn de training begonnen met veel mid-training op codetokens, bijna op pre-training schaal eigenlijk. And then, coming out of that mid-training run, we took the checkpoints and we did very large-scale RL on lots of lots of tasks. Daarna, na die mid-training run, namen we de checkpoints en deden we grootschalige RL op heel veel taken. Okay, and then the premise here would be because Cursor sits in the middle of so many interesting coding tokens, you actually pretty uniquely have access to data to be able to train at almost pre-training scale. Oké, en de aanname is dan dat Cursor in het midden zit van zoveel interessante coderingstokens, dat jullie eigenlijk vrij uniek toegang hebben tot data om op bijna pre-training schaal te trainen. Yeah. Ja. Why not pre-train your own model, then? Waarom dan geen eigen model pre-trainen? We just think about our approach from top-down instead of bottom-up. We denken gewoon top-down in plaats van bottom-up. So, like, how do we get a model that's useful to users in the least time possible if we were to start from the bottom, sort of figure out how how we do pre-training and then scale it up to mid-training and then, okay, now we figured out mid-training, now we do reinforcement learning. Dus, hoe krijgen we zo snel mogelijk een model dat nuttig is voor gebruikers? Als we van onderaf beginnen, eerst uitzoeken hoe we pre-training doen, dat opschalen naar mid-training, dan uitzoeken hoe mid-training werkt en dan reinforcement learning doen. That would take a very long time to get a model out to our users. Dat zou heel lang duren voordat we een model bij onze gebruikers konden brengen. By doing it the other way around, we were able to give our useful model to our users in very little time. Door het andersom aan te pakken konden we in veel minder tijd een nuttig model aan onze gebruikers leveren. So, hopefully, you know, like next Composer versions are going to be our own model instead of basing it off an open-source base. Hopelijk worden de volgende Composer-versies ons eigen model in plaats van dat we ons baseren op een open-source basis. And what is the model roughly learning in the kind of mid-training step? En wat leert het model globaal gezien in de mid-training stap? And what is the model learning in the post-training step for you? En wat leert het model voor jullie in de post-training stap? Yeah, so in mid-training, it's sort of just kind of learning about libraries of code and learning about specific code patterns that are very common, like just world knowledge as well. Ja, in mid-training leert het model globaal over codebibliotheekbestanden en specifieke codepatronen die veel voorkomen, plus algemene wereldkennis. There is like web data there as well. Er zit ook webdata in. And this is sort of just creating a wider distribution that then reinforcement learning can sharpen on. Dit creëert een bredere distributie waarop reinforcement learning vervolgens kan aanscherpen. And so, during reinforcement learning, you know, the model gets to play directly with the cursor harness. Tijdens reinforcement learning mag het model rechtstreeks spelen met de Cursor harness. And so, it gets to learn about the world the model is going to live in for the rest of its life, right? Zo leert het over de wereld waarin het model de rest van zijn leven zal leven, toch? In in some way. Op een bepaalde manier. And and so, then during reinforcement learning, that's where it learns how to call tools properly, how to navigate its environment, how to write correct code. En tijdens reinforcement learning leert het dan hoe het tools correct aanroept, hoe het zijn omgeving navigeert, hoe het correcte code schrijft. Because during mid-training, it it learns how to write code. Want tijdens mid-training leert het hoe code te schrijven.