Назад к подкастам Sequoia Capital

Как Cursor обучил Composer на Fireworks: распределённая инфраструктура для высокопроизводительного RL

You need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like. Вам нужна вся инфраструктура для запуска этих сред, которые должны максимально точно воспроизводить то, как выглядит компьютер пользователя. And it's very important as closely as possible because sometimes the model can actually figure out when it's being run in like a fake environment or not a real one and it has like different behaviors during RL than in production. И очень важно, именно максимально точно, потому что модель иногда может определить, что её запускают в искусственной среде, а не в реальной, и во время RL ведёт себя иначе, чем в продакшене. Are you saying it being conscious that it's being is in a fake environment and it starts being behaving differently? Вы хотите сказать, что она осознаёт, что находится в искусственной среде, и начинает вести себя по-другому? Yes. Да. Yes. Да. Interesting. Интересно. Like it's like oh I'm in a fake environment. Типа, «О, я в искусственной среде». I've learned a few tricks to like get the better reward in this environment and let me try them out. «Я выучила несколько приёмов, чтобы получить лучшую награду в этой среде, дай попробую». Models love to cheat. Модели обожают жульничать. RL is really good at encouraging cheating. RL очень хорошо поощряет жульничество. I'm delighted to welcome Federico from Cursor and Dima from Fireworks to the podcast today. Рада приветствовать сегодня в подкасте Federico из Cursor и Диму из Fireworks. Federico, you are the research lead on Composer 2 at Cursor, Cursor's new agentic coding model. Federico, вы руководите исследованиями по Composer 2 в Cursor, новой агентной модели для написания кода. And Dima, you spent how many of the last few months moonlighting at Cursor in order to support all of the infrastructure required to make this gargantuan training task happen. А Дима, сколько последних месяцев вы работали по совместительству в Cursor, чтобы обеспечить всю инфраструктуру, необходимую для этой грандиозной задачи обучения? And so, I'm excited to talk to both of you today about how the training of Composer 2 came together, what hard problems you solved together, and what you think it means for the future of of AI and foundation model companies. Мне не терпится поговорить с вами обоими о том, как было организовано обучение Composer 2, какие сложные задачи вы решали вместе и что это означает для будущего ИИ и компаний, создающих базовые модели. Exciting. Здорово. Yeah, exciting. Да, здорово. Thank you for having us. Спасибо, что пригласили. Thanks for joining. Спасибо, что пришли. Okay, let's dive right in. Окей, давайте сразу к делу. For those who haven't been following as closely, uh Cursor recently announced Composer 2, which is an agentic coding model uh meant for long horizon coding tasks. Для тех, кто следил не так пристально: Cursor недавно анонсировал Composer 2, агентную модель для написания кода, рассчитанную на задачи с длинным горизонтом. Federico, uh up till now, um Cursor was mostly uh enabling uh other people's uh coding agents. Federico, до сих пор Cursor в основном помогал запускать чужие агенты для написания кода. Uh what was the impetus for Cursor to lean so heavily into Composer 2, and how existential is it for you to become not just an application company but also a foundation model company yourselves? Что побудило Cursor так серьёзно взяться за Composer 2 и насколько это экзистенциально важно, стать не просто прикладной компанией, но и компанией, создающей базовые модели? The reason why we started looking into training our own models is you can sort of think about the model as sort of like like a storage drive. Причина, по которой мы начали задумываться об обучении собственных моделей, в том, что модель можно представить как накопитель. It has certain amount of bits that it can store in its weights. В её весах хранится определённое количество битов информации. And the idea is very simple, you know, like we care about only one task. Идея очень простая: нам важна только одна задача. We don't even care about coding or programming necessarily. Нам вообще не так важны написание кода или программирование как таковые. We care about software engineering inside cursor and inside cursor only. Нас интересует разработка программного обеспечения внутри Cursor, и только внутри Cursor. And so, what if we were to allocate all of the bits of information that can be stored inside the model weights to that one particular task? Что если выделить все биты информации, которые можно сохранить в весах модели, именно под эту одну задачу? Also, as people may have noticed, composer is order of magnitude less expensive than Opus and other like coding models because we can just simply specialize all of the model weights to that particular task. Кроме того, как некоторые заметили, Composer стоит на порядок дешевле, чем Opus и другие модели для написания кода, просто потому что мы специализируем все веса модели под конкретную задачу. And so, we can serve like a smaller model or something of that sort, yeah. Поэтому мы можем обслуживать запросы на более компактной модели или что-то в этом роде. So, it's about let's make sure every single bit of weight or information we have is dedicated toward the specific problem that we have at hand. Значит, речь о том, чтобы каждый бит информации и каждый вес модели работали на конкретную задачу, которую вы решаете. Exactly. Именно. Got it. Понятно. Um that seems like it's an almost generalizable problem. Похоже, это почти универсальная проблема. Uh Dima, I'm curious your perspective. Дима, мне интересна ваша точка зрения. Do you think that every application company should be looking at cursor as a harbinger of what's to come? Как думаете, стоит ли каждой прикладной компании смотреть на Cursor как на предвестника того, что ждёт всю отрасль? Like should they all be looking to do the same thing? Нужно ли всем делать то же самое? Yeah, absolutely. Да, абсолютно. I mean, we actually generally see it as a pattern of kind of evolution of the applications. Мы в целом видим в этом закономерность эволюции приложений. You maybe start prototyping, you might be using kind of off-the-shelf model to get something running, maybe do some prompt engineering, figure out how your harness works. Сначала вы создаёте прототип, возможно, используете готовую модель, чтобы что-то запустить, занимаетесь prompt engineering, разбираетесь, как работает ваш фреймворк. But the most kind of leveraged attribute of your application is the actual usage of user data or particular specific aspects of how this application works, maybe some aspects of your harness, which tools do you provide, how the application works, kind of really important bits which are important for your application. Но самый ценный актив вашего приложения, это реальное использование пользовательских данных или конкретные особенности того, как работает это приложение: какие инструменты вы предоставляете, как функционирует приложение, по-настоящему важные детали. And the right way to capture that, you can do a little bit of that through prompting, but really the right way to do this is craft your model to act in your environment. И правильный способ это зафиксировать, отчасти через prompting, но по-настоящему, это настроить модель так, чтобы она действовала в вашей среде. Yeah, absolutely. Да, абсолютно. Like there are certain tools the agent calls that it's very hard to succinctly describe exactly the behavior of that tool to the model. Есть определённые инструменты, которые вызывает агент, и очень сложно кратко описать модели, как именно они должны работать. And you know, with just like post-training, we can bake in the optimal way to use those tools. С помощью post-training мы можем закрепить оптимальный способ использования этих инструментов. Like Composer, we do serve a prompt to Composer, but I I think the way we are training it, it would work even without a prompt and it would know what to do just because like we are intrinsically pushing the model to like the right direction of how it should act throughout our training. Скажем, в Composer мы передаём промпт, но думаю, при нашем подходе к обучению он работал бы и без промпта и знал бы, что делать, просто потому что мы целенаправленно ведём модель в правильном направлении на протяжении всего обучения. Basically, there's kind of like upper bound of like how far you can get with prompt engineering. По сути, у prompt engineering есть определённый потолок. And if you want to uh craft really great AI products, you have to go through kind of fine-tuning and influence model behavior. Если хочешь создавать действительно отличные AI-продукты, нужно заниматься fine-tuning и влиять на поведение модели. That's kind of one reason. Это одна из причин. I mean, reason number two is what Federico mentioned is kind of cost trade-off or XP trade-off. Вторая причина, то, о чём говорил Federico: компромисс по стоимости или по соотношению качество-производительность. Like the way we kind of view it at Fireworks is that when you're trying to do optimization, you have this like three-dimensional trade-off between quality, speed, and cost. В Fireworks мы смотрим на это так: когда вы оптимизируете систему, вы работаете в трёхмерном пространстве компромиссов, качество, скорость и стоимость. And uh you can go quite far and we're doing it with all of our customers initially. И можно добиться многого, мы делаем это со всеми нашими клиентами поначалу. We can go quite far with just optimizing infrastructure, but when you start getting to model training, you can really push this trade-off much further and you can get better model at fraction of the cost running much faster. Оптимизация инфраструктуры может завести довольно далеко, но когда вы переходите к обучению модели, этот компромисс сдвигается гораздо сильнее: вы получаете лучшую модель за долю стоимости, которая работает значительно быстрее. And you know, Composer is a great example of И Composer, отличный пример того, Can I push on this a little bit? Можно я немного надавлю на это? I want to ask you if this approach is better lesson pills. Хочу спросить вас, не противоречит ли этот подход урокам из горького опыта. And we were we were actually all talking about TabNine on the walk-in. Мы как раз все говорили о TabNine по пути сюда. I'm remembering before the LLM era, there were these like small specialized coding models. Я вспоминаю время до эпохи LLM, тогда были небольшие специализированные модели для написания кода. And one of the things that was I think surprising to to a lot of people was as you've scaled up, you know, you scaled up just training on the internet and a lot of a bunch of English text and other languages, actually the models themselves got inherently better at coding as well. И одно из открытий, которое, думаю, многих удивило: когда вы масштабировали модели, обучая их просто на интернет-текстах на английском и других языках, сами модели становились принципиально лучше и в написании кода тоже. And so at least the trend line I've seen so far is just like bigger models perform better on everything including on coding. Так что тренд, который я наблюдаю, выглядит примерно так: чем больше модель, тем лучше она справляется со всем, в том числе с кодом. Is what you guys are saying, does that go against the grain of the better lesson? То, что вы говорите, это противоречит горькому уроку или нет? I think no, but one one sort of like thing to point out is that the big models trained by the labs train on a lot of code as well. Думаю, нет. Но стоит отметить одну вещь: большие модели, обученные в лабораториях, тоже обучались на огромных объёмах кода. Like code is one of the main tasks the labs are interested in pushing and so they don't just generalize to it. Написание кода, одна из главных задач, которые лаборатории стремятся развивать, поэтому они не просто обобщают, они тоже специализируются. They're a bit specialized as well. Так что определённая специализация есть и у них. I think for our case, actually, you know, if we believe about the bitter lesson, we are just pushing very hard on the data dimension, and we know that the models inherently have finite capacity. В нашем случае, если мы верим в горький урок, мы просто очень сильно давим на измерение данных, и понимаем, что у модели конечная ёмкость. And so, if we want to saturate all that capacity, we need to scale data. Поэтому, чтобы насытить всю эту ёмкость, нужно масштабировать данные. And in order to ingest more data, we we need to like free up the weights from distractions the model may have. А чтобы переварить больше данных, нужно освободить веса от отвлекающих факторов, которые есть у модели. Mhm, okay. Понятно. Got it. Понял. Super interesting. Очень интересно. Okay, let's dig into the training of Composer 2. Окей, давайте разберём обучение Composer 2. You launched a couple weeks ago, immediately grabbed attention. Вы запустились несколько недель назад и сразу привлекли внимание. Strong benchmark numbers, much lower cost to to run inference on. Сильные результаты на бенчмарках, значительно меньшая стоимость инференса. What's the short version of how Composer 2 works, and and what you guys did to make it so performant? Как работает Composer 2 в двух словах и что вы сделали, чтобы он был таким производительным? We started from a very strong base, which is uh Kimmy 2.5. Мы начали с очень сильной базы, Kimi 2.5. It's like a 1 trillion and parameter MoE, that's 30 B active, so very very sparse, actually. Это MoE примерно на 1 триллион параметров с 30 миллиардами активных, очень разреженная модель. We sort of like looked at the stock and realized there are like two axes. Мы посмотрели на то, что у нас есть, и увидели два измерения. So, mainly Composer 1 was just pushing on one of these axes, which is reinforcement learning, but Composer 2 pushes in two different axes. Composer 1 развивал только одно из них, обучение с подкреплением. Composer 2 движется сразу в двух направлениях. One is continual pre-training, and the other is reinforcement learning. Первое, continual pre-training, второе, обучение с подкреплением. So, the thing that made Composer 2 very good is pushing in both of these directions. То, что сделало Composer 2 действительно хорошим, это работа сразу в обоих направлениях. So, we started off the training run by doing lots of mid-training on code tokens, almost sort of pre-training scale, actually. Мы начали обучение с масштабного mid-training на токенах кода, почти на уровне pre-training, если честно. And then, coming out of that mid-training run, we took the checkpoints and we did very large-scale RL on lots of lots of tasks. После этого mid-training мы взяли чекпоинты и провели очень масштабный RL на огромном количестве задач. Okay, and then the premise here would be because Cursor sits in the middle of so many interesting coding tokens, you actually pretty uniquely have access to data to be able to train at almost pre-training scale. Понятно. Предпосылка здесь в том, что Cursor находится в центре множества интересных токенов кода и потому у вас есть уникальный доступ к данным для обучения почти на уровне pre-training. Yeah. Да. Why not pre-train your own model, then? Тогда почему не обучить собственную модель с нуля? We just think about our approach from top-down instead of bottom-up. Мы думаем о нашем подходе сверху вниз, а не снизу вверх. So, like, how do we get a model that's useful to users in the least time possible if we were to start from the bottom, sort of figure out how how we do pre-training and then scale it up to mid-training and then, okay, now we figured out mid-training, now we do reinforcement learning. Как быстрее всего получить модель, которая будет полезна пользователям? Если начинать с самого начала: разбираться с pre-training, масштабировать его до mid-training, потом осваивать reinforcement learning, это займёт очень много времени. That would take a very long time to get a model out to our users. Дорога до рабочей модели для пользователей была бы очень долгой. By doing it the other way around, we were able to give our useful model to our users in very little time. Действуя в обратном порядке, мы смогли дать пользователям полезную модель очень быстро. So, hopefully, you know, like next Composer versions are going to be our own model instead of basing it off an open-source base. Надеюсь, следующие версии Composer будут уже на нашей собственной модели, а не на базе open-source. And what is the model roughly learning in the kind of mid-training step? Что примерно учит модель на этапе mid-training? And what is the model learning in the post-training step for you? И что она учит на этапе post-training? Yeah, so in mid-training, it's sort of just kind of learning about libraries of code and learning about specific code patterns that are very common, like just world knowledge as well. На mid-training модель в основном знакомится с библиотеками кода, усваивает специфические паттерны программирования, которые встречаются очень часто, а ещё получает знания об окружающем мире. There is like web data there as well. Туда входят и веб-данные. And this is sort of just creating a wider distribution that then reinforcement learning can sharpen on. По сути, это создание более широкого распределения, которое reinforcement learning потом заостряет. And so, during reinforcement learning, you know, the model gets to play directly with the cursor harness. А во время reinforcement learning модель напрямую взаимодействует с фреймворком Cursor. And so, it gets to learn about the world the model is going to live in for the rest of its life, right? Она учится жить в том мире, в котором ей предстоит существовать всю оставшуюся жизнь. In in some way. В каком-то смысле. And and so, then during reinforcement learning, that's where it learns how to call tools properly, how to navigate its environment, how to write correct code. И именно во время reinforcement learning она учится правильно вызывать инструменты, ориентироваться в своей среде, писать корректный код. Because during mid-training, it it learns how to write code. Потому что во время mid-training она учится писать код.