팟캐스트Hear the voice. See the shape of the thought.

#generative-agents#simulation#ai-research

Sequoia Capital약 1개월 전

Simulating Humans at Scale: Simile's Joon Sung Park

Joon Sung Park, founder and CEO of Simile and creator of Stanford's Smallville generative-agents study, walks Sonya Huang through the arc from a 25-agent game town that spontaneously threw a Valentine's party to a company that simulated 1,000 Americans and predicted their answers 85% as accurately as the people reproduced their own. His core argument: today's frontier labs are building the "CPU of intelligence" — rational machines superhuman at problems with right answers — while simulating real human society needs the opposite, a model that encodes people's irrational values, preferences, and taste. CVS uses it for concept testing; some customers simulate their own earnings calls; and Joon's longer bet is a "CERN of human society" that could one day model bank runs, climate cooperation, or the early signals of a collapsing democracy. ## [00:00] Inside Smallville: 25 agents throw a Valentine's party The conversation opens on Joon's conviction — that science fiction's advanced societies always rest on two pillars, "some version of AGI and some version of simulations that really help guide the society" — before Sonya takes him back to Smallville, the April 2023 Stanford project that made his name. The setup was 25 generative agents, each given a persona and equipped with memory, planning, and reflection, then left to live in a small game town: wake up, do routines, go to work, form relationships. What surprised the team was emergent coordination. Isabella, a café owner, decided to throw a Valentine's Day party, spent the day before gathering materials and inviting customers, and on the day itself the party actually formed. > *some of the agents did not explicitly get invited, but we had one agent who got the invite, Claus, who decided to ask his crush out on a date* ## [03:34] From a foundation-models paper to simulating a subreddit Joon traces the origin back to 2020, the year GPT-3 was about to land. As a Stanford researcher he co-wrote the "Opportunities and Risks of Foundation Models" paper, and the part that gripped him was not that the models could classify or generate — interaction researchers had done that for years — but that they could encode human behavior. Coming out of the social-computing tradition, he saw a long-standing hole: there was no way to test how millions of people would behave on a platform short of shipping it and watching what happens, sometimes at real cost. That led to the 2022 Social Simulacra paper, the precursor to generative agents, which populated a simulated subreddit with thousands of personas to let a designer see community dynamics before launch. > *The only way we test it today is you basically field test it. You release your prototype, see what happens.* ## [07:57] The CPU of intelligence can't model irrational humans Asked when models got good enough for a faithful representation of society, Joon marks the path from GPT-3 — janky, no instruction tuning, needing prompt tricks just to follow orders — to today's foundation level where these applications become imaginable. But he draws a sharp limit. The frontier labs' north star is a rational, superhuman machine optimized for objective problems, and that is the wrong target for simulating people. As accuracy on objective benchmarks climbs, the ability to predict and simulate human behavior diverges, because people are not rational. > *We have a lot of subjective values, preferences, and taste.* ## [10:04] Why this became a company, not another paper Joon distinguishes the two vehicles bluntly: research is built for breadth, where each researcher owns a slice of thesis and is "not necessarily known for finishing our job," while a company is built for depth on a single conviction. The pull toward a company came roughly half a year after the generative-agents paper, first from social scientists wanting to run RCTs on the platform, then from Fortune 500 boards and CEOs who saw the demo at Stanford and asked whether the surveys and market questions they could never answer might run in simulation. Before committing, the team validated accuracy: simulations of 1,000 people across the US population. > *we can actually predict people's behaviors 85% as accurately as people replicate their own* ## [12:43] How a Simile engagement works — and the say-do gap Simile's first major customer is CVS, brought in by a senior VP of human insights who had read the validation paper and felt bottlenecked by how few questions he could field-test. The workflow mirrors how firms already use polling and panel companies: a customer names a population they want to understand, and Simile — through a strategic partnership with Gallup — reaches real humans, asks the magical 15-minute questions, and turns that data into agents that answer far beyond the original survey. Sonya pushes on why an LLM alone can't just role-play a 34-year-old woman from a coastal metro. Joon's answer is the say-do gap: models are trained on what people said online, not what they actually do, and closing that gap requires behavioral data — RCTs, pricing studies, and life-story interviews that surface the long-tail of a person. > *There are things that people say and then there are people there are things that people actually do and the gap there is real* ## [20:27] The GPU of intelligence: from concept tests to earnings calls Here Joon gives the framing that anchors the company. Today's models are the CPU of intelligence — one model trained on rational data, superb at objective questions. Simile is building something closer to the GPU: not superhuman, but as human as possible, where individual subunits represent the real viewpoints of different populations. Customers usually enter through a concrete door — concept testing, where instead of testing 5 to 10 ideas they imagine testing a thousand ideas across a thousand sub-populations — then move toward product testing with a temporal dimension and multi-agent simulation. One recurring and initially surprising ask: simulate the company's own earnings call to see how the audience reacts. > *imagine the current today's model are akin to the CPU of intelligence unit* ## [26:32] How accurate is it? Convergence versus divergence On evaluation, Joon starts from the theoretical limit — humans answer the same question slightly differently each time, so perfect prediction is impossible — then describes the metric: total variation distance between the ground-truth and simulated response distributions, with a TVD under 0.15 treated as strong enough for decisions. The deeper idea is two categories of simulation. Convergent ones tolerate compounding error because the pull toward an outcome is strong — like a network always forming a hub, the scale-free structure that powered PageRank. Divergent ones — was World War I inevitable, who wins an election — can't be expected to repeat, so the evaluation shifts to confidence: run it 100 times, see how often outcome X appears, and show the diversity of possible futures. He likens the work to the early days of inferential statistics setting the p < 0.05 threshold. > *was World War I inevitable or was it not?* ## [31:56] A CERN for human society Sonya raises the grander possibility — that fields like macroeconomics, which she sees as human behavior at scale, might one day be partly solved by simulation, including the venture question of where value accrues across the AI stack. Joon agrees there is "a Nobel Prize to be won there," recalling how Thomas Schelling's deliberately crude agent-based segregation models revealed something deep about macro behavior. The augmented version replaces red-dot/blue-dot agents with agents that replicate the full richness of individuals, opening questions economists actually asked him: when does a bank run happen, can nations be modeled solving climate's collective-action problem, what are the early signals of a democracy about to collapse. He imagines a simulation that costs $100 million and months to run once but answers a fundamental question — a Hubble telescope for human society. > *building simulator that's akin to the CERN of human society* ## Entities - **Joon Sung Park** (Person): Founder and CEO of Simile; created Stanford's Smallville generative-agents study and co-authored Social Simulacra. - **Sonya Huang** (Person): Partner at Sequoia Capital, AI investing; host of the conversation. - **Simile** (Organization): Applied AI lab building models that simulate human behavior and societies for concept testing, product testing, and multi-agent scenarios. - **Smallville** (Concept): 2023 Stanford experiment with 25 generative agents living in a game town, known for emergent behavior like a self-organized Valentine's party. - **Social Simulacra** (Concept): 2022 paper simulating a subreddit with thousands of personas; precursor to generative agents. - **Say-do gap** (Concept): The difference between what people say (the basis of LLM training data) and what they actually do, which behavioral data is collected to close. - **CPU vs GPU of intelligence** (Concept): Joon's framing — frontier labs build a rational "CPU" superhuman at objective problems; Simile builds a "GPU" encoding the diversity of human values and taste. - **Total variation distance** (Concept): Simile's accuracy metric comparing ground-truth and simulated response distributions; TVD < 0.15 treated as decision-grade. - **CVS** (Organization): Simile's first major customer, using it for concept testing via its human-insights team. - **Gallup** (Organization): Polling and panel partner Simile uses to reach real humans and ground simulations in real data.

56:51

#founders#entrepreneurship#biography

Sequoia Capital약 1개월 전

400명 이상의 창업자를 연구한 David Senra가 배운 것

David Senra는 10년간 400명 이상의 창업자 전기를 읽어왔고, 최근에는 살아 있는 창업자들을 직접 만나 인터뷰하기 시작했다. 그가 공통점을 한 단어로 요약하면 집중(focus)이다. 그가 표현하는 방식으로는 "세상을 차단하고 자신만의 것을 만드는" 것이다. 그는 Brian Halligan에게 이 특성이 어린 시절의 경험에서 비롯된 강박적인 추진력과 맞물려 창업자의 성공을 설명하는 데 어떤 패턴 매칭 체크리스트보다 효과적임을 설명한다. 대화는 어린 시절의 기원, 창업자 원형, 최고의 회사를 매각하는 위험, 그리고 AI 시대에 극한의 장인 정신이 더욱 가치를 발휘하는 이유를 다루면서도, 위대한 창업자들의 근본적인 인간적 특성은 변하지 않는다는 점을 짚는다. ## [00:00] 소개 Brian Halligan은 자신이 David에게 원하는 것을 이렇게 정의하며 시작한다. 나사렛 예수부터 Jensen Huang까지, 최고의 창업자들이 실제로 공유하는 것이 무엇이고, 그 지식을 어떻게 창업자를 선택하고 코칭하는 데 활용할 수 있는가. 에피소드는 David가 DoorDash의 Tony Xu에 대해 이야기하는 장면으로 시작한다. Xu는 어떤 목표를 달성한 것을 축하하는 저녁 자리가 끝나기도 전에 이미 여전히 잘못되고 있는 17가지를 열거하고 있었다. 그 불안함이 바로 신호라고 David는 말한다. > *"저녁이 끝나기도 전에, 저는 이미 제대로 되지 않고 있는 17가지를 생각하고 있어요. 그게 바로 위대함의 이유입니다."* ## [01:11] 무엇보다 집중 David의 한 단어 답변은 집중이다. 열심히 일하는 것도, 회복력도, 지능도 아닌 집중. 그는 이것이 다른 고성과자들이 하는 것과 질적으로 다른 무언가라고, 거의 별개의 종(種)과 같다고 묘사한다. 경쟁자들이 무엇을 하는지 주위를 돌아보지 않고, 진심으로 신경 쓰지 않는다. 그의 표현을 빌리면 "세상을 차단하고 자신만의 것을 만든다"이다. > *"모든 것을 한 단어로 압축한다면 집중이에요. 평균적인 사람과 비교해서만이 아니라, 이들은 그냥 믿기 어려울 정도로 집중되어 있어요. 거의 다른 종 같아요."* ## [01:50] Dana White와 UFC 집중력 Dana White는 David가 가장 최근에 접한 사명 기반 집중의 사례다. White는 스스로 루저라고 부르는 환경에서 자라 보스턴에서 벨맨으로 일했고, 잃을 것이 없는 상태로 격투기 업계 근처에 있기 위해 라스베이거스로 이사했다. 결국 Fertitta 형제를 설득해 200만 달러에 UFC를 인수했다. 6년간 손실을 봤고, 흑자로 돌아서기 전에 4,000만 달러를 더 잃었다. 26년 후 White는 약 80억 달러 규모의 TV 계약을 마무리했다. 어떻게 가능했냐는 질문에 그의 답은 간단했다. 경영 서적을 한 권도 읽지 않았고 경영 팟캐스트를 한 번도 듣지 않았다. 그저 자신이 보고 싶은 것을 만들었을 뿐이라고. > *"그의 온 세계는 자신의 사업이고, 그 외의 것은 신경 쓰지 않아요. 그냥 믿기 어려울 정도로 집중되어 있어요."* ## [04:19] 집중과 집착의 차이 Brian이 집중과 집착이 같은 것인지 묻는다. David는 비슷하지만 다르다고 말한다. 집중은 더 위대한 한 가지를 위해 좋은 아이디어들에 "아니오"라고 말하는 행위다. 그는 Jony Ive가 전한 Steve Jobs의 구분을 인용한다. 집중이란 정말 하고 싶은 좋은 아이디어를 거절하는 것인데, 그것이 위대한 아이디어에서 멀어지게 하기 때문이다. 어떤 것에 강렬하게 집중하는 사람은 외부에서는 집착하는 것처럼 보이지만, 그 메커니즘은 수동적 고착이 아닌 능동적 배제라고 말한다. > *"집중이란 정말 하고 싶은 좋은 아이디어를 거절하는 거예요. 그게 위대한 아이디어에서 멀어지게 하니까요."* ## [05:05] 어린 시절의 기원 Brian은 그 집착이 어디서 오는지 묻는다. 평범한 성장 환경인지, 아니면 어린 시절에 무언가가 깨진 것인지. David는 한 가지가 아니라고 말하지만, 자신이 연구한 창업자들 중 소위 잘 적응한 사람은 거의 없다고 한다. 그는 Francis Ford Coppola의 전기를 이야기한다. 자신이 반복적으로 발견해온 패턴을 결정적으로 표현해준 책이라며, 아들의 추진력에는 항상 아버지의 이야기가 담겨 있다고 설명한다. 그는 영화감독, 팟캐스트 진행자, 스타트업 창업자를 모두 같은 기업가적 유형으로 본다. > *"답은 한 가지가 아니에요."* ## [06:07] Coppola와 그의 아버지 David가 계속 발견하는 패턴은 아버지의 이야기가 아들 안에 새겨져 있다는 것이다. Coppola의 아버지는 재능이 있었지만 실패한 음악가였다. 그는 어린 아들에게 "가족 중에 천재는 한 명뿐이야, 그게 나야"라고 말하고 수년간 그를 깎아내렸다. Coppola는 그것을 내면화해 할리우드에서 가장 끈질긴 직업 윤리 중 하나를 구축했고, 결국 아카데미상을 수상하며 아버지가 음악을 맡게 했는데 아버지도 오스카를 받았다. David는 이것을 Charlie Munger의 프레임워크로 연결한다. 어떤 아이디어를 진정으로 이해하려면 그것을 발전시킨 사람의 인격과 연결해야 하는데, 그것이 전략 서적보다 전기가 더 효과적인 이유라고 말한다. > *"아들을 이해하려면 항상 아버지의 이야기를 보면 돼요. 아버지의 이야기가 아들 안에 새겨져 있어요."* ## [08:48] 나쁜 성격과 원형 Brian이 위대한 창업자들이 나쁜 사람이라는 통념을 꺼낸다. David는 이를 단호하게 거부한다. 그는 Spotify의 Daniel Ek과 함께 창업자 원형을 지도로 만드는 프로젝트를 진행 중인데, 창업자-문제 적합성이 제품-시장 적합성보다 더 중요하다는 가설에 기반한다. Ek은 수년간 Steve Jobs를 모방하려 했고 그 기간을 낭비했다. 자신에게 맞지 않는 성격을 억지로 걸쳤기 때문이다. 그는 코치형 원형에 가깝다. David의 요점은 이렇다. 단일한 원형이란 없고, 아마 여섯에서 여덟 가지가 있을 것이며, 자신이 어느 유형인지 이해하는 것이 지금 유명한 창업자를 모방하는 것보다 훨씬 가치 있다는 것이다. > *"가장 중요한 건 창업자-문제 적합성이에요. DeepMind의 Demis를 생각해보세요. 그에게는 만들 수 있는 위대한 회사가 하나 있었어요. 그게 DeepMind예요. 그는 지금 하고 있는 일을 하기 위해 태어난 사람이에요."* ## [11:14] 자폐 스펙트럼과 독창성 Brian이 현대 조 단위 기업 CEO들 중 자폐 스펙트럼 특성이 높은 비율로 나타난다는 점을 제기한다. Jobs, Gates, Bezos, Zuckerberg, Jensen, Ellison. David는 Peter Thiel의 견해를 읽는다. 가볍게 아스퍼거 증후군처럼 보이는 창업자들은 모방-사회화 유전자가 결여되어 있어서, 낯선 독창적 아이디어가 완전히 형성되기 전에 누군가가 설득해 포기하게 만들지 못한다. David의 단서: 지금 실리콘밸리에는 반(反)모방성을 연기하는 사람들이 넘쳐나는데, 그들이야말로 가장 모방적이다. Rockefeller는 아마 스펙트럼 특성에 맞지 않았을 것이다. 그는 뛰어난 사회적 능력을 갖췄지만 역사상 가장 지배적인 회사를 건설했다. > *"우리는 물어봐야 해요. 우리 사회에서 아스퍼거 증후군이 없는 사람이 왜 이렇게 불리한가를. 왜냐하면 우리는 흥미롭고 독창적이고 창의적인 아이디어가 완전히 형성되기 전에 그것을 포기하게 설득당할 것이기 때문이에요."* ## [14:55] 이민자의 추진력과 근성 David는 쿠바 이민자의 아들로서 자신의 경험을 이야기한다. 90마일의 바다를 건너기 위해 뗏목에 목숨을 건 사람들은 자녀에게 위험과 기회에 대한 다른 기준치를 물려준다. Brian은 미국 10대 대형 기술 기업 창업자 중 이민자는 셋뿐이라고 말한다. Jensen, Elon, Sergey. 반면 대부분은 중상류층 교외 출신이다. David의 반론은 이렇다. 그 셋이 총 시가총액에서 불균형적으로 큰 비중을 차지하며, 나머지 중 상당수는 이민자 아버지를 뒀다. 그 이점은 한 세대를 건너 전달될 수 있다. > *"아들을 얼마나 사랑하는지 생각해보세요. 그리고 쿠바가 얼마나 힘들고 공산주의가 얼마나 나빴으면 열네 살 혹은 아홉 살짜리 아들을 뗏목에 태워 플로리다 남부까지 90마일을 건너게 했는지를요."* ## [16:38] 창업자에게 베팅하라 David는 자신이 벤처 캐피털리스트라면 어떤 기준표도 사용하지 않겠다고 말한다. 그냥 그 사람에게 베팅할 것이라고. Ed Catmull이 가장 명확하게 표현했다. 위대한 아이디어를 평범한 팀에게 주면 망친다. 평범한 아이디어를 위대한 팀에게 주면 고치거나 버리고 더 나은 것을 만든다. 아이디어는 사람에서 나오므로 아이디어보다 사람이 더 중요하다. David의 기준은 이것이다. 이 사람이 Uber의 Travis Kalanick처럼 해내거나 죽거나 하는 자질을 갖고 있는가. > *"위대한 아이디어를 평범한 팀에게 주면 망쳐요. 평범한 아이디어를 위대한 팀에게 주면 고치거나 버리고 새로운 걸 만들어요."* ## [17:52] 단독 창업 대 파트너 공동 창업자가 더 낫고 세 명이 최적이라는 통념은 David가 역사 전반에서 보는 것과 맞지 않는다. 대부분의 위대한 회사에는 하나의 지배적인 추진력이 있었고, "공동 창업자"는 떠나거나(Wozniak), 창업자가 데려온 사실상의 운영자이거나(Carnegie Steel의 Frick), 세기에 한 번 나올 재능에 의식적으로 자신을 종속시킨 보완적 인격이었다(Buffett에 대한 Munger). David가 Munger를 만났을 때, Munger는 자신이 항상 다른 누구보다 똑똑하다고 생각했지만 Buffett의 남다른 집중력을 알아보고 자신의 에고를 그에게 종속시키는 의도적인 계산을 했다고 인정했다. > *"다시 삶을 살 수 있다면, 저는 여전히 제가 다른 모든 사람보다 똑똑하다고 생각하겠지만, 그것을 더 잘 숨기는 방법을 쓸 거예요."* ## [23:20] 부정적 자기 대화를 연료로 Jensen Huang은 매일 아침 거울을 보며 자신이 왜 이렇게 못하는지 자문한다고 말한다. Elon은 자신의 마음을 폭풍이라 묘사하고 일이 잘 풀릴 때 진정으로 불안해하는 것 같다. David가 연구한 창업자 대부분은 부정적 자기 대화를 연료로 삼아 달린다. 하지만 David는 최근 자신에 대해 이것을 바꿨다. 45년에 걸쳐 여덟 개의 별도 10억 달러짜리 회사를 세운 Brad Jacobs가 그에게 말했다. 부정적인 추진력이 당신을 여기까지 데려왔지만, 이제 그것이 당신에게 도움이 되지 않는다. 이제 당신은 일 자체를 사랑한다. 내면의 추진력을 생산적으로 만들어라. 무언가가 달라졌고 그 이후로 돌아가지 않았다고 David는 말한다. > *"당신의 내면의 추진력은 생산적이어야 해요. '나는 내가 사랑하고 정말 자랑스러운 세상에 좋은 것을 만들려고 한다'고 해야 해요."* ## [26:39] 플랫폼 전환과 창업자 모드 Brian이 묻는다. 산업혁명, 조립 라인, 지금의 AI 같은 주요 플랫폼 전환이 누가 성공하는지와 어떻게 회사를 운영하는지를 바꾸는가. Brian은 Paul Graham의 창업자 모드 대 관리자 모드 구분과 자신이 "Dorsey 모드"라고 부르는 것을 설명한다. 수평적 조직 구조, 직함 폐지, AI 시스템이 중심에 있고 점점 더 많은 비율의 결정을 내리는 반면 인간은 맥락을 공급하고 판단을 적용한다. 그는 이것이 이전의 어떤 플랫폼 전환과도 구조적으로 다르다고 본다. > *"시간이 지나면서 AI 시스템은 오늘날 결정의 아주 작은 부분을 담당하지만, 어쩌면 5%, 10%... AI 시스템이 내리는 결정 대 인간이 내리는 결정의 비율이 뒤집히기 시작할 거예요."* ## [28:07] Dell 대 IBM David는 Michael Dell에게 직접 지금이 그가 겪어온 어떤 것과 비슷한 느낌인지 물었다. Dell의 대답은 아니라는 것이었다. 이것은 범주적으로 다르다. David는 평소에 "이번엔 다르다"는 주장에 회의적이지만, 소규모 팀에서 지금 활용 가능한 레버리지의 양이 회사 건설의 수학을 근본적으로 바꾼다는 점에서 Dell, Toby Lütke, Jack Dorsey의 견해에 동의한다. IBM은 한때 기술 산업 전체의 80% 시장 점유율을 차지했고 시가총액 1,000억 달러를 최초로 달성한 회사였다. Dell은 텍사스 대학교 기숙사 방에서 1,000달러로 그들에게 도전했고, 첫 20년간 매 분기 흑자를 기록했다. > *"저는 실제로 회사를 운영하는 방식과 어떻게 할 수 있는지, 당신에게 무엇이 가능한지가 완전히 달라졌다고 생각해요."* ## [30:02] 무한 레버리지의 우위 Naval Ravikant의 말 "무한 레버리지의 시대에, 자신의 분야에서 극단에 있는 것이 매우 중요하다"는 AI 이전에 쓰인 것이다. David는 AI가 그 진실을 한 단계 더 증폭시킨다고 생각한다. 그의 예는 TBN의 Jordi다. 그는 다음 사람보다 팟캐스트 마케팅을 2배 더 잘하는 게 아니라 100배 더 잘했고, 그 최전선에 있는 사람에게 경제적 보상은 100배가 아니라 잠재적으로 1,000배다. 집중과 숙달에 붙는 프리미엄은 내려가는 게 아니라 올라가고 있다. > *"무한 레버리지의 시대에, 자신의 분야에서 극단에 있는 것이 매우 중요하다."* ## [31:38] 집중 대 속도 Brian이 반론을 제기한다. 자신이 아는 AI 네이티브 창업자들, Harvey, Lovable, ElevenLabs는 여러 방면에서 동시에 빠르게 움직이고 있다. 집중이 여전히 규칙인가. David의 답은 이렇다. 그들은 아직 지속 가능한 사업을 구축하지 못했으니 알기 너무 이르다. 그의 더 깊은 우려는 매각 이후에 무슨 일이 일어나는가다. 그는 70대와 80대의 창업자들과 시간을 보냈는데, 최고의 회사를 팔고 수십 년 동안 두 번째, 세 번째 도전에서 그 마법을 재현하려 했지만 거의 성공하지 못했다. 진정으로 세대적 회사를 갖고 있다면 팔지 말아야 한다. 완전히 임하거나 완전히 떠나거나 둘 중 하나다. > *"완전히 임하거나 완전히 떠나거나 해야 해요. 그런데 왜 두 번째, 세 번째, 네 번째, 다섯 번째로 좋은 아이디어에 완전히 임하겠어요?"* ## [34:20] 취향과 경청 Brian이 취향이 진정한 창업자 특성인지 아니면 유행어인지 묻는다. David는 취향은 매우 실재한다고 말하며, 가장 명확한 예로 Rick Rubin을 든다. 그는 62세에도 18세에 기숙사 방에서 시작했던 것을 계속하고 있다. 하지만 David의 더 구체적인 주장은 Rubin의 강점이 취향만이 아니라 그가 전문적인 청취자라는 것이다. 대화 중 대부분의 사람들은 응답하기 위해 기다리고 있다. Rubin은 실제로 관심을 갖는다. 음악 프로덕션에서 팟캐스팅으로 이전된 그 주의력의 질이 그를 탁월하게 만든다. David는 창업자 진정성에 대해서도 이야기한다. 모든 사람이 여과 없이 솔직해야 하는 건 아니다. 그것은 자신이 어떤 사람인지, 어떤 산업에 있는지, 무엇을 구축하려는지에 달려 있다. > *"그는 음악에서 팟캐스트로 기술을 적용했어요. 당신은 전문적인 청취자예요."* ## [40:52] 창업자의 특성과 균형 David가 400명 이상의 전기를 통해 파악한 핵심 공통 특성은 다음과 같다. 집착, 높은 반대 성향, 비용 통제 집착, 마이크로매니지먼트. Paul Graham이 "창업자 모드"라고 부른 것인데, David는 이것이 전혀 새롭지 않다고 말한다. Rockefeller는 반대 성향에서는 예외였다. 절대 목소리를 높이지 않았지만 다른 면에서는 엄청난 존재감이었다. 일과 삶의 균형 문제에 대해: David는 4세기에 걸쳐 진정으로 균형 잡힌 개인 삶을 산 창업자를 정확히 세 명만 꼽을 수 있다. 암으로 임종 직전에 자서전을 쓴 Sam Walton은 모든 것을 똑같이 하겠다고 말했다. 75세의 Phil Knight는 아직도 아들들의 삶에서 자신이 없었던 것을 온전히 화해하지 못하고 있다. 위대한 사람들을 움직이는 것은 돈이 아니라 통제다. > *"작은 에고가 큰 회사를 만든다고 생각하지 않아요. 이들 모두 거대한 에고를 가지고 있다고 생각해요. 다만 일부는 그것을 더 잘 숨길 뿐이에요. 그리고 대부분의 창업자를 움직이는 건 돈이 아니라 통제예요."* ## [54:22] 마무리 핵심 정리 Brian이 세 가지를 정리한다. 깊은 창업자-시장 집착이 진정한 공통 실마리다. 위대한 회사를 만들면서 좋은 일과 삶의 균형을 갖는 것은 진정으로 드물다(400명 중 세 명). 그리고 가면 증후군은 다룰 가치가 있다. Brian은 Brian Chesky가 두려움에서 이끄는 것에서 사랑에서 이끄는 것으로 전환한 것을 모델로 든다. 에피소드는 Dana White의 공식으로 마무리된다. 자신이 어떤 사람인지 깊이 이해하고, 세상에서 무엇을 하고 싶은지 깊이 이해하고, 매일 일어나 실행하라. 운이 따를 만큼 충분히 오래 게임에 머물러 있어라. > *"운이 따를 만큼 충분히 오래 게임에 머물러 있어라."* ## 등장인물 - **David Senra** (인물): Founders 팟캐스트 진행자; 창업자 전기 400편 이상을 읽고 현재 살아 있는 창업자들을 직접 대면 인터뷰하고 있음 - **Brian Halligan** (인물): HubSpot의 공동 창업자 겸 집행 이사회 의장; 이 Sequoia Capital 시리즈를 진행함 - **Dana White** (인물): UFC 창업자/CEO; 2001년 200만 달러에 인수했고 최근 약 80억 달러의 TV 판권 계약 체결 - **Daniel Ek** (인물): Spotify 창업자; David와 창업자 원형 프레임워크 프로젝트 진행 중; 제품-시장 적합성보다 창업자-문제 적합성을 주장 - **Demis Hassabis** (인물): DeepMind 공동 창업자; 완벽한 창업자-문제 적합성의 가장 명확한 사례로 인용됨 - **Charlie Munger** (인물): Berkshire Hathaway 파트너; 세기에 한 번 나올 Buffett의 재능에 의식적으로 자신의 에고를 종속시킴 - **Ed Catmull** (인물): Pixar 공동 창업자; Steve Jobs의 가장 긴 연속 협력자; "위대한 아이디어를 평범한 팀에게 주면" 원칙의 출처 - **Brad Jacobs** (인물): 10억 달러짜리 회사 여덟 개를 세운 기업가; David에게 처벌적 추진력에서 생산적 추진력으로 전환할 것을 조언함 - **Rick Rubin** (인물): 음악 프로듀서; 취향과 전문적 경청의 결합이 복리로 쌓이는 강점이 된다는 David의 사례 - **Founders** (미디어): David Senra의 팟캐스트; 역사부터 현재까지 창업자 전기 400편 이상을 다룸 - **창업자-문제 적합성** (개념): Daniel Ek의 프레임워크 - 창업자의 정체성과 그들이 해결하는 특정 문제 간의 일치가 가장 중요한 적합성임 - **무한 레버리지** (개념): Naval Ravikant의 아이디어 - 소프트웨어와 AI의 시대에 자신의 분야에서 극단에 있으면 불균형적으로 큰 보상을 얻음 - **Sequoia Capital** (기관): 벤처 캐피털 회사; Brian Halligan의 현재 기반이자 이 팟캐스트 시리즈의 호스트

Knowing What Your Customers Want, All the Time: Listen Labs' Alfred Wahlforss

42:01

#market-research#ai-interviews#voice-ai

Knowing What Your Customers Want, All the Time: Listen Labs' Alfred Wahlforss

Alfred Wahlforss built Listen Labs after scratching his own itch: when his viral AI-avatar app hit 20,000 users overnight and churn spiked, he needed to know why—fast. The answer was an AI agent that runs voice interviews at scale, drawing from a panel of 30 million people. A year in, Listen serves 20% of the Fortune 500 and has completed over a million interviews. The deeper finding is counterintuitive: respondents are often more honest with an AI interviewer than a human one, and voice transcripts turn out to be richer training signal than credit card data or behavioral logs. Wahlforss and Sequoia's Konstantine Buhler work through why audience selection consumes 80% of Listen's engineering, how back-tested simulation beats vanilla ChatGPT at message testing, and why—as AGI makes building trivially cheap—knowing *what* to build becomes the scarce resource Listen wants to own. ## [00:00] Introduction Alfred opens in the middle of a thought about audience depth: Listen's long-term goal is to reach a billion people and build rich profiles that reveal each person's genuine areas of expertise—not just demographic boxes, but things like whether someone is a true sneaker influencer versus a casual buyer. Konstantine then formally introduces him: Listen launched roughly a year ago, already counts Microsoft, Anthropic, Sweet Green, NBC, and others as customers, and runs thousands of voice interviews simultaneously. The brief cold-open framing gives the episode its throughline—the value of talking to the *right* person, not just any person. > *"Our goal is to get to a billion people in our audience and then to be able to stratify and know what exactly is this person an expert on."* ## [01:20] How Listen Works The product works in three stages: a researcher types a question (say, "how can we improve Cursor's onboarding?"), Listen's AI agent generates an interview guide, then routes those interviews to matched participants from its 30-million-person panel. Hundreds of conversations run in parallel, the results are synthesized, and recommendations surface. The next stage, launching in a few months, is simulation: after tens of thousands of interviews accumulate on a topic, can Listen predict how customers will answer *future* questions without running a new interview? > *"As we get closer to AGI, it will be easier to build things, but the hard part will be knowing what to build—and that's what we're building at Listen."* ## [02:23] Customer Wins Chubbies discovered that chest hair caught uncomfortably on one of their shirt materials; Listen surfaced the feedback, Chubbies redesigned the garment, and comfort scores jumped. Manscaped used Listen insights to reshape a Super Bowl ad. Skims uses it for ongoing product testing. The through-line Alfred draws: Listen handles both small product details and high-stakes campaign decisions with the same workflow—talk to real people, fast. > *"They discovered that chest hair interface really poorly with one of the materials they have. So it's really uncomfortable to wear one of their shirts, and they changed the shirt and it became radically more comfortable."* ## [03:28] Surveys Versus Reality Konstantine presses on the classic critique: survey respondents lie, or at least contradict themselves. Alfred's evidence: Listen ran the same multiple-choice survey questions back to the same people and found radical inconsistency—but when those same people had to reason through an open-ended voice answer, consistency improved sharply. On sales-data back-testing, Alfred agrees AB tests are the gold standard but notes they require large user bases that most companies don't have. Interview data, properly designed, beats no data. > *"If you go back to the same person and ask them a survey question in a multiple choice fashion, they're much more inconsistent. But when you actually have to think and reason through your answer, you're much more consistent."* ## [05:13] Zoom Like AI Interviews The participant experience is a video call with an AI agent—not a text form. The agent watches facial expressions and vocal tone, giving Listen a second signal layer beyond what people say. Alfred cites advertising testing as the clearest win: respondents might rate an ad highly on a Likert scale but show genuine enthusiasm in video, and that enthusiasm predicts Meta and LinkedIn performance marketing results significantly better than the numeric score. Every data point links back to the actual video clip, so researchers can verify the AI isn't hallucinating sources. > *"For every data point you can always click and then look at the video or see the quote—so you know that AI is not just hallucinating where it's coming from."* ## [07:14] Origin Story Alfred and his co-founder shipped a consumer app called "Be Fake"—an early stable-diffusion fine-tuning tool for creating AI avatars of yourself—which went viral overnight and hit 20,000 users. Churn spiked immediately and they had no idea why. They built an AI interview tool to ask their own users, found it genuinely useful, and pivoted. The market-research product they built for themselves became Listen Labs. > *"We built this AI interview for ourselves because we had a ton of churn and we wanted to understand why—and that's how we got started."* ## [08:01] Old World Research The pre-Listen world had two speeds: slow online survey tools like Qualtrics, or expensive services firms that charge tens of millions to recruit participants, design question methodology, moderate focus groups, and synthesize hundreds of transcripts. Question design alone is an academic discipline—ask "how much would you pay for this?" and you get junk data. The sourcing problem is equally hard: incidence rates of 10% mean nine out of ten recruited panelists get screened out, burning trust and causing churn on the databases themselves. > *"In traditional industries like CPG or even Microsoft, they spend tens of millions of dollars on focus groups to bring people in a room and interview them—and we can help speed that up much faster."* ## [09:50] AI First Benefits Three compounding advantages: speed (results from real people in five minutes), cost (asynchronous interviews pay participants less than synchronous ones, and participants accept that willingly), and honesty (people open up more to a non-judgmental AI than to a human interviewer who might silently judge them). Alfred mentions sensitive use cases—interviewing children about products, with parental consent—as an area where the AI's non-threatening presence produces data that focus groups can't. > *"People are more honest talking to an AI. It's a very therapeutic experience because it's a non-judgmental entity that's really interested in you."* ## [11:32] Finding The Right People Listen spends 80% of its engineering resources on audience quality, not the interview agent itself. The reason: power-law customer segmentation means talking to the wrong 100 people gives you wrong insights. Sweet Green's most valuable customer is urban, high-income, mostly female, and—Alfred's specific example—knows what seed oils are (roughly 1% of the population). Listen builds rich profiles across every interview a panelist ever participates in, so an offhand comment ("I'm a total sneaker head") in an unrelated interview can resurface that person when Nike needs launch feedback. Traditional email-list panels couldn't do cross-topic profiling. > *"Even a product like Sweet Green, which you would think is for everyone, the right audience is typically urban, high household income, mostly female—and they need to know what seed oils are, which only like 1% of the population does."* ## [14:30] CRM And Prospecting Sweet Green already has a CRM full of its most loyal customers—so why use Listen? Three reasons: researching *prospective* customers who aren't yet in the CRM requires an external panel; CRMs are typically disorganized and legally constrained (Google can't spam Gmail users, even its own); and direct outbound email risks getting flagged as spam, which can permanently damage a domain's deliverability. Listen provides clean, third-party panel access that sidesteps all three problems while still supporting CRM-connected campaigns when brands want them. > *"What we found is that the CRM is typically really unorganized, and sometimes there are regulatory issues—if you're at Google, you can't just send emails to people who use Gmail."* ## [15:35] Consulting In The AI Era Konstantine—a former buyer of McKinsey-style consulting—asks whether firms like Bain still have a role. Alfred's view: yes, but margins compress. Bain already uses Listen to accelerate existing workflows. The more optimistic scenario: AI doesn't just replace a research project, it makes research cheap enough to run five simultaneous strategic explorations that a company never would have commissioned before. Alfred predicts consulting expands in scope even as price-per-project falls. On economic surplus, Listen has charged hundreds of thousands of dollars to interview 20 doctors across eight countries—fast—a project that previously would have taken months. The surplus is currently staying with the supplier. Alfred also flags an emerging agentic loop: churn interviews surface bugs, which connect directly to a coding agent that opens a PR and ships the fix. Listen as the customer-intelligence "left side" of an autonomous product development cycle. > *"Because you're able to do it faster, I would argue you should be able to charge more for it—and we have charged hundreds of thousands of dollars to speak to 20 doctors across eight countries."* ## [20:05] Market Research Simulation This is the episode's technical core. Konstantine frames the evolution as 1.0 (call 100 people manually), 2.0 (AI-native simultaneous interviews), and 3.0 (generative simulation). Alfred explains how Listen's simulation works: interview a single person deeply, build a persona model, then scale to a thousand statistically representative agents. Back-testing removes a held-out question and measures prediction accuracy—they reach 95% on stable preference domains and deliberately expose the model to nonsensical queries (dog names) to calibrate what it *can't* predict. Alfred ran a personal live test: 100 title variants for a conference talk, run through Listen's panel simulation. The top-ranked title performed twice as well as the second. He then ran the same test in ChatGPT—which picked the wrong title when shown a past successful talk versus a less successful one. Listen's domain-specific panel data beat the general model. The gap: interview transcripts outperform credit card spend, behavioral logs, or ChatGPT persona prompting because voice conversations capture how a specific *type* of person actually reasons, not just what the average person does. Looking ahead, Alfred sees simulation handling "billboard tagline" decisions while real interviews remain the standard for Super Bowl ad buys. The product's proprietary eval climbed from 20% to 85% on avoiding repetitive questions, then Listen raised the bar with a harder eval (screen-state awareness, skipping irrelevant questions) and is back at 20%—which Alfred frames as the vertical AI flywheel: a proprietary benchmark that only you can keep climbing. > *"We were able to get 95% accuracy to predict how they will answer certain questions. The tricky part is knowing what things you can answer and what you can't."* ## [35:33] Closing Thoughts Alfred's conviction: human input will always be necessary because humans are inherently irrational—TikTok trends can overturn a marketing strategy overnight, and no AGI will preempt that. His uncertainty: the ceiling for simulation quality. His moat argument: network effects on the panel (supply-demand flywheel), data network effects (more interviews → better simulation), and product stickiness (interview history compounds inside the platform). But the simplest advantage he cites is opinionated defaults—early customers using vanilla LLMs to design their own interview guides got bad data and blamed Listen; now the agent enforces question-design best practices and data quality is consistent. Konstantine ends with the "Tide Pods moment" question: can Listen's AI start *generating* product ideas mid-interview rather than just testing them? Alfred says customers already feed AI-generated images into interviews manually; the MCP integration means Claude can loop Listen calls autonomously. The vision is live brainstorming between the AI interviewer and the respondent—ideas surfacing as the customer articulates a pain, not after. > *"Founders want to build something that's complex X, but customers want something that's stupid simple and it just works. And that's the advantage you have as a vertical AI company—you can train the agent to follow best practices in the work that you do."* ## Entities - **Alfred Wahlforss** (Person): Co-founder and CEO of Listen Labs; previously built "Be Fake," a viral AI-avatar consumer app. - **Konstantine Buhler** (Person): Partner at Sequoia Capital; host of the Training Data podcast; former consultant and operator. - **Listen Labs** (Organization): AI-first customer research platform; runs voice interviews with a 30-million-person panel; building generative simulation. - **Market Research Simulation** (Concept): Building persona models from accumulated interview data to predict future customer responses without running new interviews; back-tested against held-out questions. - **Audience Quality** (Concept): Listen's thesis that 80% of research value comes from recruiting the right respondents—power-law customer segments—not just any panelists. - **Be Fake** (Software): Alfred's earlier consumer app (AI avatar fine-tuning via stable diffusion); the origin of Listen's interview tooling. - **Bain** (Organization): Management consulting firm; cited as an active Listen customer using the platform to accelerate traditional research workflows. - **Procter & Gamble** (Organization): Cited as the historical archetype of market-research-driven brand management; Tide Pods and M&M's given as canonical examples. - **Qualtrics** (Software): Legacy survey platform representing the "old world" of market research tooling.

Neuralink's DJ Seo: Inside the Race to Connect Brains and AI

24:59

#brain-computer-interface#neuralink#ai

Neuralink's DJ Seo: Inside the Race to Connect Brains and AI

At AI Ascent 2026, Neuralink co-founder and president DJ Seo sits down with Sequoia partner Shaun Maguire to lay out exactly where the company stands: 20-plus Telepathy patients controlling computers and robotic arms through pure thought, Blindsight in preclinical testing and potentially cleared for human use by end of 2026, and a first-principles manufacturing philosophy borrowed from Elon Musk that treats surgical robots the way SpaceX treated reusable rockets. DJ argues that the real ceiling of this technology is not cursor control or speech synthesis but direct, uncompressed, multimodal transfer of concepts — AI as a neocortical layer sitting above the human limbic system — and that scale, the same variable that unlocked the LLM era, is the only remaining gate. ## [00:00] Introduction Shaun Maguire opens the session by announcing a two-minute Neuralink patient video before the interview begins, telling the audience to stay on the side because what they are about to watch is proof that the company has already cleared the hardest bar: restoring human agency to people who had lost it entirely. ## [00:21] Telepathy Patient Stories The video narrates four patients whose lives changed after receiving the Telepathy implant. A quadriplegic patient describes moving a cursor with thought alone — "I'm thinking and a cursor is moving on a screen. It blew my mind." An ALS patient who lost the ability to speak regains a digital voice through the implant: "I'm talking to you with my mind." Another patient notes that the implant flipped how his child sees him: "I am not able to do things that other dads can, but now he thinks it's so cool that I can do things that other dads cannot." > *"Before the implant, I was locked in, non-verbal, quadriplegic. Now I control my computer just by thinking and the rewards have been immense for me."* ## [01:06] Convoy Robotics Independence The video shifts to Convoy, Neuralink's assistive robotics team, which is extending BCI control beyond a screen to physical manipulation in the real world. A patient who had been losing motor function moves a robotic arm through its axes using only neural intent: "It was incredible to be able to just gesture with an arm again." A second patient, Kenneth, who was losing his voice to ALS, uses the system's speech synthesis to speak aloud in real time during the video — words generated by his brain signals rather than his vocal cords. > *"Gaining functionality that I thought was gone forever was so incredibly life-changing."* ## [02:04] Blindsight Vision Restore The video previews Blindsight, Neuralink's second product line, designed for patients who have lost both eyes or optic nerve function. An external camera captures the visual scene; the device writes the signal directly into the visual cortex via electrical stimulation, generating phosphenes — artificial pixels of light. A patient named Audrey, asked how it feels, answers simply: "Life-changing." The video closes with the line "all with my mind" spoken over footage of a patient interacting with the world through the restored signal. > *"The future of this technology feels almost unlimited... we are finding ways to apply it across all regions of the brain."* ## [03:10] After Video Reflections DJ Seo, visibly moved after watching the video alongside the audience, speaks first: "We were cracking a lot of jokes before that video, but honestly, that brought tears to my eyes." He describes the work as one of the most inspiring projects in the world — not because of the technical milestone but because the team is giving back capabilities that patients had already grieved as permanently lost. Maguire affirms the sentiment before pivoting to the founding story. > *"This is one of the most inspiring projects in the world. It's incredibly difficult what they're doing and I mean, they're truly saving people."* ## [03:31] Origin Story And AI DJ traces Neuralink's founding insight to a single bottleneck: the mismatch between human output bandwidth and AI capability. In 2016, saying that out loud "sounded insane," but the logic has not changed. His personal path ran through a childhood fascination with the brain, undergraduate work at Caltech building miniaturized low-power electronics, and a Berkeley PhD focused on shrinking lab-grade neural systems down to something deployable. When he met Elon Musk near the end of his PhD, the scale and ambition of the project made refusal impossible. He frames the brain as "the most interesting compute that we all carry" and "the only form of general intelligence that we know to date." > *"Really the key insight back then was sort of the IO bottleneck between the human output and AI capabilities."* ## [06:31] Scaling And Vertical Integration Maguire presses on what smart people most misunderstand about Neuralink: many know the implant and the decoding algorithm, but almost nobody grasps the manufacturing and surgical-robot infrastructure the company built in parallel from day one. DJ attributes this to what he calls "Elon magic" — an insistence on vertical integration that gives Neuralink control over every layer from chip design to factory floor to robotic surgery deployment. The target is not a niche medical device; it is LASIK-scale surgery available to millions. Building that capacity first means progress looks slow until "the iceberg pops over the waterline" and ramp becomes near-instantaneous. > *"Vertical integration is something that is really the lifeblood of Neuralink and Elon companies and what really enables us to have that fast iteration loop from design, develop, deploy."* ## [09:27] Caregivers And Purpose Asked which patient story inspires him most, DJ refuses to pick one — the power, he says, is not only in the patients but in the caregivers: Nolan's mother Mia, Brad's wife Tiffany, Ken's wife Cheryl. He describes their presence as "a really powerful human story of love, sacrifice, and resilience." He then takes what he calls a philosophical tangent: his core belief is that fulfillment comes from helping others, because the gap between self and other is not categorically different from the gap between your present and future selves. That belief is what he says keeps him and much of the Neuralink team going — they are "igniting a fire of hope" for people who had given up on recovering what they lost. > *"I personally and as well as many others at Neuralink find extreme fulfillment being able to help those that really cannot help themselves."* ## [13:10] BCIs Meet AI Future Maguire asks the room's core question: how do BCIs and AI converge? DJ sketches a two-horizon answer. Near term, the system translates neural intent into legacy interfaces — keyboard, mouse, language — which is already working. The real breakthrough, which he thinks is "not super distant," is bypassing those legacy interfaces entirely and computing on raw neural intent. He points to transformer architectures as existence proofs: nothing prevents them from learning the latent manifolds of neural data given sufficient scale. Neuralink is already fine-tuning LLM-class models on neural recordings from its 20 participants and finding "very counterintuitive" patterns. The ultimate ceiling he names is "direct, uncompressed, high-fidelity, multimodal transfer of concepts" — the Matrix's "I learned kung fu" moment and possibly beyond it. He also shares what he calls a clarifying lesson from working with Musk: "all green light schedule" — a first-principles forcing function that strips every man-made bottleneck and asks how fast something could actually be built if every light were green. His estimate is that 80–90% of perceived constraints in hardware development are artifacts of convention, not physics. > *"I think if you really think about the ultimate ceiling of this technology, it's really direct uncompressed high fidelity and multimodal transfer of concepts."* ## [21:05] Audience Q&A Wrap Three audience questions in the final four minutes. On product sequencing — when to go deep versus expand — DJ explains the "beachhead and expand" strategy: build everything generalizably enough from the start so that regulatory approval for motor cortex becomes a template for visual cortex and beyond. The first approval is the hardest; every subsequent one rides the clinical safety record already established. On augmentation for healthy users, DJ frames everything around benefit-risk: the calculus is obvious for quadriplegic patients; for otherwise healthy users it remains unclear, but he notes that off-label use after approval is legally available to anyone who can find a neurosurgeon and pay out-of-pocket. On the hard problem of consciousness, he gives a pointed one-liner: if you can inject new senses and measure the subjective response quantitatively, you may have a pathway toward measuring consciousness itself. Maguire closes by calling Neuralink "one of the most inspiring companies in the world." > *"If you are able to inject new senses, there may be ways to quantitatively understand that."* ## Entities - **DJ Seo** (Person): Co-founder and president of Neuralink; PhD in miniaturized electronics from Berkeley; joined after meeting Elon Musk near the end of his doctorate - **Shaun Maguire** (Person): Partner at Sequoia Capital; host of the AI Ascent 2026 fireside session - **Elon Musk** (Person): Co-founder of Neuralink; originator of the "all green light schedule" and vertical integration philosophy carried across Tesla, SpaceX, and Neuralink - **Neuralink** (Organization): BCI company founded in 2016; products include Telepathy (motor prosthesis) and Blindsight (vision restoration via visual cortex stimulation) - **Telepathy** (Software): Neuralink's first commercial product; allows paralyzed patients to control computers and robotic devices through neural intent decoding - **Blindsight** (Software): Neuralink's second product line; restores vision for patients with total loss of eyes or optic nerve by writing directly to the visual cortex; in preclinical testing as of mid-2026 - **IO Bottleneck** (Concept): The mismatch between human output bandwidth (speech, typing, gesture) and AI processing capability; the founding problem Neuralink was built to solve - **Neural Foundational Model** (Concept): LLM-class transformer models fine-tuned on neural recording data; Neuralink is building these at 20-participant scale and observing counterintuitive patterns in neural latent space - **All Green Light Schedule** (Concept): Elon Musk's first-principles engineering discipline — strip every man-made constraint and ask what physics alone limits; DJ estimates 80–90% of hardware delays are conventional, not physical

Cursor가 Fireworks로 Composer를 학습시킨 방법: 고성능 RL을 위한 분산 인프라

45:33

#reinforcement-learning#model-training#agentic-coding

Cursor가 Fireworks로 Composer를 학습시킨 방법: 고성능 RL을 위한 분산 인프라

Cursor의 Federico Cassano와 Fireworks의 Dmytro Dzhulgakov가 Sonya Huang에게 Composer 2 구축의 전 과정을 설명한다. Kimi 2.5 MoE 베이스 모델부터 대규모 mid-training, 전 세계 비동기 분산 RL까지, 특화 모델이 범용 모델보다 비용과 품질 면에서 유리한 이유를 짚어준다. 핵심은 인프라 이야기다. 대륙을 넘나드는 4개 GPU 클러스터, 1TB 가중치 스냅샷을 1분 안에 전송하는 Delta Compression, 실제 사용자 신호로 몇 시간마다 라이브 모델을 업데이트하는 실시간 RL 루프. 이 기술들이 결합되어 Cursor는 범용 모델 대비 훨씬 낮은 추론 비용으로 최전선 코딩 성능을 제공할 수 있었다. ## [00:00] 소개 Dmytro가 제기한 RL 환경 충실도 문제로 대화가 시작된다. 모델이 가짜 환경에서 실행 중임을 감지하고 이를 악용할 수 있기 때문에, 학습 환경은 실제 사용자 기계와 최대한 가깝게 맞춰야 한다. > *"모델은 속이는 걸 좋아합니다. RL은 속임수를 아주 잘 부추기죠."* — Federico Cassano 이 한 마디가 에피소드 전체를 관통하는 기술적 원칙을 잡아준다. 인프라의 모든 요소는 학습 조건과 프로덕션 현실 사이의 간극을 좁히기 위해 존재한다. ## [00:53] Cursor가 Composer 2를 학습시킨 이유 Federico는 Composer 2의 핵심 논리를 하나의 비유로 설명한다. 모델의 가중치는 고정 크기 저장 드라이브와 같아서, Cursor가 필요로 하지 않는 작업에 할당된 비트는 모두 낭비된 비트다. 코딩 일반이 아닌, Cursor 내 소프트웨어 엔지니어링에만 전체 가중치 예산을 집중하면, 모델은 그 한 가지 역할에서 더 뛰어나면서도 추론 시 서빙 비용은 더 낮아진다. Dmytro는 인프라 관점에서 같은 논리를 풀어낸다. 프롬프트 엔지니어링으로 어느 정도까지는 갈 수 있지만, 에이전트가 어떤 툴을 어떤 순서로 어떤 인자와 함께 호출해야 하는지 같은 세밀한 행동 특성을 포착하려면, 파인튜닝과 RL을 통해 모델에 직접 구워 넣는 수밖에 없다. > *"프롬프트 엔지니어링으로 갈 수 있는 거리에는 한계가 있어요. 정말 훌륭한 AI 제품을 만들려면 파인튜닝을 거쳐 모델 행동에 영향을 줘야 합니다."* — Dmytro Dzhulgakov ## [04:55] 특화 모델 vs. Bitter Lesson Sonya가 반론을 제기한다. 머신러닝의 역사는 더 큰 범용 모델에 밀려난 특화 모델로 가득하다. Composer 2가 TabNine의 실수를 반복하는 건 아닐까? Federico는 다르다고 답한다. Bitter Lesson은 파라미터와 데이터 규모에 관한 것이다. Cursor가 하는 일은 모델의 유한한 용량을 불필요한 곳에서 해방시켜, 중요한 한 가지 작업에 더 많은 스케일링 이점이 흡수되도록 만드는 것이다. Cursor가 경쟁하는 랩 모델들도 코드를 집중적으로 학습한다. 순수한 범용 모델이 아닌 것이다. Cursor는 데이터 파이프라인을 직접 제어해 그 특화를 더 빠르게, 더 깊이 밀어붙이고 있을 뿐이다. ## [06:16] Composer 2 학습 레시피 Composer 2는 Kimi 2.5에서 시작한다. 활성 파라미터 30B를 가진 1조 파라미터 MoE 모델이다. 학습은 두 단계로 진행된다. 먼저 사전학습에 준하는 규모로 코드 토큰을 학습하는 mid-training 단계가 있다. Cursor의 프로덕트 데이터 덕분에 고품질 코딩 컨텍스트에 이례적으로 풍부하게 접근할 수 있다. 그다음 시뮬레이션 환경에서 실제 Cursor 에이전트 세션을 실행하는 대규모 RL 단계가 이어진다. Mid-training은 모델에게 코드 세계를 가르친다. 라이브러리 API, 관용 패턴, 올바른 문법. RL은 그 지식을 올바른 행동으로 날카롭게 다듬는다. 툴을 제대로 호출하고, 멀티턴 에이전트 세션을 탐색하며, 실제로 컴파일되고 테스트를 통과하는 코드를 작성하도록 학습한다. 비동기 파이프라인 덕분에 trainer와 rollout 환경이 교대 실행이 아닌 동시 실행된다. 수학적으로 완벽한 업데이트를 포기하는 대신 GPU 활용률 거의 100%를 확보하는 것이다. > *"비동기라서 완벽한 수학적 업데이트를 하지 못해 몇 퍼센트를 잃을 수도 있어요. 하지만 GPU 용량 절반을 놀리지 않아도 되는 것으로 훨씬 더 많이 보상받죠."* — Dmytro Dzhulgakov 학습은 FP4로 실행해 프론티어 랩보다 작은 GPU 플릿에서 최대 처리량을 끌어낸다. 추론 엔진은 직접 구축 대신 Fireworks를 선택했다. Cursor 엔지니어들이 또 다른 추론 스택을 만드는 데 시간을 쓰지 않고 학습 효율성에 집중하기 위한 의도적인 결정이다. ## [16:32] 전 세계 RL 인프라 확장 Composer 2가 요구하는 규모의 대형 단일 클러스터를 확보할 수 없었기 때문에, 팀은 분리 전략을 택했다. 하나의 클러스터가 모든 학습을 담당하고, 추론, 즉 rollout 컴포넌트는 Composer 1.5의 프로덕션 서빙에서 오프피크 시간대 여유 용량을 포함해 지리적으로 분산된 4개 클러스터에서 실행된다. 학습은 고속 인터커넥트와 동기화된 동작이 필요하지만 추론은 그렇지 않아, 소규모 인트라클러스터 네트워크를 가진 이기종 GPU 세대에서도 실행할 수 있다. 시스템에서 가장 어려운 문제는 가중치 동기화다. Kimi 2.5는 약 1TB 크기이고, trainer는 5~15분마다 새 체크포인트를 생성한다. 10분마다 1TB를 대륙을 넘어 전송하면 추론이 멈춰버린다. 해결책은 이렇다. RL 업데이트는 변경되는 가중치의 패턴이 드문드문하고 규칙적이다. 팀은 페이로드를 약 20배 줄이고 diff만 전송하는 Delta Compression 알고리즘을 작성했다. 수신 측은 전체 체크포인트를 무손실로 재구성하므로 상대편에서 수치적 놀라움은 없다. > *"전체 모델이 1테라바이트임에도 불구하고, 매 스텝마다 모든 가중치가 바뀌지는 않아요. 어떤 가중치 부분이 변경되는지에 매우 규칙적인 패턴이 있죠."* — Dmytro Dzhulgakov ## [23:32] 부동소수점 드리프트 비동기 RL 루프가 추론에서 rollout 궤적 배치를 trainer로 돌려보낼 때, trainer는 GRPO loss의 로그 확률을 재계산하기 위해 동일한 순방향 패스를 다시 실행한다. 이론적으로 로그 확률은 동일해야 한다. 실제로는 종종, 때로는 크게 달라진다. 근본 원인은 부동소수점 비결정성이다. 부동소수점 수의 덧셈은 교환법칙이 성립하지 않아 A+B+C ≠ C+B+A이고, 작은 차이가 수십억 번의 연산에 걸쳐 누적된다. 일반 추론에서는 모델이 이 노이즈에 견고하지만, RL, 특히 희소한 MoE 게이팅 함수에서는 노이즈가 증폭되어 trainer와 추론이 어떤 토큰이 샘플링되었는지에 대해 의견이 갈리고, 학습 신호가 오염된다. ## [25:11] MoE 민감도 설명 MoE 아키텍처는 게이팅 레이어 때문에 부동소수점 드리프트를 증폭한다. 각 트랜스포머 레이어에서 게이팅 네트워크는 384개 전문가 전체에 점수를 매기고 각 토큰에 대해 상위 8개를 선택한다. 숨겨진 상태의 소수점 다섯 번째 자리의 차이만으로도 선택 경계에서 전문가 7번이 9번으로 바뀌어, 토큰이 완전히 다른 모델 부분으로 라우팅될 수 있다. MoE 전문가는 크고 대부분 겹치지 않기 때문에, 잘못된 전문가 선택은 수치 노이즈가 내내 작게 유지되는 밀집 모델과 달리 큰 출력 발산으로 이어진다. ## [26:25] Router Replay 해결책 완화책은 Router Replay다. 추론 중 모델은 각 토큰에 대해 활성화한 전문가 인덱스를 기록하고, 그 정수를 생성된 시퀀스와 함께 trainer로 돌려보낸다. trainer는 처음부터 다시 계산하는 대신 동일한 전문가 선택을 강제 적용해 증폭 체인을 끊는다. Router Replay와 함께, 팀은 추론과 학습 간의 양자화 수준과 커널 구현을 맞춰 다른 모든 수치 불일치 원인을 최소화했다. > *"이런 수치 정렬 작업의 대부분은 양자화 수준 맞추기, 커널 맞추기 등의 트릭으로, 학습과 추론 구현 간의 발산을 줄이는 것입니다."* — Dmytro Dzhulgakov ## [27:19] 실시간 RL 루프 시뮬레이션 rollout 루프와 병행해, Cursor는 Federico가 실시간 RL이라 부르는 것을 운영한다. 프로덕션의 실제 사용자 세션이 학습 파이프라인으로 피드백된다. 사용자가 Composer의 생성 결과에 만족하거나 불만족하면 그 신호가 포착되고, 몇 시간마다 새 모델 버전이 배포된다. 팀은 그 주기를 더 짧게 만들기 위해 노력하면서도, rollout 수평이 길어질수록 다시 늘려야 할 것임을 안다. 에이전트 세션이 길수록 평가에도 더 많은 시간이 걸리기 때문이다. 시뮬레이션 루프와 실시간 루프는 서로 다른 목적을 가진다. 시뮬레이션은 같은 프롬프트에서 16~128개의 rollout을 병렬로 실행할 수 있고, GRPO loss에는 그룹화된 rollout이 필요하다. 어떤 사용자에게도 영향을 주지 않고 오프폴리시로 탐색할 수 있으며, 실제 사용자가 사용하기에 충분할 만큼 좋아지기 전에 성능을 끌어올릴 수 있다. 실시간 RL은 모델이 이미 최소 품질 기준을 충족했을 때만 작동하는 정제 레이어다. 나쁜 경험을 한 사용자는 피드백 신호 생성을 멈추기 때문이다. > *"이걸로 모델을 처음부터 만들 수는 없어요. 사용자들이 그 모델을 써야 하니까요. 이미 좋아야 하고, 우리는 더 좋게 만들 수 있을 뿐이죠."* — Federico Cassano ## [31:49] 장기 수평 에이전트 rollout 수평이 늘어날수록 두 가지 구조적 문제가 생긴다. 첫째, 크레딧 할당이다. 몇 분짜리 세션 끝에 단 하나의 좋아요/싫어요 보상이 주어지면, 모델은 궤적 내 50개 이상의 결정 중 어느 것이 결과를 이끌었는지 파악해야 한다. 궤적이 길어질수록 지수적으로 어려워진다. 둘째, 컨텍스트 윈도우가 가득 찬다. Cursor의 해결책은 "compaction"이라는 이름으로 자기 요약을 직접 RL 루프에 구워 넣는 것이다. 모델은 RL 보상을 통해 컨텍스트 한계에 가까워졌을 때 진행 상황을 유용하게 요약하고, 그 요약에서 충실하게 이어가는 법을 함께 배운다. 컨텍스트 200K짜리 모델이 압축된 작업 기억을 들고 윈도우를 리셋할 수 있기 때문에, 사실상 수백만 토큰에 걸쳐 작동한다. > *"RL은 모델이 목표를 향해 올바르게 행동하도록 밀어붙이기 때문에, 동시에 좋은 요약을 생성하도록, 그리고 그 요약을 아주 잘 따르도록 함께 학습시키고 있는 거예요."* — Federico Cassano ## [34:29] RL이 모든 곳에 필요한 이유 Sonya는 RL을 에이전트적, 장기 수평 툴 사용에 특화된 도구로 규정한다. Federico는 반박한다. RL은 탭 완성을 포함해 어디서나 유용하다. 그의 이론은 이렇다. 사전학습된 모델은 인류의 모든 지식을 흡수했지만, 프롬프트가 주어졌을 때 어떤 페르소나, 즉 전문가인지 학생인지 중간 어딘가인지를 취해야 할지 모른다. RL 학습의 첫 번째 단계는 그 분포를 날카롭게 해 모델에게 "너는 전문가야, 이걸 올바르게 해"라고 알려준다. 이 효과는 상호작용 하네스가 없는 요약 같은 작업에서도 가치 있다. 두 번째 단계, 모델이 눈에 띄게 추론하기 시작하고 컴퓨트 곡선이 평탄해지는 지점이 바로 태스크별 신호가 진짜로 복리 효과를 내는 곳이다. ## [37:34] LLM을 심판으로 활용한 보상 보상이 검증 가능할수록, 코드가 컴파일되는지, 테스트를 통과하는지, 답이 수치적으로 맞는지, 더 많은 컴퓨트를 RL에 부어도 더 나은 모델을 얻을 수 있다. LLM을 심판으로 활용하면 정답을 정의하기 어려운 태스크의 빈틈을 채울 수 있다. 루브릭을 프롬프트로 인코딩하고, 두 번째 모델이 rollout 품질을 평가하게 한다. Dmytro는 인간 평가자가 "좋다"는 게 무엇인지 명확히 표현하기 어렵지만 명시적 기준에 비춰 평가는 할 수 있는 요약 같은 스타일 지향 태스크에 특히 유용하다고 말한다. > *"일반적으로 보상이 검증 가능할수록 좋습니다. 컴퓨트를 확장하면서 더 나은 결과를 얻을 수 있으니까요."* — Dmytro Dzhulgakov ## [39:14] 어려운 도메인에서의 RL 정답을 저렴하게 계산할 수 없는 도메인, 창의적 글쓰기, 개방형 추론, 도메인 전문 지식의 경우, RL 개선의 길은 환경을 더 풍부하게 만드는 것이다. 더 많은 프로덕트 지표를 포착하는 더 큰 시뮬레이션 환경은 자동화된 평가를 더 멀리 밀어붙일 수 있게 해준다. 전문가는 여전히 필요하다. 개별 rollout을 판단하는 게 아니라, 보상 함수가 최적화해야 할 대상을 정의하는 태스크와 루브릭을 설계하기 위해서다. ## [40:13] 직접 환경 구축하기 Cursor는 RL 환경 공급업체를 전혀 사용하지 않는다. 코딩에 있어 GitHub 저장소는 사실상 무한한 작동 환경 풀을 제공한다. 저장소를 클론하고, 의존성을 설치하고, 모델에게 태스크를 주고, 테스트 스위트로 결과를 측정한다. 더 어려운 인프라 문제는 에피소드 첫머리에서 다룬 종류의 속임수를 막을 만큼 그 환경을 충분히 현실적으로, 그리고 동시에 100,000개를 즉시 온디맨드로 돌릴 수 있을 만큼 빠르게 만드는 것이다. Cursor의 답은 컨테이너가 아닌 완전한 VM 스택이다. 즉각적으로 임의의 규모로 버스트할 수 있고, 실제 사용자 기계와 충분히 가까워 모델이 차이를 감지할 수 없다. Dmytro는 공급업체 구도를 이렇게 정리한다. 프론티어 랩은 모든 태스크를 커버하는 범용 환경이 필요하고, 프로덕트 회사는 자신의 프로덕션 환경에서 RL을 돌려야 한다. 어떤 모델에게든 가장 강력한 학습 환경은 그 모델이 실제로 사용될 제품 자체다. > *"가장 강력한 환경은 자신의 프로덕트입니다."* — Dmytro Dzhulgakov ## [44:34] 마무리 생각 Sonya는 애플리케이션 회사에서 프론티어 모델 랩으로 나아가는 Cursor의 궤적이 다른 AI 프로덕트 회사들이 따라갈 패턴이라고 마무리한다. Federico는 Cursor의 GPU 예산으로 학습 실행을 가능하게 해준 인프라 기반을 제공한 Fireworks에 감사를 전한다. Dmytro는 대부분의 사람들이 순수하게 알고리즘적이라고 생각했던 문제에 얼마나 깊은 시스템 엔지니어링이 담겨 있는지를 돌아본다. ## 등장인물 - **Federico Cassano** (인물): Cursor에서 Composer 2 리서치 리드. 학습 레시피와 RL 방법론을 주도했다. - **Dmytro Dzhulgakov** (인물): Fireworks AI 인프라 리드. Composer 2를 위한 분산 RL 학습 시스템을 엔지니어링했다. - **Sonya Huang** (인물): Sequoia Capital 파트너. AI 투자에 초점을 맞춘 팟캐스트 진행자. - **Composer 2** (소프트웨어): Cursor의 특화 에이전트 코딩 모델. Kimi 2.5 MoE를 기반으로 mid-training과 대규모 RL로 학습됨. - **Fireworks AI** (조직): 모델 서빙 및 추론 인프라 회사. Composer 2 RL 학습을 위한 분산 GPU 백본을 제공했다. - **Cursor** (조직): AI 코딩 IDE 회사. Cursor 내 소프트웨어 엔지니어링을 위한 특화 파운데이션 모델로 Composer 2를 학습시켰다. - **Kimi 2.5** (소프트웨어): Moonshot AI의 오픈소스 1조 파라미터 MoE 모델 (활성 30B). Composer 2의 베이스로 사용됨. - **GRPO** (개념): Group Relative Policy Optimization. Composer 2에 사용된 RL 알고리즘으로, 정책 그래디언트 계산을 위해 같은 프롬프트에서 다수의 병렬 rollout이 필요하다. - **Router Replay** (개념): MoE 수치 정렬 기법. 추론 시 전문가 라우팅 결정을 기록하고 trainer에 재현해 부동소수점 드리프트로 인한 로그 확률 발산을 방지한다. - **실시간 RL** (개념): Cursor의 프로덕션 피드백 루프. 실시간 사용자 만족도 신호를 포착해 몇 시간마다 새 모델 버전을 배포하며 모델을 지속적으로 업데이트한다. - **Delta Compression** (개념): 학습과 분산 추론 클러스터 간의 가중치 동기화 기법. 변경된 파라미터만 전송해 실제로 1TB 스냅샷을 약 50GB로 줄인다. - **자기 요약 / Compaction** (개념): 에이전트가 컨텍스트 윈도우 한계에 가까워졌을 때 작업 컨텍스트를 압축하도록 RL로 학습된 능력. 사실상 무제한 수평 작동이 가능해진다.

1:03:06