PodcastsHear the voice. See the shape of the thought.
चैनल ब्राउज़ करें
Knowing What Your Customers Want, All the Time: Listen Labs' Alfred Wahlforss
Alfred Wahlforss built Listen Labs after scratching his own itch: when his viral AI-avatar app hit 20,000 users overnight and churn spiked, he needed to know why—fast. The answer was an AI agent that runs voice interviews at scale, drawing from a panel of 30 million people. A year in, Listen serves 20% of the Fortune 500 and has completed over a million interviews. The deeper finding is counterintuitive: respondents are often more honest with an AI interviewer than a human one, and voice transcripts turn out to be richer training signal than credit card data or behavioral logs. Wahlforss and Sequoia's Konstantine Buhler work through why audience selection consumes 80% of Listen's engineering, how back-tested simulation beats vanilla ChatGPT at message testing, and why—as AGI makes building trivially cheap—knowing *what* to build becomes the scarce resource Listen wants to own. ## [00:00] Introduction Alfred opens in the middle of a thought about audience depth: Listen's long-term goal is to reach a billion people and build rich profiles that reveal each person's genuine areas of expertise—not just demographic boxes, but things like whether someone is a true sneaker influencer versus a casual buyer. Konstantine then formally introduces him: Listen launched roughly a year ago, already counts Microsoft, Anthropic, Sweet Green, NBC, and others as customers, and runs thousands of voice interviews simultaneously. The brief cold-open framing gives the episode its throughline—the value of talking to the *right* person, not just any person. > *"Our goal is to get to a billion people in our audience and then to be able to stratify and know what exactly is this person an expert on."* ## [01:20] How Listen Works The product works in three stages: a researcher types a question (say, "how can we improve Cursor's onboarding?"), Listen's AI agent generates an interview guide, then routes those interviews to matched participants from its 30-million-person panel. Hundreds of conversations run in parallel, the results are synthesized, and recommendations surface. The next stage, launching in a few months, is simulation: after tens of thousands of interviews accumulate on a topic, can Listen predict how customers will answer *future* questions without running a new interview? > *"As we get closer to AGI, it will be easier to build things, but the hard part will be knowing what to build—and that's what we're building at Listen."* ## [02:23] Customer Wins Chubbies discovered that chest hair caught uncomfortably on one of their shirt materials; Listen surfaced the feedback, Chubbies redesigned the garment, and comfort scores jumped. Manscaped used Listen insights to reshape a Super Bowl ad. Skims uses it for ongoing product testing. The through-line Alfred draws: Listen handles both small product details and high-stakes campaign decisions with the same workflow—talk to real people, fast. > *"They discovered that chest hair interface really poorly with one of the materials they have. So it's really uncomfortable to wear one of their shirts, and they changed the shirt and it became radically more comfortable."* ## [03:28] Surveys Versus Reality Konstantine presses on the classic critique: survey respondents lie, or at least contradict themselves. Alfred's evidence: Listen ran the same multiple-choice survey questions back to the same people and found radical inconsistency—but when those same people had to reason through an open-ended voice answer, consistency improved sharply. On sales-data back-testing, Alfred agrees AB tests are the gold standard but notes they require large user bases that most companies don't have. Interview data, properly designed, beats no data. > *"If you go back to the same person and ask them a survey question in a multiple choice fashion, they're much more inconsistent. But when you actually have to think and reason through your answer, you're much more consistent."* ## [05:13] Zoom Like AI Interviews The participant experience is a video call with an AI agent—not a text form. The agent watches facial expressions and vocal tone, giving Listen a second signal layer beyond what people say. Alfred cites advertising testing as the clearest win: respondents might rate an ad highly on a Likert scale but show genuine enthusiasm in video, and that enthusiasm predicts Meta and LinkedIn performance marketing results significantly better than the numeric score. Every data point links back to the actual video clip, so researchers can verify the AI isn't hallucinating sources. > *"For every data point you can always click and then look at the video or see the quote—so you know that AI is not just hallucinating where it's coming from."* ## [07:14] Origin Story Alfred and his co-founder shipped a consumer app called "Be Fake"—an early stable-diffusion fine-tuning tool for creating AI avatars of yourself—which went viral overnight and hit 20,000 users. Churn spiked immediately and they had no idea why. They built an AI interview tool to ask their own users, found it genuinely useful, and pivoted. The market-research product they built for themselves became Listen Labs. > *"We built this AI interview for ourselves because we had a ton of churn and we wanted to understand why—and that's how we got started."* ## [08:01] Old World Research The pre-Listen world had two speeds: slow online survey tools like Qualtrics, or expensive services firms that charge tens of millions to recruit participants, design question methodology, moderate focus groups, and synthesize hundreds of transcripts. Question design alone is an academic discipline—ask "how much would you pay for this?" and you get junk data. The sourcing problem is equally hard: incidence rates of 10% mean nine out of ten recruited panelists get screened out, burning trust and causing churn on the databases themselves. > *"In traditional industries like CPG or even Microsoft, they spend tens of millions of dollars on focus groups to bring people in a room and interview them—and we can help speed that up much faster."* ## [09:50] AI First Benefits Three compounding advantages: speed (results from real people in five minutes), cost (asynchronous interviews pay participants less than synchronous ones, and participants accept that willingly), and honesty (people open up more to a non-judgmental AI than to a human interviewer who might silently judge them). Alfred mentions sensitive use cases—interviewing children about products, with parental consent—as an area where the AI's non-threatening presence produces data that focus groups can't. > *"People are more honest talking to an AI. It's a very therapeutic experience because it's a non-judgmental entity that's really interested in you."* ## [11:32] Finding The Right People Listen spends 80% of its engineering resources on audience quality, not the interview agent itself. The reason: power-law customer segmentation means talking to the wrong 100 people gives you wrong insights. Sweet Green's most valuable customer is urban, high-income, mostly female, and—Alfred's specific example—knows what seed oils are (roughly 1% of the population). Listen builds rich profiles across every interview a panelist ever participates in, so an offhand comment ("I'm a total sneaker head") in an unrelated interview can resurface that person when Nike needs launch feedback. Traditional email-list panels couldn't do cross-topic profiling. > *"Even a product like Sweet Green, which you would think is for everyone, the right audience is typically urban, high household income, mostly female—and they need to know what seed oils are, which only like 1% of the population does."* ## [14:30] CRM And Prospecting Sweet Green already has a CRM full of its most loyal customers—so why use Listen? Three reasons: researching *prospective* customers who aren't yet in the CRM requires an external panel; CRMs are typically disorganized and legally constrained (Google can't spam Gmail users, even its own); and direct outbound email risks getting flagged as spam, which can permanently damage a domain's deliverability. Listen provides clean, third-party panel access that sidesteps all three problems while still supporting CRM-connected campaigns when brands want them. > *"What we found is that the CRM is typically really unorganized, and sometimes there are regulatory issues—if you're at Google, you can't just send emails to people who use Gmail."* ## [15:35] Consulting In The AI Era Konstantine—a former buyer of McKinsey-style consulting—asks whether firms like Bain still have a role. Alfred's view: yes, but margins compress. Bain already uses Listen to accelerate existing workflows. The more optimistic scenario: AI doesn't just replace a research project, it makes research cheap enough to run five simultaneous strategic explorations that a company never would have commissioned before. Alfred predicts consulting expands in scope even as price-per-project falls. On economic surplus, Listen has charged hundreds of thousands of dollars to interview 20 doctors across eight countries—fast—a project that previously would have taken months. The surplus is currently staying with the supplier. Alfred also flags an emerging agentic loop: churn interviews surface bugs, which connect directly to a coding agent that opens a PR and ships the fix. Listen as the customer-intelligence "left side" of an autonomous product development cycle. > *"Because you're able to do it faster, I would argue you should be able to charge more for it—and we have charged hundreds of thousands of dollars to speak to 20 doctors across eight countries."* ## [20:05] Market Research Simulation This is the episode's technical core. Konstantine frames the evolution as 1.0 (call 100 people manually), 2.0 (AI-native simultaneous interviews), and 3.0 (generative simulation). Alfred explains how Listen's simulation works: interview a single person deeply, build a persona model, then scale to a thousand statistically representative agents. Back-testing removes a held-out question and measures prediction accuracy—they reach 95% on stable preference domains and deliberately expose the model to nonsensical queries (dog names) to calibrate what it *can't* predict. Alfred ran a personal live test: 100 title variants for a conference talk, run through Listen's panel simulation. The top-ranked title performed twice as well as the second. He then ran the same test in ChatGPT—which picked the wrong title when shown a past successful talk versus a less successful one. Listen's domain-specific panel data beat the general model. The gap: interview transcripts outperform credit card spend, behavioral logs, or ChatGPT persona prompting because voice conversations capture how a specific *type* of person actually reasons, not just what the average person does. Looking ahead, Alfred sees simulation handling "billboard tagline" decisions while real interviews remain the standard for Super Bowl ad buys. The product's proprietary eval climbed from 20% to 85% on avoiding repetitive questions, then Listen raised the bar with a harder eval (screen-state awareness, skipping irrelevant questions) and is back at 20%—which Alfred frames as the vertical AI flywheel: a proprietary benchmark that only you can keep climbing. > *"We were able to get 95% accuracy to predict how they will answer certain questions. The tricky part is knowing what things you can answer and what you can't."* ## [35:33] Closing Thoughts Alfred's conviction: human input will always be necessary because humans are inherently irrational—TikTok trends can overturn a marketing strategy overnight, and no AGI will preempt that. His uncertainty: the ceiling for simulation quality. His moat argument: network effects on the panel (supply-demand flywheel), data network effects (more interviews → better simulation), and product stickiness (interview history compounds inside the platform). But the simplest advantage he cites is opinionated defaults—early customers using vanilla LLMs to design their own interview guides got bad data and blamed Listen; now the agent enforces question-design best practices and data quality is consistent. Konstantine ends with the "Tide Pods moment" question: can Listen's AI start *generating* product ideas mid-interview rather than just testing them? Alfred says customers already feed AI-generated images into interviews manually; the MCP integration means Claude can loop Listen calls autonomously. The vision is live brainstorming between the AI interviewer and the respondent—ideas surfacing as the customer articulates a pain, not after. > *"Founders want to build something that's complex X, but customers want something that's stupid simple and it just works. And that's the advantage you have as a vertical AI company—you can train the agent to follow best practices in the work that you do."* ## Entities - **Alfred Wahlforss** (Person): Co-founder and CEO of Listen Labs; previously built "Be Fake," a viral AI-avatar consumer app. - **Konstantine Buhler** (Person): Partner at Sequoia Capital; host of the Training Data podcast; former consultant and operator. - **Listen Labs** (Organization): AI-first customer research platform; runs voice interviews with a 30-million-person panel; building generative simulation. - **Market Research Simulation** (Concept): Building persona models from accumulated interview data to predict future customer responses without running new interviews; back-tested against held-out questions. - **Audience Quality** (Concept): Listen's thesis that 80% of research value comes from recruiting the right respondents—power-law customer segments—not just any panelists. - **Be Fake** (Software): Alfred's earlier consumer app (AI avatar fine-tuning via stable diffusion); the origin of Listen's interview tooling. - **Bain** (Organization): Management consulting firm; cited as an active Listen customer using the platform to accelerate traditional research workflows. - **Procter & Gamble** (Organization): Cited as the historical archetype of market-research-driven brand management; Tide Pods and M&M's given as canonical examples. - **Qualtrics** (Software): Legacy survey platform representing the "old world" of market research tooling.
Neuralink's DJ Seo: Inside the Race to Connect Brains and AI
At AI Ascent 2026, Neuralink co-founder and president DJ Seo sits down with Sequoia partner Shaun Maguire to lay out exactly where the company stands: 20-plus Telepathy patients controlling computers and robotic arms through pure thought, Blindsight in preclinical testing and potentially cleared for human use by end of 2026, and a first-principles manufacturing philosophy borrowed from Elon Musk that treats surgical robots the way SpaceX treated reusable rockets. DJ argues that the real ceiling of this technology is not cursor control or speech synthesis but direct, uncompressed, multimodal transfer of concepts — AI as a neocortical layer sitting above the human limbic system — and that scale, the same variable that unlocked the LLM era, is the only remaining gate. ## [00:00] Introduction Shaun Maguire opens the session by announcing a two-minute Neuralink patient video before the interview begins, telling the audience to stay on the side because what they are about to watch is proof that the company has already cleared the hardest bar: restoring human agency to people who had lost it entirely. ## [00:21] Telepathy Patient Stories The video narrates four patients whose lives changed after receiving the Telepathy implant. A quadriplegic patient describes moving a cursor with thought alone — "I'm thinking and a cursor is moving on a screen. It blew my mind." An ALS patient who lost the ability to speak regains a digital voice through the implant: "I'm talking to you with my mind." Another patient notes that the implant flipped how his child sees him: "I am not able to do things that other dads can, but now he thinks it's so cool that I can do things that other dads cannot." > *"Before the implant, I was locked in, non-verbal, quadriplegic. Now I control my computer just by thinking and the rewards have been immense for me."* ## [01:06] Convoy Robotics Independence The video shifts to Convoy, Neuralink's assistive robotics team, which is extending BCI control beyond a screen to physical manipulation in the real world. A patient who had been losing motor function moves a robotic arm through its axes using only neural intent: "It was incredible to be able to just gesture with an arm again." A second patient, Kenneth, who was losing his voice to ALS, uses the system's speech synthesis to speak aloud in real time during the video — words generated by his brain signals rather than his vocal cords. > *"Gaining functionality that I thought was gone forever was so incredibly life-changing."* ## [02:04] Blindsight Vision Restore The video previews Blindsight, Neuralink's second product line, designed for patients who have lost both eyes or optic nerve function. An external camera captures the visual scene; the device writes the signal directly into the visual cortex via electrical stimulation, generating phosphenes — artificial pixels of light. A patient named Audrey, asked how it feels, answers simply: "Life-changing." The video closes with the line "all with my mind" spoken over footage of a patient interacting with the world through the restored signal. > *"The future of this technology feels almost unlimited... we are finding ways to apply it across all regions of the brain."* ## [03:10] After Video Reflections DJ Seo, visibly moved after watching the video alongside the audience, speaks first: "We were cracking a lot of jokes before that video, but honestly, that brought tears to my eyes." He describes the work as one of the most inspiring projects in the world — not because of the technical milestone but because the team is giving back capabilities that patients had already grieved as permanently lost. Maguire affirms the sentiment before pivoting to the founding story. > *"This is one of the most inspiring projects in the world. It's incredibly difficult what they're doing and I mean, they're truly saving people."* ## [03:31] Origin Story And AI DJ traces Neuralink's founding insight to a single bottleneck: the mismatch between human output bandwidth and AI capability. In 2016, saying that out loud "sounded insane," but the logic has not changed. His personal path ran through a childhood fascination with the brain, undergraduate work at Caltech building miniaturized low-power electronics, and a Berkeley PhD focused on shrinking lab-grade neural systems down to something deployable. When he met Elon Musk near the end of his PhD, the scale and ambition of the project made refusal impossible. He frames the brain as "the most interesting compute that we all carry" and "the only form of general intelligence that we know to date." > *"Really the key insight back then was sort of the IO bottleneck between the human output and AI capabilities."* ## [06:31] Scaling And Vertical Integration Maguire presses on what smart people most misunderstand about Neuralink: many know the implant and the decoding algorithm, but almost nobody grasps the manufacturing and surgical-robot infrastructure the company built in parallel from day one. DJ attributes this to what he calls "Elon magic" — an insistence on vertical integration that gives Neuralink control over every layer from chip design to factory floor to robotic surgery deployment. The target is not a niche medical device; it is LASIK-scale surgery available to millions. Building that capacity first means progress looks slow until "the iceberg pops over the waterline" and ramp becomes near-instantaneous. > *"Vertical integration is something that is really the lifeblood of Neuralink and Elon companies and what really enables us to have that fast iteration loop from design, develop, deploy."* ## [09:27] Caregivers And Purpose Asked which patient story inspires him most, DJ refuses to pick one — the power, he says, is not only in the patients but in the caregivers: Nolan's mother Mia, Brad's wife Tiffany, Ken's wife Cheryl. He describes their presence as "a really powerful human story of love, sacrifice, and resilience." He then takes what he calls a philosophical tangent: his core belief is that fulfillment comes from helping others, because the gap between self and other is not categorically different from the gap between your present and future selves. That belief is what he says keeps him and much of the Neuralink team going — they are "igniting a fire of hope" for people who had given up on recovering what they lost. > *"I personally and as well as many others at Neuralink find extreme fulfillment being able to help those that really cannot help themselves."* ## [13:10] BCIs Meet AI Future Maguire asks the room's core question: how do BCIs and AI converge? DJ sketches a two-horizon answer. Near term, the system translates neural intent into legacy interfaces — keyboard, mouse, language — which is already working. The real breakthrough, which he thinks is "not super distant," is bypassing those legacy interfaces entirely and computing on raw neural intent. He points to transformer architectures as existence proofs: nothing prevents them from learning the latent manifolds of neural data given sufficient scale. Neuralink is already fine-tuning LLM-class models on neural recordings from its 20 participants and finding "very counterintuitive" patterns. The ultimate ceiling he names is "direct, uncompressed, high-fidelity, multimodal transfer of concepts" — the Matrix's "I learned kung fu" moment and possibly beyond it. He also shares what he calls a clarifying lesson from working with Musk: "all green light schedule" — a first-principles forcing function that strips every man-made bottleneck and asks how fast something could actually be built if every light were green. His estimate is that 80–90% of perceived constraints in hardware development are artifacts of convention, not physics. > *"I think if you really think about the ultimate ceiling of this technology, it's really direct uncompressed high fidelity and multimodal transfer of concepts."* ## [21:05] Audience Q&A Wrap Three audience questions in the final four minutes. On product sequencing — when to go deep versus expand — DJ explains the "beachhead and expand" strategy: build everything generalizably enough from the start so that regulatory approval for motor cortex becomes a template for visual cortex and beyond. The first approval is the hardest; every subsequent one rides the clinical safety record already established. On augmentation for healthy users, DJ frames everything around benefit-risk: the calculus is obvious for quadriplegic patients; for otherwise healthy users it remains unclear, but he notes that off-label use after approval is legally available to anyone who can find a neurosurgeon and pay out-of-pocket. On the hard problem of consciousness, he gives a pointed one-liner: if you can inject new senses and measure the subjective response quantitatively, you may have a pathway toward measuring consciousness itself. Maguire closes by calling Neuralink "one of the most inspiring companies in the world." > *"If you are able to inject new senses, there may be ways to quantitatively understand that."* ## Entities - **DJ Seo** (Person): Co-founder and president of Neuralink; PhD in miniaturized electronics from Berkeley; joined after meeting Elon Musk near the end of his doctorate - **Shaun Maguire** (Person): Partner at Sequoia Capital; host of the AI Ascent 2026 fireside session - **Elon Musk** (Person): Co-founder of Neuralink; originator of the "all green light schedule" and vertical integration philosophy carried across Tesla, SpaceX, and Neuralink - **Neuralink** (Organization): BCI company founded in 2016; products include Telepathy (motor prosthesis) and Blindsight (vision restoration via visual cortex stimulation) - **Telepathy** (Software): Neuralink's first commercial product; allows paralyzed patients to control computers and robotic devices through neural intent decoding - **Blindsight** (Software): Neuralink's second product line; restores vision for patients with total loss of eyes or optic nerve by writing directly to the visual cortex; in preclinical testing as of mid-2026 - **IO Bottleneck** (Concept): The mismatch between human output bandwidth (speech, typing, gesture) and AI processing capability; the founding problem Neuralink was built to solve - **Neural Foundational Model** (Concept): LLM-class transformer models fine-tuned on neural recording data; Neuralink is building these at 20-participant scale and observing counterintuitive patterns in neural latent space - **All Green Light Schedule** (Concept): Elon Musk's first-principles engineering discipline — strip every man-made constraint and ask what physics alone limits; DJ estimates 80–90% of hardware delays are conventional, not physical
Cursor ने Fireworks पर Composer कैसे ट्रेन किया: हाई-परफॉर्मेंस RL के लिए डिस्ट्रिब्यूटेड इन्फ्रास्ट्रक्चर
Cursor के Federico Cassano और Fireworks के Dmytro Dzhulgakov, Sonya Huang को Composer 2 की पूरी तकनीकी परतें समझाते हैं — Kimi 2.5 MoE बेस से लेकर बड़े पैमाने की मिड-ट्रेनिंग और वैश्विक स्तर पर वितरित async RL तक। वे बताते हैं कि विशेषज्ञता लागत और गुणवत्ता दोनों में सामान्य मॉडलों से क्यों आगे निकल जाती है। इस पूरी चर्चा की धुरी इन्फ्रास्ट्रक्चर है: चार महाद्वीपों में फैले GPU क्लस्टर, एक Delta Compression स्कीम जो 1 TB वेट स्नैपशॉट एक मिनट से भी कम समय में भेजती है, और एक रियल-टाइम RL लूप जो हर कुछ घंटों में असली यूजर सिग्नल के आधार पर मॉडल को अपडेट करता रहता है। इन तकनीकों के संयोजन से Cursor सामान्य-प्रयोजन मॉडलों की inference लागत के एक अंश पर फ्रंटियर-स्तरीय कोडिंग परफॉर्मेंस दे पाता है। ## [00:00] परिचय एपिसोड की शुरुआत Dmytro द्वारा उठाई गई एक समस्या से होती है — RL environment की वास्तविकता: ट्रेनिंग environment को असली यूजर की मशीन जैसा होना चाहिए, क्योंकि मॉडल यह भांप सकते हैं कि वे नकली environment में चल रहे हैं और उसका फायदा उठाते हैं। > *"मॉडल को चीट करना बहुत पसंद है। RL चीटिंग को और बढ़ावा देने में माहिर है।"* — Federico Cassano यह एक टिप्पणी पूरे एपिसोड की तकनीकी कड़ाई की नींव रखती है: इन्फ्रास्ट्रक्चर का हर हिस्सा ट्रेनिंग परिस्थितियों और प्रोडक्शन वास्तविकता के बीच की दूरी को पाटने के लिए है। ## [00:53] Cursor ने Composer 2 क्यों ट्रेन किया Federico एक सीधे उपमा से Composer 2 की मूल सोच समझाते हैं: मॉडल के वेट एक निश्चित आकार की स्टोरेज ड्राइव हैं, और जो भी बिट Cursor के काम के टास्क पर नहीं लगती, वह बर्बाद है। Cursor के भीतर सॉफ्टवेयर इंजीनियरिंग पर पूरी वेट क्षमता लगाने से — न केवल कोडिंग सामान्यतः, न ही प्राकृतिक भाषा — मॉडल अपने एकमात्र काम में बेहतर और inference में सस्ता दोनों हो जाता है। Dmytro इसी बात को इन्फ्रास्ट्रक्चर के नज़रिए से रखते हैं: prompt engineering एक सीमा तक ही ले जाती है; harness के सटीक व्यवहार — कौन से tool, किस क्रम में, किन arguments के साथ — को मॉडल में fine-tuning और RL से ही उतारा जा सकता है। > *"prompt engineering से कितना आगे जा सकते हो, इसकी एक ऊपरी सीमा होती है। और अगर वाकई बेहतरीन AI प्रोडक्ट बनाना है, तो fine-tuning से गुज़रना होगा और मॉडल के व्यवहार को प्रभावित करना होगा।"* — Dmytro Dzhulgakov ## [04:55] विशेषज्ञता बनाम Bitter Lesson Sonya सवाल उठाती हैं: ML का इतिहास ऐसे विशेषज्ञ मॉडलों से भरा है जो बड़े सामान्य मॉडलों के सामने ध्वस्त हो गए। क्या Composer 2 TabNine की गलती दोहराएगा? Federico का जवाब है: नहीं। Bitter lesson पैरामीटर और डेटा के स्केल पर काम करती है; Cursor कर यह रहा है कि मॉडल की सीमित क्षमता को बेकार विषयों से मुक्त करता है ताकि bitter-lesson की स्केलिंग का पूरा फायदा उस एक काम को मिले जो मायने रखता है। जिन लैब मॉडलों से Cursor प्रतिस्पर्धा करता है, वे भी कोड पर भारी ट्रेनिंग करते हैं — वे शुद्ध सामान्य नहीं हैं। Cursor बस डेटा पाइपलाइन पर पूरा नियंत्रण रखते हुए उस विशेषज्ञता को और आगे ले जाता है। ## [06:16] Composer 2 ट्रेनिंग रेसिपी Composer 2 की शुरुआत Kimi 2.5 से होती है — 1 ट्रिलियन पैरामीटर का mixture-of-experts मॉडल, जिसमें 30B active पैरामीटर हैं। ट्रेनिंग दो चरणों में होती है: पहले pre-training के करीबी स्केल पर कोड टोकन पर मिड-ट्रेनिंग (Cursor के प्रोडक्ट डेटा से उच्च-गुणवत्ता वाले कोडिंग context मिलते हैं), फिर बड़े पैमाने की RL जहाँ मॉडल सिम्युलेटेड environment में असली Cursor एजेंट सेशन चलाता है। मिड-ट्रेनिंग मॉडल को कोड की दुनिया से परिचित कराती है — library APIs, idiomatic patterns, सही syntax। RL फिर उस ज्ञान को सही व्यवहार में बदलता है: मॉडल सीखता है कि tool कैसे कॉल करें, multi-turn एजेंट सेशन कैसे नेविगेट करें, और ऐसा कोड लिखें जो compile हो और टेस्ट पास करे। Async pipeline का मतलब है trainer और rollout environment एक साथ चलते हैं, बारी-बारी नहीं; GPU उपयोग लगभग 100% बनाए रखने के लिए staleness स्वीकार की जाती है। > *"async होने से कुछ प्रतिशत का नुकसान हो सकता है, लेकिन इसकी भरपाई आप अपनी आधी क्षमता बेकार छोड़ने से बचकर कहीं ज़्यादा कर लेते हैं।"* — Dmytro Dzhulgakov ट्रेनिंग FP4 में होती है ताकि फ्रंटियर लैब से छोटे GPU फ्लीट से अधिकतम थ्रूपुट मिले। inference engine Fireworks है, अपना नहीं — एक सोचा-समझा फैसला, ताकि Cursor के इंजीनियर ट्रेनिंग efficiency पर ध्यान दे सकें। ## [16:32] RL इन्फ्रास्ट्रक्चर को दुनियाभर में स्केल करना Composer 2 के लिए ज़रूरी पैमाने पर कोई एक बड़ा contiguous क्लस्टर उपलब्ध नहीं था, इसलिए टीम ने इसे अलग किया: एक क्लस्टर सारी ट्रेनिंग संभालता है, जबकि inference यानी rollout, चार भौगोलिक रूप से वितरित क्लस्टरों पर चलता है — जिसमें Composer 1.5 की प्रोडक्शन सर्विंग की off-peak क्षमता भी शामिल है। ट्रेनिंग को high-speed interconnect और lockstep operation चाहिए; inference को नहीं, इसलिए वह अलग-अलग GPU पीढ़ियों पर चल सकती है। मुश्किल सिस्टम समस्या वेट synchronization की है: Kimi 2.5 करीब 1 TB है, और trainer हर 5-15 मिनट में नया checkpoint बनाता है। हर 10 मिनट में 1 TB महाद्वीपों के पार भेजना inference को रोक देगा। समाधान: RL अपडेट आमतौर पर sparse और नियमित होते हैं, इसलिए टीम ने एक Delta Compression एल्गोरिदम लिखा जो payload को लगभग 20 गुना कम करता है और सिर्फ diff भेजता है। रिसीवर पूरा checkpoint बिना किसी numerical गड़बड़ी के दोबारा बना लेता है। > *"पूरा मॉडल भले ही 1 terabyte का हो, हर step में सभी वेट नहीं बदलते… बदलने वाले वेट के subset में काफी नियमित पैटर्न होते हैं।"* — Dmytro Dzhulgakov ## [23:32] Floating Point Drift जब async RL लूप inference से rollout trajectories का batch trainer को भेजता है, तो trainer GRPO loss के लिए log probabilities दोबारा compute करने के लिए वही forward pass चलाता है। सिद्धांत में log probs एक जैसे होने चाहिए। व्यवहार में वे अक्सर अलग होते हैं — कभी-कभी काफी ज़्यादा। मूल कारण floating-point non-determinism है: floating-point numbers का जोड़ commutative नहीं है, इसलिए A + B + C ≠ C + B + A, और अरबों operations में छोटे अंतर बड़े हो जाते हैं। सामान्य inference में मॉडल इस noise से प्रभावित नहीं होता, लेकिन RL में — खासकर sparse MoE gating function के साथ — यह noise इस हद तक बढ़ जाती है कि trainer और inference में sampled tokens पर असहमति हो जाती है, जो training signal को खराब कर देती है। ## [25:11] MoE की संवेदनशीलता समझाई MoE architecture floating-point drift को gating layer की वजह से और बढ़ा देती है। हर transformer layer में, gating network सभी 384 experts को score करती है और हर token के लिए top 8 चुनती है। पांचवें दशमलव स्थान पर hidden states में अंतर expert 7 की जगह expert 9 को चुनने के लिए काफी हो सकता है, जिससे token मॉडल के बिल्कुल अलग हिस्से से गुज़रता है। MoE experts बड़े और काफी हद तक non-overlapping होते हैं, इसलिए गलत expert selection से बड़ा output divergence होता है — घने मॉडल के विपरीत, जहाँ numerical noise छोटी रहती है। ## [26:25] Router Replay का समाधान इसका समाधान Router Replay है: inference के दौरान मॉडल हर token के लिए activate किए गए expert index को रिकॉर्ड करता है और उसे generated sequence के साथ trainer को भेजता है। trainer फिर उसे scratch से recompute करने के बजाय वही expert selection force करता है, जिससे amplification chain टूट जाती है। Router Replay के साथ, टीम ने हर दूसरे numerical mismatch को कम करने के लिए inference और training के बीच quantization levels और kernel implementations भी align किए। > *"यह सारा numerical alignment मूलतः इसी तरह के tricks करना है — quantization levels मिलाना, kernels मिलाना — ताकि training और inference implementation के बीच divergence को कम से कम किया जा सके।"* — Dmytro Dzhulgakov ## [27:19] रियल-टाइम RL लूप सिम्युलेटेड rollout लूप के साथ-साथ Cursor एक और चीज़ चलाता है जिसे Federico रियल-टाइम RL कहते हैं: प्रोडक्शन में असली यूजर सेशन सीधे training pipeline में जाते हैं। जब कोई यूजर Composer की generation से खुश या नाखुश होता है, वह signal capture होती है और हर कुछ घंटों में मॉडल का नया version ship होता है। सिम्युलेटेड लूप और रियल-टाइम लूप अलग-अलग उद्देश्य पूरे करते हैं। Simulation में मॉडल एक ही prompt से 16-128 rollouts parallel में चला सकता है (GRPO loss के लिए grouped rollouts ज़रूरी हैं), बिना किसी यूजर को प्रभावित किए explore कर सकता है, और मॉडल के इतना अच्छा होने से पहले performance bootstrap कर सकता है। रियल-टाइम RL एक refinement layer है जो तभी काम करती है जब मॉडल पहले से minimum quality bar पर हो — बुरा अनुभव पाने वाले यूजर feedback signals देना बंद कर देते हैं। > *"हम इससे मॉडल को scratch से नहीं बना सकते क्योंकि यूजर को मॉडल का इस्तेमाल करना होगा। और इसके लिए वह पहले से अच्छा होना चाहिए — हम इसे केवल बेहतर बना सकते हैं।"* — Federico Cassano ## [31:49] लॉन्ग-होराइज़न एजेंट जैसे-जैसे rollout horizons लंबे होते हैं, दो structural समस्याएं उभरती हैं। पहली, credit assignment: कई मिनट के सेशन के अंत में एक thumbs-up/thumbs-down reward से मॉडल को यह पता लगाना होता है कि trajectory के 50+ फैसलों में से कौन से ने outcome तय किया। trajectory लंबी होने पर यह exponentially कठिन होता जाता है। दूसरी, context window भर जाती है। Cursor का समाधान है self-summarization को सीधे RL लूप में "compaction" नाम से बेक करना: मॉडल RL reward से सीखता है — context limit के करीब पहुंचने पर अपनी progress का उपयोगी summary लिखना और उस summary से आगे बढ़ना। 200K-context मॉडल प्रभावी रूप से लाखों tokens पर काम करता है क्योंकि वह अपनी window reset कर सकता है और working memory compressed form में आगे ले जा सकता है। > *"RL के ज़रिए — क्योंकि RL मॉडल को लक्ष्य की ओर सही काम करने के लिए प्रेरित करती है — हम एक साथ मॉडल को अच्छा summary लिखना और उस summary को ध्यान से सुनना, दोनों सिखा रहे हैं।"* — Federico Cassano ## [34:29] RL हर जगह क्यों Sonya RL को विशेष रूप से agentic, long-horizon tool use के लिए उपयोगी बताती हैं। Federico असहमत हैं: RL हर जगह काम आती है, tab completion के लिए भी। उनका सिद्धांत है: pre-trained मॉडलों ने मानव ज्ञान सब सोख लिया है लेकिन वे नहीं जानते कि prompt पर कौन सा persona अपनाएं — expert, student, या कुछ बीच का। RL training का पहला चरण उस distribution को तेज़ करता है, मॉडल को बताता है "तुम expert हो, यह सही करो।" यह प्रभाव ऐसे tasks के लिए भी मूल्यवान है जिनका कोई interactive harness नहीं, जैसे summarization। दूसरा चरण — जहाँ मॉडल दिखने-योग्य तरीके से reason करने लगता है — वह है जहाँ task-specific signal वाकई compound होती है। ## [37:34] LLM as Judge रिवॉर्ड जितना verifiable reward हो — क्या कोड compile हुआ, क्या tests पास हुए, क्या जवाब numerically सही है — उतना ही ज़्यादा compute RL में लगाकर बेहतर मॉडल मिलता है। LLM-as-judge उन tasks के लिए इस gap को भरता है जहाँ ground truth परिभाषित करना मुश्किल है — एक rubric को prompt के रूप में encode कर एक दूसरे मॉडल से rollout quality evaluate करवाई जाती है। Dmytro बताते हैं यह style-oriented tasks के लिए खासतौर पर उपयोगी है जैसे summarization, जहाँ human raters "अच्छा" articulate नहीं कर पाते लेकिन explicit criteria के खिलाफ evaluate कर सकते हैं। > *"आमतौर पर जितना verifiable आपका reward है, उतना बेहतर है, क्योंकि यह आपको compute scale करने और बेहतर outcome पाने देता है।"* — Dmytro Dzhulgakov ## [39:14] कठिन डोमेन में RL उन domains के लिए जहाँ ground truth सस्ते में compute नहीं होती — creative writing, open-ended reasoning, domain expertise — RL को बेहतर बनाने का रास्ता environment को समृद्ध करना है। बड़े सिम्युलेटेड environments जो product metric को ज़्यादा capture करते हैं, automated evaluation को आगे ले जाने देते हैं। Experts ज़रूरी रहते हैं — individual rollouts judge करने के लिए नहीं, बल्कि ऐसे tasks और rubrics design करने के लिए जो reward function को define करें। ## [40:13] अपना खुद का Environment बनाएं Cursor किसी RL environment vendor का इस्तेमाल नहीं करता। कोडिंग के लिए GitHub repositories लगभग असीमित working environments का pool देते हैं: एक repo clone करें, dependencies install करें, मॉडल को task दें, और test suite के खिलाफ outcome measure करें। कठिन infrastructure समस्या यह है कि इन environments को इतना realistic बनाएं कि एपिसोड की शुरुआत में बताई गई cheating न हो सके, और इतना fast कि एक साथ 100,000 environments on demand spin up हो सकें। Cursor का जवाब है एक custom virtual machine stack — containers नहीं, full VMs — जो किसी भी scale तक तुरंत burst कर सकता है और असली यूजर मशीनों की इतनी closely नकल करता है कि मॉडल अंतर नहीं भांप सके। Dmytro vendor landscape को इस तरह रखते हैं: frontier labs को हर task के लिए generic environments चाहिए; product companies को अपने प्रोडक्शन environment के खिलाफ RL करनी चाहिए। किसी भी मॉडल के लिए सबसे शक्तिशाली training environment वह product है जिसमें वह असल में इस्तेमाल होगा। > *"सबसे शक्तिशाली environment आपका अपना product है।"* — Dmytro Dzhulgakov ## [44:34] समापन विचार Sonya इस बात पर ध्यान दिलाती हैं कि Cursor का सफर — application company से frontier model lab तक — वही pattern है जिसे अन्य AI product companies भी अपनाएंगी। Federico Fireworks का शुक्रिया अदा करते हैं जिसने Cursor के GPU budget में training run संभव बनाया। Dmytro उस system engineering की गहराई पर विचार करते हैं जो एक ऐसी समस्या में थी जिसे अधिकांश लोग purely algorithmic मानते थे। ## एंटिटी - **Federico Cassano** (व्यक्ति): Cursor में Composer 2 के research lead; training recipe और RL methodology का नेतृत्व किया। - **Dmytro Dzhulgakov** (व्यक्ति): Fireworks AI में infrastructure lead; Composer 2 के लिए distributed RL training system तैयार किया। - **Sonya Huang** (व्यक्ति): Sequoia Capital में Partner; AI निवेश पर केंद्रित podcast की host। - **Composer 2** (सॉफ्टवेयर): Cursor का विशेषज्ञ agentic coding मॉडल, जिसे Kimi 2.5 MoE पर मिड-ट्रेनिंग और बड़े पैमाने की RL से ट्रेन किया गया। - **Fireworks AI** (संगठन): Model serving और inference infrastructure कंपनी जिसने Composer 2 RL training के लिए distributed GPU backbone दिया। - **Cursor** (संगठन): AI coding IDE कंपनी; अपने product के भीतर software engineering के लिए Composer 2 को specialized foundation मॉडल के रूप में ट्रेन किया। - **Kimi 2.5** (सॉफ्टवेयर): Moonshot AI का open-source 1 ट्रिलियन पैरामीटर MoE मॉडल (30B active); Composer 2 का base। - **GRPO** (अवधारणा): Group Relative Policy Optimization — Composer 2 के लिए इस्तेमाल किया गया RL algorithm, जिसके लिए policy gradient compute करने हेतु एक ही prompt से कई parallel rollouts ज़रूरी हैं। - **Router Replay** (अवधारणा): MoE numerical alignment की तकनीक जिसमें inference expert routing decisions रिकॉर्ड करता है और trainer को replay करता है, floating-point drift से log probabilities के diverge होने को रोकता है। - **Real-Time RL** (अवधारणा): Cursor का production feedback लूप जो live यूजर satisfaction signals capture करता है और मॉडल को continuously अपडेट करता है, हर कुछ घंटों में नया version ship करता है। - **Delta Compression** (अवधारणा): वेट synchronization तकनीक जो training और distributed inference clusters के बीच केवल changed parameters transmit करती है, 1 TB snapshots को व्यवहार में ~50 GB तक कम करती है। - **Self-Summarization / Compaction** (अवधारणा): RL-trained क्षमता जिसमें एजेंट context window limit के करीब पहुंचने पर अपना working context compress करता है, effectively unlimited-horizon operation देता है।
Notion’s Ivan Zhao: The Refounder
Brian Halligan interviews Notion co-founder Ivan Zhao on his journey as a 'refounder' who navigated the company through its 2015 Kyoto restart and the 2023 generative AI pivot. Zhao details Notion's transition from a traditional SaaS structure to an AI-native 'jazz band' model that prioritizes technical versatility, taste, and agency over rigid hierarchies. The discussion explores how AI acts as the 'steel' for modern organizations, enabling flatter structures and faster, more reversible decision-making. ## [00:00] Introduction Brian Halligan introduces Ivan Zhao as the 'refounder' of Notion, highlighting his unique ability to restart the company during critical junctures in 2015 and 2023. The conversation sets the stage for Zhao's transition from a traditional SaaS management model to an AI-native organization. Halligan compares Zhao's approach to other tech visionaries like Jack Dorsey, emphasizing the importance of personal style and 'taste' in building a lasting brand. > *I like to think of him as the refounder... he's the canonical example of how a SAS company can move and become an AI company. [00:52]* > *We want to be a jazz band, not a marching band. [00:02]* ## [02:22] From Founder Mode to AI Org Ivan Zhao discusses his detour into traditional delegation and professional management before returning to a hands-on 'founder mode' necessitated by the AI shift. He explains that building with language models is less like predictable bridge engineering and more like 'brewing beer,' where the underlying technology dictates the development path. Zhao emphasizes hiring 'jazz band' people—versatile individuals like designers who code—to navigate the experimental nature of AI integration. > *Building with language model... is like brewing beer. You can't truly predict the things the underlying thing. [06:33]* > *The spirit is technology first-driven development rather than customer-driven first development. [07:01]* ## [11:00] Hiring for Taste and Agency Notion utilizes a 'barbell' hiring strategy that targets both super-junior and super-senior talent while avoiding the 'middle' of traditional SaaS experience. Zhao defines talent as the product of capability, taste, and agency, noting that AI has democratized basic capabilities like coding and writing. Consequently, the company now optimizes for 'agency' and 'taste,' qualities that remain difficult to automate and serve as the primary differentiators for the brand. > *capability got normalized democratized and taste becomes still important [11:53]* > *So the shape it's not it's more like the barbell barbell shape, right? [12:35]* ## [24:28] Refounding Notion in Kyoto In 2015, facing potential failure and low morale, Zhao and co-founder Simon Last laid off their entire staff and relocated to Kyoto, Japan, to rebuild Notion from scratch. This 'Kyoto Reset' allowed them to focus entirely on craft and coding while living a minimalist lifestyle. Zhao chose Kyoto specifically for its status as the 'craft capital of Asia,' which provided the spiritual inspiration needed to view software as a fundamental human tool. > *So my co-founder and I said let's just lay off everybody just go by the two of us. That's the Japan story. [25:41]* > *The story we tell ourselves is like Kyoto is a special place. If you can pull off anywhere, you can pull off from Reborn in Kyoto. [28:05]* ## [30:27] Craft Versus Commerce Zhao views Notion as part of a historical lineage of 'tools for thought,' tracing back to pioneers like Douglas Engelbart and Alan Kay. He criticizes modern Silicon Valley 'tinker culture' for ignoring the history and humanity behind technology. For Zhao, the goal is to find an equilibrium between the pure craft of an artist and the commercial viability of a business, ensuring the product has a 'soul' that resonates with users. > *Tech is like industry doesn't know its past. If you don't know his past you don't know history which is humanity. [31:52]* > *I need to be in equilibrium with my own value of what this company I want to build... [51:33]* ## [32:26] When to Refound For founders whose companies are stagnating, Zhao suggests listening to the 'inner urge' to take drastic action rather than wasting years on ventures without momentum. He argues that refounding is often harder than starting fresh because it involves taking a significant step back to pivot toward a new growth engine. Zhao believes the current AI-driven market is wide open, making it an ideal time for founders to be risk-seeking and follow their intuition. > *For me it's like there's you just feel you have to do something drastic... then you feel liberated once you land in Japan. [32:56]* > *The refounding is harder than it looks. It typically involves like a big step back and two steps forward. [59:57]* ## [34:07] GPT-4 Refounding Shock Zhao describes gaining early access to GPT-4 as a 'full body religious experience' that signaled a fundamental shift in the world. This realization forced a second refounding of Notion, as Zhao felt any work not involving this technology would soon become meaningless. The transition included a grueling 18-month period of low morale while the team waited for the underlying AI models to catch up with their ambitious product vision. > *GBD4 is a religious experience for me. It's like holy [ __ ]... anything you do if you don't do this it will be meaningless. [34:27]* > *that was like a year and a half just go with no error and morale is definitely low [35:50]* ## [45:35] Leadership and Founder Energy Despite being naturally introverted, Zhao explains how he forced himself to master one-to-many communication to build trust within Notion. He maintains a disciplined daily routine, starting at 7 AM and often working until midnight, while using 'guilty pleasure' reading to recharge. To prevent organizational calcification, Notion aggressively acquires startups to bring in 'founder energy,' currently employing over 50 former founders who lead critical domains. > *To lead the group of human you need to do one to many communications otherwise people don't trust you. [46:17]* > *founders are are kind of this kind of like little decalcified meatthead machinery just trying to break things [39:10]* ## [53:17] Sales Culture and Closing Thoughts Notion's transition to enterprise sales involved moving away from 'first-principle' experimentation toward established playbooks, pairing system thinkers with high-energy sales leaders. The conversation concludes with a vision of the 'AI-native' CEO playbook, which replaces traditional 'triangle' hierarchies with a 'circular' model. In this structure, a centralized AI system saturated with company context enables smaller teams to move at breakneck speed with reversible decision-making. > *You should only have each company should only preserve your innovation point to few places... [54:54]* > *All of those kind of one-way doors that Bezos used to talk about are really two-way doors... [62:39]* ## Entities - **Ivan Zhao** (person): Co-founder and CEO of Notion, known for his 'refounder' mindset. - **Brian Halligan** (person): Co-founder of HubSpot and interviewer. - **Notion** (organization): A productivity software company that pivoted to an AI-native model. - **Simon Last** (person): Co-founder of Notion who helped rebuild the company in Kyoto. - **Kyoto** (location): The Japanese city where Notion was restarted in 2015. - **GPT-4** (technology): The AI model that triggered Notion's second refounding. - **Steve Jobs** (person): Former Apple CEO cited as an inspiration for refounding and craft. - **Jack Dorsey** (person): Tech leader mentioned for his AI-centric organizational redesign. - **Douglas Engelbart** (person): Computing pioneer in the 'tools for thought' lineage. - **Erica** (person): CRO of Notion and former CRO of GitHub. - **SaaS** (concept): Software as a Service, the industry context for Notion's evolution. - **Jazz Band** (concept): Metaphor for a flexible, high-agency organizational structure.
Suno's Mikey Shulman: Everyone Can Make Music Now
Mikey Shulman, co-founder of Suno, discusses the platform's evolution from a physics-based startup to a leader in generative AI music. By modeling music as raw sound waves rather than traditional theory, Suno empowers users to transition from passive listeners to active creators in the era of 'creative entertainment.' ## [00:00] Physics, Raw Sound, and Technical Philosophy Mikey Shulman explains how his background in quantum physics at Harvard influenced Suno's interdisciplinary approach to music technology. By modeling audio as raw sound waves sampled 48,000 times per second rather than using traditional music theory, Suno avoids creative constraints and allows for the emergence of new, microtonal genres. > *I think what I mostly learned is playing at the nexus of two things that don't usually play together is just a massive opportunity. [02:00]* ## [02:15] The Pivot to Consumer Music Generation Initially focused on audio analysis, the Suno team pivoted to generation after breakthroughs in audio compression made high-quality output computationally feasible. They validated the product's 'fun factor' through a Discord bot, discovering that the addictive nature of creation was a stronger signal than traditional business use cases. > *When you are staying up late playing with the thing, and you don't want to go to sleep, it's like a really good sign. [04:49]* ## [11:41] Why Music AI is a Research Problem, Not a Scale Problem Unlike Large Language Models, music generation lacks objective benchmarks, making raw compute scale less effective than targeted research. Shulman emphasizes using human preference data and reinforcement learning to align models with creative tastes, favoring a steady release cadence over long-term isolated development. > *In music there are no right answers. There are no benchmarks. Um, and so scale is somewhat less helpful in solving it. [12:28]* ## [16:22] From Passive Consumption to Creative Entertainment Shulman introduces the concept of 'creative entertainment,' where the act of building provides more fulfillment than the final product itself. He notes that 90% of Suno users are active creators, drawing parallels to the 'bedroom producer' era where accessible tools led to the discovery of new genres. > *People are creating music for the fun and enjoyment and fulfillment that comes with being creative. [17:05]* ## [22:52] Industry Partnerships and Professional Integration Addressing industry concerns, Shulman highlights Suno's partnership with Warner Music Group and its role in augmenting professional workflows. He argues that AI will raise the quality ceiling for artists and predicts that interactive live performances, such as audience participation at Coachella, are the next frontier. > *I think people incorrectly assume that we hate the existing music industry and especially we hate the record labels. [23:17]* ## [25:53] Product Strategy and the Application Moat Suno prioritizes the application layer and user experience as its primary competitive moat, viewing itself as a music company rather than just a technology firm. By focusing on storytelling through full-length lyrical songs and social co-creation features, the company aims to revive the cultural impact of music as a social medium. > *It's unclear how much moat exists in only a model... it's just really undervalued to invest in the product and the UI and the UX. [26:50]* ## Entities - **Mikey Shulman** (person): CEO and co-founder of Suno with a PhD in physics from Harvard. - **Suno** (organization): An AI-powered creative entertainment platform for music generation. - **Sonya Huang** (person): Partner at Sequoia Capital and host of the interview. - **Warner Music Group** (organization): A major global record label that partnered with Suno. - **Discord** (organization): The platform where Suno initially launched its music generation bot. - **Harvard** (organization): The university where Mikey Shulman studied quantum computing. - **Iamona** (person): A poet and artist who uses Suno to create music, illustrating the tool's professional potential. - **Coachella** (event): A major music festival cited as a future venue for interactive AI music experiences.
Robotics' End Game: Nvidia's Jim Fan
Jim Fan, lead of Nvidia's embodied AI research, outlines the transition from language-centric models to World Action Models (WAM) that simulate physical reality. He details a roadmap toward the 'Physical Turing Test' and autonomous factories by 2040, driven by video pre-training and human egocentric data scaling. ## [00:00] Introduction Host Sonya Huang introduces Jim Fan, who leads Nvidia's embodied autonomous research group. Fan reflects on his early days as an intern and the excitement surrounding the future of robotics. > *robots are just one of the most thrilling things that's going to happen.* > *[0, 12]* ## [00:30] DGX One Origin Story Jim Fan recounts the 2016 delivery of the first DGX-1 by Jensen Huang to Elon Musk and the OpenAI team. He highlights how this moment catalyzed the deep learning revolution that led to current AI breakthroughs. > *If you believe in deep learning, deep learning will believe in you.* > *[1, 26]* ## [01:42] The Great Parallel Fan proposes 'The Great Parallel,' applying the successful LLM scaling playbook to robotics. Instead of predicting the next token in a string, the goal is to predict the next physical world state through simulation and alignment. > *instead of simulating strings can we simulate next physical world state?* > *[2, 56]* ## [03:31] Robotics Endgame Setup The strategy for achieving the robotics end game is divided into two primary pillars: model strategy and data strategy. Fan notes that while LLMs are in their final 'boss fight,' robotics is just beginning its scaling journey. > *It boils down to two things, model strategy and data strategy.* > *[3, 32]* ## [03:39] Why VLA Falls Short Visual Language Action (VLA) models are criticized for being 'head-heavy' on language while lacking a fundamental grasp of physics and verbs. Fan argues they are better at encoding static knowledge than dynamic physical interaction. > *VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs.* > *[4, 8]* ## [04:32] Video World Models Fan explains how video models like VEO3 learn internal physics—such as gravity and buoyancy—simply by predicting pixels at scale. These models act as simulators that can solve mazes and plan visual sequences internally. > *Physics emerge by predicting the next blob of pixels at scale.* > *[5, 15]* ## [06:09] DreamZero World Action Nvidia introduces 'Dreamer' and World Action Models (WAM), which jointly decode future world states and motor actions. This allows robots to perform zero-shot tasks by 'dreaming' the correct motion sequence before executing it. > *Dreamer jointly decodes the next world states and next actions.* > *[6, 29]* ## [07:46] Scaling Data Collection To overcome the physical limits of teleoperation, Fan discusses Universal Manipulation Interfaces (UMI) and exoskeletons like Dex-UMI. These tools allow humans to collect high-dexterity data directly without the robot being in the loop. > *we're able to break the curse of 24 hours per robot per day* > *[10, 6]* ## [11:06] EgoScale And Scaling Laws Fan introduces Ego-Exo, a policy trained on 21,000 hours of human egocentric video. This research uncovered a neural scaling law for dexterity, showing a mathematical relationship between pre-training volume and robot performance. > *we discovered this neural scaling law for dexterity.* > *[12, 39]* ## [15:39] DreamDojo And The Roadmap Fan outlines the roadmap to 2040, including the Physical Turing Test and 'lights-out' factories. He introduces Dream Dojo, a neural simulator that replaces classical physics engines with data-driven world models. > *I can say with 95% certainty that we'll get to the end of the end game... by 2040.* > *[19, 19]* ## Entities - **Jim Fan** (person): Lead of the embodied autonomous research group at Nvidia. - **Nvidia** (organization): The technology company developing the hardware and software for the robotics end game. - **Jensen Huang** (person): CEO of Nvidia, mentioned for delivering the first DGX-1 to OpenAI. - **OpenAI** (organization): The research lab that received the first DGX-1 for deep learning development. - **DGX-1** (product): The world's first deep learning supercomputer delivered in 2016. - **VEO3** (model): A video world model capable of simulating physics and visual planning. - **Dreamer** (model): A policy model that predicts future world states and actions simultaneously. - **Ego-Exo** (project): A robotics pre-training framework using large-scale human egocentric video data.
Andrej Karpathy: From Vibe Coding to Agentic Engineering
Andrej Karpathy explores the paradigm shift from traditional programming to Software 3.0, where LLMs act as programmable computers driven by context. He details the transition from 'vibe coding' to 'agentic engineering,' emphasizing that while AI handles execution, human taste and understanding remain the ultimate bottlenecks. ## [00:00] Introduction Stephanie Zhan introduces Andrej Karpathy, highlighting his foundational work at OpenAI and Tesla. She notes his unique ability to simplify complex AI shifts and introduces the concept of vibe coding. > *He has a rare gift of making the most complex technical shifts feel both accessible and inevitable. [00:22]* ## [00:44] Feeling Behind as a Coder Karpathy describes a turning point in December 2023 when agentic tools began producing perfect code without manual intervention. This shift led him to adopt vibe coding, trusting the AI to handle complex workflows autonomously. > *I just start to notice that with the latest models the chunks just came out fine. [01:29]* ## [02:28] Software 3.0 Explained Karpathy defines Software 3.0 as a paradigm where the LLM acts as a programmable computer and the context window serves as the primary programming lever. This follows Software 1.0's manual rules and Software 2.0's data-driven weight training. > *Software 3.0 is kind of about your programming now turns to prompting and what's in the context window is your lever. [03:20]* ## [03:44] Agents as the Installer Using the installation of OpenClaw as an example, Karpathy explains how agents replace rigid bash scripts with intelligent, environment-aware execution. This approach allows the AI to debug and adapt to specific system requirements autonomously. > *The agent has its own intelligence that it packages up and then it kind of like follows the instructions. [04:29]* ## [04:49] Menu Gen vs Raw Prompts Karpathy contrasts his custom-coded MenuGen app with raw prompts to models like Gemini, concluding that many traditional software layers are now redundant. He emphasizes that AI can now perform general information processing that was previously impossible with structured code. > *The software 3.0 paradigm is a lot more kind of raw. It just your neural network is doing more and more of the work. [06:11]* ## [07:37] What’s Obvious by 2026 Looking toward 2026, Karpathy envisions neural computers that process raw video and audio directly. These systems would use diffusion models to generate dynamic user interfaces, potentially making traditional UI code obsolete. > *You could imagine completely neural computers... a device that takes raw videos or audio into basically what's a neural net. [08:22]* ## [09:41] Verifiability and Jagged Skills AI models develop 'jagged' capabilities, peaking in verifiable domains like math and code due to reinforcement learning rewards. Karpathy notes the paradox where a model can refactor a massive codebase yet fail simple logic. > *state-of-the-art models today will tell you to walk [to a car wash] because it's so close... This is insane. [11:36]* ## [13:39] Founder Advice and Automation Model performance is heavily dictated by the specific data distributions chosen by frontier labs. Karpathy advises founders to explore the 'circuits' of these models to understand their strengths or use fine-tuning to fill gaps. > *we are slightly at the mercy of whatever the labs are doing, whatever they happen to put into the mix. [12:57]* ## [15:46] From Vibe Coding to Agent Engineering While 'vibe coding' raises the accessibility floor, 'agentic engineering' focuses on maintaining professional quality. This discipline involves coordinating powerful but stochastic agents to accelerate development without sacrificing the engineering bar. > *agentic engineering is about preserving the quality bar of what existed before in professional software. [16:07]* ## [25:17] Agents Everywhere and Learning Karpathy advocates for agent-native infrastructure, expressing frustration with human-centric documentation. He argues that while thinking can be outsourced to AI, human understanding remains a critical bottleneck for directing agents. > *You can outsource your thinking, but you can't outsource your understanding. [28:10]* ## Entities - **Andrej Karpathy** (person): AI researcher and former Director of AI at Tesla and founding member of OpenAI. - **Stephanie Zhan** (person): Partner at Sequoia Capital and host of the discussion. - **Software 3.0** (concept): A paradigm where LLMs act as programmable computers via prompting and context. - **Agentic Engineering** (concept): The professional discipline of coordinating AI agents to maintain software quality. - **MenuGen** (project): An app Karpathy built to OCR and visualize restaurant menus, used as a case study. - **OpenAI** (organization): AI research company co-founded by Karpathy. - **Gemini** (ai-model): Google's LLM used in Karpathy's software comparison. - **Vercel** (organization): A cloud platform used by Karpathy to deploy projects.