LaiDub

팟캐스트

GitHub's Agent Era: 14x Commits, 200M Developers, Copilot's Next Act — Kyle Daigle
1:24:44
EN/ZH
Watch with Captions
Latent Space1일 전

GitHub's Agent Era: 14x Commits, 200M Developers, Copilot's Next Act — Kyle Daigle

GitHub COO Kyle Daigle joins swyx to map what the agent era looks like from inside the platform hosting 200 million developers and now processing commits at 14x last year's pace. Across 84 minutes they cover how Kyle runs GitHub with AI-driven micro-skills and WorkIQ MCP, why former developers in leadership have an unusual edge right now, the full arc of GitHub's platform history from webhooks to Actions to Copilot, and where trust in agent-generated code ultimately has to come from. The conversation is grounded throughout in Kyle's own weekend and executive workflows: building AI-generated revenue presentations, running 15 simultaneous agents on a Saturday, and describing what "ambient AI" would actually need to do before it becomes genuinely useful. ## [00:00] Hook Kyle opens mid-sentence, already deep in his argument: people who detoured into other careers before coding, and came back armed with cross-domain knowledge, are uniquely positioned in the AI era. Running 15 agents on a Saturday while his kids are at lacrosse is not just a productivity flex — it recreates the feeling of creation that got him into software in the first place. > *"I can crank up 15 agents on Saturday, you know, while my kids are doing lacrosse. That's like really powerful and I think it gets me back to that feeling of like creation."* ## [01:21] Introduction Kyle's title is COO of GitHub, but he recently took on CMO of Developer for Microsoft as well — meaning every developer-facing product and communication across the broader Microsoft ecosystem now runs through him. He's been at GitHub for 13 years, joined as a developer, personally built webhooks and the platform/API layer, ran engineering until 2018, then moved into the operational and business side. The dual COO/CMO role is unusual; Kyle frames it as the same job with a larger surface area: tell the truth, be authentic, let the products speak. > *"I built webhooks and worked with teams building the API, built the platform layer, anything that integrated with GitHub, up until really 2018 I built or ran the engineering teams."* ## [04:57] Why AI Got Kyle Coding Again Swyx points out that Kyle's commit graph shows a clear dip through his leadership years and a sharp uptick recently — entirely driven by AI. Kyle is not writing features for GitHub's product; he's building internal agents and workflow tools that stitch together disparate data sources. His primary use case is retrospective: using WorkIQ, MCP servers, Slack, Teams transcripts, and Obsidian notes to ask "what actually happened last week, what worked, and what should I tweak for the next few days." He finds LLMs are exceptionally good at pattern-finding across a week of context, far more so than generating forward-looking plans from scratch. > *"I find AI in like what most of this launch here is actually like less building forward. It's actually like a recursive loop backwards. I'm always looking at what had happened first."* ## [08:25] Running GitHub with AI: WorkIQ, MCP, Slack, Teams, and Skills GitHub rolled out AI internally by meeting people where they already work — Slack, Teams, email — rather than forcing them onto a new tool. Every employee, technical or not, gets the Copilot CLI plus a shared set of atomic micro-skills deposited into repos. The era of the "mega-skill" that handles an entire workflow end-to-end is over; what works are tiny, single-purpose skills that do one thing well and compose cleanly. Kyle uses Postel's Law as a design principle: liberal in what each skill accepts, strict in what it outputs. WorkIQ, the M365 MCP server, lets anyone ask backward-facing questions across every meeting, email, and chat — critical for a fully remote, globally distributed team. > *"We're ending the era of these like massive beautiful perfect skills. What we found is these incredibly micro skills that are just doing one thing for us very very well versus a skill that's going to do that full report that doesn't really exist on our side anymore."* ## [17:00] The Golden Age for Former Developers in Leadership Swyx asks whether people like Kyle — technical backgrounds, now in exec roles — have a structural advantage in the AI era. Kyle's answer: pattern-finding and problem-solving are the durable skills from his developer years, and AI has given him back the ability to apply them directly in code. The more interesting case isn't developers going back to update old side projects; it's people who spent ten-plus years accumulating business knowledge now using that context as leverage when wielding AI tools. The cross-domain background, once a liability in pure engineering orgs, is now a multiplier. > *"I just find that the folks that came from a different career, went to school for something else, went off and did this random thing and then became a software dev — now having the power of an AI where I can crank up 15 agents on Saturday."* ## [18:52] 15 Agents on Saturday and AI-Generated Executive Work Kyle built GitHub's annual revenue planning presentation entirely with AI — a SQLite app to view the data, skills pulling from Obsidian notes and work context, and a deliberate skill that made the output look "humanly bad" so it wouldn't read as AI-generated. He presented it to the CRO and CFO teams without disclosing the process; nobody asked. His point isn't to hide AI from colleagues but to demonstrate that value is in crafting and judgment, not slide assembly. The ability to build a small data-manipulation app and control the final output is, specifically, the advantage that developers carry into leadership. > *"I ultimately built this entire presentation without touching any of it. And I was like, okay, I'm just going to present this to our CRO, the CFO, their teams without mentioning I built it with AI. Never came up once."* ## [21:41] How AI Changes the Chief of Staff Role Kyle still has a chief of staff — but the job has shifted. Slide prep and presentation assembly have moved to AI; what remains irreplaceable is the human connective tissue: knowing which people in which cities should meet, surfacing relationship opportunities across a distributed org, brokering conversations that don't appear in any MCP server. The analogy is email replacing letter-opening: nobody expects the chief of staff to open physical mail anymore, and soon nobody will expect them to build decks either. The judgment about *who* should talk to *whom* is what stays. > *"I still have a chief of staff because the difference is the human connection aspects — I should be meeting with this group and this team and they have an opportunity and I'm going to be in San Francisco today."* ## [23:06] GitHub's History: Actions, npm, Webhooks, and Open Source Kyle walked the platform's architectural history: GitHub Services (pre-2014 arbitrary Ruby execution with no real containerization), webhooks, Pages, and then Actions — launched by Kyle personally at GitHub Universe in October 2018. Actions went from "we should not be running arbitrary Ruby on people's behalf" to a fully containerized compute layer now using Azure Dev Compute for fast, small-VM agent spin-ups. The npm acquisition came from a simple premise: npm was powering the internet and having scaling problems; GitHub's job was to keep it running and raise its security posture. Every security improvement — 2FA enforcement, token invalidation on exposure — breaks something downstream, and that balance between hardening a 15-year-old ecosystem and not causing developer snow days remains the central tension. > *"We have changed the 2FA policies, we've changed the way the tokens work. When we find tokens that have been exposed or potentially exposed, we invalidate them. That creates issues. But we're trying to push the community forward."* ## [30:06] Slop Forks, Vendoring, and AI Dependency Management Swyx raises the "slop fork" pattern — AI-assisted vendoring where you pull in only the source you need rather than importing a whole package — and asks whether it sidesteps npm's vulnerability surface. Kyle: vendoring was how everyone worked in 2013, and there's something true about pulling in only what you need, but it doesn't fix the fundamental problem. An agent evaluating code can be convinced it's secure just as easily as a human can. Static analysis and runtime testing still need investment regardless of package scope. GitHub's historical stance — wait for community RFC and social consensus before cementing a practice — means they won't push a single vendoring standard, but will build tools for maintainers to enforce their own trust rules. > *"The vulnerabilities — in an agent looking at them there's time and time again a million different ways in which we can convince an agent that this thing is like secure or not."* ## [35:18] Pull Requests, Prompt Requests, and Trust in Agent-Generated Code GitHub invented the pull request as a social trust mechanism, and now agents are generating the majority of PRs on many projects. Kyle assessed various alternatives — Peter Coppola's "prompt request" model, Thomas Dohmke's contribution-asset approach — but argues that none fully solve the underlying problem: trust is social, not technical. Even if a PR is 100% verified by static analysis, humans still reach for human signals (does Mitchell approve it?) before merging. GitHub's current direction centers on giving maintainers malleable tools to define their own trust heuristics rather than imposing a universal standard, because any single standard immediately becomes a gamification target. The endgame is something closer to human digital identity. > *"The reason why there's not a single answer is ultimately we're trying to codify trust. Right now when an agent writes code and another agent reviews code and then Kyle goes and looks at it, the trust is kind of diffuse."* ## [42:42] GitHub Stars, 200M+ Developers, and the New AI Builder Wave GitHub crossed 200 million accounts — up from 80 million not long ago. The rapid star accumulation on new AI projects is mostly genuine: an entire new cohort who built their first app in the AI era is swarming the zeitgeist. Kyle refuses to split hairs about who "counts" as a developer, drawing on his own experience being called a fraud for having a GitHub account before he knew what git was. The gamification problem is real (whack-a-mole anti-abuse, now AI-powered), but the majority of the star velocity is new builders who want to participate in the moment the way Kyle wanted to participate in the Ruby era. > *"It's not just developers. It's folks that have maybe started coding or only joined in since the AI era. And those projects are going up because you want to be a part of this moment."* ## [46:36] GitHub Spark, Low-Code, and Why GitHub Still Shows the Code GitHub experimented with Spark as an easy app-build-and-run experience. The lesson: for developers, the value was always simple runtime, not a UI veneer hiding the code. GitHub's architectural principle is non-negotiable — they will always show you the code. The broader goal Kyle articulates is lowering the barrier to that first "I had an idea and I built it" moment: anyone should be able to swap a light switch without needing to open the breaker box. > *"Anytime we try to put a veneer on top of something, we still always show you the code. That's kind of like a tenant. We're never gonna hide the code from you ever."* ## [48:59] GitHub's Hardest Era: 14x Growth, Reliability, and Scale GitHub went from 1 billion commits in all of 2025 to 275 million per week in April 2026 — a 14x year-on-year rate still accelerating. This broke things in new ways: not the old webhooks reliability problems (those were fixed and rewrote), but novel permission-layer failures only visible at cross-object scale. The core pain point is MySQL 1, a monolithic permissions database GitHub has been decomposing for years; permissioning is where most cross-cutting outages originate. Simultaneously, the industry is shifting back toward monorepos, which carry unique git infrastructure performance characteristics. Kyle frames the scaling problem as "diagonal" — vertical and horizontal both stop working, so you crack open services running unchanged for 10-15 years and rewrite them. > *"We're doing more in a month than we did in a year last year. By roughly every measure, there's growth that is much much bigger. And that is breaking our system in new ways, not old ways."* ## [60:42] Actions as the Compute Layer for CI/CD and Automation Actions has evolved well beyond CI/CD into a general-purpose automation compute layer — the root of significant availability pressure because every agent task and agentic workflow translates into more builds and more CPU. GitHub is expanding compute through both its own data centers and Azure cloud, and is using Azure Dev Compute (fast small-VM spin-up) under the hood for containerized agent execution. The path to fewer outages is a step-change model: large foundational infrastructure fixes that take time, then visible plateau improvements in availability rather than incremental noise reduction. > *"Actions is the core compute layer for either CI or side project. More tools, more agents, more PRs mean more builds. More builds need more CPUs and we simply need more CPUs."* ## [63:25] The State and Future of GitHub Copilot Copilot's history: launched as code completion, then shifted energy toward fine-tuning as the industry demanded better accuracy, and then next-gen models arrived and made fine-tuning less critical — creating confusion about where Copilot was going. The current architecture unifies a single SDK and agent harness across code completion, the new CLI, the new desktop app, and cloud agents. The future Kyle describes covers the full SDLC: security remediation, issue triage, documentation drift detection — not just writing code. The remaining hard problem is context and memory: getting GitHub to "act like Kyle wants it to act" across all his dependencies, preferences, and team context. > *"What we think is that it's not solely about the code generation. It's really about having the ability to use these coding agent brained harnesses across not just the coding experience but also security remediation, every GitHub issue that comes in."* ## [69:45] Ambient AI, Background Agents, and the Future of the SDLC Kyle argues the industry is still stuck in a "hyper-myopic" frame where coding agents only know about code. What he actually wants is ambient AI that carries every spec doc, every email thread, every conversation, every Obsidian note into its decision-making as a developer — not as a recall tool you query, but as persistent background context that shapes implementation choices in real time. OpenClaw interests him precisely because it connects personal context to agent action; but the missing piece is making that context available *during* software development. The extreme version — AI that proactively directs you rather than waiting to be asked — is the inversion of control that both excites and slightly alarms him. > *"The most interesting thing to me in AI is actual ambient AI. I'm looking to be implementing a new feature and for it to know every spec doc, every email, the conversations that I've had online, everything about how this could be implemented and be able to use that as part of its decision-making."* ## [74:30] OpenClaw, Enterprise Security, and the New OS for Agents Microsoft has a CVP dedicated to OpenClaw — unusual given Microsoft doesn't own Anthropic. Kyle explains: OpenClaw demonstrated what a valuable personal agent actually looks like (full personal context, computer use, not just chat), and Microsoft's job is to make that work in enterprise — OS-level sandboxing on Windows so you can run an agent on a work device without it becoming a security incident. The framing Kyle reaches for: Microsoft is the original operating systems company, and agents need a new OS layer. Workloads have changed so fundamentally that the right question is no longer "do we need more inference?" but "what type of compute do we need to run these agentic flows?" — all the way down to silicon. > *"Microsoft is the original operating systems company and here's the new operating system for AI. Operating systems need to look different than they looked five years ago because it's not just you using them anymore."* ## [79:24] Build Announcements, WorkIQ, FoundryIQ, and Microsoft Context Kyle previews what GitHub and Microsoft are announcing at Build: WorkIQ (M365 context engine via MCP, powerful for retrospective questioning across all work assets) and FoundryIQ (same intelligence layer that connects to existing data stores without requiring migration). The pitch for enterprise developers: "how I build on the weekend should be how I build at work" — but Fortune 500 companies can't just vibe-code and ship; security and compliance gates have to move as fast as development does. WorkIQ and FoundryIQ are the attempt to bring weekend-level agility into the enterprise context layer, with the governance that lets it survive in large organizations. > *"Work IQ, Foundry IQ — these context engines are wild good and we've given them to our developers at GitHub. You can ask questions around everything in your work context and it's surprisingly powerful."* ## [83:02] What Should swyx Ask Satya? swyx is about to interview Satya Nadella at Build and asks Kyle what to ask. Kyle's recommendation: challenge Satya on what he believes is demonstrably true about the AI and inference landscape in two to three years — not as a throwaway futurist question, but as a direct test of the internal bets Microsoft is making right now. Significant external skepticism exists about Microsoft's AI approach, and a straight answer from Satya would be both a genuine stress test and a reassuring signal for the developer community. > *"The best question to ask is what he thinks is true in like two or three years from now. The way that he is looking at this AI problem, the inference problem, the token problem — why is this approach in two years going to pay off?"* ## Entities - **Kyle Daigle** (Person): COO of GitHub and CMO of Developer for Microsoft; 13-year GitHub veteran who built the original webhooks and platform API layer. - **swyx** (Person): Host of Latent Space podcast; developer-advocate-turned-podcaster who conducted this interview at Microsoft Build 2026. - **GitHub Copilot** (Software): GitHub's AI coding assistant, now spanning code completion, CLI, desktop app, and cloud agents under a unified SDK. - **WorkIQ** (Software): Microsoft 365 MCP server that gives employees a context engine over all work assets (Teams, email, calendar, etc.). - **FoundryIQ** (Software): M365 intelligence layer that connects to existing enterprise data stores without requiring migration. - **GitHub Actions** (Software): GitHub's general-purpose compute and CI/CD automation layer; primary source of CPU demand growth from agent workloads. - **OpenClaw** (Software): Anthropic's Claude Code agentic tool; referenced as a model for what a personal AI agent with full context and computer use looks like. - **npm** (Software): JavaScript package registry acquired by GitHub; central to supply-chain security discussions about vendoring, slop forks, and dependency trust. - **Mitch Hashimoto** (Person): Co-founder of HashiCorp, active open-source maintainer; discussed in context of vendoring approaches and GitHub's maintainer relationship model. - **Thomas Dohmke** (Person): CEO of GitHub; referenced in context of PR workflow evolution. - **Microsoft Build** (Organization): Annual Microsoft developer conference; context for this episode's release and Kyle's expanded-role announcements.

#github#copilot#ai-agents
Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He
1:44:42
EN/ZH
Watch with Captions
Latent Space3일 전

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

Ethan He built NVIDIA's Cosmos world model, then joined xAI mid-2025 to build Grok Imagine from scratch — no infra, no data, no model — and shipped the first audio-video generation model in three months. He walks swyx and Vibhu through the full technical stack: synthetic captioning pipelines, VAE design tradeoffs, step distillation, audio-video alignment, and the hard economics of storing petabytes of video training data. His central argument runs through the entire conversation: since diffusion model technology has largely matured, most quality gains in video now come from language models, not from the video model itself — a view with direct implications for where the field goes next, including video agents, generative UI, and embodied world models. ## [00:00] Hook This exchange — Ethan's "pretty big claim" that visual intelligence now mostly comes from language — is pulled from later in the interview, where he argues that improvements to video models are increasingly driven by better language models acting as prompt rewriters and orchestrators, not by advances in diffusion or flow-matching architectures themselves. > *"Every time you see there's some improvement on these models, I would say mostly the gain comes from language model, not coming from the video model itself."* ## [01:16] Introduction swyx and Vibhu welcome Ethan to the Latent Space studio, noting he has been a recurring presence through the podcast's paper club — first presenting the Cosmos world model paper, then mixture-of-experts work. The conversation opens with a brief aside about the Poolside paper released the same day, a fully open Gemma-level model trained on 40 trillion tokens, before pivoting to Ethan's own trajectory. ## [02:41] From NVIDIA Cosmos to xAI Ethan built Cosmos — NVIDIA's giant video foundation model aimed at giving roboticists a simulatable world to build on — and shipped it by end of 2024. Once he realized video models obeyed the same scaling laws as language models, he went looking for more compute. xAI offered it. He joined in mid-2025 at the moment xAI decided to build its own image and video stack, with no existing infra, data pipeline, or model. He stayed through pre-training, post-training (reference-to-video, video extension), and a final stretch leading a small team on real-time long-horizon video generation. > *"By the time I joined, xAI was about to build video models and multimodal models. There were no infra, no data, and no model. Just a few engineers — we built it in three months and released the first model, Grok Imagine 0.9."* ## [04:40] Building Grok Imagine from Zero to One The three-month timeline surprised even Ethan. He attributes it to three factors: talent density (strong engineers who could align on a goal with minimal meetings — typically just one sync a day), xAI's existing data and inference infrastructure, and his own prior experience running the same build at NVIDIA. The bottleneck was iteration speed: how many training runs can you complete per day. With strong infra and abundant compute, bugs surface faster and each failed run costs less, so you burn through the inevitable data and pipeline errors in weeks rather than months. > *"The most important thing is talent. Everyone was very strong and clever, very close to each other toward a common goal. So that speeds up things a lot — you reduce the communication bandwidth among people."* Ethan describes a pattern where small data or pipeline bugs produce outsized quality regressions, and only fast iteration exposes them. A bug invisible at one scale becomes catastrophic at the next. The engineers who find and fix these quickly — not the ones who design the most sophisticated architecture — determine how fast a team ships. ## [11:23] How Image and Video Models Are Trained Video models require synthetic text-video pairs because internet video titles and descriptions almost never describe visual content accurately. The first step is human labeling: at NVIDIA, annotators were instructed to describe every object, character, interaction, and dialogue in a clip as exhaustively as possible. Those labels train an early VLM, which then generates captions at scale. The resulting pipeline — video to VLM to synthetic caption to (video, caption) training pair — is the foundation of both Cosmos and Grok Imagine. Image models must come first: they train faster, require less storage, and the learned representations transfer directly to video. Ethan describes building image models as building the foundation that video sits on top of. The architecture — diffusion transformer operating over VAE latents — is now standard, but the data quality and caption detail remain the primary lever for model quality. > *"Building a video model, you actually need to build an image model first. The data you need is 100% synthetic pairs of language and image, or language to video — because on the internet, videos don't naturally associate with text."* ## [20:09] Video Compression, VAEs, and Real-Time Tradeoffs Raw MP4 compression produces tokens whose latent space is incomprehensible to transformers, so the field moved to learned VAEs that create a smoother, more continuous latent space models can train on. The key design choice is how aggressively to compress the temporal dimension. Temporal compression is efficient — adjacent frames are mostly redundant — but it trades away real-time capability. Wan 2.1 uses 8x8 spatial and 4x temporal compression; generating a single token requires reconstructing four frames, making sub-200ms latency impractical. Ethan frames this as a fundamental tradeoff: high compression rates make training cheap and inference efficient for pre-rendered video, but lock out any use case that needs to respond to live user input. World models require the opposite choice. ## [23:26] Generative UI, Flipbook, and Neural OS Ethan argues that if inference were free, the logical endpoint of video generation is a complete replacement of conventional UI: instead of loading web pages from a server, a model generates them in real time in response to user intent. Flipbook, a demo that went viral, shows this literally — every element of the "browser" is generated by an image model, and clicking a link generates a new page rather than fetching one. The deeper claim is that this is not a novelty but the final form of world models applied to human-computer interaction. A traditional app is a fixed function mapping input to output; a generative UI is a model that can produce any interface the user needs without a developer having to build it first. Ethan calls this a "Neural OS," where the gap between user intent and rendered pixels closes entirely. > *"Imagine the internet doesn't exist and you type in google.com — what should a model show you? The model can imagine something. These web pages completely do not exist, so I can explore anything."* The near-term constraint is inference cost. Current video models cannot generate at interactive frame rates without significant distillation. But Ethan treats this as an engineering problem with a known solution trajectory, not a fundamental barrier. ## [33:26] The Cost of Training Large Video Models Training large video models costs roughly as much as training a medium-scale language model, but the breakdown differs. Compute is comparable, but storage and data movement dominate in ways LLM practitioners do not expect. One billion videos at 5 MB each requires five petabytes of raw storage. The VAE features that must also be stored are roughly the same size again — tens of petabytes total. On AWS S3, five petabytes runs approximately $100K per month before egress. Egress — downloading that data into the training cluster — can exceed storage costs, and each training run pulls the full dataset once. > *"Just storing the videos alone costs a lot. Five petabytes on S3 Standard is $100K per month. And egress — just to download those videos — I believe it's more expensive than storing them, and each training run you probably need to pull them once."* The implication is that video model development is gated on data infrastructure as much as on GPU hours. Teams without efficient data pipelines pay a multiplier on every experiment. ## [38:20] Distillation, GANs, and Fast Video Inference Training-time costs are largely fixed; the inference-time story is more tractable. Step distillation — training a small model to replicate the outputs of a large teacher in far fewer denoising steps — cuts inference cost by 10-25x. Flow-matching models trained to convergence need around 100 steps; production models typically run in 4-8. At the extreme, simple image-to-image tasks can run in a single step. The intuition Ethan offers: the teacher model must learn the full distribution of internet video, which is arbitrarily complex. The distilled student only needs to match the teacher, which is a fixed and much simpler target. Consistency models and LCM-style approaches follow the same logic. In Cosmos, production serving used 4-step and 8-step variants depending on quality requirements. GANs remain relevant as discriminators: a GAN discriminator can enforce photorealism constraints during distillation that pure score-matching loss misses, and Ethan notes that consistency models and GANs are converging on similar practical deployments even if their theoretical motivations differ. ## [42:37] Audio-Video Generation and Grok Imagine 0.9 Grok Imagine 0.9 was the first audio-video joint generation model deployed at scale. The core difficulty is modality alignment: text-video pairs are relatively abundant; text-audio pairs are rare; audio-video pairs aligned at the semantic level are almost nonexistent at scale. Speech tokens are quasi-discrete and can be modeled with language-like approaches, but music is continuous and requires a completely different representation. Training the joint model required building synthetic audio caption pipelines from scratch, with human annotation where VLMs failed — which was often, especially for music. Aligning all three modalities — text, video, and audio — without either degrading video quality or audio realism is what Ethan calls the hardest part of the project. > *"Audio has two components: a discrete component — language — and a continuous component — music. The music is completely different; you cannot model it with discrete tokens. That's the hard part, not to mention we have to align text, video, and audio together."* ## [49:50] What Makes a World Model? Ethan's definition has three components: real-time, interactive, and long-horizon video generation. He treats these as independent requirements, each of which most current models fail. Real-time means generating at display frame rates — 60fps for casual use, 300fps for gaming, 200ms response latency for digital humans. Current video models cannot do this; the VAE's temporal compression alone introduces latency that makes sub-200ms responses nearly impossible without architectural changes. Interactive means the model can accept any input modality the user can provide — keyboard, mouse, voice — and respond coherently. Long-horizon means maintaining consistent physical laws, character identity, and causal logic across minutes, not seconds. > *"World model is real-time, interactive, long-horizon video. Current video models can do none of these three things fully. That's why they're not world models yet."* ## [57:07] Reference Videos, Long Context, and Video Memory The parallel to language model context scaling is direct: video models are in the 2,000-8,000 token era, and will need to scale to million-token-equivalent contexts to generate coherent long videos. Ethan describes the reference-to-video feature he built at xAI (analogous to Cameo) as a mechanism for injecting selected history into the model's context rather than carrying the full video forward. FramePack's heuristic — storing the last second of video at full resolution while compressing earlier frames progressively — points toward the right direction: the model selects relevant context from its history rather than brute-forcing the full sequence. Ethan expects this context management to become part of the model itself rather than remaining a harness-level heuristic, the same way KV cache management is disappearing into model internals. ## [61:27] xAI Culture, Research, and First-Principles Building swyx notes that xAI communicates its research poorly relative to what the work actually demonstrates — the blog post accompanying Grok Imagine describes high-level capabilities without the technical depth Ethan has just spent an hour covering. Ethan is diplomatic but agrees that different labs have different communication styles. The xAI working culture he describes is minimalist: few meetings, no bureaucratic overhead, direct access to leadership judgment on technical decisions, and extreme iteration speed enabled by a strong infra team. The tradeoff is that company priorities shift fast, which is part of what eventually pushed him toward independent research. First-principles thinking — starting from the physics of the problem rather than from what competitors have shipped — runs through the team's approach to both model architecture and product. > *"Everything you just described is state-of-the-art. Like no one else has done it. And then you just put this blog post with the cookies. I'm like, this is not enough."* ## [71:01] AI Safety, Watermarking, and Prompt Rewriting Grok Imagine deployed watermarks in all jurisdictions requiring them and built takedown pipelines integrated with xAI's social platform infrastructure. On watermarking technology, Ethan is skeptical of SynthID's long-term robustness: the technique is documented publicly, and users on Reddit have already reverse-engineered the exact frequency pattern Google applies and can strip it from any generated image. He expects watermark detection to become an arms race. On prompt rewriting: video diffusion models take instructions literally. If a user types "a cat," the model generates a stationary cat on a white background with no motion, because the training data pairs were maximally detailed descriptions of physical scenes. Production systems layer a large language model as a prompt upsampler — converting sparse user instructions into the detailed physical descriptions the video model was trained on. This is one of the reasons Ethan argues language models are increasingly central to video quality. ## [74:26] Video Agents and AI-Assisted Creation Ethan's central claim from the hook: visual intelligence now mostly comes from language. The diffusion model architecture has largely converged; the gains come from larger, smarter LLMs that rewrite prompts, plan video sequences, call editing tools, and stitch clips together. In Cosmos, the prompt rewriter was larger than the video model itself. Video agents extend this: instead of generating a complete video in one shot, an agent plans the production, calls video generation models as tools alongside deterministic editing operations (text overlays, color grading, cuts), and iterates until the output meets a specification. Ethan predicts that by end of 2025, video agent output will reach production-grade quality — presentable video generated without a human editor in the loop. > *"The visual intelligence are actually mostly coming from language. Every time you see improvement on these models, I would say mostly the gain comes from language model, not coming from the video model itself."* ## [88:48] Why Language Models Unlock Better Video LLMs prompt video models better than humans do, because AI models understand AI models' training distributions. A language model knows that a diffusion model needs explicit physical descriptions, not poetic shorthand — and can generate the right prompt format automatically. Beyond prompting, agents can use deterministic video editing tools for precision operations (exact text overlays, frame-accurate cuts) that probabilistic diffusion models handle poorly, keeping the stochastic model focused on generation and delegating precision to tools. Ethan's timeline: video agent output at production quality by end of 2025, with the inflection point visible in work already shipping. ## [92:31] Robotics, Physical AI, and Embodied World Models Ethan's robotics prediction inverts the usual framing: physical AI may be solved not by deploying robots in the real world but by video world models becoming so capable at simulating physical environments that they effectively provide embodied experience. Once a model can control computer interfaces in real time with full causal understanding, extending that to robotic control becomes a matter of adding one more tool. The path from screen-interacting video model to robot controller may be shorter than the path from current robot learning systems to the same capability. ## [93:54] Why Ethan Left xAI Research ambitions and company priorities diverged. xAI's focus shifted in ways that made certain research directions — particularly on the language model side — impractical from inside. Ethan also notes that the insight driving his departure is the same one underlying his "big claim": if language models are now the primary driver of video quality, the most impactful work to do is on language models, not video models. He frames leaving not as dissatisfaction but as following the evidence about where the leverage is. ## [95:32] Self-Managed Context and the Future of LLMs Ethan's active research question: language models that are aware of their own context state and manage it autonomously, rather than relying on harness-level heuristics like automatic compaction at 80% fill. He draws the parallel to video models struggling with long-horizon generation — the same context management problem appears in both modalities. He points to Claude Code's practice of appending the current timestamp to user messages as an early example of making models context-aware, and expects this pattern to be absorbed into model training rather than remaining an external scaffold. > *"The language models are not aware of how long their own context length is. Once they hit like 80% or something, automatic context compaction is getting triggered, and the model is not aware of that when it's working."* ## [99:59] Ethan's Career Path and Closing Thoughts Ethan traces a decade of transitions: ResNet-era image recognition with the original authors at NVIDIA, self-supervised learning at Facebook AI Research, scaling at NVIDIA Cosmos, extreme-scale compute at xAI. He was rejected from every top PhD program despite first-author papers at top conferences, which pushed him into industry. In hindsight he reads his career as consistently following the scaling frontier — from image recognition to SSL to video to LLMs — and argues that within ML, domain switching is far more tractable than practitioners believe. > *"Within ML, it's actually easier to switch than you think. A lot of people have manifested that 'I work on computer vision, I always have to work on computer vision.' But from my experience, the fundamentals transfer."* ## Entities - **Ethan He** (Person): Former xAI researcher who built Grok Imagine from zero; previously led NVIDIA Cosmos world model; now focused on LLM research - **swyx** (Person): Latent Space co-host; conducts technical interviews on AI engineering and research - **Vibhu Viswanathan** (Person): Latent Space co-host; co-interviewer for this episode - **Grok Imagine** (Software): xAI's image and video generation product; first model (0.9) was the first large-scale audio-video joint generation system - **NVIDIA Cosmos** (Software): Open-source video foundation model for robotics simulation; Ethan's project before xAI; released end of 2024 - **xAI** (Organization): Elon Musk's AI lab; known for fast iteration culture and extreme compute resources - **Flipbook** (Software): Viral demo of real-time generative UI; all interface elements generated by image model in real time - **SynthID** (Software): Google's AI watermarking technology; Ethan notes its pattern has been publicly reverse-engineered - **Step distillation** (Concept): Technique to train a model to replicate a teacher's output in far fewer denoising steps; reduces inference cost 10-25x - **VAE** (Concept): Learned video compression creating smooth latent spaces; temporal compression is efficient but creates real-time latency tradeoffs - **World model** (Concept): Ethan's definition — real-time, interactive, long-horizon video generation; distinct from standard video generation - **Video agents** (Concept): Systems where LLMs orchestrate video generation models, editing tools, and deterministic operations to produce production-quality video - **FramePack** (Concept): Progressive temporal compression approach for long-context video generation; stores recent frames at full resolution, compresses older history

#video-generation#world-models#grok-imagine
Devin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray
1:09:32
EN/ZH
Watch with Captions
Latent Space7일 전

Devin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray

🔬 단백질에도 쓴맛 교훈이 온다 — Alex Rives, BioHub
1:10:12
EN/ZH
Watch with Captions
Latent Space8일 전

🔬 단백질에도 쓴맛 교훈이 온다 — Alex Rives, BioHub

BioHub 과학 총괄이자 Meta FAIR에서 ESM-1부터 ESM-3까지 이끈 연구자 Alex Rives가 Brandon, RJ Honicky와 함께 출연해, 마스크드 언어 모델을 단백질 서열에 스케일링하면 생물학적 구조·기능·설계의 문이 열린다는 8년간의 확신을 풀어놓는다. UniRef에서 메타게놈 데이터로의 전환이 ESMC 스케일링 법칙을 어떻게 되살렸는지, 희소 오토인코더 특징 지도가 100년간의 생화학 분류 체계를 학습 없이 어떻게 재현하는지, 그리고 월드 모델 탐색으로 치료 등급 단일 사슬 항체를 처음 설계하는 데 성공한 과정을 다룬다. Rives는 또한 BioHub의 5억 달러 규모 가상 생물학 이니셔티브와 세포 일반화 모델을 만들기 위한 원칙도 제시한다. ## [00:00] ESMC가 항체를 설계한다 — 미리보기 인터뷰 후반부에서 Rives가 ESMC의 프로그래밍 가능한 생물학 접근 방식을 설명하는 장면을 발췌했다. 그는 설계 기준을 충족하는 단백질을 월드 모델에서 탐색한다고 설명하며, 팀이 미니 바인더와 특히 치료적으로 유의미한 결합 친화도를 지닌 단일 사슬 항체 단편(SCFV)을 설계하는 데 성공했다고 밝힌다. 본 클립은 정식 인트로 전에 배치되어, 에피소드가 어디로 향하는지를 암시한다. ## [00:33] 단백질에도 쓴맛 교훈이 온다 Brandon과 RJ Honicky는 Alex Rives를 "단백질 생물학계에서 지금 가장 쓴맛 교훈에 가까운 사람"으로 소개한다. Rives는 그 말을 받아들인다. 그는 자신의 확신을 2018년으로 거슬러 올라가 설명한다. 당시 Meta FAIR 팀이 마스크드 토큰 예측 방식으로 단백질 서열에 첫 번째 트랜스포머 언어 모델을 훈련했을 때, 명시적인 지도 없이 구조·기능 표현이 저절로 나타났다. 핵심 직관은 Zellig Harris의 1954년 논문에서 빌려온 것으로, 아미노산이 등장할 수 있는 맥락은 그 단백질의 구조·기능·진화적 역할에 의해 결정된다는 것이다. 생명 전체에 걸쳐 수십억 개의 서열에 이 통계적 압력을 가하면, 단백질 생물학을 지배하는 잠재 변수를 모델이 학습해야 한다. > *"저는 스케일링 법칙을 믿습니다."* ## [06:00] ESM 계보: ESM2에서 ESMC까지 Rives는 ESM 4세대의 흐름을 짚는다. ESM2는 스케일링 이득을 보여줬지만 100억 파라미터 근처에서 수확 체감에 부딪혔다. 모델이 포화된 게 아니라 데이터가 포화된 것이었다. 배양 가능한 생물 위주로 편향된 UniRef 대신, ESMC는 메타게놈 데이터를 사용했다. 열수 분출공, 극지 토양, 하수에서 뽑아낸 서열들을 생물종 분류 없이, 부분 컨티그까지 포함해 원시 환경 DNA에서 조립한 것이다. 수십억 개의 메타게놈 서열을 학습에 추가하자 깔끔한 로그 선형 스케일링 법칙이 복원됐고, 소규모 실험만으로도 60억 파라미터 플래그십의 표현 충실도를 정확하게 예측할 수 있었다. > *"규모 확장의 수확 체감은 이제 없습니다. ESM2는 컴퓨팅이 아니라 데이터에 의해 제한되고 있었습니다."* ESMC는 표준 마스킹 목표를 쓰는 사실상 바닐라 트랜스포머다. AlphaFold식 MSA도, 기하학적 귀납 편향도 없다. Brandon과 Rives는 ESM3의 멀티 트랙 아키텍처가 생산적인 우회였는지를 잠시 논쟁한다. Rives는 두 패러다임 모두 자리가 있다고 말하지만, ESMC의 결과는 이 데이터 규모에서 그 사전 지식들이 핵심 역할을 하지 않았음을 시사한다고 본다. ## [18:30] 기계적 해석 가능성과 단백질 특징 지도 BioHub 팀은 ESMC 모델 패밀리(300M, 600M, 6B) 전 레이어에 걸쳐 희소 오토인코더(SAE)를 훈련시켜, 단백질 표현 공간의 내재적 특징 기하를 추출했다. 그 결과는 생물학이 한 세기에 걸쳐 실험으로 쌓아온 환원적 위계, 즉 기본 아미노산 화학에서 구조 모티프, 도메인 패밀리, 큰 기능 테마까지의 체계와 거의 일치했다. 학습 중 그 분류 체계를 전혀 입력하지 않았는데도. > *"한 아미노산의 선택은 서열 안의 다른 모든 아미노산의 선택과 완전히 얽혀 있습니다. 이를 잘 해내려면 모델이 생물학을 표현하는 잠재 변수들을 갖기 시작해야 합니다."* 구체적인 발견 하나: 모델은 진화적으로 무관한 여러 단백질 패밀리에서 독립적으로 진화한 것으로 알려진 촉매 모티프인 친핵성 팔꿈치를 단일 특징으로 인코딩해, 해당 패밀리 전체에서 활성화했다. 팀은 68억 개의 비중복 단백질 구조 지도도 구축했는데, 11억 개 클러스터 대표의 예측 구조가 포함되어 있다. SAE 특징을 이용해 진화적으로 멀리 떨어진 유전자 편집 시스템도 연결했다. 해당 클러스터로 묶인 일부 단백질은 기능이 알려지지 않은 것들로, Rives는 이를 발견 후보 목록으로 취급한다. ESM 지도의 첫 번째 버전은 외부 연구 그룹이 새로운 유전자 편집 시스템을 찾는 데 이미 활용됐다. ## [35:30] ESMC로 항체 설계하기 Rives는 단백질 설계를 월드 모델 탐색으로 정의한다. 생성 모델을 역전시켜 목표 결합 기준을 충족하는 서열을 찾는 것이다. 미니 바인더는 이제 일상적인 수준이 됐다. 나노바디와 SCFV는 구조 예측 기반 방법에게 여전히 더 어렵다. 항체 진화는 특정 폴드로 수렴하는 대신 다양성을 극대화하기 때문에, MSA 기반 접근법이 덜 유용하다. 바로 그 다양성을 대규모로 학습한 ESMC야말로 표현이 가장 풍부해야 할 곳이다. > *"항체는 분자의 구조 위상을 예측하는 것과 같은 방식으로 진화 정보로부터 이득을 얻지 못할 겁니다."* 팀은 소수의 시도만으로 치료 등급 친화도에 도달하는 SCFV 설계에 성공했으며, SCFV는 완전한 IgG로 재포맷할 수 있다고 밝혔다. ESMC 표현 위에 구축된 구조 예측 헤드인 ESMFold 2는 MSA 없이 서열당 수 초면 추론을 마치며, 전체 프로테옴 멀티머 매핑도 가능하다. Rives는 이 모델이 현재 오픈 웨이트 멀티머 예측에서 최고 수준이라고 말한다. ## [42:00] BioHub의 비전: 프로그래밍 가능한 생물학을 향해 BioHub 합류 6개월 차인 Rives는 기관의 구조를 이렇게 설명한다. 최전선의 실험 생물학, 최전선의 측정 기술, 최전선의 AI를 모두 오픈 사이언스 사명 아래 하나로 묶는 자선 단체라고. 그가 그리는 목적지는 개인 맞춤형 생리 예측 모델이다. 알약이 아니라, 특정 인간 게놈에서 단백질 수준의 분자 사건이 세포 회로를 거쳐 질병 발현으로 이어지는 과정을 추적할 수 있는 시스템이다. > *"우리는 이 새로운 패러다임을 위한 과학 기관을 만들고 있습니다."* 그는 순서대로 모델링해야 할 생물학적 복잡성의 층위를 짚는다. 단백질(현세대), 세포(다음 세대), 조직과 시스템, 생리. 단백질에서 세포로 넘어가려면 아직 존재하지 않는 데이터와 아마도 아직 발명되지 않은 모델링 접근법이 필요하다. 현재의 "가상 세포" 모델들은 일반화가 잘 안 된다. 학습 데이터는 잘 표현하지만, 새로운 개입을 새로운 맥락에 적용했을 때 결과를 예측하는 데는 실패한다. > *"전혀 관찰된 적 없는 맥락에서 새로운 개입을 할 때 무슨 일이 일어날지 예측하는 능력이 매우 제한적입니다."* ## [57:00] 가상 생물학 이니셔티브와 세포 데이터 스케일링 BioHub는 내부 데이터 생성 및 측정 기술에 4억 달러, 외부 노력 촉진에 1억 달러, 합쳐서 가상 생물학 이니셔티브를 최근 발표했다. Rives는 이를 씨앗 자금으로 규정한다. 실제로 필요한 데이터 규모는 훨씬 크며, BioHub의 약속이 더 넓은 과학 커뮤니티의 투자를 이끌어내길 기대한다는 것이다. 그는 세 가지 데이터 원칙을 제시한다. 속도(단백질 데이터는 반세기가 걸렸는데, 세포는 그만큼 기다릴 수 없다), 일반화(훈련 분포가 세포 유형과 맥락을 가로질러 매우 다양한 개입을 아우러야 한다, 단백질에서의 메타게놈적 폭과 유사하게), 피드백(모델 예측에 이끌린 능동적 실험 루프, 습식 실험실 생물학에 RLVR을 적용하는 것과 비슷한). 섭동 시퀀싱, 공간 트랜스크립토믹스, 교차 양식 단세포 측정이 지금 바로 가동할 수 있는 스케일러블 기술이다. 컴퓨팅 면에서: ESMC는 약 10억 개 서열로 훈련됐다. 약 1,000억 개가 존재하는 것으로 추정되며, 모델은 현재 지도에 있는 68억 개조차 아직 완전히 활용하지 못했다. 100배 컴퓨팅 증가가 도움이 되겠지만, 그에 비례하는 데이터 확장이 함께 이루어져야 한다. 수확 체감이 언제 나타날지는 경험적으로 열린 질문이다. ESM2의 곡선도 메타게놈 데이터가 그것을 지우기 직전까지는 포화된 것처럼 보였다. > *"몇 년 안에 이것을 해내야 합니다. 일반 AI가 발전하는 속도를 감안하면, 생물학은 실험 과학과 데이터에 의해 근본적으로 제약받게 될 것입니다."* ## 등장 인물 및 주요 개념 - **Alex Rives** (인물): BioHub 과학 총괄; ESM-1, ESM-2, ESM-3, ESMC, ESMFold 2 설계자; 전 Meta FAIR. - **Brandon** (인물): Latent Space AI for Science 서브시리즈 공동 진행자; Atomic AI(RNA 치료제) 소속. - **RJ Honicky** (인물): 공동 진행자; Miro Omix CTO 겸 창업자. - **ESMC** (소프트웨어): BioHub/EvoScale의 4세대 단백질 언어 모델; 3억~60억 파라미터; 메타게놈 데이터 포함 약 10억 개 서열로 훈련; MIT 라이선스 오픈 소스. - **ESMFold 2** (소프트웨어): ESMC 표현 위에 구축된 구조 예측 모델; MSA 불필요, 서열당 수 초 추론; 최고 수준의 오픈 웨이트 멀티머 예측. - **ESM** (소프트웨어): Evolutionary Scale Modeling — Rives 팀이 개척한 다세대 단백질 언어 모델 계보(ESM-1, ESM-2, ESM-3, ESMC). - **Sparse Autoencoders / SAEs** (개념): ESMC 표현 공간의 내재적 특징 기하를 추출하는 기계적 해석 가능성 도구; 지도 없이 생물학적으로 해석 가능한 위계를 드러낸다. - **쓴맛 교훈** (개념): Richard Sutton의 주장으로, 컴퓨팅과 데이터를 활용하는 일반적 방법이 도메인 지식을 인코딩한 방법을 꾸준히 능가한다는 것; 여기서는 단백질 생물학 스케일링에 적용된다. - **메타게놈 시퀀싱** (개념): 배양 없이 미생물·바이러스 다양성을 포착하는 환경 DNA 시퀀싱; UniRef가 포화된 이후 ESMC 스케일링 법칙을 복원한 데이터 확장. - **BioHub** (기관): Chan Zuckerberg BioHub; 실험 생물학, 측정 기술, AI의 교차점에서 오픈 사이언스 도구를 구축하는 자선 단체. - **Virtual Biology Initiative** (개념): BioHub의 5억 달러 약정(내부 4억 달러, 외부 1억 달러)으로, 세포 일반화 모델 훈련에 필요한 세포 규모 데이터 생성을 목표로 한다. - **AlphaFold** (소프트웨어): DeepMind의 구조 예측 시스템; MSA와 기하학적 귀납 편향 사용; ESMC의 MSA-프리 접근법과 대비된다. - **UniRef** (소프트웨어/데이터베이스): 표준 단백질 서열 데이터베이스; ESM2의 학습 데이터였으나 이후 스케일링 병목으로 밝혀졌다. - **친핵성 팔꿈치** (개념): 진화적으로 무관한 여러 단백질 패밀리에 나타나는 촉매 구조 모티프; ESMC에서 단일 특징으로 인코딩되어 전체에 걸쳐 활성화된다. - **Zellig Harris** (인물): 언어학자; 1954년 논문 "Distributional Structure"에서 단어 맥락이 의미를 인코딩한다고 주장했으며, Rives가 아미노산 맥락 통계가 생물학적 기능을 인코딩해야 하는 이유의 이론적 선구자로 인용한다.

#protein-language-models#scaling-laws#esm
⚡️ 왜 SF를 만들어야 하는가 — Sunil Pai, Cloudflare
14:47
EN/ZH
Watch with Captions
Latent Space11일 전

⚡️ 왜 SF를 만들어야 하는가 — Sunil Pai, Cloudflare

이 짧은 에피소드에서 swyx는 Cloudflare 개발자 플랫폼 책임자이자 swyx가 Code Mode의 창시자로 꼽는 Sunil Pai와 대화를 나눈다. 세 가지 주제를 다룬다: AI 에이전트의 기반으로서 Durable Objects와 Dynamic Workers에 대한 Cloudflare의 인프라 베팅, Sunil이 커리어가 끝난 줄 알았던 Vercel과의 트위터 오해 사건, 그리고 코드 포킹이 공격이 아니라 존중의 행위인 이유. Sunil은 마지막에 직접적인 도전을 던진다: 점진적인 에이전트 프레임워크 대신 SF를 만들라고. ## [00:00] Code Mode는 누가 만들었나? 3초짜리 슬레이트로 시작하는 영상. swyx가 Sunil을 "Code Mode의 창시자"로 소개하자, Sunil은 어린 시절부터 이를 구상해왔다며 거창하게 공을 받아들이는 장난스러운 첫 교환이다. 두 오랜 친구 사이의 순수한 농담이지, 본편 내용의 예고가 아니다. ## [00:03] 소개 및 Sunil Pai의 배경 swyx가 Sunil을 오랜 친구이자 AIE Europe 키노트 연사로 다시 소개한다. 짧은 근황 나눔이 이후 내용의 배경을 설정한다. Sunil의 현재 관심은 Cloudflare의 AI 에이전트 플랫폼이며, 최근 Anthropic의 Cloud Managed Agents 출시가 그에게 구체적인 비교 대상을 제공한다. > *"Cloudflare에서 요즘 어떤 일이 벌어지고 있는지 이야기 나눠보고 싶었어요."* ## [00:30] 새로운 클라우드 관리형 에이전트 이야기 Anthropic이 새로 출시한 Cloud Managed Agents — 장기 실행 에이전트를 구축·배포하는 플랫폼 — 이 Sunil의 출발점이다. Anthropic 팀을 좋아하고 제품도 흥미롭다고 하면서도, 스펙을 읽는 순간 경쟁심이 발동했다고 한다. Cloudflare가 더 잘할 수 있다는 것. swyx는 그 주장을 뒷받침할 Cloudflare의 실제 강점을 묻는다. > *"제품을 보고 나서 경쟁하고 싶다는 생각이 들었어요. Workers와 Durable Objects로 더 잘할 수 있다고 봐요."* ## [01:10] Cloudflare의 핵심 인프라: Durable Objects와 Dynamic Workers Sunil은 모든 에이전트 플랫폼이 결국 필요로 하게 될 두 가지 기본 요소를 꼽는다. Durable Objects는 상태를 가진 서버리스 단위로, 유저 레벨 라이브러리가 아닌 인프라 레이어에서 구현된 세계 최초의 액터 모델이라는 것이 Sunil의 주장이다. Dynamic Workers는 LLM이 생성한 코드를 안전하게 실행하는 Cloudflare의 방식이다. 콜드 스타트 없이, API 노출 범위를 설정할 수 있고, 외부 트래픽은 기본적으로 차단된다. 이 둘이 합쳐지면 전체 VM을 띄우지 않고도 샌드박스 컴퓨팅 환경에서 에이전트 단계를 실행할 수 있다. > *"인프라 레이어에서 액터 모델을 구현한 세계 최초의 사례입니다. 유저랜드가 아니에요."* ## [02:34] Cloudflare의 AI 에이전트 아키텍처 접근법 동료 Matt Carey가 구축한 Cloudflare MCP 서버가 Dynamic Workers의 실제 활용을 보여준다. Cloudflare API는 엔드포인트가 2,600개인데, 엔드포인트마다 하나의 툴을 노출하면 어떤 LLM의 컨텍스트 윈도우도 버텨내지 못한다. 대신 서버는 모든 것을 `search`와 `execute` 두 개의 툴 호출로 압축하며, 둘 다 아이솔레이트에서 실행되는 JavaScript 코드로 뒷받침된다. 에이전트가 코드를 제출하면 아이솔레이트가 실행하고 결과를 반환한다. LLM과의 왕복 없이, 타입 체크도 된다. > *"LLM과 한 번의 툴 호출로, 왕복 없이, 타입 체크까지. 결국 LLM이 코드를 잘 실행한다는 게 밝혀진 거죠."* ## [03:40] 에이전트 소프트웨어의 미래와 "harness" 표준화 swyx는 Anthropic 스펙의 harness 개념이 크로스 플랫폼 표준이 될 수 있는지 묻는다. Sunil의 답: AI 에이전트의 React는 아직 아무도 만들지 않았다. 2013년 React 비유를 의도적으로 꺼낸다. JSConf 발표장을 걸어 나간 사람들, Facebook이 JavaScript를 싫어한다고 비판한 사람들, 그럼에도 결국 React가 이후 모든 UI 프레임워크를 정의했다는 이야기. 지금은 저마다 자기 방식으로 harness를 만들고 있고, 언어와 회사와 인프라를 가로질러 재현 가능한 것이 없다. swyx는 평범한 마크다운인 skills가 이미 통합 레이어가 될 수 있지 않냐고 제안하고, Sunil은 아이디어가 매력적이라고 하면서도 구체성의 한계를 걱정한다. > *"너무 어렵지만, 머릿속에서 이렇게 프레이밍하고 있어요. 아직 아무도 React를 만들지 않았다고."* ## [06:11] "slop forks" 현상과 오픈소스 문화 swyx가 "slop forks" — AI로 생성된 인기 프로젝트 포크 — 를 꺼내자 Sunil이 눈을 빛낸다. 그의 시각에서 포킹은 절도가 아니라 위신과 존중의 표시다. React 생태계가 포크를 통해 성장했다. Cloudflare Agents SDK와 경쟁하는 무언가를 만들려는 사람에게는 마음껏 하라고 한다. 그렇게 되면 모두가 이긴다는 것이다. > *"포킹은 내 문화에서 위신과 존중의 표시예요."* ## [06:36] Vercel / Cloudflare 소셜 미디어 오해 사건 JSConf España에서 Sunil은 Vercel의 Harvey를 만나 즐거운 시간을 보냈다. 이후 Vercel Labs의 Just Bash — Bash를 순수 JavaScript로 구현한 것 — 를 발견하고 Cloudflare에 포팅하고 싶었다. 점심시간에 Opus로 코드베이스를 분석해 5,000줄을 받았고, 월요일에 정식 PR을 보내기 전에 정리할 계획으로 잠들었다. 깨어보니 Cloudflare 경영진에게 트위터를 확인했냐는 DM이 와 있었다. Vercel CTO가 그 작업을 개인 사이드 프로젝트가 아닌 회사 차원의 움직임으로 공개 비판한 것이다. Sunil은 담담하게 상황을 설명했고, 그러자 인터넷 절반이 그를 옹호하러 몰려들었다. > *"트위터에 들어가보니 Vercel CTO가 제 작업을 깎아내리면서 '이건 Cloudflare가 한 짓이다'라고 하더라고요."* ## [09:45] 소프트웨어 개발에서 포킹의 중요성 swyx가 Vercel 사건을 더 넓은 패턴과 연결한다. 라이선스를 피하려고 Python으로 다시 쓴 유출 코드베이스 이야기인데, 법적으로는 파생 저작물로 판결났다. swyx의 핵심 주장은 slop forks를 장려할 만하다는 것이다. 의존성을 포크하고, 내재화하고, 소유하면 LiteLLM이나 Axios처럼 업스트림이 갑자기 바뀌는 문제를 피할 수 있다. Sunil도 동의한다. NPM 이전에 소프트웨어는 정확히 이 방식으로 유즈넷을 통해 퍼졌고, 포크 주기를 단축하는 것은 그 전통의 연장일 뿐이라고. > *"포킹은 우리가 소프트웨어를 만드는 방식의 근본이에요."* ## [12:04] 현대 오픈소스 저장소의 적대적 환경 Cloudflare Agents SDK는 풀 리퀘스트 기여를 완전히 차단해야 했다. 이슈만 허용된다. Sunil은 콘퍼런스에서 오픈소스 메인테이너들과 대화를 나눴는데 모두 같은 이야기를 한다. 저장소가 적대적 영역이 됐고, 가장 위험한 공격 벡터는 자세히 읽기 전까지는 완전히 합법적으로 보이는 가짜 보안 리포트라는 것이다. swyx는 Claude Code의 Peter가 오전 발표에서 한 이야기와 연결한다. 지금 가장 큰 공격 표면은 손상된 의존성이 Claude Code 안으로 들어오는 것이고, 그렇게 되면 그것을 사용하는 모든 개발자가 노출된다는 것이다. > *"오픈소스 저장소는 사람들이 인기를 얻는 것 자체를 두려워할 정도로 적대적이 됐어요."* ## [13:04] 마무리 생각과 독창성을 향한 격려 Sunil의 마지막 요청은 직접적이다. 열 번째 에이전트 프레임워크는 그만 만들고, SF를 만들라고. 가족을 위한 무언가를 만들라고. Agent SDK를 써도 좋지만, 인프라와 LLM이 거의 한계에 부딪히는 지점에서 쓰라고. 다음 단계의 변화는 바로 거기 있다고. swyx는 2018년 React Rally에서 나온 Sunil의 "alpha thought leading" 발언을 회상하며 마무리한다. > *"SF 같은 걸 만드세요. 가족을 위한 걸 만드세요. 세상을 바꿀 힘이 충분히 있는데, 그냥 독창적이었으면 해요."* ## 등장인물 - **swyx** (인물): Latent Space 호스트; Sunil의 오랜 친구; 2018년 React Rally에서 Sunil의 즉흥 발언 후 "alpha thought leading"이라는 표현을 만들었다. - **Sunil Pai** (인물): Cloudflare 개발자 플랫폼 책임자; swyx로부터 Code Mode의 창시자로 인정받음; AIE Europe 키노트 연사. - **Cloudflare** (조직): 클라우드 플랫폼 기업; Durable Objects와 Dynamic Workers를 기반으로 에이전트 인프라를 구축 중. - **Anthropic** (조직): AI 기업; Cloud Managed Agents를 출시했으며, Sunil이 경쟁 대상으로 삼는 제품이다. - **Vercel** (조직): 프론트엔드 클라우드 기업; Sunil이 그들의 AI SDK를 사용하며, 트위터 오해 사건의 상대방이다. - **Durable Objects** (소프트웨어): Cloudflare의 상태 저장 서버리스 기본 요소; 인프라 레이어에서 구현된 세계 최초의 액터 모델이라는 것이 Sunil의 주장이다. - **Dynamic Workers** (소프트웨어): LLM 또는 사용자가 생성한 JavaScript를 콜드 스타트 없이 안전한 아이솔레이트에서 실행하는 Cloudflare 기능. - **Just Bash** (소프트웨어): Vercel Labs 프로젝트 — Bash의 순수 JavaScript 구현체 — 로, Sunil이 Cloudflare에 포팅하다가 트위터 사건이 발생했다. - **MCP** (개념): Model Context Protocol; Cloudflare의 MCP 서버는 Dynamic Workers를 활용해 2,600개의 API 엔드포인트를 두 개의 툴 호출로 압축한다. - **Slop forks** (개념): AI로 생성된 기존 프로젝트의 포크; Sunil은 이를 표절이 아닌 오픈소스 포킹 문화의 연장, 즉 존중의 표시로 해석한다.

#cloudflare#ai-agents#open-source
⚡️ Google의 오픈 AI 전략 — Omar Sanseviero, Google DeepMind
29:58
EN/ZH
Watch with Captions
Latent Space11일 전

⚡️ Google의 오픈 AI 전략 — Omar Sanseviero, Google DeepMind

AI Engineer London 현장에서 swyx가 Omar Sanseviero — Google DeepMind 개발자 경험 총괄 — 와 30분간 밀도 있는 대화를 나눈다. Gemma 4의 아키텍처 혁신, Google의 오픈 모델 전략, DevEx 팀의 다음 성장 방향을 짚으며, Omar는 레이어별 임베딩의 내막, 파인튜닝 열풍이 식은 이유, Kaggle이 DeepMind에 합류한 것의 실질적 의미, '자동 연구'가 실체인지 과대광고인지를 솔직하게 풀어놓는다. ## [00:00] Gemma 4 소개와 팀 범위 Omar의 한 문장 요약: Gemma 4는 "지금까지 출시한 오픈 모델 중 가장 강력한 것"으로, 파라미터당 지능을 극한까지 쥐어짜면서 완전한 멀티모달 지원을 유지하되 로컬 추론이 가능한 무게를 지킨다는 원칙 아래 만들어졌다. > *"저희는 정말 파라미터당 지능을 최대한 압축하려고 노력했습니다."* ## [00:23] 유효 파라미터와 활성 파라미터 설명 Gemma 4 소형 모델의 핵심 설계는 각 트랜스포머 블록에 레이어별 임베딩 테이블을 삽입하는 것이다. 행렬 곱이 아닌 룩업 방식이므로 30억 개의 임베딩 파라미터는 GPU 메모리에 상주하지 않아도 된다 — CPU나 디스크에 머물고, 실제 연산은 20억 활성 파라미터가 담당한다. Omar는 이 기법이 온디바이스 전용이라고 솔직히 밝힌다: 대형 모델에서는 Dense나 MoE 구조가 더 낫다. > *"Gemma 4 모델은 E2B입니다. GPU에 실제로 올라가는 건 20억 파라미터예요. 전체로는 거의 50억 파라미터지만, 나머지 30억은 CPU나 디스크에 둘 수 있습니다."* ## [01:43] 온디바이스 활용 사례와 Gemini Nano 통합 Pixel 폰과 하이엔드 Samsung 기기에는 Gemini Nano가 기본 탑재되어 있으며, Gemini Nano는 Google이 스마트폰 제약에 맞게 설계한 Gemma 3N 아키텍처를 기반으로 훈련된다. Gemma 4의 파라미터 오프로딩 아이디어는 이 소형 변형에도 동일하게 적용된다. swyx가 29B–31B 수준으로 확장 가능한지 묻자 Omar는 "실험을 많이 하고 있다 — 지켜봐 달라"고만 답한다. > *"고사양 스마트폰을 사면 이미 Gemini를 바로 쓸 수 있습니다."* ## [03:14] 모델 출시 배경과 개발자 생태계 Gemma 팀은 대부분의 예상보다 훨씬 작다 — PM 두세 명, 마케터 한 명, 그리고 핵심 엔지니어와 연구자들. 출시를 복잡하게 만드는 건 외부 그래프다: llama.cpp, Ollama, MLX, Hugging Face, vLLM, Nvidia, AMD 등 50개 파트너를 동시에 조율하고, 내부적으로는 Google Cloud, Vertex, ADK, Android와 협력해야 한다. Gemma 4 출시에는 Android Studio 에이전트 모드와의 네이티브 통합도 포함됐는데, 개발자가 오프라인 Gemma 4 추론으로 코드 지원을 받을 수 있다. > *"Gemma 4 출시에 외부 파트너가 거의 50곳이었습니다. 역대 가장 복잡한 출시였어요."* ## [04:29] 오프라인 vs API 사용과 향후 모델 성장 오프라인/프라이버시 구분은 실재하지만 전부는 아니다. Omar는 더 명확한 선을 긋는다: 지금 로컬 모델은 기능(함수 호출, 지시 수행, 에이전틱 작업)에서는 탁월하지만 지식 밀도에서는 여전히 밀린다 — 틈새 사실을 안정적으로 떠올리려면 대형 모델이 필요하다. 그의 1~2년 전망: Gemini Pro급 모델이 완전히 온디바이스에서 실행되어, 지금은 API 연결이 필수인 경험을 가능하게 한다. > *"1~2년 안에 스마트폰에서 Gemini Pro 수준의 강력한 모델을 직접 실행할 수 있는 미래가 온다고 생각합니다."* ## [06:26] Gemma 4 멀티모달 기능과 한계 Gemma 4는 Gemini 3의 연구 스택을 물려받아, 2B 모델에서도 오디오 이해(음성 인식, 음성-번역 텍스트, 오디오 클립 질의응답)와 비전(객체 감지, 포인팅, 캡셔닝)을 지원한다. Omar가 명시적으로 언급한 두 가지 한계: 이미지 세그멘테이션 미지원, 그리고 단일 프롬프트에서 비디오와 오디오를 동시에 처리하는 기능 미지원 — 현재는 별도 스트림으로 입력해야 한다. 네이티브 음성 출력은 검토 중이지만 발표된 내용은 없다. > *"비디오 입력과 오디오 입력을 각각 이해하는 건 되는데, 같은 프롬프트에 시각 부분과 오디오 부분을 함께 넣으려면 아직 개선이 더 필요합니다."* ## [08:08] 다국어 토크나이저 인사이트 Gemma의 토크나이저는 Gemini를 구동하는 것과 동일하다 — 140개 언어에 걸쳐 비범한 다국어 기반을 제공하는 설계 선택이다. Omar의 구체적 발견: Gemma 3을 베이스로 베트남어 같은 동남아 언어로 파인튜닝하면, 영어 벤치마크에서 더 높은 점수를 기록한 베이스 모델보다 뛰어난 성능을 낸다. 영어 최적화된 서브워드 조각으로 비라틴 문자를 억지로 처리하는 대신, 해당 언어에 맞는 토큰을 포착하기 때문이다. > *"이 모델들을 베트남어 같은 특정 동남아 언어로 파인튜닝하면 — 다른 베이스 모델이 전반적으로 더 낫더라도 — Gemma가 더 좋은 결과를 냅니다."* ## [09:30] AI Engineer에서 만난 Google 개발자 경험팀 런던은 DeepMind의 본거지다. AI Engineer Europe에 전체 팀을 이끌고 참석한 건 의도적인 선언이었다. Omar는 Gemma 4 개발, 디퓨전 텍스트 생성, 로보틱스, 온디바이스 ML, Android에 걸친 연구자들을 데려왔다 — DevEx 로드쇼가 아니라 실질적인 연구 발표였다. swyx는 그 범위를 직접적으로 표현한다: "가장 넓은 범위를 다루는 연구소예요. 돌고래 연구까지 하잖아요." > *"로보틱스부터 연구, Android까지 전 분야 사람들을 데려왔습니다. 회사가 만들고 있는 모든 것을 보여줄 수 있어서 정말 기분이 좋았어요."* ## [10:42] 텍스트용 디퓨전 모델 연구 소개 Google은 I/O에서 Gemini Diffusion을 발표했다 — 이미지가 아닌 텍스트를 생성하는 디퓨전 트랜스포머로, 자기회귀 디코딩보다 훨씬 빠른 속도를 낸다. Omar의 솔직한 평가: 품질은 여전히 자기회귀 기준선에 못 미치고, 분포 이동이 라우팅에 다른 방식으로 영향을 미치기 때문에 디퓨전 트랜스포머 파인튜닝이 더 어렵다. swyx는 디퓨전 모델이 빠른 직관적 처리를 담당하고 자기회귀 모델이 복잡한 계획을 맡는 그럴듯한 아키텍처를 스케치하는데, Omar는 가능성은 있지만 아직 이르다고 본다. > *"현재로서는 여전히 매우 실험적입니다. 일반적인 자기회귀 모델에서 얻을 수 있는 것보다 모델 품질이 아직 조금 떨어져요."* ## [13:37] 파인튜닝의 현재와 커뮤니티 트렌드 파인튜닝 커뮤니티는 2023년을 정점으로 조수가 빠지고 있다. Omar가 목격하고 있는 풍경: Gemma 4 출시 파트너 중 여럿이 27B 비전 모델 파인튜닝을 계획했다가 중간에 포기했는데, 베이스 모델이 이미 그 일을 해냈기 때문이다. 예전엔 파인튜닝이 필요했던 범용 동작 변경이 이제는 프롬프팅으로 처리된다. 남은 것: 의료, 금융, 틈새 데이터를 위한 도메인 특화 파인튜닝 — 그리고 베이스 모델이 업데이트될 때 LoRA 호환성을 관리해야 하는 조직적 과제. > *"그런 사례를 많이 봤어요 — 요즘은 범용 대화 모델로서의 파인튜닝에 대한 열기가 식고 있는 걸 느낍니다."* ## [16:29] Dense와 Sparse 아키텍처의 트레이드오프 Gemma 4는 비슷한 파라미터 수의 대형 모델 두 가지를 출시했다: 31B Dense(가장 높은 원시 지능, 양자화하면 소비자용 GPU에 올라감)와 4B 활성 파라미터를 가진 27B MoE(동일한 하드웨어 환경에서 가장 빠른 추론). 크기 선택은 개발자 친화성을 의도한 결정이다. Omar의 파인튜너들을 향한 경고: MoE 훈련 레시피와 하이퍼파라미터는 Dense 모델에서 깔끔하게 이식되지 않는다 — 입력 분포 변화가 어떤 전문가를 활성화하는지 바꾸면서 라우팅에 아직 완전히 이해되지 않은 방식으로 분포 이동이 발생한다. > *"MoE는 파인튜닝하기 까다롭습니다. 추론에서는 잘 작동하지만, 파인튜닝하면 조금 어려움을 겪어요."* ## [18:29] 파라미터당 지능과 미래 연구 방향 Gemma 2, 3, 4를 거치는 동안 Google은 최대 파라미터 수를 약 30B로 거의 고정한 채 성능 상한을 크게 끌어올렸다 — 파라미터당 지능 향상의 직접적인 증거다. 더 어려운 비교 문제: MoE 희소성과 파라미터 오프로딩을 도입하면 파라미터 수는 더 이상 공통 단위가 되지 않는다. Omar의 솔직한 전망: 지식 한계는 구조적으로 고착될 가능성이 높다 — 3년 후 30B 모델도 정보 이론적 한계 때문에 매우 틈새적인 사실 회상에서는 여전히 실패할 것이다. > *"파라미터당 지능이란 무엇인가? 이 파라미터당 지능을 어떻게 극대화할 것인가?"* ## [20:09] Gemma Scope와 메커니즘적 해석 가능성 Google은 12월에 Gemma Scope를 출시했다 — Gemma 3 모델 전체 레이어의 활성화를 분석하는 툴킷으로, 모든 레이어를 커버하는 수 테라바이트(페타바이트 수준일 수도 있는) 규모의 활성화 데이터셋이 뒷받침한다. Omar는 메커니즘적 해석 가능성을 ML 연구 입문의 낮은 진입 경로로 소개한다: 훈련 클러스터 없이도 활성화 분석을 실행할 수 있고, 실험을 통해 트랜스포머 내부 작동 방식에 대한 실질적인 직관을 얻을 수 있다. > *"시작하는 데 많은 컴퓨팅 자원이 필요하지 않은 분야입니다. 모델이 어떻게 작동하는지 이해할 수 있게 해줘요."* ## [21:12] 연구와 엔지니어링의 교차점 연구자들을 엔지니어링 컨퍼런스에 데려온 계기: 엔지니어들은 모델이 어떻게 만들어졌는지 이해할 때 모델을 더 신뢰하게 된다, 직접 훈련할 일이 없더라도. Omar와 swyx 모두 연구와 엔지니어링의 경계가 흐릿해졌다고 지적한다 — 연구자 업무의 대부분은 이론보다 엔지니어링에 가까운 경험적 소거 실험이고, 코딩 에이전트 덕분에 엔지니어들도 예전엔 연구 배경이 있어야 가능했던 실험에 바로 접근할 수 있다. Omar는 Reddit과 Discord가 독자적으로 재발견한 기법을 연구소가 나중에 논문으로 발표한 사례로 프랑켄머지와 Axolotl 커뮤니티를 든다. > *"무엇이 효과 있고 없는지 보면서 이것저것 옮겨보는 대규모 경험적 실험 — 제게는 연구보다 엔지니어링에 훨씬 가깝습니다."* ## [23:59] '자동 연구'와 에이전틱 자동화에 대한 시각 swyx가 핵심 질문을 던진다: 자동 연구는 그냥 '에이전틱 하이퍼파라미터 스윕'인가, 아니면 아무도 찾지 않았을 37번 수 같은 발견을 만들어낼 수 있는가? Omar는 신중한 회의론자다 — AutoML의 실적은 대부분 위장한 그리드 서치였고, 심층적인 아키텍처 작업은 향후 1~2년 안에 자동화되기 어렵다고 본다. 하지만 파인튜닝 자체는 곧 완전히 에이전트 주도로 바뀔 것이라 생각한다: 사용자는 훈련 코드를 짜는 대신 에이전트에게 실험을 시작하라고 지시하게 되며, Hugging Face의 AutoTrain이나 Axolotl의 CLI 같은 도구를 활용하게 된다. > *"다음 세대 파인튜너들은 코딩을 전혀 하지 않는 사람들일 겁니다. 대부분의 사람들은 몇 가지 스킬만으로 파인튜닝하게 될 거예요."* ## [26:06] 팀 확장, 글로벌 거점, Kaggle 통합 DevEx 팀은 현재 싱가포르와 인도에서 채용 중이다 — DeepMind 연구 사무소와 같은 건물에 자리 잡아, DevRel 직원이 고립된 영업 위성 사무소에 앉아있는 대신 복도를 걸어서 연구자를 만날 수 있다. 더 큰 조직 소식: Kaggle이 DeepMind에 합류했고, Kaggle의 경진대회와 벤치마크 인프라가 Gemma/Gemini 기능 격차와 직결된다 — 커뮤니티가 만든 벤치마크가 훈련 신호로 돌아올 수 있다. Omar는 피드백 루프 모델이라고 설명한다: 팀이 소셜 미디어와 행사를 통해 개발자들이 무엇을 만들고 있는지 파악하고, 그 신호를 모델링 쪽으로 가져간다. > *"Gemma, Gemini, 그리고 저희의 모든 도구를 만드는 방식은 스타트업, 커뮤니티, 개발자들의 피드백에 정말로 기반합니다."* ## 엔티티 - **Omar Sanseviero** (인물): Google DeepMind 개발자 경험 총괄; 이전에 Hugging Face에서 DevRel 성장을 이끌었으며, Gemma 개발자 생태계를 담당. - **swyx** (인물): Latent Space 팟캐스트 호스트; AI Engineer London 2026 인터뷰어. - **Gemma 4** (소프트웨어): Google의 오픈 모델 패밀리. 레이어별 임베딩 아키텍처(E2B 유효 파라미터 오프로딩), 2B/4B/27B MoE/31B Dense 변형, 140개 언어 지원, 멀티모달 입력 탑재. - **Gemini Nano** (소프트웨어): Gemma 아키텍처 기반의 온디바이스 모델; OS를 통해 Pixel 및 하이엔드 Samsung 폰에 기본 탑재. - **Gemma Scope** (소프트웨어): Google의 메커니즘적 해석 가능성 툴킷 — Gemma 3 모델의 레이어별 활성화 분석; 2025년 12월 페타바이트 규모 활성화 데이터와 함께 출시. - **Gemini Diffusion** (소프트웨어): Google의 실험적 텍스트 생성용 디퓨전 트랜스포머(이미지 아님), Google I/O에서 발표; 주요 장점은 추론 속도. - **Kaggle** (조직): 경진대회/벤치마크 플랫폼으로 Google DeepMind에 합류; 커뮤니티 평가를 Gemini 기능 피드백 루프와 연결. - **Google DeepMind** (조직): Google의 통합 AI 연구소; Gemma, Gemini, 로보틱스, 온디바이스 ML, 메커니즘적 해석 가능성을 아우름. - **AI Engineer London** (조직): 응용 AI 엔지니어링 컨퍼런스 (2026년 에디션); 이 인터뷰 장소이자 DeepMind 본거지. - **MoE (Mixture of Experts)** (개념): 토큰당 파라미터의 일부만 활성화하는 희소 아키텍처; 동등한 파라미터 수에서 Dense보다 빠른 추론을 제공하지만, 분포에 민감한 라우팅으로 파인튜닝이 어려움. - **레이어별 임베딩 (Per-layer embedding)** (개념): Gemma 4의 아키텍처적 변경 사항 — 각 트랜스포머 레이어에 삽입된 룩업 테이블 임베딩으로, 30억 파라미터를 행렬 곱 비용 없이 GPU 외부에 두는 것을 가능하게 함. - **파라미터당 지능 (Intelligence per parameter)** (개념): Gemma 2→3→4를 거치며 총 파라미터 수를 약 30B로 유지하면서 향상시켜 온 성능 대 가중치 비율.

#gemma#google-deepmind#open-models
AI Agents Need Computers: 74% MoM Growth, 850K/Day Runs, & New Agent Cloud — Ivan Burazin, Daytona
1:11:40
EN/ZH
Watch with Captions
Latent Space14일 전

AI Agents Need Computers: 74% MoM Growth, 850K/Day Runs, & New Agent Cloud — Ivan Burazin, Daytona

Ivan Burazin, CEO of Daytona, discusses the massive shift from building developer environments for humans to providing composable computers for AI agents. With 74% month-over-month growth and 850,000 daily runs, Daytona provides the bare-metal infrastructure required for stateful, high-performance agentic workflows. This conversation explores the technical challenges of spiky compute, the $10 trillion computer-use market, and why the future AI cloud will look more like Stripe than AWS. ## [00:00] Hook Ivan Burazin describes the intense, direct demand for Daytona's infrastructure, with potential users calling him personally to request access. This level of interest signaled a massive, untapped market for providing execution environments to every future AI agent. The team realized they had identified a critical missing piece in the AI development stack. > *I've never experienced this that people literally call you if you do not give them access. Like they want access right now.* > *[0, 0]* > * ] }, { * > *title": "Introduction* > *{'start': 72.0, 'summary': "Host swyx introduces Ivan Burazin, noting their shared history in the developer experience and 'end of localhost' movements. Ivan recalls reaching out to swyx years ago for advice on developer experience while working at a previous role. They reflect on how their early interactions and mutual interests in cloud-based development tools eventually led to their current collaboration.", 'quotes': ['I was one of the co-founders of code anywhere... we were thinking a long time of like local host should die.', [1, 36], '\n ]\n },\n {\n ', 'title": "CodeAnywhere', 'Shift', 'and the end of localhost', {'start': 195.0, 'summary': 'Ivan discusses his long history with his co-founder, dating back to early 2000s virtualization and the creation of CodeAnywhere. As the first browser-based IDE, CodeAnywhere predated modern infrastructure like Docker and Kubernetes, which provided the team with deep foundational knowledge. After a successful run with the Shift developer conference, they returned to their infrastructure roots to launch Daytona.', 'quotes': ['We originally started stacking stacking servers doing like virtualization in the early 2000s... and that was a services company which we sold.', [3, 38], '\n ]\n },\n {\n "title": "What Daytona is: composable computers for AI agents",\n "start": 358.0,\n "summary": ', "Ivan defines Daytona as a provider of 'composable computers' for AI agents", "moving beyond the limited industry term 'sandboxes.' He explains that agents require diverse computing environments tailored to specific tasks", 'much like different hardware setups for human professionals. This API-driven infrastructure allows agents to execute code in production-grade environments rather than just temporary test boxes.', {'quotes': ['What Daytona is today is essentially composable computers for AI agents... the market calls them sandboxes which [is] misleading.', [6, 41], '\n ]\n },\n {\n ', 'title": "The pivot from dev environments to AI sandboxes', {'start': 487.0, 'summary': "Ivan explains how observing early agents like Devon and OpenHands led to a realization that AI agents require a dedicated compute runtime. While their initial SaaS offering for human automation saw low traction, it attracted developers who specifically needed sandboxes for their agents. This feedback loop revealed a massive, underserved market for agent-specific infrastructure that standard cloud providers weren't addressing.", 'quotes': ['a lot of people reached out that were building agents and they were like hey my agent needs a compute sandbox runtime', [8, 50], '\n ]\n },\n {\n ', 'title": "The New Year’s Eve MVP and customers begging for API keys', {'start': 617.0, 'summary': "On New Year's Eve, Ivan 'vibe-coded' the first MVP of what would become the new Daytona. Although the CTO initially dismissed the code as 'garbage,' the core idea was strong enough to warrant a two-week professional rebuild. When they demoed this version to previous skeptics, the response was immediate and overwhelming, with users demanding API access before the calls even ended.", 'quotes': ["I've never experienced this that people literally call you if you do not give them access.", [12, 18], '\n ]\n },\n {\n ', 'title": "Bare metal', 'stateful sandboxes', 'and Daytona’s scheduler', {'start': 776.0, 'summary': "The team approached the technical architecture from first principles, deciding to run on bare metal rather than traditional VMs. They aimed to combine the speed of AWS Lambda with the stateful, long-running nature of an EC2 instance. This allows agents to 'pause and come back' to their work, much like a human closing a laptop lid, without losing state or performance.", 'quotes': ["agents will be like humans in the sense of you don't want your laptop to be shut down until you're done with work", [13, 57], '\n ]\n },\n {\n ', 'title": "60ms startup', 50, 0, 'sandboxes', 'and 850K daily runs', {'start': 1048.0, 'summary': "Daytona's infrastructure is optimized for both individual speed and massive concurrency, with a single instance spinning up in just 60 milliseconds. This scale supports high-volume customers who perform nearly 850,000 runs daily, with some requesting capacity for half a million concurrent CPUs. The system utilizes a custom scheduler and local NVMe drives to eliminate network latency and maximize IOPS.", 'quotes': ['Our time to spin up one is 60 milliseconds with network latency... if you want to spin up 50,000 at once, we are now at about 75 seconds.', [17, 40], ',\n ', 'The biggest customer of ours does like about 850', 0, "every single day is sort of where they're where they're just shy of a million.", [18, 17], '\n ]\n },\n {\n ', 'title": "Spiky RL/eval workloads and the new agent infra problem', {'start': 1313.0, 'summary': "The 'spiky' nature of AI workloads presents a major challenge for compute providers, leading to a mean utilization rate of only 15% despite peaks hitting 90%. Workloads are categorized into 'background agents' that follow human cycles and 'evaluations/RL' which fire off massive bursts of activity at unpredictable hours. To manage this, Daytona must use capacity commits to handle sudden bursts of 100,000 or more CPUs.", 'quotes': ["Daytona's mean utilization is 15%... because it's very spiky. But it's very spiky but we get up to 90%.", [23, 1], '\n ]\n },\n {\n ', 'title": "RL workloads', 'Kubernetes pain', 'and dynamic resizing', {'start': 1692.0, 'summary': "Daytona competes primarily against managed Kubernetes services like EKS and GKS, positioning itself as a more ergonomic 'Twilio or Stripe' for compute. Unlike Kubernetes, Daytona offers a seamless API for spinning up sandboxes with significantly faster startup times. A key advantage is the ability to dynamically resize sandboxes on the fly to prevent out-of-memory (OOM) errors, a feature difficult to implement on other platforms.", 'quotes': ["Daytona although it's a compute provider it's more akin to a Twilio and Stripe from a consumption perspective than it is an AWS", [29, 46], '\n ]\n },\n {\n ', 'title": "Why every AI agent needs a computer', {'start': 2011.0, 'summary': "Ivan outlines the massive scale of knowledge work, estimating a $50 trillion global salary pool, much of which is locked in legacy Windows applications. He argues that true automation requires 'human emulators' that can interact with these legacy systems via GUIs when APIs are incomplete. By automating 40% of this work, the market opportunity for agentic computer use reaches approximately $10 trillion annually.", 'quotes': ['If you take 40% of that, you get to essentially like 10 trillion dollars a year.', [35, 20], '\n ]\n },\n {\n ', 'title": "macOS sandboxes and Apple’s licensing problem', {'start': 2328.0, 'summary': "The discussion shifts to the difficulties of hosting Mac OS sandboxes compared to Windows and Linux. Apple's restrictive licensing only allows two parallel VMs per machine and requires a 24-hour lock-in for users, making per-second billing economically unfeasible. Furthermore, security restrictions prevent moving memory snapshots between physical machines, severely limiting the scalability of agentic workloads on Mac hardware.", 'quotes': ['Apple is shooting itself in the foot... if it would just enable a concurrency model similar to what you can get on a Windows.', [40, 52], '\n ]\n },\n {\n ', 'title": "Why CLI may matter more than MCP', {'start': 2668.0, 'summary': "The discussion compares the Model Context Protocol (MCP) to the Command Line Interface (CLI) for agentic action. While MCP acts as an interface for APIs, the CLI allows agents to execute scripts and perform deep data analysis within a sandbox. This layer of indirection enables more complex agentic workflows beyond simple data retrieval, allowing agents to actually 'do things' rather than just integrate.", 'quotes': ['the MCP is an interface against an API whereas the CLI is like you can actually go do things... the difference between integrations and actually running scripts.', [45, 34], '\n ]\n },\n {\n ', 'title": "Open source', 'GitHub stars', 'and agent integration', {'start': 2891.0, 'summary': "Ivan details Daytona's transition to an AGPLv3 license for its sandbox product to balance openness with commercial protection. This 'copyleft' approach allows enterprise use but prevents competitors from building proprietary forks without contributing back. Keeping the core engine transparent builds trust with users and allows large enterprises to bypass lengthy security audits by providing agents with full context.", 'quotes': ["in the new sandbox product we did add a AGPL3... you essentially can't make a competitor without open sourcing your stuff.", [49, 49], '\n ]\n },\n {\n ', 'title": "Git', 'CI/CD', 'and agent collaboration bottlenecks', {'start': 3191.0, 'summary': 'Current versioning systems like GitHub are often too slow for the high-velocity output of AI agents, leading to bottlenecks in CI/CD pipelines. Some developers are creating makeshift solutions like dumping codebases into JSON files on S3 to bypass Git overhead. There is a growing need for an agent collaboration layer that precedes the traditional Git-based pipeline to handle companies generating over 1,000 PRs per day.', 'quotes': ["GitHub as-is was an overhead... it wasn't fast enough what they needed.", [54, 3], '\n ]\n },\n {\n ', 'title": "Founder life and building a 25-person infra company', {'start': 3495.0, 'summary': "Daytona's success stems from a core team of 13 people who have worked together for over seven years, fostering a high-trust culture. Ivan acknowledges the difficulty of the founder journey, including being away from family, but posits that growth requires 'pain.' He views his work as building the spiritual successor to serverless and Kubernetes for the agent era, requiring radical responsiveness as a differentiator.", 'quotes': ['Of the 25 people in Daytona, I think about 13 of them we have worked with seven years plus.', [58, 57], '\n ]\n },\n {\n ', 'title": "AI SaaS', 'token resale', 'and API-first business models', {'start': 3764.0, 'summary': 'Ivan presents a critical take on the SaaS ecosystem, arguing that the market is incorrectly applying a premium to vendors who simply resell AI tokens. He points out that these models have significantly worse margins than traditional SaaS. Instead, he advocates for companies to expose their data via APIs and charge for consumption, allowing for actual revenue acceleration through increased agentic usage.', 'quotes': ["The market is adding premium to SAS vendors that are reselling tokens. And I think that's incorrect.", [62, 54], '\n ]\n },\n {\n "title": ', 'GPU sandboxes', 'data centers', 'and compute growth', {'start': 3970.0, 'summary': 'Daytona plans to introduce GPU sandboxes to support workloads like 3D rendering and reinforcement learning on CAD, rather than focusing on inference. While the company currently runs on bare metal via colocation providers, Ivan notes they are architected to potentially own data centers in the future. He currently avoids the high capital risk of building data centers for single-digit margin gains.', 'quotes': ['We will [offer GPUs], but not for inference. Like essentially what we think about is like the GPU sandbox.', [66, 21], '\n ]\n },\n {\n ', 'title": "Why the AI cloud may look more like Stripe than AWS', {'start': 4188.0, 'summary': "The conversation concludes by imagining the 'AWS for AI Agents,' which Ivan suggests might look more like Stripe than a traditional cloud provider. This future 'AI Cloud' will integrate sandboxes, web search, and databases as fundamental primitives. While companies like Cloudflare and OpenAI are competing for this space, Ivan hints that many more infrastructure primitives for agents are yet to be developed.", 'quotes': ["There will be a cloud built out specifically for agents and so that cloud will have sandboxes and it will have web search and it'll have databases.", [70, 47], '\n ]\n },\n {\n ', 'title": "Closing thoughts', {'start': 4286.0, 'summary': 'The discussion ends with the observation that the AI infrastructure market is growing at an unprecedented baseline of 40-75% month-over-month. Ivan and swyx reflect on the race to secure hardware and the shift toward specialized agent clouds that will define the next decade of computing.', 'quotes': ["The entire infrastructure market is growing 40% plus or minus month over month... if you're not growing 40%ish... you don't have to come to work.", [68, 23], '\n ]\n }\n ],\n ', 'entities": [\n {\n "name": "Ivan Burazin', {'type': 'person', 'description': 'CEO of Daytona and co-founder of CodeAnywhere.'}, {'name': 'swyx', 'type': 'person', 'description': 'Host of Latent Space and early investor in Daytona.'}, {'name': 'Daytona', 'type': 'organization', 'description': 'A company providing composable computers and sandboxes for AI agents.'}, {'name': 'CodeAnywhere', 'type': 'organization', 'description': 'The first browser-based IDE, co-founded by Ivan Burazin.'}, {'name': 'Devon', 'type': 'product', 'description': 'An early AI software engineer agent.'}, {'name': 'OpenHands', 'type': 'product', 'description': 'An open-source AI agent project formerly known as OpenDevin.'}, {'name': 'Kubernetes', 'type': 'technology', 'description': "Orchestration technology mentioned as a competitor to Daytona's ergonomic API."}, {'name': 'Apple', 'type': 'organization', 'description': 'Mentioned regarding restrictive Mac OS virtualization licensing.'}, {'name': 'Salesforce', 'type': 'organization', 'description': 'Cloud-based software company mentioned for its API-first strategy.'}, {'name': 'GitHub', 'type': 'organization', 'description': 'Developer platform noted for being a bottleneck in agentic CI/CD workflows.'}, {'name': 'Nvidia', 'type': 'organization', 'description': 'The primary provider of GPUs whose supply constraints dictate market growth.'}, {'name': 'Stripe', 'type': 'organization', 'description': 'Used as a comparison for the consumption-based model of the future AI cloud.'}], 'tags': ['ai-agents', 'infrastructure', 'sandboxing', 'bare-metal', 'cloud-computing', 'developer-tools', 'computer-use', 'saas-growth'], 'seo_title': "AI Agents Need Computers: Ivan Burazin on Daytona's Pivot", 'seo_description': 'Ivan Burazin explains why AI agents need composable computers and how Daytona pivoted from dev environments to 850K daily agent runs.', 'confidence': {'score': 0.98, 'rationale': 'The summary synthesizes multiple detailed chunks covering technical metrics, business strategy, and market philosophy with high fidelity to the source.'}}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}* ## [01:12] Introduction ## [03:15] CodeAnywhere, Shift, and the end of localhost ## [05:58] What Daytona is: composable computers for AI agents ## [08:07] The pivot from dev environments to AI sandboxes ## [10:17] The New Year’s Eve MVP and customers begging for API keys ## [12:56] Bare metal, stateful sandboxes, and Daytona’s scheduler ## [17:28] 60ms startup, 50,000 sandboxes, and 850K daily runs ## [21:53] Spiky RL/eval workloads and the new agent infra problem ## [28:12] RL workloads, Kubernetes pain, and dynamic resizing ## [33:31] Why every AI agent needs a computer ## [38:48] macOS sandboxes and Apple’s licensing problem ## [44:28] Why CLI may matter more than MCP ## [48:11] Open source, GitHub stars, and agent integration ## [53:11] Git, CI/CD, and agent collaboration bottlenecks ## [58:15] Founder life and building a 25-person infra company ## [1:02:44] AI SaaS, token resale, and API-first business models ## [1:06:10] GPU sandboxes, data centers, and compute growth ## [1:09:48] Why the AI cloud may look more like Stripe than AWS ## [1:11:26] Closing thoughts

The Agent-Native Cloud: Jake Cooper on Railway's Future
1:29:54
EN/ZH
Watch with Captions
Latent Space15일 전

The Agent-Native Cloud: Jake Cooper on Railway's Future

Jake Cooper, CEO of Railway, details the platform's evolution from a high-burn startup to a sustainable, bare-metal cloud infrastructure powering 3 million users. He argues that the rise of AI agents necessitates a fundamental rebuild of the cloud, moving away from human-centric tools like Kubernetes and pull requests toward high-density CLI handles and production forking. This conversation provides a roadmap for building modular, high-scale systems capable of supporting the next generation of automated software development. ## [00:00] Intro Jake Cooper argues that developers should stop writing code by hand and instead focus on reviewing agent-generated code to maintain architectural integrity. He emphasizes that while AI tools have improved significantly, underlying architectural patterns matter more than ever in an automated workflow. The hosts introduce Jake as the 'Conductor' of Railway, setting the stage for a discussion on the future of cloud platforms and developer experience. > *you should be reviewing the code that you are writing instead of trying to go and write it by hand.* > *[0, 10]* ## [01:19] What Is Railway? Railway is described as a platform that allows users to deploy applications and databases instantly via a canvas or AI prompts like Claude. Jake explains that the goal is to manage software versioning and environment cloning to reduce the complexity of traditional tools like Docker and Kubernetes. By tracking all changes, Railway enables developers to fork production environments into parallel universes for safe validation without reproducing staging environments manually. > *railway is the easiest way to ship anything.* > *[2, 29]* > *we want to make it really easy for not just to like deploy things, but for you to almost like evolve applications over time.* > *[2, 49]* ## [03:26] Jake’s Path to Railway Jake details his professional journey from front-end work at Wolfram to building distributed systems for Jump bikes at Uber using Cadence. He describes his engineering philosophy as a willingness to 'swim to the bottom of the pool,' which includes writing kernel patches to ensure the best possible user experience. Additionally, he critiques GitHub's architecture, specifically the 'broken pointers' created by cloning, which complicates upstream contributions. > *we will swim to the bottom of the swimming pool to go and get the experience* > *[4, 35]* > *GitHub's original sin is that it's like almost a series of broken pointers.* > *[6, 2]* ## [07:32] Railway’s Six-Year Growth Story Jake presents a growth chart illustrating the rapid increase in daily signups for the Railway platform, which has transitioned from a 'slow grind' to adding 100,000 users weekly. Early growth was driven by high-touch interaction on Discord and a determination to acquire the first 100 core users manually. This visual data serves as a transition into the company's history of scaling and its move toward becoming a primary cloud provider. > *so I just wanted to like pull up this glorious chart you say which is basically your usage or number of daily signups* > *[7, 34]* > *Trying to get those initial like first 100 users to like actually kind of come back to it.* > *[8, 21]* ## [10:11] Rebuilding the Business After the Free Tier At one point, Railway was losing $500,000 a month while only generating $50,000 in revenue, despite having $20 million in the bank. Cooper realized this was an unsustainable business model and chose to prioritize long-term viability over vanity metrics, temporarily closing the free tier to rebuild. The company now maintains a lean team of 35 people, preferring to build automated systems rather than throwing headcount at problems. > *We basically had to kind of close off the the free kind of users for a little while, rebuild the business.* > *[11, 47]* > *We're 35 people right now... we don't want to just like add headcount for the sake of headcount.* > *[10, 52]* ## [12:36] Agents as the Next Software Platform Over the last six months, Railway has prioritized 'agentic' development as the primary mechanism for building and deploying software. Cooper believes the industry is moving from assembly and high-level languages to 'words' as the primary interface. He envisions a future where thousands of agents run in parallel, requiring new tools for coordination and version control to manage the super-exponential growth of workloads. > *We've moved from assembly to C to C++ to JavaScript to now like words.* > *[13, 23]* ## [14:48] Railway’s Infrastructure Philosophy Jake Cooper explains that Railway prioritizes control over low-level primitives like network, compute, and storage to optimize for AI agent workloads. By avoiding Kubernetes in favor of custom orchestration, the team can place workloads with high precision to ensure memory efficiency. This level of control is necessary to prevent cost structures from ballooning as agent usage increases and requires thousands of parallel instances. > *you have to be very very efficient with these agents... or you're going to massively massively blow up your cost structure* > *[15, 10]* > *How do you get agents to coordinate? How do you go and get them to be able to like safely version changes?* > *[14, 28]* ## [17:01] Bare Metal, Cloud Economics, and the Compute Crunch Cooper describes the transition to bare metal as highly lucrative, reporting a payback period of just three months compared to cloud rental costs. This strategy allows the company to achieve 70% margins while leveraging hardware that remains viable for several years. He also notes the surprising appreciation of hardware assets, such as RAM, due to the global compute shortage and supply chain constraints. > *our payback period when we go to to metal... if we rent it in the cloud, our payback period is about 3 months.* > *[17, 2]* > *hardware and all of this stuff is... appreciated in value because RAM has gone up* > *[17, 50]* ## [18:41] Cloud Bursting and Five-Cloud Networking To maintain growth without being compute-constrained, Railway utilizes a hybrid cloud strategy for bursting capacity across AWS, GCP, and Oracle. This required building a custom network overlay capable of straddling five different cloud environments simultaneously. While this complexity led to past reliability challenges, it now allows Railway to scale rapidly regardless of individual provider quotas or hardware availability. > *I spent a weekend rebuilding our entire like network like overlay essentially so that we could straddle uh five different clouds* > *[19, 41]* > *we still maintain like cloud presence for like bursting essentially* > *[18, 52]* ## [21:39] Data Center Debt and Infra Financing Cooper highlights the strategic use of data center debt, secured against hardware, as a more efficient alternative to venture capital for infrastructure expansion. By treating compute capacity as a linear driver of revenue, Railway can scale as quickly as they can deploy new hardware. He encourages infrastructure startups to explore diverse financing tools rather than relying solely on expensive venture equity for physical assets. > *we can scale revenue as basically as quickly as we can scale compute* > *[21, 20]* > *our margins on metal are like quite high for the like 70%.* > *[20, 46]* ## [24:50] Data Centers in Space Jake Cooper and the hosts explore the technical challenges of placing data centers in space, specifically the issue of heat dissipation in a vacuum. Cooper expresses skepticism toward current proposals that ignore fundamental thermodynamic laws, comparing the 'figure it out later' mentality to science fiction. He highlights the difficulty VCs face in distinguishing between visionary ideas and technical 'grifts' in the space-tech sector. > *I haven't seen anybody like prove how you're going to go and dissipate that much heat in a vacuum* > *[25, 16]* > *how do you know what's like basically not possible and like is a grift versus like uh is possible but like sounds completely insane* > *[26, 16]* ## [26:43] What Agents Need From Infrastructure Cooper outlines the infrastructure needs of AI agents, noting they require versioning, observability, and storage similar to humans but at a 1000x scale. He predicts that current industry standards like Kubernetes and Envoy will become bottlenecks as agentic workloads compress development cycles. To support this growth, infrastructure must be modular enough to allow for the rapid replacement of failing components without human intervention. > *the workload profile doesn't change so much as it gets like massively massively compressed because you need to do thousands of these things* > *[28, 28]* > *you just need at a thousandx scale* > *[29, 13]* ## [29:43] CLIs, Canvas, and Agent-Native UX Cooper explains that while humans prefer simplicity, agents benefit from high-density CLI interfaces with numerous flags that serve as 'handles.' The Railway Canvas is also evolving into an output mechanism and 'context anchor' rather than just an input tool. This hierarchical view of infrastructure prevents critical knowledge from being siloed as teams scale complex 'hyperstructures' using automated agents. > *If you hand it to an agent and you say, 'Hey, that's 40 arguments and 600 flags.' Like, oh yeah, this is excellent.* > *[30, 35]* > *It has to be almost like an anchor for your context. It has to be like a port in the storm.* > *[34, 27]* ## [36:34] Central Station, Incidents, and Responsible Disclosure Railway utilizes an internal tool called Central Station to aggregate feedback and user context, moving away from static communication channels like Slack. The team emphasizes transparency by exposing real-time metrics and detailed incident reports, operating under a core value of 'honor.' This approach involves over-disclosing issues to users rather than providing vague or misleading information during outages. > *We'd rather overdisclose and know that you know that something is wrong versus almost like having your provider gaslight you.* > *[40, 22]* > *If you can dynamically aggregate that information and dynamically route it to the right person... this is no longer a manual process.* > *[37, 10]* ## [41:49] Safe Rollouts, SRE Agents, and Production Forks To mitigate the impact of bugs, Railway employs incremental rollouts and makes it easy to test behaviors in safe, shadowed environments. Cooper argues that production should not be treated as 'sacred' to the point of stagnation; instead, infrastructure should allow for trivial production forks. This is essential for AI agents, which face a 'stacking entropy' problem without safe iteration primitives to prevent system drift. > *We've built so much ceremony around like production is sacred... we need to get to a point where it's just trivially easy to test different behaviors.* > *[41, 33]* > *I think if you don't have the primitives to make iterating in production safe, it becomes very very difficult.* > *[44, 3]* ## [46:19] AI SRE, Specs, Code, and Tests Jake Cooper reflects on his transition from an AI skeptic to a believer, noting that the safety of AI SREs depends on infrastructure primitives. He advocates for the 'Holy Trinity' of software engineering: a clear specification, the code, and the tests. By aligning these three, developers and agents can reconcile discrepancies and maintain system integrity during rapid, automated iteration. > *If you just unleash an AI SRE on your production infrastructure... it's going to nuke your production database.* > *[46, 37]* > *You need three points essentially which is you need a clear spec... you need the code and then you need the tests.* > *[48, 22]* ## [49:43] Self-Replicating Infrastructure and the New Serverless The speakers explore the concept of agents using the Railway CLI to modify their own infrastructure, creating a self-replicating loop. This shift necessitates a move away from expensive, static virtual machines toward cheap, instantaneous 'atomic units of deploy' like isolates or sandboxes. The goal is to make throwaway copies of production as trivial and cost-effective as possible for agentic experimentation. > *The agent can like modify its own infra which I think is... yeah it's nuts.* > *[50, 4]* > *How do you go and make those throwaway copies like as trivial as possible to spin up run super cheap etc.* > *[50, 53]* ## [54:37] Heroku, Temporal, and Workflow Engines Cooper attributes the decline of Heroku to Salesforce's lack of focus on compute as a core business, leading to product stagnation. Railway positions itself as a 'fluid compute' provider, leveraging Cooper's decade of experience with Temporal (and its precursor Cadence) for durable workflows. Railway is a power user of Temporal, using it to manage complex, long-running infrastructure tasks at scale. > *The business of Salesforce is to build a really really good CRM... and then you acquire this business as a compute business that's kind of an offshoot* > *[55, 33]* > *I have used Temporal for almost like 10 years now, right? Because like Cadence, all of us other things.* > *[60, 5]* ## [1:05:26] Railpack, Nixpacks, and Lazy-Loaded Filesystems Railway is developing Railpack, an engine for determining source code dependencies, which evolved from their earlier Nix-based tool, Nixpacks. While Nix offers theoretical benefits for versioning, Railway found it caused significant image bloat and scaling issues for real-world workloads. They are now exploring content-addressable file systems to enable lazy loading of data into memory for faster deployments. > *If you want version X and version Y, you end up bloating a lot of your kind of like package like space.* > *[66, 2]* ## [1:07:20] Coding Agents, Token Spend, and Roadmap Acceleration With a monthly cloud spend reaching $300,000, Railway heavily incentivizes the use of AI coding agents among its employees. Cooper argues that manual code generation is an inefficient use of time, urging developers to focus on architectural patterns and code review. This allows the team to 'speedrun' their product roadmap by automating complex infrastructure tasks and test generation. > *If you are writing code by hand you are doing this wrong... you should be reviewing the code that you are writing.* > *[67, 37]* > *If you're not using the AI systems to almost like speedrun your road map... then you're kind of missing a large point.* > *[69, 12]* ## [1:12:15] The Pull Request Is Dying The traditional SDLC is undergoing a radical transformation where the pull request and manual code review are losing relevance. Impact is increasingly measured by the 'percentage of tokens that end up in production' rather than lines of code. As AI systems handle more reconciliation and validation, the focus shifts from the PR to the initial prompt and final deployment. > *The pull request is dying... it's going to be the prompt... and beyond that code review is also kind of dying.* > *[72, 23]* > *The really naive way to go in and measure this is almost like your percentage of tokens that end up in production.* > *[71, 40]* ## [1:13:47] Feature Flags and the Agent-Era SDLC Jake Cooper discusses the critical role of feature flagging in managing the 1000x compression of the SDLC driven by AI agents. He argues that incremental rollouts and blast radius management through flagging will become even more essential for safety as deployment speed increases. This culture of flagging allows for rapid experimentation without compromising system stability for enterprise customers. > *Everything's just going to get compressed by like a thousandx so that everybody can go and do that.* > *[77, 21]* ## [1:17:34] Cattle, Pets, and Cloning Machines Jake offers a contrarian view on the 'cattle not pets' philosophy, suggesting that snapshotting allows developers to treat infrastructure like 'pets' again. By snapshotting every frame and lazily loading file systems, the overhead of traditional DevOps tools like Dockerfiles is reduced. Railway even modifies the kernel to support persistent connections during these system snapshots. > *I think you can move towards having pets so long as... you have a cloning machine for your pets.* > *[78, 2]* > *If you can snapshot every single thing at every frame, then like it actually doesn't matter if you know that obliterated.* > *[78, 12]* ## [1:20:48] Solo Founder Lessons Jake reflects on his path as a solo founder, contrasting it with the Silicon Valley consensus of finding a co-founder. He emphasizes the need to be obsessed with every layer of the stack, from kernel-level changes to go-to-market strategies. He argues that having two co-founders can often lead to deadlocks without a clear tiebreak, whereas solo leadership allows for singular vision. > *Two is the worst number of co-founders is because you have no tiebreak... you basically are like, well, I disagree on this thing.* > *[82, 49]* ## [1:25:31] Focus, GPUs, and Building a New Cloud Railway is intentionally avoiding the GPU provider market for now to maintain its core mission, though Cooper admits GPUs are an inevitable part of their long-term roadmap. He stresses that companies are defined as much by what they choose not to do as by what they execute. The ultimate goal is full vertical integration to ensure a seamless experience from logic to execution. > *I think you're you're defined almost more by the things that you don't do than the things that you do* > *[86, 8]* > *I can tell you for a fact that we will not be doing GPUs now, but we 100% will be doing GPUs at some point.* > *[86, 50]* ## [1:29:39] Closing Thoughts Cooper reveals that Railway is moving toward 100% ownership of its data centers to avoid copying the infrastructure of legacy hyperscalers. By inventing their own infrastructure from scratch, Railway aims to support 'vibe coding,' where the friction between a thought and a live application is completely removed. This approach empowers a new generation of 'citizen developers' to build at the speed of thought. > *there should be no friction in between what your thought is and reality that kind of comes out.* > *[89, 4]* > *we've been very very deliberate to like invent our own infrastructure from scratch.* > *[88, 30]* ## Entities - **Jake Cooper** (person): CEO and 'Conductor' of Railway. - **Railway** (organization): A cloud platform designed for easy deployment and environment management. - **Uber** (organization): Jake's former employer where he worked on distributed systems for Jump bikes. - **Temporal** (software): A workflow orchestration platform used by Railway for reliable infrastructure tasks. - **Salesforce** (organization): The CRM company that acquired Heroku, leading to its perceived stagnation. - **Heroku** (organization): A pioneer PaaS platform that Railway is often compared to. - **AWS** (organization): Amazon Web Services, used by Railway for hybrid cloud bursting. - **GCP** (organization): Google Cloud Platform, one of the five clouds Railway straddles. - **Claude** (software): An AI model mentioned as an interface for deploying on Railway. - **GitHub** (organization): A code hosting platform discussed regarding its architectural flaws in versioning. - **Kubernetes** (software): An orchestration system Railway chooses to avoid for higher-order control. - **Central Station** (product): Railway's internal tool for aggregating user context and support feedback.

#cloud-computing#ai-agents#infrastructure
The Next War Is Already Here — Yaroslav Azhnyuk, The Fourth Law & Noah Smith, Noahpinion
1:59:28
EN/ZH
Watch with Captions
Latent Space17일 전

The Next War Is Already Here — Yaroslav Azhnyuk, The Fourth Law & Noah Smith, Noahpinion

Ukraine produced 4 million FPV drones last year; China could produce 4 billion. That asymmetry frames two hours of unusually concrete conversation between Yaroslav Azhnyuk — serial tech founder turned AI-drone builder at The Fourth Law — and economist Noah Smith, who has been writing about the economics of drone warfare since before most Western policy circles took it seriously. They cover the full tech stack (cameras, autonomy modules, fiber optic links, interceptors, a semiconductor fab under construction), a five-level autonomy taxonomy, an eight-dimension autonomous-battlefield framework, and China's manufacturing edge that has no near-term Western answer. The through-line: the West is still planning to fight the last war, Ukraine is the defense valley where the next war is already live, and the gap is widening faster than most people realize. ## [00:00] Cold Open: China's 4 Billion Drones and the Cameras-to-Explosives Pipeline Yaroslav opens cold with a single arithmetic comparison that structures the rest of the episode. Ukraine, not an industrial powerhouse, built 4 million FPV drones in a year. China, with an order-of-magnitude larger manufacturing base and a consumer electronics supply chain already producing the same cameras, motors, and chips, could produce 4 billion. Noah immediately asks whether that makes China the supreme conventional military power on earth right now. Yaroslav won't claim certainty, but won't rule it out either. > *"I don't think we have all the information to claim that, but we cannot count it out. And that alone should be, you know, a big warning sign."* The cold open also plants the personal pivot that the rest of the episode unpacks: Yaroslav went from making cameras that fling treats to pets to cameras that fling explosives to occupiers. ## [01:04] Introduction: Brandon, Noah Smith, and Yaroslav Azhnyuk Guest host Brandon normally runs a science podcast; this episode is the exception. Noah Smith — Noahpinion Substack, economist focused on industrial policy and geopolitics — is co-host and co-interviewer. Yaroslav sets the personal context: on February 23rd, 2022, he and his then-fiancée landed in Kyiv at 11 p.m. on what turned out to be one of the last flights into the city. Eight hours later, the bombs fell. The 17-hour drive west that followed — empty streets, gas stations out of fuel, pouring diesel into windshield-washer canisters — reads like a scene from an apocalyptic film because, for the people living it, it was exactly that. > *"We basically packed our belongings and got in the car and spent 17 hours riding west. That was exactly like that. I, you know, missiles are falling, like there was smoke in Kyiv."* ## [05:41] From Tech Entrepreneur to Defense: PetCube, Brave One, and the D3 Fund Yaroslav's path from pet-tech to defense wasn't a straight line. In San Francisco from 2014 to 2020 building PetCube (one of the leading pet-camera companies), he had never taken military coursework and considered wars a thing of the past. Day one of the invasion he knew he would fight back with everything he could — but weapons weren't the first instinct. Early efforts included lobbying U.S. Congress on Lend-Lease (passed May 2022, underdelivered), co-founding Brave 1 (Ukraine's defense-innovation cluster, analogous to DIU), and helping seed the D3 Fund co-started by Eric Schmidt. By 2023, two things became undeniable: the war would last, and drones had permanently redefined warfare — the first software-defined weapon platform in history, where a battlefield capability upgrade can be pushed overnight like a software update. > *"It's like if you were able to push a software update and get all of your Roman legionaries a new helmet. That has never been possible before."* ## [10:42] The Ethics of Building Weapons: Dual-Use Technology and the Wolf at the Door Brandon raises the dual-use problem: the technology won't stay in Ukrainian hands. Yaroslav's answer is pragmatic rather than philosophical. Every technology from fire to large language models is dual-use; the question for a maker is whether the marginal risk of their contribution outweighs the immediate need. Ukraine is in a forest with a wolf. You deal with the wolf first, then consult Greenpeace. He's clear-eyed that no technology stays contained — the parallel concern about LLMs freely available in North Korea and Russia applies equally to drone autonomy — but frames his own company's responsibility narrowly: they supply to the Ukrainian government and armed forces, not to arbitrary buyers. > *"When you're in a situation where you're in a forest in front of a wolf, you know, you first going to deal with a wolf that wants to eat you and then you're going to go consult Greenpeace."* ## [14:01] The Tech Stack: Cameras, Autonomy Modules, Interceptors, and a Semiconductor Fab The Fourth Law's structure is three interlocking business units. Cameras (daytime and thermal, sold to 200+ Ukrainian drone manufacturers). Drone autonomy modules (sold to the same ecosystem). And UAV products sold direct to the armed forces: FPV strike drones, bombers, Shahed interceptors, and ISR interceptors — drones that hunt Russian reconnaissance drones before they can relay targeting data. The thermal-camera arm is about to start construction on two semiconductor fabs to manufacture sensor chips in-house, driven by the realization that dependence on foreign sensor supply chains is a strategic vulnerability. > *"We're about to start construction of two semiconductor plants to make sensors for thermal cameras. That's super exciting for me as a computer science guy — doing semiconductor, super cool."* ## [18:47] Fiber Optic vs. AI: The Radio Horizon Problem and $32/km Cable The chapter is really about why radio-only FPV drones fail at long range — not just from jamming, but from the curvature of the Earth. Below roughly 60-100 meters altitude at 30-40 km range, a drone enters a radio shadow behind hills, forests, or the horizon itself. The pilot loses video and control precisely when closing on a target that is, by definition, on the ground. Fiber optic cable ($32/km, spooled from the drone) solves the shadow problem but adds weight, limits range, and reduces maneuverability. AI fills the gap differently: terminal guidance lets the drone complete the last few hundred meters autonomously even after the radio link breaks. The two approaches aren't mutually exclusive — you can run AI on top of a fiber optic link to command hundreds of drones with fewer operators. > *"If your drone goes low — and usually Russian infantry and vehicles, they're on the ground and you want to hit them, you need to go low — lower you go, maybe you'll get behind a hill or behind a forest, and if you're far enough you'll just get behind the curvature of the Earth."* ## [25:32] FPV Drones: The New God of War — 70–80% of Frontline Casualties Artillery was historically called "the god of war" because it caused 80% of battlefield casualties. On the current Ukrainian front line, 70-80% of casualties are inflicted by FPV drones — the same fraction, a different weapon. Tanks, designed to dominate land warfare for decades, are now routinely destroyed by $400 consumer-grade quadcopters because armor was never built to defend against attacks from directly above. The trajectory follows the same curve as calculators becoming irrelevant once smartphones arrived: not a linear substitution but an exponential displacement where the new technology's influence grows nonlinearly. > *"They used to say that artillery is the god of war because artillery used to cause like 80% of casualties, and now on that ranking FPV drones rule."* ## [28:28] The Five Levels of Drone Autonomy: From Terminal Guidance to Full Autonomy Yaroslav lays out five autonomy levels describing where the field stands and where it's heading. Level 1 is terminal guidance — the drone flies under human control and locks onto a target only in the final seconds. Level 2 is bombing — dropping munitions from altitude without directly ramming a target. Levels 3-4 introduce increasing target-selection and navigation independence: the drone can identify radio-emitting equipment, track vehicles, or navigate through GPS-denied environments. Level 5 is full autonomy — launch-and-forget, no human in the loop for any mission phase. Current battlefield deployment sits mostly at Levels 1-3. The jump to higher levels isn't primarily a technical problem anymore; it's a deployment, doctrine, and trust problem. Human confirmation remains in the loop at every stage involving lethal targeting decisions — for now. > *"Technology progresses and its influence grows nonlinearly. It's all exponential."* ## [41:37] The Eight Dimensions of the Autonomous Battlefield The five autonomy levels describe a single drone's capability. The eight dimensions describe the full battlefield context those drones operate in. Dimension 1: level of autonomy (the five-level scale). Dimension 2: platform type (quadcopter, fixed-wing, missile, naval drone). Dimension 3: environment (day/night, urban/forest/open terrain). Dimension 4: target type (moving vehicle, static structure, radio emitter). Dimension 5: swarm size and coordination. Dimension 6: command-and-control architecture. Dimension 7: sensing modality (optical, thermal, RF). Dimension 8: infrastructure (simulation, data pipelines, security, deployment tooling). Each dimension interacts with every other. A Level-4 autonomous drone performing well in open daylight terrain may fail completely in a forest at night. Battlefield AI systems have to be evaluated across all eight dimensions simultaneously, not just on the single axis of autonomy level. > *"I say dimension because each of them works with another. It's crucial to understand how autonomy evolves in a modern battlefield environment."* ## [45:32] AI Safety and the Morality of Autonomous Weapons Yaroslav's position flips the standard AI-safety framing: in five to ten years, it will be *immoral* to use weapons *without* AI, because human-only weapons produce more collateral damage and friendly fire. He draws the analogy to manually driven cars — once autonomous vehicles are the norm, letting a human drive on a public road becomes the dangerous choice. Noah pushes to the logical endpoint: a Level-6 "AI general" — one large model that ingests all battlefield data and agentically selects targets, with humans reduced to repairing drones. Yaroslav says technically it could be done now. The constraint is deployment and trust, not capability. He references what was publicly described about AI-assisted target designation in the Iran operation: AI surfaces 127 targets, human reviews the list and presses okay. That's already close to an AI general with a rubber-stamp layer. > *"I think 5 to 10 years from now it will be immoral to use weapons without AI because weapons without AI will be more likely to cause collateral damage or unwanted damage."* ## [51:31] The End of the Rifleman? Noah's 2013 Prediction vs. Battlefield Reality Noah revisits a prediction he made in 2013: the rifleman is obsolete, replaced by standoff weapons. Ukraine both confirms and complicates it. FPV drones have unquestionably displaced the rifle as the primary instrument of attrition — but infantrymen haven't disappeared. They dig trenches, hold terrain, conduct logistics, and survive for months in dugouts under continuous drone threat by adapting: better camouflage, smaller movement signatures, drone-awareness drills. Yaroslav extends the timeline question to humanoid robots. The world is built for bipedal humans; there's genuine utility in a platform that can operate a rifle, open a door, or crew a vehicle. He puts a Terminator-style scenario — humanoid combat robots — at 10 years out, not science fiction. But modern warfare, they agree, is a multi-dimensional problem — dozens of drone types, land ops, reconnaissance, psychological operations, aviation, tanks, logistics — and the press focus on whichever technology is newest understates how much every layer still matters. > *"Modern warfare is really very complex and the fact that drones are the latest coolest thing doesn't mean that now it's that and only that."* ## [01:05:13] China's Manufacturing Advantage and Western Vulnerabilities This is where Noah Smith's economics background drives the conversation. The U.S.-China drone comparison isn't about unit price or autonomy level — it's about manufacturing throughput at scale. China's consumer electronics supply chain already produces the motors, cameras, chips, and battery cells that go into FPV drones. Switching that capacity to military production requires regulatory will, not retooling. Ukraine builds fixed-wing drones with 10 km range from hobby components; China can build fixed-wing drones with 200-300 km range at the same cost curve. The West's vulnerability isn't just quantity. It's thermal cameras (overwhelmingly sourced from China), semiconductor fabs (two generations behind on drone-relevant sensors), and procurement speed (a Western defense contract takes years to award; Ukraine iterates weekly). Yaroslav is optimistic about Western human capital — the engineers exist — but openly frustrated with European institutional inertia and uncertain about whether the U.S. has fully absorbed the lessons from Ukraine and the Middle East. > *"We don't have all the information to claim that, but we cannot count that out. If we want to keep the resemblance of our good past life, we have to do something about it."* ## [01:24:21] Policy Advice for Western Defense: Defense Valley and the Widening Gap Yaroslav's top policy prescriptions are framed around the William Gibson quote he attributes to Arthur C. Clarke: the future is already here, just not evenly distributed. Kyiv is Defense Valley — the place where the future of war arrived first, with hundreds of specialized companies, battle-tested commanders at every rank, and a government that learned to move at startup speed. Priority 1: deep integration with Ukraine's defense ecosystem, not just procurement but embedded learning. Priority 2: procurement reform — the drone-dominance initiative is the right direction and needs to scale 10x. Priority 3: long-range drone readiness for contested maritime environments (Shahed-class drones with 2,000 km range cover the entire Pacific island chain). He worries that the U.S. learned less from Ukraine than it should have and may be repeating the pattern with Iran. > *"Kyiv and Ukraine is sort of the defense valley. It's the point where the future of defense has already arrived, and there's a ton of things to learn from that."* ## [01:32:54] The Drone Race: Who's Ahead, Category by Category Russia was at parity or ahead in drone capability 18 months ago; Ukraine has since pulled ahead on FPV and autonomy. But Russia has a 4x population advantage and significantly more industrial capacity than Ukraine alone — scale disparity is why Western supply matters. The race breaks down by category: FPV strike (Ukraine leads), ISR reconnaissance (contested), glide bombs (Russia leads, dropping from bomber aircraft at scale), deep-strike drones (Russia leads on volume), and interceptors (Ukraine innovating rapidly, Russia catching up). Russia uses helicopters to intercept Ukrainian deep-strike drones — a costly but effective countermeasure revealing how each new offense spawns a tailored defense, at weekly iteration cycles. > *"Everyone says Russia's behind right now in the drone war. But that wasn't true a year ago."* ## [01:41:57] Countermeasures: Shotguns, Jammers, Lasers, and Fishnets Shotguns work — they're the primary kinetic countermeasure against incoming FPV drones — but only for a trained soldier who can hit a 20 cm target moving at 100 km/h under combat stress. Electronic jammers are the most widespread defense: block the radio or GPS link and the drone loses guidance. The catch is that the same spectrum the jammer blankets is often used by your own forces, and jammers are being defeated by frequency-hopping and fiber optic links. Russian tanks now look like porcupines — improvised metal cages and electronic-warfare antennas bolted on top to defeat top-attack drones. Ukraine's answer is shaped charges specifically tuned for the gap between the cage and the hull. Lasers are effective but expensive ($10M+ per system to kill a $400 drone) and slow to slew onto fast-moving targets. Fishnets — literally mesh nets — are being deployed around static positions because they're cheap, snag rotors, and require no power. > *"Then the tanks — if you look at Russian tanks and sometimes Ukrainian tanks or equipment — they all look like porcupines."* ## [01:58:19] The Wedding and Final Takeaway: Be Prepared for War Brandon closes with two questions. First: did Yaroslav actually get married in that chapel on February 23rd? They got legally married, but postponed the reception until the war is over. Second: one takeaway for the audience. Yaroslav's answer is a restatement of the Roman proverb: *si vis pacem, para bellum*. > *"You want peace, be prepared for war. Got to invest in defense and security."* ## Entities - **Yaroslav Azhnyuk** (Person): Founder of The Fourth Law (AI drone autonomy + thermal cameras, Ukraine); previously co-founder of PetCube; co-founder of Brave 1 and D3 Fund; born and raised in Kyiv. - **Noah Smith** (Person): Economist; author of the Noahpinion Substack; co-host for this episode; focus on industrial policy, manufacturing economics, and geopolitics. - **Brandon** (Person): Regular Latent Space host (science podcast background); guest host for this episode. - **The Fourth Law** (Organization): Yaroslav's AI-guided drone company; three business units — thermal cameras, drone autonomy modules, UAV products (FPV strike, bombers, interceptors). Leading drone-AI team in Ukraine. - **PetCube** (Organization): Consumer pet-camera company Yaroslav co-founded in San Francisco (2014–2020); the origin of the "cameras that fling treats / cameras that fling explosives" pivot. - **Brave 1** (Organization): Ukraine's defense-innovation cluster; analogous to DIU (Defense Innovation Unit) in the U.S.; co-founded with Yaroslav's involvement. - **D3 Fund** (Organization): Defense-tech investment fund co-founded with Eric Schmidt (ex-Google CEO) to accelerate Ukraine's drone ecosystem. - **FPV Drone** (Concept): First-Person-View drone — pilot sees through onboard camera in real time; currently responsible for 70-80% of frontline casualties; dominant tactical weapon of the Ukraine conflict. - **Five Levels of Drone Autonomy** (Concept): Yaroslav's taxonomy from terminal guidance (Level 1) to full autonomous operation (Level 5); most current battlefield deployment is Levels 1-3. - **Eight Dimensions of the Autonomous Battlefield** (Concept): Yaroslav's framework for evaluating drone systems across platform type, environment, target class, swarm scale, C2 architecture, sensing modality, and infrastructure. - **Defense Valley** (Concept): Yaroslav's term for Kyiv/Ukraine as the global hub where the future of defense tech is already live — analogous to Silicon Valley for consumer tech. - **Radio Horizon** (Concept): Earth-curvature effect that cuts radio/video links to low-flying FPV drones at 30-40 km range; primary technical driver for fiber optic drone adoption. - **Shahed** (Concept): Iranian-designed loitering munition used by Russia; fixed-wing, up to 2,000 km range; archetype for long-range drone threats to Western bases and Pacific-scenario planning.

#drones#ukraine#defense-tech
Abridge 내부: AI가 듣는 1억 건의 진료 — Abridge의 Janie Lee & Chai Asawa
1:06:38
EN/ZH
Watch with Captions
Latent Space21일 전

Abridge 내부: AI가 듣는 1억 건의 진료 — Abridge의 Janie Lee & Chai Asawa

Abridge의 Janie Lee와 Chai Asawa가 swyx, Redpoint의 Jacob Effron과 함께하는 Latent Space × Unsupervised Learning 크로스오버에서 AI 스크라이브가 어떻게 의료계의 "임상 지능 레이어"로 성장했는지를 이야기합니다. 에어컨형 제품 철학, 사전 승인 활용 사례, 임상 과학자와 LLM 판정자를 중심으로 구축한 eval 스택, HIPAA가 데이터 플라이휠을 어떻게 재편하는지, 그리고 1억 건 이상의 의료 대화에서 안정적으로 운영하기 위해 무엇이 필요한지를 다룹니다. ## [00:00] 소개 에피소드는 Janie Lee의 핵심 메시지로 시작합니다. 맥락이 전부이며, 알림은 사후 대응에서 선제적 대응으로 바뀌어야 하고, 제품 자체는 임상 위험이 발생할 때까지 에어컨처럼 배경으로 물러나 있어야 한다는 것입니다. 이어서 swyx가 광고 없이 운영하기 위해 청취자에게 구독을 부탁합니다. > *"저희가 자주 하는 말이 있어요. 제품이 에어컨처럼 느껴지길 바란다고요. 그냥 배경에서 조용히 상황을 더 좋게 만들어 주는 존재요."* — Janie Lee ## [01:17] Abridge가 하는 일 swyx는 이번 에피소드를 연간 Latent Space × Unsupervised Learning 크로스오버로 소개하고, Redpoint가 Abridge에 투자했기 때문에 Jacob Effron이 합류했다고 설명합니다. Janie는 Abridge를 의료 시스템을 위한 임상 지능 레이어로 소개하며, 문서화에서 시작했다고 밝힙니다. 임상의들은 매주 10~20시간을 노트 작성에 씁니다. 환자와 임상의 간의 대화는 청구서, 결제, 진단 등 이후의 거의 모든 산출물의 출발점입니다. Chai는 환자, 보험사, 가이드라인, 의학 문헌에 대한 완전한 맥락이 확보되면 진료 전·중·후 모든 과정을 다룰 수 있다고 덧붙입니다. > *"Abridge는 의료 시스템을 위한 임상 지능 레이어입니다. 문서화에서 출발해 임상의를 위한 제품을 만들었습니다."* — Janie Lee ## [03:22] 주변 문서화에서 임상 지능으로 Janie는 Abridge의 세 가지 전개 단계를 이야기합니다. 첫 번째는 시간 절약으로, 의사들에게 저녁 시간을 돌려주는 원래의 스크라이브 제품입니다. 두 번째는 사상 최저 수준의 운영 마진으로 운영되는 의료 시스템의 비용 절감과 수익 창출입니다. 세 번째는 궁극적으로 생명을 구하는 것입니다. 제품이 매주 수백만 번, 진료 전·중·후에 열린다는 사실이 이 확장을 가능하게 합니다. > *"'파자마 타임'이라고 부르죠. 퇴근 후에 파자마 차림으로 집에서 매일 노트를 쓰고 마무리하는 의사들을 가리키는 말이에요."* — Janie Lee ## [05:21] 임상 의사결정 지원과 맥락의 중요성 Jacob이 Abridge의 임상 의사결정 지원을 Chai의 이전 직장인 Glean과 비교합니다. Chai는 두 곳의 차이를 이렇게 설명합니다. Glean에서는 틀린 답이 불편한 수준이지만, 의료에서는 위험 부담이 높고 사용자 접점이 훨씬 좁습니다. 페르소나는 적지만 모든 산출물이 정확해야 합니다. 이는 오프라인 평가부터 단계적 출시까지 모든 것을 결정하며, 지난 10년간 해커톤마다 등장했던 '나를 진짜로 아는 어시스턴트'의 비전과도 연결됩니다. > *"지난 10년간 제가 참여한 해커톤마다 항상 Jarvis 경쟁 프로젝트가 있었어요. 그런데 Abridge는 그 기회에서 시작해 그 방향으로 계속 나아가고 있다고 진짜로 생각합니다."* — Chai Asawa ## [08:14] 알림 피로, 선제적 지능, 그리고 사전 승인 Jacob이 고전적인 알림 피로 문제를 제기합니다. 에어컨처럼 조용히 있다가 언제 실제로 끼어들 것인지 어떻게 결정하냐는 것입니다. Janie의 실제 사례는 사전 승인입니다. 오늘날 수 주 후에 도착하는 MRI 거부 결정을 환자가 진료실에 있는 동안 실시간 안내로 전환할 수 있습니다. 보험사 정책, EHR 데이터, 이전 진단, 클리닉별 프로토콜을 모두 고려합니다. 핵심은 데이터 배관입니다. 사전 승인은 적절한 순간에 모든 관련 신호를 연결할 수 있을 때만 작동합니다. > *"사전 승인 예시를 가능하게 하려면 어떤 데이터들이 필요한지 생각해보세요."* — Janie Lee ## [13:53] 주변 AI 폼 팩터와 의료 고객 swyx가 폼 팩터에 대해 묻습니다. 현재 주요 접점은 모바일이지만 Abridge는 데스크탑, EHR 내부 브라우저 플러그인, 입원 환경의 병실 장치, 간호 워크플로에서도 운영되며 AR도 검토 중입니다. 고객은 다면적입니다. CMIO, CFO, CIO, 임상의, 환자, 보험사, 제약사가 모두 어딘가에 관여하며, 보험사와의 상호작용은 원시 Abridge 데이터에 직접 접근하는 방식이 아니라 구조화된 교환을 통해 이루어집니다. > *"주변 AI에 대해 많이 이야기하시는데, 주로 전화기에서 이루어지나요?"* — swyx ## [18:16] 의료 분야에서 가장 어려운 AI 문제 Abridge에서 가장 어려운 AI 문제 하나를 꼽으라는 질문에 Chai는 고위험 임상 환경에서의 고품질·저레이턴시·저비용 실시간 지원을 선택합니다. 시스템이 추론할 수 있는 중간 표현으로 보험사 정책의 롱테일을 모델링하는 것이 하나의 구체적인 사례입니다. 파레토 프론티어는 계속 이동하며, 기성 솔루션을 기다리지 않고 스스로 밀어붙여야 합니다. > *"물론 파레토 프론티어는 항상 변하지만, 저희는 지금 당장 이걸 해내야 합니다."* — Chai Asawa ## [19:43] 프론티어 모델, 독점 데이터, 그리고 모델 전략 Jacob이 무엇을 기성품으로 사용하고 무엇을 자체 개발하는지 묻습니다. Chai의 관점: 프론티어 모델은 계속해서 일반 의료 지식을 흡수하므로 Abridge의 경쟁력은 독점적 의료 대화 데이터와 그 위에 구축한 전문 분야별 동작에 있습니다. 최종 제품 경험이 중요한 것이지 모델 자체가 아니기 때문에, 가능한 한 모델 종속성을 피하고 워크플로별로 혼합해 사용합니다. > *"결국 저희가 신경 쓰는 건 최고의 제품 경험뿐이에요. 이건 저것, 저건 이것, 그렇게 혼용할 수 있어요."* — Chai Asawa ## [22:24] 에이전트를 위한 파일시스템으로서의 EHR Chai가 앞으로 1년을 내다보는 관점: 모든 에이전트는 결국 코딩 에이전트이며, 의료 환경에서 EHR은 파일시스템 역할을 합니다. 현재 어떤 모델의 컨텍스트 윈도우에도 들어가지 않는 방대한 구조화 정보 저장소입니다. Janie는 목표는 여전히 임상의가 환자에게 집중할 수 있도록 하는 것이라고 덧붙입니다. 대화를 다시 검토하는 게 아니라, 올바른 맥락이 올바른 순간에 준비되어 있어야 합니다. > *"거의 모든 에이전트는 내부적으로 코딩 에이전트입니다. 어떤 파일시스템이든 주면 코드도 짤 수 있죠. EHR을 파일시스템처럼 생각할 수 있어요."* — Chai Asawa ## [25:20] 개인화, 메모리, 그리고 임상의 선호도 Jacob이 Abridge의 의사별 개인화를 어떻게 처리하는지 묻습니다. Janie의 답변은 계층적입니다. 개인의 편집이 신호가 되고, 그 위에 전문 분야별 기본값이 얹히며, 의료 시스템 정책이 모든 것을 감쌉니다. Chai는 메모리를 새로운 종류의 시스템 오브 레코드로 이야기합니다. 진료마다 신호를 통합하는 백그라운드 작업으로, 수면이 인간의 기억을 공고히 하듯 모델이 모든 편집과 비편집에서 학습하는 방식입니다. > *"저희에게 또 흥미로운 부산물 중 하나는 메모리인데, 사실 이게 새로운 시스템 오브 레코드 중 하나가 되어가고 있어요."* — Chai Asawa ## [31:57] Evals, LLM 판정자, 그리고 단계적 출시 Janie가 eval 스택을 설명합니다. 사내 임상의가 LFD 1차 검토를 수행하고, LLM 판정자는 그 어노테이션 데이터로 보정됩니다. 제3자 평가자가 독립적 검토를 제공하고, 전문 분야별 eval이 범용 eval이 놓치는 부분을 잡아냅니다. Chai는 자율주행차 비유를 더합니다. 실제 환경에 최대한 빨리 접촉하되, 단계적 출시를 통해 오프라인 배포 분포가 실제 프로덕션 배포 분포와 일치하도록 합니다. > *"현실 세계와 최대한 빨리 접촉하고 싶지만 단계적 출시를 원합니다. 오프라인 eval 세트의 배포 분포가 실제 배포 분포와 일치하길 바라기 때문입니다."* — Chai Asawa ## [38:04] HIPAA, 비식별화, 그리고 프라이버시 프라이버시는 데이터 플라이휠의 경직된 제약으로 취급됩니다. Chai는 온라인 eval이나 학습의 기반으로 사용되는 모든 데이터는 일방향으로 비식별화되어야 하며, 이를 위한 엔지니어링 프로세스가 구축되어 있다고 설명합니다. Janie는 고객 계약도 Abridge 내부에서 PHI에 접근할 수 있는 사람을 제한하기 때문에, 학습 데이터로 흘러 들어가는 기준이 정책 수준이 아니라 계약 수준으로 높다고 덧붙입니다. > *"저희가 사용하는 모든 데이터는 비식별화되어야 합니다. 온라인 eval 세트나 학습의 기반으로 사용하는 실제 세계 데이터 전부 그렇습니다."* — Chai Asawa ## [40:38] 1억 건의 대화와 규모의 운영 대화가 1억 건을 넘어서면서 새로운 과제들이 전면에 부상합니다. 모델 라우팅, 사후 학습, 신뢰성 예산, 호출당 비용이 모두 1급 관심사가 됩니다. Chai는 임상의에게 제공할 수 있는 인사이트를 이야기하며 시간을 더 앞으로 넓힙니다. 결국 같은 대화가 의료진뿐만 아니라 환자와 소비자에게 직접 신호를 전달하는 원천이 될 수 있습니다. > *"1억 건의 대화 데이터셋에는 너무나 많은 것들이 담겨 있어요. 임상의에게 줄 수 있는 인사이트 같은 것들을 상상해보세요."* — Chai Asawa ## [45:27] EHR 통합과 임상 지능 레이어 swyx가 EHR과의 관계를 묻습니다. Abridge는 깊은 상호운용성에 많은 투자를 합니다. EHR 파트너십은 임상의 도입을 위한 기본 요건이지만, Abridge가 그 위에 쌓는 가치는 다른 차원에 있습니다. 교차 진료 맥락, 보험사 인식 추론, EHR 자체가 구조적으로 생산하기 어려운 종류의 임상 지능입니다. > *"핵심 파트너 중 하나가 EHR인데, 그 관계가 어떤지 궁금합니다."* — swyx ## [47:56] 의료 규제, 레이턴시, 그리고 고위험 AI Jacob이 규제에서 얻은 교훈을 묻습니다. Janie의 답변은 통상적인 서사와 다릅니다. 의료 AI는 실제로 규제 순풍을 받고 있으며, 기준이 워낙 높기 때문에 가장 어려운 문제들이 이곳에서 먼저 해결된다는 것입니다. Chai는 오늘날 출시하는 "영리한 기법들"이 프론티어가 계속 발전함에 따라 5년 뒤에는 살아남지 못할 수도 있다는 것을 받아들이며 만든다고 말합니다. > *"가장 어려운 AI 문제들이 여기서 먼저 해결될 것이라고 생각해요. 기준이 그만큼 높으니까요."* — Janie Lee ## [51:28] 임상 과학자와 롱테일 품질 Janie는 Abridge 내부의 임상 과학자라는 역할을 설명합니다. 기술적 역량도 갖춘 MD들로, 풀스택 엔지니어부터 "극도로 실용적인 프롬프터"까지 다양합니다. 이들이 제품 및 eval 팀에 embedded되어 있기 때문에 출시 기준이 올라갑니다. LFD 기준을 작성하는 사람들이 임상적으로 유용하다는 것이 무엇인지 실제로 이해하는 사람들이기 때문입니다. swyx는 이를 알려진 약점에 대한 능동적 학습과 연결합니다. 대부분의 AI 조직에서는 사라져가는 장인 정신입니다. > *"임상 과학자라는 역할이 있는데, 최근에 우리 리더 중 한 명이 이들을 '돌연변이'라고 부르는 걸 들었어요."* — Janie Lee ## [54:21] Glean에서 배운 교훈과 지속 가능한 AI 인프라 Jacob이 Chai에게 Glean에서 가져온 것들을 묻습니다. 시간이 지나도 유효한 것들, 즉 맥락 레이어, 이벤트 기반 시스템, Kafka, Temporal, 소켓, Google Docs 협업 플레이북의 CRDTs가 답입니다. 멀티 에이전트 시스템은 인간과 같은 충돌 해결 문제를 물려받으며, 지난 10년간의 인프라 패턴들은 버려지는 게 아니라 재활용되고 있습니다. > *"이벤트 기반 기술이 정말 많아요. Kafka, Temporal, 소켓 등인데 이것들을 어떻게 통합하느냐가 실제로 오래 유효한 부분이라고 생각합니다."* — Chai Asawa ## [58:20] 에이전틱 의료 워크플로의 미래 더 에이전틱한 Abridge가 어떤 모습일지에 대한 짧은 대화. 임상의의 환자 관계에서의 역할을 중심에 두되, 검사 결과 대응, 후속 조치 초안 작성, 임상의를 대신한 업무 처리 등 백그라운드 작업이 더 늘어납니다. 다만 그 관계 자체를 대신하지는 않습니다. > *"환자 연결이라는 측면에서 임상의가 매우 중요한 역할을 한다고 믿기 때문에, 임상의를 대신해 더 많은 기능을 수행하면 됩니다."* — Chai Asawa ## [58:51] PRD, 제품 명확성, 그리고 진지한 AI 제품 만들기 Jacob의 속사포 질문: 지난 1년간 AI에 대해 생각이 바뀐 것이 있다면. Janie는 대중적인 통념을 뒤집습니다. 프로토타입이 전부가 아니고, PRD는 죽지 않았습니다. 제품이 더 복잡해지고 AI 기반이 될수록 제대로 된 PRD의 서면 명확성 훈련이 더 중요해집니다. 나머지 섹션은 의료에서 진지한 AI 제품 구축에 관한 내용입니다. 책임감, 서면 스펙 규율, 데모 주도 개발에 저항하기. > *"더 자극적인 주장은 프로토타입이 전부이고 PRD는 죽었다는 것이에요."* — Janie Lee (생각이 바뀐 주장) ## [64:28] Abridge의 AI 코딩 도구 swyx의 표준 마무리 질문. Abridge는 내부적으로 Claude Code와 Cursor를 사용합니다. Jacob은 반쯤 농담으로 벤치마크를 제안합니다. Claude가 시가총액 10억 달러 규모의 매출 전 회사를 운영하는 것을 보고 싶다는 것입니다. > *"Claude가 이걸 해주길 바라요. 10억 달러 규모의 매출 전 단계 회사를 운영해줬으면 해요."* — Jacob Effron ## [65:23] 아웃트로 Chai가 청취자들에게 Abridge 웹사이트의 백서를 소개합니다. 환각 감소, evals, 연구 스택에 관한 내용들입니다. swyx와 Jacob이 감사 인사와 마무리 인사를 나눕니다. > *"Abridge 웹사이트에 가시면 환각 감소 같은 흥미로운 작업들에 대한 백서들이 많이 있어요."* — Chai Asawa ## 등장인물 - **Janie Lee** (인물): Abridge 창업 초기 멤버; 임상 지능 레이어의 제품 및 사업 부문 담당. - **Chai Asawa** (인물): Abridge 임상 의사결정 지원 리드; Glean 출신. - **swyx** (인물): Latent Space 진행자. - **Jacob Effron** (인물): Redpoint Ventures 파트너; Unsupervised Learning 팟캐스트 진행자. - **Abridge** (조직): 임상 지능 레이어를 구축하는 의료 AI 기업. 주변 문서화에서 시작해 의사결정 지원, 사전 승인, evals, EHR 통합으로 확장 중. - **Glean** (조직): 엔터프라이즈 AI 검색 기업. Chai의 전 직장이자 수평적 제품 대 수직적 제품의 대비 사례로 언급됨. - **Redpoint Ventures** (조직): 벤처캐피털 회사; Abridge 투자자이자 Unsupervised Learning 크로스오버의 배경. - **EHR (전자의무기록)** (개념): 의료 시스템의 시스템 오브 레코드. Chai의 관점에서 EHR은 의료 에이전트의 파일시스템 역할을 함. - **사전 승인** (개념): Abridge의 핵심 활용 사례. 수 주씩 걸리는 보험사 거부 결정을 진료 중 실시간 안내로 전환. - **LFD 프로세스** (개념): Abridge의 사내 임상의 주도 1차 검토. LLM 판정자 보정과 eval 기준 정의에 사용. - **임상 과학자** (개념): Abridge의 역할. 제품 및 eval 팀에 embedded된 기술 역량을 갖춘 MD들. - **단계적 출시** (개념): Abridge의 배포 원칙. 실제 트래픽의 일부에 먼저 출시해 오프라인 배포 분포를 실제와 일치시키는 방식. 자율주행 출시 패턴을 모델로 함. - **Claude Code** (소프트웨어): Abridge 내부에서 사용하는 AI 코딩 도구. - **Cursor** (소프트웨어): Abridge 내부에서 사용하는 AI 코딩 에디터.

#ai-healthcare#ambient-ai#abridge
⚡️ Matt Pocock - Why Engineering Fundamentals matter MORE now
22:02
EN/ZH
Watch with Captions
Latent Space28일 전

⚡️ Matt Pocock - Why Engineering Fundamentals matter MORE now

Matt Pocock joins swyx at AI Engineer Europe to argue that the old software design canon — DDD, deep modules, ubiquitous language — matters more, not less, in the AI coding era. The thesis: code is not just a compile target; a codebase that is easy for humans to change is easy for AI to change. Along the way they cover course-making, why traditional lectures still beat AI-native learning, and TypeScript's quiet takeover of AI engineering. ## [00:04] Opening at AIE Europe and the Cursed Course swyx welcomes Matt to the AI Engineer Europe podcast booth in London. Matt jokes that AIE is "the worst" event he has ever attended (the location is in fact astonishing) before turning to his Claude Code course, which is just wrapping up its two-week cohort. He explains why he runs short cohorts: AI moves so fast that self-paced courses cannot guarantee updates, and the "curse" of releasing into breaking changes — AI SDK v5 dropped on day two of his AI SDK v4 course, and the Claude Code source leaked during this one — is now baked in. The conversation then turns to teaching as a craft. Matt rejects the "pundit" branch of YouTuber identity — he is not trying to predict the future, only to teach durable material — and notes that being a teacher first is what differentiates his content. > *I'm not a guy who's trying to predict the future. I'm just trying to teach.* ## [02:51] Why Engineering Fundamentals Matter More with AI Matt previews his AIE talk. The popular narrative says code no longer matters because English plus an AI compiler can produce applications. Every time he tried to ignore the code, he ended up with "a terrible mess." So he went back to the classics — *Extreme Programming*, *The Pragmatic Programmer*, *A Philosophy of Software Design*, DDD — and discovered they ported directly into prompts. Keeping the architecture in your head, even when you delegate implementation, yields outsized dividends. > *If you have a code base that's easy to change for humans, it's going to be easy for AI to change, too.* ## [04:23] Narrow Waist and Deep Modules swyx introduces the "narrow waist" concept from internet architecture (TCP/IP, HTTP at layers 3–4) as a way to contain AI-generated slop: define rigid interfaces, delegate the inside. He extends it to running AIE as a nine-person business — "model-view-claw" instead of MVC, where coordination across people and AI is the real systems problem. Matt maps this onto John Ousterhout's notion of *deep modules*: a large amount of functionality behind a simple interface, ports and adapters style. This is, in his experience, the best way to use AI for coding — be intentional about the interface as a human, then delegate the implementation. > *Deep modules basically — a large amount of functionality with a simple interface. Kind of ports and adapters, right?* ## [06:37] Domain-Driven Design Meets AI DDD is having a moment, and Matt argues it works *because* the framework has been around long enough to sit in the latent space of these models. You do not have to invent new vocabulary; you can bolt on a system that is composable and that the model already understands. The deeper point: DDD is fundamentally about aligning code with language, which is exactly what you want when speaking to an AI. He makes it concrete with the `mattpocock/skills` repo (≈13k stars) and its "ubiquitous language" skill — a Claude Code skill that scans your codebase, surfaces the arcane jargon, and refines it with you into a markdown file he keeps open while prompting. He references it from `agents.md` but does not paste it wholesale, so the agent finds it when searching for those terms. > *Essentially, you're trying to create a unified domain model so that the AI and you are speaking the same language.* ## [10:05] Teaching as an Overpowered Skill swyx asks how Matt got so good at explaining things. Matt credits six years as a voice coach before becoming a developer — communication felt like an unfair advantage when he started as a junior. He has since narrowed his focus: split time between learning material and finding the right phrases for it. The old texts help because they give him pre-built mental models to explain new ideas through. He walks through his course-making process: an "explore and exploit" phase, a Zettelkasten-style Obsidian vault, a custom planning app, P1/P2/P3 prioritization, and the rule that *each lesson teaches exactly one thing* with dependencies made explicit. Most of what he produces ends up on the cutting room floor. > *The ability to communicate always just felt like a ridiculous overpowered skill that I had in my locker that no one else had.* ## [13:20] How People Actually Learn AI Engineering The conversation turns to whether AI has changed how people learn. Matt distinguishes knowledge (lectures), skills (interactive exercises), and wisdom (small-group discussion — and now, talking to an AI). Counterintuitively, the more he leans into AI-experimental teaching, the more it turns his audience off. Most learners still want traditional lectures; swyx recalls Maven's cohort-based education arc landing in the same place. Matt's compromise is to force the work without forcing the form: in his TypeScript material he throws learners into a problem first and gives them the knowledge afterwards. > *The more I lean into the kind of AI experimental stuff, the more it actually turns people off my materials.* ## [15:04] TypeScript Overtaking Python swyx flags that TypeScript overtook Python in the GitHub survey this year — a shift he did not see coming, particularly in AI engineering where Python's expressiveness has been dominant on the backend. Matt's echo chamber is 100% TypeScript, but his real argument is ecosystem: when you care about UX and shipping chat-style applications, the framework gravity is in TypeScript (Vercel's Next.js, Cloudflare's variants). swyx admits this would meaningfully change which frameworks he promotes. > *If you're concerned about UX, concerned about shipping great stuff, you're mostly doing it in TypeScript.* ## [16:45] Inversion of Control and Composable Skills Matt looks ahead. His TypeScript-evals bet (Everlight) stalled — "no one's excited to do evals." The next frontier is *inversion of control*: as coding agents converge on similar architectures (Firebase-style backends, small tool sets), the interesting axis becomes how much control sits with the developer versus the harness. Claude Code's opacity buys ease of use but loses observability; Pydantic AI ("Pi") swings the other way — total control, total maintenance burden. He closes by pointing past coding agents entirely. Software engineers are a step ahead because AI produces quality output in their domain, but the composable skills he authors — like his three-sentence "grill me" skill that makes the AI interrogate you until you reach a shared understanding — generalize to any domain where you want the AI aligned with you. > *The inversion of control is going to be really important — you put more control in the hands of the developer and less in the harness.* ## Entities - **Matt Pocock** (Person): Creator of Total TypeScript and AI Hero; teaches TypeScript and AI Engineering through two-week cohort courses. - **Shawn Wang / swyx** (Person): Host; founder of AI Engineer and the AIE conference series. - **AI Engineer Europe (AIE)** (Organization): The London conference where this conversation was recorded; Matt's talk hit 1M views in 13 days — fastest in AIE history. - **AI Hero** (Organization): Matt's AI engineering education platform (aihero.dev). - **Claude Code** (Software): Anthropic's coding agent; subject of Matt's just-finished course and a recurring example throughout. - **Domain-Driven Design (DDD)** (Concept): Software methodology centered on aligning code with the language of the business domain; Matt argues it ports cleanly into AI prompting. - **Ubiquitous Language** (Concept): DDD practice of maintaining a shared vocabulary doc; Matt's namesake Claude Code skill scans a repo and refines this with the user. - **Deep Modules / Narrow Waist** (Concept): Architectural pattern (Ousterhout / internet protocols) of large functionality behind a small interface — Matt's preferred shape for AI-assisted codebases. - **mattpocock/skills** (Software): Matt's open-source repository of Claude Code skills; ≈13k stars at recording time. - **Pydantic AI (Pi)** (Software): Python agent framework built from low-level primitives; cited as the high-control counterpoint to Claude Code's opaque harness. - **Obsidian** (Software): Note-taking app reportedly run by a team of four; the example for non-engineering domains where AI leverage compounds.

#ai-engineering#software-design#typescript
🔬How GPT-5 derived new results in theoretical physics and quantum gravity — Alex Lupsasca, OpenAI
1:31:51
EN/ZH
Watch with Captions
Latent Space30일 전

🔬How GPT-5 derived new results in theoretical physics and quantum gravity — Alex Lupsasca, OpenAI

Alex Lupsasca — 2024 New Horizons Breakthrough Prize winner and OpenAI resident scientist — recounts how GPT-5 resolved a year-long open problem in quantum field theory: proving that single-minus gluon tree amplitudes are non-zero and finding their compact closed form. He then describes how the publicly available GPT Pro, given the gluon paper as a seed, independently generalized the result to graviton amplitudes in under three days of human clock time. Throughout the conversation, Lupsasca reflects on what this trajectory means for how physics is done, how the next generation of physicists will be trained, and where the remaining bottlenecks — verification, creativity, and publishing infrastructure — still lie. ## [00:00] Introduction to AI's impact on physics research Lupsasca opens in medias res, framing the episode's central claim before the formal introduction: AI has crossed a threshold where it can resolve questions that stumped human experts for over a year. He describes this not as a curiosity for theoretical physicists but as a profound, if underappreciated, change in the nature of scientific discovery itself. > *"That's a certain milestone that we've passed, and I think maybe for the average person on the street who doesn't care about theoretical physics, this is not very noticeable, but I think it's a very profound change and we've really passed some kind of a threshold."* ## [00:43] Guest introduction: Alex Luposka The hosts — Brandon (Atomic AI) and RJ Honicky (Miro Omix) — introduce Lupsasca as a Vanderbilt professor and OpenAI fellow who holds both the 2024 New Horizons in Physics Breakthrough Prize (often called the "Oscars for science") and the IUPAP Young Scientist Award. Lupsasca immediately sets the narrative arc: a year ago, AI was useful for email but not for his work; ChatGPT o3 was the first model that genuinely helped with research math; then GPT-5 reproduced one of his hardest published results in 30 minutes. > *"When GPT-5 came out it was able to reproduce one of my best papers that took me a very long time to come up with in like 30 minutes. And that's when I really became AI pilled."* ## [02:49] Alex joining OpenAI and the shift in physics research After GPT-5's release, Lupsasca began evangelizing the shift to colleagues who were skeptical. Finding OpenAI equally excited, and being on sabbatical, he joined as resident scientist — the person physicists around the world now email when something astonishing happens. He describes receiving an inbound that week about Codex simulating the Sachdev-Ye-Kitaev (SYK) model in 10 minutes, a feat that many research groups had struggled to achieve due to the narrow Venn diagram of physicists with strong coding skills. > *"I talked to OpenAI. They were also really excited and I thought I have to get in on this and to understand that this is happening and not be a part of it is a huge mistake so I have to go to OpenAI."* ## [04:08] The release of GPT-5 and the shift in capabilities Lupsasca contrasts the lukewarm Twitter reception of GPT-5 (complaints that it was not better at writing email) with what he observed at the science frontier. He notes GPT-5.4 is another significant jump, and describes how AI capabilities for physics have been accelerating rapidly since o3, the first reasoning model strong enough for research-grade mathematics. He uses this as a bridge to the central technical story of the episode: a pair of new papers on gluon and graviton scattering amplitudes. > *"At the science frontier the capabilities were really taking off."* ## [10:05] Explaining Quantum Field Theory and amplitude calculations Lupsasca gives an accessible primer on quantum field theory (QFT), the framework that reconciles special relativity and quantum mechanics. The key objects in QFT are scattering amplitudes — complex-valued functions that encode the quantum probability for a set of incoming particles (with given energies, momenta, and polarizations) to scatter into a set of outgoing particles. These amplitudes are computed at particle colliders like the LHC, and knowing the n-point amplitude (for any number n of particles) encodes essentially the full content of the theory. > *"If you have a particular force and you're able to compute the n-point amplitudes... you know everything about the theory."* ## [14:20] Overview of gluons and the strong force Gluons are the force-carrying particles of the strong nuclear force — the force that, despite like-charge repulsion between protons, holds the atomic nucleus together. They are the QFT analog of photons for electromagnetism and gravitons for gravity. Like photons, gluons carry a polarization (helicity): positive (right-handed) or negative (left-handed). This helicity structure is central to the paper discussed next. > *"The strong force is mediated by the exchange of the particles of the strong force, which are called gluons, because they're what glues together the nucleus of the atom."* ## [14:38] Discussing the first research paper on single-minus gluon tree amplitudes Lupsasca unpacks the paper's title — "Single-Minus Gluon Tree Amplitudes Are Non-Zero" — piece by piece. Tree amplitudes are the leading-order (no-loop) contributions to scattering. All-plus-helicity amplitudes are exactly zero by a symmetry argument. Single-minus amplitudes — where all but one gluon have positive helicity — were assumed in textbooks to also be zero by the same argument. The paper proves they are not. The result involves collaboration with Alfredo Guevara (IAS), David Skinner (Cambridge), Andrew Strominger (Harvard), and Kevin Wheel. > *"If you look at the lecture notes and textbooks that have been written on this, the same argument that rules out the all-plus amplitudes also appears to rule out the single-minus amplitudes."* ## [20:56] How ChatGPT helped solve a year-long physics puzzle Strominger, Guevara, and Skinner had understood for about a year that the textbook argument has a loophole: when particles are collinear (exactly aligned in momentum), the standard dimensional-analysis reasoning fails, and single-minus amplitudes can be non-zero. But computing what those non-zero amplitudes equal had eluded them. Lupsasca invited Strominger to visit OpenAI and work on it with AI. The week before Strominger's flight, Lupsasca began using ChatGPT Pro. By the time Strominger landed, they had the answer. > *"Using ChatGPT we solved the problem before he even got off the plane."* ## [23:02] Complexity of manual calculations in physics Lupsasca shows the audience a concrete illustration of the difficulty: the six-point single-minus amplitude, worked out by hand by Alfredo Guevara, is a sum of 32 terms each of which is itself a product of four complicated factors. The number of terms grows factorially with the number of particles n — super-exponential growth. This is the messy representation that the group had been staring at for a year, seeking the analog of the elegant Parke-Taylor formula. > *"By the time you get to six terms, it explodes in your face."* ## [26:12] The history and mechanics of Feynman diagrams Feynman diagrams are a visual language introduced by Richard Feynman to organize perturbative QFT calculations: diagrams represent possible intermediate histories of a scattering process, and the full amplitude is a sum over all of them. Diagrams are organized by number of vertices (interaction points); each additional vertex is suppressed by the coupling constant, so tree diagrams (fewest vertices) dominate. Loop diagrams — where intermediate particles are created and annihilated — contribute smaller corrections. The combinatorial explosion of tree diagrams is the root cause of factorial growth. > *"In principle, there are infinitely many pictures to sum over."* ## [27:44] The Parke-Taylor formula and the quest for simplification In the 1980s, Parke and Taylor computed the "maximally helicity violating" (MHV, or double-minus) gluon amplitudes through a heroic Feynman diagram expansion. Despite the factorial number of terms, everything canceled to leave a single compact formula — the Parke-Taylor formula — that fits in half a line. Strominger, Guevara, and Skinner spent a year looking for the analogous compact formula for the single-minus case. Their search stalled at the level of the messy Feynman representation. > *"Andy, Alfredo and David spent the last year chasing the analog of the Parke-Taylor formula, the very simple answer that was obtained in the '80s for the double minus amplitudes."* ## [31:26] Using ChatGPT to find the simplification in the special phase space region When the five-point single-minus amplitude was fed to ChatGPT Pro, the model identified a special subregion of phase space (where one particle's frequency has opposite sign) in which the amplitude simplifies from eight terms to a product of just three. This appears not to have been a known fact; the model wrote Python code and tested thousands of possibilities to deduce it. Moving to the six-point amplitude (Guevara's hand calculation), ChatGPT simplified 32 terms to a product of 4. It then conjectured the general n-point formula — with only linear growth in the number of terms, the best possible behavior. GPT-5.2 Pro guessed the formula but could not prove it. > *"The formula that it proposed, instead of having this factorial growth... here it's actually linear. So if you double the number of particles, you only double the number of terms."* ## [38:07] Proving the formula from scratch to ensure validity To obtain a proof, Lupsasca used an internal OpenAI model with extended reasoning. He gave it the problem cold — without the conjectured formula — and asked it to find the general answer in the special phase-space region. After 12 hours of computation, the model independently rediscovered the same formula and produced a complete three-step proof. The proof constitutes the bulk of the published paper. The team kept the AI attribution to one paragraph, framing the paper as a physics result that stands on its own merits. > *"We gave it the whole problem from scratch... and it came back with the same formula which we had not given it. So it rediscovered the correct formula. But this time it also found the proof."* ## [41:00] Determining the scientific impact and future research Asked to compare the result to the Parke-Taylor formula, Lupsasca is candid that scientific impact is only assessable decades later, but argues the result is genuinely surprising and should open a line of attack toward deeper questions in quantum gravity. The conversation pivots naturally to the second paper. > *"I think the true value of a paper can only be assessed decades into the future based on how much future work it leads to and what developments it opens up."* ## [42:27] Introduction to the second paper on graviton amplitudes Gravitons are the hypothetical quanta of gravity — the spin-2 force carrier analogous to the spin-1 photon (electromagnetism) and gluon (strong force). Unlike gluons, gravitons have never been directly detected, but they are central to quantum gravity theory. The second paper, "Single-Minus Graviton Tree Amplitudes Are Non-Zero," shows the same loophole applies to gravity and that a compact formula extends there too — despite gravitons being mathematically more complex than gluons. > *"We wrote this paper which is called single minus graviton tree amplitudes are non-zero. So it's the same title almost, except with graviton instead of gluon."* ## [45:41] Defining particles, irreducible representations, and symmetry Lupsasca sketches the modern QFT definition of a particle (an irreducible representation of the Poincaré group, classified by Wigner according to mass, spin, and charge) and explains why gravitons are spin-2 while gluons and photons are spin-1, making graviton polarization data twice as rich. Crucially, the second paper was complete within three days of the first going public — most elapsed time was spent verifying correctness, not computing. > *"Most of the time was spent verifying the answer, not writing, which is insane, actually, if you take a step back."* ## [47:46] How GPT Pro generalized the research to gravity For the graviton paper, no internal model was needed — the publicly available ChatGPT GPT-5.2 Pro sufficed. Lupsasca provided the gluon paper as context plus two paragraphs describing the key mathematical changes, then said "Good luck. You're a brilliant theoretical physicist." Over a 110-page exchange, the model worked through the graviton calculation — applying the directed matrix tree theorem, a piece of known combinatorics that neither Lupsasca nor collaborators had thought to invoke — produced correct intermediate results, and wrote a draft paper very close to the final arXiv version from section 3 onward. > *"It's a real solid result in quantum gravity that was done pretty much completely by an AI with human steering it and asking kind of the right questions."* ## [53:57] The epistemological shift: Is this a new way of doing physics? The hosts raise the central epistemological question: if an undergraduate with domain knowledge and good prompting could have done this, what does graduate training mean now? Lupsasca agrees this is the hardest open question facing academia. He notes that arduous calculation trains not just skill but self-confidence, that the gap between coursework and the research frontier is growing, and that many "easy" problems professors once assigned to students are now solvable by AI in minutes. He offers two concrete ways AI has already changed his own workflow: dramatically reducing time spent confused between steps, and enabling parallel AI scouts that explore multiple research directions simultaneously. > *"With AI, actually, you can launch 10 instances of chat and have each one try a different route and send it as a scout that moves very fast into the unknown."* ## [59:27] The use of AI as a 'scout' for research directions Lupsasca elaborates on the scout metaphor: rather than carefully mapping a route from A to C before committing to it, a researcher can now dispatch many AI "scouts" in parallel, get rapid feedback on which directions are promising, and redirect human attention accordingly. Even when a scout makes errors, its signposts reduce orientation cost for the human following. This constitutes a qualitatively new mode of research — one where the bottleneck shifts from calculation to judgment about which directions matter. > *"Even if ChatGPT doesn't always get everything right, just kind of having a scout that signposts some key steps along the way that you can use to anchor your own movement is extremely helpful."* ## [61:44] The role of 'taste' and collaboration with AI The hosts push on the problem of "taste" — the ability to identify which questions are at the productive edge of knowledge. Lupsasca argues that working effectively with ChatGPT requires the same skill a professor develops advising students: knowing what question to give, at what level of detail. "Taste" — knowing where the frontier is and which questions there are tractable — is the last skill to develop and the one AI currently lacks. AI is, he says, like an extremely technically skilled graduate student: given a sharp, well-posed question, it can do incredibly hard computations correctly, but it does not yet know which question to ask. > *"The difference between a good physicist and a great physicist is knowing what is the right question to ask — that is actually the hardest part of being a scientist."* ## [70:23] Personal evolution from AI skeptic to resident scientist Lupsasca recapitulates his personal arc: skeptic → converted by o3 (which solved in 11 minutes a calculation that would have taken him days) → "AI-pilled" by GPT-5 (which reproduced, in 30 minutes, his best published result on black hole Love numbers and tidal symmetries — a paper whose training cutoff predated its arXiv release) → now resident scientist at OpenAI. He notes that no competing model at the time could match GPT Pro on that calculation. > *"In under 30 minutes, with one hint... it completely solved this problem, which is one of the nicest calculations that I've ever done."* ## [72:46] Solving a black hole perturbation problem with GPT-5 Lupsasca details the "Move 37" moment that converted him: his paper "Why Is There No Love in Black Holes?" establishes new symmetry generators for perturbations of a Kerr black hole (explaining why black hole Love numbers — tidal response coefficients, named after mathematician Augustus Love — are exactly zero). When GPT-5 Pro was first given the full problem cold, it failed. But after being primed with the simpler flat-space warm-up (a 200-year-old known result), it then solved the full Kerr black hole problem in 18 minutes. > *"GPT-5 was able to reproduce one of my hardest calculations, which I think the number of people in the world that could do that you could count on your hands."* ## [76:34] Discussing whether AI can make original, conceptual leaps The hosts ask whether AI is doing genuine recombination versus true creative leaps. Lupsasca cites Terry Tao, who has not yet seen an AI proof that cannot be traced to an obscure reference. But Lupsasca has been impressed and frames the distinction as one of degree rather than kind — humans may also be recombination machines. He believes continued scaling will produce feats of insight that look like creativity, and notes OpenAI is actively working on enabling models to take bigger, more out-of-distribution leaps suited to scientific discovery. > *"I'm not sure there's a qualitative difference. I think it's just a matter of degree — as we continue scaling the capabilities, I don't see why it's going to stop."* ## [80:09] Challenges of 'AI slop' and the future of academic publishing With models now capable of turning out a physics paper in 30 minutes when properly steered, the arXiv preprint server is being flooded with submissions. Lupsasca distinguishes legitimate use (expert steering + careful verification) from "AI slop" — poorly prompted outputs submitted without adequate checking. His proposed response: raise the bar rather than increase volume. The single-minus amplitude papers open a clear line of attack toward genuine quantum gravity questions; the goal should be to pursue harder problems, not to publish incrementally. > *"Instead, I think now that we have this new tool that gives us AI superpowers, I think we should just raise the bar for what it means to write a good paper."* ## [83:13] The bottleneck of writing academic papers Asked what single bottleneck he would remove, Lupsasca nominates the paper-writing process itself — finding it increasingly strange that researchers use AI to do calculations, compress results into a static paper, and then readers feed that paper back into AI to understand it. He envisions interactive, LLM-embedded papers as a plausible future. He also identifies two missing capabilities in current models: (1) the spark of creativity to identify the next important question, and (2) reliable self-verification, so that the onus of checking long AI-generated proofs does not fall entirely on humans. > *"Maybe some kind of interactive paper which lives in some LLM. Maybe your whole paper is some ChatGPT page... I think we're going to head in that direction."* ## [90:19] Final takeaways and looking ahead to the next year Lupsasca's closing message: pay attention. The trajectory from "useful for email" to "solves open problems in quantum gravity" has taken roughly 18 months. Models are solving open problems that expert communities spent years on. Extrapolating forward, with more scaling already in the pipeline, the next 6 to 12 months should bring further surprises. The right posture is excitement, careful verification, and a commitment to pursuing harder problems. > *"If you just extrapolate that into the future, imagine where we're going to be in 6 months or a year — I think it's kind of surreal to live through this time, but it's really happening."* ## Entities - **Alex Lupsasca** (Person): Theoretical physicist, Vanderbilt University professor and OpenAI resident scientist; 2024 New Horizons Breakthrough Prize and IUPAP Young Scientist Award winner; expert in black hole physics and scattering amplitudes. - **Andrew Strominger** (Person): Harvard professor and Lupsasca's former PhD advisor; pioneer of celestial holography; co-author of both single-minus amplitude papers. - **Alfredo Guevara** (Person): Postdoctoral researcher at the Institute for Advanced Study (IAS); performed the foundational hand calculations underpinning the AI-assisted breakthrough. - **David Skinner** (Person): Professor at Cambridge University; co-author of the single-minus gluon amplitude paper. - **Terry Tao** (Person): Fields Medal-winning mathematician at UCLA; referenced regarding the question of whether AI proofs involve genuine creativity. - **Scattering Amplitudes** (Concept): Complex-valued functions in quantum field theory encoding probabilities for particles to scatter; the central mathematical objects of both papers discussed. - **Single-Minus Gluon/Graviton Amplitudes** (Concept): Tree-level scattering amplitudes where all but one particle have positive helicity; previously assumed zero in textbooks but shown non-zero in a collinear phase-space region. - **Parke-Taylor Formula** (Concept): Compact closed-form result for maximally helicity violating (MHV, double-minus) gluon amplitudes derived in the 1980s; the template whose analog was sought for single-minus amplitudes. - **Feynman Diagrams** (Concept): Diagrammatic technique to organize perturbative QFT calculations; individual diagrams represent distinct intermediate-particle histories whose amplitudes are summed. - **Love Numbers** (Concept): Coefficients encoding tidal deformability; famously vanish for black holes, a fact connected to hidden symmetries studied in Lupsasca's "Why Is There No Love in Black Holes?" paper. - **Celestial Holography** (Concept): Research program exploring symmetries of quantum gravity via scattering amplitude structure; motivates studying graviton amplitudes. - **OpenAI** (Organization): AI research company where Lupsasca serves as resident scientist; developer of GPT-5 and the internal extended-reasoning model used for the amplitude proof. - **arXiv** (Organization): Open-access physics and mathematics preprint server; mentioned in the context of AI-generated "slop" flooding submissions. - **GPT-5 / ChatGPT Pro** (Software): OpenAI's frontier language model used as the primary AI tool in both amplitude papers; capable of extended reasoning steps of 20-34 minutes per prompt.

#theoretical-physics#quantum-field-theory#gpt-5