返回播客 Dwarkesh Patel

从零复现 AlphaGo——Eric Jang

Today, I'm here with Eric Jang, who was most recently vice president of AI at 1X Technologies, and before that, senior research scientist at what is now Google DeepMind Robotics. 今天我请来了 Eric Jang，他最近担任 1X Technologies 的 AI 副总裁，此前是现在 Google DeepMind Robotics 的高级研究科学家。 You've been on sabbatical for the last few months. 你这几个月一直在休假。 One of the things you've been doing is rebuilding, improving, and hacking on AlphaGo. 你做的事情之一，就是重新构建和改进 AlphaGo。 Today, you're going to explain building AlphaGo from scratch and what it tells us about the future of AI research and development. 今天你会来讲从零开始构建 AlphaGo，以及这对 AI 研究和开发的未来意味着什么。 Before we get to that, why is AlphaGo interesting? 在进入正题之前，AlphaGo 有什么值得关注的？ Why is this the project you decided to do on sabbatical rather than just hanging out at the beach? 你为什么选这个项目来度假，而不是去海滩放松一下？ I like making things, and AlphaGo and Go AI is one of those things that really got me into the field. 我喜欢做东西，AlphaGo 和围棋 AI 是当初把我带入这个领域的东西之一。 When I saw the early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth, it was profound to see how smart AI systems could become and the computational complexity class they could tackle with deep learning. 2014、2015、2016 年前后看到 AlphaGo 的早期突破时，我深刻感受到 AI 系统能变得多聪明，以及深度学习能攻克怎样的计算复杂度难题。 This is a problem that has long been understood to be intractable for search, and yet it was solved through deep learning. 围棋长期以来被认为是搜索无法解决的问题，却被深度学习攻克了。 That was quite mysterious to me, and I've always wanted to understand that phenomenon a little better. 这对我来说非常神秘，我一直想更好地理解这个现象。 My training is in deep neural nets for robotics, where the decisions made by the neural networks are a bit more intuitive. 我的训练背景是机器人领域的深度神经网络，那里神经网络的决策更直观一些。 But AlphaGo is a problem where the decisions are the result of a very, very deep search. 但 AlphaGo 的决策是一个极深搜索的产物。 It's always been very mysterious to me how a ten-layer network can amortize the simulation of something so deep in the game tree. 一个十层网络怎么能摊销如此深的博弈树的模拟，这对我一直很神秘。 If you plot out how much compute it took to build various iterations of strong Go bots over the years, you can see that in 2020 there was an open-source project called KataGo by David Wu from Jane Street, which achieved a 40x reduction in the compute needed to train a really strong Go bot tabula rasa. 如果你画出多年来训练强力围棋机器人所需算力的变化，会看到 2020 年出现了一个叫 KataGo 的开源项目，由 Jane Street 的 David Wu 开发，将从零训练一个极强围棋机器人所需的算力降低了 40 倍。 I'm not certain if it's stronger than AlphaGo Zero, AlphaZero, or MuZero, but it's very strong, and this is what most Go practitioners today train against when they're playing an AI. 我不确定它是否超过了 AlphaGo Zero、AlphaZero 或 MuZero，但它非常强，这就是今天大多数围棋练习者对弈 AI 时的对手。 Thanks to LLM coding, what took a whole team of research scientists at DeepMind and millions of dollars of research and compute can now be done for a few thousand dollars of rented compute. 得益于 LLM 编程，当年 DeepMind 整支研究科学家团队加上数百万美元的研究和算力才能做到的事，现在花几千美元的租用算力就能搞定。 We should first discuss how Go works. 我们先聊聊围棋怎么玩。 How does the game work? 这个游戏是怎么运作的？ Go is a very simple game that can be implemented quickly and easily on a computer. 围棋是一个规则非常简单的游戏，可以很快在计算机上实现。 The objective is to put down black and white stones and try to occupy as much territory as possible. 目标是落下黑白棋子，尽量占据更多领地。 I might start by putting down a black stone. 我先落一颗黑子。 Black always goes first. 黑棋先走。 Go ahead. 你来。 The way you capture an opponent's stones is that for every intersection, if you can surround all four of its neighbors with your stones, then it's cut off from oxygen, if you will, and it's a dead stone. 吃掉对方棋子的方式是：对于每个交叉点，如果你能用自己的棋子围住它四个方向的所有邻居，那它就断气了，成为死子。 Now I control these four stones as well as this empty intersection here. 现在这四颗棋子和这个空交叉点都归我控制。 There are slight variations between Chinese, Japanese, and what are called Tromp-Taylor rules. 中国规则、日本规则和所谓的 Tromp-Taylor 规则之间有细微差异。 Tromp-Taylor rules are designed to be completely unambiguous, so this is what all Go AIs train and resolve against. Tromp-Taylor 规则设计为完全无歧义，所以所有 Go AI 都用它来训练和结算。 In typical Go, when humans play, you're actually not allowed to put this white stone down here. 在普通围棋中，人类对弈时是不能把这颗白子下在这里的。 It would be instant suicide. 那是即时自杀。 In Tromp-Taylor, it's actually fine. 但 Tromp-Taylor 规则里其实是允许的。 You put it down, and it immediately resolves to death, so the outcome is the same. 你下下去，它立刻就解析成死子，结果一样。 Let's start over and play a few stones, and then I'll explain some more. 我们重新开始，落几颗子，然后我再多解释一些。 I'll just start there. 就从这里开始吧。 I'm basically playing randomly here, but I'm trying to get around your stones and see if I can surround them. 我基本在随机下，但我想绕过你的棋子，看看能不能把它们围住。 This move exposes one empty neighbor for your white stone. 这步棋给你的白子露出了一个空邻。 It's akin to a check in chess. 有点像象棋里的将军。 If you don't respond immediately by putting one here, then I can immediately capture this. 如果你不立刻在这里补一子，我就能马上吃掉这颗。 I see. 我明白了。 Because it's the diagonals that determine whether you're surrounded or not. 因为决定是否被围住的是正交方向，不是对角线。 The cross-section, not the diagonals. 是正交，不是对角线。 This one is surrounded on three sides, so you're at threat of losing that stone if you don't play one immediately there. 这颗三面受围，如果你不马上在那里下一子，就有被吃掉的危险。 Now you can see that I'm starting to pressure you, because by putting a stone here, you're forced to put one here. 现在你能看到我开始施压了，因为我下这里，你就被迫要在这里回应。 Otherwise, you would have this two-block to yourself. 否则你就把这两格让给我了。 Yes. 对。 And if you think through what happens if you were to respond here, you can probably search into the future and deduce what I'll do in response once you do that. 如果你想一想在这里回应会发生什么，大概可以推算出你这样下之后我会怎么应对。 You have a lot of confidence in my abilities, but I'm guessing you'd put the black here. 你对我的判断力太有信心了，但我猜你会把黑子下这里。 That's right, and then I would capture all three of these stones. 没错，然后我就能吃掉这三颗棋子。 So I should just assume that this little block is gone. 所以我就应该认定这一小块没了。 Yes. 是的。 In Go, it's actually okay to let an opponent capture some stones if, for example, it lets you position to capture more stones somewhere else on the board. 在围棋里，让对手吃掉一些棋子其实没关系，比如这样能让你去占据棋盘其他地方更有利的位置。 This is what makes 这就是 Go a beautiful game: you can lose the battle but win the war. 围棋这个游戏美妙的地方：你可以输了小战役但赢下整场战争。 As the board size increases, the complexity of these micro versus macro dynamics gets more interesting. 随着棋盘变大，这种微观与宏观的动态博弈会变得更有意思。 Presumably you'd put one here. 你应该下这里吧。 So now I would capture this entire group, and this would be mine. 这样我就能吃掉整组棋子，这片区域就是我的了。 There's one more case I want to demonstrate, which I actually had a bug in my code for recently. 还有一种情况我想演示一下，我代码里最近就有一个关于它的 bug。 Let's consider a formation like this, with other pieces on the board in play. 来考虑这样一个形状，棋盘上还有其他棋子在。 Let's talk about how the game ends. 聊聊游戏怎么结束。 In this territory, who controls these areas? 这片领地里，谁控制了这些区域？ Is it white or black? 是白棋还是黑棋？ White. 白棋。 It's actually black, because I've surrounded this whole area. 其实是黑棋，因为我已经把这整片区域围住了。 Assuming I have other black stones here, it's very hard for you to break this out of the control of these stones. 假设我在这里还有其他黑子，你很难把这片区域从这些棋子的控制下夺走。 And when the final score is tallied, would these ones also count as being in... 最后算分的时候，这些也会算进去吗？ Great question. 好问题。 This is where different rule sets have different ways of scoring. 不同规则对计分有不同的处理方式。 We should talk about how you resolve scores between humans and how you resolve scores in computer 我们应该聊聊人类怎么结算分数，以及计算机围棋怎么结算分数， Go, because there's some ambiguity in how humans evaluate this. 因为人类对这个的评估存在一些歧义。 Most humans would look at this board configuration and conclude that black has totally surrounded white, and white has no chance of life. 大多数人看了这个棋盘形态都会判断黑棋完全围住了白棋，白棋没有生路。 We could play out more here, but at the end I would capture everything. 这里还可以继续下，但最终我会把一切都吃掉。 However, if you have a way of breaking this formation and connecting white to something outside of it, then it can flip. 但如果你有办法打破这个阵型，把白棋连接到外面，就可能翻盘。 This is where it's a little bit hard for a computer to decide these kinds of things. 这对计算机来说有点难以判断。 How do humans do it? 人类怎么处理的？ It's worth thinking about how humans resolve this, because this will map later to how we think about the deep neural network. 想清楚人类怎么解决这个问题是有价值的，因为这会映射到我们后面讲深度神经网络时的思路。 Humans basically say,"I think the game is done," and then you have to also say,"I think the game is done." 人类基本上会说，我觉得游戏结束了，然后你也要说，我觉得游戏结束了。 Then we'll say,"I think these are my stones," and you have to agree. 然后我们会说，我认为这些是我的棋子，你要表示同意。 If you don't agree, we keep playing. 如果你不同意，就继续下。 Essentially, once two humans—their so-called value function—agree on a consensus, then the Chinese rules resolve that. 本质上，当双方人类的价值函数达成共识，中国规则就会据此结算。 In Tromp-Taylor scoring, it's perfectly unambiguous, so it can be decided algorithmically by a computer. Tromp-Taylor 计分是完全无歧义的，计算机可以用算法来决定。 If you have this at the endgame, the way you score it is that you first count how many stones you control, and that's unambiguous. 如果终局是这种形态，计分方式是先数你控制了多少棋子，这是无歧义的。 Then you count how many empty intersections are not touched by your opponent's stones. 然后数对手棋子没有触及的空交叉点有多少。 These intersections would not count for either player, because all of these intersections are connected to both white stones and black stones. 这些交叉点对双方都不算分，因为它们同时连接到白子和黑子。 If this were like this, then white would get three points. 如果是这种形态，白棋就能得三分。 This is a little odd because a human would know that white is actually losing these points. 这有点奇怪，因为人类知道白棋其实是在丢分。 But Tromp-Taylor scoring would consider white to have all of these points as well as these points. 但 Tromp-Taylor 计分会认为白棋既拥有这些分，也拥有那些分。 So that is a very big difference in how computer Go scores things and how humans score things. 这就是计算机围棋计分和人类计分方式的一大差异。 How does the game end? 游戏怎么结束的？ The game ends when either a player chooses to resign or both players pass consecutively. 当一方选择认输，或者双方连续 pass，游戏就结束了。 Those are the rules. 规则就这些。 Now help me crack this with AI. 现在帮我用 AI 来破解它。 Let's understand how AlphaGo actually works and how somebody in the audience might be able to implement it. 来讲讲 AlphaGo 实际是怎么运作的，以及听众里的人怎么可能实现它。 Let's start with an intuition about the underlying search process used to make moves, and we'll layer on ideas from deep learning to make it much more efficient and tractable. 我们先从落子所用搜索过程的直觉讲起，然后叠加深度学习的思路，让它变得更高效、更可行。 Go is a game with just two players. 围棋是一个只有两名玩家的游戏。 We're going to draw a person here, and we're going to draw an AI here. 这里画一个人，这里画一个 AI。 Let's say this person is playing black, so they go first. 假设这个人执黑，先走。 They go here. 他走这里。 Now the AI is going to make a move based on what it sees here. 现在 AI 要根据它看到的局面来落子。 There's a question of how you encode these inputs into the AI. 问题在于怎么把这些输入编码给 AI。 Maybe you could use ones and zeros, but you want to represent black, white, and empty. 也许可以用 1 和 0，但你需要表示黑、白和空三种状态。 You would need at least three different values. 至少需要三个不同的值。