Back to PodcastsDwarkesh Patel
Chip design from the bottom up – Reiner Pope
I'm back with Reiner Pope, CEO of MatX, a new AI chip company.
我再次邀请到 Reiner Pope,AI 芯片创业公司 MatX 的 CEO。
Last time we were talking about what happens inside a data center.
上次我们聊的是数据中心内部发生了什么。
Now I want to understand what happens inside an AI chip.
今天我想了解 AI 芯片内部是怎么运作的。
How does a chip actually work?
芯片究竟是如何工作的?
Full disclosure, by the way: I am an angel investor in MatX.
顺便声明一下:我是 MatX 的天使投资人。
So hopefully you have designed a good chip.
所以希望你设计了一颗好芯片。
Hope so.
希望如此。
I'll start with the smallest fundamental unit of chip design, and we'll build up to what an actual production chip is and what its components are.
我从芯片设计最小的基本单元讲起,逐步拆解一颗量产芯片是什么,以及它由哪些部分组成。
At the very bottom level of a chip, the primitives we work with are logic gates, very simple things like AND, OR, and NOT.
在芯片最底层,我们使用的原语是逻辑门,非常简单的东西,比如 AND、OR 和 NOT。
These are connected together by wires that have to be laid out physically as metal traces on a chip.
它们通过导线连接在一起,这些导线必须以金属走线的形式物理布局在芯片上。
The main function that AI chips want to compute is the multiplication of matrices.
AI 芯片需要计算的核心功能是矩阵乘法。
Inside that, the fundamental primitive is a multiply-accumulate of pairs of numbers.
在矩阵乘法内部,最基本的原语是对数对执行乘加运算。
We're going to demonstrate what that calculation looks like by hand, and then infer what a circuit would look like for that.
我们先手算一遍这个运算过程,再推导出对应电路的样子。
It'll be easiest if I do a multiply-accumulate of a four-bit number with another four-bit number.
最简单的演示方式是对一个 4 位数和另一个 4 位数做乘加运算。
The clearest primitive is actually multiply-accumulate.
最直观的原语其实就是乘加。
So there's a multiply of these two terms, and then we're going to add in an eight-bit number.
具体来说,先将两个操作数相乘,然后再加上一个 8 位数。
Can I ask a clarifying question?
我可以问一个澄清性问题吗?
Why is this the natural primitive for whatever computation happens inside a computer?
为什么乘加是计算机内部运算的天然原语?
There are a few reasons.
原因有几点。
It's a little bit more efficient, but the reason it's natural for AI chips is that if you look at what's happening during a matrix multiply
效率上稍微高一些,但它对 AI 芯片来说是天然原语,是因为观察矩阵乘法的过程就能看出来,
What is a matrix multiply in short?
简单来说,矩阵乘法是什么?
There's a for-loop over i, over j, and over k, of output [i, k]+= input [i, j] x other input [j, k].
就是对 i、j、k 的三重循环,每次执行 output[i, k] += input[i, j] x other input[j, k]。
A multiply-accumulate happens at every single step of a matrix multiply.
矩阵乘法的每一步都包含一次乘加运算。
The other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step.
另一个观察是:累加步骤所需的精度几乎总是高于乘法步骤。
This is specific to AI chips.
这是 AI 芯片的特有现象。
You're multiplying low-precision numbers, and then when you accumulate, errors accumulate quickly, so you need more precision there.
你在做低精度数的乘法,但累加时误差会迅速积累,所以累加那步需要更高的精度。
This is why we've chosen to do a four-bit multiplication and an eight-bit addition.
这就是为什么我们选择做 4 位乘法加上 8 位加法。
Let me make sure I understood that.
让我确认一下我理解对了。
There are two ways to understand that.
理解它有两个角度。
One is that the value will be larger than the inputs.
一是累加结果的数值会大于输入。
The other is that if it was a floating-point number it would be
另一点是,如果是浮点数的话,
Maybe that part is less intuitive to me.
这部分对我来说不太直觉。
But maybe it's the same principle?
但原理是一样的吗?
It really is the same principle.
原理确实一样。
The separate principle is that as you're summing up this number, you're summing up a whole bunch of numbers, so you've got a lot of rounding errors accumulating.
另一个原理是,在累加这个数字时,你在累加一大堆数,所以舍入误差会不断积累。
Whereas in this case, there's only one multiplication in the chain, so there aren't a lot of rounding errors accumulating in the multiplication.
而在这种情况下,链路中只有一次乘法,所以乘法里不会积累太多舍入误差。
Why are you summing up a whole bunch of numbers?
为什么要累加一大堆数?
There's just two numbers there.
那里只有两个数。
This summation is repeated j many times.
求和运算要重复 j 次。
Any errors accumulate.
误差就这样累积。
I see.
明白了。
So how would we perform this calculation by hand?
那我们怎么手算这个呢?
As a human, we would probably separate it into two, but we can do it all in one using long multiplication.
作为人类,我们可能会拆成两步,但用长乘法可以一步完成。
For the multiplication term first, we're going to multiply this four-bit number by every single bit position in the other four-bit number.
先看乘法项,我们要把这个 4-bit 数乘以另一个 4-bit 数的每一个 bit 位。
We write that out.
把它写出来。
First, 1001 multiplied by this bit position.
第一步,1001 乘以这个 bit 位。
That is the number itself.
结果就是数字本身。
Then shifted across by one, we're multiplying by 0.
左移一位后,乘以 0。
That gives us an all-0 number.
得到全 0 的数。
Shifted across one more to multiply by this one, we get 1001.
再左移一位,乘以这个 1,得到 1001。
Finally, for this last bit position, we get an all-0 number again.
最后,对最后这个 bit 位,又得到全 0 的数。
This gives us a bunch of terms that we have to add for the multiplication.
这给出了一堆需要相加的项,用于完成乘法。
While we're doing that summation, we might as well add in the actual accumulator term as well.
在做求和的同时,不妨把累加器项也一并加进去。
So we just copy that directly across.
直接把它复制过来。
So this is the sum.
这就是求和。
It's a five-way sum that we want to compute.
是个五路求和,要计算出来。
What logic gates did it take us to get to this intermediate step?
到这个中间步骤,用了哪些逻辑门?
We needed to produce all 16 of these partial products.
我们需要产生全部 16 个部分积。
How do I produce one of these partial products?
怎么产生一个部分积?
Let's take this number 1, for example here.
以这个数字 1 为例。
We produce it by multiplying this number by this one over here.
将这个数与那个数相乘,就能得到它。
We can produce that with an AND gate.
用 AND 门就能实现。
This number is 1 if both this bit is 1 and this bit is 1.
当两个 bit 都是 1 时,结果才是 1。
If either of them is 0, then the multiplication of 0 times anything is 0.
任意一个是 0,0 乘以任何数都是 0。
To produce all of this, we ended up consuming 16 AND gates.
生成全部乘积,共用了 16 个 AND 门。
In the general case, if I were doing a p bit multiply times a q bit multiply, this will be p times q many ANDs.
一般情况下,p bit 乘以 q bit,就需要 p 乘以 q 个 AND 门。
Finally, I sum them.
最后,把它们求和。
Most of the work is going to happen in the summing.
大部分工作在求和阶段。
Let me describe the other logic gate that we use here.
来介绍另一种逻辑门。
AND is almost the simplest logic gate that exists on a chip.
AND 是芯片上几乎最简单的逻辑门。
It's almost the smallest.
几乎是最小的。
At the other extreme, the very largest logic gate you'll typically use is something called a full adder.
另一个极端,最大的逻辑门叫全加器。
Coming from software, you might think that a full adder adds 32-bit numbers together.
做软件出身的人可能以为全加器是做 32 bit 加法的。
In this case, it just adds three single-bit numbers together, so you can think of it as adding 0, 1, and 1 together.
实际上它只加三个单 bit 数,比如 0、1、1 相加。
When I add these together, the result can be 0, 1, 2, or 3, so I can express that in binary using just two bits.
三个数相加,结果最大是 3,用两个 bit 就能表示。
As input, it has three bits.
输入有三个 bit。
As output, it has two bits.
输出是两个 bit。
The number 2 in binary is 10.
数字 2 的二进制是 10。
This is also known as a 3→2 compressor because it takes three bits of input and produces two bits of output.
它也叫 3→2 压缩器,输入三个 bit,输出两个 bit。
Just to make sure I understood: the two inputs are an X and a Y value and then some carry that came in
确认一下:两个输入是 X 和 Y,再加上一个进位。
The three inputs are all bits in the same bit position, like three bits in a column here.
三个输入都是同一 bit 位的,就像一列里的三个 bit。
The two outputs, I've drawn them vertically here and horizontally here to match this vertical versus horizontal layout.
两个输出,一个竖向一个横向,对应纵列与横列的布局。
This expresses that things in the same column are in the same bit position, whereas things in adjacent columns are different.
同列的 bit 位权相同,相邻列的 bit 位权不同。
This is a carry out, whereas this was the sum.
横向这个是进位输出,竖向那个是和。
So if the inputs in the full adder were, say, 101, then the output would be 10.
如果全加器输入是 101,输出就是 10。
If it were 111, it'd be 11.
输入 111,输出就是 11。
If it were 000, it'd be 00.
输入 000,输出就是 00。
If it were 010, it'd still be 01.
输入 010,输出还是 01。
Got it.
明白了。
Yeah.
对。
It's essentially just counting the number of things and expressing that in binary.
本质上就是统计数量并用二进制表达。
This circuit captures what we as humans naturally do when we're summing along a column.
这个电路实现了人类在竖式加法中自然而然会做的事。
I'll show one iteration of using the full adder to sum.
我来演示一步用全加器求和的过程。
The way I sum here is going to be a little unnatural for humans.
我这里的求和方式对人来说稍显不自然。
We would sum along the column and then remember the carry, but instead of remembering the carry, we'll explicitly write it out.
我们沿列求和并记住进位,但不记在脑中,而是直接写出来。
We proceed from the rightmost column toward the left.
从最右列向左逐列推进。
On the rightmost column, we sum the 1 and the 1, and that produces a zero here and a carry of one.
在最右列,把 1 和 1 相加,得到 0,并产生一个进位 1。
We've used this full adder circuit on this pair of bits and produced a pair of bits as output.
我们对这对 bit 使用了全加器,输出也是一对 bit。
Now we can do the same thing with this column.
现在对这一列做同样的操作。
We have a column of four numbers, so we'll take the first three of them, run a full adder on them, and that gives us a 0 and a 0 as output.
这列有 4 个数,取其中 3 个跑全加器,输出是 0 和 0。