Chip design from the bottom up – Reiner Pope
I'm back with Reiner Pope, CEO of MatX, a new AI chip company.
我又请回了 Reiner Pope,MatX 这家新 AI 芯片公司的 CEO。
Last time we were talking about what happens inside a data center.
上次我们聊的是数据中心里发生了什么。
Now I want to understand what happens inside an AI chip.
这次我想搞清楚 AI 芯片里面发生了什么。
How does a chip actually work?
芯片到底是怎么工作的?
Full disclosure, by the way: I am an angel investor in MatX.
顺带说一下,我是 MatX 的天使投资人。
So hopefully you have designed a good chip.
所以希望你设计了一块好芯片。
Hope so.
希望如此。
I'll start with the smallest fundamental unit of chip design, and we'll build up to what an actual production chip is and what its components are.
我从芯片设计最小的基本单元讲起,一路推导到真正的量产芯片是什么、由哪些部分组成。
At the very bottom level of a chip, the primitives we work with are logic gates, very simple things like AND, OR, and NOT.
芯片最底层,我们操作的原语是逻辑门,非常简单的东西,比如 AND、OR、NOT。
These are connected together by wires that have to be laid out physically as metal traces on a chip.
这些逻辑门通过导线相互连接,导线必须以金属走线的形式物理布局在芯片上。
The main function that AI chips want to compute is the multiplication of matrices.
AI 芯片要计算的主要功能是矩阵乘法。
Inside that, the fundamental primitive is a multiply-accumulate of pairs of numbers.
矩阵乘法里面,基本原语是对数对进行乘加运算。
We're going to demonstrate what that calculation looks like by hand, and then infer what a circuit would look like for that.
我们先手动演示这个计算长什么样,再推导出对应的电路。
It'll be easiest if I do a multiply-accumulate of a four-bit number with another four-bit number.
最简单的做法是用一个 4 位数乘以另一个 4 位数,做一次 MAC。
The clearest primitive is actually multiply-accumulate.
最清晰的原语其实就是乘加。
So there's a multiply of these two terms, and then we're going to add in an eight-bit number.
也就是把这两项相乘,然后加上一个 8 位数。
Can I ask a clarifying question?
我可以问一个澄清性的问题吗?
Why is this the natural primitive for whatever computation happens inside a computer?
为什么这是计算机运算的自然原语?
There are a few reasons.
有几个原因。
It's a little bit more efficient, but the reason it's natural for AI chips is that if you look at what's happening during a matrix multiply
从效率上说稍微好一些,但对 AI 芯片来说,之所以自然,是因为看一下矩阵乘法里发生了什么。
What is a matrix multiply in short?
矩阵乘法简而言之是什么?
There's a for-loop over i, over j, and over k, of output [ i , k ]+= input [ i , j ] x other input [ j , k ]. A multiply-accumulate happens at every single step of a matrix multiply.
就是对 i、j、k 的三重循环,output[i][k] += input[i][j] × other_input[j][k]。矩阵乘法的每一步都包含一次乘加。
The other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step.
另一个观察是,累加步骤所需的精度几乎总是高于乘法步骤。
This is specific to AI chips.
这是 AI 芯片特有的。
You're multiplying low-precision numbers, and then when you accumulate, errors accumulate quickly, so you need more precision there.
你在乘低精度数,累加时误差积累很快,所以这里需要更高精度。
This is why we've chosen to do a four-bit multiplication and an eight-bit addition.
这就是为什么我们选择做 4 位乘法加 8 位加法。
Let me make sure I understood that.
让我确认一下我理解对了。
There are two ways to understand that.
可以从两个角度理解。
One is that the value will be larger than the inputs.
一是结果的值会比输入更大。
The other is that if it was a floating-point number it would be
另一个是,如果是浮点数的话,
Maybe that part is less intuitive to me.
这部分对我来说不太直觉。
But maybe it's the same principle?
但也许原理是一样的?
It really is the same principle.
原理确实是一样的。
The separate principle is that as you're summing up this number, you're summing up a whole bunch of numbers, so you've got a lot of rounding errors accumulating.
另一个独立的原理是,在做这个加法的时候,你在累加一大堆数,所以会积累大量舍入误差。
Whereas in this case, there's only one multiplication in the chain, so there aren't a lot of rounding errors accumulating in the multiplication.
而在乘法这条链上只有一次乘法,所以乘法本身不会积累太多舍入误差。
Why are you summing up a whole bunch of numbers?
为什么要累加一大堆数?
There's just two numbers there.
那里只有两个数啊。
This summation is repeated j many times.
这个求和要重复 j 次。
Any errors accumulate.
所以误差会累积。
I see.
明白了。
So how would we perform this calculation by hand?
那我们怎么手动做这个计算?
As a human, we would probably separate it into two, but we can do it all in one using long multiplication.
人类通常会把它拆成两步,但用长乘法可以一次搞定。
For the multiplication term first, we're going to multiply this four-bit number by every single bit position in the other four-bit number.
先做乘法项:把这个 4 位数和另一个 4 位数的每一个 bit 位分别相乘。
We write that out.
把结果写出来。
First, 1001 multiplied by this bit position.
首先,1001 乘以这个 bit 位。
That is the number itself.
就是这个数本身。
Then shifted across by one, we're multiplying by 0.
向左移一位,乘以 0。
That gives us an all-0 number.
得到一个全零的数。
Shifted across one more to multiply by this one, we get 1001.
再向左移一位,乘以这个 1,得到 1001。
Finally, for this last bit position, we get an all-0 number again.
最后,最高 bit 位,又得到全零。
This gives us a bunch of terms that we have to add for the multiplication.
这给了我们一堆需要相加的乘法项。
While we're doing that summation, we might as well add in the actual accumulator term as well.
做这个求和的同时,顺便把真正的累加项也加进去。
So we just copy that directly across.
直接把它复制过来。
So this is the sum.
这就是要求的和。
It's a five-way sum that we want to compute.
是一个五路求和。
What logic gates did it take us to get to this intermediate step?
到这一步用了哪些逻辑门?
We needed to produce all 16 of these partial products.
我们需要产生全部 16 个部分积。
How do I produce one of these partial products?
怎么产生其中一个部分积?
Let's take this number 1, for example here.
比如这个 1。
We produce it by multiplying this number by this one over here.
它是把这个数和这里的那个数相乘得到的。
We can produce that with an AND gate.
用一个 AND 门就能产生它。
This number is 1 if both this bit is 1 and this bit is 1.
当且仅当这个 bit 是 1 且那个 bit 也是 1 时,结果才是 1。
If either of them is 0, then the multiplication of 0 times anything is 0.
如果其中一个是 0,那么 0 乘以任何数都是 0。
To produce all of this, we ended up consuming 16 AND gates.
产生这全部的部分积,消耗了 16 个 AND 门。
In the general case, if I were doing a p bit multiply times a q bit multiply, this will be p times q many ANDs.
一般情况下,如果做 p 位数乘 q 位数,就需要 p×q 个 AND 门。
Finally, I sum them.
接下来做求和。
Most of the work is going to happen in the summing.
大部分工作都在求和上。
Let me describe the other logic gate that we use here.
再介绍一下这里用到的另一种逻辑门。
AND is almost the simplest logic gate that exists on a chip.
AND 是芯片上几乎最简单的逻辑门,
It's almost the smallest.
也几乎是最小的。
At the other extreme, the very largest logic gate you'll typically use is something called a full adder.
另一个极端,通常用到的最大逻辑门叫做全加器。
Coming from software, you might think that a full adder adds 32-bit numbers together.
从软件的角度来看,你可能觉得全加器是把 32 位数加起来。
In this case, it just adds three single-bit numbers together, so you can think of it as adding 0, 1, and 1 together.
实际上,它只做三个单 bit 数的加法,比如把 0、1、1 加在一起。
When I add these together, the result can be 0, 1, 2, or 3, so I can express that in binary using just two bits.
三个数相加,结果可以是 0、1、2、3,用两个 bit 就能表示。
As input, it has three bits.
输入是三个 bit,
As output, it has two bits.
输出是两个 bit。
The number 2 in binary is 10.
数字 2 的二进制是 10。
This is also known as a 3→2 compressor because it takes three bits of input and produces two bits of output.
也叫 3→2 压缩器,因为它把三个输入 bit 压缩成两个输出 bit。
Just to make sure I understood: the two inputs are an X and a Y value and then some carry that came in
确认一下我的理解:两个输入是 X 和 Y,然后还有一个进位输入,
The three inputs are all bits in the same bit position, like three bits in a column here.
三个输入都是同一 bit 位置上的 bit,比如同一列里的三个 bit。
The two outputs, I've drawn them vertically here and horizontally here to match this vertical versus horizontal layout.
两个输出,我在这里画成竖向和横向,对应竖向和横向的布局。
This expresses that things in the same column are in the same bit position, whereas things in adjacent columns are different.
同一列的东西在同一 bit 位置,相邻列的东西在不同 bit 位置。
This is a carry out, whereas this was the sum.
这个是进位输出,那个是和。
So if the inputs in the full adder were, say, 101, then the output would be 10.
如果全加器的输入是 101,输出就是 10。
If it were 111, it'd be 11.
如果是 111,输出就是 11。
If it were 000, it'd be 00.
如果是 000,输出是 00。
If it were 010, it'd still be 01.
如果是 010,输出还是 01。
Got it.
明白了。
Yeah.
对。
It's essentially just counting the number of things and expressing that in binary.
本质上就是数 1 的个数,然后用二进制表示。
This circuit captures what we as humans naturally do when we're summing along a column.
这个电路模拟了人类按列求和时自然的思路。
I'll show one iteration of using the full adder to sum.
演示一轮用全加器求和的过程。
The way I sum here is going to be a little unnatural for humans.
我的求和方式对人类来说有点不自然。
We would sum along the column and then remember the carry, but instead of remembering the carry, we'll explicitly write it out.
人类通常按列求和然后记住进位,但我们不记进位,而是把它显式写出来。
We proceed from the rightmost column toward the left.
从最右列开始向左推进。
On the rightmost column, we sum the 1 and the 1, and that produces a zero here and a carry of one.
最右列,把 1 和 1 相加,这里得 0,进位为 1。
We've used this full adder circuit on this pair of bits and produced a pair of bits as output.
我们把全加器电路用在这对 bit 上,输出了一对 bit。
Now we can do the same thing with this column.
下一列做同样的事。
We have a column of four numbers, so we'll take the first three of them, run a full adder on them, and that gives us a 0 and a 0 as output.
这一列有四个数,取前三个,跑一次全加器,得到 0 和 0。
The sum of these is 00.
这三个数的和是 00。