Suno's Mikey Shulman: Everyone Can Make Music Now
In Western music, there are 12 tones.
西方音乐里一共有 12 个音调。
If you tell the model there are 12 tones, it will only ever produce those 12 tones.
如果你告诉模型只有 12 个音调,它就永远只会产生这 12 个音调。
You will be forever limited.
你会被永远限制住。
And so, for us, it was all about let's throw away everything we know about music, and let's try to do this from scratch.
所以对我们来说,核心就是把我们对音乐的所有认知全部抛掉,从零开始。
And it's like it's just a sound wave.
说到底,它就是一段声波。
It's just sampled at 48,000 times a second, and it is a continuous, you know, float 32 number, and let's figure out how to model that.
就是以每秒 48000 次的频率采样,是个连续的 float 32 的浮点数,然后我们想办法对它建模。
And that was a lot of the early breakthroughs that we had to make, but once we did, we realized that it is a totally generic music-making machine.
这是我们当时需要突破的很多早期难关,但一旦突破之后,我们发现这是一台完全通用的音乐生成机器。
And now you are only constrained by what you can describe and your imagination.
现在唯一限制你的,就是你能描述什么,以及你的想象力。
I'm delighted to welcome Mikey Shulman.
我非常高兴地欢迎 Mikey Shulman 来到今天的节目。
Uh, Mikey is founder and CEO of Suno, which is building a music company or a creative entertainment platform, uh, and has been one of the most novel consumer applications I've seen out of AI, and I'm very, very excited to to ask you about, uh, your journey and what's ahead for Suno.
呃,Mikey 是 Suno 的联合创始人兼 CEO,Suno 正在打造一家音乐公司,或者说是创意娱乐平台,在我见过的所有 AI 消费者应用中,这是最新颖的之一,我非常非常期待向你请教,呃,你的创业历程以及 Suno 的未来。
So, thank you for joining us today.
感谢你今天来到我们节目。
Thank you for having me.
谢谢你邀请我。
I'm excited.
我很兴奋。
Okay, awesome.
好,太棒了。
I want to start your background because it is very, very unexpected.
我想先聊聊你的背景,因为真的很出人意料。
Uh, you went from a physics PhD at Harvard, I think quantum computing with solid-state spins, uh, to building the largest AI music company in the world.
呃,你从哈佛的物理学博士,研究的好像是固态自旋量子计算,到后来创建了全球最大的 AI 音乐公司。
Like, what insight connected those two things for you?
是什么洞察把这两件事联系在一起的?
Uh, you know, I don't know how I don't know like on paper, I guess I have no business building a consumer entertainment company, but um, a lot of people went from physics into AI just like, you know, 30 years ago a lot of people went from physics into quantitative trading.
呃,说实话,从纸面上看,我好像没有任何资格去做消费娱乐公司,但就像 30 年前很多物理学家转行做量化交易一样,很多人也从物理学转入了 AI。
Um, I I'll be honest though, like I was an okay physicist only, and um, there are a lot of better physicists, uh, including one of my co-founders.
不过说实话,我只是个还过得去的物理学家,有很多更优秀的,包括我的一位联合创始人。
And I think what I mostly learned is um playing at the nexus of two things that don't usually play together um is just a massive opportunity in all domains.
我觉得我从中学到最多的是,在两个通常不交叉的领域的交叉点上去探索,在任何行业都是巨大的机会。
It can be music and technology, it can be uh quantum mechanics and low-temperature microwave engineering, or it could be whatever else you're going to do.
可以是音乐和技术,可以是量子力学和低温微波工程,或者任何你接下来想做的事。
Um you and I got connected in the very early days of Suno.
你和我在 Suno 非常早期的时候就结缘了。
One of our mutual friends, Harrison Chase, uh was one of the earliest Suno Discord users, and he was having far too much fun uh making songs in your Discord.
我们的共同朋友 Harrison Chase,是最早的一批 Suno Discord 用户之一,他在你们的 Discord 里玩得不亦乐乎,做了好多歌。
Uh maybe tell us about the early days of Suno.
呃,跟我们聊聊 Suno 早期的故事吧。
How did it
是怎么
How did it come together?
是怎么走到一起的?
Did you
你有没有
Did you set out to build a music company?
你当初是有意要创建一家音乐公司吗?
Um originally, we thought this would actually be too hard.
最初,我们其实觉得这件事难度太大了。
Um and it's because uh you have to rely on this is like pre-the ChatGPT moment.
因为当时还是 ChatGPT 时代之前,你不得不依赖很多不确定的东西。
Um we did some like back-of-the-envelope math.
我们做了一些粗略的估算。
We knew we loved audio, but the back-of-the-envelope math told us that actually producing good music, um making good music, generating good music, um was like a couple of orders of magnitude um away in terms of compute and model size and capability.
我们知道自己热爱音频,但粗略估算告诉我们,要做出好音乐,无论是制作好音乐、生成好音乐,在算力、模型规模和能力方面都还差了好几个数量级。
And um it's because music, sound in general, is like very unwieldy.
原因在于音乐,声音这种东西,实在太难驯服了。
It's not in discrete bits like text is.
它不像文字那样是离散的单元。
And so we actually started building a company that was all around using the same technologies to make sense of audio, not to produce it.
于是我们开始建立一家公司,专注于用同样的技术来理解音频,而不是生成它。
And um very happily, pretty early on, we had the right breakthroughs, and we realized, "Oh, we actually can make music."
然后非常幸运的是,相当早期,我们就有了关键突破,意识到,哦,我们其实真的可以做音乐。
You're pretty good at math.
你数学还挺厉害的。
What did you get wrong with your back-of-the-napkin math then?
那你们的粗略估算到底错在哪里了?
Uh the the math was right.
数学本身是对的。
We just had some breakthroughs that said like it's um it's actually you you don't need that amount of compute.
我们只是有了一些突破,发现其实你不需要那么多算力。
Um you can make the right technological breakthroughs to if you want to think about it, basically just compress audio really, really efficiently.
可以通过正确的技术突破,如果你愿意这样理解的话,就是把音频压缩得非常非常高效。
Um and that worked a hell of a lot better than we anticipated.
而且效果比我们预期的好太多了。
So it was like a very nice being wrong moment.
所以这是一次很好的被打脸时刻。
Um not all being wrong moments are are so pleasant.
不是所有被打脸的时刻都这么愉快。
And um to be clear, at the beginning, the music was terrible, but we still uh stayed up late.
说实话,一开始音乐很糟糕,但我们还是会熬夜玩。
was good.
感觉很好。
He was only the first 10 users, I think.
他那时候应该是前 10 名用户之一吧。
He
他
He thought
他当时
He thought he was pretty
他当时以为他
[laughter]
[笑声]
Uh certainly before we put it on Discord, the music was very terrible.
在上线 Discord 之前,音乐质量确实非常糟糕。
Before we put it on Discord, we could make like 12 and 1/2 second clips that um uh wouldn't always listen to the words you asked them to sing.
在上线 Discord 之前,我们只能生成大概 12.5 秒的片段,而且不一定听得进去你让它唱的词。
But, we had so much fun doing it.
但我们玩得特别开心。
And we thought other people might have fun doing it.
我们觉得别人也可能会玩得很开心。
And so, um we kind of took the example of Midjourney, and we said it's really easy to put a Discord bot out and see will people enjoy it.
于是我们借鉴 Midjourney 的做法,说把 Discord 机器人上线很简单,看看大家会不会喜欢。
And we put it out there, and a hell of a lot of people enjoyed it.
我们上线了,有一大堆人真的很喜欢。
And that was um a really confirmatory moment for us.
这对我们来说是个非常有力的验证时刻。
And so, a lot of people told us not to build a music company.
然后嘛,有很多人劝我们不要做音乐公司。
It's not the easiest business to work in.
这不是一个最容易做的行业。
Speech is really big.
语音是个很大的赛道。
There's a lot of great um business use cases for building speech technologies, but when you are staying up late playing with the thing, and you don't want to go to sleep, it's like a really good sign that that is what you are meant to be doing.
语音技术有很多很好的商业应用场景,但当你熬夜在那儿玩、不想去睡觉的时候,那就是你真正应该做的事的信号。
And so, that's what we did.
所以我们就去做了。
I love that.
我喜欢这个。
Are you a musician?
你是音乐人吗?
I am.
是的。
Uh I play almost every day.
我几乎每天都在弹。
Uh I grew up playing a lot of piano, and um ended up picking up uh picking up a bass around age 12, and and um playing a lot a lot more of that.
我从小学了很多钢琴,后来大概 12 岁的时候开始弹贝斯,然后弹得越来越多。
Okay, so personal passion project.
好,所以是个人热情所在。
That's awesome.
太棒了。
You know, the the revisionist history is that um which is true is that we used to have jam sessions at our last company in one of my co-founders' basements.
从后见之明来说,有一个真实的版本是,我们以前在上一家公司,会在我一位联合创始人的地下室里开即兴演奏会。
And it's true, we had a lot of fun there.
确实,我们玩得很开心。
It's not why we started the company.
但这不是我们创业的原因。
Again, we thought it would be too hard to do this.
我们当时还是觉得这件事太难了。
It was just fun.
只是很好玩而已。
Meaning at Kensho?
是在 Kensho 吗?
At Kensho.
在 Kensho。
Yes, where I met the great Harrison Chase.
就是在那,认识了 Harrison Chase。
The Kensho mafia is like pretty unparalleled.
Kensho 系真的挺厉害的,堪称无与伦比。
There's Harrison, but also Daniel Nadler, Sam Whitmore, you.
有 Harrison,还有 Daniel Nadler、Sam Whitmore,以及你。
Oh, there are a lot of you.
哦,你们真多。
There's
有
There's a lot of us.
我们确实挺多的。
I just credit Daniel with that, honestly.
说真的,这都归功于 Daniel。
Um Daniel is like I think the best uh object lesson in what talent density can do for a company.
Daniel 是我见过的最好的例子,说明人才密度能为一家公司做什么。
And it was a lot of people with non-traditional backgrounds.
而且都是背景非常非传统的人。
It skewed very young, but he was great at finding people and great at convincing them to join.
整体偏年轻,但他很善于发现人才,也很善于说服他们加入。
I love that.
我喜欢这个。
Okay, so walk us through what happens when like somebody types upbeat 90s hip-hop track about a road trip.
好,带我们了解一下,当有人输入充满活力的 90 年代嘻哈路途歌曲的时候,会发生什么。
You get the prompt in, what happens?
提示词进来之后,接下来是什么?
What is the model model doing to be able to pass something back to the user that seems like it's quite special?
模型到底在做什么,才能把一个看起来很特别的东西传回给用户?
Um in some way, it's actually pretty simple.
某种程度上其实挺简单的。
A prompt like that, you have to figure out what are the words of this song, and we use various LLMs to do that to make the lyrics.
面对这样的提示词,首先要弄清楚这首歌的歌词是什么,我们用各种 LLM 来生成歌词。
And um so it's taking basically the cue there is road trip, and so like what should this road trip be about?
所以基本上是在抓取线索,比如是公路旅行,那这首公路旅行应该讲什么?
And it will probably get it wrong cuz you didn't give us enough information, but that's actually okay.
大概率会搞错,因为你没给我们足够的信息,但其实也没关系。
You can iterate on it.
你可以迭代。
And then you said 90s hip-hop, and we try to expand that out into a set of cues that the model can really understand.
然后你说了 90 年代嘻哈,我们会把它扩展成一组模型能真正理解的提示。
What is the genre?
这是什么类型?
What is the style of this music?
这是什么风格的音乐?
Um and then you put those things together, you have a lot of lyrics, you have a lot of styles, and we have our models that are trained to take in all of that information and just produce sound.
把这些东西组合在一起,你有大量的歌词,有大量的风格描述,然后我们的模型被训练来接收所有这些信息,直接生成声音。