Podcasts पर वापस जाएंSequoia Capital
Suno's Mikey Shulman: Everyone Can Make Music Now
In Western music, there are 12 tones.
If you tell the model there are 12 tones, it will only ever produce those 12 tones.
You will be forever limited.
And so, for us, it was all about let's throw away everything we know about music, and let's try to do this from scratch.
And it's like it's just a sound wave.
It's just sampled at 48,000 times a second, and it is a continuous, you know, float 32 number, and let's figure out how to model that.
And that was a lot of the early breakthroughs that we had to make, but once we did, we realized that it is a totally generic music-making machine.
And now you are only constrained by what you can describe and your imagination.
I'm delighted to welcome Mikey Shulman.
Uh, Mikey is founder and CEO of Suno, which is building a music company or a creative entertainment platform, uh, and has been one of the most novel consumer applications I've seen out of AI, and I'm very, very excited to to ask you about, uh, your journey and what's ahead for Suno.
So, thank you for joining us today.
Thank you for having me.
I'm excited.
Okay, awesome.
I want to start your background because it is very, very unexpected.
Uh, you went from a physics PhD at Harvard, I think quantum computing with solid-state spins, uh, to building the largest AI music company in the world.
Like, what insight connected those two things for you?
Uh, you know, I don't know how I don't know like on paper, I guess I have no business building a consumer entertainment company, but um, a lot of people went from physics into AI just like, you know, 30 years ago a lot of people went from physics into quantitative trading.
Um, I I'll be honest though, like I was an okay physicist only, and um, there are a lot of better physicists, uh, including one of my co-founders.
And I think what I mostly learned is um playing at the nexus of two things that don't usually play together um is just a massive opportunity in all domains.
It can be music and technology, it can be uh quantum mechanics and low-temperature microwave engineering, or it could be whatever else you're going to do.
Um you and I got connected in the very early days of Suno.
One of our mutual friends, Harrison Chase, uh was one of the earliest Suno Discord users, and he was having far too much fun uh making songs in your Discord.
Uh maybe tell us about the early days of Suno.
How did it
How did it come together?
Did you
Did you set out to build a music company?
Um originally, we thought this would actually be too hard.
Um and it's because uh you have to rely on this is like pre-the ChatGPT moment.
Um we did some like back-of-the-envelope math.
We knew we loved audio, but the back-of-the-envelope math told us that actually producing good music, um making good music, generating good music, um was like a couple of orders of magnitude um away in terms of compute and model size and capability.
And um it's because music, sound in general, is like very unwieldy.
It's not in discrete bits like text is.
And so we actually started building a company that was all around using the same technologies to make sense of audio, not to produce it.
And um very happily, pretty early on, we had the right breakthroughs, and we realized, "Oh, we actually can make music."
You're pretty good at math.
What did you get wrong with your back-of-the-napkin math then?
Uh the the math was right.
We just had some breakthroughs that said like it's um it's actually you you don't need that amount of compute.
Um you can make the right technological breakthroughs to if you want to think about it, basically just compress audio really, really efficiently.
Um and that worked a hell of a lot better than we anticipated.
So it was like a very nice being wrong moment.
Um not all being wrong moments are are so pleasant.
And um to be clear, at the beginning, the music was terrible, but we still uh stayed up late.
was good.
He was only the first 10 users, I think.
He
He thought
He thought he was pretty
[laughter]
Uh certainly before we put it on Discord, the music was very terrible.
Before we put it on Discord, we could make like 12 and 1/2 second clips that um uh wouldn't always listen to the words you asked them to sing.
But, we had so much fun doing it.
And we thought other people might have fun doing it.
And so, um we kind of took the example of Midjourney, and we said it's really easy to put a Discord bot out and see will people enjoy it.
And we put it out there, and a hell of a lot of people enjoyed it.
And that was um a really confirmatory moment for us.
And so, a lot of people told us not to build a music company.
It's not the easiest business to work in.
Speech is really big.
There's a lot of great um business use cases for building speech technologies, but when you are staying up late playing with the thing, and you don't want to go to sleep, it's like a really good sign that that is what you are meant to be doing.
And so, that's what we did.
I love that.
Are you a musician?
I am.
Uh I play almost every day.
Uh I grew up playing a lot of piano, and um ended up picking up uh picking up a bass around age 12, and and um playing a lot a lot more of that.
Okay, so personal passion project.
That's awesome.
You know, the the revisionist history is that um which is true is that we used to have jam sessions at our last company in one of my co-founders' basements.
And it's true, we had a lot of fun there.
It's not why we started the company.
Again, we thought it would be too hard to do this.
It was just fun.
Meaning at Kensho?
At Kensho.
Yes, where I met the great Harrison Chase.
The Kensho mafia is like pretty unparalleled.
There's Harrison, but also Daniel Nadler, Sam Whitmore, you.
Oh, there are a lot of you.
There's
There's a lot of us.
I just credit Daniel with that, honestly.
Um Daniel is like I think the best uh object lesson in what talent density can do for a company.
And it was a lot of people with non-traditional backgrounds.
It skewed very young, but he was great at finding people and great at convincing them to join.
I love that.
Okay, so walk us through what happens when like somebody types upbeat 90s hip-hop track about a road trip.
You get the prompt in, what happens?
What is the model model doing to be able to pass something back to the user that seems like it's quite special?
Um in some way, it's actually pretty simple.
A prompt like that, you have to figure out what are the words of this song, and we use various LLMs to do that to make the lyrics.
And um so it's taking basically the cue there is road trip, and so like what should this road trip be about?
And it will probably get it wrong cuz you didn't give us enough information, but that's actually okay.
You can iterate on it.
And then you said 90s hip-hop, and we try to expand that out into a set of cues that the model can really understand.
What is the genre?
What is the style of this music?
Um and then you put those things together, you have a lot of lyrics, you have a lot of styles, and we have our models that are trained to take in all of that information and just produce sound.