Retour aux podcastsEvery
Why Opus 4.8 Pulled Me Back to Claude
It's model release day.
今天模型发布了。
Opus 4.8 8 drops today.
Opus 4.8 今天上线。
But honestly, they could have called it Opus 5 cuz this is a really great model.
说实话,他们完全可以叫它 Opus 5,这个模型真的很强。
Anthropic, I know you're trying to underpromise, but you are overd delivering.
Anthropic,我知道你们在刻意低调,但你们是真的超额兑现了。
We have been testing it internally for about a week here at EveryY.
我们在 Every 内部已经测试了大约一周。
And here is your day zero vibe check.
这里是你们的发布日即时测评。
Before we get into it, what is Every?
进入正题之前,先说说 Every 是什么。
Every is the only subscription you need to stay at the edge of AI.
Every 是你紧跟 AI 前沿唯一需要的订阅。
You can kind of think of us like an applied AI lab for the future of work.
可以把我们理解成一个专注于未来工作的应用型 AI 实验室。
We're about 30 people.
我们大概有 30 个人。
We're all early adopters of these tools.
都是这些工具的早期用户。
We write about all the new models, all the ways that we use it for use them for coding, writing, design, company building, and more.
我们写关于所有新模型的文章,以及如何用它们做编程、写作、设计、公司建设等等。
We have a suite of products that we build for ourselves to help us work better with AI.
我们自己开发了一套产品,帮我们更好地和 AI 协作。
And we also do a lot of training and courses.
我们也做很多培训和课程。
It's all available for one subscription on the every website, every.to.
这一切都在 every.to 上,一个订阅全部搞定。
And if you want to read the indepth written version of this video, we also publish a written vibe check on every as soon as the model drops.
如果你想读这个视频的深度文字版,我们也会在模型发布的第一时间在 Every 上发布文字版即时测评。
So make sure you go there to subscribe and read it.
记得去那里订阅并阅读。
Okay, let's get into it.
好,我们开始。
So headline is to me anthropic is back.
核心结论是,Anthropic 回来了。
OBS 4.7 wasn't that great of a model.
Opus 4.7 不是一个很好的模型。
Yes, benchmark improvements, but not that usable, pretty slow, hard to love.
基准有所提升,但不那么好用,相当慢,很难让人喜欢。
And what I found for myself is I was pretty much using only codecs and GBD 5.5 for almost everything uh for the last month or two.
我自己的情况是,过去一两个月几乎所有事情都只用 Codex 和 GPT-5.5。
And even internally at every we have a lot of like diehard Claude stands.
就连我们 Every 内部也有很多铁杆 Claude 粉丝。
We've been really going hard with claude for like the last year or so.
过去一年左右我们一直非常重度使用 Claude。
And and I could you could kind of feel even the diehard Claude stands were like, you know, Codex is pretty good.
但你能感觉到,就连那些铁杆 Claude 粉丝也开始说,Codex 挺不错的。
I'm actually starting to use GBT 5.5 for some of my writing in ways that I never had before or some of my coding and stuff like that.
我现在真的开始用 GPT-5.5 做一些写作,这是我以前从来没有过的,还有一些编程之类的工作。
I think especially with with the Codex desktop app being just so clean and fast and feeling like the future versus Opus 4.7 itself being a slow hard to use model and then its harness like the cloud desktop app.
我觉得尤其是 Codex 桌面端简洁、快速、感觉像是未来,而 Opus 4.7 本身又慢又难用,加上它的端,就是 Claude 桌面端。
It just it's kind of messy.
实在比较乱。
Anyway, vibes were honestly bad for anthropic for a little while and this model is just like a legitimately great model.
总之,Anthropic 的口碑之前确实不太好,但这个模型是真正意义上的强作。
It is the top of the pack for us in terms of our benchmarks.
在我们的基准测试中,排在所有模型之首。
So we have a senior engineer benchmark which measures these models on how well they do at senior engineer like tasks.
我们有一个资深工程师基准测试,衡量这些模型在资深工程师类型任务上的表现。
Opus 4.8 8 scores a 63 on the benchmark, which is about 30 points higher than Opus 4.7 and is just a hair it's one point higher than GBD 5.5.
Opus 4.8 在这个基准测试中得了 63 分,比 Opus 4.7 高大约 30 分,比 GPT-5.5 高一分。
So, it's it's about it's it's very similar to GP5.5, maybe a little bit more depending how you on how you score the benchmark on senior engineer like tasks.
总体来说和 GPT-5.5 非常接近,在资深工程师类任务上,具体差异取决于你的评分方式。
It's really good for writing.
写作非常强。
It's expressive.
很有表现力。
Um, it doesn't have a lot of AI tells, especially on the higher reasoning settings, and we'll get into that in a bit.
AI 痕迹不多,尤其是在更高的推理设置下,这个我们之后会细说。
And it's really good at knowledge work.
知识工作方面也很出色。
Like, it did this slide deck.
比如,它做了一个 PPT。
One of the one of the things we test always is how well does it do on knowledge work tasks and and one of those tasks is how does how well does it make a presentation like it made a slide deck.
我们每次都会测试它在知识工作任务上的表现,其中一项就是让它做一个演示文稿,它做了一个 PPT。
We'll link to it in in the YouTube and and maybe we'll I'll be able to put a little screen share up here but it made a slide deck explaining a topic.
我们会在 YouTube 放链接,也许我能在这里放个录屏,它做了一个 PPT 来讲解一个主题。
Our engineering philosophy is compound engineering.
我们的工程哲学叫 compound engineering。
So it just made a beginner slide deck for compound engineering and it's really good.
它就做了一个 compound engineering 的入门 PPT,做得真的很好。
Like I think a lot of these decks when they're automatically generated they feel kind of thin and this just had it had depth.
我觉得很多自动生成的 PPT 感觉比较空洞,但这个有深度。
It had it was everything was like pretty well styled.
整体设计感也很好。
It was just like a great first pass at a deck, which is the first time I've really seen that.
就是一个很好的初稿,这是我第一次真正看到这种效果。
And and it's just it's actually just really hard to make a model that improves that much on all these different dimensions at once.
而且,能让一个模型同时在这么多维度上都提升这么多,其实是非常难的。
I think like we usually see the labs like pendulum swinging back and forth.
我觉得我们通常看到各家实验室像钟摆一样来回摆。
It's like one release it's too cautious and the next release it's like way too it just goes off and does tons of stuff without you asking.
一个版本太保守,下一个版本又太激进,没让你要求就自顾自做了一堆事情。
This model they just seem to have gotten something really right.
这个模型感觉他们真的把某些东西做对了。
It just feels good.
就是感觉很好。
Um Kieran Classen who's the GM of Kora and was one of the internal every testers on this said it to him it's like it's like the most human model that he's worked with.
Kieran Klaassen 是 Kora 的 GM,也是这次 Every 内测者之一,他说对他来说,这是他用过的最像人类的模型。
That's why I think they could have called this Opus 5 and and we would have been happy.
所以我觉得他们完全可以叫这个 Opus 5,我们也会很满意。
There are some catches though.
不过有一些短板。
There are some things this model doesn't doesn't do well.
有些事情这个模型确实做得不好。
The first one is it's very sensitive to reasoning.
第一个是,它对推理设置非常敏感。
We got really great performance on extra high reasoning both for writing and for coding and less good performance on high and medium.
极高推理档在写作和编程上都表现极好,但高和中档的表现要差一些。
So, as you're testing it, especially on your hardest programming challenges and for really important writing, highly recommend the high and extra high settings, it makes a it makes a big difference.
所以,在测试的时候,特别是面对最难的编程挑战和重要写作任务,强烈推荐用高和极高推理档,差别很大。
The second thing is it is still not really my daily driver.
第二个是,它还不是我真正的日常主力。
And that's only because the codeex app is just so much better than the cloud app.
原因只有一个,就是 Codex 端比 Claude 端好太多了。
We're entering this world where the harness matters as much as the model does.
我们正在进入一个端和模型本身同样重要的时代。
And the cloud app has the the scars of the history of how cla of how anthropic got here.
Claude 端带着 Anthropic 走到今天的历史痕迹。
You know, it's got the chat tab and the code tab and the co-work tab and each tab is like they're kind of shipping their org chart.
聊天标签、代码标签、协作标签,每个标签感觉都在把他们的组织架构对外展示。
Like each tab is run by a different team and you can just kind of feel it whenever I get in there.
每个标签由不同团队负责,每次进去都能感觉到。
I'm like I don't know which tab to go into and Codeex is so just fast and simple and it just works really well and it has a couple other bells and whistles like the inapp browser working really well that is just changes the game for knowledge work.
我进去就不知道该点哪个标签,而 Codex 就是快、简洁、用着顺,还有几个加分项,比如应用内浏览器体验很好,在知识工作上真的改变了游戏规则。
So I'm still in codeex all day.
所以我还是一整天都在用 Codex。
However, I'm now flipping back and forth between the Claude app and the Codeex app in a way that I was not just because this model is so good.
不过,我现在会在 Claude 端和 Codex 端之间来回切换,以前没有这样,就是因为这个模型太好了。
So, let's get into some of the details.
好,我们来看一些细节。
Okay, so first thing that we always do is a reach test.
好,我们每次都会先做触达测试。
And the reach test is is is our is our simplest measure of how good a model is, which is do you reach for it?
触达测试是我们衡量一个模型好坏最简单的方式,就是你会主动打开它吗?
And if you do reach for it, what do in what situations do you reach for it?
如果你会,在什么情况下会?
We have three reach test participants today.
今天有三个触达测试参与者。
We have three reach test ratings on this model from our team.
我们团队对这个模型有三份触达测试评分。
One's from me, one's from Kieran Classen, who I mentioned earlier, the GM Aora, one's from Katie Parrot, who's who's a who's a senior staff writer.
一份来自我,一份来自我之前提到的 Kieran Klaassen,Kora 的 GM,一份来自 Katie Parrot,她是资深撰稿人。
On the reach test, this is clearly like an S tier paradigm shifting model.
触达测试上,这明显是一个 S 级范式转移级别的模型。
So, I'm gold, but I'm also like gold/green, and that's just because the harness isn't that good.
所以我给的是 gold,但同时也是 gold/green,因为端不够好。
Um, Kieran is a straight gold paradigm shift, which is very rare.
Kieran 给的是纯 gold 范式转移,非常罕见。
It's I got to say, it's very rare to give a paradigm shift grade to a model.
我必须说,给一个模型范式转移评级是非常罕见的。
So, I would pay attention to this.
所以这值得认真对待。
Kieran's a gold and Katie is a green.
Kieran 给 gold,Katie 给 green。
Um, so if you're doing the kind of work that we're doing, so I'm doing a lot of CEO work, which is work across all sorts of different things like coding and writing and decision-m, it's a paradigm shift model like wrapped in a kind of like pretty good okayish to pretty good harness.
如果你做的是我们这类工作,我做的是很多 CEO 工作,涵盖编程、写作、决策等各种事情,这是一个范式转移级别的模型,套在一个还算不错的端里面。
So that makes it paradigm shift to green.
所以综合下来是范式转移降到 green。
Uh, for someone like Kieran who's coding all day and is just like running 50 agents at once, paradigm shift.
对于像 Kieran 这样整天写代码、同时跑 50 个 agent 的人来说,就是范式转移。
Kieran is also historically the biggest Claude stand on the team.
Kieran 历来也是团队里最铁的 Claude 粉丝。
So if you're if you are a Claude person, you're going to love this model.
所以如果你是 Claude 的人,你会爱上这个模型。
And Katie is a green Katie's using it mostly for writing and knowledge work is also historically a big Claude fan, especially for writing.
Katie 给 green,她主要用于写作和知识工作,历来也是 Claude 的忠实用户,尤其在写作上。
I think it's currently kind of going back and forth between this model and codeex uh for most of her work.
她目前大部分工作在这个模型和 Codex 之间来回切换。
Now let's get into some of the more detailed benchmarks.
现在来看更详细的基准数据。
Coding, it is a powerhouse at extra high reasoning.
编程方面,在极高推理档上是绝对的强者。
It got a 63 on our senior engineer benchmark, which is 30 points higher than Opus 4.7, and it's just a nose higher than GPT 5.5.
资深工程师基准测试得了 63 分,比 Opus 4.7 高 30 分,比 GPT-5.5 略高一点。
I have the senior engineer benchmark.
我来介绍一下资深工程师基准测试。
What it does is it gives the model a vibecoded codebase.
它给模型一个 vibe-coded 的代码库。
It says this is vibecoded slap.
说:这是 vibe-coded 出来的,
Can you please rewrite it from first principles?
能不能从头原则性地重写一遍?
And I actually have two human engineers who have done the rewrite themselves.
我实际上还找了两位人类工程师亲自做了这个重写任务。
So I can compare what the models do to what a human senior engineer would do.
这样我就能把模型的产出和人类资深工程师的做法对比。
And human senior engineers usually score in the 80s or 90s.
人类资深工程师通常得分在 80 到 90 分段。
Opus 4.8 is at a 63.
Opus 4.8 是 63。
GP 5.5 is at a 62.
GPT-5.5 是 62。
It's really good.
真的很好。
It's like it's very close on a task like this.
在这类任务上非常接近。