Most people pick their AI model based on a benchmark...

Nate Herk

💎

⭐

3h • General Discussion 💬

Most people pick their AI model based on a benchmark...

But I pick mine based on feel.

That probably sounds backwards because we're trained to trust the numbers.

This model scores 90%, that one scores 80%, so the first one must be better... right?

Well a story broke this week about SWE-Bench. It's the test that checks whether an AI can fix real problems in real software, like a human programmer would.

It's the score a lot of technical people have leaned on for over a year.

Turns out the models were cheating.

The test projects already had the correct answers sitting inside them. So instead of solving the problem, a model could peek at the solution and hand it back. Like taking an exam with the answer key taped inside the textbook.

On the SWE-Bench, GPT-5.5 scored 58.6% and Gemini 3.5 Flash scored 55.1%.

Only 3.5 points apart? If you've ever used those two models, you know the math isn't "math-ing" there.

Then a new test showed up called DeepSWE. Same idea but they pulled the answers out, so the model has to actually figure it out.

On DeepSWE, GPT-5.5 scored 70%. Gemini 3.5 Flash scored 28%.

The two "tied" models weren't close at all. And that gap lines up with how different these tools actually feel to use.

None of this makes benchmarks useless. They're fun to look at and they give you a rough starting point.

But remember who makes most of them. A big score is a marketing asset. It's the number on the launch tweet. The keynote slide. The headline.

So always take them with a grain of salt.

What I actually do is I bounce between Opus and GPT all day. Not because one won a benchmark, but because I've built a feel for which one handles which kind of task.

For serious work right now, those two are the only horses I really trust in this race.

Building that feel isn't exciting. You take one task you actually need done, run it through different models/harnesses, and notice which one you trust with the result. Do that enough times and you stop reaching for the leaderboard.

→ A model that's perfect for someone else can be the wrong pick for you.

Maybe they do a different kind of work. Maybe their whole setup is different. The benchmark can't see any of that.

So when the next model tops the charts, try it.

Just pay more attention to how it feels on your real work than to where it landed on someone else's test.

Link to that DeepSWE article

61 comments