70 percent became 23 percent overnight.
That is not a model getting worse. That is a benchmark becoming honest.
OpenAI retired SWE-bench Verified this month. The reason: scores were inflated by training data contamination. Models were, in effect, learning the answers. When SWE-bench Pro replaced it, with harder problems that could not have been memorised from training data, frontier models dropped from 70-plus percent to around 23 percent on coding tasks.
A 47-point gap. That is not noise. That is the distance between what the industry was selling and what the industry was delivering.
There is a concept called Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. It was originally an observation about economics. It has turned out to be almost perfectly predictive of how AI benchmarks behave.
I have been thinking about this a lot because I spend a meaningful part of my time either building AI tools or helping people evaluate them. The benchmark problem is not abstract for me. It is the specific moment when a vendor sends a deck with a 72 percent coding score and I have to decide what that number actually means.
What the collapse of SWE-bench tells me is that for the last two years, a significant portion of those numbers meant very little. Not because vendors were lying, exactly. Because everyone was measuring the thing that was easiest to measure, and calling it signal.
The honest version of AI evaluation is harder. It requires building your own test set, against your actual use case, with tasks the model has never encountered. Most companies do not have the time or the technical depth to do that. Most vendors know this.
So the gap between 70 and 23 percent is not just a story about one benchmark. It is a story about how difficult it actually is to know whether a tool works before you have spent six months and a significant budget discovering it does not.
Is it perfect? Most definitely not. But the retirement of SWE-bench Verified might turn out to be a useful moment. The industry acknowledging, out loud, that the number it had been waving around was inflated. That is not nothing. That is a small step towards evaluations that are actually grounded in what models do in practice, not what they can pattern-match from training data.
My instinct for now: run your own small evaluation before you commit. Accept that benchmarks are a starting point, not an answer. Give more weight to people who have used a tool in production than to the score in the vendor deck.
Does anyone have a reliable framework for evaluating AI tools outside of the standard benchmarks? I am genuinely curious what people have built for this.



