

I would give it credit for being better than the absolutely worthless approach of “scoring well on a bunch of multiple choice question tests”. And it is possibly vaguely relevant for the pipe-dream end goal of outright replacing programmers. But overall, yeah, it is really arbitrary.
Also, given how programming is perceived as one of the more in-demand “potential” killer-apps for LLMs and how it is also one of the applications it is relatively easy to churn out and verify synthetic training data for (write really precise detailed test cases, then you can automatically verify attempted solutions and synthetic data), even if LLMs are genuinely improving at programming it likely doesn’t indicate general improvement in capabilities.
They are going with the 50% success rate because the “time horizons” for something remotely reasonable like 99% or even just 95% are still so tiny they can’t extrapolate a trend out of it and it tears a massive hole in their whole AGI agents soon scenarios().