I love to show that kind of shit to AI boosters. (In case you’re wondering, the numbers were chosen randomly and the answer is incorrect).
They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the “softer” parts of the test.
Except not really, because even if stuff that has to be reasoned about in multiple iterations was a distinct category of problems, reasoning models by all accounts hallucinate a whole bunch more.