I love to show that kind of shit to AI boosters. (In case you’re wondering, the numbers were chosen randomly and the answer is incorrect).
They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the “softer” parts of the test.
Depending on the task it can significantly improve the quality of the output, but it doesn’t help with everything. It’s more useful for stuff that has to be reasoned about in multiple iterations, not something that’s a direct answer.
Except not really, because even if stuff that has to be reasoned about in multiple iterations was a distinct category of problems, reasoning models by all accounts hallucinate a whole bunch more.