• panda_abyss@lemmy.ca
    link
    fedilink
    arrow-up
    7
    ·
    3 days ago

    This.

    They’re incredibly useful, but you have to treat their output as disposable and untrustworthy. They’re reinforcement trained to generate a solution, regardless of if it’s right, because it’s impossible to AI evaluate that these solutions are correct at scale.

    If you’re writing some core code: you can use an agent to review it, refactor parts, stump the original version, infill methods, and to run your test/benchmark scripts.

    but you still have to manage it, edit it, make sure it’s not recreating the same code in 6 existing modules, generating faked tests, etc.


    As an example this week on my side project I had Claude Opus write some benchmarks. Total throwaway code.

    It actually took my input files, generated a static binary payload from it using numpy, and loaded that into my app’s memory (on its own that’s really cool), then it ran my one function and declared the whole system 100x faster than comparable libraries that parse the original data. Not a fair test at all, nor was it a useful test.

    You cannot trust this software.

    You’ll see these games metrics, gamed tests, duplicate parallel implementations, etc.

    • baines@lemmy.cafe
      link
      fedilink
      English
      arrow-up
      9
      ·
      3 days ago

      spend more time fixing slop compared to just doing it manually and correct the first time