Gemini seem to have "solved" my duck river crossing, lol.

diz@awful.systems · 6 months ago

Gemini seem to have "solved" my duck river crossing, lol.

froztbyte@awful.systems · 6 months ago

(excuse possible incoherence it’s 01:20 and I’m entirely in filmbrain (I’ll revise/edit/answer questions in morning))

re (1): while that is a possibility, keep in mind that all this shit also operates/exists in a metrics-as-targets obsessed space. they might not present end user with hit% but the number exists, and I have no reason to believe that isn’t being tracked. combine that with social effects (public humiliation of their Shiny New Model, monitoring usage in public, etc etc) - that’s where my thesis of directed prompt-improvement is grounded

while they could do something like that (synthetic derivation, etc), I dunno if that’d be happening for this. this is outright a guess on my part, a reach based on character based on what I’ve seen from some the field, but just……I don’t think they’d try that hard. I think they might try some limited form of it, but only so much as can be backed up in relatively little time and thought. “only as far as you can stretch 3 sprints” type long

(the other big input in my guesstimation re (2) is an awareness of the fucked interplay of incentives and glorycoders and startup culture)

scruiser@awful.systems · 6 months ago

I don’t think they’d try that hard.

Wow lol… 2) was my guess at an easy/lazy/fast solution, and you think they are too lazy for even that? (I think a “proper” solution would involve substantial modifications/extensions to the standard LLM architecture, and I’ve seen academic papers with potential approaches, but none of the modelfarmers are actually seriously trying anything along those lines.)

froztbyte@awful.systems · edit-2 6 months ago

lol, yeah

“perverse incentives rule everything around me” is a big thing (observable) in “startup”[0] world because everything[1] is about speed/iteration. for example: why bother spending a few weeks working out a way to generate better training data for a niche kind of puzzle test if you can just code in “personality” and make the autoplag casinobot go “hah, I saw a puzzle almost like this just last week, let’s see if the same solution works…”

i.e. when faced with a choice of hard vs quick, cynically I’ll guess the latter in almost all cases. there are occasional exceptions, but none of the promptfondlers and modelfarmers are in that set imo

[0] - look, we may wish to argue about what having billions in vc funding categorizes a business as. but apparently “immature shitderpery” is still squarely “startup”

[1] - in the bayfucker playbook. I disagree.

diz@awful.systems · 6 months ago

I think they worked specifically on cheating the benchmarks, though. As well as popular puzzles like pre existing variants of the river crossing - it is a very large puzzle category, very popular, if the river crossing puzzle is not on the list I don’t know what would be.

Keep in mind that they are also true believers, too - they think that if they cram enough little pieces of logical reasoning, taken from puzzles, into the AI, then they will get robot god that will actually start coming up with new shit.

I very much doubt that there’s some general reasoning performance improvement that results in these older puzzle variants getting solved, while new ones that aren’t particularly more difficult, fail.