Got the pointer to this from Allison Parrish who says it better than I could:

it’s a very compelling paper, with a super clever methodology, and (i’m paraphrasing/extrapolating) shows that “alignment” strategies like RLHF only work to ensure that it never seems like a white person is saying something overtly racist, rather than addressing the actual prejudice baked into the model.

  • L0rdMathias@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    4
    ·
    3 months ago

    Interesting results, interesting insights, neat takeaways and argumentation.

    It’s unfortunate they only tested models that were trained on SAE and they didn’t have a control group of language models in other dialects. Seems like a huge oversight.

    I wonder how this would play out with a model that has been trained on AAE, another non-SAE dialect, or even one trained in English but optimized for a non-english language.