• Fubarberry@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    31
    ·
    9 months ago

    China has a huge advantage in AI models because of how lax they are on intellectual property rights. US companies are fighting over API licensing costs, while china is just going to scrape everything and use it for free.

    The US has a lead now, but I don’t think they can maintain it without giving up on ethical training. Then again it may not matter if the US models are ethical if everyone will eventually just uses the superior unethically trained chinese models instead.

      • Redex@lemmy.world
        link
        fedilink
        English
        arrow-up
        10
        ·
        9 months ago

        I mean, they are right. Asside the question of whether we can even make meaningfully better models by just using LLMs and more data and what the future of AI will look like, and whether it’s ethical or not to steal the data, it is quite possible that OpenAI and the like will get into legal trouble because of the methods they use for acquiring data, but Chinese companies won’t have to worry about that. If more data = better models then China has an obvious advantage.

        • just_an_average_joe@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          7
          ·
          9 months ago

          OpenAI and the like aren’t going to get into trouble anytime soon. They already provide their latest tech to US gov and military. OpenAI is like a goose that laid a golden egg, they need to fuck up really really badly to face any consequences.

    • just_an_average_joe@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      11
      ·
      9 months ago

      The US companies already scraped the data while they could. If anything, data scraping is far far more difficult now for everyone due to technical reasons.

      Most of the new models are trained on synthetic data or higher quality of data or with RLHF. The reason deepseek is able to perform is likely because LLMs are very very new things, there are many low hanging fruits. Its no longer just about the data we already hit that limit for quite some time.

      • Naia@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        1
        ·
        9 months ago

        Honestly, even from the beginning it’s pretty obvious scraped data is going to have a ton of issues. There’s too much nonsense out there, both from misinformation and people just not able to communicate.

        That’s before you get into the ethical aspects of stealing other people’s content and the way these things are being misused.