• TechLich@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    6 days ago

    Not entirely true. You don’t need your own personal data centre, you can use GPU cloud instances for a lot of that stuff. It’s expensive but not so expensive that it would be impossible without being a huge tech company (only 1000s of dollars, not billions). This can be done by anyone with a credit card and some cash to burn. Also, you don’t need to train a model from scratch, you can build on existing models that others have published to cut down on training.

    However, to impersonate someone’s voice you don’t need any of that. You only need about 5-10 seconds of audio for a zero-shot impersonation with a pre-trained model. A minute or so for few-shot. This runs on consumer hardware and in some cases even in real time.

    Even to build your own model from scratch for high quality voice audio, there doesn’t need to be a huge amount of initial training data. Something like xtts was trained with about 10-15K hours of English audio which is actually pretty easy to come by in the public domain. There are a lot of open and public research datasets specifically for this kind of thing, no copyright infringements necessary. If a big tech company wants more audio data than what’s publically available, they just pay people to record audio, no need to steal it or risk copyright claims and breaking surveillance laws, they have a budget to exploit people to record whatever they want.

    This tech wasn’t invented by some evil giant tech company stealing everybody’s data, it was mostly geeky computer scientists presenting things at computer speech synthesis conferences. That’s not to say there aren’t a bunch of huge evil tech companies profiting from this or contributing to this kind of tech, but in the context of audio deepfakes being accessible to scammers, it’s not on them and I don’t think that some kind of extra copyright regulation on data centres would do anything about it.

    The current industry leader in this space in terms of companies trying to monetize speech synthesis is elevenlabs which is a private start-up with only a few dozen employees.

    The current tech is not perfect but definitely good enough to fool someone who isn’t thinking too hard over a noisy phone call and a scammer doesn’t need server time or access to a data centre to do it.