As a specific example, the ESP32 chip does low power voice recognition for pre-trained trigger words. This lightweight recognition lacks the training to detect anything other than the list of trigger words that Espressif provides.
Basically only battery-operated devices work this way (for power consumption reasons). If you’re plugged in you’re probably always running the high quality listening loop.
This is also why a lot of the wake words are similar:
Hey Siri
Alexa / Echo
OK Google / Hey Google
Those all have different vowel sounds, hard consonants etc. because without that there’s not enough difference to make a unique wake word/phrase. Google needed something like “Hey” or “OK” before it because “Google” itself doesn’t generate enough unique sounds to act as a keyword. They’re also between 3 and 5 syllables because they need to be short enough to monitor for them, and long enough that they can be distinguished reliably from background noise.
The sounds are converted into MFCCs, which is sort-of an extremely lossy form of compression. It was originally used to identify numbers, like when someone would call into an automated switchboard and they’d have to say “one” or “five”. It couldn’t identify complex words, just distinguish between a small set of very different sounding numbers.
The way these systems work is that they’re running on a very low-power loop converting ambient sounds into these patterns and seeing if there’s a match for a wake-word pattern. The sound is converted into basically a time vs. frequency matrix and matched against the keyword / phrase. If there’s a match it unlocks the much more computationally-expensive voice transcription programs, otherwise it just throws out the data.
You can tell that at least mobile devices aren’t always listening because if they were actually doing full-on voice transcription all the time, the battery would drain much faster. If they were doing off-device voice transcription, the antenna would have to stay on a lot more, which would also kill the battery, and it would be visible in your bandwidth bill.
People need some more basic computer literacy. I get that the FAANG companies are “evil”, and want to do unscrupulous things with your data, but there’s often a simpler explanation that doesn’t involve massive privacy violations that security researchers would have caught long ago.
Even in the first scenario, what stops there from being multiple wake words with different functionality? So like “ok google” wakes up the bot but “pepsi” wakes it silently and has it tick a box on the back end of a server that now sends me Coke adds because they paid about $3.50 for the privilege?
As a specific example, the ESP32 chip does low power voice recognition for pre-trained trigger words. This lightweight recognition lacks the training to detect anything other than the list of trigger words that Espressif provides.
Basically only battery-operated devices work this way (for power consumption reasons). If you’re plugged in you’re probably always running the high quality listening loop.
This is also why a lot of the wake words are similar:
Those all have different vowel sounds, hard consonants etc. because without that there’s not enough difference to make a unique wake word/phrase. Google needed something like “Hey” or “OK” before it because “Google” itself doesn’t generate enough unique sounds to act as a keyword. They’re also between 3 and 5 syllables because they need to be short enough to monitor for them, and long enough that they can be distinguished reliably from background noise.
The sounds are converted into MFCCs, which is sort-of an extremely lossy form of compression. It was originally used to identify numbers, like when someone would call into an automated switchboard and they’d have to say “one” or “five”. It couldn’t identify complex words, just distinguish between a small set of very different sounding numbers.
The way these systems work is that they’re running on a very low-power loop converting ambient sounds into these patterns and seeing if there’s a match for a wake-word pattern. The sound is converted into basically a time vs. frequency matrix and matched against the keyword / phrase. If there’s a match it unlocks the much more computationally-expensive voice transcription programs, otherwise it just throws out the data.
You can tell that at least mobile devices aren’t always listening because if they were actually doing full-on voice transcription all the time, the battery would drain much faster. If they were doing off-device voice transcription, the antenna would have to stay on a lot more, which would also kill the battery, and it would be visible in your bandwidth bill.
People need some more basic computer literacy. I get that the FAANG companies are “evil”, and want to do unscrupulous things with your data, but there’s often a simpler explanation that doesn’t involve massive privacy violations that security researchers would have caught long ago.
Even in the first scenario, what stops there from being multiple wake words with different functionality? So like “ok google” wakes up the bot but “pepsi” wakes it silently and has it tick a box on the back end of a server that now sends me Coke adds because they paid about $3.50 for the privilege?
Pretty much nothing. Any company with the resources of Google or Amazon could easily have their top 100 wake words trained into that model.