I’ve got a homelab setup that could benefit from low-power AI acceleration, which could let me run Whisper and distilled models locally and integrate with my serviced ie. home assistant. Plus, the less data I send over my network the happier I’ll be.
I don’t really want to stuff a GPU into my system right now, I dont have much power budget and GPUs can get pricey for the cost of one that’s useful. I’ve seen a few examples of “Edge accelerators” which boast a super tiny (2-5w) power envelope and 40 TOPs, but that doesn’t tell me much about how well models will actually work in practice.
Is there any kind of mapping between TOPs and, say, tokens per second for X model? Maybe recommended TOPs for X model?


Roughly speaking, each token requires the computer to fetch & iterate over the entire model in memory. So memory bandwidth is usually the constraint. If you put a 10 GB model on it and the memory bandwidth is 10 GB/s (number made up) it will be one second per token. If you have multiple compute cores, each perhaps with their own 10 GB/s memory bandwidth limit, then you can divide one second by the number of cores to get the time per token.
Idk why you would use a USB stick and not just run it in CPU/RAM on an ordinary computer. Small models are shit anyway though (even against the baseline of large/frontier hosted models being shit).
Laziness and the prospect of a cheap hack to avoid having to drag my server out of it’s confines to sort it. Saw an ad a while ago and had the thought ever since!
Oh, and the Coral TPUs are at least m.2, but yeah I can see why usb dongles are just a meme…
At least try running a local model on your regular computer first to see whether you can deal with how shit they are. The quality of a model is roughly proportional to its size in memory (that’s why the memory chip market is fucked right now). Computation speed only controls how fast it generates tokens.