Key architectural details
Mixture of Experts (MoE): 128 experts, with 4 active per token, enabling efficient scaling and specialization.
119B total parameters, with 6B active parameters per token (8B including embedding and output layers).
256k context window, supporting long-form interactions and document analysis.
Configurable reasoning effort: Toggle between fast, low-latency responses and deep, reasoning-intensive outputs.
Native multimodality: Accepts both text and image inputs, unlocking use cases from document parsing to visual analysis.



At this point, these small models should add explicit minimum hardware requirements just so they can stand out. STM32 w xxGB of PSRAM. Android phone w this much RAM, how many TOPS, and minimum OS version. ESP32-S3 or S4? That sort of thing.
If you just say ‘small,’ you get lost in the noise.
tbh that’s the main thing I took away from this, since when did small equal 119b ?!
Does that mean they’ve got large models lined up approaching 1tb?
Cloud-based LLMs have been commodotized. Lots of options.
There’s room for someone to lead the local on-device space. Anything from a high-end workstation (Apple Studio, Nvidia DGX Spark, AMD Strix) to laptop (MBPro, Windows AI) down to embedded (Qualcomm, STM32) and ultra-small (ESP32, ARM/RISC).
Lots of room there and no clear winners. Mistral, at this point could focus on the other tiers, make a name, and carve a lot of mindshare.