Key architectural details

Mixture of Experts (MoE): 128 experts, with 4 active per token, enabling efficient scaling and specialization.

119B total parameters, with 6B active parameters per token (8B including embedding and output layers).

256k context window, supporting long-form interactions and document analysis.

Configurable reasoning effort: Toggle between fast, low-latency responses and deep, reasoning-intensive outputs.

Native multimodality: Accepts both text and image inputs, unlocking use cases from document parsing to visual analysis.

  • fubarx@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    6 hours ago

    At this point, these small models should add explicit minimum hardware requirements just so they can stand out. STM32 w xxGB of PSRAM. Android phone w this much RAM, how many TOPS, and minimum OS version. ESP32-S3 or S4? That sort of thing.

    If you just say ‘small,’ you get lost in the noise.

    • ikt@aussie.zoneOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      5 hours ago

      tbh that’s the main thing I took away from this, since when did small equal 119b ?!

      Does that mean they’ve got large models lined up approaching 1tb?

      • fubarx@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        Cloud-based LLMs have been commodotized. Lots of options.

        There’s room for someone to lead the local on-device space. Anything from a high-end workstation (Apple Studio, Nvidia DGX Spark, AMD Strix) to laptop (MBPro, Windows AI) down to embedded (Qualcomm, STM32) and ultra-small (ESP32, ARM/RISC).

        Lots of room there and no clear winners. Mistral, at this point could focus on the other tiers, make a name, and carve a lot of mindshare.

  • panda_abyss@lemmy.ca
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    7 hours ago

    Looks a little underwhelming with Qwen3.5 and Haiku beating it.

    However, 6B active parameters and it’s trained to return short results could make this useful as a Qwen for local model. I’ve overall found Mistral models to be better to discuss with, but also the devstral small models were kinda janky last I used them (stuff like infinite loops and getting confused by less common programming languages). Qwen models are by far the most verbose out of the box, and happily burn a ton of tokens on useless thought. It’s an over-emphasis on reinforcement learning.

    Also weird they use GPT 4.1 as the judge model. That’s a year old model, not nearly SOTA, and IIRC underwhelmed on most metrics. So it feels like a poor candidate judge.

    Edit: we have a GPT5 – some of the charts are labelled wrong

    Not mentioned in the blog post, but on HF: they created a small speculative decoding model go with it – https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-eagle

    That should accelerate inference speeds on some setups.

    • MalReynolds@slrpnk.net
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      For certain values of small…

      That said, Mistral is strong in world knowledge and something this big is likely quite so. The 6B experts can fit in reasonable amounts of system RAM (Q4_K_M is ~ 72 GB so it’d likely run reasonably in 64 GB system RAM and 24 GB VRAM) and run at reasonable if not spectacular speeds, speculative decoding could help too (but that eagle is 392MB, which is scary tiny).