Key architectural details

Mixture of Experts (MoE): 128 experts, with 4 active per token, enabling efficient scaling and specialization.

119B total parameters, with 6B active parameters per token (8B including embedding and output layers).

256k context window, supporting long-form interactions and document analysis.

Configurable reasoning effort: Toggle between fast, low-latency responses and deep, reasoning-intensive outputs.

Native multimodality: Accepts both text and image inputs, unlocking use cases from document parsing to visual analysis.

  • ikt@aussie.zoneOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    10 hours ago

    Qwen models are by far the most verbose out of the box, and happily burn a ton of tokens on useless thought. It’s an over-emphasis on reinforcement learning.

    I now have a system prompt just to say please stop talking Qwen 😭

    Even a hello can result in 3 paragraphs by default