Ive been playing around with the deepseek R1 distills. Qwen 14b and 32b specifically.

So far its very cool to see models really going after this current CoT meta by mimicing internal thinking monologues. Seeing a model go “but wait…” “Hold on, let me check again…” “Aha! So…” Kind of makes it feel more natural in its eventual conclusions.

I don’t like how it can get caught in looping thought processes and im not sure how much all the extra tokens spent really go towards a “better” answer/solution.

What really needs to be ironed out is the reading comprehension seems to be lower th average as it misses small details in tricky questions and makes assumptions about what youre trying to ask like wanting a recipe for coconut oil cookies but only seeing coconut and giving a coconut cookie recipe with regular butter.

Its exciting to see models operate in a kind of a new way.

  • GenderNeutralBro@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 days ago

    Maybe if they distilled the coder version of qwen 14b it might be a little better but i doubt it. I think a really high quant 70b model is more in range of cooking functioning code off the bat. Its not really fair to compare a low quant local model to o1 or Claude with on the cloud they are much bigger.

    That’s a good point. I got mixed up and thought it was distilled from qwen2.5-coder, which I was using for comparison at the same size and quant. qwen2.5-coder-34b@4bit gave me better (but not entirely correct) responses, without spending several minutes on CoT.

    I think I need to play around with this more to see if CoT is really useful for coding. I should probably also compare 32b@4bit to 14b@8bit to see which is better, since those both can run within my memory constraints.