Ive been playing around with the deepseek R1 distills. Qwen 14b and 32b specifically.
So far its very cool to see models really going after this current CoT meta by mimicing internal thinking monologues. Seeing a model go “but wait…” “Hold on, let me check again…” “Aha! So…” Kind of makes it feel more natural in its eventual conclusions.
I don’t like how it can get caught in looping thought processes and im not sure how much all the extra tokens spent really go towards a “better” answer/solution.
What really needs to be ironed out is the reading comprehension seems to be lower th average as it misses small details in tricky questions and makes assumptions about what youre trying to ask like wanting a recipe for coconut oil cookies but only seeing coconut and giving a coconut cookie recipe with regular butter.
Its exciting to see models operate in a kind of a new way.
That’s a good point. I got mixed up and thought it was distilled from qwen2.5-coder, which I was using for comparison at the same size and quant. qwen2.5-coder-34b@4bit gave me better (but not entirely correct) responses, without spending several minutes on CoT.
I think I need to play around with this more to see if CoT is really useful for coding. I should probably also compare 32b@4bit to 14b@8bit to see which is better, since those both can run within my memory constraints.