Deekseek Reasoner-1
based on DeepSeek-V3-Base and RL tuned with only 20GB dataset
Alignment is not rigid, so good for jailbreak testing
Uses pure RL to enhance LLM reasoning without human-authored CoT data. Combines Rejection Sampling + RL + SFT to solve the Zero version's language mixing and readability issues, strengthening not only reasoning but also general conversation and writing capabilities.
fully open sourced
model or distill
tech report
only RL with limitations (repetitive answers, low readability, language mixing)

Seonglae Cho