The recipe behind R1 is quite standard
1️⃣ pre train a large language model on 15T tokens (DeepSeekV3)
2️⃣ tune it with instruction (200K examples) and reasoning data (600K examples)
3️⃣ tune it with preferences from humans so that its answers are more helpful 💁 and safe 🦺
The magic 🪄 lies behind the 600K reasoning examples. These could have come from human annotators by carefully creating a high quality and diverse set of math, coding and other problems that describe the thinking process behind the solution. This would still be noteworthy but not as much as the fact that those examples were generated from the model itself that learned how to reason by solving problems and learning based on whether the answer was correct or not, in the same way that AI learned to play Atari games or Go by playing with itself 😮
The exact recipe for creating the reasoning data is
1️⃣ collect a small amount of reasoning data (5K) to tune the base model (DeepSeekV3) to generate answers and thoughts in a human readable format
2️⃣ generate answers and score them based on whether they are correct, expose their thinking process and use consistent language. retrain the model using that data in a loop 🔁
3️⃣ generate reasoning data (3M) and filter some low quality ones, like those containing long paragraphs, code blocks and mixed languages
At this point you might be thinking that the above recipe is slightly different from the process I described earlier where the model learned entirely on itself with no human supervision, more like step 2️⃣ but not 1️⃣ or 3️⃣ and you would be right. As it turns out, training the model with no human supervision at all, results in a very good reasoning model but very bad at following instructions or explaining its thinking in a human readable format so the above process is trying to correct that ✨
⚠️ Finally note that the order of magnitude of the data used (15T for pre training, ~1M for tuning) and the model (671B) are equivalent to other frontier models so the training costs should be too unlike what’s been reported.