In this article, I share my understanding of DeepSeek-R1.
Here's the full paper link: arXiv.org

I created an overview of DeepSeek-R1's training pipeline architecture. It illustrates the research journey from proof-of-concept to optimized model implementation.

DeepSeek-R1 architecture

Introduction#

Post-training methods can enhance accuracy on reasoning tasks.
OpenAI improved reasoning performance by increasing Chain-of-Thought length.
Effective test-time scaling remains an open challenge.
The authors use pure Reinforcement Learning (RL) to enhance reasoning capabilities in language models.
Employ Group Relative Policy Optimization (GPRO) (Shao et al., 2024) as the RL framework.
Train for 1,000 RL steps.
DeepSeek-R1-Zero faces challenges with poor readability and language mixing.
DeepSeek-R1 addresses these issues using a small amount of cold-start data and a multi-stage training pipeline.
Generate new supervised fine-tuning data through rejection sampling on RL checkpoints, combined with supervised data from DeepSeek-v3.
Distill Qwen2.5-32B's capabilities into smaller dense models.

Contributions#

Post-Training: Large-scale RL applied directly to base models without prerequisite supervised fine-tuning (SFT).
Distillation: Effective knowledge transfer from larger to smaller models.
Evaluation Results:
- Matches OpenAI-o1 on reasoning tasks
- Slightly trails OpenAI-o1 in knowledge tasks
- Excels in diverse benchmark tasks

Approach#

DeepSeek-R1-Zero

Pure RL approach using self-evolution via GPRO
Reward components:
- Accuracy reward (correctness evaluation)
- Format reward (proper <think>/</think> tag usage)
Training template:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

Key Insight: The model autonomously develops sophisticated problem-solving strategies through RL incentives.

DeepSeek-R1

Cold Start: Initializes training with 1,000 curated examples using few-shot prompting and human validation.
- Output format:
```
| special_token | <reasoning> | special_token | <summary>  
```
Reasoning-Oriented RL: Incorporates language consistency rewards alongside accuracy metrics.
Rejection Sampling & SFT: Generates 800,000 training samples (reasoning + general tasks) from RL checkpoints.
General RL Training: Optimizes for helpfulness and harmlessness across diverse scenarios.
Distillation: Transfers reasoning capabilities to smaller models (Qwen, Llama) using curated datasets.

Discussion#

Distillation vs. RL: Smaller models achieve better performance through distillation than standalone RL training.
Lessons Learned:
- Process Reward Models (PRMs) complicate training pipelines
- Monte Carlo Tree Search (MCTS) tends to converge to local optima

Conclusion#

DeepSeek-R1-Zero demonstrates pure RL's potential for cross-task competence.
DeepSeek-R1 combines cold-start data with iterative RL for enhanced performance.
Matches OpenAI-o1-1217 on multiple benchmarks.
Distilled Models (e.g., DeepSeek-R1-Distill-Qwen1.5B) outperform GPT-4o and Claude-3.5-Sonnet on math tasks.

Limitations#

Capability Gaps: Limited function calling, multi-turn dialogue, and JSON formatting abilities.
Language Focus: Optimized primarily for Chinese/English.
Prompt Sensitivity: Zero-shot prompts yield best results; few-shot examples degrade performance.
Technical Tasks: Requires further RL optimization for software engineering applications.

Review — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Table of ContentsReading Time ≈ 2 min 🕰️