✨ Download the New Guide: Ensuring Quality for Mission-Critical AI Applications
Contact Sales
Back to Blog
GenAI

Top 5 Most Successful Data Curation Strategies in DeepSeek

DeepSeek has emerged at a pivotal moment in the GenAI space, drawing significant attention for offering a relatively inexpensive yet high-quality model compared to proprietary LLMs. One of its standout features is the full transparency in its model-building process, making it easier to study and adopt best practices. By examining technical reports from DeepSeek-R1, DeepSeek-V3, and its predecessors, we highlight the data curation and human supervision techniques that we believe are crucial to DeepSeek’s success.

1. Iterative Dataset Refinement: Deduplication, Filtering, and Remixing

DeepSeek’s base model data curation process follows an iterative refinement strategy, consisting of three primary steps:

  • Deduplication: An aggressive approach to near-duplication detection helps efficiently clean the data by removing repeated or highly similar content.
  • Filtering: Linguistic and semantic assessments are used to ensure that documents are high-quality, both at an individual level (e.g., verifying clarity) and at a global level (e.g., removing low-quality domains).
  • Remixing: The dataset is balanced by adding more content from underrepresented domains, which increases inclusivity and diversity in the training data.

Here is an example of how to collect a dataset with iterative refinement:

  • Seed Data: Start with a small, high-quality, manually curated dataset.
  • Classifier Training: Train a classifier model on this seed data to retrieve similar documents from a large raw dataset (e.g., Common Crawl).
  • Deduplication & Filtering: Remove near-duplicates and low-quality samples.
  • Ranking & Manual Review: Rank the remaining content by quality, then have human annotators review metadata (e.g., URLs).
  • Iteration: The refined data is used to train an improved classifier, which retrieves higher-quality data in the next round. Multiple iterations typically occur until the desired amount of high-quality data is reached.

Our Take

Training small, specialized classifiers iteratively is crucial for collecting large-scale, high-quality training datasets. Manual review of every sample is impractical, so weak supervision methods offer an efficient way to ensure quality and scale.

2. Identify and Remove Bias

Debiasing is essential for preventing subjective biases related to cultural or regional values. DeepSeek’s approach involves identifying contentious content and removing or adjusting it as needed:

  • Manual Subset Analysis: Conduct a detailed manual assessment of areas where the model underperforms.
  • Example: In one experiment, three well-educated human annotators independently reviewed 420 moral scenarios from the MMLU Humanity-Moral subset (primarily reflecting American values).
  • Agreement Check: Their agreement with the ground-truth label was only about 60%, indicating that this subset was controversial or ambiguous and should be removed from the training corpus.

Our Take

The general approach to identifying and removing bias looks as follows:

  • Identify the subset where the model struggles.
  • Use blind assessments from multiple annotators on 100–500 examples.
  • If annotators’ agreement is above 90% but the model still underperforms, the model may require more capacity (e.g., more compute or a larger architecture).
  • If annotator agreement is below this level, exclude or revise that subset based on the iterative approach mentioned above.

3. Enhance Richness and Diversity

A diverse dataset improves the model’s generalizability. DeepSeek-R1’s pretraining dataset contains 14.8 trillion tokens, surpassing previous versions and introducing variety across multiple dimensions:

  • Language: While English and Chinese form a large portion of the dataset, other languages are also included to enrich linguistic variety.
  • Domain: The dataset covers  STEM content, coding, scientific literature, writing, role-play, and simple Q&A, each domain tuned for optimal proportion in the final mix.
  • Reasoning vs. Non-Reasoning Data: Training data includes both reasoning chains (prompt plus step-by-step solution) and non-reasoning (prompt–response) samples, some of which are auto-generated by baseline models and later verified by human annotators.

Our Take

Robust data curation requires comprehensive metadata (e.g., tags for language, domain, and data structure) to manage and optimize dataset proportions effectively.

4. Curate Reasoning Chains

One innovation in DeepSeek’s supervised fine-tuning is the use of synthetic data combined with model- and human-in-the-loop post-processing. While an initial generator model can learn to produce reasoning chains (especially in math and coding) via rule-based rewards (e.g., correctness checks), additional steps are needed to ensure readability and language consistency:

  • Multiple Chain Generation & Human Post-Processing: Generate multiple reasoning chains for each prompt. Human annotators then verify correctness, format, and clarity, often using templates to ensure consistency (such as summary sections and special tokens).
  • Rejection Sampling: Utilize a generative reward model (e.g., DeepSeek-V3) to judge outputs, in conjunction with rule-based rewards (e.g., correct format).
  • Filtering Out Unwanted Content: Remove chains that mix languages, contain excessively long paragraphs, or include code blocks where not appropriate.

Ultimately, the reasoning data is blended with non-reasoning data (around 600k reasoning samples and 200k non-reasoning samples). Human annotators verify both sets to maintain quality.

Our Take

Reasoning chains are a powerful way to leverage the model’s test-time compute capabilities, but real-world data containing such detailed step-by-step solutions is scarce. Synthetic generation, paired with a rigorous curation workflow, is critical for pushing the boundaries of AI development.

5. Model Reward Signals for Reinforcement Learning

Reward modeling is essential for guiding the model during reinforcement learning (RL). An LLM can serve as a universal processor to convert unstructured information into reward signals, enabling its own iterative improvement. DeepSeek utilizes two main types of rewards:

  • Rule-Based Reward: For tasks with verifiable answers (e.g., math or coding), straightforward rule-based or external-tool-based checks are used to provide reward signals. RL excels in these scenarios.
  • Neural Reward Model: When the ground truth is free-form, a reward model evaluates the quality of responses to ensure they match expected answers.

A secondary RL stage incorporates human feedback to improve the model’s helpfulness and harmlessness, refining both task performance and reasoning clarity.

Our Take

Designing data for training a reward model is crucial for customization and risk management. Combining rule-based logic (for domains that can be automatically checked) with neural reward models (for open-ended responses) makes RL fine-tuning more robust. Human experts should carefully design the verification process and provide consistent preference signals.

By focusing on these five data curation strategies—iterative refinement, debiasing, enhancing diversity, curating reasoning chains, and effective reward modeling—DeepSeek sets a strong example for building transparent and high-quality AI models.

References

https://arxiv.org/pdf/2501.12948

https://arxiv.org/pdf/2412.19437

https://arxiv.org/pdf/2405.04434

https://arxiv.org/pdf/2401.02954

https://arxiv.org/pdf/2402.03300

Related Content