How was DeepSeek-R1 created?

Chinese AI company Hangzhou Deep Seek Artificial Intelligence Fundamental Technology Research Co., Ltd (Deep ⁃Seek) released the DeepSeek-R1 reasoning model, which triggered a shock in the global tech circle and was closely watched by major tech companies and even the US government.

DeepSeek-R1 has excellent performance in programming, mathematical, and logical reasoning, and its performance is comparable to that of OpenAIo1, the strongest publicly released reasoning model. Moreover, DeepSeek-R1 is available to researchers and developers worldwide as open source, demonstrating a true spirit of openness and an immeasurable impact. Jim Fan, an AI scientist at NVIDIA, praised DeepSeek-R1 as ‘cutting-edge research that is truly open and empowering for all.’ The

DeepSeek was founded in May 2023 by Chinese hedge fund Mirage Quantitative, founded by Liang Wenfeng, the company’s team is mainly composed of PhD students graduated from China’s top local universities, the team is young, efficient, and close-knit, able to quickly learn and use the latest technology to develop large models.

At the end of December 2024, DeepSeek released and open-sourced the DeepSeek-V3 model, which has a performance comparable to the current top closed-source models, while the training cost is less than $6 million, which is only 1/20 of the cost of the GPT-4, and the training time is only 2 months. In January 2025, it launched the inference model DeepSeek-R1, which has reached or surpassed the OpenAIo1 model in many tests. OpenAIo1 model in several tests.

Despite the controversy about the power of DeepSeek-V3 and DeepSeek-R1, most people agree that the performance of the two models reached the level of the top closed-source big models of the time, ChatGPT4o and ChatGPTo1, respectively.

Currently, the most significant controversy over DeepSeek-R1 is that he only used a cluster of 2048 NVIDIAH800 GPUs to complete the training of a mixture of experts (MoE) model with 671 billion parameters in about 2 months, which is 10 times more efficient than industry leaders such as Meta,. The training cost is only 3% of that of OpenAIo1’s training cost is only 3%~5%.

Some critics believe that DeepSeek may have exaggerated the efficiency and resource utilization of its training process in its technical report. In contrast, others believe that it may be a result of DeepSeek’s substantial technological advancements that enabled this seemingly impossible task.

What impact will the rise of DeepSeek have on the global AI industry, and will it reshape the future AI industry landscape?

This paper explores the training methods of the DeepSeek family of models, including DeepSeek-V3, DeepSeek-R1-Zero, and DeepSeek-R1, by analyzing publicly available information and materials, in particular the technical reports of DeepSeek-V3 and DeepSeek-R1.

It introduces in detail DeepSeek’s improvement and development on the existing large model architecture and discusses in depth the series of models’ improvement on GPU cluster load balancing, the use of the GPU underlying language PTX for the underlying optimization, as well as the implementation of co-optimisation of hardware and software systems (including communication, memory, and computation).

The article strives to express our observations and understandings in easy-to-understand language to provide an objective and unbiased analytical perspective for readers in the non-AI technology circle and provide valuable references for professionals in the AI technology circle.

Efficient architecture and innovative technology of DeepSeek-V3

DeepSeek-V3 excels in programming power, mathematical reasoning, Chinese comprehension, and long-text comprehension. DeepSeek-R1 is based on DeepSeek-V3’s large-language model designed for complex reasoning tasks, and DeepSeek-R1-Zero is an overreaching reasoning model in the middle of these two models.

While DeepSeek uses the Transformer architecture, it is extremely optimized in all aspects of architecture and algorithms and incorporates awesome innovations and state-of-the-art technologies.

1. Efficient Model Architecture: Hybrid Expert Model Models

DeepSeek-V3 uses Mixed Expert Models (MoE), which is the current technique used by most of the big AI models, but some models do not rely on MoE, such as An⁃thropic’s Claude and Meta’s LLaMA family.

Due to the opacity of information, it is impossible to determine whether OpenAI’s ChatGPT3.5 and ChatGPT4.0 adopt the MoE architecture.

The MoE model handles complex tasks by combining multiple expert models; each expert model focuses on a different part of the input data, and the gating network decides how to weigh the outputs of these experts. The core idea is to decompose the task into multiple sub-tasks, which are then handled by different experts, thus improving the flexibility and performance of the model.

The MoE model performs well in natural language processing and computer vision and is especially suitable for processing large-scale data and complex tasks. By dynamically allocating computational resources, MoE can efficiently utilize hardware while maintaining high accuracy and generalization.

The DeepSeekMoE architecture is unique in its fine-grained design and shared expert policies. Other MoE models may have a few to dozens of experts per layer, e.g., xAI’s Grok-1 employs an 8-expert MoE architecture that activates two experts per token processed.

In the DeepSeekMoE framework, each MoE layer consists of 1 shared expert and 256 routing experts. Each token is processed by selecting the eight most appropriate experts from these routing experts.

The shared expert strategy in DeepSeekMoE architecture is an important innovation that includes expert classification, characteristics, purpose, and advantages.

The number of shared experts is fixed and small, and each MoE layer usually contains one shared expert that is always active and is responsible for capturing and integrating common knowledge in different contexts, which reduces knowledge redundancy, improves parameter efficiency, and allows independent routing experts to focus on more specialized knowledge.

The shared expert strategy improves the generalization capability and overall efficiency of the model, mitigates parameter redundancy among other routing experts, and achieves an efficient model architecture in combination with fine-grained expert partitioning.

This fine-grained MoE design is very complex and challenging to engineer. Due to the lack of sufficient high-performance GPUs, the DeepSeek team had to make the model reach new heights in efficiency and performance through careful design and effort. This innovation undoubtedly sets a new benchmark for the AI field.

During the training process, each token activates only eight routing experts in each MoE layer and can route to up to 4 nodes, and this method is called sparse activation. The sparse activation mechanism can significantly expand the model capacity without substantially increasing the computational cost. Fine-grained expert systems and sparse activation have significant advantages.

Firstly, by reducing the number of connections and activations, the number of parameters in the network is significantly reduced, reducing the model’s storage requirements and computational overhead.

In addition, sparse connection and activation patterns make the model more interpretable and help to understand the model’s decision-making process. Restricting connections and activations also reduces the effect of data noise and redundant information and improves the robustness of the model to disturbances and changes. By extracting the most relevant and important features, the generalization ability of the model is enhanced, and the risk of overfitting is effectively reduced.

In addition, by retaining only the most essential activation values, the computation and memory usage are significantly reduced, while the model performance is hardly affected. However, the drawbacks are also obvious: the implementation complexity is high, requiring sophisticated routing mechanisms and specialized hardware support;

More computational resources may be required to optimize expert allocation and activation patterns during the training phase, which is a challenge for teams with limited resources; carefully balancing the number of experts, activation strategies, and model performance requires a lot of experimentation and tuning, which is also a complex process.

The biggest challenge in training huge models with MoE architecture is load balancing, which involves various aspects such as efficiency, performance bottlenecks, training stability, scalability, and communication overheads. If the load is not balanced, some experts will be overused while others will be idle, resulting in wasted computational resources and reduced training efficiency.

Load imbalance also leads to system performance bottlenecks, where popular experts are overloaded and cold experts are underloaded, forming a self-reinforcing cycle. As the model size increases, the load imbalance limits scalability and diminishes returns. In addition, in distributed training, load imbalance increases the communication overhead between nodes, affecting the training speed.

2. Load balancing strategy without auxiliary loss

The auxiliary-loss-free load balancing (ALFLB) strategy proposed by the DeepSeek team is an innovative load-balancing method that achieves adaptive load distribution by adjusting the routing tasks through dynamic deviations and adjusting the deviation values according to the recent load of experts.

When an expert is overloaded, the system automatically reduces its probability of receiving a new task and conversely increases the task receiving chance of that expert.

Compared with the traditional auxiliary loss approach, this unaided loss strategy avoids interference with the model’s main training target, significantly improves model performance and training efficiency, and reduces memory consumption and computational resources, allowing the model to potentially increase model capability by increasing the number of attention heads without increasing the size of the key-value (KV) cache.

Overall, the strategy provides an efficient and cost-effective solution to significantly improve the performance and training efficiency of the large language model (LLM) through a naturally balanced load distribution.

3. Innovative Attention Mechanism: Multiple Potential Attention Mechanisms

When ChatGPT generates text, it not only pays attention to the just-generated words but also takes into account the entire input context as well as all the previously generated words, and the model assigns different weights to these words to differentially pay attention to their influence on the currently-generated words.

This dynamic and differentiated attention mechanism enables the model to capture key information in the context and generate more natural, coherent, and semantically rich text, which is a visual representation of the attention mechanism.

To implement the attention mechanism in the training process, Transformer introduces a query matrix (Q), a key matrix (K), and a value matrix (V) to compute the attention, where Q, K, and V are all high-dimensional matrices.

In the actual sentence generation process, Q and K are first multiplied to calculate the relevance of different parts of the previous sentence to the word to be generated and then multiplied by V, which represents the content of the last sentence, to calculate the attention and decide what the next word is.

The multi-head attention (MHA) mechanism is an extension and improvement of self-attention, which makes the model act as a multi-angle observer, capturing different features and correlations from multiple angles simultaneously. The MHA mechanism not only extends the representation space of the model but also enhances its ability to learn complex features.

Multiple attention heads can also be computed in parallel to increase the processing speed of the model and reduce the risk of overfitting, thus improving the generalization ability of the model. Different attention heads focus on various input aspects, enabling the model to obtain a more comprehensive semantic understanding.

Through this parallel processing from multiple perspectives, the multiple heads of attention enable the model to perform well in various natural language processing tasks and to understand complex linguistic structures and semantic relations more comprehensively.

DeepSeek first proposed a multi-headlatentattention (MLA) mechanism in the DeepSeek-V2 model, which solves the bottleneck problem of large-scale language model LLM in the process of training and inference, especially the problem of key-value (KV) cache occupying a large amount of memory.

MLA requires only 5-13% of the memory of MHA and speeds up the inference process by reducing the KV cache, especially when dealing with long sequences.

At the same time, the MLA mechanism can still achieve comparable or even more potent performance than the MHA mechanism with significantly reduced resource consumption. This enables DeepSeek-V2 to reduce training and inference costs while maintaining high performance significantly. This innovation gives DeepSeek a significant advantage in large language modeling.

The MLA mechanism innovatively adopts the low-rank key-value joint compression technique to compress the key and value matrices cached by the traditional MHA mechanism into a low-dimensional latent vector, significantly reducing the memory occupation while preserving the key information and achieving the efficient computation of attention.

This design allows the MLA mechanism to significantly reduce the computational resource requirements while maintaining or improving the model performance, especially when dealing with long sequences. This innovation enables the MLA mechanism to achieve more efficient training and inference in LLM applications and is one of the keys to training the DeepSeek-V3 model.

In the pre-training phase, the MLA mechanism demonstrates significant advantages in expanding the model capacity, increasing the batch size, and optimizing the computation-memory balance.

Although the MLA mechanism adds additional computational complexity, the savings in memory resources and potential performance improvements usually outweigh the burden of this computational increase, especially when memory hardware is limited.

In the inference phase, the MLA mechanism significantly reduces the memory footprint. It improves inference efficiency by lowering the KV cache and using smaller dimensions to project the key and value matrices to a low-dimensional potential space. Although this increases computation, it reduces memory bandwidth and storage requirements.

In addition, the MLA mechanism allows for an increase in the number of attention headers without increasing the size of the KV cache, allowing the model to potentially increase capacity without sacrificing inference speed.

4. Application of multi-token prediction

The DeepSeek-V3 model employs the multi-tokenprediction (MTP) technique, making it unique in large language modelling.MTP works by predicting multiple tokens in parallel using multiple output heads and then the main output head (next-token prediction head). The main output head (next-token prediction head) verifies the predictions and selects the most likely outcome.

The model uses n independent output heads to predict n future tokens, sharing the same backbone network to generate a potential representation of the context, which is then fed to a network of n independent heads. This design is simple, easy to implement, and does not require complex architectural changes.

A study by Meta, an American Internet company, showed that MTP provides more prosperous supervised signals to the model by predicting multiple tokens, allowing it to learn language structures and regularities faster.

Models trained with 4-token prediction can be up to 3 times faster than a single token in inference. MTP also helps models learn long-distance dependencies between tokens, leading to a better understanding of contextual information, and excels on programming tasks, enhancing out-of-distribution generalization.

However, MTP requires more computational resources, especially when the model size is large, and even a simple implementation of MTP may lead to a rapid increase in memory usage, which requires special optimization techniques to address. As a result, MTP does not consistently outperform traditional single-token prediction on some specific NLP tasks, e.g., it performs poorly on some standard multiple-choice tasks.

DeepSeek is the first to apply the MTP technique to the training of DeepSeek-V3 and DeepSeek-R1 through extreme memory and communication management, which fully exploits the efficient advantages of MTP, and its improvements include improving data efficiency, enhancing prediction, reducing training time, and improving the generalization of the model.

This innovative approach significantly improves efficiency and performance, enabling DeepSeek to lead at the forefront of AI technology.

5. Mixed-Precision Training

DeepSeek-V3 introduces the floatingpoint8-bit (FP8) mixed-precision training framework, which should be a significant innovation. The main features of FP8 mixed-precision training include the use of 8-bit floating-point numbers to represent the data, compared with the traditional 32-bit (FP32) and 16-bit (FP16) formats. However, the precision is lower, the computational speed is also smaller. is reduced, but it occupies less space, and the calculation speed is faster.

The mixed-precision strategy uses FP8 to implement most core computational kernels, including forward propagation, activation backpropagation, and weight backpropagation. The output results are in BF16 or FP32 format, and the vector activation values are stored in FP8 format for backpropagation. This approach leads to significant performance improvements, theoretically increasing the computational speed by a factor of 1 while significantly reducing memory consumption.

DeepSeek’s innovative error accumulation solution, FP8 mixed-precision training, keeps the accuracy loss within 0.25%, with almost no impact on model performance.

For the first time, the effectiveness of FP8 mixed-precision training has been validated on an ultra-large-scale model, enabling DeepSeek-V3 to maintain a high level of performance while reducing the GPU memory footprint and computational overhead, further improving the computational utilization per GPU hour, and reducing the overall training cost.

Mixed-precision training, while easy to understand conceptually, is quite tricky in practice. The design team must have a comprehensive and precise grasp of the computational accuracy of every aspect and detail of the large model training process.

Because of this, many large models are not trained with mixed precision, especially for AI giants with colossal capital and hundreds of thousands of GPUs. DeepSee uses mixed-precision training methods as a last resort and has successfully achieved this, which can be said to be a comeback from the dead.

6. Improving GPU computational efficiency by directly writing and optimizing PTX code

PTX (parallel thread execution) is an intermediate representation language for NVIDIA’s CUDA architecture between high-level GPU programming languages (e.g., CUDAC/C++) and the low-level machine code SASS. PTX provides a closer-to-the-bottom instruction-set architecture, allowing developers to perform fine-grained optimizations.

DeepSeek improves GPU computational efficiency by writing and optimizing PTX code when training the DeepSeek-V3 model, including dedicating 20 of the 132 streaming multiproces⁃sors (SMs) to inter-server communication, bypassing communication bandwidth constraints, optimizing register allocation and thread scheduling to reduce data handling overheads; deep adaptation of GPUs, etc.

These optimizations result in significant performance improvements, achieving GPU computation efficiencies up to 10 times higher than Meta. The GPU’s potential is fully utilized through direct control of registers and thread scheduling;

Deep optimizations for specific hardware (e.g., H800) for extreme performance.

However, PTX is closer to assembly language and requires deep hardware knowledge and programming skills;

Poor maintainability, the code is difficult to read and maintain, which is not conducive to teamwork and long-term development; low portability, the PTX code optimized for specific hardware is difficult to migrate between different GPU models. Therefore, this approach has not been widely adopted.

7. Data well rows and model parallelism

The parallelism strategy of DeepSeek-V3 is very complex and fine-grained, including a 3-layer parallelism strategy: 16-way pipeline parallelism, 64-way expert parallelism across eight nodes, and ZeRO-1 data parallelism.

In addition, DeepSeek-V3 introduces an innovative DualPipe pipeline parallelism algorithm, which significantly reduces pipeline stalls and enables overlapping of computation and communication phases. This design dramatically improves GPU utilization while reducing communication overhead.

Regarding expert parallelism, DeepSeek-V3’s model consists of 256 routing experts and 1 shared expert, with each token activating eight experts and ensuring that it is sent to up to 4 nodes.

This multi-level parallel strategy not only makes full use of the hardware resources but also dramatically improves the training efficiency through innovative algorithm design, which enables DeepSeek-V3 to complete the training of large-scale models in a shorter period.

In addition, the model also reaches the extreme in the joint design of hardware and software architecture, the reasonable deployment of memory and computing power, and the load balancing strategy.

Through the combined application of these techniques, DeepSeek has successfully trained the general-purpose language large model DeepSeek-V3.5.1 with limited GPU resources and short training time.

Application of the innovative new algorithm GPRO

From DeepSeek-V3 to DeepSeek-R1-Zero, big models first need to be pre-trained. The pre-training process is costly and requires a large training dataset, a sufficiently large computer fleet, and a long training time.

The purpose of pre-training is to compress the knowledge from the massive training data into the hundreds of millions of parameters of the large model and obtain a general-purpose linguistic large model, such as DeepSeek-V3 and ChatGPT4.0. Although such a general-purpose linguistic large model is almost omniscient, its inference ability is still limited.

1. Supervised fine-tuning and reinforcement learning

Various training methods have been developed to enhance the reasoning ability of large models, the most important and commonly used are supervised fine-tuning (SFT) and reinforcement learning (RL).

SFT builds on pre-trained models and uses labeled data for further training to improve the performance of the models on specific tasks or domains.

SFT requires a large amount of high-quality, well-labeled task-specific data, and professionals need to be hired for data labeling and processing, which is a time-consuming and expensive process. The SFT process requires a large number of computational resources, and to achieve the desired results, multiple iterations and optimizations may be required, which further increases the cost.

It is often said in the industry that ‘the world is suffering from SFT for a long time.’ RL is a machine learning method in which a large model interacts with the environment and learns the optimal strategy to maximize the cumulative reward based on the reward signals fed back from the environment.

The RLHF method, which combines reinforcement learning with human feedback (HF), is more commonly used in large model post-training.

RLHF and traditional reinforcement learning are similar in terms of framework, optimization strategy, and iterative learning but differ in reward sources, learning objectives, and training process.

While traditional reinforcement learning relies on predefined rules or environments, RLHF translates human feedback into rewards and trains reward models to predict human preferences so that model outputs are more consistent with human values.

The training process of RLHF consists of multiple phases such as pre-training, reward model training, and fine-tuning of reinforcement learning, which is more suitable for tasks where quality is difficult to define algorithmically but is easy for humans to judge, such as generating compelling stories.

Compared with RL, RLHF requires a large amount of high-quality human feedback data, multiple model training and deployment phases, and more computational resources, making it even more expensive than SFT and a challenge for companies with limited resources.

2. Population Relative Policy Optimisation

DeepSeek team proposed an innovative reinforcement learning algorithm, group relative policy optimisation (GRPO), in February 2024.

The algorithm aims to improve the inference ability of large language models, especially in complex tasks such as mathematics and programming.

The main feature of GRPO is that it does not rely on independent value function models but optimizes by averaging the rewards of multiple outputs. This approach simplifies the training process and reduces memory consumption and computational overhead while achieving significant performance gains on specific tasks.

DeepSeek-R1-Zero employs the GRPO algorithm, which entirely skips the RLHF and traditional SFT processes that consume computational time and resources, making the training process effective in high efficiency and low consumption. On the AIME2024 test set, the model score improves from 15.6% to 71.0%, demonstrating excellent performance and resource-saving capability.

3. Implications of DeepSeek-R1-Zero

DeepSeek-R1-Zero starts training from a base model through pure RL. During the training process, the model is first given some prompts (prompt) and asked to think between 2 thinking labels and give an answer between 2 answer labels.

Then, the model behavior is optimized based on the correctness and format of the final result as a reward (reward). With the increase of training steps, R1-Zero gradually emerges la ong chain-of-thinking (CoT) capability, and the reasoning path becomes longer and longer.

In addition, the model also exhibits an ‘ahamoment’ during the training process, i.e., self-discovery and repair of previous reasoning errors, and this ‘epiphany moment’ can be regarded as a concrete manifestation of emergent capabilities.

This pure reinforcement learning training method demonstrates a breakthrough in the ability of models to improve reasoning without human feedback.

R1-Zero completely abandons the predefined thought chain template and SFT and only relies on simple reward and punishment signals to optimize the model behavior, breaking the reliance on human-labelled data in traditional training methods.

During the training process, R1-Zero demonstrates complex behaviors such as reflection and exploration of alternative solutions and possesses the ability of autonomous learning, an important feature of general artificial intelligence (AGI).

Surprisingly, R1-Zero shows human-like complex reasoning behaviors such as reflection and multi-step verification during the training process, possessing the ability of self-correction and deep thinking, and naturally adjusting the corresponding lengths according to the complexity of the problem, which shows that it truly understands the difficulty of the problem.

These complex behaviours demonstrate the ability of AI systems to develop advanced problem solving strategies without explicit programming autonomously.

Supervised fine-tuning + reinforcement learning:

From DeepSeek-V3 to DeepSeek-R1 there was first a cold-start where the base model was subjected to supervised fine-tuning (SFT) using thousands of long chain-of-thinking (CoT) samples to provide initial reasoning capabilities;

This is followed by Reasoning-Oriented Reinforcement Learning (RORL), which improves the model’s reasoning ability through large-scale reinforcement learning, especially on programming, mathematical, scientific and logical reasoning tasks; then by Reconstruction and Data Generation (RDG), which generates high-quality training data using rejection sampling and CoT cues, including both reasoning and non-reasoning task data;

Finally, final evolution, again with supervised fine-tuning (SFT) and the introduction of human preference rewards to improve the generalisation, usability and safety of the model. The advantages of this training strategy are:

Improved readability by designing an easy-to-read output format and filtering unfriendly responses; Enhanced reasoning through human a priori designed patterns; Balancing multiple capabilities to improve reasoning while focusing on model generality and safety.

However, this strategy also has drawbacks, such as the complexity of the training process, which requires more time and resource management; the introduction of human-designed models may bring bias.

The key to DeepSeek’s success lies in rational project planning, subdividing the R&D cycle into multiple phases, each with clear goals and timelines;

Effective teamwork, adopting a multi-intelligence body collaborative learning mechanism to improve overall efficiency;

Innovative algorithm design, such as the introduction of multi-intelligent body collaborative learning and experience playback technology to improve learning efficiency;

Balancing supervised learning and reinforcement learning, combining the advantages of the two learning methods through a staged training strategy.

This training method not only improves the model performance but also significantly reduces the training cost, providing new ideas for developing the AI industry.

Conclusion

1) DeepSeek’s breakthroughs in AI are mainly reflected in model and algorithm innovations, hardware and software co-optimization, and overall training efficiency improvement. DeepSeek-V3 adopts the MoE model architecture, which achieves efficient computational resource utilization through fine-grained design and shared expert strategy.

The sparse activation mechanism and load balancing without auxiliary loss in the MoE architecture significantly improve the model efficiency and performance, especially when dealing with large-scale data and complex tasks. The innovative MLA mechanism excels in handling long sequences by reducing memory usage and accelerating the inference process, reducing the model training and inference costs.

(2) When training DeepSeek-V3, the team introduced techniques such as MTP and FP8 mixed-precision training. MTP significantly improves the model’s contextual understanding and training efficiency by predicting multiple words simultaneously. FP8 mixed-precision training utilises an 8-bit floating point representation of the data, reducing memory consumption and computational overhead while maintaining high performance.

To maximize GPU computational efficiency, the team also directly wrote and optimized the PTX code, making it far more efficient than its competitors. These innovative approaches effectively improve the overall training efficiency and reduce the computational utilization per GPU hour.

(3) In the training process of DeepSeek-R1-Zero, the team adopted a new reinforcement learning algorithm, GRPO, which skips the traditional SFT and RLHF phases. GRPO simplifies the training process by optimising the average rewards of multiple outputs and significantly improves the model’s inference ability. DeepSeek-R1-Zero demonstrates the breakthrough ability of self-learning and inference through pure reinforcement learning by training from 1 base model.

In conclusion, DeepSeek has successfully trained a world-class closed-source inference model in a relatively short period under limited and relatively inefficient GPU resources, creating a new path for global AI research and development. This breakthrough not only proves DeepSeek’s technological strength but also signals a radical change in the future of the AI landscape and opens a door to infinite possibilities.