Evaluating the DeepSeek tech stack with a critical eye
We've obtained and evaluated a pre-print DeepSeek Technical Report.... DeepSeek-V3: Core Contributions and Characteristics This report details DeepSeek-V3, a Mixture-of-Experts (MoE) language model with a total of 671 billion parameters, where 37 billion are activated for each token. The model's design prioritizes efficient inference and cost-effective training, incorporating specific architectural components and training strategies. Architectural Innovations: - Multi-head Latent Attention (MLA): DeepSeek-V3 utilizes MLA, which aims to reduce Key-Value (KV) cache during inference through a low-rank compression for attention keys and values. This technique involves compressing the latent vectors for queries, keys, and values, which can be cached during inference. The caching significantly reduces the memory footprint while maintaining performance. - DeepSeekMoE with Auxiliary-Loss-Free Load Balancing: The model employs the DeepSeekMoE architecture, using finer-grained experts and isolating some as shared. It introduces an auxiliary-loss-free load balancing strategy to minimize performance degradation caused by imbalanced expert load, which occurs with MoE training. This strategy avoids using conventional auxiliary losses and instead employs a dynamic bias term added to affinity scores to distribute the load. There is also a sequence-wise auxiliary loss to prevent imabalance within a single sequence. - Multi-Token Prediction (MTP): The model incorporates a multi-token prediction objective, extending the prediction scope to multiple future tokens at each position. The implementation uses sequential modules to predict additional tokens and keeps the causal chain at each prediction depth. During inference, MTP modules can be discarded to function normally, or used to improve latency via speculative decoding. Infrastructure and Training Framework: - Compute Infrastructure: DeepSeek-V3 was trained on a cluster equipped with 2048 NVIDIA H800 GPUs. GPUs are connected by NVLink within nodes and by InfiniBand (IB) across nodes. - DualPipe Algorithm: A pipeline parallelism method named DualPipe is used, which overlaps the computation and communication across forward and backward passes. This method divides the computation into components and rearranges them with manual adjustment to ensure that communication is hidden during execution. - Cross-Node All-to-All Communication: The authors implement custom kernels for cross-node all-to-all communication, leveraging IB and NVLink. A node-limited routing mechanism limits the number of receiving nodes for each token, using only 20 SMs to implement all-to-all communication. - Memory Saving Techniques: Several methods are employed to reduce memory usage, including recomputing RMSNorm and MLA up-projections during back-propagation, storing the exponential moving average (EMA) of model parameters in CPU memory, and sharing the embedding and output head between modules. - FP8 Training: The model leverages a fine-grained mixed precision framework using the FP8 data format to accelerate training and reduce GPU memory usage. Techniques are introduced to ensure high precision, including a tile-wise or block-wise quantization strategy to handle feature outliers, and a promotion of GEMM operations to CUDA cores. Also, they retain FP32 and BF16 for key components of the architecture.