DeepSeek-V3 is a robust Mixture-of-Experts (MoE) language model developed by DeepSeek, featuring a total of 671 billion parameters, with 37 billion activated per token. It employs Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture to enhance computational efficiency. The model introduces an auxiliary-loss-free load balancing strategy and a multi-token prediction training objective to boost performance. Trained on 14.8 trillion diverse, high-quality tokens, DeepSeek-V3 underwent supervised fine-tuning and reinforcement learning to fully realize its capabilities. Evaluations indicate that it outperforms other open-source models and rivals leading closed-source models, achieving this with a training duration of 55 days on 2,048 Nvidia H800 GPUs, costing approximately $5.58 million.
Features
- 671 billion parameters with 37 billion activated per token, ensuring robust language modeling.
- Multi-head Latent Attention (MLA) and DeepSeekMoE architecture for efficient computation.
- Auxiliary-loss-free load balancing strategy to enhance performance without additional losses.
- Multi-token prediction training objective for improved predictive capabilities.
- Pre-trained on 14.8 trillion diverse tokens, ensuring comprehensive language understanding.
- Supervised fine-tuning and reinforcement learning to fully harness model potential.
- Outperforms other open-source models, comparable to leading closed-source counterparts.
- Cost-effective training, completed in 55 days using 2,048 Nvidia H800 GPUs at approximately $5.58 million.
Categories
Large Language Models (LLM), Reinforcement Learning Frameworks, Reinforcement Learning Libraries, Reinforcement Learning Algorithms, AI ModelsLicense
MIT LicenseFollow DeepSeek-V3
User Reviews
-
Awesome mixture of experts AI model