DeBERTa-v3-base is an enhanced version of Microsoft’s DeBERTa model, integrating ELECTRA-style pretraining and Gradient-Disentangled Embedding Sharing for improved performance. It builds upon the original DeBERTa's disentangled attention mechanism and enhanced mask decoder, enabling more effective representation learning than BERT or RoBERTa. The base version includes 12 layers, a hidden size of 768, and 86 million backbone parameters, with a 128K-token vocabulary contributing to 98M embedding parameters. DeBERTa-v3-base was trained on 160GB of text data, the same used for DeBERTa-v2, ensuring robust language understanding. It achieves state-of-the-art results on several NLU benchmarks, including SQuAD 2.0 and MNLI, outperforming prior models like RoBERTa-base and ELECTRA-base. The model is compatible with Hugging Face Transformers, PyTorch, TensorFlow, and Rust, and is widely used in text classification and fill-mask tasks.
Features
- 12-layer transformer with 86M backbone parameters
- Enhanced with ELECTRA-style pretraining for better efficiency
- Uses Gradient-Disentangled Embedding Sharing
- Disentangled attention mechanism for improved context understanding
- Trained on 160GB of data for robust NLU performance
- Achieves top scores on SQuAD 2.0 and MNLI benchmarks
- Compatible with PyTorch, TensorFlow, and Hugging Face Transformers
- Supports tasks like masked language modeling and fine-tuning for classification