RoBERTa-large is a robustly optimized transformer model for English, trained by Facebook AI using a masked language modeling (MLM) objective. Unlike BERT, RoBERTa was trained on 160GB of data from BookCorpus, English Wikipedia, CC-News, OpenWebText, and Stories, with dynamic masking applied during training. It uses a byte-level BPE tokenizer and was trained with a sequence length of 512 and a batch size of 8K across 1024 V100 GPUs. RoBERTa improves performance across multiple NLP tasks by removing BERT’s next-sentence prediction objective and leveraging larger batches and longer training. With 355 million parameters, it learns bidirectional sentence representations and performs strongly in tasks like sequence classification, token classification, and question answering. However, it reflects social biases present in its training data, so caution is advised when deploying in sensitive contexts.
Features
- Pretrained using masked language modeling on 160GB of text
- 355M parameters for rich contextual language understanding
- Learns bidirectional representations (unlike GPT)
- Dynamic token masking per epoch for better generalization
- Optimized for downstream tasks like QA and classification
- Compatible with PyTorch, TensorFlow, and JAX
- Case-sensitive vocabulary with 50K BPE tokens
- Open-source under the MIT license