BERT-base-uncased is a 110-million-parameter English language model developed by Google, pretrained using masked language modeling and next sentence prediction on BookCorpus and English Wikipedia. It is case-insensitive and tokenizes text using WordPiece, enabling it to learn contextual relationships between words in a sentence bidirectionally. The model excels at feature extraction for downstream NLP tasks like sentence classification, named entity recognition, and question answering when fine-tuned appropriately. Its pretraining involved randomly masking 15% of tokens and predicting them based on surrounding context, allowing it to learn deep semantic and syntactic patterns. It has been widely used as a baseline and component in various fine-tuned models, achieving strong results on benchmarks like GLUE. Despite its success, BERT-base-uncased can exhibit social biases learned from its training data and is not designed for factual generation or open-ended text production.
Features
- 110M parameters, uncased English version
- Pretrained using MLM and next sentence prediction
- Trained on BookCorpus and English Wikipedia
- Outputs bidirectional contextual embeddings
- Strong performance on GLUE and other NLP tasks
- Compatible with multiple frameworks (PyTorch, TensorFlow, etc.)
- Tokenized with WordPiece (30k vocab)
- Openly licensed under Apache 2.0