ConvNeXt is a modernized convolutional neural network (CNN) architecture designed to rival Vision Transformers (ViTs) in accuracy and scalability while retaining the simplicity and efficiency of CNNs. It revisits classic ResNet-style backbones through the lens of transformer design trends—large kernel sizes, inverted bottlenecks, layer normalization, and GELU activations—to bridge the performance gap between convolutions and attention-based models. ConvNeXt’s clean, hierarchical structure makes it efficient for both pretraining and fine-tuning across a wide range of visual recognition tasks. It achieves competitive or superior results on ImageNet and downstream datasets while being easier to deploy and train than transformers. The repository provides pretrained models, training recipes, and ablation studies demonstrating how incremental design choices collectively yield state-of-the-art performance.
Features
- Modernized CNN architecture inspired by Vision Transformer design principles
- Large kernel convolutions and inverted bottleneck blocks for enhanced representation
- Layer normalization and GELU activation for improved stability and accuracy
- Hierarchical structure with strong scaling properties across model sizes
- Pretrained checkpoints and training recipes for ImageNet and downstream tasks
- Efficient deployment and compatibility with existing CNN-based systems