StarCoder is a 15.5B parameter language model developed by BigCode for code generation tasks across more than 80 programming languages. It is trained on 1 trillion tokens from the permissively licensed dataset The Stack v1.2, using the Fill-in-the-Middle (FIM) objective and Multi-Query Attention to enhance performance. With an extended context window of 8192 tokens and pretraining in bfloat16, StarCoder can generate, complete, or refactor code in various languages, with English as the primary natural language. While it is not an instruction-tuned model, it can act as a capable technical assistant when prompted appropriately. Developers can use it for general-purpose code generation, with fine control over prefix/middle/suffix tokens. The model has some limitations: generated code may contain bugs or licensing constraints, and attribution must be observed when output resembles training data. StarCoder is licensed under the BigCode OpenRAIL-M license.
Features
- 15.5B parameters trained on 1T tokens from 80+ programming languages
- Supports Fill-in-the-Middle (FIM) objective for smart code editing
- Multi-Query Attention and 8192-token context window
- Trained on permissively licensed GitHub code (The Stack v1.2)
- Generates code in Python, JavaScript, Java, C++, and many more
- Includes tools for tracing output to source code for attribution
- Optimized with Megatron-LM and PyTorch on 512 A100 GPUs
- Licensed under BigCode OpenRAIL-M for responsible open use