OpenVLA 7B is a multimodal vision-language-action model trained on 970,000 robot manipulation episodes from the Open X-Embodiment dataset. It takes camera images and natural language instructions as input and outputs normalized 7-DoF robot actions, enabling control of multiple robot types across various domains. Built on top of LLaMA-2 and DINOv2/SigLIP visual backbones, it allows both zero-shot inference for known robot setups and parameter-efficient fine-tuning for new domains. The model supports real-world robotics tasks, with robust generalization to environments seen in pretraining. Its actions include delta values for position, orientation, and gripper status, and can be un-normalized based on robot-specific statistics. OpenVLA is MIT-licensed, fully open-source, and designed collaboratively by Stanford, Berkeley, Google DeepMind, and TRI. Deployment is facilitated via Python and Hugging Face tools, with flash attention support for efficient inference.
Features
- 7.5B parameter vision-language-action model
- Trained on 970K real robot episodes
- Inputs: image + language → Outputs: 7-DoF robot actions
- Built with LLaMA-2, DINOv2, and SigLIP backbones
- Zero-shot support for robots in the Open X-Embodiment mix
- Fine-tunable for new robot platforms with minimal data
- MIT-licensed and fully open-source (code + weights)
- Integrated with Hugging Face and supports FlashAttention-2 Preguntar a ChatGPT