Contexts Optical Compression
Code for running inference and finetuning with SAM 3 model
Official inference repo for FLUX.2 models
CLIP, Predict the most relevant text snippet given an image
A Powerful Native Multimodal Model for Image Generation
Dataset of GPT-2 outputs for research in detection, biases, and more
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Industrial-level controllable zero-shot text-to-speech system
tiktoken is a fast BPE tokeniser for use with OpenAI's models
A Unified Framework for Text-to-3D and Image-to-3D Generation
Multimodal-Driven Architecture for Customized Video Generation
Generate Any 3D Scene in Seconds
Large-language-model & vision-language-model based on Linear Attention
Towards Real-World Vision-Language Understanding
Diffusion Transformer with Fine-Grained Chinese Understanding
Large Multimodal Models for Video Understanding and Editing
Implementation of "MobileCLIP" CVPR 2024
Unified Multimodal Understanding and Generation Models
The official PyTorch implementation of Google's Gemma models
Multimodal Diffusion with Representation Alignment
Official code for Style Aligned Image Generation via Shared Attention
Memory-efficient and performant finetuning of Mistral's models
Pushing the Limits of Mathematical Reasoning in Open Language Models
Open-weight, large-scale hybrid-attention reasoning model
Phi-3.5 for Mac: Locally-run Vision and Language Models