GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Inference framework for 1-bit LLMs
The official PyTorch implementation of Google's Gemma models
Capable of understanding text, audio, vision, video
Inference code for scalable emulation of protein equilibrium ensembles
Programmatic access to the AlphaGenome model
CodeGeeX2: A More Powerful Multilingual Code Generation Model
Repo of Qwen2-Audio chat & pretrained large audio language model
Implementation of "MobileCLIP" CVPR 2024
VMZ: Model Zoo for Video Modeling
High-resolution models for human tasks
Video understanding codebase from FAIR for reproducing video models
Towards Real-World Vision-Language Understanding
CLIP, Predict the most relevant text snippet given an image
Ling is a MoE LLM provided and open-sourced by InclusionAI
A Unified Framework for Text-to-3D and Image-to-3D Generation
Multimodal-Driven Architecture for Customized Video Generation
Multimodal Diffusion with Representation Alignment
Personalize Any Characters with a Scalable Diffusion Transformer
Qwen3-omni is a natively end-to-end, omni-modal LLM
Official code for Style Aligned Image Generation via Shared Attention
4M: Massively Multimodal Masked Modeling
This repository contains the official implementation of FastVLM
FAIR Sequence Modeling Toolkit 2
ICLR2024 Spotlight: curation/training code, metadata, distribution