Implementation of "MobileCLIP" CVPR 2024
High-resolution models for human tasks
Video understanding codebase from FAIR for reproducing video models
Towards Real-World Vision-Language Understanding
CLIP, Predict the most relevant text snippet given an image
Ling is a MoE LLM provided and open-sourced by InclusionAI
Multimodal-Driven Architecture for Customized Video Generation
Multimodal Diffusion with Representation Alignment
Personalize Any Characters with a Scalable Diffusion Transformer
Inference script for Oasis 500M
4M: Massively Multimodal Masked Modeling
This repository contains the official implementation of FastVLM
A PyTorch library for implementing flow matching algorithms
Official DeiT repository
Hackable and optimized Transformers building blocks
Memory-efficient and performant finetuning of Mistral's models
tiktoken is a fast BPE tokeniser for use with OpenAI's models
Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI
Diffusion Transformer with Fine-Grained Chinese Understanding
A Customizable Image-to-Video Model based on HunyuanVideo
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Controllable & emotion-expressive zero-shot TTS
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
Language modeling in a sentence representation space