A state-of-the-art open visual language model
Generate audiobooks from EPUBs, PDFs and text with captions
Easily turn large sets of image urls to an image dataset
A robust, efficient, low-latency speech-to-text library
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
A simple screen parsing tool towards pure vision based GUI agent
Towards Real-World Vision-Language Understanding
CLIP, Predict the most relevant text snippet given an image
4M: Massively Multimodal Masked Modeling
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
An open-source framework for training large multimodal models
The ultimate tool to automate custom telegram message forwarding
A lightweight, dependency-free Python library
Official implementation for UniVL video and language training models
Easily and Quickly add Captions to your photos
FUSE-based filesystem reflecting XWindows into files
Browser interface to your memories