A state-of-the-art open visual language model
Generate audiobooks from EPUBs, PDFs and text with captions
Easily turn large sets of image urls to an image dataset
A robust, efficient, low-latency speech-to-text library
Abstraction layer over YouTube's internal API
Simple HTML5, YouTube and Vimeo player
Let's use AI to Earn
Automated YouTube Shorts pipeline
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
CLIP, Predict the most relevant text snippet given an image
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
A simple screen parsing tool towards pure vision based GUI agent
Software version control visualization
4M: Massively Multimodal Masked Modeling
OpenAI swift async text to image for SwiftUI app using OpenAI
ShanaEncoder is audio/video encoding program based on FFmpeg.
Towards Real-World Vision-Language Understanding
An enhanced HTML 5 file input for Bootstrap 5.x/4.x./3.x
A standalone lightweight auxiliary CLI video player for BlackVideo.
Implementation of Dreambooth
Packages with more than 80 components for all delphi versions
An open-source framework for training large multimodal models