Showing 58 open source projects for "caption"

View related business solutions
  • Auth0 B2B Essentials: SSO, MFA, and RBAC Built In Icon
    Auth0 B2B Essentials: SSO, MFA, and RBAC Built In

    Unlimited organizations, 3 enterprise SSO connections, role-based access control, and pro MFA included. Dev and prod tenants out of the box.

    Auth0's B2B Essentials plan gives you everything you need to ship secure multi-tenant apps. Unlimited orgs, enterprise SSO, RBAC, audit log streaming, and higher auth and API limits included. Add on M2M tokens, enterprise MFA, or additional SSO connections as you scale.
    Sign Up Free
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    CogVLM

    CogVLM

    A state-of-the-art open visual language model

    ...It includes checkpoints for chat, base, and grounding variants, plus recipes for model-parallel inference and LoRA fine-tuning. The documentation covers task prompts for general dialogue, visual grounding (box→caption, caption→box, caption+boxes), and GUI agent workflows that produce structured actions with bounding boxes.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    abogen

    abogen

    Generate audiobooks from EPUBs, PDFs and text with captions

    ...This can be very useful for accessibility, content consumption on the go, or for users who prefer audio over reading. The repository supports handling common ebook formats and generating outputs that combine audio plus caption metadata. By automating text-to-speech for arbitrary documents, abogen reduces the friction of producing audiobooks and could be integrated into larger workflows (e.g., batch converting a library of texts).
    Downloads: 8 This Week
    Last Update:
    See Project
  • 3
    img2dataset

    img2dataset

    Easily turn large sets of image urls to an image dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports saving captions for url+caption datasets. Opt-out directives: Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 4
    RealtimeSTT

    RealtimeSTT

    A robust, efficient, low-latency speech-to-text library

    RealtimeSTT is a Python-based realtime speech-to-text engine emphasizing low latency, wake-word detection, voice activity detection, and automatic speech segmentation. It provides asynchronous callbacks, nanosecond-precision timestamps, and CLI tools, suitable for building voice assistants, meeting transcribers, or live caption systems.
    Downloads: 5 This Week
    Last Update:
    See Project
  • Build Securely on Azure with Proven Frameworks Icon
    Build Securely on Azure with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 5
    YoutubeExplode

    YoutubeExplode

    Abstraction layer over YouTube's internal API

    YoutubeExplode is a .NET library that provides a high-level abstraction for interacting with YouTube data, enabling developers to retrieve metadata and download media streams programmatically. The project exposes a clean API that allows applications to query videos, playlists, channels, and search results without relying on the official YouTube Data API. Under the hood, the library parses raw page data and leverages reverse-engineered internal endpoints to obtain structured information and...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 6
    Plyr

    Plyr

    Simple HTML5, YouTube and Vimeo player

    ...Fullscreen - supports native fullscreen with fallback to "full window" modes. Picture-in-Picture - supports picture-in-picture mode. Multiple captions - support for multiple caption tracks.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 7
    AiToEarn

    AiToEarn

    Let's use AI to Earn

    ...The project supports matrix publishing to major global platforms like TikTok, YouTube, Instagram, Facebook, Pinterest, Twitter (X), and several Chinese social networks, enabling a “create once, publish everywhere” workflow. AI automation assists with tasks like title and caption creation, batch content generation, and optimization for each channel’s format and audience. Developers can run or extend AiToEarn locally using Node.js or via its desktop and web apps, and the open-source architecture encourages customization and community contributions.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 8
    Verticals v3

    Verticals v3

    Automated YouTube Shorts pipeline

    Verticals v3 is an automated content generation workflow designed to create and process YouTube Shorts videos programmatically. It combines multiple tools and scripts to handle tasks such as downloading source material, editing clips, adding subtitles, and formatting output for vertical video platforms. The pipeline emphasizes automation, allowing users to produce short-form content at scale with minimal manual intervention. It integrates FFmpeg and other media processing tools to handle...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    DeepSeek VL2

    DeepSeek VL2

    Mixture-of-Experts Vision-Language Models for Advanced Multimodal

    ...It combines image and text inputs into a unified embedding / reasoning space so that you can query with text and image jointly (e.g. “What’s going on in this scene?” or “Generate a caption appropriate to context”). The model supports both image understanding (vision tasks) and multimodal reasoning, and is likely used as a component in agent systems to process visual inputs as context for downstream tasks. The repository includes evaluation results (e.g. image/text alignment scores, common VL benchmarks), configuration files, and model weights (where permitted). ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build, govern, and optimize agents and models with Gemini Enterprise Agent Platform.
    Start Free
  • 10
    CLIP

    CLIP

    CLIP, Predict the most relevant text snippet given an image

    CLIP (Contrastive Language-Image Pretraining) is a neural model that links images and text in a shared embedding space, allowing zero-shot image classification, similarity search, and multimodal alignment. It was trained on large sets of (image, caption) pairs using a contrastive objective: images and their matching text are pulled together in embedding space, while mismatches are pushed apart. Once trained, you can give it any text labels and ask it to pick which label best matches a given image—even without explicit training for that classification task. The repository provides code for model architecture, preprocessing transforms, evaluation pipelines, and example inference scripts. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    Qwen-Image-Layered

    Qwen-Image-Layered

    Qwen-Image-Layered: Layered Decomposition for Inherent Editablity

    Qwen-Image-Layered is an extension of the Qwen series of multimodal models that introduces layered image understanding, enabling the model to reason about hierarchical visual structures — such as separating foreground, background, objects, and contextual layers within an image. This architecture allows richer semantic interpretation, enabling use cases such as scene decomposition, object-level editing, layered captioning, and more fine-grained multimodal reasoning than with flat image...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 12
    4M

    4M

    4M: Massively Multimodal Masked Modeling

    4M is a training framework for “any-to-any” vision foundation models that uses tokenization and masking to scale across many modalities and tasks. The same model family can classify, segment, detect, caption, and even generate images, with a single interface for both discriminative and generative use. The repository releases code and models for multiple variants (e.g., 4M-7 and 4M-21), emphasizing transfer to unseen tasks and modalities. Training/inference configs and issues discuss things like depth tokenizers, input masks for generation, and CUDA build questions, signaling active research iteration. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    OpenAI DALL·E AsyncImage SwiftUI

    OpenAI DALL·E AsyncImage SwiftUI

    OpenAI swift async text to image for SwiftUI app using OpenAI

    ...You need to have Xcode 13 installed in order to have access to Documentation Compiler (DocC) OpenAI's text-to-image model DALL-E 2 is a recent example of diffusion models. It uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image. In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    DeepSeek VL

    DeepSeek VL

    Towards Real-World Vision-Language Understanding

    DeepSeek-VL is DeepSeek’s initial vision-language model that anchors their multimodal stack. It enables understanding and generation across visual and textual modalities—meaning it can process an image + a prompt, answer questions about images, caption, classify, or reason about visuals in context. The model is likely used internally as the visual encoder backbone for agent use cases, to ground perception in downstream tasks (e.g. answering questions about a screenshot). The repository includes model weights (or pointers to them), evaluation metrics on standard vision + language benchmarks, and configuration or architecture files. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Krajee

    Krajee

    An enhanced HTML 5 file input for Bootstrap 5.x/4.x./3.x

    An enhanced HTML 5 file input for Bootstrap 5.x or Bootstrap 4.x or Bootstrap 3.x with file preview for various files, offers multiple selection, and more. The plugin allows you a simple way to setup an advanced file picker/upload control built to work specially with Bootstrap CSS3 styles. It enhances the file input functionality further, by offering support to preview a wide variety of files i.e. images, text, html, video, audio, flash, and objects. In addition, it includes AJAX based...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Exile is a Python based image collection manager application. Easily add metadata to photos, inluding Caption, People, Event, Location and Tags. No external database: stores metadata in Exif/IPTC/Xmp tags. Three level categorization for easy photo sorting/management Clone GPS data between files.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Extended Dreambooth How-To Guides

    Extended Dreambooth How-To Guides

    Implementation of Dreambooth

    Extended Dreambooth How-To Guides is an implementation and extended toolkit for fine-tuning Stable Diffusion models using the DreamBooth technique, enabling users to train AI image generators to reproduce specific subjects, styles, or identities from a small set of reference images. The project adapts and expands upon earlier DreamBooth research by providing practical scripts, notebooks, and workflows that allow users to train personalized models on local machines, cloud environments, or...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    Cindy components for Delphi 7 and newer

    Packages with more than 80 components for all delphi versions

    Packages with 86 components for all delphi versions (since D7) to build Windows 32/64 bit applications: VCL controls (labels, buttons, panels, Edits, TabControls, StaticText) with features like background gradient, colored bevels, wallpaper, shadowText, caption orientation etc... TcyCommunicate and TcyCommRoomConnector allows communication between applications running in same computer session. TcySearchFiles and TcyCopyfiles allow respectively search and copy files with pause/resume/abort features. TcyResizer allow move and resize components at run-time like delphi 2009 does. Advanced DB Express components (tested with mySQL) for easy table data handling (tcyDbxTable or TcyDbxSimpleTable), schema modifications (TcyDbxUpdateSql), reconcile handling (TcyDBXReconcileError) and table creation (cyDbxImportDataset1). ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 19
    OpenFlamingo

    OpenFlamingo

    An open-source framework for training large multimodal models

    Welcome to our open source version of DeepMind's Flamingo model! In this repository, we provide a PyTorch implementation for training and evaluating OpenFlamingo models. We also provide an initial OpenFlamingo 9B model trained on a new Multimodal C4 dataset (coming soon). Please refer to our blog post for more details. This repo is still under development, and we hope to release better-performing and larger OpenFlamingo models soon. If you have any questions, please feel free to open an...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    tgcf

    tgcf

    The ultimate tool to automate custom telegram message forwarding

    The ultimate tool to automate custom telegram message forwarding. Live-syncer, Auto-poster, backup-bot, cloner, chat-forwarder, duplicator, ... Call it whatever you like! tgcf is an advanced telegram chat forwarding automation tool that can fulfill all your custom needs.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 21
    iziModal

    iziModal

    Elegant, responsive, flexible and lightweight modal plugin with jQuery

    Elegant, responsive, flexible and lightweight modal plugin with jQuery. All modern browsers are supported (Tested in Chrome, Firefox, Opera, Safari, IE9+ and Edge).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Krajee bootstrap-star-rating

    Krajee bootstrap-star-rating

    A simple yet powerful JQuery star rating plugin with fractional rating

    ...The plugin uses Bootstrap markup and styling by default, but it can be overridden with any other CSS markup. Ability to size the rating control to any size including the stars, caption, and clear button. Five prebuilt size templates are available xl, lg, md, sm, and xs. However one can have their own size configured through a simple CSS manipulation. You can use the HTML 5 number input for polyfill and the plugin will automatically use the number attributes like min, max, and step. However, number inputs have a problem with decimal values on the Chrome Browser. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    pytube

    pytube

    A lightweight, dependency-free Python library

    Pytube is a lightweight, dependency-free Python library that enables downloading YouTube videos and audio streams with minimal setup. It supports video resolution selection, progressive or adaptive streams, and caption downloads. Pytube is ideal for automation scripts, archiving tools, and media applications that need to interface with YouTube content programmatically.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 24
    UniVL

    UniVL

    Official implementation for UniVL video and language training models

    UniVL is a video-language pretrain model. It is designed with four modules and five objectives for both video language understanding and generation tasks. It is also a flexible model for most of the multimodal downstream tasks considering both efficiency and effectiveness.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    HTML Article Generator

    HTML Article Generator

    Quickly create custom webpages from your content

    ...These webpages can be customised to give a unique appearance, with a selection of 5 different themes. Other features include the ability to save the current values you have entered and restore these values after future changes have been made. Images can have caption text added to them and given alt text to improve accessibility. Each webpage can also be given a favourite icon. This tool is useful for those who are building a website or blog and quickly need to generate a series of webpages from existing content.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next
MongoDB Logo MongoDB