caption free download

Showing 91 open source projects for "caption"

View related business solutions

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
1

Caption Studio Pro

Fast, offline video captioning tool optimized for entry-level GPUs.

Caption Studio Pro is a lightweight desktop application designed for generating video subtitles 100% offline. Powered by the Faster-Whisper engine, it focus on providing accurate transcriptions while maintaining maximum data privacy by processing everything locally on your machine. A key highlight of this tool is its focus on accessibility for entry-level hardware.

Downloads: 1 This Week

Last Update: 2026-05-20
See Project
2

abogen

Generate audiobooks from EPUBs, PDFs and text with captions

...This can be very useful for accessibility, content consumption on the go, or for users who prefer audio over reading. The repository supports handling common ebook formats and generating outputs that combine audio plus caption metadata. By automating text-to-speech for arbitrary documents, abogen reduces the friction of producing audiobooks and could be integrated into larger workflows (e.g., batch converting a library of texts).

Downloads: 5 This Week

Last Update: 2026-02-06
See Project
3

img2dataset

Easily turn large sets of image urls to an image dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports saving captions for url+caption datasets. Opt-out directives: Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers.

Downloads: 0 This Week

Last Update: 2025-08-09
See Project
4

RealtimeSTT

A robust, efficient, low-latency speech-to-text library

RealtimeSTT is a Python-based realtime speech-to-text engine emphasizing low latency, wake-word detection, voice activity detection, and automatic speech segmentation. It provides asynchronous callbacks, nanosecond-precision timestamps, and CLI tools, suitable for building voice assistants, meeting transcribers, or live caption systems.

Downloads: 0 This Week

Last Update: 2026-05-31
See Project
Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
5

AiToEarn

Let's use AI to Earn

...The project supports matrix publishing to major global platforms like TikTok, YouTube, Instagram, Facebook, Pinterest, Twitter (X), and several Chinese social networks, enabling a “create once, publish everywhere” workflow. AI automation assists with tasks like title and caption creation, batch content generation, and optimization for each channel’s format and audience. Developers can run or extend AiToEarn locally using Node.js or via its desktop and web apps, and the open-source architecture encourages customization and community contributions.

Downloads: 17 This Week

Last Update: 4 days ago
See Project
6

CogVLM

A state-of-the-art open visual language model

...It includes checkpoints for chat, base, and grounding variants, plus recipes for model-parallel inference and LoRA fine-tuning. The documentation covers task prompts for general dialogue, visual grounding (box→caption, caption→box, caption+boxes), and GUI agent workflows that produce structured actions with bounding boxes.

Downloads: 2 This Week

Last Update: 6 days ago
See Project
7

Plyr

Simple HTML5, YouTube and Vimeo player

...Fullscreen - supports native fullscreen with fallback to "full window" modes. Picture-in-Picture - supports picture-in-picture mode. Multiple captions - support for multiple caption tracks.

Downloads: 6 This Week

Last Update: 2026-01-03
See Project
8

YoutubeExplode

Abstraction layer over YouTube's internal API

YoutubeExplode is a .NET library that provides a high-level abstraction for interacting with YouTube data, enabling developers to retrieve metadata and download media streams programmatically. The project exposes a clean API that allows applications to query videos, playlists, channels, and search results without relying on the official YouTube Data API. Under the hood, the library parses raw page data and leverages reverse-engineered internal endpoints to obtain structured information and...

Downloads: 2 This Week

Last Update: 2026-04-22
See Project
9

Verticals v3

Automated YouTube Shorts pipeline

Verticals v3 is an automated content generation workflow designed to create and process YouTube Shorts videos programmatically. It combines multiple tools and scripts to handle tasks such as downloading source material, editing clips, adding subtitles, and formatting output for vertical video platforms. The pipeline emphasizes automation, allowing users to produce short-form content at scale with minimal manual intervention. It integrates FFmpeg and other media processing tools to handle...

Downloads: 0 This Week

Last Update: 2026-06-09
See Project
Stop vibe-debugging.
Plug Claude into your app's actual errors.

AppSignal's MCP server hands Claude, Cursor, or Zed your real errors, traces, and the deploy that shipped them. AI writes the fix; you review the diff.

Free 30 days.
10

DeepSeek VL2

Mixture-of-Experts Vision-Language Models for Advanced Multimodal

...It combines image and text inputs into a unified embedding / reasoning space so that you can query with text and image jointly (e.g. “What’s going on in this scene?” or “Generate a caption appropriate to context”). The model supports both image understanding (vision tasks) and multimodal reasoning, and is likely used as a component in agent systems to process visual inputs as context for downstream tasks. The repository includes evaluation results (e.g. image/text alignment scores, common VL benchmarks), configuration files, and model weights (where permitted). ...

Downloads: 5 This Week

Last Update: 2025-10-03
See Project
11

CLIP

CLIP, Predict the most relevant text snippet given an image

CLIP (Contrastive Language-Image Pretraining) is a neural model that links images and text in a shared embedding space, allowing zero-shot image classification, similarity search, and multimodal alignment. It was trained on large sets of (image, caption) pairs using a contrastive objective: images and their matching text are pulled together in embedding space, while mismatches are pushed apart. Once trained, you can give it any text labels and ask it to pick which label best matches a given image—even without explicit training for that classification task. The repository provides code for model architecture, preprocessing transforms, evaluation pipelines, and example inference scripts. ...

Downloads: 0 This Week

Last Update: 2026-03-25
See Project
12

4M

4M: Massively Multimodal Masked Modeling

4M is a training framework for “any-to-any” vision foundation models that uses tokenization and masking to scale across many modalities and tasks. The same model family can classify, segment, detect, caption, and even generate images, with a single interface for both discriminative and generative use. The repository releases code and models for multiple variants (e.g., 4M-7 and 4M-21), emphasizing transfer to unseen tasks and modalities. Training/inference configs and issues discuss things like depth tokenizers, input masks for generation, and CUDA build questions, signaling active research iteration. ...

Downloads: 0 This Week

Last Update: 2025-10-08
See Project
13

OmniParser

A simple screen parsing tool towards pure vision based GUI agent

...To achieve this, OmniParser curates an interactable icon detection dataset containing 67,000 unique screenshot images labeled with bounding boxes of interactable icons derived from DOM trees. Additionally, a collection of 7,000 icon-description pairs is used to fine-tune a caption model that extracts the functional semantics of detected elements. Evaluations on benchmarks such as SeeClick, Mind2Web, and AITW demonstrate that OmniParser outperforms GPT-4V baselines, even when using only screenshot inputs without additional information.

Downloads: 0 This Week

Last Update: 2025-09-09
See Project
14

Qwen-Image-Layered

Qwen-Image-Layered: Layered Decomposition for Inherent Editablity

Qwen-Image-Layered is an extension of the Qwen series of multimodal models that introduces layered image understanding, enabling the model to reason about hierarchical visual structures — such as separating foreground, background, objects, and contextual layers within an image. This architecture allows richer semantic interpretation, enabling use cases such as scene decomposition, object-level editing, layered captioning, and more fine-grained multimodal reasoning than with flat image...

Downloads: 0 This Week

Last Update: 2026-01-05
See Project
15

WinExplorer

WinExplorer is a utility that shows all system's windows in hierarchical display. For every window in the hierarchy, you can view its properties, like handle, class name, caption, size, position and more. You can also modify some properties, like Caption and Visible/Enabled. This utility is released as freeware with full source code. You can freely use, distribute, and modify the source code of this utility without restrictions. However, if you release to the public a modified version of this utility, you should specify the original copyright notice.

1 Review

Downloads: 117 This Week

Last Update: 2024-08-05
See Project
16

ShanaEncoder

ShanaEncoder is audio/video encoding program based on FFmpeg.

ShanaEncoder is audio/video encoding program based on FFmpeg. Main Features - Both beginners and professionals can easily use the ShanaEncoder. - Fast encoding speed and quality of professional. - Closed caption, subtitle overlay, logo, crop, segment, etc... ShanaEncoder provides many features. - Support for H.264(High 10) decoding/encoding. - Support for unicode Source: https://shana.pe.kr/ffmpeg

Downloads: 1,619 This Week

Last Update: 2025-02-03
See Project
17

OpenAI DALL·E AsyncImage SwiftUI

OpenAI swift async text to image for SwiftUI app using OpenAI

...You need to have Xcode 13 installed in order to have access to Documentation Compiler (DocC) OpenAI's text-to-image model DALL-E 2 is a recent example of diffusion models. It uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image. In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space.

Downloads: 0 This Week

Last Update: 2025-08-14
See Project
18

Gource

Software version control visualization

Software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Directories appear as branches with files as leaves. Developers can be seen working on the tree at the times they contributed to the project. Gource includes built-in log generation support for Git, Mercurial, Bazaar and SVN. Gource can also parse logs produced by several third party tools for CVS repositories. Gource is a visualization tool for source control repositories. The...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
19

gexile4

This is a Gtk4/Python3 port of the Exile project (https://sourceforge.net/projects/gexile/) Exile is a Python based image collection manager application. Easily add metadata to photos, inluding Caption, People, Event, Location and Tags. No external database: stores metadata in Exif/IPTC/Xmp tags. Three level categorization for easy photo sorting/management Clone GPS data between files.

Downloads: 0 This Week

Last Update: 2026-05-03
See Project
20

Exile

Exile is a Python based image collection manager application. Easily add metadata to photos, inluding Caption, People, Event, Location and Tags. No external database: stores metadata in Exif/IPTC/Xmp tags. Three level categorization for easy photo sorting/management Clone GPS data between files.

Downloads: 0 This Week

Last Update: 2026-05-03
See Project
21

DeepSeek VL

Towards Real-World Vision-Language Understanding

DeepSeek-VL is DeepSeek’s initial vision-language model that anchors their multimodal stack. It enables understanding and generation across visual and textual modalities—meaning it can process an image + a prompt, answer questions about images, caption, classify, or reason about visuals in context. The model is likely used internally as the visual encoder backbone for agent use cases, to ground perception in downstream tasks (e.g. answering questions about a screenshot). The repository includes model weights (or pointers to them), evaluation metrics on standard vision + language benchmarks, and configuration or architecture files. ...

Downloads: 2 This Week

Last Update: 2025-10-03
See Project
22

Krajee

An enhanced HTML 5 file input for Bootstrap 5.x/4.x./3.x

An enhanced HTML 5 file input for Bootstrap 5.x or Bootstrap 4.x or Bootstrap 3.x with file preview for various files, offers multiple selection, and more. The plugin allows you a simple way to setup an advanced file picker/upload control built to work specially with Bootstrap CSS3 styles. It enhances the file input functionality further, by offering support to preview a wide variety of files i.e. images, text, html, video, audio, flash, and objects. In addition, it includes AJAX based...

Downloads: 0 This Week

Last Update: 2024-04-09
See Project
23

blackvideo-mini-player

A standalone lightweight auxiliary CLI video player for BlackVideo.

Lightweight cross-platform video player (Ada + SDL2 + FFmpeg). Support player for the BlackVideo. Works standalone via CLI or right-click on any video file. Usage Method 1 — Command Line Step 1. Unzip blackvideo-mini-player-v2.3.0.win.zip Step 2. Open the build\ folder, then type cmd directly in the address bar and press Enter — this opens a terminal already in that folder. Alternatively: open Command Prompt anywhere and use cd with the copied path: cd...

Downloads: 1 This Week

Last Update: 2026-03-18
See Project
24

Extended Dreambooth How-To Guides

Implementation of Dreambooth

Extended Dreambooth How-To Guides is an implementation and extended toolkit for fine-tuning Stable Diffusion models using the DreamBooth technique, enabling users to train AI image generators to reproduce specific subjects, styles, or identities from a small set of reference images. The project adapts and expands upon earlier DreamBooth research by providing practical scripts, notebooks, and workflows that allow users to train personalized models on local machines, cloud environments, or...

Downloads: 0 This Week

Last Update: 2026-03-18
See Project
25

Cindy components for Delphi 7 and newer

Packages with more than 80 components for all delphi versions

Packages with 86 components for all delphi versions (since D7) to build Windows 32/64 bit applications: VCL controls (labels, buttons, panels, Edits, TabControls, StaticText) with features like background gradient, colored bevels, wallpaper, shadowText, caption orientation etc... TcyCommunicate and TcyCommRoomConnector allows communication between applications running in same computer session. TcySearchFiles and TcyCopyfiles allow respectively search and copy files with pause/resume/abort features. TcyResizer allow move and resize components at run-time like delphi 2009 does. Advanced DB Express components (tested with mySQL) for easy table data handling (tcyDbxTable or TcyDbxSimpleTable), schema modifications (TcyDbxUpdateSql), reconcile handling (TcyDBXReconcileError) and table creation (cyDbxImportDataset1). ...

16 Reviews

Downloads: 8 This Week

Last Update: 2023-12-20
See Project