Showing 174 open source projects for "visual-mingw"

View related business solutions
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • 1
    Map-Anything

    Map-Anything

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Map-Anything is a universal, feed-forward transformer for metric 3D reconstruction that predicts a scene’s geometry and camera parameters directly from visual inputs. Instead of stitching together many task-specific models, it uses a single architecture that supports a wide range of 3D tasks—multi-image structure-from-motion, multi-view stereo, monocular metric depth, registration, depth completion, and more. The model flexibly accepts different input combinations (images, intrinsics, poses, sparse or dense depth) and produces a rich set of outputs including per-pixel 3D points, camera intrinsics, camera poses, ray directions, confidence maps, and validity masks. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    OpenSwarm

    OpenSwarm

    Claude code for everything except coding

    ...The included agents can handle research, data analysis, slide decks, documents, images, videos, scheduling, messaging, and other productivity tasks. It is designed for outputs like pitch decks, market research, SEO content, quarterly reports, launch campaigns, visual assets, and multimedia projects. The project can connect to external services through integrations and can be customized into purpose-specific swarms for areas such as SEO, sales, marketing, finance, customer support, or research. Its main appeal is giving technical users a forkable, terminal-based framework for building agent teams that produce polished business and creative deliverables.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    ComfyUI-HunyuanVideoWrapper

    ComfyUI-HunyuanVideoWrapper

    ComfyUI wrapper nodes for HunyuanVideo

    ...The system introduces specialized nodes such as text-image encoders that allow multiple image inputs to be referenced directly within prompts. This makes it possible to guide generation using both visual and textual context simultaneously. The wrapper is designed to fit seamlessly into ComfyUI pipelines, enabling chaining with other nodes for advanced workflows. It supports prompt-based referencing of images, where placeholders in text correspond to connected inputs, allowing fine control over generation behavior. The project is particularly useful for creators experimenting with multimodal AI video synthesis.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    InfiniteYou

    InfiniteYou

    Flexible Photo Recrafting While Preserving Your Identity

    ...Using an architecture built around diffusion transformers (DiTs), InfiniteYou introduces a component called InfuseNet that injects identity features derived from reference images into the generation process — via residual connections — so that the output matches the person’s identity closely, without sacrificing visual quality or text-image alignment. The team uses a multi-stage training strategy with synthetic multi-sample data per identity to fine-tune for both identity consistency and aesthetic quality. Compared to prior methods, InfiniteYou significantly improves on identity similarity, text-prompt adherence, overall image quality, and avoids common problems such as face copy-pasting artifacts.
    Downloads: 1 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 5
    dots.ocr

    dots.ocr

    Multilingual Document Layout Parsing in a Single Vision-Language Model

    ...It achieves state-of-the-art performance on document parsing benchmarks while maintaining a relatively compact model size, demonstrating efficiency without sacrificing accuracy. Beyond standard OCR tasks, it extends its capabilities to parse complex visual elements such as charts, diagrams, and web interfaces, converting them into structured outputs like SVG code.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    SimpleHTR

    SimpleHTR

    Handwritten Text Recognition (HTR) system implemented with TensorFlow

    ...The project focuses on converting images of handwritten text into machine-readable digital text using neural networks. The system uses a combination of convolutional neural networks and recurrent neural networks to extract visual features and model sequential character patterns in handwriting. It also employs connectionist temporal classification (CTC) to align predicted character sequences with input images without requiring character-level segmentation. The repository provides code for training models, performing inference on handwritten text images, and evaluating recognition accuracy. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Grounded-Segment-Anything

    Grounded-Segment-Anything

    Marrying Grounding DINO with Segment Anything & Stable Diffusion

    Grounded-Segment-Anything is a research-oriented project that combines powerful open-set object detection with pixel-level segmentation and subsequent creative workflows, effectively enabling detection, segmentation, and high-level vision tasks guided by free-form text prompts. The core idea behind the project is to pair Grounding DINO — a zero-shot object detector that can locate objects described by natural language — with Segment Anything Model (SAM), which can produce detailed masks for...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Wan Move

    Wan Move

    Motion-controllable Video Generation via Latent Trajectory Guidance

    Wan Move is an open-source research codebase for motion-controllable video generation that focuses on enabling fine-grained control of motion within generative video models. It is designed to guide the temporal evolution of visual content by leveraging latent trajectory guidance, allowing users to manipulate how objects move over time without modifying the underlying generative architecture. By representing motion information as dense point trajectories and integrating them into the latent space of an image-to-video model, the project produces videos with more precise and controllable motion behavior than many existing methods. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Qwen3-VL-Embedding

    Qwen3-VL-Embedding

    Multimodal embedding and reranking models built on Qwen3-VL

    ...The reranking model then precisely scores relevance between a given query and candidate documents, enhancing retrieval accuracy in complex multimodal tasks. Together, they support advanced information retrieval workflows such as image-text search, visual question answering (VQA), and video-text matching, while providing out-of-the-box support for more than 30 languages.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 10
    MiniMind-V

    MiniMind-V

    "Big Model" trains a visual multimodal VLM with 26M parameters

    MiniMind-V is an experimental open-source project that aims to train a very small multimodal vision–language model (VLM) from scratch with extremely low compute and cost, making research and experimentation accessible to more people. The repository showcases training workflows and code designed to produce a 26-million parameter model—including both image and text capabilities—using minimal resources in very little time, reflecting a trend toward democratizing AI research. MiniMind-V combines...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    zvt

    zvt

    Modular quant framework

    ...Your world is built by core concepts inside you, so it’s you. zvt world is built by core concepts inside the market, so it’s zvt. The core concept of the system is visual, and the name of the interface corresponds to it one-to-one, so it is also uniform and extensible. You can write and run the strategy in your favorite ide, and then view its related targets, factor, signal and performance on the UI. Once you are familiar with the core concepts of the system, you can apply it to any target in the market.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    HunyuanOCR

    HunyuanOCR

    OCR expert VLM powered by Hunyuan's native multimodal architecture

    HunyuanOCR is an open-source, end-to-end OCR (optical character recognition) Vision-Language Model (VLM) developed by Tencent‑Hunyuan. It’s designed to unify the entire OCR pipeline, detection, recognition, layout parsing, information extraction, translation, and even subtitle or structured output generation, into a single model inference instead of a cascade of separate tools. Despite being fairly lightweight (about 1 billion parameters), it delivers state-of-the-art performance across a...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    Agent Sprite Forge

    Agent Sprite Forge

    Agent Skill for generating 2D sprite sheets and map, transparent PNG

    ...The system supports multi-frame sprite generation, animation sequencing, and transparent background rendering for easier integration into game engines. Its architecture is designed around automation and repeatability, enabling developers to generate large batches of visual assets through structured prompt workflows. Overall, agent-sprite-forge acts as an AI-assisted creative tool for accelerating 2D game art production and experimentation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    MolmoWeb

    MolmoWeb

    Open multimodal web agent built by Ai2

    ...Unlike traditional automation tools that rely on structured HTML parsing or predefined APIs, MolmoWeb operates directly from screenshots of web pages, interpreting visual content in the same way a human user would. This approach allows it to generalize across different websites without requiring site-specific integrations, making it highly adaptable to diverse web environments.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Diffusion for World Modeling

    Diffusion for World Modeling

    Learning agent trained in a diffusion world model

    ...Instead of interacting directly with a real environment, the reinforcement learning agent learns within a generative model that produces frames representing the environment. This approach allows training to occur in a simulated world that captures detailed visual dynamics while reducing the need for costly interactions with real environments. The system has been applied to tasks such as Atari game simulations and demonstrations involving complex environments like first-person shooter games.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    FireRed-Image-Edit

    FireRed-Image-Edit

    General-purpose image editing model that delivers high-fidelity

    ...It is built on a flexible text-to-image foundation model that has been extended with training paradigms including pretraining, supervised fine-tuning, and reinforcement learning to imbue the system with strong instruction following and editing consistency. The model excels in maintaining visual and text stylistic fidelity, allowing users to preserve the original artistic qualities of an image while applying creative changes according to natural language instructions. In addition to editing single images, FireRed supports multi-image editing scenarios such as virtual try-on or batch transformations, making it suitable for both creative and practical workflows.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    ticket

    ticket

    Fast, powerful, git-native ticket tracking in a single bash script

    ...It stores each ticket as a Markdown file with YAML frontmatter, making them human-readable and easy to version control alongside your code, while also allowing IDEs to jump straight to ticket definitions. The CLI provides common subcommands to create, list, edit, close, and manage dependencies between tickets, enabling clear hierarchical task structures and visual dependency trees. Its design is rooted in the Unix philosophy of simplicity, composability, and transparency, meaning it integrates well with other standard tools like grep, jq, and ripgrep when installed. Teams can use ticket to track bugs, features, chores, and epics with priority levels and tags, all by staying within the terminal and Git ecosystem.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    MetaCLIP

    MetaCLIP

    ICLR2024 Spotlight: curation/training code, metadata, distribution

    ...It includes utilities to fine-tune vision-language embeddings, compute prompt or adapter updates, and benchmark across transfer and retention metrics. MetaCLIP is especially suited for real-world settings where a model must continuously incorporate new visual categories or domains over time.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    VGGT

    VGGT

    [CVPR 2025 Best Paper Award] VGGT

    VGGT is a transformer-based framework aimed at unifying classic visual geometry tasks—such as depth estimation, camera pose recovery, point tracking, and correspondence—under a single model. Rather than training separate networks per task, it shares an encoder and leverages geometric heads/decoders to infer structure and motion from images or short clips. The design emphasizes consistent geometric reasoning: outputs from one head (e.g., correspondences or tracks) reinforce others (e.g., pose or depth), making the system more robust to challenging viewpoints and textures. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    LLaMA-Mesh

    LLaMA-Mesh

    Unifying 3D Mesh Generation with Language Models

    ...By serializing 3D geometry into text tokens, the approach allows existing transformer architectures to generate and interpret 3D models without requiring specialized visual tokenizers. The project includes a supervised fine-tuning dataset composed of interleaved text and mesh data, allowing the model to learn relationships between textual descriptions and 3D structures. As a result, the model can generate mesh models directly from text prompts, explain mesh structures in natural language, or output mixed text-and-mesh sequences. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    AI-Codereview-Gitlab

    AI-Codereview-Gitlab

    GitLab automatic code review tool based on large models

    AI-Codereview-Gitlab is an open-source automation tool that integrates large language models into the GitLab development workflow to perform automated code reviews. The system monitors GitLab repositories and analyzes commits or merge requests using AI models to identify potential issues, coding mistakes, and quality improvements before the code is merged. By leveraging multiple large language model providers—including OpenAI, DeepSeek, ZhipuAI, or local models through Ollama—the platform...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    InternVL

    InternVL

    A Pioneering Open-Source Alternative to GPT-4o

    InternVL is a large-scale multimodal foundation model designed to integrate computer vision and language understanding within a unified architecture. The project focuses on scaling vision models and aligning them with large language models so that they can perform tasks involving both visual and textual information. InternVL is trained on massive collections of image-text data, enabling it to learn representations that capture both visual patterns and semantic meaning. The model supports a wide variety of tasks, including visual perception, image classification, and cross-modal retrieval between images and text. It can also be connected to language models to enable conversational interfaces that understand images, videos, and other visual content. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    DeepSeek VL

    DeepSeek VL

    Towards Real-World Vision-Language Understanding

    DeepSeek-VL is DeepSeek’s initial vision-language model that anchors their multimodal stack. It enables understanding and generation across visual and textual modalities—meaning it can process an image + a prompt, answer questions about images, caption, classify, or reason about visuals in context. The model is likely used internally as the visual encoder backbone for agent use cases, to ground perception in downstream tasks (e.g. answering questions about a screenshot). The repository includes model weights (or pointers to them), evaluation metrics on standard vision + language benchmarks, and configuration or architecture files. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 24
    Qwen-VL

    Qwen-VL

    Chat & pretrained large vision language model

    Qwen-VL is Alibaba Cloud’s vision-language large model family, designed to integrate visual and linguistic modalities. It accepts image inputs (with optional bounding boxes) and text, and produces text (and sometimes bounding boxes) as output. The model variants (VL-Plus, VL-Max, etc.) have been upgraded for better visual reasoning, text recognition from images, fine-grained understanding, and support for high image resolutions / extreme aspect ratios.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    airda

    airda

    airda(Air Data Agent

    airda(Air Data Agent) is a multi-smart body for data analysis, capable of understanding data development and data analysis needs, understanding data, generating data-oriented queries, data visualization, machine learning and other tasks of SQL and Python codes.
    Downloads: 2 This Week
    Last Update:
    See Project
Auth0 Logo