Showing 168 open source projects for "visual\"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build, govern, and optimize agents and models with Gemini Enterprise Agent Platform.
    Start Free
  • Build Securely on Azure with Proven Frameworks Icon
    Build Securely on Azure with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 1
    DeepWiki Open

    DeepWiki Open

    AI-Powered Wiki Generator for GitHub/Gitlab/Bitbucket Repositories

    ...Users can enter a repository URL and the system will clone the project, build semantic embeddings of its codebase, extract architecture and relationships, generate human-readable documentation, and produce visual diagrams to help explain complex code structure. DeepWiki’s output turns raw repositories into interactive, web-style wikis complete with navigable sections, diagrams, and contextual explanations, making it easier for developers and collaborators to understand unfamiliar code. It includes an “Ask” feature that lets users query the generated wiki using RAG-style retrieval, enabling interactive question-answering and exploration.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    HunyuanVideo-Foley

    HunyuanVideo-Foley

    Multimodal Diffusion with Representation Alignment

    HunyuanVideo-Foley is a multimodal diffusion model from Tencent Hunyuan for high-fidelity Foley (sound effects) audio generation synchronized to video scenes. It is designed to generate audio that matches both visual content and textual semantic cues, for use in video production, film, advertising, games, etc. The model architecture aligns audio, video, and text representations to produce realistic synchronized soundtracks. Produces high-quality 48 kHz audio output suitable for professional use. Hybrid architecture combining multimodal transformer blocks and unimodal refinement blocks. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    Screenshot to Code

    Screenshot to Code

    A neural network that transforms a design mock-up into static websites

    Screenshot-to-code is a tool or prototype that attempts to convert UI screenshots (e.g., of mobile or web UIs) into code representations, likely generating layouts, HTML, CSS, or markup from image inputs. It is part of a research/proof-of-concept domain in UI automation and image-to-UI code generation. Mapping visual design to code constructs. Code/UI layout (HTML, CSS, or markup). Examples/demo scripts showing “image UI code”.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    ManiSkill

    ManiSkill

    SAPIEN Manipulation Skill Framework

    ...Developed by Hao Su Lab, it focuses on robotic manipulation with diverse, high-quality 3D tasks designed to challenge perception, control, and planning in robotics. ManiSkill provides both low-level control and visual observation spaces for realistic learning scenarios.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8 Monitoring Tools in One APM. Install in 5 Minutes. Icon
    8 Monitoring Tools in One APM. Install in 5 Minutes.

    Errors, performance, logs, uptime, hosts, anomalies, dashboards, and check-ins. One interface.

    AppSignal works out of the box for Ruby, Elixir, Node.js, Python, and more. 30-day free trial, no credit card required.
    Start Free
  • 5
    Super Magic

    Super Magic

    All-in-one AI productivity platform with agents, workflows, and IM

    ...Magic centers around a general-purpose AI agent system called Super Magic, which can autonomously understand tasks, plan actions, execute workflows, and perform error correction. Alongside this, Magic includes a visual workflow engine that enables users to design complex AI processes using a drag-and-drop interface without requiring extensive coding knowledge. It also provides an enterprise-grade instant messaging system that integrates AI conversations with internal communication, allowing teams to collaborate while leveraging intelligent assistants. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    HunyuanWorld 1.0

    HunyuanWorld 1.0

    Generating Immersive, Explorable, and Interactive 3D Worlds

    ...The architecture integrates panoramic proxy generation, semantic layering, and hierarchical 3D reconstruction to produce high-quality scene-scale 3D worlds from both text and images. HunyuanWorld-1.0 surpasses existing open-source methods in visual quality and geometric consistency, demonstrated by superior scores in BRISQUE, NIQE, Q-Align, and CLIP metrics.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    DINOv3

    DINOv3

    Reference PyTorch implementation and models for DINOv3

    DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while maintaining or improving feature quality. ...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 8
    Qwen3-Omni

    Qwen3-Omni

    Qwen3-omni is a natively end-to-end, omni-modal LLM

    ...It uses a Thinker-Talker architecture with a Mixture-of-Experts (MoE) design, early text-first pretraining, and mixed multimodal training to support strong performance across all modalities without sacrificing text or image quality. The model supports 119 text languages, 19 speech input languages, and 10 speech output languages. It achieves state-of-the-art results: across 36 audio and audio-visual benchmarks, it hits open-source SOTA on 32 and overall SOTA on 22, outperforming or matching strong closed-source models such as Gemini-2.5 Pro and GPT-4o. To reduce latency, especially in audio/video streaming, Talker predicts discrete speech codecs via a multi-codebook scheme and replaces heavier diffusion approaches.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    AppAgent

    AppAgent

    Multimodal Agents as Smartphone Users, an LLM-based multimodal agent

    AppAgent is an open-source multimodal agent framework designed to enable large language models to operate smartphone applications through natural interactions with graphical user interfaces. The system allows an AI agent to interpret visual information from the screen and translate natural language instructions into actions such as tapping, swiping, and navigating between application screens. Instead of requiring backend access to application APIs, the framework interacts with apps the same way a human user would, making it compatible with a wide variety of mobile applications. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 10
    GLM-4.5V

    GLM-4.5V

    GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning

    GLM-4.5V is the preceding iteration in the GLM-V series that laid much of the groundwork for general multimodal reasoning and vision-language understanding. It embodies the design philosophy of mixing visual and textual modalities into a unified model capable of general-purpose reasoning, content understanding, and generation, while already supporting a wide variety of tasks: from image captioning and visual question answering to content recognition, GUI-based agents, video understanding, and long-document interpretation. GLM-4.5V emerged from a training framework that leverages scalable reinforcement learning (with curriculum sampling) to boost performance across tasks ranging from STEM problem solving to long-context reasoning, giving it broad applicability beyond narrow benchmarks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Sa2VA

    Sa2VA

    Official Repo For "Sa2VA: Marrying SAM2 with LLaVA

    Sa2VA is a cutting-edge open-source multi-modal large language model (MLLM) developed by ByteDance that unifies dense segmentation, visual understanding, and language-based reasoning across both images and videos. It merges the segmentation power of a state-of-the-art video segmentation model (based on SAM‑2) with the vision-language reasoning capabilities of a strong LLM backbone (derived from models like InternVL2.5 / Qwen-VL series), yielding a system that can answer questions about visual content, perform referring segmentation, and maintain temporal consistency across frames in video. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    FramePack

    FramePack

    Lets make video diffusion practical

    FramePack explores compact representations for sequences of image frames, targeting tasks where many near-duplicate frames carry redundant information. The idea is to “pack” frames by detecting shared structure and storing differences efficiently, which can accelerate training or inference on video-like data. By reducing I/O and memory bandwidth, datasets become lighter to load while models still see the essential temporal variation. The repository demonstrates both packing and unpacking...
    Downloads: 11 This Week
    Last Update:
    See Project
  • 13
    Watermark-Removal

    Watermark-Removal

    Machine learning image inpainting task that removes watermarks

    Watermark-Removal repository is a machine learning project focused on removing visible watermarks from digital images using deep learning and image inpainting techniques. The system analyzes an image containing a watermark and attempts to reconstruct the underlying visual content so that the watermark is removed while preserving the original appearance of the image. The project uses neural network models inspired by research in contextual attention and gated convolution, which are methods commonly applied to image restoration tasks. Through these techniques, the model learns to identify regions of the image affected by the watermark and generate realistic replacements for the missing visual information. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Agent S

    Agent S

    Agent S: an open agentic framework that uses computers like a human

    ...The latest version, Agent S3, surpasses human-level performance on the OSWorld benchmark, demonstrating state-of-the-art results in complex multi-step computer tasks. Agent S combines powerful foundation models (such as GPT-5) with grounding models like UI-TARS to translate visual inputs into precise executable actions. It supports flexible deployment via CLI, SDK, or cloud, and integrates with multiple model providers including OpenAI, Anthropic, Gemini, Azure, and Hugging Face endpoints. With optional local code execution, reflection mechanisms, and compositional planning, Agent S provides a scalable and research-driven framework for building advanced computer-use agents.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 15
    DeepSeek VL

    DeepSeek VL

    Towards Real-World Vision-Language Understanding

    DeepSeek-VL is DeepSeek’s initial vision-language model that anchors their multimodal stack. It enables understanding and generation across visual and textual modalities—meaning it can process an image + a prompt, answer questions about images, caption, classify, or reason about visuals in context. The model is likely used internally as the visual encoder backbone for agent use cases, to ground perception in downstream tasks (e.g. answering questions about a screenshot). The repository includes model weights (or pointers to them), evaluation metrics on standard vision + language benchmarks, and configuration or architecture files. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 16
    machine-learning-refined

    machine-learning-refined

    Master the fundamentals of machine learning, deep learning

    machine-learning-refined is an educational repository designed to help students and practitioners understand machine learning algorithms through intuitive explanations and interactive examples. The project accompanies a series of textbooks and teaching materials that focus on making machine learning concepts accessible through visual demonstrations and simple code implementations. Instead of presenting algorithms purely through mathematical derivations, the repository emphasizes geometric intuition, visualization, and step-by-step experimentation. It includes Jupyter notebooks and scripts that illustrate core machine learning topics such as regression, classification, optimization methods, and neural networks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    VLMEvalKit

    VLMEvalKit

    Open-source evaluation toolkit of large multi-modality models (LMMs)

    ...VLMEvalKit supports generation-based evaluation methods, allowing models to produce textual responses to visual inputs while measuring performance through techniques such as exact matching or language-model-assisted answer extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    firerpa LAMDA

    firerpa LAMDA

    The most powerful Android RPA agent framework

    lamda is an Android RPA agent framework that provides visual remote desktop control and automation at scale, geared toward testing, automation validation, and device management. It exposes a clean UI to monitor and interact with connected devices and includes tooling to script actions reliably across apps and OS versions. The project emphasizes low-friction setup and powerful control primitives so teams can move from interactive validation to repeatable automation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Label Studio

    Label Studio

    Label Studio is a multi-type data labeling and annotation tool

    ...The frontend part of Label Studio app lies in the frontend/ folder and written in React JSX. Multi-user labeling sign up and login, when you create an annotation it's tied to your account. Configurable label formats let you customize the visual interface to meet your specific labeling needs. Support for multiple data types including images, audio, text, HTML, time-series, and video.
    Downloads: 15 This Week
    Last Update:
    See Project
  • 20
    Open-AutoGLM

    Open-AutoGLM

    An open phone agent model & framework

    ...It aims to create an “AI phone agent” that can perceive on-screen content, reason about user goals, and execute sequences of taps, swipes, and text input via automated device control interfaces like ADB, enabling hands-off completion of multi-step tasks such as navigating apps, filling forms, and more. Unlike traditional automation scripts that depend on brittle heuristics, Open-AutoGLM uses pretrained large language and vision-language models to interpret visual context and natural language instructions, giving the agent robust adaptability across apps and interfaces.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 21
    PaperBanana

    PaperBanana

    Extension of Google Research’s PaperBanana

    PaperBanana is an open-source agentic framework designed to automatically generate publication-quality academic diagrams and statistical plots directly from text descriptions. The project focuses on helping researchers, educators, and data scientists transform conceptual descriptions of figures into structured visual outputs suitable for research papers, presentations, and technical reports. Instead of manually designing charts or diagrams using traditional visualization tools, users can describe the desired figure in natural language and allow the system to generate the visual representation automatically. PaperBanana integrates modern multimodal AI models capable of interpreting instructions and producing graphics that follow academic conventions. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Browser Use MCP Server

    Browser Use MCP Server

    Browse the web, directly from Cursor etc.

    A browser automation server implementing the Model Context Protocol, designed to allow AI assistants to browse the web directly from applications like Cursor. It supports natural language commands for web navigation and interaction. ​
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    InvokeAI

    InvokeAI

    InvokeAI is a leading creative engine for Stable Diffusion models

    ...It runs on Windows, Mac and Linux machines, and runs on GPU cards with as little as 4 GB or RAM. InvokeAI is a leading creative engine built to empower professionals and enthusiasts alike. Generate and create stunning visual media using the latest AI-driven technologies. InvokeAI offers an industry leading Web Interface, interactive Command Line Interface, and also serves as the foundation for multiple commercial products. This fork is supported across Linux, Windows and Macintosh. Linux users can use either an Nvidia-based card (with CUDA support) or an AMD card (using the ROCm driver). ...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 24
    VGGSfM

    VGGSfM

    VGGSfM: Visual Geometry Grounded Deep Structure From Motion

    VGGSfM is an advanced structure-from-motion (SfM) framework jointly developed by Meta AI Research (GenAI) and the University of Oxford’s Visual Geometry Group (VGG). It reconstructs 3D geometry, dense depth, and camera poses directly from unordered or sequential images and videos. The system combines learned feature matching and geometric optimization to generate high-quality camera calibrations, sparse/dense point clouds, and depth maps in standard COLMAP format. Version 2.0 adds support for dynamic scene handling, dense point cloud export, video-based reconstruction (1000+ frames), and integration with Gaussian Splatting pipelines. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Perception Models

    Perception Models

    State-of-the-art Image & Video CLIP, Multimodal Large Language Models

    ...The project supports a wide range of research applications, from visual recognition and dense prediction to fine-grained multimodal understanding. Additionally, it includes several large-scale open datasets for both image and video perception.
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB