Showing 101 open source projects for "diffusion"

View related business solutions
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • Create and run cloud-based virtual machines. Icon
    Create and run cloud-based virtual machines.

    Secure and customizable compute service that lets you create and run virtual machines.

    Computing infrastructure in predefined or custom machine sizes to accelerate your cloud transformation. General purpose (E2, N1, N2, N2D) machines provide a good balance of price and performance. Compute optimized (C2) machines offer high-end vCPU performance for compute-intensive workloads. Memory optimized (M2) machines offer the highest memory and are great for in-memory databases. Accelerator optimized (A2) machines are based on the A100 GPU, for very demanding applications.
    Try for free
  • 1
    VoxCPM

    VoxCPM

    TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

    VoxCPM is a tokenizer-free text-to-speech system that models speech in a continuous space, aiming for extremely realistic, context-aware synthesis and true-to-life zero-shot voice cloning. Instead of converting speech into discrete tokens, it uses an end-to-end diffusion-autoregressive architecture built on the MiniCPM-4 backbone, combining hierarchical language modeling, finite scalar quantization (FSQ), and local Diffusion Transformers. This design helps decouple semantic and acoustic information while preserving fine-grained prosody, leading to more stable and expressive generation than many discrete-token systems. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 2
    SeedVR

    SeedVR

    Repo for SeedVR2 & SeedVR

    SeedVR (from the ByteDance-Seed organization) is an open-source research and implementation repository focused on cutting-edge video restoration using diffusion transformer architectures. The project includes both the original SeedVR and its successor SeedVR2 models, which are designed to restore degraded or low-quality video content by learning to reconstruct high-fidelity frames with temporal coherence. These models leverage advanced techniques such as adaptive attention mechanisms and adversarial training to produce visually appealing results in a single inference step, pushing the boundaries of video restoration research. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    VibeVoice

    VibeVoice

    Open-source multi-speaker long-form text-to-speech model

    ...A key innovation is its use of continuous acoustic and semantic speech tokenizers operating at an ultra-low frame rate of 7.5 Hz, enabling high audio fidelity with efficient processing of long sequences. The model integrates a Qwen2.5-based large language model with a diffusion head to produce realistic acoustic details and capture conversational context. Training involved curriculum learning with increasing sequence lengths up to 65K tokens, allowing VibeVoice to handle very long dialogues effectively. Safety mechanisms include an audible disclaimer and imperceptible watermarking in all generated audio to mitigate misuse risks.
    Downloads: 12 This Week
    Last Update:
    See Project
  • 4
    SeedVR2 Upscaler ComfyUI

    SeedVR2 Upscaler ComfyUI

    Official SeedVR2 Video Upscaler for ComfyUI

    ...This project packages the SeedVR2 architecture as a custom node for ComfyUI, letting users upscale low-resolution video or imagery inside a node-based interface without needing to write code manually. The underlying SeedVR2 model is known for delivering high-quality video enhancement with strong temporal consistency and improved detail preservation by using diffusion-based techniques that are trained specifically on video sequences. Within the ComfyUI ecosystem, the upscaler integrates with existing nodes and pipelines, making it easier to combine with other processing steps such as denoising, color correction, or format conversion. Enthusiasts often use it for workflows ranging from hobby video enhancement to professional content improvement.
    Downloads: 6 This Week
    Last Update:
    See Project
  • Leverage AI to Automate Medical Coding Icon
    Leverage AI to Automate Medical Coding

    Medical Coding Solution

    As a healthcare provider, you should be paid promptly for the services you provide to patients. Slow, inefficient, and error-prone manual coding keeps you from the financial peace you deserve. XpertDox’s autonomous coding solution accelerates the revenue cycle so you can focus on providing great healthcare.
    Learn More
  • 5
    Oasis

    Oasis

    Inference script for Oasis 500M

    Open-Oasis provides inference code and released weights for Oasis 500M, an interactive world model that generates gameplay frames conditioned on user keyboard input. Instead of rendering a pre-built game world, the system produces the next visual state via a diffusion-transformer approach, effectively “imagining” the world response to your actions in real time. The project focuses on enabling action-conditional frame generation so developers can experiment with interactive, model-generated environments rather than static video generation alone. Because it’s an inference-focused repository, it’s especially useful as a practical reference for running the model, wiring inputs, and producing the autoregressive sequence of gameplay frames. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    InvokeAI

    InvokeAI

    InvokeAI is a leading creative engine for Stable Diffusion models

    InvokeAI is an implementation of Stable Diffusion, the open source text-to-image and image-to-image generator. It provides a streamlined process with various new features and options to aid the image generation process. It runs on Windows, Mac and Linux machines, and runs on GPU cards with as little as 4 GB or RAM. InvokeAI is a leading creative engine built to empower professionals and enthusiasts alike.
    Downloads: 23 This Week
    Last Update:
    See Project
  • 7
    MegaTTS 3

    MegaTTS 3

    Official PyTorch Implementation

    MegaTTS3 is an open-source text-to-speech (TTS) and voice-cloning system from ByteDance that aims to deliver high-quality, expressive speech synthesis, including zero-shot voice cloning of previously unseen speakers. Its backbone is a lightweight diffusion-transformer (on the order of ~0.45 B parameters), which enables efficient inference while still producing high-fidelity audio. Given a reference audio sample (and corresponding latent representation), MegaTTS3 can generate speech in the style and voice timbre of that speaker — useful for personalized TTS, voice-overs, dubbing, or multi-speaker applications. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 8
    HunyuanVideo-Foley

    HunyuanVideo-Foley

    Multimodal Diffusion with Representation Alignment

    HunyuanVideo-Foley is a multimodal diffusion model from Tencent Hunyuan for high-fidelity Foley (sound effects) audio generation synchronized to video scenes. It is designed to generate audio that matches both visual content and textual semantic cues, for use in video production, film, advertising, games, etc. The model architecture aligns audio, video, and text representations to produce realistic synchronized soundtracks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    InstantCharacter

    InstantCharacter

    Personalize Any Characters with a Scalable Diffusion Transformer

    InstantCharacter is a tuning-free diffusion transformer framework created by Tencent Hunyuan / InstantX team, which enables generating images of a specific character (subject) from a single reference image, preserving identity and character features. Uses adapters, so full fine-tuning of the base model is not required. Demo scripts and pipeline API (via infer_demo.py, pipeline.py) included.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Easy-to-use Business Software for the Waste Management Software Industry Icon
    Easy-to-use Business Software for the Waste Management Software Industry

    Increase efficiency, expedite accounts receivables, optimize routes, acquire new customers, & more!

    DOP Software’s mission is to streamline waste and recycling business’ processes by providing them with dynamic, comprehensive software and services that increase productivity and quality of performance.
    Learn More
  • 10
    Step-Video-T2V

    Step-Video-T2V

    State-of-the-art (SoTA) text-to-video pre-trained model

    Step-Video-T2V is a state-of-the-art text-to-video foundation model developed to generate videos from natural-language prompts; its 30B-parameter architecture is designed to produce coherent, temporally extended video sequences — up to around 204 frames — based on input text. Under the hood it uses a compressed latent representation (a Video-VAE) to reduce spatial and temporal redundancy, and a denoising diffusion (or similar) process over that latent space to generate smooth, plausible motion and visuals. The model handles bilingual input (e.g. English and Chinese) thanks to dual encoders, and supports end-to-end text-to-video generation without requiring external assets. Its training and generation pipeline includes techniques like flow-matching, full 3D attention for temporal consistency, and fine-tuning approaches (e.g. video-based DPO) to improve fidelity and reduce artifacts. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    HunyuanVideo-Avatar

    HunyuanVideo-Avatar

    Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model

    HunyuanVideo-Avatar is a multimodal diffusion transformer (MM-DiT) model by Tencent Hunyuan for animating static avatar images into dynamic, emotion-controllable, and multi-character dialogue videos, conditioned on audio. It addresses challenges of motion realism, identity consistency, and emotional alignment. Innovations include a character image injection module, an Audio Emotion Module for transferring emotion cues, and a Face-Aware Audio Adapter to isolate audio effects on faces, enabling multiple characters to be animated in a scene. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    tinygrad

    tinygrad

    Deep learning framework

    This may not be the best deep learning framework, but it is a deep learning framework. Due to its extreme simplicity, it aims to be the easiest framework to add new accelerators to, with support for both inference and training. If XLA is CISC, tinygrad is RISC.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    VibeVoice ComfyUI

    VibeVoice ComfyUI

    ComfyUI integration for Microsoft's VibeVoice text-to-speech model

    ...The integration supports high-quality single-speaker synthesis as well as scripted multi-speaker conversations, with optional voice cloning from audio samples for each speaker. It includes advanced control over generation parameters like attention backend, diffusion steps, sampling temperature, guidance scale, and quantization settings, allowing users to tune the trade-offs between quality, VRAM usage, and speed. The project also introduces first-class LoRA support, making it possible to fine-tune and load custom LoRA adapters that modify voice identity or style while keeping the base VibeVoice model intact.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    Lama Cleaner

    Lama Cleaner

    Image inpainting tool powered by SOTA AI Model

    Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, or people from your pictures or erase and replace(powered by stable diffusion) anything on your pictures. Lama Cleaner is a free, open-source and fully self-hostable inpainting tool powered by state-of-the-art AI models. You can use it to remove any unwanted object, defect, or people from your pictures or erase and replace anything on your pictures. Many AICG creators are using Lama Cleaner to clean-up their work. ...
    Downloads: 50 This Week
    Last Update:
    See Project
  • 15
    Pixelization

    Pixelization

    Stable-diffusion-webui-pixelization

    This is a specialized extension for the popular Stable Diffusion Web UI (AUTOMATIC1111) that focuses on converting or “pixelizing” images into a pixel-art aesthetic. It's designed as a plugin you install into the Web UI so that in the “Extras” or “Pixelization” tab you can drag in an input image and produce a stylized, block-based version with control over cell size, color depth, and segmentation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    DreamO

    DreamO

    A Unified Framework for Image Customization

    DreamO is a unified, open-source framework from ByteDance for advanced image customization and generation that consolidates multiple “image manipulation” tasks into a single system, rather than requiring separate specialized models. Built on a diffusion-transformer (DiT) backbone, it supports a diverse set of tasks — including identity preservation, virtual “try-on” (e.g. clothing, accessories), style transfer, IP adaptation (objects/characters), and layout/condition-aware customizations — all handled within the same unified architecture. DreamO’s design introduces a feature routing constraint that helps disentangle different control conditions (like identity, style, clothing) when more than one is specified, which significantly reduces conflicts and artifacts when combining controls. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Step1X-Edit

    Step1X-Edit

    A SOTA open-source image editing model

    Step1X-Edit is a state-of-the-art open-source image editing model/framework that uses a multimodal large language model (LLM) together with a diffusion-based image decoder to let users edit images simply via natural-language instructions plus a reference image. You supply an existing image and a textual command — e.g. “add a ruby pendant on the girl’s neck” or “make the background a sunset over mountains” — and the model interprets the instruction, computes a latent embedding combining the image content and user intent, then decodes a new image implementing the edit. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Step1X-3D

    Step1X-3D

    High-Fidelity and Controllable Generation of Textured 3D Assets

    ...It combines a hybrid architecture: a geometry generation stage using a VAE-DiT model to output a watertight 3D representation (e.g. TSDF surface), and a texture synthesis stage that conditions on geometry and optionally reference input (or prompts) to produce view-consistent textures using a diffusion-based texture module. The result is fully 3D assets — meshes + textures — which can be rendered from any viewpoint, textured consistently, and used in 3D applications. To achieve this, the project includes a massive curated dataset: among more than 5 million candidate 3D assets, it filters and standardizes to produce a high-quality 2 million–asset subset suitable for training.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    CogVideo

    CogVideo

    text and image to video generation: CogVideoX (2024) and CogVideo

    CogVideo is an open source text-/image-/video-to-video generation project that hosts the CogVideoX family of diffusion-transformer models and end-to-end tooling. The repo includes SAT and Diffusers implementations, turnkey demos, and fine-tuning pipelines (including LoRA) designed to run across a wide range of NVIDIA GPUs, from desktop cards (e.g., RTX 3060) to data-center hardware (A100/H100). Current releases cover CogVideoX-2B, CogVideoX-5B, and the upgraded CogVideoX1.5-5B variants, plus image-to-video (I2V) models, with options for BF16/FP16/FP32—and INT8 quantized inference via TorchAO for memory-constrained setups. ...
    Downloads: 28 This Week
    Last Update:
    See Project
  • 20
    Flow Matching

    Flow Matching

    A PyTorch library for implementing flow matching algorithms

    flow_matching is a PyTorch library implementing flow matching algorithms in both continuous and discrete settings, enabling generative modeling via matching vector fields rather than diffusion. The underlying idea is to parameterize a flow (a time-dependent vector field) that transports samples from a simple base distribution to a target distribution, and train via matching of flows without requiring score estimation or noisy corruption—this can lead to more efficient or stable generative training. The library supports both continuous-time flows (via differential equations) and discrete-time analogues, giving flexibility in design and tradeoffs. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 21
    HunyuanImage-3.0

    HunyuanImage-3.0

    A Powerful Native Multimodal Model for Image Generation

    ...It unifies multimodal understanding and generation in a single autoregressive framework, combining text and image modalities seamlessly rather than relying on separate image-only diffusion components. It uses a Mixture-of-Experts (MoE) architecture with many expert subnetworks to scale efficiently, deploying only a subset of experts per token, which allows large parameter counts without linear inference cost explosion. The model is intended to be competitive with closed-source image generation systems, aiming for high fidelity, prompt adherence, fine detail, and even “world knowledge” reasoning (i.e. leveraging context, semantics, or common sense in generation). ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 22
    HunyuanWorld 1.0

    HunyuanWorld 1.0

    Generating Immersive, Explorable, and Interactive 3D Worlds

    HunyuanWorld-1.0 is an open-source, simulation-capable 3D world generation model developed by Tencent Hunyuan that creates immersive, explorable, and interactive 3D environments from text or image inputs. It combines the strengths of video-based diversity and 3D-based geometric consistency through a novel framework using panoramic world proxies and semantically layered 3D mesh representations. This approach enables 360° immersive experiences, seamless mesh export for graphics pipelines, and...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 23
    PEFT

    PEFT

    State-of-the-art Parameter-Efficient Fine-Tuning

    Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    TensorRT Node for ComfyUI

    TensorRT Node for ComfyUI

    Enables the best performance on NVIDIA RTX Graphics Cards

    ComfyUI_TensorRT is an extension that lets ComfyUI run AI inference through NVIDIA’s TensorRT, aiming to get faster, more efficient execution on supported GPUs. It bridges the gap between ComfyUI’s flexible, node-based workflows and TensorRT’s highly optimized engine format. The result is that complex diffusion or image-processing graphs can be accelerated without the user having to rewrite the pipeline. The repo typically includes instructions for converting models to TensorRT engines and for wiring those engines into ComfyUI nodes. This is particularly attractive for power users who run many generations or who host ComfyUI on dedicated hardware and want to squeeze out every bit of GPU performance. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    WhisperSpeech

    WhisperSpeech

    An Open Source text-to-speech system built by inverting Whisper

    WhisperSpeech is an open-source text-to-speech system created by “inverting” OpenAI’s Whisper, reusing its strengths as a semantic audio model to generate speech instead of only transcribing it. The project aims to be for speech what Stable Diffusion is for images: powerful, hackable, and safe for commercial use, with code under Apache-2.0/MIT and models trained only on properly licensed data. Its architecture follows a token-based, multi-stage pipeline inspired by AudioLM and SPEAR-TTS: Whisper is used to produce semantic tokens, EnCodec compresses the waveform into acoustic tokens, and Vocos reconstructs high-fidelity audio from those tokens. ...
    Downloads: 5 This Week
    Last Update:
    See Project