audio source separation free download

Ultimate Vocal Remover (UVR5)

GUI for a Vocal Remover that uses Deep Neural Networks

This application uses state-of-the-art source separation models to remove vocals from audio files. UVR's core developers trained all of the models provided in this package (except for the Demucs v3 and v4 4-stem models).

Downloads: 679 This Week

Last Update: 2025-01-20

See Project

MLX-Audio

A text-to-speech, speech-to-text and speech-to-speech library

MLX-Audio is a speech library built on Apple’s MLX framework and optimized for Apple Silicon machines (M-series Macs). It focuses on text-to-speech and speech-to-speech workflows, with APIs and a command-line interface that make it easy to generate high-quality audio from text. Because it uses MLX and targets Apple Silicon, inference is fast and can take advantage of hardware acceleration and quantization for efficient on-device performance. The project provides a straightforward CLI...

Downloads: 4 This Week

Last Update: 2026-03-14

See Project

Qwen2-Audio

Repo of Qwen2-Audio chat & pretrained large audio language model

Qwen2-Audio is a large audio-language model by Alibaba Cloud, part of the Qwen series. It is trained to accept various audio signal inputs (including speech, sounds, etc.) and perform both voice chat and audio analysis, producing textual responses. It supports two major modes: Voice Chat (interactive voice only input) and Audio Analysis (audio + text instructions), with both base and instruction-tuned models. It is evaluated on many benchmarks (speech recognition, translation, sound...

Downloads: 1 This Week

Last Update: 2025-09-23

See Project

Step-Audio

Open-source framework for intelligent speech interaction

Step-Audio is a unified, open-source framework aimed at building intelligent speech systems that combine both comprehension and generation: it integrates large language models (LLMs) with speech input/output to handle not only semantic understanding but also rich vocal characteristics like tone, style, dialect, emotion, and prosody. The design moves beyond traditional separate-component pipelines (ASR → text model → TTS), instead offering a multimodal model that ingests speech or audio and produces speech accordingly, enabling natural dialogue, voice cloning, and expressive speech synthesis. ...

Downloads: 0 This Week

Last Update: 2026-03-16

See Project

Qwen-Audio

Chat & pretrained large audio language model proposed by Alibaba Cloud

Qwen-Audio is a large audio-language model developed by Alibaba Cloud, built to accept various types of audio input (speech, natural sounds, music, singing) along with text input, and output text. There is also an instruction-tuned version called Qwen-Audio-Chat which supports conversational interaction (multi-round), audio + text input, creative tasks and reasoning over audio. It uses multi-task training over many different audio tasks (30+), and achieves strong multi-benchmarks performance...

Downloads: 0 This Week

Last Update: 2025-09-23

See Project

Kimi-Audio

Audio foundation model excelling in audio understanding

Kimi-Audio is an ambitious open-source audio foundation model designed to unify a wide array of audio processing tasks — from speech recognition and audio understanding to generative conversation and sound event classification — within a single cohesive architecture. Instead of fragmenting work across specialized models, Kimi-Audio handles automatic speech recognition (ASR), audio question answering, automatic audio captioning, speech emotion recognition, and audio-to-text chat in one system, enabling developers to build rich, multimodal audio applications without stitching together disparate components. ...

Downloads: 0 This Week

Last Update: 2026-01-27

See Project

Fun Audio Chat

Large Audio Language Model built for natural interactions

Fun Audio Chat is an interactive voice-first conversational AI platform designed to let users engage in natural spoken dialogue with large language models in real time, turning speech into context-aware responses while maintaining a smooth back-and-forth experience. It combines speech recognition, audio processing, and AI generation so users can speak simply and receive spoken replies, enabling applications such as virtual assistants, voice bots, and hands-free chat interfaces. The system...

Downloads: 1 This Week

Last Update: 2026-02-27

See Project

Step-Audio 2

Multi-modal large language model designed for audio understanding

Step-Audio2 is an advanced, end-to-end multimodal large language model designed for high-fidelity audio understanding and natural speech conversation: unlike many pipelines that separate speech recognition, processing, and synthesis, Step-Audio2 processes raw audio, reasons about semantic and paralinguistic content (like emotion, speaker characteristics, non-verbal cues), and can generate contextually appropriate responses — including potentially generating or transforming audio output. It...

Downloads: 0 This Week

Last Update: 2026-03-16

See Project

Step-Audio-EditX

LLM-based Reinforcement Learning audio edit model

Step-Audio-EditX is an open-source, 3 billion-parameter audio model from StepFun AI designed to make expressive and precise editing of speech and audio as easy as text editing. Rather than treating audio editing as low-level waveform manipulation, this model converts speech into a sequence of discrete “audio tokens” (via a dual-codebook tokenizer) — combining a linguistic token stream and a semantic (prosody/emotion/style) token stream — thereby abstracting audio editing into high-level token operations. ...

Downloads: 0 This Week

Last Update: 2026-03-16

See Project

Voice-Pro

Comprehensive Gradio WebUI for audio processing

Voice-Pro is the best gradio WebUI for transcription, translation and text-to-speech. It can be easily installed with one click. Create a virtual environment using Miniconda, running completely separate from the Windows system (fully portable). Supports real-time transcription and translation, as well as batch mode.

1 Review

Downloads: 27 This Week

Last Update: 2025-12-05

See Project

Whisper-WebUI

A Web UI for easy subtitle using whisper model

Whisper WebUI is an open-source browser-based interface that simplifies the use of Whisper speech recognition models by providing an intuitive graphical environment for transcription, translation, and subtitle generation. Built with Gradio, it allows users to upload audio or video files, process them locally, and generate accurate text outputs without relying on command-line tools.

Downloads: 5 This Week

Last Update: 2026-03-18

See Project

LTX-2.3

Official Python inference and LoRA trainer package

LTX-2.3 is an open-source multimodal artificial intelligence foundation model developed by Lightricks for generating synchronized video and audio from prompts or other inputs. Unlike most earlier video generation systems that only produced silent clips, LTX-2 combines video and audio generation in a unified architecture capable of producing coherent audiovisual scenes.

Downloads: 88 This Week

Last Update: 2026-03-11

See Project

AudioCraft

Audiocraft is a library for audio processing and generation

AudioCraft is a PyTorch library for text-to-audio and text-to-music generation, packaging research models and tooling for training and inference. It includes MusicGen for music generation conditioned on text (and optionally melody) and AudioGen for text-conditioned sound effects and environmental audio. Both models operate over discrete audio tokens produced by a neural codec (EnCodec), which acts like a tokenizer for waveforms and enables efficient sequence modeling. The repo provides...

Downloads: 6 This Week

Last Update: 2025-10-13

See Project

HeartMuLa

A Family of Open Sourced Music Foundation Models

...The project also includes HeartCodec, a music codec optimized for high reconstruction fidelity, enabling efficient tokenization and reconstruction workflows that are critical for training and generation pipelines. For text extraction from audio, it provides HeartTranscriptor, a Whisper-based model tuned specifically for lyrics transcription, which helps bridge generated or recorded audio back into structured text. It also introduces HeartCLAP, which aligns audio and text into a shared embedding space.

Downloads: 12 This Week

Last Update: 2026-03-05

See Project

pyAudioAnalysis

Python Audio Analysis Library: Feature Extraction, Classification

pyAudioAnalysis is an open-source Python library designed for audio signal analysis, machine learning, and music information retrieval tasks. The project provides a collection of tools that allow developers to extract meaningful features from audio files and use those features for classification, segmentation, and analysis. The library supports multiple audio processing workflows, including feature extraction from raw audio signals, training of machine learning models, and automatic audio segmentation. ...

Downloads: 0 This Week

Last Update: 2026-03-10

See Project

Databend

Cloud-native open source data warehouse for analytics and AI queries

Databend is an open source cloud-native data warehouse designed for large-scale analytics and modern data workloads. Built in Rust, the system focuses on high performance, scalability, and efficient data processing for analytical queries. It is designed with a separation of compute and storage, allowing compute nodes to scale independently while storing data in object storage systems.

Downloads: 0 This Week

Last Update: 2026-03-13

See Project

Basic Pitch

A lightweight audio-to-MIDI converter with pitch bend detection

Basic Pitch is a Python library for Automatic Music Transcription (AMT), using lightweight neural network developed by Spotify's Audio Intelligence Lab. It's small, easy-to-use, pip install-able and npm install-able via its sibling repo. Basic Pitch may be simple, but it's is far from "basic"! basic-pitch is efficient and easy to use, and its multi pitch support, its ability to generalize across instruments, and its note accuracy compete with much larger and more resource-hungry AMT systems....

Downloads: 34 This Week

Last Update: 2024-08-16

See Project

Audiogen Codec

48khz stereo neural audio codec for general audio

AGC (Audiogen Codec) is a convolutional autoencoder based on the DAC architecture, which holds SOTA. We found that training with EMA and adding a perceptual loss term with CLAP features improved performance. These codecs, being low compression, outperform Meta's EnCodec and DAC on general audio as validated from internal blind ELO games. We trained (relatively) very low compression codecs in the pursuit of solving a core issue regarding general music and audio generation, low acoustic...

Downloads: 2 This Week

Last Update: 2024-10-02

See Project

Qwen3-Omni

Qwen3-omni is a natively end-to-end, omni-modal LLM

...It achieves state-of-the-art results: across 36 audio and audio-visual benchmarks, it hits open-source SOTA on 32 and overall SOTA on 22, outperforming or matching strong closed-source models such as Gemini-2.5 Pro and GPT-4o. To reduce latency, especially in audio/video streaming, Talker predicts discrete speech codecs via a multi-codebook scheme and replaces heavier diffusion approaches.

Downloads: 3 This Week

Last Update: 2026-01-08

See Project

Pedalboard

A Python library for audio

pedalboard is a Python library for working with audio: reading, writing, rendering, adding effects, and more. It supports the most popular audio file formats and a number of common audio effects out of the box and also allows the use of VST3® and Audio Unit formats for loading third-party software instruments and effects. pedalboard was built by Spotify’s Audio Intelligence Lab to enable using studio-quality audio effects from within Python and TensorFlow. Internally at Spotify, pedalboard...

Downloads: 0 This Week

Last Update: 2026-02-01

See Project

SoniTranslate

Synchronized Translation for Videos

SoniTranslate is a video translation and dubbing system that produces synchronized target-language audio tracks for existing video content. It provides a web UI built with Gradio, allowing users to upload a video, choose source and target languages, and then run a pipeline that handles transcription, translation and re-synthesis of speech. Under the hood, it uses advanced speech and diarization models to separate speakers, align audio with timecodes and respect subtitle timing, which lets the generated dub track stay in sync with the original video structure. ...

Downloads: 41 This Week

Last Update: 2025-11-28

See Project

AudioMuse-AI

AudioMuse-AI is an Open Source Dockerized environment

AudioMuse-AI is an open-source system designed to automatically generate playlists and analyze music libraries using artificial intelligence and audio signal processing techniques. The platform runs locally in a Dockerized environment and performs detailed sonic analysis on audio files to understand characteristics such as tempo, mood, and acoustic similarity.

Downloads: 3 This Week

Last Update: 2026-03-16

See Project

LatentSync

Taming Stable Diffusion for Lip Sync

LatentSync is an open-source framework from ByteDance that produces high-quality lip-synchronization for video by using an audio-conditioned latent diffusion model, bypassing traditional intermediate motion representations. In effect, given a source video (with masked or reference frames) and an audio track, LatentSync directly generates frames whose lip motions and expressions align with the audio, producing convincing talking-head or animated lip-sync output. ...

Downloads: 1 This Week

Last Update: 2025-12-02

See Project

vocal-separate

An extremely simple tool for separating vocals and background music

vocal-separate is a simple but effective audio processing application that isolates vocals and instrumental tracks from music and video files using stem-based source separation models, enabling tasks such as karaoke creation, remixing, and music analysis. Built as a localized web-based tool, it runs entirely on the user’s machine without requiring an internet connection, emphasizing privacy and convenience for creative work.

Downloads: 4 This Week

Last Update: 2026-02-17

See Project

WhisperJAV

Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD

WhisperJAV is an open-source speech transcription pipeline designed specifically for generating subtitles for Japanese adult video content. The project addresses challenges that standard speech recognition models face when transcribing this type of audio, which often includes low signal-to-noise ratios and large numbers of non-verbal vocalizations. Traditional automatic speech recognition systems can misinterpret these sounds as words, leading to inaccurate transcripts. ...

Downloads: 10 This Week

Last Update: 2026-03-19

See Project

Search Results for "audio source separation"

Showing 211 open source projects for "audio source separation"

Ultimate Vocal Remover (UVR5)

MLX-Audio

Qwen2-Audio

Step-Audio

Qwen-Audio

Kimi-Audio

Fun Audio Chat

Step-Audio 2

Step-Audio-EditX

Voice-Pro

Whisper-WebUI

LTX-2.3

AudioCraft

HeartMuLa

pyAudioAnalysis

Databend

Basic Pitch

Audiogen Codec

Qwen3-Omni

Pedalboard

SoniTranslate

AudioMuse-AI

LatentSync

vocal-separate

WhisperJAV

Search Results for "audio source separation"

Showing 211 open source projects for "audio source separation"

Ultimate Vocal Remover (UVR5)

MLX-Audio

Qwen2-Audio

Step-Audio

Qwen-Audio

Kimi-Audio

Fun Audio Chat

Step-Audio 2

Step-Audio-EditX

Voice-Pro

Whisper-WebUI

LTX-2.3

AudioCraft

HeartMuLa

pyAudioAnalysis

Databend

Basic Pitch

Audiogen Codec

Qwen3-Omni

Pedalboard

SoniTranslate

AudioMuse-AI

LatentSync

vocal-separate

WhisperJAV

Related Searches

Related Categories