Showing 98 open source projects for "audio processing"

View related business solutions
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 1
    Kimi-Audio

    Kimi-Audio

    Audio foundation model excelling in audio understanding

    Kimi-Audio is an ambitious open-source audio foundation model designed to unify a wide array of audio processing tasks — from speech recognition and audio understanding to generative conversation and sound event classification — within a single cohesive architecture. Instead of fragmenting work across specialized models, Kimi-Audio handles automatic speech recognition (ASR), audio question answering, automatic audio captioning, speech emotion recognition, and audio-to-text chat in one system, enabling developers to build rich, multimodal audio applications without stitching together disparate components. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Fun Audio Chat

    Fun Audio Chat

    Large Audio Language Model built for natural interactions

    Fun Audio Chat is an interactive voice-first conversational AI platform designed to let users engage in natural spoken dialogue with large language models in real time, turning speech into context-aware responses while maintaining a smooth back-and-forth experience. It combines speech recognition, audio processing, and AI generation so users can speak simply and receive spoken replies, enabling applications such as virtual assistants, voice bots, and hands-free chat interfaces. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Step-Audio 2

    Step-Audio 2

    Multi-modal large language model designed for audio understanding

    Step-Audio2 is an advanced, end-to-end multimodal large language model designed for high-fidelity audio understanding and natural speech conversation: unlike many pipelines that separate speech recognition, processing, and synthesis, Step-Audio2 processes raw audio, reasons about semantic and paralinguistic content (like emotion, speaker characteristics, non-verbal cues), and can generate contextually appropriate responses — including potentially generating or transforming audio output. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Ultimate Vocal Remover (UVR5)

    Ultimate Vocal Remover (UVR5)

    GUI for a Vocal Remover that uses Deep Neural Networks

    This application uses state-of-the-art source separation models to remove vocals from audio files. UVR's core developers trained all of the models provided in this package (except for the Demucs v3 and v4 4-stem models).
    Downloads: 794 This Week
    Last Update:
    See Project
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    Librosa

    Librosa

    Python library for audio and music analysis

    Librosa is a powerful Python library for analyzing and processing audio and music signals. Built on top of NumPy, SciPy, and matplotlib, it provides a wide range of tools for feature extraction, time-series manipulation, audio display, and music information retrieval. Whether you're building machine learning models for audio classification or visualizing spectrograms, Librosa is a go-to library for researchers and developers working in audio signal processing.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    ffmpeg-normalize

    ffmpeg-normalize

    Audio Normalization for Python/ffmpeg

    ffmpeg-normalize is a command-line utility designed to normalize audio levels in media files using FFmpeg, ensuring consistent volume across multiple tracks. It supports both EBU R128 loudness normalization and peak normalization methods, allowing users to choose the appropriate standard for their needs. The tool analyzes audio streams and applies adjustments to achieve target loudness levels without introducing distortion. It can process multiple files in batch mode, making it suitable for...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 7
    pyAudioAnalysis

    pyAudioAnalysis

    Python Audio Analysis Library: Feature Extraction, Classification

    pyAudioAnalysis is an open-source Python library designed for audio signal analysis, machine learning, and music information retrieval tasks. The project provides a collection of tools that allow developers to extract meaningful features from audio files and use those features for classification, segmentation, and analysis. The library supports multiple audio processing workflows, including feature extraction from raw audio signals, training of machine learning models, and automatic audio segmentation. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    AudioCraft

    AudioCraft

    Audiocraft is a library for audio processing and generation

    ...It also contains training code and recipes, so researchers can fine-tune on custom data or explore new objectives without building infrastructure from scratch. Example notebooks, CLI tools, and audio utilities help with prompt design, conditioning on reference audio, and post-processing to produce ready-to-share outputs.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 9
    NovaSR

    NovaSR

    A lightning fast audio upsampler

    ...NovaSR is especially valuable for post-processing tasks in speech enhancement, TTS pipelines, and dataset restoration where low sampling rates degrade perceived audio clarity; the minimal model size also makes it suitable for edge and embedded use cases where memory is at a premium. Its performance can reach thousands of times realtime on modern GPUs, allowing massive audio batches to be processed with negligible compute overhead.
    Downloads: 3 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    Whisper

    Whisper

    Robust Speech Recognition via Large-Scale Weak Supervision

    OpenAI Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.
    Downloads: 53 This Week
    Last Update:
    See Project
  • 11
    AutoSubSync

    AutoSubSync

    Automatic subtitle synchronization tool

    AutoSubSync is a cross-platform desktop application designed to automatically synchronize subtitle files with video content using advanced alignment algorithms. It integrates tools like ffsubsync, autosubsync, and alass to analyze audio and match subtitle timing with high accuracy. The application supports both automatic synchronization and manual adjustment, allowing users to fine-tune results when needed. It provides a drag-and-drop interface that simplifies the process of loading video and subtitle files, making it accessible for non-technical users. AutoSubSync also includes batch processing capabilities, enabling users to handle entire media libraries efficiently. ...
    Downloads: 25 This Week
    Last Update:
    See Project
  • 12
    AI-Media2Doc

    AI-Media2Doc

    AI tool converting video/audio into structured documents instantly

    AI-Media2Doc is a web-based application that uses large language models to convert video and audio content into structured, readable documents in a single workflow. It is designed to transform multimedia inputs into formats such as knowledge notes, summaries, mind maps, and social-style articles, making content easier to review and reuse. AI-Media2Doc emphasizes privacy by processing media locally in the browser using WebAssembly-based ffmpeg, ensuring that original video files are not uploaded externally. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    Twspace-dl

    Twspace-dl

    A python module to download twitter spaces

    Twspace-dl is a Python-based tool designed to download audio content from Twitter Spaces, enabling users to archive live or recorded sessions locally. It works by extracting streaming URLs and processing them with FFmpeg to generate downloadable audio files. The tool supports both command-line and graphical interfaces, making it accessible to different types of users. It requires authentication via exported cookies due to API restrictions, ensuring access to protected content. twspace-dl also includes options for saving metadata, playlists, and cover art associated with the stream. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    TorchAudio

    TorchAudio

    Data manipulation and transformation for audio signal processing

    The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the same philosophy of providing strong GPU acceleration, having a focus on trainable features through the autograd system, and having consistent style (tensor names and dimension names). Therefore, it is primarily a machine learning library and not a general signal processing library.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    WhisperJAV

    WhisperJAV

    Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD

    WhisperJAV is an open-source speech transcription pipeline designed specifically for generating subtitles for Japanese adult video content. The project addresses challenges that standard speech recognition models face when transcribing this type of audio, which often includes low signal-to-noise ratios and large numbers of non-verbal vocalizations. Traditional automatic speech recognition systems can misinterpret these sounds as words, leading to inaccurate transcripts. WhisperJAV introduces...
    Downloads: 21 This Week
    Last Update:
    See Project
  • 16
    Hugging Face - Speech To Speech

    Hugging Face - Speech To Speech

    Open speech-to-speech models and pipelines by Hugging Face toolkit AI

    This project from Hugging Face focuses on enabling direct speech-to-speech processing using modern machine learning models. It provides tools and reference implementations that allow audio input to be transformed into audio output without requiring an intermediate text representation. Hugging Face - Speech To Speech builds on recent advances in speech modeling, combining components such as speech recognition, translation, and synthesis into unified pipelines. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    ReClip

    ReClip

    Download videos from almost any website

    ...Users can paste multiple URLs at once, select output formats such as MP4 or MP3, and choose quality settings before downloading. The system also includes features like automatic URL deduplication and batch processing to improve usability.
    Downloads: 154 This Week
    Last Update:
    See Project
  • 18
    Voice-Pro

    Voice-Pro

    Comprehensive Gradio WebUI for audio processing

    Voice-Pro is the best gradio WebUI for transcription, translation and text-to-speech. It can be easily installed with one click. Create a virtual environment using Miniconda, running completely separate from the Windows system (fully portable). Supports real-time transcription and translation, as well as batch mode.
    Downloads: 48 This Week
    Last Update:
    See Project
  • 19
    cloud-morph

    cloud-morph

    Decentralize, Self-host Cloud Gaming/Application

    cloud-morph is a cloud-based media processing service that enables real-time video and audio transformation using FFmpeg in scalable environments. It is designed to run as a backend service that processes media streams or files and applies transformations such as transcoding, filtering, and format conversion. The system supports API-driven workflows, allowing integration into web applications or automation pipelines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    VibeVoice

    VibeVoice

    Open-source multi-speaker long-form text-to-speech model

    VibeVoice-1.5B is Microsoft’s frontier open-source text-to-speech (TTS) model designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts. Unlike traditional TTS systems, it excels in scalability, speaker consistency, and natural turn-taking for up to 90 minutes of continuous speech with as many as four distinct speakers. A key innovation is its use of continuous acoustic and semantic speech tokenizers operating at an ultra-low frame rate of 7.5 Hz, enabling high audio fidelity with efficient processing of long sequences. ...
    Downloads: 16 This Week
    Last Update:
    See Project
  • 21
    AudioMuse-AI

    AudioMuse-AI

    AudioMuse-AI is an Open Source Dockerized environment

    AudioMuse-AI is an open-source system designed to automatically generate playlists and analyze music libraries using artificial intelligence and audio signal processing techniques. The platform runs locally in a Dockerized environment and performs detailed sonic analysis on audio files to understand characteristics such as tempo, mood, and acoustic similarity. By analyzing the underlying audio content rather than relying on external metadata services, the system can organize large personal music libraries and generate curated playlists for different moods or listening contexts. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 22
    Videomass

    Videomass

    Videomass is a free, open source and cross-platform GUI for FFmpeg

    Videomass is a free, open-source graphical interface for FFmpeg designed to make advanced video and audio processing accessible to both beginners and experienced users. Built in Python using wxPython, it provides a cross-platform environment for managing encoding, conversion, and editing tasks through a visual interface. The software supports multitasking operations, allowing users to process multiple media files simultaneously. It offers extensive configuration options while also providing presets to simplify common workflows. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    pyVideoTrans

    pyVideoTrans

    Translate the video from one language to another and embed dubbing

    pyVideoTrans is an ambitious open-source multimedia processing project that assembles speech recognition, subtitle generation, AI translation, voice synthesis, and video assembly into a unified pipeline for converting videos from one language to another with embedded dubbing and captions. At its core it runs speech-to-text models to transcribe audio tracks, translates the resulting text into a target language using local or cloud-based translation engines, synthesizes new speech to match the translated subtitles, and then merges that speech back into the video, creating a fully localized media file. ...
    Downloads: 17 This Week
    Last Update:
    See Project
  • 24
    WhisperX

    WhisperX

    Automatic Speech Recognition with Word-level Timestamps

    WhisperX is an advanced speech recognition system built on top of OpenAI’s Whisper model, designed to improve transcription accuracy and timing precision for long-form audio. It addresses key limitations of standard Whisper implementations by introducing voice activity detection and forced alignment techniques to produce word-level timestamps. The system enables batched inference, significantly increasing transcription speed while maintaining high accuracy. It is particularly effective for...
    Downloads: 21 This Week
    Last Update:
    See Project
  • 25
    Insanely Fast Whisper

    Insanely Fast Whisper

    An opinionated CLI to transcribe Audio files w/ Whisper on-device

    Insanely Fast Whisper is a high-performance command-line tool designed to dramatically accelerate speech-to-text transcription using OpenAI’s Whisper models on local hardware. It leverages modern optimizations such as batch processing, mixed precision, and advanced attention mechanisms like Flash Attention to significantly reduce inference time while maintaining high transcription accuracy. The project is built on top of the Transformers ecosystem and integrates with libraries such as Optimum to maximize GPU efficiency. It is specifically engineered for environments with CUDA-enabled GPUs or Apple Silicon devices, allowing users to process hours of audio in minutes or even seconds depending on hardware capabilities. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • Next
MongoDB Logo MongoDB