audio text sync free download

Showing 954 open source projects for "audio text sync"

View related business solutions

Full-stack observability with actually useful AI | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime
General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.

Try Free
1

Shairport Sync

AirPlay audio player

Shairport Sync adds multi-room capability with audio synchronization. Shairport Sync is an AirPlay 1 audio player. Switch to the development branch for a version with limited AirPlay 2 functionality. Shairport Sync plays audio streamed from iTunes, iOS, Apple TV and macOS devices and AirPlay sources such as Quicktime Player and OwnTone, among others.

Downloads: 4 This Week

Last Update: 2026-03-28
See Project
2

MLX-Audio

A text-to-speech, speech-to-text and speech-to-speech library

MLX-Audio is a speech library built on Apple’s MLX framework and optimized for Apple Silicon machines (M-series Macs). It focuses on text-to-speech and speech-to-speech workflows, with APIs and a command-line interface that make it easy to generate high-quality audio from text. Because it uses MLX and targets Apple Silicon, inference is fast and can take advantage of hardware acceleration and quantization for efficient on-device performance.

Downloads: 5 This Week

Last Update: 2026-03-30
See Project
3

Sync Server

Secure, open-source platform for file storage, sharing, collaboration

Sync Server is the core backend of a secure, open-source file storage, sharing, collaboration, and synchronization platform designed to give users full control over their data while supporting modern collaboration needs. It provides a sleek web interface where teams or individuals can upload, organize, and share files with fine-grained access permissions, and its security-minded design includes things like multi-factor authentication and role-based controls to help protect sensitive...

Downloads: 3 This Week

Last Update: 5 days ago
See Project
4

Qwen-Audio

Chat & pretrained large audio language model proposed by Alibaba Cloud

Qwen-Audio is a large audio-language model developed by Alibaba Cloud, built to accept various types of audio input (speech, natural sounds, music, singing) along with text input, and output text. There is also an instruction-tuned version called Qwen-Audio-Chat which supports conversational interaction (multi-round), audio + text input, creative tasks and reasoning over audio.

Downloads: 4 This Week

Last Update: 2025-09-23
See Project
Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
5

Qwen2-Audio

Repo of Qwen2-Audio chat & pretrained large audio language model

Qwen2-Audio is a large audio-language model by Alibaba Cloud, part of the Qwen series. It is trained to accept various audio signal inputs (including speech, sounds, etc.) and perform both voice chat and audio analysis, producing textual responses. It supports two major modes: Voice Chat (interactive voice only input) and Audio Analysis (audio + text instructions), with both base and instruction-tuned models.

Downloads: 0 This Week

Last Update: 2025-09-23
See Project
6

Kimi-Audio

Audio foundation model excelling in audio understanding

Kimi-Audio is an ambitious open-source audio foundation model designed to unify a wide array of audio processing tasks — from speech recognition and audio understanding to generative conversation and sound event classification — within a single cohesive architecture. Instead of fragmenting work across specialized models, Kimi-Audio handles automatic speech recognition (ASR), audio question answering, automatic audio captioning, speech emotion recognition, and audio-to-text chat in one system, enabling developers to build rich, multimodal audio applications without stitching together disparate components. ...

Downloads: 1 This Week

Last Update: 2026-01-27
See Project
7

Step-Audio

Open-source framework for intelligent speech interaction

Step-Audio is a unified, open-source framework aimed at building intelligent speech systems that combine both comprehension and generation: it integrates large language models (LLMs) with speech input/output to handle not only semantic understanding but also rich vocal characteristics like tone, style, dialect, emotion, and prosody. The design moves beyond traditional separate-component pipelines (ASR → text model → TTS), instead offering a multimodal model that ingests speech or audio and produces speech accordingly, enabling natural dialogue, voice cloning, and expressive speech synthesis. ...

Downloads: 4 This Week

Last Update: 2026-03-16
See Project
8

LatentSync

Taming Stable Diffusion for Lip Sync

LatentSync is an open-source framework from ByteDance that produces high-quality lip-synchronization for video by using an audio-conditioned latent diffusion model, bypassing traditional intermediate motion representations. In effect, given a source video (with masked or reference frames) and an audio track, LatentSync directly generates frames whose lip motions and expressions align with the audio, producing convincing talking-head or animated lip-sync output. ...

Downloads: 6 This Week

Last Update: 2025-12-02
See Project
9

Anki

Anki is a smart spaced repetition flashcard program

Anki is a free, open-source spaced repetition flashcard application designed for efficient long‑term memorization. It supports a wide variety of media types (text, images, audio, LaTeX), advanced scheduling algorithms (SM‑2, FSRS), and extensibility via add‑ons. It’s widely used for education, language learning, medical training, and more.

Downloads: 28 This Week

Last Update: 2025-09-17
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

Audio/Video_ToText_OpenAI_Whisper

Transcripciones con Whisper Esta aplicación de escritorio basada en web permite transcribir (o transcribir y traducir al ingles), archivos de audio o video utilizando el modelo Whisper de OpenAI. Transcriptions with Whisper This web-based desktop application allows you to transcribe—or both transcribe and translate into English—audio or video files using OpenAI's Whisper model.

Downloads: 0 This Week

Last Update: 2025-04-11
See Project
11

Step-Audio-EditX

LLM-based Reinforcement Learning audio edit model

Step-Audio-EditX is an open-source, 3 billion-parameter audio model from StepFun AI designed to make expressive and precise editing of speech and audio as easy as text editing. Rather than treating audio editing as low-level waveform manipulation, this model converts speech into a sequence of discrete “audio tokens” (via a dual-codebook tokenizer) — combining a linguistic token stream and a semantic (prosody/emotion/style) token stream — thereby abstracting audio editing into high-level token operations. ...

Downloads: 0 This Week

Last Update: 2026-04-09
See Project
12

Step-Audio 2

Multi-modal large language model designed for audio understanding

...It integrates a latent-space audio encoder, discrete acoustic tokens, and reinforcement-learning–based training (CoT + RL) to enhance its ability to capture and reproduce voice styles, intonations, and subtle vocal cues. Moreover, Step-Audio2 supports tool-calling and retrieval-augmented generation (RAG), allowing it to access external knowledge sources or audio/text databases, thus reducing hallucinations and improving coherence in complex dialogues.

Downloads: 0 This Week

Last Update: 2026-03-16
See Project
13

Text Generation Web UI

Oobabooga - The definitive Web UI for local AI, with powerful features

...Instruct mode compatible with Alpaca and Open Assistant formats. Nice HTML output for GPT-4chan. Markdown output for GALACTICA, including LaTeX rendering. Custom chat characters. Advanced chat features (send images, get audio responses with TTS). Very efficient text streaming. Parameter presets, 8-bit mode. Layers splitting across GPU(s), CPU, and disk. CPU mode, FlexGen, DeepSpeed ZeRO-3, API with streaming and without streaming. LLaMA model, including 4-bit GPTQ. RWKV model, LoRA (loading and training), Softprompts, and extensions.

Downloads: 23 This Week

Last Update: 1 day ago
See Project
14

LX Music Mobile

A music software developed based on React native

...Because it supports custom sources (including non-official music platforms) the team includes disclaimers around copyright, clarifying that the project does not take responsibility for the legality or correctness of the audio data you play.

Downloads: 13 This Week

Last Update: 2026-03-28
See Project
15

Wire iOS

Wire for iOS (iPhone and iPad)

...The user interface layer of the mobile app is built on top of the sync engine, which provides the data to display to the UI. The sync engine itself is built on top of a few third-party frameworks, and uses Wire components that are shared between platforms for cryptography (Proteus/Cryptobox) and audio-video signaling (AVS).

Downloads: 1 This Week

Last Update: 2026-04-14
See Project
16

Auto Synced & Translated Dubs

Automatically translates the text of a video based on a subtitle file

...It assumes you have a human-made SRT (or similar) subtitle file; the script then uses translation services such as Google Cloud or DeepL to generate translated subtitle tracks in one or more target languages. Using the timestamps of each subtitle line, it computes the required duration of each spoken segment and synthesizes audio via neural TTS services, producing one audio clip per subtitle entry. The tool then time-stretches or compresses each TTS clip to match the original speech duration exactly, which preserves lip-sync and rhythm as closely as possible without manual editing. Finally, it combines all the clips into a single dubbed audio track that can be muxed with the original video, along with new translated subtitle files.

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
17

Subtitle Edit

The subtitle editor

Subtitle Edit (SE) is a free, open‑source subtitle editor for creating, editing, synchronizing, and converting subtitles. It supports a wide range of formats (over 300) and offers both graphical and text-based editing views.  Easy insertion, deletion, and shift of subtitle lines. Portable versions available (.NET 4.8, 32/64-bit), runs on Windows and via compatibility on Linux. Active development with frequent updates and issue tracking. Plugin support and rich editing tools (e.g., translation, spellcheck, sync).

Downloads: 384 This Week

Last Update: 2026-02-06
See Project
18

Nextcloud Server

A safe home for all your data

Nextcloud server is a free and open source server software that allows you to store all of your data in a server of your choosing. With Nextcloud you can easily access and store data in the data center you trust, sync data among various devices, and share your data for collaboration purposes. It offers the best security in the self hosted file sync and share world, and is expandable with hundreds of apps.

Downloads: 35 This Week

Last Update: 2026-04-02
See Project
19

SoniTranslate

Synchronized Translation for Videos

SoniTranslate is a video translation and dubbing system that produces synchronized target-language audio tracks for existing video content. It provides a web UI built with Gradio, allowing users to upload a video, choose source and target languages, and then run a pipeline that handles transcription, translation and re-synthesis of speech. Under the hood, it uses advanced speech and diarization models to separate speakers, align audio with timecodes and respect subtitle timing, which lets the generated dub track stay in sync with the original video structure. ...

Downloads: 19 This Week

Last Update: 2025-11-28
See Project
20

AudioCraft

Audiocraft is a library for audio processing and generation

AudioCraft is a PyTorch library for text-to-audio and text-to-music generation, packaging research models and tooling for training and inference. It includes MusicGen for music generation conditioned on text (and optionally melody) and AudioGen for text-conditioned sound effects and environmental audio. Both models operate over discrete audio tokens produced by a neural codec (EnCodec), which acts like a tokenizer for waveforms and enables efficient sequence modeling. ...

Downloads: 6 This Week

Last Update: 2025-10-13
See Project
21

sherpa-onnx

Speech-to-text, text-to-speech, and speaker recognition

Speech-to-text, text-to-speech, and speaker recognition using next-gen Kaldi with onnxruntime without an Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter.

Downloads: 224 This Week

Last Update: 13 hours ago
See Project
22

Frescobaldi

LilyPond sheet music text editor

Frescobaldi is a free and open source LilyPond sheet music text editor. Designed to be powerful yet lightweight and easy-to-use, Frescobaldi offers great functionality and a host of useful features such as music view with advanced two-way Point & Click, Midi capturing to enter music, a Snippet Manager and many more. Frescobaldi is named after Girolamo Frescobaldi (1583-1643), an Italian composer of keyboard music in the late Renaissance and early Baroque period.

2 Reviews

Downloads: 44 This Week

Last Update: 2026-02-11
See Project
23

QOwnNotes

QOwnNotes is a plain-text file notepad and todo-list manager

QOwnNotes is a fully open-source markdown-focused note-taking and personal information manager that runs natively on Linux, macOS, and Windows, with a strong emphasis on plain-text storage and cross-device synchronization. Because all notes are stored as markdown files in user-controlled directories, you avoid vendor lock-in and can sync through services like Nextcloud, ownCloud, Dropbox, or Git without proprietary cloud dependency. The application is optimized for performance and minimal footprint, so it runs smoothly even with large note libraries, and it includes features like a live markdown preview panel that lets you see rendered text alongside raw markdown as you type. ...

Downloads: 15 This Week

Last Update: 1 day ago
See Project
24

VoxCPM2

Tokenizer-Free TTS for Multilingual Speech Generation

VoxCPM2 is an advanced open-source text-to-speech system that redefines speech synthesis by eliminating traditional tokenization and instead generating continuous speech representations through a diffusion-based autoregressive architecture. Built on top of the MiniCPM model family, it enables highly natural, expressive, and context-aware speech generation that adapts tone, emotion, and pacing directly from input text.

Downloads: 24 This Week

Last Update: 2026-04-13
See Project
25

Qwen3-Omni

Qwen3-omni is a natively end-to-end, omni-modal LLM

Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video and delivers real-time streaming responses in text and natural speech. It uses a Thinker-Talker architecture with a Mixture-of-Experts (MoE) design, early text-first pretraining, and mixed multimodal training to support strong performance across all modalities without sacrificing text or image quality. The model supports 119 text languages, 19 speech input languages, and 10 speech output languages. ...

Downloads: 3 This Week

Last Update: 2026-01-08
See Project