Page 2 | audio linux free download

Vidi2

Large Multimodal Models for Video Understanding and Editing

Vidi is a family of large multimodal models developed for deep video understanding and editing tasks, integrating vision, audio, and language to allow sophisticated querying and manipulation of video content. It’s designed to process long-form, real-world videos and answer complex queries such as “when in this clip does X happen?” or “where in the frame is object Y during that moment?” — offering temporal retrieval, spatio-temporal grounding (i.e. locating objects over time + space), and...

Downloads: 1 This Week

Last Update: 2026-03-04

See Project

GLM-4-Voice

GLM-4-Voice | End-to-End Chinese-English Conversational Model

GLM-4-Voice is an open-source speech-enabled model from ZhipuAI, extending the GLM-4 family into the audio domain. It integrates advanced voice recognition and generation with the multimodal reasoning capabilities of GLM-4, enabling smooth natural interaction via spoken input and output. The model supports real-time speech-to-text transcription, spoken dialogue understanding, and text-to-speech synthesis, making it suitable for conversational AI, virtual assistants, and accessibility...

Downloads: 1 This Week

Last Update: 2 days ago

See Project

MiniCPM-o

A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming

MiniCPM-o 2.6 is a cutting-edge multimodal large language model (MLLM) designed for high-performance tasks across vision, speech, and video. Capable of running on end-side devices such as smartphones and tablets, it provides powerful features like real-time speech conversation, video understanding, and multimodal live streaming. With 8 billion parameters, MiniCPM-o 2.6 surpasses its predecessors in versatility and efficiency, making it one of the most robust models available. It supports...

Downloads: 0 This Week

Last Update: 2025-05-15

See Project

CSM (Conversational Speech Model)

A Conversational Speech Generation Model

The CSM (Conversational Speech Model) is a speech generation model developed by Sesame AI that creates RVQ audio codes from text and audio inputs. It uses a Llama backbone and a smaller audio decoder to produce audio codes for realistic speech synthesis. The model has been fine-tuned for interactive voice demos and is hosted on platforms like Hugging Face for testing. CSM offers a flexible setup and is compatible with CUDA-enabled GPUs for efficient execution.

Downloads: 4 This Week

Last Update: 2025-03-19

See Project

DiffRhythm

Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation

DiffRhythm is an open-source, diffusion-based model designed to generate full-length songs. Focused on music creation, it combines advanced AI techniques to produce coherent and creative audio compositions. The model utilizes a latent diffusion architecture, making it capable of producing high-quality, long-form music. It can be accessed on Huggingface, where users can interact with a demo or download the model for further use. DiffRhythm offers tools for both training and inference, and its...

1 Review

Downloads: 6 This Week

Last Update: 2025-03-06

See Project

Demucs

Code for the paper Hybrid Spectrogram and Waveform Source Separation

Demucs (Deep Extractor for Music Sources) is a deep-learning framework for music source separation—extracting individual instrument or vocal tracks from a mixed audio file. The system is based on a U-Net-like convolutional architecture combined with recurrent and transformer elements to capture both short-term and long-term temporal structure. It processes raw waveforms directly rather than spectrograms, allowing for higher-quality reconstruction and fewer artifacts in separated tracks. The...

Downloads: 109 This Week

Last Update: 2025-10-12

See Project

VALL-E

PyTorch implementation of VALL-E (Zero-Shot Text-To-Speech)

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems....

Downloads: 0 This Week

Last Update: 2023-04-14

See Project

Denoiser

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)

Denoiser is a real-time speech enhancement model operating directly on raw waveforms, designed to clean noisy audio while running efficiently on CPU. It uses a causal encoder-decoder architecture with skip connections, optimized with losses defined both in the time domain and frequency domain to better suppress noise while preserving speech. Unlike models that operate on spectrograms alone, this design enables lower latency and coherent waveform output. The implementation includes data...

Downloads: 2 This Week

Last Update: 2025-10-07

See Project

Dia-1.6B

Dia-1.6B generates lifelike English dialogue and vocal expressions

Dia-1.6B is a 1.6 billion parameter text-to-speech model by Nari Labs that generates high-fidelity dialogue directly from transcripts. Designed for realistic vocal performance, Dia supports expressive features like emotion, tone control, and non-verbal cues such as laughter, coughing, or sighs. The model accepts speaker conditioning through audio prompts, allowing limited voice cloning and speaker consistency across generations. It is optimized for English and built for real-time performance...

Downloads: 0 This Week

Last Update: 2025-06-27

See Project

Search Results for "audio linux" - Page 2

Showing 34 open source projects for "audio linux"

Vidi2

GLM-4-Voice

MiniCPM-o

CSM (Conversational Speech Model)

DiffRhythm

Demucs

VALL-E

Denoiser

Dia-1.6B

Search Results for "audio linux" - Page 2

Showing 34 open source projects for "audio linux"

Vidi2

GLM-4-Voice

MiniCPM-o

CSM (Conversational Speech Model)

DiffRhythm

Demucs

VALL-E

Denoiser

Dia-1.6B

Related Searches

Related Categories