Step-Audio-EditX

Step-Audio-EditX is an open-source, 3 billion-parameter audio model from StepFun AI designed to make expressive and precise editing of speech and audio as easy as text editing. Rather than treating audio editing as low-level waveform manipulation, this model converts speech into a sequence of discrete “audio tokens” (via a dual-codebook tokenizer) — combining a linguistic token stream and a semantic (prosody/emotion/style) token stream — thereby abstracting audio editing into high-level token operations. This allows users to modify not only what is said (the text) but also how it's said: emotion, tone, speaking style, prosody, accent, even paralinguistic cues. Because the model is trained with a “large-margin learning” objective over many synthesized and natural speech samples, it gains robust control over expressive attributes, and can perform iterative editing: e.g. you could record a line, then ask the model to “make it sadder,” “speak slower,” or “change accent to X.”

Features

Token-based audio editing: converts speech to discrete token streams for high-level, language-like editing operations on audio
Dual-codebook tokenizer design: separates linguistic content and prosody/style — enabling control over both what is said and how it's said
Expressive editing: allows modifying emotion, tone, accent, speaking style, prosody, pacing, and other vocal attributes without re-recording
Iterative editing workflow: supports multiple rounds of edits — e.g. change style, then adjust emotion, then pace, etc.
Zero-shot TTS: generate speech directly from text + optional style/emotion instructions, in a controlled expressive voice
Open-source model & code under permissive license — enabling integration, customization, and use in research, creative workflows, or production

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Step-Audio-EditX

Step-Audio-EditX Web Site

Other Useful Business Software

Cloud-based help desk software with ServoDesk

Full access to Enterprise features. No credit card required.

What if You Could Automate 90% of Your Repetitive Tasks in Under 30 Days? At ServoDesk, we help businesses like yours automate operations with AI, allowing you to cut service times in half and increase productivity by 25% - without hiring more staff.

Try ServoDesk for free

Rate This Project

User Reviews

Be the first to post a review of Step-Audio-EditX!

Additional Project Details

Operating Systems

Linux

Programming Language

Python

Related Categories

Python AI Models

Registered

2 days ago

Similar Business Software

LM-Kit.NET

LM-Kit.NET is a cutting-edge, high-level inference SDK designed specifically to bring the advanced capabilities of Large Language Models (LLM) into the C# ecosystem. Tailored for developers working within .NET, LM-Kit.NET provides a comprehensive suite of powerful Generative AI tools, making...

See Software
Vertex AI

Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery...

See Software
AudioLM

AudioLM is a pure audio language model that generates high‑fidelity, long‑term coherent speech and piano music by learning from raw audio alone, without requiring any text transcripts or symbolic representations. It represents audio hierarchically using two types of discrete tokens, semantic...

See Software
Google AI Studio

Google AI Studio is a comprehensive, web-based development environment that democratizes access to Google's cutting-edge AI models, notably the Gemini family, enabling a broad spectrum of users to explore and build innovative applications. This platform facilitates rapid prototyping by providing...

See Software
Chatterbox

Chatterbox is a free, open source voice cloning AI model developed by Resemble AI, licensed under MIT. It enables zero-shot voice cloning using just 5 seconds of reference audio, eliminating the need for training. The model offers expressive speech synthesis with unique emotion control, allowing...

See Software
gpt-4o-mini Realtime

The gpt-4o-mini-realtime-preview model is a compact, lower-cost, realtime variant of GPT-4o designed to power speech and text interactions with low latency. It supports both text and audio inputs and outputs, enabling “speech in, speech out” conversational experiences via a persistent WebSocket...

See Software

Report inappropriate content

Step-Audio-EditX

LLM-based Reinforcement Learning audio edit model

Get an email when there's a new version of Step-Audio-EditX

Features

Project Samples

Project Activity

Categories

License

Follow Step-Audio-EditX

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered