AudioLM
AudioLM is a pure audio language model that generates high‑fidelity, long‑term coherent speech and piano music by learning from raw audio alone, without requiring any text transcripts or symbolic representations. It represents audio hierarchically using two types of discrete tokens, semantic tokens extracted from a self‑supervised model to capture phonetic or melodic structure and global context, and acoustic tokens from a neural codec to preserve speaker characteristics and fine waveform details, and chains three Transformer stages to predict first semantic tokens for high‑level structure, then coarse and finally fine acoustic tokens for detailed synthesis. The resulting pipeline allows AudioLM to condition on a few seconds of input audio and produce seamless continuations that retain voice identity, prosody, and recording conditions in speech or melody, harmony, and rhythm in music. Human evaluations show that synthetic continuations are nearly indistinguishable from real recordings.
Learn more
Amazon Polly
Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly's Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. With dozens of lifelike voices across a broad set of languages, you can build speech-enabled applications that work in many different countries.
In addition to Standard TTS voices, Amazon Polly offers Neural Text-to-Speech (NTTS) voices that deliver advanced improvements in speech quality through a new machine learning approach. Polly’s Neural TTS technology also supports two speaking styles that allow you to better match the delivery style of the speaker to the application: a Newscaster reading style that is tailored to news narration use cases, and a Conversational speaking style that is ideal for two-way communication like telephony applications.
Learn more
Levelr
Levelr is an AI-powered audio enhancement platform that uses advanced machine learning to deliver studio-grade sound by removing background noise, isolating speech, and enhancing dialogue clarity across a wide range of workflows, making sloppy or noisy audio recordings crisp and intelligible with minimal manual effort. It supports common formats such as MP3, WAV, FLAC, AIFF, M4A, and MP4, and lets users upload audio tracks directly to strip out ambient noise, mic hiss, echoes, music, interference, and other distractions while preserving the voice front and center, which improves accessibility and listener understanding. Its intuitive interface and streamlined workflow are designed to save time for creators working on podcasts, interviews, video post-production, live streams, and professional recordings by automating complex audio restoration tasks that usually require manual equalization or noise gating.
Learn more
iZotope VEA
VEA (Voice Enhancement Assistant) is an AI-powered audio enhancer developed by iZotope, designed to transform any voice recording into a more powerful, polished, and professional sound. Tailored for podcasters and content creators of all experience levels, VEA simplifies the voice enhancement process through its intuitive interface and advanced features. Instantly refines your voice, eliminating the need for manual equalizer adjustments or preset browsing, ensuring your voice sounds audience-ready in seconds. Adds presence and power to your voice, removing the guesswork from voice mixing and delivering a consistent, engaging sound for your content. Employs noise reduction technology to minimize background noise, allowing your voice to stand out clearly, even in less-than-ideal recording environments. Enables you to match the sound of your favorite creators or podcasts by referencing target audio, and assisting in visualizing, comparing, and replicating audio characteristics.
Learn more