| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2026-03-13 | 3.0 kB | |
| v0.17.0 source code.tar.gz | 2026-03-13 | 3.2 MB | |
| v0.17.0 source code.zip | 2026-03-13 | 3.4 MB | |
| Totals: 3 Items | 6.6 MB | 0 | |
Highlights
We're excited to bring our commercial-grade speaker diarization framework, SpeakerKit, to open-source!
With NVIDIA Sortformer now powering real-time speaker diarization in the Argmax Pro SDK, we're open-sourcing our implementation of Pyannote 4 (community-1). Pyannote is well-known for solving the "who spoke when" problem and has shown strong results on datasets such as AMI, DIHARD, and VoxConverse. Read the blog post for architecture details and benchmarks.
Quickstart
Download, load, diarize, and generate RTTM (Rich Transcription Time Marked) output in just a few lines of code:
import SpeakerKit
let speakerKit = try await SpeakerKit()
let result = try await speakerKit.diarize(audioPath: "audio.wav")
let rttm = speakerKit.generateRTTM(result: result)
Key features
- End-to-end Pyannote-style diarization pipeline
- Automatically estimate the number of speakers or set it manually
- Utilities to add speaker information to WhisperKit outputs
- Standard RTTM export
Explore the new SpeakerKit README section for API documentation, configuration details, and optimization tips.
CLI
whisperkit-cli now includes a dedicated diarize subcommand:
swift run -c release whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm
With Homebrew:
brew install whisperkit-cli
whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm
You can also run transcription and diarization together using the new --diarization flag:
whisperkit-cli transcribe --audio-path audio.wav --diarization
Example output:
---- Speaker Diarization Results ----
SPEAKER audio 1 0.220 7.360 What is RLHF? reinforcement learning with human feedback. What was that little magic ingredient to the dish that made it so much more delicious? <NA> A <NA> <NA>
SPEAKER audio 1 7.610 14.850 - So we train these models on a lot of text data. And in that process, they learn something about the underlying representations of what's in here or in there. <NA> B <NA> <NA>
Additional flags are available for speaker counts, model variants, clustering algorithms tuning, and more.
WhisperAX example updates
The WhisperAX example app has been updated with SpeakerKit support. It now includes diarization toggles, a flexible pipeline selector, and a Speakers tab for browsing labeled segments.
What's Changed
- Add SpeakerKit with Pyannote speaker diarization support by @a2they in https://github.com/argmaxinc/WhisperKit/pull/440
Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.16.0...v0.17.0