Download Latest Version v1.0.0 source code.tar.gz (3.3 MB)
Email in envelope

Get an email when there's a new version of WhisperKit

Home / v0.17.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2026-03-13 3.0 kB
v0.17.0 source code.tar.gz 2026-03-13 3.2 MB
v0.17.0 source code.zip 2026-03-13 3.4 MB
Totals: 3 Items   6.6 MB 0

Highlights

We're excited to bring our commercial-grade speaker diarization framework, SpeakerKit, to open-source!

With NVIDIA Sortformer now powering real-time speaker diarization in the Argmax Pro SDK, we're open-sourcing our implementation of Pyannote 4 (community-1). Pyannote is well-known for solving the "who spoke when" problem and has shown strong results on datasets such as AMI, DIHARD, and VoxConverse. Read the blog post for architecture details and benchmarks.

Quickstart

Download, load, diarize, and generate RTTM (Rich Transcription Time Marked) output in just a few lines of code:

import SpeakerKit

let speakerKit = try await SpeakerKit()
let result = try await speakerKit.diarize(audioPath: "audio.wav")
let rttm = speakerKit.generateRTTM(result: result)

Key features

  • End-to-end Pyannote-style diarization pipeline
  • Automatically estimate the number of speakers or set it manually
  • Utilities to add speaker information to WhisperKit outputs
  • Standard RTTM export

Explore the new SpeakerKit README section for API documentation, configuration details, and optimization tips.

CLI

whisperkit-cli now includes a dedicated diarize subcommand:

swift run -c release whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm

With Homebrew:

brew install whisperkit-cli
whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm

You can also run transcription and diarization together using the new --diarization flag:

whisperkit-cli transcribe --audio-path audio.wav --diarization

Example output:

---- Speaker Diarization Results ----
SPEAKER audio 1 0.220 7.360 What is RLHF? reinforcement learning with human feedback. What was that little magic ingredient to the dish that made it so much more delicious? <NA> A <NA> <NA>
SPEAKER audio 1 7.610 14.850 - So we train these models on a lot of text data. And in that process, they learn something about the underlying representations of what's in here or in there. <NA> B <NA> <NA>

Additional flags are available for speaker counts, model variants, clustering algorithms tuning, and more.

WhisperAX example updates

The WhisperAX example app has been updated with SpeakerKit support. It now includes diarization toggles, a flexible pipeline selector, and a Speakers tab for browsing labeled segments.

What's Changed

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.16.0...v0.17.0

Source: README.md, updated 2026-03-13