Hi Dave (and others),
This is a long email - bear with me. To summarize, I explain how
WaveTracks work, the basic flow of audio for playback (from disk
to mixer to audio I/O), and then propose a class to handle sample
> I wonder about retaining either of the int formats at all for internal
> formats - import/export absolutely but apart from slight space saving
> there is no advantage in my (recent) experience although there was in the
> Atari days! Indeed, using ints can have a small but significant
> performance hit these days with most processors having separate integer
> and floating point processors. Most processors having fpu's have ones
> which in single precision mode are as fast as the integer unit for most
> common operations and which also run in parallel with the integer unit.
> Since the integer unit is commonly used for memory addressing
> calculations, it is better to do arithmetic in floats so that addressing
> calculations can be run simultaneously. A small point but if it helps
> squeeze a few extra % performance...
Even though it may be the case that floats and doubles are the best
internal format, we need to keep the integer types around as an option
just to give the user maximal control. For example, there may be a
real-time dither and a (higher quality) non-real-time dither. A user
might want to take a finished mix stored in floats, and manually dither
it down to 16 bit ints using a super high-quality dither, and listen to
it, before deciding to export it as a WAV file to burn to a CD.
I can also conceive of a user wanting to keep a file 16-bit internally
so that it doesn't take up twice as much temporary disk space. When
editing a very large recording, but not necessarily applying effects,
it would make sense to leave it in its original format.
>>There aren't any flowcharts, but I can try to describe the flow for you
>>in a following email.
> That would be very useful - coming in cold on code which others have
> worked on for some time always involves a steep learning curve.
OK, let me begin discussing some of the things that are relevant to
the audio processing part of Audacity (i.e. I'm leaving out the UI
Digital audio in Audacity is stored in WaveTracks. Each WaveTrack is
a single channel and is never interleaved. (A stereo or multi-channel
file is de-interleaved into multiple WaveTracks when it is imported.)
From the outside, a WaveTrack behaves as if it contains one contiguous
array of samples at a fixed (but arbitrary) sample rate, possibly with
a time offset. So a WaveTrack can't be split into two pieces, but if
it's silent except from t=10 seconds thru t=12 seconds, then it only
needs to contain 88200 samples (assuming a sample rate of 44K), and
an offset of 10.0.
The most important methods that a WaveTrack supports are Get and Set,
both of which operate on a contiguous range of samples, plus editing
operations such as Cut, Copy, and Paste.
Under the hood, WaveTracks are implemented using a special block-file
structure which attempts to bridge the gap between in-place audio
editors (like CoolEdit and SoundForge) and non-destructive audio
editors (like ProTools, Cubase, and almost all other multitrack audio
editors). The problem is that in-place audio editors don't scale
well with file size (the larger the file, the longer it takes to
perform an edit) and non-destructive editors don't scale well with
number of edits (it requires more processing power to play
back a small section of audio that's been edited a lot, since it
has to skip around the disk a lot).
Each WaveTrack keeps an array of small blocks which represent the
audio data, each representing between 32K-64K bytes worth of audio.
(Initially they're all 64K except for the last block, which is
usually less.) Each block doesn't actually store the samples, but
instead stores a filename, an offset in bytes from the start of
the file, and a number of samples. (I'm glossing over implementation
details like byte ordering and sample format. Use your imagination.)
Here's an example: suppose we open a file containing 150K worth
of data. The WaveTrack creates three nodes in its array, each
pointing to a different part of the file:
|------0------| |------1------| |------2------|
audiofile.wav audiofile.wav audiofile.wav
start: 0 start: 64K start: 128K
len: 64K len: 64K len: 22K
|-------------| |-------------| |-------------|
Now, say the user selects audio from 50K to 90K samples,
and applies an effect. Audacity leaves the original file
alone and saves the audio that's been changed in a new
file (call it b0.auf). It then saves a copy of the
old array and creates a new one that looks like this:
|------0------| |------1------| |------2------| |------3------|
audiofile.wav b0.auf audiofile.wav audiofile.wav
start: 0 start: 0 start: 90K start: 128K
len: 50K len: 40K len: 38K len: 22K
|-------------| |-------------| |-------------| |-------------|
Note that two of the original three elements in the array
had to change, but the third was left alone.
That was a simple illustration of what Audacity does for a simple
edit. In general, the algorithm is more complicated. What matters,
though, is this:
* All of the blocks must be between 32K and 64K, with the exception
of the first and last blocks.
* With this restriction, Audacity can perform any edit (i.e.
insertions, deletions, moving samples from one place to another)
while always only touching a constant number of bytes on disk
(around 300K) or less. Even if you delete the middle 300 MB
out of a 1 GB song, it only creates 2 new block files on disk,
and everything else happens in memory.
This means Audacity is very fast at editing!
* Undo is trivially easy to implement. Audacity just gets the
previous array, which still points to valid audio files, since
the old ones haven't been thrown away.
* When an array is thrown away (because an Undo was committed, you
save the project and quit, or you want to save disk space), the
block files which are no longer needed are identified (using
reference counting) and deleted.
It's okay if you don't understand all of the details of the
implementation. What's important is:
- WaveTracks pretend that they're just working with one large
sequence of samples. Few classes outside of a WaveTrack should
know or care much about what the internal implementation
- WaveTracks have a sophisticated internal implementation which is
quite fast and efficient, provided that you give them large blocks
of data to work with at a time.
It's horribly inefficient to ask a WaveTrack for one sample at
a time. There's a tremendous overhead.
On the other hand, if you ask for 100K samples, they will be
returned to you in virtually no more time than it would have
taken to read 100K samples from a single flat file using fread().
Anything in-between degrades nicely. Asking for about 32K
samples at a time performs quite well, and that's about as
many as effects tend to ask for. Normal playback only asks
for 2-4K samples at a time, and it can keep up with this
for stereo files. If you try to mix 12 tracks, it has to
start using larger chunks (and should do so automatically).
OK, now you know far more than you might have wanted to about
The rest is quite simple in comparison.
There's a multi-purpose class called Mixer. It gets passed a
bunch of WaveTracks, each one of which is marked with a channel
(like "Left", "Right", or "Mono" currently) and it mixes them
down to a (possibly interleaved) buffer for export or playback.
Mixer performs "lazy evaluation". After you initialize it with
the list of tracks, you ask for blocks of samples in the
resulting mix a few at a time, and it calculates them on demand.
Mixer will also handle different sample rates, mixing them all
down to whatever resuling sample rate you want. (I just added
WaveTracks can have amplitude envelopes associated with them
that get applied just before being mixed. This and sample rate
conversion are the only "real time" processing things that
are done right now.
All non-real-time effects are not part of this flow - they
just operate on WaveTracks, creating new data on disk.
Anyway, the AudioIO class plays audio by setting up some number
of output channels and then using the Mixer class to get new
audio data. So when listening to audio, the flow is just:
WaveTrack -> Mixer -> AudioIO
So, here's what needs to be done to support dithering and
multiple sample formats:
1. Each track needs to have a sample format associated with
it. Each instance of sizeof(sampleType) needs to be
replaced with a lookup of this type.
2. The Mixer needs to be written to take multiple sample
formats and combine them into a single output format.
3. All other parts of the program (like effects and display)
should use doubles, converting to and from doubles when
getting and setting data from WaveTracks when the WaveTracks
are set to use a lesser format.
Most of this code is pretty easy if we have a bunch of
routines to convert between sample formats. Here's what
static void SetRealTimeDither(Dither d);
static void SetHighQualityDither(Dither d);
void Convert(SampleFormat format0, void *data0,
SampleFormat format1, void *data1,
So, Dave, what do you think of this class? Would you
be able to start working on an implentation?
We could start by just implementing sample16 and
sampleFloat, handling conversion in both directions
(with dithering), and then fill in the rest of the