[Decals-svnlog] SF.net SVN: decals: [1] mp4-als_format_notes.txt
Status: Inactive
Brought to you by:
jbr79
From: <jb...@us...> - 2007-08-20 00:07:56
|
Revision: 1 http://decals.svn.sourceforge.net/decals/?rev=1&view=rev Author: jbr79 Date: 2007-08-19 17:07:52 -0700 (Sun, 19 Aug 2007) Log Message: ----------- initial SVN import Added Paths: ----------- mp4-als_format_notes.txt Added: mp4-als_format_notes.txt =================================================================== --- mp4-als_format_notes.txt (rev 0) +++ mp4-als_format_notes.txt 2007-08-20 00:07:52 UTC (rev 1) @@ -0,0 +1,635 @@ + +MPEG-4 Audio Lossless Coding (ALS) Documentation +------------------------------------------------ + + +MPEG-4 Audio Lossless Coding[1][2], also called MP4-ALS, is one of the lossless +audio coding formats which is part of the MPEG-4 audio specification. It is +based on the LPAC audio codec[3] by Tilman Liebchen. This document covers the +raw ALS format, which consists of the ALS header and a sequence of frames. + + + +KEY CONCEPTS +============ + +Original file preservation +-------------------------- +The ALS header can provide everything that is needed to reconstruct the +original WAVE, AIFF, or raw audio file. This includes the file type, +sample rate, number of channels, bits-per-sample, sample type, and byte order. +The original header and/or footer are also preserved exactly in the ALS +header. The ALS format only supports from 8-bit to 32-bit audio in 8-bit +increments. Either integer or IEEE-floating-point sample types are supported. + + +Frame size and block switching +------------------------------ +The ALS frame size (number of audio samples) is fixed for a given stream or +file. All audio samples for a single channel in a frame is referred to as a +block. In order to adapt to transient signals, each block can be divided +into recursive subblocks. Each subblock can be split into 2 smaller subblocks, +down to 1/32 of the original frame size. + + +Random access +------------- +Seeking is an important feature for any audio format. It allows the decoder to +jump forward or backward to the start of a frame without having to decode all +frames in between. In the ALS format, the encoder can optionally include +so called "random access" frames to achieve this goal. The ALS header +specifies a random access interval from 0 to 255. A value of 0 indicates no +random access frames, 1 means every frame is a random access frame, 2 means +every other frame, etc... Each set of frames between random access frames is +call a random access unit. The encoder indicates the size of each random +access unit by either writing them all in the header or by writing it at the +start of each random access frame. The ra_location field in the ALS header +tells which of the 2 locations are used. + +I'm going to go ahead and gripe here about this design. The problem I see is +that using random access frames is optional, and that there is no unique +identifier for frame boundaries. This prevents a decoder from finding random +frames without both a reference point and either random access information or +by decoding every frame between the reference and the destination. + + +Channels +-------- +ALS has 2 different ways of reducing redundancy between multiple channels. The +first method is joint stereo coding. The encoder calculates the difference +between 2 channels and selects 2 of left+right, left+diff, or right+diff. For +files with more than 2 channels, joint stereo can still be used by grouping the +channels into sets of pairs and individual channels. For example, encoding of +standard 5.1 audio could be done as L+R, C, LFE, LS+RS. The format allows for +reordering of the channels from their original positions in order to pair them +together if necessary. + +The 2nd method is called multi-channel correlation, or MCC. This method uses +adaptive subtraction to compare the prediction residual of reference channels +to the remaining channels. This is a more complex procedure which I will +descibe later once I fully understand the math behind it. + + +Forward Prediction +------------------ +The primary prediction method in ALS is forward linear predictive coding, or +LPC[4][5]. There are several ways to select good prediction coefficients. The +recommended method for ALS is Levinson-Durbin Recursion[6][7]. This is a good +method for 2 reasons. First of all, it sequentially produces higher order +coefficients, which is ideal for adapting the prediction order for each block. +Secondly, it produces parcor (reflection) coefficients as an intermediate +result, which are the values that are quantized and written to the ALS +bitstream. The standard defines a bit-exact method for transforming the +quantized parcor coefficients into the LPC coefficients used for linear +prediction. + +One of the unique features of the ALS format is that it uses progressively +increasing prediction orders for the first samples in random access frames, +which cannot reference previous samples. Other formats, such as FLAC, write the +first samples directly, then proceed with prediction using the full number of +coefficients. In ALS, the first sample is written directly, but each subsequent +sample is predicted using increasingly higher orders until the full prediction +order is reached. + + +Long-Term Prediction +-------------------- +Also called pitch coding, long-term prediction as implemented in ALS reduces +signal redundancy by predicting the value of an LPC residual sample using the +average value of 5 residual samples at a specified time lag with a specified +gain value. The optimal lag and gain values are determined for each frame by the +encoder. This is called pitch coding because the frequency, or pitch, of a +signal leads to periodic redundancy. If the period is calculated, it can be +exploited in order to remove that redundancy from the encoded output. This +algorithm is typically used in speech codecs, such as Speex. + + +RLSLMS (Recursive Least Squares, Least Mean Square) backward prediction +----------------------------------------------------------------------- +ALS has an alternative backward prediction mode, which uses the recursive least +squares algorithm[8]. + +Entropy Coding +-------------- +In ALS, both the prediction coefficients and the residual signal are compressed +using entropy coding. The encoder can choose between 2 different methods, +Golomb-Rice Coding[9] and Block Gilbert-Moore Coding (BGMC). Rice coding is a +form of Huffman coding, while BGMC, also called Shannon-Fano-Elias coding, is a +precursor of arithmetic coding. + + +Floating-Point Data +------------------- +The ALS format provides the ability to losslessly encode floating-point[10] +audio. The floating-point signal is basically broken down into an integer signal +plus the residual signal. The integer signal is encoded using the standard ALS +algorithm, while the residual is encoded using the Masked-Lempel-Ziv +dictionary-based compression scheme[11]. + + +References +---------- +[1] http://www.nue.tu-berlin.de/forschung/projekte/lossless/mp4als.html +[2] http://en.wikipedia.org/wiki/Audio_Lossless_Coding +[3] http://www.nue.tu-berlin.de/wer/liebchen/lpac.html +[4] http://en.wikipedia.org/wiki/Linear_predictive_coding +[5] http://en.wikipedia.org/wiki/Linear_prediction +[6] http://en.wikipedia.org/wiki/Levinson_recursion +[7] http://en.wikipedia.org/wiki/Autocorrelation +[8] http://en.wikipedia.org/wiki/Recursive_least_squares_filter +[9] http://en.wikipedia.org/wiki/Golomb_coding +[10] http://en.wikipedia.org/wiki/IEEE_floating-point_standard +[11] http://en.wikipedia.org/wiki/Dictionary_coder + + + +BITSTREAM FORMAT +================ + +ALS Header +---------- +In the raw ALS file format, the ALS header is located at the beginning of the +bitstream, and is transmitted only once. It contains information about the +audio and global properties of the encoded ALS bitstream. + +als_header() { + fixed_portion() 22 bytes + if(channel_config) { + ch_config 16 bits + } + if(channel_reorder) { + // number of bits used to code each channel position + ch_bits = max(1, ceil(log2(channels))) + for(i=0; i<channels; i++) { + channel_position[i] ch_bits + } + } + header_size 32 bits + trailer_size 32 bits + header() header_size + trailer() trailer_size +} + +The fixed portion of the ALS header is always 22 bytes long. The field layout is +given in Table 1. If the channel configuration flag is indicated, the +configuration data is written. The channel configuration data is currently not +used, and neither its purpose nor its format are given in the draft spec or in +the reference software. If the channel reorder flag is indicated, the channel +layout is written. The number of bits used to encode the position of each +channel is determined as the minimum number of bits needed to encode the +highest channel index. Next, the header size and trailer size (in bytes) are +written. These sizes are used by the decoder to read the original header and +trailer, which make up the last 2 fields of the ALS header. + +Table 1 : 22-byte fixed portion of the ALS header ++------------------------------------------------------------------------------+ +|bits |parameter |description | +|==========+==================+================================================| +|32 |magic number |"ALS\0" or 0x414C5300. Indicates that this is | +| | |an ALS file | +|----------+------------------+------------------------------------------------| +|32 |sample rate |Audio sampling frequency, e.g. 44100 | +|----------+------------------+------------------------------------------------| +|32 |samples |Total number of samples in the file | +|----------+------------------+------------------------------------------------| +|16 |channels |Number of channels - 1 | +|----------+------------------+------------------------------------------------| +|3 |file type |Type of the original audio file | +| | |0=raw/unknown 1=WAVE 2=AIFF | +|----------+------------------+------------------------------------------------| +|3 |resolution |Bits-per-sample/8 - 1 | +|----------+------------------+------------------------------------------------| +|1 |sample type |0=integer samples 1=floating-point samples | +|----------+------------------+------------------------------------------------| +|1 |source order |Source sample endianness 0=little 1=big | +|----------+------------------+------------------------------------------------| +|16 |frame size |Source frame size - 1 (measured in samples) | +|----------+------------------+------------------------------------------------| +|8 |random access |Random access frame interval 0=no RA frames | +|----------+------------------+------------------------------------------------| +|2 |ra location |Random access data location | +| | |0=none 1=frames 2=header | +|----------+------------------+------------------------------------------------| +|1 |adapt |Indicates use of adaptive LPC order | +|----------+------------------+------------------------------------------------| +|2 |coef table |Selects 1 of the 3 tables of Rice code | +| | |parameters used to encode parcor coefficients. | +| | |Selection is generally based on the sample rate.| +| | |3=use 7-bit quantizer instead of Rice coding. | +| | |See Tables 2, 3, and 4 | +|----------+------------------+------------------------------------------------| +|1 |long-term pred |Indicates use of long-term prediction | +|----------+------------------+------------------------------------------------| +|10 |max order |Maximum LPC order | +|----------+------------------+------------------------------------------------| +|2 |block switching |Indicates use of block switching and maximum | +| | |level of block division | +|----------+------------------+------------------------------------------------| +|1 |entropy coder |Type of entropy coder used 0=Rice 1=BGMC | +|----------+------------------+------------------------------------------------| +|1 |subblock part |Indicates partitioned entropy coding | +|----------+------------------+------------------------------------------------| +|1 |joint stereo |Indicates use of joint stereo coding | +|----------+------------------+------------------------------------------------| +|1 |mcc |Indicates use of multi-channel correlation | +|----------+------------------+------------------------------------------------| +|1 |channel config |Indicates presence of channel configuration | +|----------+------------------+------------------------------------------------| +|1 |channel reorder |Indicates presence of channel reordering | +|----------+------------------+------------------------------------------------| +|1 |has crc |Indicates presence of reference CRC checksum | +|----------+------------------+------------------------------------------------| +|1 |rlslms |Indicates use of RLSLMS backward prediction | +|----------+------------------+------------------------------------------------| +|5 |reserved |Not used | +|----------+------------------+------------------------------------------------| +|1 |aux enabled |Indicates presence of auxilliary data | ++------------------------------------------------------------------------------+ + +Table 2 : Coef table 0 : Rice code params for encoding of parcor coefficients + Recommended for 44kHz and 48kHz ++--------------------------------+ +|coef number |offset |rice param | +|============+=======+===========| +|0 |-52 |4 | +|------------+-------+-----------| +|1 |-29 |5 | +|------------+-------+-----------| +|2 |-31 |4 | +|------------+-------+-----------| +|3 |19 |4 | +|------------+-------+-----------| +|4 |-16 |4 | +|------------+-------+-----------| +|5 |12 |3 | +|------------+-------+-----------| +|6 |-7 |3 | +|------------+-------+-----------| +|7 |9 |3 | +|------------+-------+-----------| +|8 |-5 |3 | +|------------+-------+-----------| +|9 |6 |3 | +|------------+-------+-----------| +|10 |-4 |3 | +|------------+-------+-----------| +|11 |3 |3 | +|------------+-------+-----------| +|12 |-3 |2 | +|------------+-------+-----------| +|13 |3 |2 | +|------------+-------+-----------| +|14 |-2 |2 | +|------------+-------+-----------| +|15 |3 |2 | +|------------+-------+-----------| +|16 |-1 |2 | +|------------+-------+-----------| +|17 |2 |2 | +|------------+-------+-----------| +|18 |-1 |2 | +|------------+-------+-----------| +|19 |2 |2 | +|------------+-------+-----------| +|2k, k>=10 |0 |2 | +|------------+-------+-----------| +|2k+1, k>=10 |1 |2 | +|------------+-------+-----------| +|k>=127 |0 |1 | ++--------------------------------+ + +Table 3 : Coef table 1 : Rice code params for encoding of parcor coefficients + Recommended for 96kHz ++--------------------------------+ +|coef number |offset |rice param | +|============+=======+===========| +|0 |-58 |3 | +|------------+-------+-----------| +|1 |-42 |4 | +|------------+-------+-----------| +|2 |-46 |4 | +|------------+-------+-----------| +|3 |37 |5 | +|------------+-------+-----------| +|4 |-36 |4 | +|------------+-------+-----------| +|5 |29 |4 | +|------------+-------+-----------| +|6 |-29 |4 | +|------------+-------+-----------| +|7 |25 |4 | +|------------+-------+-----------| +|8 |-23 |4 | +|------------+-------+-----------| +|9 |20 |4 | +|------------+-------+-----------| +|10 |-17 |4 | +|------------+-------+-----------| +|11 |16 |4 | +|------------+-------+-----------| +|12 |-12 |4 | +|------------+-------+-----------| +|13 |12 |3 | +|------------+-------+-----------| +|14 |-10 |3 | +|------------+-------+-----------| +|15 |7 |3 | +|------------+-------+-----------| +|16 |-4 |3 | +|------------+-------+-----------| +|17 |3 |3 | +|------------+-------+-----------| +|18 |-1 |3 | +|------------+-------+-----------| +|19 |1 |3 | +|------------+-------+-----------| +|2k, k>=10 |0 |2 | +|------------+-------+-----------| +|2k+1, k>=10 |1 |2 | +|------------+-------+-----------| +|k>=127 |0 |1 | ++--------------------------------+ + +Table 4 : Coef table 2 : Rice code params for encoding of parcor coefficients + Recommended for 192kHz ++--------------------------------+ +|coef number |offset |rice param | +|============+=======+===========| +|0 |-59 |3 | +|------------+-------+-----------| +|1 |-45 |5 | +|------------+-------+-----------| +|2 |-50 |4 | +|------------+-------+-----------| +|3 |38 |4 | +|------------+-------+-----------| +|4 |-39 |4 | +|------------+-------+-----------| +|5 |32 |4 | +|------------+-------+-----------| +|6 |-30 |4 | +|------------+-------+-----------| +|7 |25 |3 | +|------------+-------+-----------| +|8 |-23 |3 | +|------------+-------+-----------| +|9 |20 |3 | +|------------+-------+-----------| +|10 |-20 |3 | +|------------+-------+-----------| +|11 |16 |3 | +|------------+-------+-----------| +|12 |-13 |3 | +|------------+-------+-----------| +|13 |10 |3 | +|------------+-------+-----------| +|14 |-7 |3 | +|------------+-------+-----------| +|15 |3 |3 | +|------------+-------+-----------| +|16 |0 |3 | +|------------+-------+-----------| +|17 |-1 |3 | +|------------+-------+-----------| +|18 |2 |3 | +|------------+-------+-----------| +|19 |-1 |3 | +|------------+-------+-----------| +|2k, k>=10 |0 |2 | +|------------+-------+-----------| +|2k+1, k>=10 |1 |2 | +|------------+-------+-----------| +|k>=127 |0 |1 | ++--------------------------------+ + + +ALS Frame +--------- +if(RA is enabled, RA location is in frames, and current frame is an RA frame) { + * random access unit size, 32 bits +} +if(MCC is enabled) { + if(joint stereo is not enabled) { + * use MCC for this frame + } else { + * read 8 bits to determine if this frame uses MCC + seems only 1 bit should be needed. not sure what the other 7 bits are + used for... + } +} +if(MCC is not enabled and RLS-LMS is not enabled) { + for each channel or channel pair { + + * enabled coupled block switching if joint stereo is enabled, this is + the first channel in a pair, and this is not the last channel + + if(block switching is enabled) { + * read block switch flags (8-bit, 16-bit or 32-bit) + coupled block switching can be explicitly turned off here + } else { + * use only 1 subblock at full frame size + } + + if(coupled block switching is enabled) { + for each subblock { + * decode subblock for 1st channel in pair + * decode subblock for 2nd channel in pair + * restore original signal if one of the pair is difference signal + } + } else { + for each subblock { + * decode subblock. + * if coupled block switching is explicitly disabled for this, + pair, reconstruct the difference signal after reading the 2nd + channel in the pair. this is for use as predictor context for + the next frame. + } + } + } +} else if(RLS-LMS is enabled) { + TODO: complete this section +} else if(MCC is enabled) { + TODO: complete this section +} + +Block switch flags +------------------ +The block switch value in the header indicates the number of flag bits that are +contained in each block. If all 32 bits are not used, the flag bits are left +aligned in a 32-bit integer. This 32-bit integer making up the block switch +flags is treated as an array of 1-bit values containing a binary tree. Index 0 +(the high bit) indicates if channel coupling is turned off in order to do +independent block switching. The root node is located at index 1. The layout +follows the standard binary-tree-in-array implementation. For a node at index i, +the parent node is at i/2, the first child is at i*2, and the second child is at +i*2+1. If a node has a value of 1, the corresponding subblock is split into 2 +smaller equal-sized subblocks. If a node has a value of 0, the subblock is not +split. The smallest subblock size is 1/32 of the original frame size. For the +last frame in the stream, which may be smaller than all the other frames, the +subblocks are divided as if it were a full-size frame, but only the number of +subblocks which are needed to fit the final number of samples are used, with +the last subblock being truncated if needed. + + +ALS Subblock +------------ +* align bitstream to 8-bit boundary +* 1-bit frame type. 0=zero or constant, 1=normal +if(frame type is zero or constant) { + * 1-bit frame type. 0=zero, 1=constant +} +* 1-bit difference flag indicates if this subblock contains a normal + signal or a difference signal. +if(frame type is constant) { + * align bitstream to 8-bit boundary + * read 1 sample. size based on sample resolution in the ALS header. +} else if(frame type is normal) { + // determine entropy partitioning and read entropy coding parameters + if(coding type is Rice) { + if(entropy partitioning is enabled) { + * 1-bit partition flag. 0=1 partition, 1=4 partitions + } else { + * only 1 partition + } + * read 1st Rice parameter (4-bit if depth<=16-bit, 5-bit if >16-bit) + if(partitioning is enabled) { + * read 3 remaining delta-encoded Rice parameters. deltas are Rice + encoded using parameter=0. + } + } else if(coding type is BGMC) { + if(entropy partitioning is enabled) { + * 2-bit partition order. 2^x partitions (1, 2, 4, or 8) + } else { + * 1-bit partition flag. 0=1 partition, 1=4 partitions + } + * read 1st BGMC parameter (8-bit if depth<=16-bit, 9-bit if >16-bit) + * read remaining delta-encoded Rice parameters. deltas are Rice + encoded using parameter=2. + * separate grouped BGMC parameters: + s = p >> 4 + sx = p & 0x0F + } + // shift (unused LSB's) + * read 1-bit shift flag + if(shift flag) { + * read 4-bit shift value, add 1. this is the number of unused LSB's. + } + // decode and reconstruct LPC coefficients + if(using LPC prediction [not RLS-LMS]) { + // determine LPC order + * order = fixed/max prediction order (from ALS header) + if(adaptive prediction order is used) { + * limit max order to range [1 ... (subblock size / 8) - 1] + * read order for this subblock using number of bits required to + store max order. + } + // decode quantized coefficients + if(not using a coefficient table [ct=3]) { + * read each quantized coefficient as 7-bit value - 64 + } else { + * read quantized coefficients from bitstream using rice params from + the coefficient table. + * apply the offsets from the coefficient table + } + // reconstruct parcor coefficients + * reconstruct 1st coefficient from the quantized coefficient using the + scale parcor coefficient formula below: + where qc has a range of -64 to 63, + coeff = 32 + ((qc+64)*(qc+65)*128) - (1 << 20) + * reconstruct the 2nd coefficient the same way, then reverse the sign. + * reconstruct remaining coefficients using + coeff = (qc << (Q-6)) + (1 << (Q-7)), where Q=20 + = (qc << 14) + (1 << 13) + = (qc * 16384) + 8192 + } else if(using RLS-LMS) { + * order = 10 + } + if(using LTP) { + // decode 5 LTP coefficients and lag + c0 = signed rice code (rp=1) + c1 = signed rice code (rp=2) + c2 = unsigned rice code (rp=2) + c3 = signed rice code (rp=2) + c4 = signed rice code (rp=1) + lag = x-bit unsigned value, where x = 8 + (samplerate / 96000) + } + // decode residual for this subblock + if(no random access for this frame) { + for(each entropy coding partition) { + if(coding type is Rice) + * decode residual for partition using Rice parameters + if(coding type is BGMC) + * decode residual for partition using BGMC parameters + } + } else { + // in RA frames, first residual values are larger because they are + // progressively predicted, so a different entropy coding is used in + // order to encode more efficiently + * decode up to 3 residuals separately, up to prediction order. + r0 = signed rice code (rp=bitdepth-4 [e.g. 16-4=12]) + r1 = signed rice code (rp=1st rice parameter + 3 [max=31]) + r2 = signed rice code (rp=1st rice parameter + 1 [max=31]) + this will make the 1st partition shorter + for(each entropy coding partition) { + if(coding type is Rice) + * decode residual for partition using Rice parameters + if(coding type is BGMC) + * decode residual for partition using BGMC parameters + } + } +} +if(using RLS-LMS prediction) { + // TODO: RLS-LMS parameter decoding +} +if(using MCC) { + // TODO: decode MCC reference channels and weighting factors +} + + +Reconstructing the signal (LPC mode, no MCC) +-------------------------------------------- +When LPC is used, each subblock contains its own set of LPC coefficients. +After the residual has been decoded, the signal can be reconstructed using +inverse linear prediction. When decoding the 1st subblock of a random access +frame, progressive linear prediction is used because of the lack of access to +previous "warm-up" samples. + +if(frame type is zero) { + * set all samples for the subblock to zero +} else if(frame type is constant) { + * set all samples for the subblock to previously decoded constant value +} else if(frame type is normal) { + * temporarily apply shift to warm-up samples from previous block + if(using LTP) { + * reconstruct final residual by reversing LTP + } + if(not a random access frame) { + * convert parcor coefficients to LPC coefficients using pseudo-code + below. keep in mind the need to avoid integer data type overflow. + for(m=0; m<pred_order; m++) { + for(i=0; i<m/2; i++) { + temp = lpc[ i] + (((par[m] * lpc[m-i]) + (1<<19)) >> 20); + temp2 = lpc[m-i] + (((par[m] * lpc[ i]) + (1<<19)) >> 20); + lpc[m-i] = temp2; + lpc[ i] = temp; + } + lpc[m] = par[m]; + } + * inverse linear prediction using code below + note that samples below index 0 are from the previous subblock + for(each sample, at index n, in the subblock) { + sum = 0; + for(i=0; i<pred_order; i++) + sum += lpc[i] * signal[n-i-1]; + signal[n] = residual[n] - ((sum + (1 << 19)) >> 20); + } + } else if(random access frame) { + // TODO: RA inverse LPC (progressive prediction) + } + * undo shift from warm-up samples and block samples +} + + +TODO: Reconstructing the signal (RLS-LMS mode) + +TODO: Reconstructing the signal (MCC mode) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |