INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC29/WG11 N2803
July 1999 / Vancouver, Canada

MPEG-4 Audio Version 2
(Final Committee Draft 14496-3 AMD1)

This web page is intended to provide an overview of the "MPEG-4 Audio Version 2 Final Committee Draft 14496-3" and is based on excerpts from this document. The complete document is available by ftp.

Introduction

MPEG-4 version 2 is an amendment to MPEG-4 version 1. This document contains the description of bitstream and decoder extractions related to new tools defined within MPEG-4 version 2. As long as nothing else is mentioned, the description made in MPEG-4 version 1 is not changed but only extended.

Overview of MPEG-4 Audio Amd 1

ISO/IEC 14496-3 (MPEG-4 Audio) is a new kind of audio standard that integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex soundtracks with simple ones, and traditional content with interactive and virtual-reality content. By standardizing individually sophisticated coding tools as well as a novel, flexible framework for audio synchronization, mixing, and downloaded post- production, the developers of the MPEG-4 Audio standard have created new technology for a new, interactive world of digital audio.

MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target a single application such as real-time telephony or high-quality audio compression. Rather, MPEG-4 Audio is a standard that applies to every application requiring the use of advanced sound compression, synthesis, manipulation, or playback. The subparts that follow specify the state-of-the-art coding tools in several domains; however, MPEG-4 Audio is more than just the sum of its parts. As the tools described here are integrated with the rest of the MPEG-4 standard, exciting new possibilities for object-based audio coding, interactive presentation, dynamic soundtracks, and other sorts of new media, are enabled.

Since a single set of tools is used to cover the needs of a broad range of applications, interoperability is a natural feature of systems that depend on the MPEG-4 Audio standard. A system that uses a particular coder—for example, a real-time voice communication system making use of the MPEG-4 speech coding toolset—can easily share data and development tools with other systems, even in different domains, that use the same tool—for example, a voicemail indexing and retrieval system making use of MPEG-4 speech coding.

The following sections give a more detailed overview of the capabilities and functionalities provided with MPEG-4 Audio version 2.

New Concepts in MPEG-4 Audio Amd 1

With this extension, new tools are added to the MPEG-4 Standard, while none of the existing tools of Version 1 is replaced. Version 2 is therefore fully backward compatible to Version 1. In the area of Audio, new tools are added in MPEG-4 Version 2 to provide the following new functionalities:

Error Robustness

The Error Robustness tools provide improved performance on error-prone transmission channels. There are two classes of tools:

Improved Error Robustness for AAC is provided by a set of tools belonging to the first class of Error Resilience tools. These tools reduce the perceived deterioration of the decoded audio signal that is caused by corrupted bits in the bitstream. The following tools are provided to improve the error robustness for several parts of an AAC bitstream frame:

Virtual CodeBook tool (VCB11)
Reversible Variable Length Coding tool (RVLC)
Huffman Codeword Reordering tool (HCR)

Improved Error Robustness capabilities for all coding tools is provided by the error resilience bitstream reordering tool. This tool allows for the application of advanced channel coding techniques, that are adapted to the special needs of the different coding tools. This tool is applicable to selected Version 1object types. For these object types, a new syntax defined within this amendment to Version 1. All other newly defined object types do only exist in this Error Robustness syntax.

The Error Protection tool (EP tool) provides Unequal Error Protection (UEP) for MPEG-4 Audio and belongs to the second class of Error Robustness tools. UEP is an efficient method to improve the error robustness of source coding schemes. It is used by various speech and audio coding systems operating over error-prone channels such as mobile telephone networks or Digital Audio Broadcasting (DAB). The bits of the coded signal representation are first grouped into different classes according to their error sensitivity. Then error protection is individually applied to the different classes, giving better protection to more sensitive bits.

Low-Delay Audio Coding

The MPEG-4 General Audio Coder provides very efficient coding of general audio signals at low bitrates. However it has an algorithmic delay of up to several 100ms and is thus not well suited for applications requiring low coding delay, such as real-time bi-directional communication. As an example, for the General Audio Coder operating at 24 kHz sampling rate and 24 kbit/s this results in an algorithmic coding delay of about 110 ms plus up to additional 210 ms for the bit reservoir. To enable coding of general audio signals with an algorithmic delay not exceeding 20 ms, MPEG-4 Version 2 specifies a Low-Delay Audio Coder which is derived from MPEG-2/4 Advanced Audio Coding (AAC). It operates at up to 48 kHz sampling rate and uses a frame length of 512 or 480 samples, compared to the 1024 or 960 samples used in standard MPEG-2/4 AAC. Also the size of the window used in the analysis and synthesis filterbank is reduced by a factor of 2. No block switching is used to avoid the ”look-ahead'' delay due to the block switching decision. To reduce pre-echo artefacts in case of transient signals, window shape switching is provided instead. For non-transient parts of the signal a sine window is used, while a so-called low overlap window is used in case of transient signals. Use of the bit reservoir is minimized in the encoder in order to reach the desired target delay. As one extreme case, no bit reservoir is used at all.

Fine grain scalability

Bitrate scalability, also known as embedded coding, is a very desirable functionality. The General Audio Coder of Version 1 supports large step scalability where a base layer bitstream can be combined with one or more enhancement layer bitstreams to utilise a higher bitrate and thus obtain a better audio quality. In a typical configuration, a 24 kbit/s base layer and two 16 kbit/s enhancement layers could be used, permitting decoding at a total bitrate of 24 kbit/s (mono), 40 kbit/s (stereo), and 56 kbit/s (stereo). Due to the side information carried in each layer, small bitrate enhancement layers are not efficiently supported in Version 1. To address this problem and to provide efficient small step scalability for the General Audio Coder, the Bit-Sliced Arithmetic Coding (BSAC) tool is available in Version 2. This tool is used in combination with the AAC coding tools and replaces the noiseless coding of the quantised spectral data and the scalefactors. BSAC provides scalability in steps of 1 kbit/s per audio channel, i.e. 2 kbit/s steps for a stereo signal. One base layer bitstream and many small enhancement layer bitstreams are used. The base layer contains the general side information, specific side information for the first layer and the audio data of the first layer. The enhancement streams contain only the specific side information and audio data for the corresponding layer. To obtain fine step scalability, a bit-slicing scheme is applied to the quantised spectral data. First the quantised spectral values are grouped into frequency bands. Each of these groups contains the quantised spectral values in their binary representation. Then the bits of a group are processed in slices according to their significance. Thus first all most significant bits (MSB) of the quantised values in a group are processed, etc. These bit-slices are then encoded using an arithmetic coding scheme to obtain entropy coding with minimal redundancy. Various arithmetic coding models are provided to cover the different statistics of the bit-slices. The scheme used to assign the bit-slices of the different frequency bands to the enhancement layer is constructed in a special way. This ensures that, with an increasing number of enhancement layers utilised by the decoder, quantized spectral data is refined by providing more of the less significant bits. But also the bandwidth is increased by providing bit-slices of the spectral data in higher frequency bands.

Parametric Audio Coding

The Parametric Audio Coding tools combine very low bitrate coding of general audio signals with the possibility of modifying the playback speed or pitch during decoding without the need for an effects processing unit. In combination with the speech and audio coding tools of Version 1, improved overall coding efficiency is expected for applications of object based coding allowing selection and/or switching between different coding techniques.

Parametric Audio Coding uses the Harmonic and Individual Line plus Noise (HILN) technique to code general audio signals at bitrates of 4 kbit/s and above using a parametric representation of the audio signal. The basic idea of this technique is to decompose the input signal into audio objects which are described by appropriate source models and represented by model parameters. Object models for sinusoids, harmonic tones, and noise are utilised in the HILN coder.

This approach allows to introduce a more advanced source model than just assuming a stationary signal for the duration of a frame, which motivates the spectral decomposition used e.g. in the MPEG-4 General Audio Coder. As known from speech coding, where specialised source models based on the speech generation process in the human vocal tract are applied, advanced source models can be advantageous in particular for very low bitrate coding schemes.

Due to the very low target bitrates, only the parameters for a small number of objects can be transmitted. Therefore a perception model is employed to select those objects that are most important for the perceptual quality of the signal.

In HILN, the frequency and amplitude parameters are quantised according to the ``just noticeable differences'' known from psychoacoustics. The spectral envelope of the noise and the harmonic tone is described using LPC modeling as known from speech coding. Correlation between the parameters of one frame and between consecutive frames is exploited by parameter prediction. The quantised parameters are finally entropy coded and multiplexed to form a bitstream.

A very interesting property of this parametric coding scheme arises from the fact that the signal is described in terms of frequency and amplitude parameters. This signal representation permits speed and pitch change functionality by simple parameter modification in the decoder. The HILN Parametric Audio Coder can be combined with MPEG-4 Parametric Speech Coder (HVXC) to form an integrated parametric coder covering a wider range of signals and bitrates. This integrated supports speed and pitch change. Using a speech/music classification tool in the encoder, it is possible to automatically select the HVXC for speech signals and the HILN for music signals. Such automatic HVXC/HILN switching was successfully demonstrated and the classification tool is described in the informative Annex of the Version 2 standard.

CELP Silence Compression

The silence compression tool reduces the average bitrate thanks to a lower-bitrate compression for silence. In the encoder, a voice activity detector is used to distinguish between regions with normal speech activity and those with silence or background noise. During normal speech activity, the CELP coding as in Version 1 is used. Otherwise a Silence Insertion Descriptor (SID) is transmitted at a lower bitrate. This SID enables a Comfort Noise Generator (CNG) in the decoder. The amplitude and spectral shape of this comfort noise is specified by energy and LPC parameters similar as in a normal CELP frame. These parameters are an optional part of the SID and thus can be updated as required.

Extended HVXC

The variable bit-rate mode of 4.0 kbps maximum is additionaly supported in version2 HVXC. In the version1 HVXC, variable bit-rate mode of 2.0 kbps maximum is supported as well as 2.0 and 4.0 kbps fixed bit-rate mode. In version2, the operation of the variable bit-rate mode is extended to work with 4.0 kbps mode. In the variable bit-rate mode, non-speech part is detected from unvoiced signals, and smaller number of bits are used for non-speech part to reduce the average bit- rate. When the variable bit-rate mode of 4.0 kbps maximum is used, the average bit rate goes down to approximately 3.0 kbps with typical speech items. Other than 4.0 kbps variable bit-rate mode, the operation of HVXC in version2 is the same as that in version1.

Contents of N2803

Location	Notes
w2803.pdf	Cover page of ISO/IEC 14496-3 Amd1 FPDAM document.
w2803_n.pdf	Normative part of ISO/IEC 14496-3 Amd1 FPDAM document.
w2803_i.pdf	Informative part of ISO/IEC 14496-3 Amd1 FPDAM document.

Heiko Purnhagen 29-Jul-1999

MPEG-4 Audio Version 2 (Final Committee Draft 14496-3 AMD1)