INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC29/WG11 N2203
March 1998 / Tokyo


MPEG-4 Audio
(Final Committee Draft 14496-3)


This web page is intended to provide an overview of the "MPEG-4 Audio Final Committee Draft 14496-3" and is based on excerpts from this document. The complete document is available by ftp.


Introduction

MPEG-4 audio coding integrates the worlds of speech and high quality audio coding as well as the worlds of sound synthesis and the representation of natural audio. The sound synthetis part is comprised of tools for the realisation of symbolically defined music and speech. This includes MIDI and Text-to-Speech systems. Furthermore, tools for effects processing and 3-D localisation of sound are included, allowing the creation of artificial sound environments using artificial and natural sources.

Synthetic audio is described by first defining a set of 'instrument' modules that can create and process audio signals under the control of a script or score file. An instrument is a small network of signal processing primitives that can emulate the effects of a natural acoustic instrument. A script or score is a time-sequenced set of commands that invokes various instruments at specific times to contribute their output to an overall music performance. Other instruments, serving the function of effects processors (reverberators, spatialisers, mixers), can be similarly invoked to receive and process the outputs of the performing instruments. These actions can not only realise a music composition but can also organise any other kind of audio, such as speech, sound effects and general ambience. Likewise, the audio sources can themselves be natural sounds, perhaps emanating from an audio channel decoder, thus enabling synthetic and natural sources to be merged with complete timing accuracy.

TTS is becoming a rather common interface and plays an important role in various multi-media application areas. For instance, by using TTS functionality, multi-media contents with narration can be easily composed without recording natural speech sound. Moreover, TTS with FA/AP/MP functionality would possibly make the contents much richer. In MPEG-4 activity, common interfaces for TTS and TTS for FA/AP/MP are proposed. The proposed MPEG-4 TTS functionality; Hybrid/Multi-Level Scalable TTS, can be considered as a superset of the conventional TTS framework. This extended TTS can utilize prosodic information of natural speech in addition to input texts and can generate much higher quality of synthetic speech. The interface and its bitstream format is strongly scalable; for example, if some parameters of prosodic information are not available, it then generates the missing parameters by rule. Still the basic idea for the scalable MPEG-4 TTS interface is it can fully utilize all the provided information according to the level of user’s requirements. The functionality of this extended TTS thus ranges from conventional TTS to natural speech coding and its application areas, from simple TTS to AP with TTS and MP dubbing with TTS.

MPEG-4 standardises natural audio coding at bitrates ranging from 2 kbit/s up to 64 kbit/s. The presence of the MPEG-2 AAC standard within the MPEG-4 tool set will provide for compression of general audio. For the bitrates from 2 kbit/s up to 64 kbit/s, the MPEG-4 standard normalises the bitstream syntax and decoding processes in terms of a set of tools. In order to achieve the highest audio quality within the full range of bitrates and at the same time provide the extra functionalities, three types of coder have been defined. The lowest bitrate range between about 2 and 6 kbit/s, mostly used for speech coding at 8 kHz sampling frequency, is covered by parametric coding techniques. Coding at the medium bitrates between about 6 and 24 kbit/s uses Code Excited Linear Predictive (CELP) coding techniques. In this region, two sampling rates, 8 and 16 kHz, are used to support a broader range of audio signals (other than speech). For the bitrates typically starting at about 16 kbit/s, time to frequency coding techniques are applied. The audio signals in this region typically have bandwidths starting at 8 kHz.

A number of functionalities are provided to facilitate a wide variety of applications which could range from intelligible speech to high quality multichannel audio. Examples of the functionalities are speed control, pitch change, error resilience and scalability in terms of bitrate, bandwidth, error robustness, complexity, etc. as defined below. These functionalities are applicable to the individual coding schemes (parametric, CELP and t/f) as well as across the coding schemes.

Bitrate scalability allows a bitstream to be parsed into a bitstream of lower bitrate such that the combination can still be decoded into a meaningful signal. The bit stream parsing can occur either during transmission or in the decoder.

To allow for smooth transitions between the bitrates and to allow for bitrate and bandwidth scalability, a general framework has been defined. This is illustrated in figure 1.

Figure 1

Starting with a coder operating at a low bitrate, by adding enhancements both the coding quality as well as the audio bandwidth can be improved. These enhancements are realised within a single coder or alternatively by combining different techniques.

Additional functionalities are realised both within individual coders, and by means of additional tools around the coders. An example of a functionality within an individual coder is pitch change within the parametric coder.


[All other sections of CD14496-3 are not include here.]


Annex (Informative)

Patent statements

(Table of organizations indicating to possess patent/intellectual property related to this specification to be added)

WG11 requests companies who believe they hold rights on patents that are necessary to implement MPEG-4 parts 1-2-3-5-6 to deliver a statements on company letterhead of compliance with ISO policy concerning use of patented items in International Standards. The patent statement can take a form similar to the statement given below:
<Company name> is pleased that the standardisation in relation to "Very low bitrate audio-visual coding" (known as MPEG-4) has reached Committee Draft level as documents ISO/IEC JTC1/SC29/WG11 N1901, N1902, N1903, N1905, N1906.
<Company name> hereby declares that it is prepared to license its patents, both granted and pending, which are necessary to manufacture, use, and sell implementations of the proposed MPEG-4 Systems, Visual, Audio, Reference Software and DMIF standards or combinations thereof.
<Company name> hereby also declares that it is aware of the rules governing inclusion of patented items in international standards, as described by Section 5.7, part 2 of the ISO/IEC Directives, and in particular that it is willing to grant a license to an unlimited number of applicants throughout the world under reasonable terms and conditions that are demonstrably free of any unfair competition.
This statement is intended to apply to the following parts of the proposed MPEG-4 standard (use the ones which apply):
Systems, Visual, Audio, Reference Software, DMIF.



Contents of N2203

The MPEG-4 Audio IS standard comprises 6 subparts:

For reasons of managability of large documents, the CD is divided in several files. For each of the five coding schemes as well as for the tools for other functionalities, there is a section containing a normative part (description of the syntax and the decoding process) and an informative annex with encoder and interface description. In addition there are two files containing very large tables.

Location Notes
w2203.pdf Subpart 1: The master file, containing general introduction and description of tools for other functionalities.
w2203par.pdf Subpart 2: Contains the description of the parametric coder.
w2203cel.pdf Subpart 3: Contains the description of the CELP coder.
w2203tft.pdf Subpart 4: Contains the normative part of the description of the T/F coder.
w2203tfs.pdf Subpart 4: Contains the syntax part of the description of the T/F coder.
w2203tfa.pdf Subpart 4: Contains the informative part of the description of the T/F coder.
w2203sa.pdf Subpart 5: Contains the description of the Structured Audio coder.
w2203tts.pdf Subpart 6: Contains the description of the Text-to-Speech coder.
w2203tvq.pdf Subpart 4: Twin-VQ vector quantizer tables.
w2203pvq.pdf Subpart 2: Parametric coder vector quantizer tables.


(MPEG Audio Web Page) (Tree) (Up)

Heiko Purnhagen 15-Jun-1998