Media Lab MPEG-4 Structured Audio : Background

Contact: Alexandra Kahn, MIT Media Lab
Phone: (617) 253-0365
Email: akahn@media.mit.edu


Structured Audio: Why and How?

As the Internet has grown, more and more people have become interested in "Internet radio", on-line sales of music, and other audio applications. However, the audio quality and interactive capability of Internet broadcasts have been hampered by the low data rate that modems can transmit. Likewise, multimedia applications such as CD-ROMs have been severely restricted by the limited capabilities of the sound hardware commonly available to developers and home users.

The Media Lab's Structured Audio method attacks these problems by using a new model for sound. Rather than transmitting sound as a "stream of bits", the sound is described in a flexible language and then synthesized, or turned into sound, on the user's computer. The description of the sound is much smaller and easier to transmit than the sound itself, and so this novel idea leads to much faster download times.

Structured Audio also addresses limitations presented by traditional sound cards. Existing PC sound cards and transmission methods for synthetic sound make use of only a single, fixed, method of sound synthesis. The Structured Audio method allows the description of any method of synthesis, or the creation of entirely new methods. When a Structured Audio file is downloaded, a set of "virtual synthesizers" are created which allow these new synthesis methods to be used for generating music.

The Media Lab method, unlike other "downloadable samples" methods which utilize only "sampling" synthesis, or wavetable synthesis, does not designate a particular synthesizer as the best way to generate sound; instead, it designates a language for describing synthesizers. Any current or future synthesizer may be described in the MPEG-4 Structured Audio Orchestra Language. In one piece of music, the MPEG-4 synthesizer might act like a sophisticated FM synthesizer like the Yamaha DX-7; in another, it might act like a wavetable synthesizer like the E-Mu Proteus; in another, it could have the capabilities of classic 1970's analog synthesizers. In fact, all of these methods may be used in the same piece of music.

MPEG-4 Structured Audio interoperates with the existing, widely used, MIDI protocol for control of electronic musical instruments. Thus, existing content and tools which make use of MIDI can be quickly repurposed to use Structured Audio instead. However, the Structured Audio technique is more powerful and flexible, and gives composers more control over sound, than MIDI allows.

The MPEG-4 Structured Audio project grew out of another research project at the Media Lab called "NetSound". This system uses the language Csound, developed by Media Lab Professor Barry Vercoe, to express audio structure. NetSound was a 1997 finalist in the Discover Magazine Awards for Technical Innovation.

Using lessons learned in the NetSound project, Machine Listening Group researchers designed new tools specifically tailored to the requirements of MPEG-4. In early 1997, the Media Lab, following a request by the Hughes Aircraft Company - a Media Lab sponsor as well as an MPEG participant - submitted a proposal based on these new tools to MPEG. MPEG experts quickly realized that this technology represented a fundamental step forward in the design of audio standards, and accepted it into the draft standard. Since then, experts from the Media Lab and around the world have been collaborating to finalize the technology before standardization.

The MPEG group and the MPEG-4 standard

The Moving Picture Experts Group (MPEG) - a working group of the larger ISO Standards Body - is made up of hundreds of researchers, engineers and industry experts from around the globe. MPEG is chartered with the development of industry standards for the compression and decompression, processing and coding of moving pictures and audio.

Following the wide-scale industry acceptance and implementation of the MPEG-1 and MPEG-2 Standards, the group is now developing MPEG-4, the next version of the Standard. MPEG-4 addresses critical issues affecting digital television, interactive graphics applications (such as synthetic content), and transmission over the World Wide Web. In addition, MPEG-4 will provide the standardized industry format for the production, distribution and access of digital content.

MPEG-4 will be released in October 1998 and become an International Standard in December 1998.

The MPEG-1 and MPEG-2 standards are broadly used in the digital exchange and transmission of audio and video data. Millions of MPEG- enabled home computers, direct-satellite broadcast receivers, and other multimedia devices have been sold. With MPEG-4, the scope of this work expands to include interactive multimedia content; this technology will be built into future home computers, portable devices, and set-top boxes to allow the easy delivery of sophisticated interactive programming.

The Media Lab's Structured Audio method has been designed to integrate seamlessly with the other components of MPEG-4. These include methods for the transmission of speech, recorded music, computer graphics, and compressed digital video. All of these tools may be combined in a single MPEG-4 presentation. Taken as a whole, the MPEG-4 audio tools point the way to a more powerful common platform for sound processing, by standardizing the most advanced sound compression tools, and the most sophisticated and high-quality synthesis methods available.

The Structured Audio toolset is also well-suited to small, low-power portable devices such as digital radios. A low-complexity "profile" is defined which allows the use of a simpler synthesis method in these sorts of receivers. This simple method is compatible with the MIDI Downloaded Sounds format, a popular but limited framework for generating sound.

About the Machine Listening Group

The Machine Listening Group at the Media Lab is recognized internationally for its research on advanced sound-processing algorithms, such as music-synthesis languages, synthetic performance, acoustic source separation, 3-D audio, content-based sound indexing, and music transcription and understanding systems.

As well as developing the Structured Audio standard, the Machine Listening Group at the Media Lab has a long history as one of the world leaders in innovating related work in advanced audio processing. Led by Professor Barry Vercoe, one of the primary inventors of the research field of "computer music", the Machine Listening Group works toward bridging the gap between the current generation of audio technologies and those that will be needed for future interactive media applications.

Many of the functionalities enabled by the MPEG-4 standard, and many of the technologies required to implement MPEG-4 Structured Audio encoders and decoders, have been invented at the Media Lab.

For example, Machine Listening Group research into Auditory Scene Analysis will allow the conversion of regular audio content into structured content, by analysis and modelling of the component sounds in a mixture. Research into Music Understanding and Instrument Identification will enable systems which automatically search on the Internet for the music that you want to hear.

Synthetic performance systems - a term and a field invented by Professor Vercoe - allow human musicians to interact with computer performers. The computer performer can be taught about the piece of music being played, and the two musicians can rehearse together, to create a sensitive and expressive collaboration between human and machine.

The Machine Listening Group has been a longtime innovator in 3-D audio presentation systems, developing the fundamental technology which will be employed in future consumer products and high-end audio equipment. Current and future research increases the quality and ease-of-use of this technology, moving it from niche applications to the mass market.

The Machine Listening Group has a WWW homepage at http://sound.media.mit.edu/.