MPEG-4 Structured Audio

Introduction to Structured Audio

Up to top level | Forward to tools | Back to top level

About Structured Audio

Structured Audio means transmitting sound by describing it rather than compressing it. That's the whole idea, and it's a very simple one, but as you will see if you keep reading, it leads to a wealth of new directions for sound research and low bit-rate coding.

The Machine Listening Group (which owns these web pages) invented the use of the term Structured Audio for this kind of sound transmission. We didn't invent the idea of Structured Audio; description methods for sound and the idea of using them for coding has been kicking around for years (Julius Smith, Andy Moorer, and probably others have written about it). But the NetSound and MPEG-4 Structured Audio projects were the first to really try to make it a practical reality.

The Problem

The problem we started out trying to solve is easy to understand: it takes too long to download sound over the World Wide Web. In order to avoid this, many researchers have developed audio compression techniques, which allow sound files to be "squished" for more rapid transmission. RealAudio, MP3, Liquid Audio, and many other technologies allow this compression. However, to compress audio to a point where it can be streamed over modems, you have to start squeezing the sound quality out of it. That's why, on one hand, MP3 files can't be streamed -- you have to download them (about 15-20 minutes for a 5 minute song) before listening -- but they sound good; and on the other, RealAudio files can be streamed, but they sound like AM radio, not like a CD. [ Go here for more information on how this sort of compression works. ]

So we wanted to build technology that would allow high-quality sound to be streamed over a modem.


Csound is a special computer language invented and refined over the last twenty years by Prof. Barry Vercoe, who leads the Machine Listening Group. It's a language that's used to describe sound synthesizers. Each program in Csound is the description of the "internal workings" of a particular kind of synthesizer. That is, one Csound program (or instrument in Csound language) might use FM synthesis like the Yamaha DX-7, while another might use wavetable techniques like the E-Mu Proteus, and another might model a classic analog synthesizer from the 1970's. Part of the power of Csound and languages like it is that any synthesizer can be described this way.

When musicians write music using Csound, they write two parts: the orchestra, which describes what all of their instruments should sound like, and the score, which describes how to use those instruments to create music. If you are familiar with MIDI files, the score of a Csound composition is something like a MIDI file. But the orchestra has no direct analog in the MIDI world, unless it's the electronics on the inside of the synthesizer. In a MIDI composition, the composer has no direct control over the sound; s/he must "trust" that the synthesizer will do the right thing. In Csound, though, the composer specifies exactly what the instruments sound like.

The Csound program itself turns the orchestra and score into sound, in real-time if you have a fast machine (that is, you can listen to the sound while Csound is producing it); or otherwise, written into a file for later listening.


We (especially group alumnus Michael Casey) made the observation in the spring of 1996 that using Csound would be a good way to put high-quality audio on a WWW page. The combination of a Csound orchestra and score is usually many hundreds of times smaller (more compressed) than the sound it turns into. If Csound were present on your home computer, I could transmit an orchestra and a score to you, and Csound would turn it into sound. So Mike, Adam Lindsay, Paris Smaragdis, and Eric Scheirer wrote a simple set of scripts that "wrap up" a Csound orchestra, score, MIDI files, and maybe some sound samples for delivery on the WWW, and a client-side program that separates them and dispatches Csound to create sound.

We called this idea NetSound, and it's proven to be a very successful platform for demonstrating Structured Audio concepts (we do a lot of NetSound demos at the Media Lab) and getting the message out about this concept in sound coding (Mike and Paris wrote a paper about it, and it was a finalist for a 1997 Discover Magazine Innovation Award).

NetSound isn't a perfect system; it's hard to code voice or very expressive natural instruments this way, and the Csound model isn't right for streaming of data. (In many cases, you don't have to stream data, though, because the NetSound file is so compact). Also, many people find Csound difficult to use (many people love it, though, too!)

The move to MPEG

In the fall of 1996, visitors from Media Lab sponsor Hughes Electronics saw a demo of NetSound. Hughes is a big player in MPEG; they make a lot of money from the MPEG-2 video standard, which is used to transmit the data in their direct-satellite broadcast system called DirectTV in the USA. Hughes realized that the concepts we were demonstrating were a good solution to an outstanding MPEG-4 call-for-proposals for "SNHC Audio" (which stands for "Synthetic/Natural Hybrid Coding", a concept explored elsewhere).

We wrote up a brief submission to MPEG based on Mike and Paris's paper, but it turned out that due to copyright hangups and some other problems, NetSound wasn't exactly the right solution for MPEG-4. So we (especially Eric) decided to use the opportunity to revisit some of the synthesis-language issues represented in Csound, and dive deeper into the Structured Audio concepts than NetSound did. The result was the language SAOL, which we designed over the winter of 1996-1997 and submitted to MPEG soon after, meeting with general enthusiasm.

We contain to develop and maintain the SAOL language model and a computer program implementing it; we've also had a hand in the design of other MPEG-4 Structured Audio tools. Go on forward to read about them.

Up to top level | Forward to tools | Back to top level