MPEG-4 Structured Audio

The MPEG-4 Structured Audio toolset

Up to top level | Forward to technical overview | Back to introduction

An introduction to Structured Audio tools

Depending on how you count them, there are about six tools making up the Structured Audio part of the MPEG-4 standard. The canonical reference to really learn about them is the draft standard; it goes into excruciating detail on the exact bitstream formats, what you have to do to build an MPEG-4 implementation, and so on. This page provides an overview and cross-links to other resources in this web site; it's similar to a paper which covers the same territory.

The Structured Audio tools are:

SAOL (Structured Audio Orchestra Language)

SAOL is pronounced like the English word "sail" and is the central part of the Structured Audio toolset. It is a new software-synthesis language; we specifically designed it for use in MPEG-4. You can think of SAOL as a language for describing synthesizers; a program, or instrument, in SAOL corresponds to the circuits on the inside of a particular hardware synthesizer.

SAOL is not based on any particular method of synthesis. It is general and flexible enough that any known method of synthesis can be described in SAOL. We have written examples of FM synthesis, physical-modeling synthesis, sampling synthesis, granular synthesis, subtractive synthesis, FOF synthesis, and hybrids of all of these in SAOL.

If you're a musician, and want to learn more about using SAOL to make music, we have a SAOL users' page.

SASL (Structured Audio Score Language)

SASL is a very simple language that we created for MPEG-4 to control the synthesizers specified by SAOL instruments. A SASL program, or score, contains instructions that tell SAOL what notes to play, how loud to play them, what tempo to play them at, how long they last, and how to control them (vary them while they're playing).

SASL is like MIDI in some ways, but doesn't suffer from MIDI's restrictions on temporal resolution or bandwidth. It also has a more sophisticated controller structure than MIDI; since in SAOL, you can write controllers to do anything, you need to be able to flexibly control them in SASL.

SASL is simpler (or more "lightweight") than many other score protocols. It doesn't have any facilities for looping, sections, repeats, expression evaluation, or some other things. Our viewpoint is that most SASL scores will be created by automatic tools, and so it's easy to make those tools map from the intent of the composer ("repeat this block") to the particular arrangement of events that implement the intent.

There's a page on sophisticated control with SASL available.

SASBF (Structured Audio Sample Bank Format)

SASBF (sometimes we pronounce this "sazz-biff") is a format for efficiently transmitting banks of sound samples to be used in wavetable, or sampling, synthesis. The format is being re-examined right now in hopes of making it at least partly compatible with the MIDI Downloaded Sounds (DLS) format.

Machine Listening Group members aren't working too actively on this component of MPEG-4; we're providing some oversight and making sure that the final solution is interoperable. The most active participants in this activity are E-Mu Systems and the MIDI Manufacturers Association (MMA).

 MIDI Semantics

As well as controlling synthesis with SASL scripts, it can be controlled with MIDI files and scores in MPEG-4. MIDI is today's most commonly used representation for music score data, and many sophisticated authoring tools (such as sequencers) work with MIDI.

The MIDI syntax is external to the MPEG-4 Structured Audio standard; we only reference the MIDI Manufacturers Association's definition in the standard. But in order to make the MIDI controls work right in the MPEG context, we redefine the semantics (what the instructions "mean") in MPEG-4. The new semantics are carefully defined as part of the MPEG-4 specification.

There's a page on using MIDI to control SAOL available.


The scheduler is the "guts" of the Structured Audio definition. It's a set of carefully defined and somewhat complicated instructions that specify how SAOL is used to create sound when it is driven by MIDI or SASL. It's in the style of "when this instruction arrives, you have to remember this, then execute this program, then do this other thing".

This component of Structured Audio is crucial but very dull unless you're a developer who wants to implement a SAOL system. If you are a developer, there's a developer's page available.


BIFS is the MPEG-4 Binary Format for Scene Description. It's the component of MPEG-4 Systems which is used to describe how the different "objects" in a structured media scene fit together. To explain this a little more: in MPEG-4, the video clips, sounds, animations, and other pieces each have special formats to describe them. But to have something to show, we need to put the pieces together -- the background goes in the back, this video clip attaches to the side of this "virtual TV" object, the sound should sound like it's coming from the speaker over there. BIFS lets you describe how to put the pieces together.

AudioBIFS is a major piece of MPEG-4 we've designed for specifying the mixing and post-production of audio scenes as they're played back. Using AudioBIFS, we can specify how the voice-track is mixed with the background music, and that it fades out after 10 seconds and this other music comes in and has a nice reverb on it.

BIFS is generally based on the Virtual Reality Modeling Language (VRML), but has extended capabilities for streaming and mixing audio and video data into a virtual-reality scene. The AudioBIFS functions are very advanced compared to VRML's sound model, which is rather simple, and are being tentatively considered for use in a future version of VRML.

There's a page on AudioBIFS available.

Up to top level | Forward to technical overview | Back to introduction