Re: saolc used in streaming enviroment

From: Eric Scheirer (eds@media.mit.edu)
Date: Mon Oct 18 1999 - 09:47:38 EDT


Hi Marek,

These questions are very important; they discuss the
differences between standalone musician tools
(which is the way saolc and sfront, at least, mostly
work) and streaming decoders, which are the fundamental
definition of MPEG-4 SA. This is quite a long answer,
because it is a complex topic that hasn't been discussed
much yet on this reflector. In my point of view, we
spent a lot of time to make this correct in the standard
and it is one of the most important ways that SA differs
from Csound and other standalone software synthesizers.

The standard (download the FDIS from the web site if
you haven't yet done) contains detailed instructions on
making SA tools run in a streaming environment.

Particularly, subclause 5.5.2 contains the streaming
bitstream format. The bitstream class
StructuredAudioSpecificConfig() is the header,
transmitted at the start of the session; the bitstream
class SA_access_unit() is one "chunk" of streaming
data. Thus, a server is conceptually sending
a stream of SA_access_unit()s to a receiver that has
been prepared with a StructuredAudioSpecificConfig().

Of course, an SA_access_unit() is not really a packet or
nicely transmittable data chunk (because the stream might be
very bursty) -- the Systems part of MPEG-4 tells you
how you can packetize the stream and multiplex and
synchronize it with other MPEG-4 data like streaming
video. But from the point of the SA decoder, the data
has been "depacketized" into nice SA_access_unit()s at
the time it is available.

The important subclause 5.7.3.3.6 tells you how to
integrate the scheduled decoding of SASL scores
transmitted in the header, and the immediate decoding
of SASL events transmitted as part of the streaming
data.

So with this background, I will answer your questions.

->- Using SAOL and SASL streams seperately ("-sco" and "-orc" options),
>the files are "sucked in" completely to be used for initialization and
>scheduling. Because the order of the SASL-commands can be any (and does
>not have to be time sorted), the whole file has to be scheduled first,
>to realize decoding.

The general principle is that SASL events with timestamps (for
example in a SASL "file", or SASL chunk in the bitstream header)
do not need to be in order -- they are scheduled according
to their timestamp. SASL events without timestamps, which
can only appear in the streaming data, are executed immediately
when they are received. You can, if you choose, put SASL
events with timestamps in the streaming data, in which case
they can be scheduled into the future and do not need to be
in order.

>But what about the mp4-files we got so far (decoded
>e.g. by SFRONT)? Is the SASL part encoded timesorted ? Will it be enough
>to suck in only those lines, which relate to a certain timestap? Can one
>be sure, that all events in the following lines will relate to later
>events ? This would be a must for a streaming implementation.

Yes, this is true. I think sfront does not yet handle the streaming
part of the mp4 file, but once it does, the streaming data must
be in the correct order.

saolc does handle this sort of mp4 file, and in the streaming part
the events must be in order if they have no SAOL timestamps.
(An mp4 file of this sort is just pasting together, first, the
StructuredAudioSpecificConfig(), and second, the list of all the
SA_access_unit()s in order with an extra field to say the time
that they are received).

>- Will it be possible to initialize the SAOL-orchestra also in a
>streaming fashion ? Or do we first have to read in (and buffer) the
>whole file part of the mp4-file, describing the instruments? If a
>streaming implementation would be possible, will there be a difference
>in the initialization phase in the new virtual-machine approach ?

The standard specifies that SAOL data is only included in the header.
So it might all be handled and compiled and buffered at the beginning.
It is certainly possible to build fancier implementations that can
dynamically interpret SAOL data later on and add it into the
orchestra as it is needed ("just-in-time" compilation during
synthesis). As long as you get the right sound, it doesn't matter
(from the point of the standard) how you implement it. But
according to the standard, you only get SAOL data at the very
beginning of the session.

>- When using a streaming file, what about changing the setup of the used
>instruments ?
>Taking a complete structured-audio mp4-file, we allways have the SAOL
>block first, followed by the SASL part. This means that we have to
>initialize all instruments for the output-audiofile first. Thinking of
>large SA-contents split into following streams, it could make sense to
>reuse instruments from a previous SA-stream for the next stream and only
>add those instruments which come in new. This could result in a reduced
>amount of SAOL data to be transmitted for the second stream and in less
>time required for initializing these new instruments. What about this
>implementation in a streaming enviroment?

Yes, there are some tools for handling such things at the Systems
level, particularly if the StructuredAudioSpecificConfig() is
exactly the same from one session to another. If there is partial
overlap, within the context of a whole MPEG-4 terminal you could
split the SA synthesis process into two or more substreams and
then mix them together at the AudioBIFS level after the synthesis
is finished. Each of these substreams would have a smaller
SASC() that could be reused directly.

I don't think there is currently any way to cache and reuse "part
of" a SASC() according to the standard. SA is the only tool
that could really use such a feature, and so it is hard to lobby
for such strange (to non-SA people) functions within MPEG politics.
If someone takes the initiative to build a system and demonstrate
nice features that result, I would like to see it.

>- The mp4 files for the structured audio part are not embedded inside
>the whole MPEG4 layer structure, yet. Are there any information
>available which give some informations about the way this will look like
>? How will the structured-audio block/frames be marked? What about
>timestamping and the BIFS-embedment ?

To be very strict, the SA part of the standard does not define a
"file" at all. It defines a number of bitstream components,
particularly the SASC() and SA_a_u() that I have been discussing.
Only for the purpose of evaluation and testing and convenience,
I have put these elements together into a file that can be
easily exchange. But this sort of file has no status in MPEG
with regard to the standard.

If you go up to the next level at the overall Audio standard
(14496-3 sec1), you will see the syntax for a real MPEG-4 Audio
stream, which contains calls to SASC() and SA_a_u() in streams
that use SA coding. And if you go up one more level, the Systems
standard (14496-1) tells you how to packetize this audio stream,
timestamp the packets and AU transmissions, and synchronize
and multiplex it with other audio and video streams.

Conceptually, many decoders are running at the same time, both
several SA decoders and several AAC and speech decoders as
well. The outputs of all of these decoders are made available
for mixing and effects-processing at the AudioBIFS level.
There is a new paper in IEEE Trans Multimedia that describes
the functions possible in AudioBIFS; it is also available as
an electronic reprint on my homepage.

These kinds of capabilities, especially packetization and
multiplexing, have not been implemented very well yet. There
is a small reference software for AudioBIFS that allows you
to use the outputs of natural audio and SA decoders and
run the right mixing and post-processing, but it is
*extremely* difficult to use.

-----

I would suggest that if someone wants to try to build such
a system, it is a good time and they could coordinate with
Giorgio on SAINT development. It would also be possible
with sfront, but slightly more complex due to the issue of
having an intermediate C-compiling step. Perhaps John
has thought about this issue and can comment.

I think the things to do would be to

 (1) understand the relationship between the SA technology,
     the MPEG-4 Audio stream, and the MPEG-4 Systems capabilities
 (2) make a decision on how to embed the MPEG-4 packetized
     stream in a streaming transport like RTP (this decision
     is outside the scope of the standard)
 (3) implement the streaming wrapper to take
     SA SASC() and SA_a_u() elements, wrap them in an MPEG-4
     Audio Stream, packetize them according to the Systems
     standard, and transport them.
 (4) extend the decoder (SAINT) to receive these RTP packets,
     reassemble the SASC() and SA_a_u() elements, and
     hand them off to the decoder.

This is a good bachelor's or master's thesis level project,
I think -- it's straightforward but requires some new decisions
and implementations. The 'saenc' encoder tool provided in
the saolc package could be used as the basis for step (3) -
it contains code for generating SASC() and SA_a_u() elements
given SAOL, SASL, samples, and MIDI files.

I hope this answers your questions, and if it raises more,
I am happy to try to answer them.

Best,

 -- Eric



This archive was generated by hypermail 2b29 : Wed May 10 2000 - 12:15:39 EDT