It is officially called "Multimedia Content Description Interface", a means of providing meta-data for multimedia. MPEG-7 focusses on two different work items in this context:
For the second work item, MPEG-7 is not restricting itself to traditional "card catalogue" descriptions (e.g. Dublin Core) of video and/or audio material but also develops descriptions that require (human or automatic) content analysis of multimedia material and are able to grasp information on structural and semantic content. MPEG-7 does not standardize extraction algorithms but only focusses on the specifications of outcomes of such algorithms. However, in developing the standard, MPEG will develop tools for creating descriptions to prove their effectiveness, and these tools may be extraction algorithms. They will not be part of the standard, or "normative," however.
What is the relationship to previous MPEG efforts?
Where MPEG-1 and -2 concentrated almost entirely on compression,
MPEG-4 moved to a higher level of abstraction in coding objects and
using content-specific techniques for coding content. MPEG-7 moves to
an even higher level of abstraction, a cognitive coding, some might
say.
In principle, MPEG-1, -2, and -4 are designed to represent the
information itself, while MPEG-7 is meant to represent information
about the information (although there are areas common between MPEG-4
and -7). Another way of looking at it is that MPEG-1, -2, and -4 made
content available. MPEG-7 allows you to describe and thus find the
content you need.
What are the general applications for MPEG-7?
MPEG-7 aims at making the web more searchable for multimedia
content than it is for text today. This also applies to making large
content archives accessible to the public (or to enable people to
identify content to buy). A big issue here is a standardized set of
descriptors for the same type of data within every content archive
such that the same tools are able to access a lot of different
collections in the same way. The information used for content
retrieval may also be used by agents for selection and filtering of
broadcast material. Additionally, the meta-data may be used for more
advanced access to the underlying data, by enabling automatic or
semi-automatic multimedia presentation or editing.
In summary, example application areas of MPEG-7 audio are:
Although still an expanding list, we can envision indexing music,
sound effects, and spoken-word content in the audio-only arena. MPEG-7
will enable query-by-example such as query-by-humming. In addition,
audio tools play a large role in typical audio-visual content in terms
of indexing film soundtracks and the like. If someone wants to manage
a large amount of audio content, whether selling it, managing it
internally, or making it openly available to the world, MPEG-7 is
potentially the solution.
One specific application area supported by MPEG-7 is that of automatic
speech recognition (ASR). A spoken content description scheme will
embody the output of any state of the art ASR system, at a host of
different semantic levels. This will allow searching on spoken and
visual events: "Find me the part where Romeo says `It is the East and
Juliet is the sun'".
For more details, please see the MPEG-7
Applications Document (N2426) for a non-exhaustive list of
representative examples.
What do the abbreviations used within MPEG-7 audio stand for?
DDL = Description Definition Language
D = Descriptor
DS = Description Scheme
XM = Experimentation Model
CE = Core Experiment
What are the foreseen elements of MPEG-7?
MPEG-7 work is currently seen as being in three parts: Descriptors (D's), Description Schemes (DS's), and a Description Definition Language (DDL). Each is equally crucial to the entire MPEG-7 effort.
All of these elements combine to create a Description, or an instantiated DS, which is represented using the DDL, and which incorporates Ds as appropriate to the data model.
We see descriptors as coming from the following possible places:
Ultimately, we recognise that there is more than one way to accomplish the feat of content description, and expect users to adopt the solutions matching their own needs.
Core experiments (CEs) are the procedure by which a description proposal makes its way into the standards document and reference software for the evolving standard. It is where multiple parties both vie for position and collaborate to find the best technology for inclusion in the standard. There will be a set of procedures that have to be followed for execution of a CE.
What is the XM? What is the development environment?
The eXperimentation Model (XM) is the development base for the Core Experiments and is the reference software for the evolving standard. The development environment for Windows is Visual Studio v5.0. The development environment for Linux has yet to be fully defined.
What audio description tools are currently in the Working Draft or are being worked on?
The only audio description tool currently included in the WD are the speech recognition description tools. They consist of combined word and phoneme lattices and may additionally store topic labels, predefined taxonomy or textual summaries related to the ASR output stored in the lattice. Other audio description tools being at CE status are:
In addition, there are generic description tools in the WD which are highly relevant to audio descriptions. Examples are:
What are potential connections between MPEG-7 tools and MPEG-4 tools?
There are many possible connections between MPEG-4 tools and
MPEG-7. Most of the content-specific tools contained in MPEG-4 have
great potential because a model for the content is already specified:
by choosing a method of coding, one selects the features that are
important to the material. For example, if one encodes a sound by
using sinusoidal tracks, then MPEG-7 asks which of those tracks are
most significant in distinguishing the sound. It is a matter of
abstraction up to the point of measuring similarity.
The Structured Audio tools also have a strong relationship to
MPEG-7. They synthesize a sound from an already-existing
description. The challenge in this case is to reach a suitable level
of abstraction. There can be many, very different, descriptions which
can be synthesized into perceptually indistinguishable sounds. It is
clear that the models (e.g. of a musical instrument or an acoustic
space) used directly within structured audio will not be sufficiently
abstract or constrained. Therefore, it is an open research question as
to how to identify, select, and build a structured audio model for
MPEG-7.
MPEG-4 also talks about "description" of media. What is the relation to MPEG-7?
MPEG-4 is concerned with generative descriptions - ie the
descriptions consist of a more or less complete set of instructions
from which a system may generate a piece of content. In other words,
an MPEG-4 description is essentially a compressed form of the
content.
By contrast, MPEG-7 descriptions are (in general) non-generative -
they are designed for other purposes (search, etc) and do not provide
complete desriptions from which the media experience may be
created. MPEG-7 descriptions exist alongside the media and provide
information about it ("bits about the bits").
It is
planned that there will be some object content identifiers and other
simple meta-data included in MPEG-4. However, their capabilities will
be severely limited compared to MPEG-7.
Can you propose more detailed information in the literature?
What is the current time frame for MPEG-7?
How do I join the MPEG-7 Audio Issues AHG to keep track of current development?
Send a message with "subscribe" in the Subject line to: mpeg-7-aud-request@starlab.net
Where can I find out more about MPEG-7?
Where do I go if I have more questions?
If you have any questions, please do not hesitate to join the MPEG-7 Audio Issues AHG and direct your questions to the reflector. Chances are, there are others who share your concerns, and the AHG exists specifically to identify those concerns. If you must reach an individual, Adam Lindsay <adam@starlab.net> will be happy to answer any questions regarding MPEG-7.