Automatic transcription of simple polyphonic music: Integrating
musical knowledge

Keith D. Martin and Eric Scheirer
MIT Media Laboratory, 20 Ames Street, Room E15-401, Cambridge, MA 02139

Email: kdm@media.mit.edu
       eds@media.mit.edu

Many models of music perception implicitly contain a "transcription
step" and cognitive representation at the note-event level. However,
automatic music transcription systems have had only limited success
with polyphonic musical input. While we reject the notion that
transcription per se is a necessary component of human music
perception, we are interested in building computer systems that
analyze music recordings to allow the study of performances outside of
the laboratory environment, as well as to better understand our own
perceptual processes.

Toward the goal of constructing a fairly general music transcription
engine, our work has been guided by the principles of auditory scene
analysis (Bregman, 1990) where the need for both bottom-up
(signal-driven) and top-down (expectation-driven) processing has been
recognized. We have adopted the blackboard metaphor (Nii, 1986), which
integrates rule-like and procedural knowledge in a structured
hypothesis framework and explicitly encourages the necessary
bi-directional processing.

We have previously demonstrated (Martin, 1996) a blackboard
transciption system that can successfully transcribe many examples of
polyphonic piano music. There is no piano timbre model inherent in the
system (the signal-processing front-end is based on a variant of the
Meddis-Hewitt pitch perception model), so we expect that it will
exhibit similar performance with many other musical instrument
timbres, though this has not yet been tested.

The principal limitation of the system (and, historically, of all
polyphonic transcription systems) lies in the detection and
interpretation of intervals related by the harmonic series: octaves,
twelfths, double octaves, and so on.  The correlation-based front-end
in (Martin, 1996) potentially leads to a bottom-up octave-detection
algorithm, but it may be possible to build a more robust system by
incorporating music knowledge within the blackboard framework.

To this end, we propose two directions for future development: the
addition of music theoretic rules acquired from experts from in-depth
study of a particular music domain (e.g., voice-leading in 18th
century counterpoint, or allowable tensions in Romantic orchestral
works) and the inclusion of statistical or probabilistic knowledge
automatically gleaned from a large corpus of symbolic (MIDI) and/or
acoustic (recorded) music (e.g., note probabilities, orchestral
"textures"). While the former path may lead to more accurate
transcription in limited domains, we believe that the latter path is
likely to be more generally applicable, particularly when the musical
genre is not known a priori.

We will present an overview of our approach to the transcription
problem, giving polyphonic transcription examples from our current
system highlighting "the octave problem," and we will discuss some of
the possibilites for incorporating musical knowledge into
transcription systems, showing that such an approach is a natural
extension within the blackboard metaphor. We will additionally discuss
implications of the two future directions described above from the
viewpoint of music psychology.

References:

Bregman, A. S. (1990). Auditory Scene Analysis. Cambridge, MA: MIT Press.

Martin, K. D. (1996). Automatic transcription of simple polyphonic
    music: Robust front-end processing. Cambridge, MA: MIT Media Lab
    Perceptual Computing TR #399.  

Nii, H. P. (1986). Blackboard systems: The blackboard model of problem
    solving and the evolution of blackboard architectures. AI Magazine,
    7(2), 38-53.