Automatic transcription of simple polyphonic music: Integrating musical knowledge Keith D. Martin and Eric Scheirer MIT Media Laboratory, 20 Ames Street, Room E15-401, Cambridge, MA 02139 Email: kdm@media.mit.edu eds@media.mit.edu Many models of music perception implicitly contain a "transcription step" and cognitive representation at the note-event level. However, automatic music transcription systems have had only limited success with polyphonic musical input. While we reject the notion that transcription per se is a necessary component of human music perception, we are interested in building computer systems that analyze music recordings to allow the study of performances outside of the laboratory environment, as well as to better understand our own perceptual processes. Toward the goal of constructing a fairly general music transcription engine, our work has been guided by the principles of auditory scene analysis (Bregman, 1990) where the need for both bottom-up (signal-driven) and top-down (expectation-driven) processing has been recognized. We have adopted the blackboard metaphor (Nii, 1986), which integrates rule-like and procedural knowledge in a structured hypothesis framework and explicitly encourages the necessary bi-directional processing. We have previously demonstrated (Martin, 1996) a blackboard transciption system that can successfully transcribe many examples of polyphonic piano music. There is no piano timbre model inherent in the system (the signal-processing front-end is based on a variant of the Meddis-Hewitt pitch perception model), so we expect that it will exhibit similar performance with many other musical instrument timbres, though this has not yet been tested. The principal limitation of the system (and, historically, of all polyphonic transcription systems) lies in the detection and interpretation of intervals related by the harmonic series: octaves, twelfths, double octaves, and so on. The correlation-based front-end in (Martin, 1996) potentially leads to a bottom-up octave-detection algorithm, but it may be possible to build a more robust system by incorporating music knowledge within the blackboard framework. To this end, we propose two directions for future development: the addition of music theoretic rules acquired from experts from in-depth study of a particular music domain (e.g., voice-leading in 18th century counterpoint, or allowable tensions in Romantic orchestral works) and the inclusion of statistical or probabilistic knowledge automatically gleaned from a large corpus of symbolic (MIDI) and/or acoustic (recorded) music (e.g., note probabilities, orchestral "textures"). While the former path may lead to more accurate transcription in limited domains, we believe that the latter path is likely to be more generally applicable, particularly when the musical genre is not known a priori. We will present an overview of our approach to the transcription problem, giving polyphonic transcription examples from our current system highlighting "the octave problem," and we will discuss some of the possibilites for incorporating musical knowledge into transcription systems, showing that such an approach is a natural extension within the blackboard metaphor. We will additionally discuss implications of the two future directions described above from the viewpoint of music psychology. References: Bregman, A. S. (1990). Auditory Scene Analysis. Cambridge, MA: MIT Press. Martin, K. D. (1996). Automatic transcription of simple polyphonic music: Robust front-end processing. Cambridge, MA: MIT Media Lab Perceptual Computing TR #399. Nii, H. P. (1986). Blackboard systems: The blackboard model of problem solving and the evolution of blackboard architectures. AI Magazine, 7(2), 38-53.