In a study published on the preprint server Arxiv.org, researchers affiliated with the Institute of Computational Perception at Johannes Kepler University Linz and Austrian Research Institute for Artificial Intelligence describe an AI system that can predict the most likely position within sheet music matching an audio recording, ostensibly outperforming current state-of-the-art image-based score followers in terms of alignment precision.
Score following is the basis for applications like automatic accompaniment, page-turning, and synchronizing live performances to visualizations. Existing systems either rely on fixed-size, small snippets of sheet music images or require a computer-readable score representation extracted using optical music recognition. But the researchers’ system can uniquely observe an entire sheet music page, following musical performances of any length in an end-to-end fashion.
The team modeled score following as an image segmentation task. Based on a musical performance up to a given point in time, their system predicts a segmentation mask — a small image “piece” — for the score that corresponds to the currently-played music. While trackers that leverage only a fixed-size audio input generally aren’t able to distinguish between repeating notes if they exceed a certain context, the proposed system has no issue even in scores spanning over longer periods of time in the audio, the researchers say.
In the course of experiments, the researchers sourced polyphonic piano samples from the Multi-model Sheet Music Dataset (MSMD), which comprises songs from various composers including Bach, Mozart, and Beethoven. After manually identifying and fixing alignment errors, they trained their system on 353 pairs of sheet music and MIDI information.
The coauthors report that their system outperformed all baselines excepting the highest threshold, achieving more precise results in terms of time difference (i.e., higher percentages for tighter error thresholds). It occasionally yielded errors, which the researchers attribute to the system’s freedom to perform “big jumps” on the sheet image paper. But they assert the experimental results show the system is “very precise” in most contexts.
“Future work will … require testing on scanned or photographed sheet images, to gauge generalization capabilities of the system in the visual domain as well,” the researchers wrote. “The next step towards a system with greater capabilities is to either explicitly or implicitly incorporate a mechanism to handle repetitions in the score as well as in the performance. We assume that the proposed method will be able to acquire this capability quite naturally from properly prepared training data, although we suspect its performance will heavily depend on its implicit encoding of the audio history so far, i. e., how large an auditory context the recurrent network is able to store.”
View original article here Source