Multimedia ResearchISSN:2582-547X

A Comprehensive Review on Automatic Music Transcription: Survey of Transcription Techniques

Abstract

“In music, transcription is the practice of notating a piece or a sound which was previously unnotated and/or unpopular as a written music”. An absolute transcription will be performed only when the timing, pitching, and instruments of all sound events are solved. In music transcription systems, a MIDI file is found to be a suitable format for melodic notations. This survey intends to make a review of 65 papers that concern music transcription using machine learning techniques. Accordingly, systematic analyses of the adopted techniques are carried out and presented briefly. The performances and related maximum achievements of each contribution are also portrayed in this survey. Moreover, the various datasets used in music transcription techniques were considered and reviewed in this work. Finally, the survey portrays the research problems and weaknesses that may be supportive for researchers to introduce the latest techniques related to music transcription.

References

  • Masahiro Yukawa, Hideaki Kagami, "Supervised nonnegative matrix factorization via minimization of regularized Moreau-envelope of divergence function with application to music transcription", Journal of the Franklin Institute, vol. 355, no. 4, pp. 2041-2066, March 2018.
  • Miguel A. Román, Antonio Pertusa, Jorge Calvo-Zaragoza, "Data representations for audio-to-score monophonic music transcription", Expert Systems with Applications, 30 December 2020.
  • J. J. Carabias - Orti, F. J. Rodriguez-Serrano, P. Vera-Candeas, F. J. Cañadas-Quesada, N. Ruiz-Reyes, "Constrained non-negative sparse coding using learnt instrument templates for realtime music transcription", Engineering Applications of Artificial Intelligence, August 2013.
  • Giovanni Costantini, Renzo Perfetti, Massimiliano Todisco, "Event based transcription system for polyphonic piano music" Signal Processing September 2009.
  • Seungmin Rho, Byeong - jun Han, Eenjun Hwang, Minkoo Kim, "MUSEMBLE: A novel music retrieval system with automatic voice query transcription and reformulation", Journal of Systems and Software, July 2008.
  • S. Sigtia, E. Benetos, and S. Dixon, "An End-to-End Neural Network for Polyphonic Piano Music Transcription," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927-939, May 2016. doi: 10.1109/TASLP.2016.2533858.
  • N. Kroher and E. Gómez, "Automatic Transcription of Flamenco Singing From Polyphonic Music Recordings," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 901-913, May 2016. doi: 10.1109/TASLP.2016.2531284.
  • A. Cogliati, Z. Duan, and B. Wohlberg, "Context-Dependent Piano Music Transcription With Convolutional Sparse Coding," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2218-2230, Dec. 2016. doi: 10.1109/TASLP.2016.2598305.
  • M. Akbari and H. Cheng, "Real-Time Piano Music Transcription Based on Computer Vision," IEEE Transactions on Multimedia, vol. 17, no. 12, pp. 2113-2121, Dec. 2015. doi: 10.1109/TMM.2015.2473702.
  • K. O’Hanlon, H. Nagano, N. Keriven, and M. D. Plumbley, "Non-Negative Group Sparsity with Subspace Note Modelling for Polyphonic Transcription," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 530-542, March 2016. doi: 10.1109/TASLP.2016.2515514.
  • Y. Wan, X. Wang, R. Zhou, and Y. Yan, "Automatic Piano Music Transcription Using Audio-Visual Features," Chinese Journal of Electronics, vol. 24, no. 3, pp. 596-603, 07 2015. doi: 10.1049/cje.2015.07.027.
  • S. Tsuruta, M. Fujimoto, M. Mizuno, and Y. Takashima, "Personal computer-music system-song transcription and its application," IEEE Transactions on Consumer Electronics, vol. 34, no. 3, pp. 819-823, Aug. 1988. doi: 10.1109/30.20189.
  • E. Benetos and S. Dixon, "Joint Multi-Pitch Detection Using Harmonic Envelope Estimation for Polyphonic Music Transcription", IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1111-1123, Oct. 2011. doi: 10.1109/JSTSP.2011.2162394.
  • F. Argenti, P. Nesi, and G. Pantaleo, "Automatic Transcription of Polyphonic Music Based on the Constant-Q Bispectral Analysis," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1610- 1630, Aug. 2011. doi: 10.1109/TASL.2010.2093894.
  • N. Bertin, R. Badeau, and E. Vincent, "Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription," IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 538-549, March 2010. doi: 10.1109/TASL.2010.2041381.
  • J. Abeßer and G. Schuller, "Instrument-Centered Music Transcription of Solo Bass Guitar Recordings," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1741-1750, Sept. 2017. doi: 10.1109/TASLP.2017.2702384.
  • A. Rizzi, M. Antonelli, and M. Luzi, "Instrument Learning and Sparse NMD for Automatic Polyphonic Music Transcription," IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1405-1415, July 2017. doi: 10.1109/TMM.2017.2674603.
  • E. Nakamura, K. Yoshii, and S. Sagayama, "Rhythm Transcription of Polyphonic Piano Music Based on Merged- Output HMM for Multiple Voices," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 794-806, April 2017. doi: 10.1109/TASLP.2017.2662479.
  • K. Lee and M. Slaney, "Acoustic Chord Transcription and Key Extraction From Audio Using Key-Dependent HMMs Trained on Synthesized Audio," IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 291-301, Feb. 2008. doi: 10.1109/TASL.2007.914399.
  • G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gomez, S. Streich, and B. Ong, "Melody Transcription From Music Audio: Approaches and Evaluation," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1247-1256, May 2007. doi: 10.1109/TASL.2006.889797.
  • E. Gómez and J. Bonada, "Towards Computer-Assisted Flamenco Transcription: An Experimental Comparison of Automatic Transcription Algorithms as Applied to A Cappella Singing," Computer Music Journal, vol. 37, no. 2, pp. 73-90, June 2013. doi: 10.1162/COMJ_a_00180.
  • B. Fuentes, R. Badeau, and G. Richard, "Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1854-1866, Sept. 2013. doi: 10.1109/TASL.2013.2260741.
  • E. Benetos and S. Dixon, "A Shift-Invariant Latent Variable Model for Automatic Music Transcription," Computer Music Journal, vol. 36, no. 4, pp. 81-94, Dec. 2012. doi: 10.1162/COMJ_a_00146.
  • E. Nakamura, K. Yoshii, and S. Dixon, "Note Value Recognition for Piano Transcription Using Markov Random Fields," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1846-1858, Sept. 2017. doi: 10.1109/TASLP.2017.2722103.
  • A. M. Barbancho, A. Klapuri, L. J. Tardon, and I. Barbancho, "Automatic Transcription of Guitar Chords and Fingering From Audio," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pp. 915- 921, March 2012. doi: 10.1109/TASL.2011.2174227.
  • M. Marolt, "A connectionist approach to automatic transcription of polyphonic piano music," IEEE Transactions on Multimedia, vol. 6, no. 3, pp. 439-449, June 2004. doi: 10.1109/TMM.2004.827507.
  • S. Ewert and M. Sandler, "Piano Transcription in the Studio Using an Extensible Alternating Directions Framework," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983- 1997, Nov. 2016. doi: 10.1109/TASLP.2016.2593801.
  • J. P. Bello, L. Daudet, and M. B. Sandler, "Automatic Piano Transcription Using Frequency and Time-Domain Information," IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 2242-2251, Nov. 2006. doi: 10.1109/TASL.2006.872609.
  • G. Reis, F. Fernandez de Vega, and A. Ferreira, "Automatic Transcription of Polyphonic Piano Music Using Genetic Algorithms, Adaptive Spectral Envelope Modeling, and Dynamic Noise Level Estimation," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 8, pp. 2313-2328, Oct. 2012. doi: 10.1109/TASL.2012.2201475.
  • R. Nishikimi, E. Nakamura, M. Goto, K. Itoyama, and K. Yoshii, "Bayesian Singing Transcription Based on a Hierarchical Generative Model of Keys, Musical Notes, and F0 Trajectories," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1678-1691, 2020. doi: 10.1109/TASLP.2020.2996095.
  • Cogliati, Z. Duan and B. Wohlberg, "Piano Transcription With Convolutional Sparse Lateral Inhibition," IEEE Signal Processing Letters, vol. 24, no. 4, pp. 392-396, April 2017. doi: 10.1109/LSP.2017.2666183.
  • Ye Wang and Bingjun Zhang, "Application-specific music transcription for tutoring", IEEE MultiMedia, Vol. 15, No. 3, pp. 70-74, 2008.
  • Shlomo Dubnov, "Unified view of prediction and repetition structure in audio signals with application to interest point detection", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 16, No. 2, pp. 327-337, 2008.
  • J. J. Carabias-Orti, P. Vera-Candeas, F. J. Cañadas-Quesada, and N. Ruiz-Reyes, "Music scene-adaptive harmonic dictionary for unsupervised note-event detection", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 3, pp. 473-486, 2010.
  • Namgook Cho and CC Jay Kuo, "Sparse music representation with source-specific dictionaries and its application to signal separation", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 2, pp. 326-337, 2011.
  • Jean-Louis Durrieu, Barak David, and Guilhem Richard, "A musically motivated mid-level representation for pitch estimation and musical audio source separation", IEEE Journal of Selected Topics in Signal Processing, Vol. 5, No. 6, pp. 1180-1191, 2011.
  • Akira Maezawa, Katsutoshi Itoyama, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G. Okuno, "Automated violin fingering transcription through analysis of an audio recording", Computer Music Journal, Vol. 36, No. 3, pp. 57-72, 2012.
  • Stanislaw Raczynski, Emmanuel Vincent, and Shigeki Sagayama, "Dynamic Bayesian networks for symbolic polyphonic pitch modeling", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 21, No. 9, pp. 1830-1840, 2013.
  • Anssi Klapuri and Tuomas Virtanen, "Representing musical sounds with an interpolating state model", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 3, pp. 613-624, 2010.
  • Vipul Arora, and Laxmidhar Behera, "Musical source clustering and identification in polyphonic audio", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 22, No. 6, pp. 1003-1012, 2014.
  • Alfonso Perez-Carrillo and Marcelo M. Wanderley, "Indirect Acquisition of Violin Instrumental Controls from Audio Signal with Hidden Markov Models", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 5, pp. 932-940, 2015.
  • Paul H. Peeling, and Simon J. Godsill, "Generative spectrogram factorization models for polyphonic piano transcription", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 3, pp. 519-527, 2010.
  • Graham Grindlay, and Daniel PW Ellis, "Transcribing multi-instrument polyphonic music with hierarchical eigen instruments", IEEE Journal of Selected Topics in Signal Processing, Vol. 5, No. 6, pp. 1159-1169, 2011.
  • Olivier Gillet and Gaël Richard, "Transcription and separation of drum signals from polyphonic music", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 16, No. 3, pp. 529-540, 2008.
  • Amelie Anglade, Emmanouil Benetos, Matthias Mauch and Simon Dixon, "Improving music genre classification using automatically induced harmony rules", Journal of New Music Research, Vol. 39, No. 4, pp. 349-361, 2010.
  • Vishweshwara Rao and Preeti Rao, "Vocal melody extraction in the presence of pitched accompaniment in polyphonic music", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 8, pp. 2145- 2154, 2010.
  • Vishweshwara Rao, Pradeep Gaddipati, and Preeti Rao, "Signal-driven window-length adaptation for sinusoid detection in polyphonic music", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 1, pp. 342-348, 2012.
  • Matija Marolt, "Automatic transcription of bell chiming recordings", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 3, pp. 844-853, 2012.
  • Jia-Min Ren and Jyh-Shing Roger Jang, "Discovering time-constrained sequential patterns for music genre classification", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 4, pp. 1134-1144, 2012.
  • Yizhar Lavner and Dima Ruinskiy, "A decision-tree-based algorithm for speech/music classification and segmentation", EURASIP Journal on Audio, Speech, and Music Processing, vol. 2, 2009.
  • Jingyi Shen, Runqi Wang, Han-Wei Shen, "Visual exploration of latent space for traditional Chinese music", Visual Informatics, vol. 4, no. 2, pp.99-108, June 2020.
  • Dhara, P., Rao, K.S. “Automatic note transcription system for Hindustani classical music”. Int J Speech Technol, vol. 21, pp. 987–1003, 2018. https://doi.org/10.1007/s10772-018-9554-1.
  • Akbari, M., Liang, J. & Cheng, H. “A real-time system for online learning-based visual transcription of piano music”. Multimed Tools Appl, vol. 77, pp. 25513–25535, 2018. https://doi.org/10.1007/s11042-018-5803-1.
  • Bhalarao, R., Raval, M. Automated tabla syllable transcription using image processing techniques. Multimed Tools Appl, 2020. https://doi.org/10.1007/s11042-020-09417-0.
  • Rodríguez-Serrano, F.J., Carabias-Orti, J.J., Vera-Candeas, P., “Monophonic constrained non-negative sparse coding using instrument models for audio separation and transcription of monophonic source-based polyphonic mixtures”, Multimed Tools Appl, vol. 72, pp. 925–949, 2014. https://doi.org/10.1007/s11042-013-1398- 8
  • Shen, H., Lee, C. An interactive Whistle-to-Music composing system based on transcription, variation and chords generation. Multimed Tools Appl, vol. 53, pp. 253–269, 2011. https://doi.org/10.1007/s11042-010-0510-6.
  • V. Arora and L. Behera, "Multiple F0 Estimation and Source Clustering of Polyphonic Music Audio Using PLCA and HMRFs," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 278- 287, Feb. 2015. doi: 10.1109/TASLP.2014.2387388.
  • F.J. Cañadas Quesada, N. Ruiz Reyes, P. Vera Candeas, J.J. Carabias & S. Maldonado () A Multiple-F0 Estimation Approach Based on Gaussian Spectral Modelling for Polyphonic Music Transcription, Journal of New Music Research, vol. vol. 39, no. 1, pp. 93-107, 2010. DOI: 10.1080/09298211003695579.
  • Dorian Cazau, Yuancheng Wang, Marc Chemillier & Olivier Adam, “An automatic music transcription system dedicated to the repertoires of the marovany zither”, Journal of New Music Research, vol. 45, no. 4, pp. 343-360, 2016. DOI: 10.1080/09298215.2016.1233285.
  • Jose J. Valero-Mas, Emmanouil Benetos & José M. Iñesta, “A supervised classification approach for note tracking in polyphonic piano transcription”, Journal of New Music Research, vol. 47, no. 3, pp. 249-263, 2018. DOI: 10.1080/09298215.2018.1451546
  • Jiayin Sun & Hongyan Wang, “A Cognitive Method for Musicology Based Melody Transcription”, International Journal of Computational Intelligence Systems, vol. 8, no. 6, pp. 1165-1177, 2015. DOI: 10.1080/18756891.2015.1113749
  • Tiago Fernandes Tavares, Jayme Garcia Arnal Barbedo & Romis Attux, “Unsupervised note activity detection in NMF-based automatic transcription of piano music”, Journal of New Music Research, vol. 45, no. 2, pp. 118-123, 2016. DOI: 10.1080/09298215.2016.1177552
  • İsmail Arı, Umut Şimşekli, Ali Taylan Cemgil & Lale Akarun (2014) Randomized Matrix Decompositions and Exemplar Selection in Large Dictionaries for Polyphonic Piano Transcription, Journal of New Music Research, 43:3, 255-265. DOI: 10.1080/09298215.2014.891628
  • Willie Krige, Theo Herbst & Thomas Niesler (2008) Explicit Transition Modelling for Automatic Singing Transcription, Journal of New Music Research, 37:4, 311-324. DOI: 10.1080/09298210902890299.
  • Paulus, J., Klapuri, A, “Drum Sound Detection in Polyphonic Music with Hidden Markov Models”, J AUDIO SPEECH MUSIC PROC. 2009, Article number: 497292, 14 December 2009. https://doi.org/10.1155/2009/497292.
  • Gao, X., Gupta, C. and Li, H, “Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30, pp.2280-2294, 2022.
  • Wang, X., Tian, B., Yang, W., Xu, W. and Cheng, W., “MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 31, pp.229- 241, 2022.
  • Wang, Y., Jing, Y., Wei, W., Cazau, D., Adam, O. and Wang, Q., “PipaSet and TEAS: A Multimodal Dataset and Annotation Platform for Automatic Music Transcription and Expressive Analysis Dedicated to Chinese Traditional Plucked String Instrument Pipa”, IEEE Access, Vol. 10, pp.113850-113864, 2022.
  • Simonetta, F., Avanzini, F. and Ntalampiras, S., “A perceptual measure for evaluating the resynthesis of automatic music transcriptions”, Multimedia Tools and Applications, Vol. 81(22), pp.32371-32391, 2022.
  • Wu, H., Marmoret, A. and Cohen, J.E., “Semi-Supervised Convolutive Nmf For Automatic Piano Transcription”, 2022.
  • Holzapfel, A., Benetos, E., Killick, A. and Widdess, R., “Humanities and engineering perspectives on music transcription”, Digital Scholarship in the Humanities, Vol. 37(3), pp.747-764, 2022.