JNACSISSN:2582-3817

Automatic Speaker Diarization using Deep LSTM in Audio Lecturing of e-Khool Platform

Abstract

The speaker diarization is the process of segmentation and the grouping of the input speech signal into a region based on the identity of the speaker. The main challenge in the speaker diarization method is improving the readability of the speech transcription. Hence, in order to overcome the challenge, a speaker diarization method based on deep LSTM is proposed in this research. Initially, the pre-processing is performed for the removal of the noise from the audio lecturing of E-Khoolusers. Then, Linear Predictive Coding (LPC) is used for the extraction of the efficient features from the audio lectures of the E-Khoolusers. After the extraction of the features, the absence or presence of the speaker in the audio lecture is detected using the VAD technique which is followed by the segmentation of the speaker using the extracted features. Finally, the feature vector is determined and the speaker from the audio lecturing of the E-Khoolusers is clustered using the deepLSTM. The proposed speaker diarization method based on deep LSTM is evaluated using the metrics, such as sensitivity, accuracy and specificity. When compared with the existing speaker diarization methods, the proposed speaker diarization method based on deep LSTM obtained a minimum DER of 0.0623, minimum false alarm rate of 0.0369, and minimum distance of 2546 for varying frame length and obtained a minimum DER of 0.0923, minimum false alarm rate of 0.0869, and minimum distance of 1146 for varying Lambda.

References

  • Ramaiah, V.S. and Rao, R.R., “Speaker diarization system using HXLPS and deep neural network,” Alexandria Engineering Journal, vol.57, no.1, pp.255-266, 2018.
  • Yu, C., & Hansen, J. H. L., “Active Learning Based Constrained Clustering For Speaker Diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(11), 2188–2198, 2017.
  • Subba Ramaiah, V., & Rajeswara Rao, R. , “A novel approach for speaker diarization system using TMFCC parameterization and Lion optimization,” Journal of Central South University, vol.24, no.11, pp.2649–2663, 2017.
  • Park, T. J., Han, K. J., Kumar, M., & Narayanan, S., “Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap,” IEEE Signal Processing Letters, pp.1–1, 2019.
  • Javier, R. J., & Youngwook Kim., “Application of Linear Predictive Coding for Human Activity Classification Based on Micro-Doppler Signatures,” IEEE Geoscience and Remote Sensing Letters, vol.11, no.10, pp.1831– 1834, 2014. [6] The ELSDSR dataset for speaker diarization system, <http://cogsys.compute.dtu.dk/soundshare/elsdsr.zip>.
  • R. Kumara Swamy, K. Sri Rama Murty, B. Yegnanarayana, “Determining number of speakers from multispeaker speech signals using excitation source information,” IEEE Signal Process. Lett. Vol.14, no.7, 2007.
  • B. Rajakumar, “The Lion0s Algorithm: a new nature-inspired search algorithm,” Procedia, vol.6 pp.126–135, 2012.
  • Chih-Hung Wu, Chen-Sen Ouyang, Li-Wen Chen, Li-Wei Lu, "A new fuzzy clustering validity index with a median factor for centroid-based clustering, IEEE Trans. Fuzzy Syst. Vol.23, (no.3), pp.1–16, 2013.
  • Jouni Pohjalainen, Rahim Saeidi, Tomi Kinnunen, Paavo Alku, “Extended weighted linear prediction (XLP) analysis of speech and its application to speaker verification in adverse conditions,” INTERSPEECH, 2010.
  • A. Janin, J. Ang, S. Bhagat, R. Dhillon, J. Edwards, J. Macias-Guarasa, N. Morgan, B. Peskin, E. Shriberg, A. Stolcke et al., “The icsi meeting project: Resources and research,” in Proceedings of the 2004 ICASSP NIST Meeting Recognition Workshop, 2004.
  • S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the bayesian information criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, vol. 8. Virginia, USA, pp. 127–132, 1998.
  • A. Solomonoff, A. Mielke, M. Schmidt, and . Gish, “Clustering speakers by their voices,” in Acoustics, Speech and Signal Processing, Proceedings of the 1998 IEEE International Conference on, vol. 2., pp. 757–760, 1998.
  • S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Computer Vision, 2007. ICCV 2007, pp. 1–8, 2007.
  • M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” in Proc. DARPA speech recognition workshop, vol. 1997, 1997.
  • Ekhool-Top learning management system from “https://ekhool.com/”.