A METHOD FOR DETERMINING FORMANT FREQUENCIES USING SPECTRAL DECOMPOSITION OF THE SPEECH SIGNAL

Authors

DOI:

https://doi.org/10.17721/ISTS.2023.1.51-60

Keywords:

speech signal, formant frequencies, spectral decomposition, computational algorithm, wavelet analysis

Abstract

Formants are one of the main components of speaker identification systems and the accuracy of formant determination is the basis for the efficiency of speaker identification systems. Improving existing speech recognition systems will significantly simplify human-computer interaction when the use of classic interfaces is not possible, as well as make such work more comfortable and efficient. The necessity for research on this topic is due to unsatisfactory results of existing systems with low signal-to-noise ratio, the dependence of the result on humans, as well as low speed of such systems. The following four main formant trackers were used for comparison with the proposed method: PRAAT, SNACK, ASSP and DEEP. There are a number of studies concerning the comparison of formant trackers, but among them it is impossible to single out the one that has the best efficiency. The selection of formants is accompanied by a number of problems associated with their dynamic change in the language process. The complexity is also caused by a number of problems related to the close location of the peaks in the analysis of spectrograms and the problems of correctly determining the peaks of the formant maxima on the spectrogram. Determining the location of the formant on the spectrograms of the vocal signal is quite easy to perform by man, but the automation of this process causes some difficulties. The selection of frequency formants was proposed to be performed in several stages. The result of the review of approaches to the determination of formant frequencies has been the algorithm consisting of the following nine stages. The segmentation of vocal signal into vocalized fragments and pauses is performed by estimating changes in fractal dimension. Obtaining the spectrum of the vocal signal has been performed using a complex Morlet wavelet based on the Gaussian window function. PRAAT, SNACK, ASSP and DEEP formant trackers have been considered for the study. Each of them has been configured on the basis of a set of default parameters set by the developers of these trackers. A set of settings for each of the trackers has been used for comparison. In the study, trackers independently have been performed segmentation into vocalized fragments and pauses using the VTR-TIMIT dataset. The comparative analysis has been showed a fairly high accuracy in determining the formant frequencies in comparison with existing formant trackers.

Downloads

Download data is not yet available.

References

Yegnanarayana, B., Veldhuis, R. N. J. (1998). Extraction of vocaltract system characteristics from speech signals, IEEE Trans. Speech Audio Process, 6 (4), 313–327.

Kim, C., Seo, K., & Sung, W. A Robust (2006). Formant Extraction Algorithm Combining Spectral Peak Picking and Root Polishing. EURASIP Journal on Applied Signal Processing, 1–16.

Wet, F. D., Weber, K., Boves, L., Cranen, B., Bengio, S., & Bourlard, H. (2004). Evaluation of Formant-Like Features for Automatic Speech Recognition. Journal of the Acoustical Society of America, 116, 1781–1791.

Mallat, S. (1999.) A Wavelet Tour of Signal Processing. Academic Press.

Yan, Q., Vaseghi, S., Zavarehei, Е., Milner, В., Darch, J., White, P., & Andrianakis, I. (Jul. 2007). Formant Tracking Linear Prediction Model using HMMs and Kalman Filters for Noisy Speech Processing. Computer Speech and Language, vol. 21, pp. 543–561.

Messaoud, Z. B., Gargouri, D., Zribi, S., & Hamida, A. B. (2009). Formant Tracking Linear Prediction Model using HMMs for Noisy Speech Processing. International Journal of Signal Processing, vol. 5, pp. 291–296.

Cooke, М., Barker, J., Cunningham, S., & X. Shao (2006). An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, vol. 120.

Acero, А. Formant Analysis and Synthesis using Hidden Markov Models (1999). Іn Proc. of the Eurospeech Conference. Budapest.

Veldhuis, R. (1997). A computationally e$cient alternative for the LF model and its perceptual evaluation. J. Acoust. Soc., 103 (1), 566–571.

Bazzi, І., Acero, А., & Deng, L. (2003). An expectation maximization approach for formant tracking using a parameter-free non-linear predictor. Іn Proc. ICASSP, vol. 1, 464–467.

Ali, J. A. M. A., Spiegel, J. V. D., & Mueller Р. (2002). Robust Auditory-based Processing using the Average Localized Synchrony Detection. Іn IEEE Transaction Speech and Audio Processing.

Vakman, D. (1996). On the analytic signal, the Teager-Kaiser energy algorithm, and other methods for defining amplitude and frequency. IEEE Trans. Signal Process, SP-44, 791–797.

Boersma, Р., & D. Weenink, (2017). Praat: doing phonetics by computer [Computer program]. Version 6.0.23, retrieved 2021-05-17. http://www.praat.org/

Kåre Sjölander(2020) The Snack Sound Toolkit [Computer program]. https://www.speech.kth.se/snack/

Scheffer, M. (2017). Available: Advanced Speech Signal Processor (libassp), retrieved 2021-05-17. http://www.sourceforge.net/projects/libassp.

Keshet, J. (2017). DeepFormant, retrieved 2021-05-25. https://github.com/MLSpeech.

Gray, А., & Wong, D.(1980, Dec.). The Burg algorithm for LPC speech analysis/Synthesis. Іn IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 6, pp. 609–615.

Krishna, H., & Wang, Y. (1993). The Split Levinson Algorithm is Weakly Stable. SIAM Journal on Numerical Analysis, 30(5), 1498–1508., http://www.jstor.org/stable/2158249.

So, H. C., & Chan, K. W. (2004). Reformulation of Pisarenko Harmonic Decomposition Method for Single-Tone Frequency Estimation. Signal Processing, IEEE Transactions on. 52. 1128–1135. 10.1109/TSP.2004.823473.

VTR Formants Database. http://www.ee.ucla.edu/~spapl/VTRFormants.rar

Nearey, T. & Assmann, P. & Hillenbrand, J. (2002). Evaluation of a strategy for automatic formant tracking. The Journal of the Acoustical Society of America. 112. 2323. 10.1121/1.4779372.

Schiel, Florian & Zitzelsberger. Thomas (2018). Evaluation of Automatic Formant Trackers. Proceedings of the Eleventh International Conference on Language Resources and Evaluation {LREC}, Miyazaki, Japan.

Markel, J. E. & Gray, A. H. (1982). Linear Prediction of Speech. New York, NY: Springer. [24] Sun, Don X. (1995). Robust estimation of spectral center-of-gravity trajectories using mixture spline models. In EUROSPEECH-1995, 749–752.

Schalk-Schupp, Ingo. (2012). Improved Noise Reduction for Hands-Free Communication in Automobile Environments. 10.13140/2.1.4068.6724.

Бєлозьорова, Я. А. (2017). Ідентифікація диктора на основі кратномасштабного аналізу. Інженерія програмного забезпечення: наук. журн., 1(29). 15–25.

Deng, L., Cui, X., Pruvenok, R., Huang, J., Momen, S., Chen, Y. N., & Alwan, A. (2006). A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing. In Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing

Published

2023-03-29

Issue

Section

Computer science and information technology

How to Cite

A METHOD FOR DETERMINING FORMANT FREQUENCIES USING SPECTRAL DECOMPOSITION OF THE SPEECH SIGNAL. (2023). Information Systems and Technologies Security, 1(6), 51-60. https://doi.org/10.17721/ISTS.2023.1.51-60