Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network

Overview

Journal IEEE/ACM Trans Audio Speech Lang Process

Date 2018 Oct 2

PMID 30271809

Citations 16

Authors

Myungjong Kim

Beiming Cao

Ted Mau

Jun Wang

Affiliations

Soon will be listed here.

Abstract

Silent speech recognition (SSR) converts non-audio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lip with articulatory normalization methods that reduce the inter-speaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech data set with flesh points was collected using an electromagnetic articulograph (EMA) from twelve healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed standard deep neural network. The best performance was obtained by BLSTM with all the three normalization approaches combined.

Citing Articles

Automated sentiment analysis of visually impaired students' audio feedback in virtual learning environments.

Elbourhamy D PeerJ Comput Sci. 2024; 10:e2143.

PMID: 38983237 PMC: 11232573. DOI: 10.7717/peerj-cs.2143.

Inter-patient ECG heartbeat classification for arrhythmia classification: a new approach of multi-layer perceptron with weight capsule and sequence-to-sequence combination.

Zhou C, Li X, Feng F, Zhang J, Lyu H, Wu W Front Physiol. 2023; 14:1247587.

PMID: 37841320 PMC: 10569428. DOI: 10.3389/fphys.2023.1247587.

Prediction of outpatients with conjunctivitis in Xinjiang based on LSTM and GRU models.

Wang Y, Yi X, Luo M, Wang Z, Qin L, Hu X PLoS One. 2023; 18(9):e0290541.

PMID: 37733673 PMC: 10513229. DOI: 10.1371/journal.pone.0290541.

MagTrack: A Wearable Tongue Motion Tracking System for Silent Speech Interfaces.

Cao B, Ravi S, Sebkhi N, Bhavsar A, Inan O, Xu W J Speech Lang Hear Res. 2023; 66(8S):3206-3221.

PMID: 37146629 PMC: 10555459. DOI: 10.1044/2023_JSLHR-22-00319.

Epidemiological characteristics, spatial clusters and monthly incidence prediction of hand, foot and mouth disease from 2017 to 2022 in Shanxi Province, China.

Ma Y, Xu S, Dong A, An J, Qin Y, Yang H Epidemiol Infect. 2023; 151:e54.

PMID: 37039461 PMC: 10126901. DOI: 10.1017/S0950268823000389.

References

Hinton G, Osindero S, Teh Y . A fast learning algorithm for deep belief nets. Neural Comput. 2006; 18(7):1527-54. DOI: 10.1162/neco.2006.18.7.1527. View

Mau T . Diagnostic evaluation and management of hoarseness. Med Clin North Am. 2010; 94(5):945-60. DOI: 10.1016/j.mcna.2010.05.010. View

Fagan M, Ell S, Gilbert J, Sarrazin E, Chapman P . Development of a (silent) speech recognition system for patients following laryngectomy. Med Eng Phys. 2007; 30(4):419-25. DOI: 10.1016/j.medengphy.2007.05.003. View

Liu H, Ng M . Electrolarynx in voice rehabilitation. Auris Nasus Larynx. 2007; 34(3):327-32. DOI: 10.1016/j.anl.2006.11.010. View

Hochreiter S, Schmidhuber J . Long short-term memory. Neural Comput. 1997; 9(8):1735-80. DOI: 10.1162/neco.1997.9.8.1735. View

Khan Z, Green P, Creer S, Cunningham S . Reconstructing the voice of an individual following laryngectomy. Augment Altern Commun. 2011; 27(1):61-6. DOI: 10.3109/07434618.2010.545078. View

Mau T, Muhlestein J, Callahan S, Chan R . Modulating phonation through alteration of vocal fold medial surface contour. Laryngoscope. 2012; 122(9):2005-14. PMC: 3461279. DOI: 10.1002/lary.23451. View

Graves A, Schmidhuber J . Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005; 18(5-6):602-10. DOI: 10.1016/j.neunet.2005.06.042. View

HASHI M, Westbury J, Honda K . Vowel posture normalization. J Acoust Soc Am. 1999; 104(4):2426-37. DOI: 10.1121/1.423750. View

10.

Berry J . Accuracy of the NDI wave speech research system. J Speech Lang Hear Res. 2011; 54(5):1295-301. DOI: 10.1044/1092-4388(2011/10-0226). View

11.

Wang J, Samal A, Rong P, Green J . An Optimal Set of Flesh Points on Tongue and Lips for Speech-Movement Classification. J Speech Lang Hear Res. 2015; 59(1):15-26. PMC: 4867928. DOI: 10.1044/2015_JSLHR-S-14-0112. View

12.

Johnson K, LADEFOGED P, Lindau M . Individual differences in vowel production. J Acoust Soc Am. 1993; 94(2 Pt 1):701-14. DOI: 10.1121/1.406887. View

13.

Wang J, Green J, Samal A, Yunusova Y . Articulatory distinctiveness of vowels and consonants: a data-driven approach. J Speech Lang Hear Res. 2013; 56(5):1539-51. PMC: 4727744. DOI: 10.1044/1092-4388(2013/12-0030). View

14.

Wang L, Tan T, Hu W, Ning H . Automatic gait recognition based on statistical shape analysis. IEEE Trans Image Process. 2008; 12(9):1120-31. DOI: 10.1109/TIP.2003.815251. View