A Data-Centric Approach for Portuguese Speech Recognition: Language Model And Its Implications

Authors

Keywords:

automatic speech recognition, language model, brazilian portuguese, wav2vec2, KenLM

Abstract

Recent advances in Automatic Speech Recognition have made it possible to achieve a quality never seen before in the literature, both for languages with abundant data, such as English, which has a large number of studies and for the Portuguese language, which has a more limited amount of resources and studies. The most recent advances address speech recognition problems with Transformers based models, which have the capability to perform the speech recognition task directly from the raw signal, without the need for manual feature extraction. Some studies have already shown that it is possible to further improve the quality of the transcription of these models using language models within the decoding stage, however, the real impact of such language models is still not clear, especially for the Brazilian Portuguese scenario. Also, it is known that the quality of the data used for training the models is of paramount importance, however, there are few works in the literature addressing this issue. This work explores the impact of language models applied to Portuguese speech recognition both in terms of data quality and computational performance, with a data-centric approach. We propose an approach to measure similarity between datasets and, thus, assist in decision-making during training. The approach indicates paths for the advancement of the state-of-the-art aiming at Portuguese speech recognition, showing that it is possible to reduce the size of the language model by 80% and still achieve error rates around 7.17% for the Common Voice dataset. The source code is available at https://github.com/joaoalvarenga/language-model-evaluation.

Downloads

Download data is not yet available.

Author Biographies

João Paulo Reis Alvarenga, Universidade Federal de Ouro Preto

João Alvarenga holds a bachelor's in Computer Science from the Federal University of Ouro Preto (UFOP), in 2019. He is currently Team Lead and Senior Machine Learning Engineer at Stilingue Inteligência Artificial and a master's student in the Graduate Program in Computer Science at UFOP. His research interests include deep learning, natural language processing, and speech recognition.

Luiz Henrique de Campos Merschmann, Universidade Federal de Lavras

Luiz H.C. Merschmann is Professor in the Department of Applied Computing at Federal University of Lavras, Brazil. He received the BSc degree in Mining Engineering from Federal University of Ouro Preto, Brazil, MSc degree in Production Engineering from Federal University of Rio de Janeiro, Brazil, and PhD degree in Computer Science from Fluminense Federal University, Brazil. In 2012, he carried out postdoctoral research at University of Kent, UK. He has published several peer reviewed papers in journals and conference proceedings. His research interests include data mining, machine learning, artificial intelligence and natural language processing.

Eduardo José da Silva Luz, Universidade Federal de Ouro Preto

Eduardo Luz holds a bachelor's degree in Electrical Engineering from the Federal University of Minas Gerais (2005), and a Ph.D. in Computer Science from the Federal University of Ouro Preto (2019). He is an Adjunct Professor at the Department of Computing (DECOM) at the Federal University of Ouro Preto and a permanent member of the Graduate Program in Computer Science. His research interests include pattern recognition, machine learning, computer vision, and embedded systems.

References

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,

S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented

transformer for speech recognition,” arXiv preprint arXiv:2005.08100,

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A frame-

work for self-supervised learning of speech representations,” Advances

in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460,

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:

An asr corpus based on public domain audio books,” in 2015 IEEE

International Conference on Acoustics, Speech and Signal Processing

(ICASSP), 2015, pp. 5206–5210.

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist

temporal classification: Labelling unsegmented sequence data with

recurrent neural networks,” in Proceedings of the 23rd International

Conference on Machine Learning, ser. ICML ’06. New York, NY,

USA: Association for Computing Machinery, 2006, p. 369–376.

[Online]. Available: https://doi.org/10.1145/1143844.1143891

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with

recurrent neural networks,” in Proceedings of the 31st International

Conference on Machine Learning, ser. Proceedings of Machine

Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 2.

Bejing, China: PMLR, 22–24 Jun 2014, pp. 1764–1772. [Online].

Available: http://proceedings.mlr.press/v32/graves14.html

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro,

J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. H. Engel,

L. Fan, C. Fougner, T. Han, A. Y. Hannun, B. Jun, P. LeGresley, L. Lin,

S. Narang, A. Y. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh,

D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao,

D. Yogatama, J. Zhan, and Z. Zhu, “Deep speech 2: End-to-end speech

recognition in english and mandarin,” CoRR, vol. abs/1512.02595,

[Online]. Available: http://arxiv.org/abs/1512.02595

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A

neural network for large vocabulary conversational speech recognition,”

in 2016 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), 2016, pp. 4960–4964.

I. Macedo Quintanilha, S. Netto, and L. Biscainho, “Towards an end-to-

end speech recognizer for portuguese using deep neural networks,” 09

Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,

D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia,

Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Chang,

K. Rao, and A. Gruenstein, “Streaming end-to-end speech recognition

for mobile devices,” CoRR, vol. abs/1811.06621, 2018. [Online].

Available: http://arxiv.org/abs/1811.06621

X. Yang, J. Li, and X. Zhou, “A novel pyramidal-fsmn architecture with

lattice-free MMI for speech recognition,” CoRR, vol. abs/1810.11352,

[Online]. Available: http://arxiv.org/abs/1810.11352

C. Batista, A. L. Dias, and N. Sampaio Neto, “Baseline Acoustic

Models for Brazilian Portuguese Using Kaldi Tools,” in Proc.

IberSPEECH 2018, 2018, pp. 77–81. [Online]. Available: http:

//dx.doi.org/10.21437/IberSPEECH.2018-17

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,

and Q. V. Le, “Specaugment: A simple data augmentation method for

automatic speech recognition,” Interspeech 2019, Sep 2019. [Online].

Available: http://dx.doi.org/10.21437/Interspeech.2019-2680

J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. Cohen,

H. Nguyen, and R. Gadde, “Jasper: An end-to-end convolutional neural

acoustic model,” 09 2019, pp. 71–75.

I. Macedo Quintanilha, S. Lima Netto, and L. Pereira Biscainho, “An

open-source end-to-end asr system for brazilian portuguese using dnns

built from newly assembled corpora,” Journal of Communication and

Information Systems, vol. 35, no. 1, pp. 230–242, Sep. 2020. [Online].

Available: https://jcis.sbrt.org.br/jcis/article/view/721

W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati,

R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural

networks for automatic speech recognition with global context,” 2020.

[Online]. Available: https://arxiv.org/abs/2005.03191

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A

framework for self-supervised learning of speech representations,” 2020.

D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu,

and Q. V. Le, “Improved noisy student training for automatic

speech recognition,” Interspeech 2020, Oct 2020. [Online]. Available:

http://dx.doi.org/10.21437/Interspeech.2020-1470

Q. Xu, A. Baevski, T. Likhomanenko, P. Tomasello, A. Conneau,

R. Collobert, G. Synnaeve, and M. Auli, “Self-training and pre-training

are complementary for speech recognition,” CoRR, vol. abs/2010.11430,

[Online]. Available: https://arxiv.org/abs/2010.11430

A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and

M. Auli, “Unsupervised cross-lingual representation learning for

speech recognition,” CoRR, vol. abs/2006.13979, 2020. [Online].

Available: https://arxiv.org/abs/2006.13979

H. Bourlard and N. Morgan, Connectionist Speech Recognition: A

Hybrid Approach, 01 1994.

O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying

convolutional neural networks concepts to hybrid nn-hmm model for

speech recognition,” in 2012 IEEE International Conference on Acous-

tics, Speech and Signal Processing (ICASSP), 2012, pp. 4277–4280.

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,

A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,

“Deep neural networks for acoustic modeling in speech recognition:

The shared views of four research groups,” IEEE Signal Processing

Magazine, vol. 29, no. 6, pp. 82–97, 2012.

B. T. Lowerre, “The harpy speech recognition system,” Ph.D. disserta-

tion, Carnegie-Mellon University, 1976.

L. R. S. Gris, E. Casanova, F. S. de Oliveira, A. da Silva Soares,

and A. C. Júnior, “Brazilian portuguese speech recognition using

wav2vec 2.0,” CoRR, vol. abs/2107.11414, 2021. [Online]. Available:

https://arxiv.org/abs/2107.11414

A. C. Júnior, E. Casanova, A. da Silva Soares, F. S. de Oliveira,

L. Oliveira, R. C. F. Junior, D. P. P. da Silva, F. G. Fayet, B. B.

Carlotto, L. R. S. Gris, and S. M. Aluísio, “CORAA: a large corpus

of spontaneous and prepared speech manually validated for speech

recognition in brazilian portuguese,” CoRR, vol. abs/2110.15731, 2021.

[Online]. Available: https://arxiv.org/abs/2110.15731

T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation

of various mfcc implementations on the speaker verification task,” in in

Proc. of the SPECOM-2005, 2005, pp. 191–194.

T. Lima and M. Da Costa-Abreu, “A survey on automatic speech recog-

nition systems for portuguese language and its variations,” Computer

Speech & Language, vol. 62, p. 101055, 12 2019.

M. Schramm, L. Freitas, A. Zanuz, and D. Barone, “Cslu: Spoltech

brazilian portuguese version 1.0 ldc2006s16,” 2006.

R. Oliveira, P. Batista, N. Neto, and A. Klautau, “Baseline acoustic mod-

els for brazilian portuguese using cmu sphinx tools,” in Computational

Processing of the Portuguese Language, H. Caseli, A. Villavicencio,

A. Teixeira, and F. Perdigão, Eds. Berlin, Heidelberg: Springer Berlin

Heidelberg, 2012, pp. 375–380.

T. Raso and H. Mello, “The c-oral-brasil i: Reference corpus for

informal spoken brazilian portuguese,” in Computational Processing of

the Portuguese Language, H. Caseli, A. Villavicencio, A. Teixeira, and

F. Perdigão, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012,

pp. 362–367.

J. Gonzalez-Dominguez, I. Lopez-Moreno, P. J. Moreno, and

J. Gonzalez-Rodriguez, “Frame-by-frame language identification in

short utterances using deep neural networks,” Neural Networks, vol. 64,

pp. 49–58, 2015, special Issue on “Deep Learning of Representations”.

[Online]. Available: https://www.sciencedirect.com/science/article/pii/

S0893608014002019

N. Neto, C. Patrick, A. Klautau, and I. Trancoso, “Free tools and

resources for brazilian portuguese speech recognition,” Journal of the

Brazilian Computer Society, vol. 17, no. 1, pp. 53–68, 2011.

P. Legal, “Pcd legal: Acessível para todos,” 2018. [Online]. Available:

http://www.pcdlegal.com.br/

PUC-Rio, “Centro de estudos em telecomunicações (cetuc).” [Online].

Available: http://www.cetuc.puc-rio.br/

K. Heafield, “KenLM: Faster and smaller language model queries,”

in Proceedings of the Sixth Workshop on Statistical Machine

A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Un-

supervised cross-lingual representation learning for speech recognition,”

Translation. Edinburgh, Scotland: Association for Computational

Linguistics, Jul. 2011, pp. 187–197. [Online]. Available: https:

//aclanthology.org/W11-2123

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer,

R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice:

A massively-multilingual speech corpus,” CoRR, vol. abs/1912.06670,

[Online]. Available: http://arxiv.org/abs/1912.06670

V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A

large-scale multilingual dataset for speech research,” Interspeech 2020,

Oct 2020. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.

-2826

Linguateca, “Cetenfolha.” [Online]. Available: https://www.linguateca.p

t/cetenfolha/

Y. Tian, L. Yu, X. Chen, and S. Ganguli, “Understanding self-supervised

learning with dual deep networks,” CoRR, vol. abs/2010.00578, 2020.

[Online]. Available: https://arxiv.org/abs/2010.00578

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training

of deep bidirectional transformers for language understanding,” CoRR,

vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/18

04805

V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions,

Insertions and Reversals,” Soviet Physics Doklady, vol. 10, p. 707, Feb.

K. S. Jones, “A statistical interpretation of term specificity and its

application in retrieval,” Journal of Documentation, vol. 28, pp. 11–21,

Published

2023-03-23

How to Cite

Alvarenga, J. P. R. ., Merschmann, L. H. de C., & Luz, E. J. da S. (2023). A Data-Centric Approach for Portuguese Speech Recognition: Language Model And Its Implications. IEEE Latin America Transactions, 21(4), 546–556. Retrieved from https://latamt.ieeer9.org/index.php/transactions/article/view/7464

Most read articles by the same author(s)