Development of an Automatic Online Transcriber for Online Classes Using Machine Learning

Abstract 

The field of speech processing has been of great concern to scientists and engineers for decades, with the researchers aiming to create computer systems capable of understanding and/or mimicking human speech. Breakthroughs in the previous decade introduced the concept of using deep learning models to analyse audio streams to develop technologies such as Siri or Alexa which are capable of responding to voice commands.

In recent times, online meetings and classes have become a part of our normal day-to-day lives, and offer a chance for the establishment of an option for virtual participation as the norm, rather than the exception, in the future. The quality of these meetings can be improved by incorporating a real time automatic transcription engine which offers written text from the audio.

This proposal proposes the automatic online transcriber for online classes using machine learning. Chapter one gives a brief background of the study, and outlines the project’s objectives, justification and scope. Chapter two goes in-depth into speech recognition and its history, and offers a brief review of relevant literature. Chapter three explains the project’s methodology, and also includes the project’s budget and time frame. Chapter four concludes the project by giving the expected result.

 Speech processing  What is speech processing Speech processing is “a discipline of computer science that deals with designing computer systems that recognize spoken words” . The sound is sampled to convert it from a continuous signal into a discrete signal that can be analysed by a computer and used to provide valuable data about the speaker(s) and/or their environment.

Speech processing is a heavily researched topic in artificial intelligence (AI) with researchers around the world seeking to apply machine learning (ML) to analyse the data for processing. There are various applications of machine learning in speech processing, such as:

Speech Recognition

This is the ability of computers to identify and translate spoken language into text. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT).

It has a wide range of applications, from controlling the menu in your car or video console, to call centres with interactive voice response systems or hands free computing.

Speaker Identification and Verification

Speaker verification is the process whereby the system tries to ascertain if a user is they claim to be while speaker identification is the process in which the system determines if a speech sample is spoken by any one of the speakers from a pre- defined set of speakers. Here, the question “Who is speaking?” is answered.

Audio Diarization

As defined in [5], audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can be speakers, music or background noises. When the audio sources are speakers, it is referred to as speaker diarization, and the question we seek to answer is “Who spoke when?” in an audio recording with an unknown number of speakers.

Emotional Recognition (SER)

The process of recognising the emotions of the speaker their tone, emotional state and gestures. There are several identifiers that human beings use to analyse speech, which would be difficult for computers to replicate. As such, in machine learning (ML) the emotional states can be characterised in a continuous 2- or 3-dimensional space, for example: characterizing on each coordinate the valence (the positive or negative nature of the emotion), degree of activation (quantifying the excitement of the speaker) and dominance (the degree of submissiveness or strength of the speaker) . This is a relatively new field. A brief history of machine learning in speech processing In 1952, Bell Labs of the United States made the first complete speech recognition device for a single speaker which could recognize 10 digits by matching filter bank output to hand-constructed template. This led to the rapid development of the filter based approach for speech recognition at the time. In 1956 at RCA Labs, Olson and Belar built a simple voice controlled-activated typewriter that could recognise ten syllables while Forgie and Forgie built a speaker-independent 10-vowel recognizer at MIT Lincoln Lab . At this stage in history, the speech recognition systems that were developed were only capable of recognising a single word at the very most.

In the 1960’s, several Japanese laboratories demonstrated their capability of building special purpose hardware to perform a speech recognition task . Most notable were the vowel recognizer of Suzuki and Nakata at the Radio Research Lab in Tokyo , the phoneme recognizer of Sakai and Doshita at Kyoto University, and the digit recognizer of NEC Laboratories. The work of Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of speech in different portions of the input utterance. In contrast, an isolated digit recognizer implicitly assumed that the unknown utterance contained a complete digit (and no other speech sounds or words) and thus did not need an explicit “segmenter”. Kyoto University’s work could be considered a precursor to a continuous speech recognition system.

In the early 1970s, Atal and Itakura independently formulated the fundamental concepts of Linear Predictive Coding (LPC), which greatly simplified the estimation of the vocal tract response from speech waveforms. The basic ideas of applying fundamental pattern recognition technology to speech recognition, based on LPC methods, were then proposed by Itakura, Rabiner and Levinson and others later on in the decade. This led to the rapid development of speech recognition for speaker-specific, isolated words and small vocabulary tasks.

Meanwhile, the Advanced Research Projects Agency (DARPA) in America begun to fund its Speech Understanding Research (SUR) program. One system of note that was a product of the ARPA program was Carnegie Mellon University’s “Harpy” which was able to recognize speech using a vocabulary of 1,011 words reasonably accurately.

IBM developed the first transcription engine: a voice activated typewriter that was used to convert spoken word to text that could be seen on a display or typed on paper. This system was speaker-dependent i.e. the user had to train their typewriter. The focus of the system was the size of the recognition vocabulary and the structure of the language model (the representation of the grammar or syntax of the task) which was heavily reliant on statistical methods.

On the other hand, AT&T Bell Laboratories sought to develop a speaker independent automated telecommunication services for voice dialling, voice commands, and control for routing calls. This led to the creation of a range of speech clustering algorithms for creating word and sound reference patterns, eventually statistical models, which could be used across a wide range of talkers and accents. They eventually developed the acoustic model which is the spectral representation of sounds or words.

In the 1980s and 1990s, there was a shift from template-based approaches to a rigorous statistical modelling framework. The Hidden Markov Model (HMM) came to be preferred for speech recognition systems after the theory was published in 1980. To overcome its limitations, it was optimised by use of mixture densities e.g. Gaussian and Cauchy, then later merged with the finite state model that arose from ARPA research. The Gaussian Mixture Method (GMM) was designed by Kai-Fu Lee of Carnegie–Méron University, to optimise the HMM. The HMM-GMM combination was the dominant method of speech processing until the advent of the deep learning networks. In the 1990s, the pattern recognition approach was developed on Baye’s theories, leading to the development of Bayesian networks.

Another technology that was (re)introduced in the late 1980’s was the idea of artificial neural networks (ANN) [11]. Neural networks were first introduced in the 1950’s, but did not produce any notable results at first. The creation of a parallel distributed processing (PDP) model which was a dense interconnection of simple computational elements with a corresponding “training” method – called error back-propagation (BP) – revived interest in neural networks. The pattern recognition problem was addressed by converting it into a spatial recognition issue which was dealt with by “multi-layer, feed-forward neural network architecture” which was adopted to match the temporal structure of speech. These models require a target output to be defined, which makes it difficult to use with continuous speech, but is very accurate for isolated words. To try to overcome ANN’s bottleneck, Bourlard et al suggested combining HMMs with ANNs to create a hybrid system. An iterative training method was proposed where HMMs were used to initially segment the acoustic data and then the Viterbi algorithm, along with the newly trained networks were used as probability estimators. This provided more reliable segmented data which led to more accurate results .

In 2006, Hilton et al proposed a new method for ASR: deep learning. Deep learning refers to “a class of ML techniques, where many layers of information processing stages in hierarchical architectures are exploited for unsupervised feature learning and for pattern classification”. This method proved to be have a higher accuracy and fewer errors than the classical HMM-GMM models, and was thus readily adopted by the community for speech processing with major corporations such as Microsoft, Google, Apple and IBM investing heavily into it. Today, deep learning is the most popular method of speech processing, with the use of complex models such as auto encoders (AE), convolutional neural networks (CNN), deep neural networks (DNN), restricted Boltzmann machines (RBM) and recurrent neural networks (RNN).

A brief look at ASR and Diarization

Automatic Speech Recognition

As defined beforehand, ASR is the ability of computers to identify and translate spoken language into text. It can also be defined as “graphical representations of frequencies emitted as a function of time” [30].

From the history of speech processing, three vital types of speech are noted:

1. Isolated words 

Here, there are gaps of silence between the words when said. This is the easiest system to create.

2. Connected words

 In this category, the speaker says a phrase that is within the systems’ ‘database’ or the system allows the existence of minimum run-off between words. The language model here is larger than that of isolated words.

3. Continuous speech 

This occurs when the speaker is left to express themselves freely without forced silences or specific sets of words. This tends to be the most difficult system to implement since the computer needs to determine what the speaker has uttered.

The general architecture of the ASR system is described below:

1. Acoustic preprocessing

The unit is used to convert the uttered speech into the training/test data. The tasks carried out in the section include:

  1. Anti-aliasing filtering to eliminate noise.
  2. Sampling of the continuous speech signal to convert it to a digital signal and quantization.
  3. Computation of the power spectrum and data normalization.
  4. Production of constants such as the Mel-Frequency Cepstral Coefficients (MFCC) used in CNNs and some HMMs.

This unit converts speech signal into speech frames and generates feature vectors, which describe the input speech signal.

2. Acoustic model

This model is used to detect the phonemes and give the decoder the probable phonemes in the speech signal. The acoustic content from the speech frame obtained during the audio preprocessing is compared against the phoneme probabilities within the model to determine the likely spoken phonemes. When combined with the pronunciation model, it determines the probable spoken word.

3. Lexical/pronunciation model

This model contains various words that are described in phoneme combinations. The data in this model is combined with the acoustical model output to determine the probable spoken word.

4. Language model

This model is used to find the correct word sequence by predicting the likelihood of the next word’s appearance depending on the previous words.

5. Decoder

This unit combines the output of all three models to provide the most probable text transcription for the uttered speech

REFERENCES

  1.  Merriam-Webster, “Communication,” Merriam-Webster.com Dictionary, 2 June 2020. [Online]. Available: https://www.merriam-webster.com/dictionary/communication. [Accessed 5 June 2020]. Verizon, “Meetings in America,” Verizon, [Online]. Available: https://e-meetings.verizonbusiness.com/global/en/meetingsinamerica/uswhitepaper.php#SUMMARY. [Accessed 27 June 2020].
  2. P. A. Abhang, B. W. Gawali and S. C. Mehrotra, Introduction to EEG- and Speech-Based Emotion Recognition, London: Academic Press, 2016. M. A. Pathak, “Privacy-Preserving Machine Learning,” 26 April 2012. [Online]. Available: www.lti.cs.cmu.edu. [Accessed 17 May 2020]  S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Transactions on Audio, Speech and Language Proecessing, vol. 14, no. 5, pp. 1557-1565, 2006.
  3. X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland and O. Vinyals, “Speaker Diarization: A Review of Recent Research,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 2, pp. 356-370, 2012. S. Casale, A. Russo, G. Scebba and S. Serrano, “Speech Emotion Classification using Machine Learning Algorithms,” The IEEE International Conference on Semantic Computing, vol. 43, no. 08, pp. 158-165, 2008. K. Davis, R. Biddulph and S. Balashek, “Automatic recognition of spoken digits.,” Journal of the Acoustical Society of America, vol. 24, pp. 637-642, 1952. H. Olson and H. Belar, “Phoenetic Typewriter,” Journal of the Acoustic Society of Ameria, vol. 28, pp. 1072-1081, 1956.
  4. J. Forgie and C. Forgie, “Results obtained from a vowel recognition computer program.,” Journal of the Acoustic Society of America, vol. 31, pp. 1480-1489, 1959. B. Juang and L. Rabiner, “Automatic Speech Recognition – A Brief History of the Technology,” 08 October 2004. [Online]. [Accessed 3 June 2020]. J. Suzuki and K. Nakata, “Recognition of Japanese vowels—Preliminary to the recognition of speech.,” Japanese Radio Research Laboratory, vol. 37, pp. 193-212, 1961. J. Sakai and S. Doshita, “The Phoenitic Typewriter,” in Information Processing 1962, IFIP Congress, Munich, 1962.  
  5. K. Nagata, Y. Kato and S. Chiba, “Spoken Digit Recognizer for Japanese Language,” NEC Research and Development, vol. 6, 1963. B. S. Atal and S. L. Hanauer, “Speech Analysis and Synthesis by Linear Prediction of the,” Journal of the Acoustic Society of America, vol. 50, no. 2, pp. 637-655, 1971 F. Itakura and S. Saito, “Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies,” Electronics and Communication in Japan, vol. 53A, pp. 36-43, 1970. F. Itakura, “Minimum Prediction Residual Principle Applied to Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, Vols. ASSP-23, pp. 57-72, 1975.  L. R. Rabiner, S. E. Levinson, A. E. Rosenberg and J. G. Wilpon, “Speaker Independent Recognition of Isolated Words Using Clustering Techniques,” IEEE Transactions on Audio, Speech, and Language Processing, Vols. ASSP-27, pp. 336-349, 1979.
  6.  B. Lowerre, “The HARPY Speech Understanding System (reprinted),” in Readings in Speech Recognition, Morgan Kaufmann Publishers, 1990, pp. 576-586.  F. Jelinek, L. R. Bahl and R. L. Mercer, “Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech,” IEEE Transactions on Information Technology, Vols. IT-21, pp. 250-256, 1975. J. D. Ferguson, “Hidden Markov Analysis: An Introduction,” in Hidden Markov Models for Speech, Princetown, Institute for Defense Analyses, 1980. K. Lee, “On large-vocabulary speaker-independent continuous speech recognition.,” Speech Communication, vol. 7, pp. 375-379, 1988. W. S. McCullough and W. H. Pitts, “A Logical Calculus of Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biology, vol. 5, pp. 115-133, 1943.
  7. J. Padmanabhan and M. J. J. Premkumar, “Machine Learning in Automatic Speech Recognition: A,” IETE Techinical Review, vol. 32, no. 4, pp. 240-251, 2015.  H. Bourlard and C. J. Wellekens, “Links between Markov Models and Multilayer Perceptrons,” in Advances in Neural Information Processing, San Mateo, Morgan Kaufmann, 1989, pp. 502-510  N. Morgan and H. Bourlard, “Continuous Speech Recognition using Multilayer Perceptrons with Hidden Markov Models,” in Proceedings of the IEEE International Conference ASSP, Albuquerque, 1990.
  8. G. E. Hinton, S. Osindero and Y. The, “A Fast Learning Algortihm for Deep Belief Nets,” Neural Computation, vol. 18, pp. 1527-1554, 2006. I. Gavat and D. Militaru, “NEW TRENDS IN MACHINE LEARNING FOR SPEECH RECOGNITION,” in SISOM & ACOUSTICS, Bucharest, 2015  2. R. R. Siddique Latif1, S. Khalifa, R. Jurdak, J. Qadir and B. W. Schuller, “Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends,” 2 January 2020. [Online]. Available: https://www.researchgate.net/publication/338355547. [Accessed 30 May 2020].
     S. Benkerzaz, Y. Elmir and A. Dennai, “A Study of Automatic Speech Recognition,” Journal of Information Technology Review, vol. 10, no. 3, pp. 77-85, 2019.  X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland and O. Vinyals, “Speaker diarization : A review of recent research,” 18 September 2012. [Online]. Available: https://hal.archives-ouvertes.fr/hal-00733397. [Accessed 13 June 2020].
  9. T. Tripathy, “Acoustic Beamforming,” 30 March 2017. [Online]. Available: https://www.researchgate.net/publication/315695379. [Accessed 28 June 2020]  A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates and A. Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” 19 December 2014. [Online]. Available: https://github.com/mozilla/DeepSpeech. [Accessed 21 June 2020].
  10. M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean and G. Hinton, “ON RECTIFIED LINEAR UNITS FOR SPEECH PROCESSING,” [Online]. [Accessed 15 May 2020].
Did you like this example?

Cite this page

Development of an automatic online transcriber for online classes using machine learning. (2021, Oct 12). Retrieved October 26, 2021 , from
https://studydriver.com/development-of-an-automatic-online-transcriber-for-online-classes-using-machine-learning/

A professional writer will make a clear, mistake-free paper for you!

Our verified experts write
your 100% original paper on this topic.

Get Writing Help

Stuck on ideas? Struggling with a concept?

A professional writer will make a clear, mistake-free paper for you!

Get help with your assigment
Leave your email and we will send a sample to you.
Go to my inbox
Didn't find the paper that you were looking for?
We can create an original paper just for you!
Get Professional Help