The field of speech processing has been of great concern to scientists and engineers for decades, with the researchers aiming to create computer systems capable of understanding and/or mimicking human speech. Breakthroughs in the previous decade introduced the concept of using deep learning models to analyse audio streams to develop technologies such as Siri or Alexa which are capable of responding to voice commands.
In recent times, online meetings and classes have become a part of our normal day-to-day lives, and offer a chance for the establishment of an option for virtual participation as the norm, rather than the exception, in the future. The quality of these meetings can be improved by incorporating a real time automatic transcription engine which offers written text from the audio.
This proposal proposes the automatic online transcriber for online classes using machine learning. Chapter one gives a brief background of the study, and outlines the project’s objectives, justification and scope. Chapter two goes in-depth into speech recognition and its history, and offers a brief review of relevant literature. Chapter three explains the project’s methodology, and also includes the project’s budget and time frame. Chapter four concludes the project by giving the expected result.
Speech processing What is speech processing Speech processing is “a discipline of computer science that deals with designing computer systems that recognize spoken words” . The sound is sampled to convert it from a continuous signal into a discrete signal that can be analysed by a computer and used to provide valuable data about the speaker(s) and/or their environment.
Speech processing is a heavily researched topic in artificial intelligence (AI) with researchers around the world seeking to apply machine learning (ML) to analyse the data for processing. There are various applications of machine learning in speech processing, such as:
This is the ability of computers to identify and translate spoken language into text. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT).
It has a wide range of applications, from controlling the menu in your car or video console, to call centres with interactive voice response systems or hands free computing.
Speaker verification is the process whereby the system tries to ascertain if a user is they claim to be while speaker identification is the process in which the system determines if a speech sample is spoken by any one of the speakers from a pre- defined set of speakers. Here, the question “Who is speaking?” is answered.
As defined in , audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can be speakers, music or background noises. When the audio sources are speakers, it is referred to as speaker diarization, and the question we seek to answer is “Who spoke when?” in an audio recording with an unknown number of speakers.
The process of recognising the emotions of the speaker their tone, emotional state and gestures. There are several identifiers that human beings use to analyse speech, which would be difficult for computers to replicate. As such, in machine learning (ML) the emotional states can be characterised in a continuous 2- or 3-dimensional space, for example: characterizing on each coordinate the valence (the positive or negative nature of the emotion), degree of activation (quantifying the excitement of the speaker) and dominance (the degree of submissiveness or strength of the speaker) . This is a relatively new field. A brief history of machine learning in speech processing In 1952, Bell Labs of the United States made the first complete speech recognition device for a single speaker which could recognize 10 digits by matching filter bank output to hand-constructed template. This led to the rapid development of the filter based approach for speech recognition at the time. In 1956 at RCA Labs, Olson and Belar built a simple voice controlled-activated typewriter that could recognise ten syllables while Forgie and Forgie built a speaker-independent 10-vowel recognizer at MIT Lincoln Lab . At this stage in history, the speech recognition systems that were developed were only capable of recognising a single word at the very most.
In the 1960’s, several Japanese laboratories demonstrated their capability of building special purpose hardware to perform a speech recognition task . Most notable were the vowel recognizer of Suzuki and Nakata at the Radio Research Lab in Tokyo , the phoneme recognizer of Sakai and Doshita at Kyoto University, and the digit recognizer of NEC Laboratories. The work of Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of speech in different portions of the input utterance. In contrast, an isolated digit recognizer implicitly assumed that the unknown utterance contained a complete digit (and no other speech sounds or words) and thus did not need an explicit “segmenter”. Kyoto University’s work could be considered a precursor to a continuous speech recognition system.
In the early 1970s, Atal and Itakura independently formulated the fundamental concepts of Linear Predictive Coding (LPC), which greatly simplified the estimation of the vocal tract response from speech waveforms. The basic ideas of applying fundamental pattern recognition technology to speech recognition, based on LPC methods, were then proposed by Itakura, Rabiner and Levinson and others later on in the decade. This led to the rapid development of speech recognition for speaker-specific, isolated words and small vocabulary tasks.
Meanwhile, the Advanced Research Projects Agency (DARPA) in America begun to fund its Speech Understanding Research (SUR) program. One system of note that was a product of the ARPA program was Carnegie Mellon University’s “Harpy” which was able to recognize speech using a vocabulary of 1,011 words reasonably accurately.
IBM developed the first transcription engine: a voice activated typewriter that was used to convert spoken word to text that could be seen on a display or typed on paper. This system was speaker-dependent i.e. the user had to train their typewriter. The focus of the system was the size of the recognition vocabulary and the structure of the language model (the representation of the grammar or syntax of the task) which was heavily reliant on statistical methods.
On the other hand, AT&T Bell Laboratories sought to develop a speaker independent automated telecommunication services for voice dialling, voice commands, and control for routing calls. This led to the creation of a range of speech clustering algorithms for creating word and sound reference patterns, eventually statistical models, which could be used across a wide range of talkers and accents. They eventually developed the acoustic model which is the spectral representation of sounds or words.
In the 1980s and 1990s, there was a shift from template-based approaches to a rigorous statistical modelling framework. The Hidden Markov Model (HMM) came to be preferred for speech recognition systems after the theory was published in 1980. To overcome its limitations, it was optimised by use of mixture densities e.g. Gaussian and Cauchy, then later merged with the finite state model that arose from ARPA research. The Gaussian Mixture Method (GMM) was designed by Kai-Fu Lee of Carnegie–Méron University, to optimise the HMM. The HMM-GMM combination was the dominant method of speech processing until the advent of the deep learning networks. In the 1990s, the pattern recognition approach was developed on Baye’s theories, leading to the development of Bayesian networks.
Another technology that was (re)introduced in the late 1980’s was the idea of artificial neural networks (ANN) . Neural networks were first introduced in the 1950’s, but did not produce any notable results at first. The creation of a parallel distributed processing (PDP) model which was a dense interconnection of simple computational elements with a corresponding “training” method – called error back-propagation (BP) – revived interest in neural networks. The pattern recognition problem was addressed by converting it into a spatial recognition issue which was dealt with by “multi-layer, feed-forward neural network architecture” which was adopted to match the temporal structure of speech. These models require a target output to be defined, which makes it difficult to use with continuous speech, but is very accurate for isolated words. To try to overcome ANN’s bottleneck, Bourlard et al suggested combining HMMs with ANNs to create a hybrid system. An iterative training method was proposed where HMMs were used to initially segment the acoustic data and then the Viterbi algorithm, along with the newly trained networks were used as probability estimators. This provided more reliable segmented data which led to more accurate results .
In 2006, Hilton et al proposed a new method for ASR: deep learning. Deep learning refers to “a class of ML techniques, where many layers of information processing stages in hierarchical architectures are exploited for unsupervised feature learning and for pattern classification”. This method proved to be have a higher accuracy and fewer errors than the classical HMM-GMM models, and was thus readily adopted by the community for speech processing with major corporations such as Microsoft, Google, Apple and IBM investing heavily into it. Today, deep learning is the most popular method of speech processing, with the use of complex models such as auto encoders (AE), convolutional neural networks (CNN), deep neural networks (DNN), restricted Boltzmann machines (RBM) and recurrent neural networks (RNN).
As defined beforehand, ASR is the ability of computers to identify and translate spoken language into text. It can also be defined as “graphical representations of frequencies emitted as a function of time” .
1. Isolated words
Here, there are gaps of silence between the words when said. This is the easiest system to create.
2. Connected words
In this category, the speaker says a phrase that is within the systems’ ‘database’ or the system allows the existence of minimum run-off between words. The language model here is larger than that of isolated words.
3. Continuous speech
This occurs when the speaker is left to express themselves freely without forced silences or specific sets of words. This tends to be the most difficult system to implement since the computer needs to determine what the speaker has uttered.
1. Acoustic preprocessing
The unit is used to convert the uttered speech into the training/test data. The tasks carried out in the section include:
This unit converts speech signal into speech frames and generates feature vectors, which describe the input speech signal.
2. Acoustic model
This model is used to detect the phonemes and give the decoder the probable phonemes in the speech signal. The acoustic content from the speech frame obtained during the audio preprocessing is compared against the phoneme probabilities within the model to determine the likely spoken phonemes. When combined with the pronunciation model, it determines the probable spoken word.
3. Lexical/pronunciation model
This model contains various words that are described in phoneme combinations. The data in this model is combined with the acoustical model output to determine the probable spoken word.
4. Language model
This model is used to find the correct word sequence by predicting the likelihood of the next word’s appearance depending on the previous words.
This unit combines the output of all three models to provide the most probable text transcription for the uttered speech
A professional writer will make a clear, mistake-free paper for you!Get help with your assigment
Please check your inbox