Online Hidden Markov Model

Check out more papers on Cognition Hidden Figures Learning


A study indicated that Artificial Intelligence and Machine Learning are popular technologies today that will greatly impact businesses for years to come. Machine Learning and Deep Learning models are powerful enough to go through billions of data points to provide insights that only experts could do before. Tools used in Machine Learning are nothing but set of sophisticated offline algorithms and online algorithms. An online algorithm  follows a learning process which is in a serial fashion. This algorithm takes in the input data one-by-one and performs learning as the data comes in. This also means that an online algorithm makes continuous changes in order to provide the optimal solution to the problem at hand. Contrary to an online algorithm, an offline algorithm is fed with the complete data-set required for the learning process. One could say that an offline algorithm looks at the complete problem and data from the beginning before providing the solution. In addition, not every online algorithm has an offline complement. The performance comparison of online and offline algorithms is a widely researched topic in the field of analysis of algorithms.

Competitive Analysis compares an online algorithm with the optimal offline algorithm in determining a solution and gives an accurate performance measure to say which one is better. The performance depends on the application and the type of the algorithm. Some of the examples of machine learning algorithms include offline models such as Hidden Markov Models (HMMs), Support Vector Machine (SVM) and Linear Regression etc., Whereas, deep learning mainly focuses on online models such as perceptron algorithm and neural networks (NNs).

This review studies the principles behind online algorithm specifically NN and applies these to an offline machine learning model specifically HMM. This research aims to create an online version of HMM and study the differences between optimal offline HMM algorithm’s performance with the online version.

The review answers the following questions: what the existing implementation of HMM is, how is probability used in implementing HMM, what are the key steps in training the HMM, how can Neural Network’s calculus theory be used to implement an online version of HMM. The articles selected for the literature review include the CS286 course material and papers discussed in the course by Prof. Mark Stamp, previous student’s thesis projects, published papers mainly in the field of HMMs and online algorithms. The review is structured as follows: Section II dives into details of how Hidden Markov Model works. The probability theorems used, Baum-welch re-estimation and mathematical formulae are listed out. Section IV introduces the concept of gradient descent, details of how an NN or Perceptron algorithm works. This section also addresses the calculus theory. Section V focuses on how to change the offline HMM to extend it to an online version and talks about the feasibility, advantages, disadvantages, expected performance of the online version vs. the offline HMM.

Fig 1 shows the organization of the literature review.

Fig 1. Organization of the survey

Hidden Markov Model

Hidden Markov Models (HMMs) are one of the adaptive systems that have been extensively used in modeling sequences such as time-series, DNA sequences, modeling speech recognition etc., One of the key concepts behind HMM is Expectation-Maximization (EM) algorithm, that has been applied to the modeling and analysis of protein sequences in biology, Natural Language Processing (NLP) and optical character recognition. [ref: hmm gradient]. As the name suggests HMMs use Markov Process (a random process in which the future state is independent of the past state, given the present state) and can identify hidden states or meanings given the observations or data. Hence, in this model, the system’s hidden states or behavior is assumed to follow the Markov process.

In order to construct an HMM, certain parameters are initialized, which are as follows:

  1. Number of hidden states N
  2. Number of unique observations M
  3. Observation sequence length T
  4. Transition probability matrix A
  5. Emission probability matrix B
  6. Initial state matrix ? (Pi)

The Markov chain of hidden states, transition and emission probabilities can be denoted as:

  1. qt ? S; S = {s1, s2, …., sn} and t=1, 2, …., T
  2. Initial probability vector of states ? = P(q1 = si)
  3. Transition Matrix Aij(t) = P(qt+1 = sj |qt = si) where i, j = 1, 2, ..,n. For discrete times t, each hidden state qt emits an observed state yt ? O, O = {o1, ..., om}
  4. Emission Probability Matrix Bij(t) = P(yt = oj | qt = si) i = 1, ...,n; j = 1, ...,m which are the actual observations of the time series shown, from time t = 1,2, …., T.

Fig 2: Markov Chain


Following steps are used to train an HMM and score an observation sequence using this trained HMM:

  1. Random initialization of A, B, and ? matrices
  2. Forward Pass algorithm (?-pass)
  3. Backward Pass algorithm (?-pass)
  4. Calculating gammas (?[i][t]) and di-gammas(?[i][j][t])
  5. Re-computing or re-estimating A, B and ? matrices using Baum-Welch re-estimation algorithm.
  6. Computing the total log-likelihood of a sequence given the model. This probability can be calculated using the following formula: Given the observed sequence y1T = {y1, y2, ..., yT} and hidden sequence q1T = {q1, q2, ..., qT}, the probability of observing a sequence y1T given a model ? or ? (symbols are used interchangeably) = (?, A, B) is given as follows:


Artificial Neural Network

An artificial neuron, inspired by biological neuron, is one of the oldest ideas proposed around the 1940s. The present-day neural networks use the concept of the perceptron, which is a mathematical function activated on certain conditions and its output can be used to model the data.

Fig 3: Artificial Neuron

The T can be called the threshold function which might or might not be activated depending upon function used and on different input values for X0, X1, X2. W0, W1, W2 denote the weights of the edges. Similarly, for n input values X = {X0, X1, …., Xn}. The function can be denoted as:

Multilayer perceptron (MLP) is an Artificial Neural Network (ANN) that has more than one (hidden) layers in the form of a perceptron. Fig 4: Multilayer perceptron with two hidden layers


In order to construct MLP or ANN, certain parameters are initialized, which are as follows:

  1. The width of an ANN, which is the number of neurons per layer. It is not necessary that every layer have the same width.
  2. The depth of an ANN, which is the number of hidden layers chosen. For a “deep” neural network the depth is large.
  3. The transfer function or activation function which is performed by each node in the network. Few examples of this function include sigmoid, hyperbolic tangent and Rectified Linear unit (ReLU) function.
  4. The objective function, which is the function that must be optimized, and this function represents the error that must be minimized. This is constructed as a distance measure between the model’s output and actual output.


Following steps are used to train an ANN using Backpropagation algorithm:

  1. Initialize the weights and parameters
  2. Forward pass: In this step, the neuron makes a prediction about the training data by looking at the input samples. This includes the computation of a weighted sum of the input features and activating this sum using the activation function chosen.
  3. Backward pass: In this step, calculus derivatives and automatic differentiation [ann.pdf] are used to compute the gradient of an objective function. To achieve this gradient-descent optimization algorithm is used.

Fig: Gradients of n input variables[ann.pdf]

The differentiation chain rule is applied to calculate the gradients of weights with respect to the error function. The chain rule is as follows:

Fig: Chain rule of differentiation

The E denotes the error function, wij is the weight that is being updated, oj is the output of the layer j, netj is one of the neurons at hidden layer ‘j’.

Weight Update: In this step the weights are updated based on the gradients computed in the previous step. By putting all the steps together, the weight must be updated using:

Fig: weight update step using ? as the learning rate

Online Version of HMM

HMMs are one of the most efficient algorithms to model sequential data because of their inherent strong statistical foundation. However, there is a certain advantage with the online characteristic of the gradient descent algorithm used in training an ANN. Moreover, HMMs often have many unstructured parameters and the training process is computationally expensive. The other weakness of HMM is due to the offline characteristic of Baum-Welch re-estimation algorithm. Due to this, when new data appears the entire training process must be restarted all over which would further increase the time and resource complexity. In order to experiment this drawback, the online gradient descent algorithm in ANN is used to replace the offline Baum-Welch re-estimation in HMM.

Firstly, the parameters of ANN with HMM are compared. The depth of an HMM is nothing but the number of hidden Markov chains considered which is typically one for first-order HMM. The width of HMM is the number of hidden states. The activation functions listed above are basically like the probabilities given for A and B matrices. Hence, the transition and emission probabilities are the activation functions. Finally, the objective function that must be optimized is the log-likelihood function given in Figure X. The only difference is that in an ANN the objective function is generally the error term that must be minimized, in contrast, the likelihood P(O|?) for HMM must be maximized.

Given the similarities and differences between HMM and an ANN, the transition probability matrix A and emission probability matrix B can be updated indirectly in the form of “weights” as follows:

Fig: new a and b matrix

This technique uses SoftMax normalized form and ? is a temperature parameter. The weights can be updated using the following step:

Fig: weight update

Aij is the expected number of transitions from hidden state i to hidden state j, Bij is the expected number of times we observe j in hidden state i and these new terms are computed in terms of gammas, di-gammas and ct’s as scales from the ?-pass as follows:

Fig: new terms

The other two terms ? and C are learning rate and -log(P(O|?)) respectively. Using the above new “online re-estimation” method, the HMM can be extended to be used as an online algorithm. Further experiments on using this algorithm might prove the performance of the model in terms of the log-likelihood probability and time for convergence. 


Most of today’s emerging technologies and applications focus on extracting insights from huge datasets. Deep learning and Machine learning models make this process easier. The tools that these models use on the back-end are set of algorithms that implement mathematical formulae. The continuously growing data has given rise to the ever-growing importance of these models which has brought in an ever-growing need for improvisation of the existing algorithms. To effectively make use of the data, the context of application and algorithm implemented are crucial.

The literature review focuses on two main classes of algorithms related to machine learning process, namely offline and online learning. The review then studies the oldest and most popular online learning algorithm (gradient descent in NN), researches the way in which this can be leveraged in implementing a new online version of one of the most efficient offline-learning models, namely HMM. The literature review then dives into details of implementation, parameters initialized, training processes of these two models (NN and HMM). Finally, the online version of HMM is proposed. Further research and experiments on a specific data-set could be helpful in comparing the performance of the online version of HMM with the existing offline HMM.

The scope of this research work can also be extended to implementing “transfer learning concept” from neural networks. Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. This efficient technique can be implemented using online algorithms. Extending the online version of HMM to use transfer learning is one of the future scopes for this research work. 

Did you like this example?

Cite this page

Online Hidden Markov Model. (2021, Oct 12). Retrieved July 23, 2024 , from

Save time with Studydriver!

Get in touch with our top writers for a non-plagiarized essays written to satisfy your needs

Get custom essay

Stuck on ideas? Struggling with a concept?

A professional writer will make a clear, mistake-free paper for you!

Get help with your assignment
Leave your email and we will send a sample to you.
Stop wasting your time searching for samples!
You can find a skilled professional who can write any paper for you.
Get unique paper

I'm Amy :)

I can help you save hours on your homework. Let's start by finding a writer.

Find Writer