Sentiment Analysis on Twitter Dataset Using Machine Learning


The wide spread of World Wide Web has brought a new way of expressing the sentiments of individuals. It is also a medium with a huge amount of information where users can view the opinion of other users that are classified into different sentiment classes and are increasingly growing as a key factor in decision making. This paper contributes to the sentiment analysis for customers’ review classification which is helpful to analyze the information in the form of the number of tweets where opinions are highly unstructured and are either positive or negative, or somewhere in between of these two.

For this we first pre-processed the dataset, after that extracted the adjective from the dataset that have some meaning which is called feature vector, then selected the feature vector list and thereafter applied machine learning based classification algorithms namely: Naive Bayes, Maximum entropy and SVM along with the Semantic Orientation based Word Net which extracts synonyms and similarity for the content feature. Finally we measured the performance of classifier in terms of recall, precision and accuracy. Keywords—Machine Learning, Semantic Orientation, Sentiment Analysis, Twitter I.


The current research paper covers the analysis of the contents on the Web covering lots of areas which are growing exponentially in numbers as well as in volumes as sites are dedicated to specific types of products and they specialize in collecting users’ reviews from various sites such as Amazon etc. Even Twitter is an area where the tweets convey opinions, but trying to obtain the overall understanding of these unstructured data (opinions) can be very time consuming. These unstructured data (opinions) on a particular site are seen by the users and thus creating an image about the products or services and hence finally generating a certain judgment.

These opinions are then being generalized to gather feedbacks for different purposes to provide useful opinions where we use sentiment analysis. Sentiment analysis is a process where the dataset consists of emotions, attitudes or assessment which takes into account the way a human thinks . In a sentence, trying to understand the positive and the negative aspect is a very difficult task. The features used to classify the sentences should have a very strong adjective in order to summarize the review. These contents are even written in different approaches which are not easily deduced by the users or the firms making it difficult to classify them. Sentiment analysis influences users to classify whether the information about the product is satisfactory or not before they acquire it. Marketers and firms use this analysis to understand about their products or services in such a way that it can be offered as per the user’s needs. There are two types of machine learning techniques which are generally used for sentiment analysis, one is unsupervised and the other is supervised.

Unsupervised learning does not consist of a category and they do not provide with the correct targets at all and therefore conduct clustering. Supervised learning is based on labeled dataset and thus the labels are provided to the model during the process. These labeled dataset are trained to produce reasonable outputs when encountered during decision- making. To help us to understand the sentiment analysis in a better way, this research paper is based on the supervised machine learning. The rest of the paper is organized as follows. Second section discusses in brief about the work carried out for sentiment analysis in different domain by various researchers. Third section is about the approach we followed for sentiment analysis. Section four is about implementation details and results followed by conclusion and future work discussion in the last section.

In recent years a lot of work has been done in the field of “Sentiment analysis” by number of researchers. In fact work in the field started since the beginning of the century. In its early stage it was intended for binary classification, which assigns opinions or reviews to bipolar classes such as positive or negative. Lakshmi and Edward have proposed to pre process the data to improve the quality structure of the raw sentence. They have applied LSA technique and cosine similarity for sentiment analysis. Basant Agarwal etc. applied phrase pattern method for sentiment classification. It uses part of speech based rules and dependency relation for extracting contextual and syntactic information from the document. M. Karamibekr and A.A. Ghorbani proposed a method based on verbs as an important opinion term for sentiment classification of a document belonging to the social domain. A lot of work has also been done where researchers have explored and applied soft-computing approaches, mainly fuzzy logic and neural works for sentiment analysis.

In our approach we used twitter dataset and analyses it. Thus analyses labeled dataset using uniform feature extraction technique. . We used the framework where the pre- processor is applied to the raw sentences which make it more appropriate to understand. Further, the different machine learning techniques trains the dataset with feature vectors and then the semantic analysis offers a large set of synonyms and similarity which provides the polarity of the content. The block diagram of the same is graphically represented in Fig. 1 [image: image1.png] Fig.1.

Diagram of the Approach to Problem A. Pre-processing of the datasets The tweets contain a lot of opinions about the data which are expressed in different ways by individuals .The twitters dataset used in this work is already labeled. Labeled dataset has a negative and positive polarity and thus the analysis of the data becomes easy. The raw data having polarity is highly susceptible to inconsistency and redundancy. The quality of the data affects the results and therefore in order to improve the quality, the raw data is pre-processed. It deals with the preparation that removes the repeated words and punctuations and improves the efficiency the data.

For example, “that painting is Beauuuutifull #” after preprocessing converts to “painting Beautiful.” Similarly, “@Geet is Noww Hardworkingg” converts to “Geet now hardworking”. B. Feature extraction The improved dataset after pre- processing has a lot of distinctive properties. The feature extraction method, extracts the aspect (adjective) from the dataset. Later this adjective is used to show the positive and negative polarity in a sentence which is useful for determining the opinion of the individuals using unigram model . Unigram model extracts the adjective and segregates it. It discards the preceding and successive word occurring with the adjective in the sentences. For above example, i.e. “painting Beautiful” through unigram model, only Beautiful is extracted from the sentence. C. Training and classification Supervised learning is an important technique for solving classification problems.

In this work too, we applied various supervised techniques to get the desired result for sentiment analysis. In next few paragraphs we have briefly discussed about the supervised techniques i.e. support vector machine followed by the semantic analysis. · Support Vector Machine(SVM) It’s a classification technique. Support vector machine analyzes the data, define the decision boundaries and uses the kernels for computation which are performed in input space. The input data are two sets of vectors of size m each. Then every data represented as a vector is classified in a particular class. Now the task is to find a margin between two classes that is far from any document.

The distance defines the margin of the classifier, maximizing the margin reduces indecisive decisions. SVM also supports classification and regression which are useful for statistical learning theory and it helps recognizing the factors precisely, that needs to be taken into account, to understand it successfully . · Decision tree regression Decision Trees are a class of very powerful Machine Learning model cable of achieving high accuracy in many tasks while being highly interpretable. What makes decision trees special in the realm of ML models is really their clarity of information representation. The “knowledge” learned by a decision tree through training is directly formulated into a hierarchical structure.

This structure holds and displays the knowledge in such a way that it can easily be understood, even by non-experts. D. Sentiment Analysis After the training and classification we used semantic analysis. Semantic analysis is derived from the WordNet database where each term is associated with each other. This database is of English words which are linked together. If two words are close to each other, they are semantically similar. More specifically, we are able to determine synonym like similarity. We map terms and examine their relationship in the ontology.

The key task is to use the stored documents that contain terms and then check the similarity with the words that the user uses in their sentences. Thus it is helpful to show the polarity of the sentiment for the users. For example in the sentence”I am happy” the word ‘’happy’’ being an adjective gets selected and is compared with the stored feature vector for synonyms. Let us assume 2 words; ‘glad’ and ‘satisfied’ tend to be very similar to the word ‘happy’. Now after the semantic analysis, ‘glad’ replaces ‘happy’ which gives a positive polarity.

[image: image2.emf] Fig. 2. Flow Diagram of the proposed methodology Input: Labeled Dataset Output: positive and negative polarity with synonym of words and similarity between words Step-1 Pre-Processing the tweets: Pre-processing () Remove URL: Remove special symbols Convert to lower: Step-2 Get the Feature Vector List: For w in words: Replace two or more words Strip: If (w in stopwords) Continue Else: Append the file Return feature vector Step-3 Extract Features from Feature Vector List: For word in feature list Features=word in tweets_words Return features Step-4 Combine Pre-Processing Dataset and Feature Vector List Pre-processed file=path name of the file Stopwords=file path name Feature Vector List=file path of feature vector list Step-5 Training the step 4 Apply classifiers classes Step-6 Find Synonym and Similarity of the Feature Vector For every sentences in feature list Extract feature vector in the tweets () For each Feature Vector: x For each Feature Vector: y Find the similarity(x, y) If (similarity>threshold) Match found Feature Vector: x= Feature Vector: y Classify (x, y) Print: sentiment polarity with similar feature words V. Result Table 1 show the performance measures of support vector machine based classifiers respectively in terms of precision and recall 

In this paper, we proposed a set of techniques of machine learning with semantic analysis for classifying the sentence and product reviews based on twitter data. The key aim is to analyze a large amount of reviews by using twitter dataset which are already labeled. The naïve byes technique which gives us a better result than the maximum entropy and SVM is being subjected to unigram model which gives a better result than using it alone. Further the accuracy is again improved when the semantic analysis WordNet is followed up by the above procedure taking it to 89.9% from 88.2%. The training data set can be increased to improve the feature vector related sentence identification process and can also extend WordNet for the summarization of the reviews. It may give better visualization of the content in better manner that will be helpful for the users. VII.


  1.  R. Feldman, ” Techniques and Applications for Sentiment Analysis ,” Communications of the ACM, Vol. 56 No. 4, pp. 82-89, 2013.
  2. Y. Singh, P. K. Bhatia, and O.P. Sangwan, ”A Review of Studies on Machine Learning Techniques,” International Journal of Computer Science and Security, Volume (1) : Issue (1), pp. 70-84, 2007.
  3. P.D. Turney,” Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews,” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp. 417-424, July 2002.
  4.  Ch.L.Liu, W.H. Hsaio, C.H. Lee,and G.C.Lu, and E. Jou,” Movie Rating and Review Summarization in Mobile Environment,” IEEE Transactions on Systems, Man, and Cybernetics, Part C 42(3):pp.397-407, 2012.
  5.  Y.Luo,W.Huang,” Product Review Information Extraction Based on Adjective Opinion Words,” Fourth International Joint Conference on Computational Sciences and Optimization (CSO), pp.1309 – 1313, 2011.
  6.  R.Liu,R.Xiong,and L.Song, ”A Sentiment Classification Method for Chinese Document,” Processed of the 5th International Conference on Computer Science and Education (ICCSE), pp. 918 – 922, 2010.
  7.  A.khan,B.Baharudin, ”Sentiment Classification Using Sentence-level Semantic Orientation of Opinion Terms from Blogs,” Processed on National Postgraduate Conference (NPC), pp. 1 – 7, 2011.
  8. L.Ramachandran,E.F.Gehringer, ”Automated Assessment of Review Quality Using Latent Semantic Analysis,” ICALT, IEEE Computer
Did you like this example?

Having doubts about how to write your paper correctly?

Our editors will help you fix any mistakes and get an A+!

Get started
Leave your email and we will send a sample to you.
Thank you!

We will send an essay sample to you in 2 Hours. If you need help faster you can always use our custom writing service.

Get help with my paper
Sorry, but copying text is forbidden on this website. You can leave an email and we will send it to you.
Didn't find the paper that you were looking for?
We can create an original paper just for you!
What is your topic?
Number of pages
Deadline 0 days left
Get Your Price