Tweets Using Support Vector Machine

Abstract

The growth of social media has been exceptional in the recent years. Huge amount of data is being put out on to the public domain through social media. Twitter is amongst the social giants which is a medium for showcasing data to the public. Sentimental Analysis is the best way to judge people’s opinion regarding a particular post. The proposed work presents analysis for sentimental behavior and sarcasm detection in Twitter dataset. The proposed work utilizes the support vector machine (SVM) to classify data into sarcastic and non-sarcastic and also to classify Dataset into positive, negative or neural behavior. But, our major objective is to increase accuracy of detection of sarcasm or irony detection in tweets which is difficult to detect as there is no facial expression or intonation involved in written data which eventually hampers the performance of sentiment analysis as the intended context of the text is read in a different manner by the machine in sarcastic or ironical texts. Keyword: Sentiment analysis – sarcastic, non-sarcastic – machine learning – social media – twitter – SVM – classification.

Introduction

The modern world are often tagged as a digital world. With the invention of mobile devices and technology, there has been an enormous pile of knowledge being generated by one device within the network. the net presently has over three billion connected devices across the world. With AN ever-increasing range of devices, the number of knowledge being made is large. modern-day humans ar pioneers in communication; developed countries boast concerning over hour of their population owning devices connected to the net. There has been a major rise in social media and microblogging sites like Twitter and Facebook within the last decade. These social media websites have provided AN open platform of communication within the epoch. varied social media giants have conjointly claimed to own over one billion on-line users on one day. Social media large Twitter is claimed to cross over five hundred million tweets per day. the number of knowledge from such social media sites cause a stimulating chance, it parades a variant and irregular dataset with style of info that is accessible publically. 

Despite the provision of computer code to extract information concerning a person’s sentiment on a selected product or service, organizations still face problems concerning the info extraction. With the rise of the planet Wide net, folk’s area unit victimization social media like Twitter that generates massive volumes of opinion texts within the variety of tweets that is accessible for the sentiment analysis. this enables an enormous volume of data from a personality’s viewpoint that makes it troublesome to extract sentences, read them, analyze them tweet by tweet, summarize them and organize them into a comprehensible format in a very timely manner. There are a unit but, varied different challenges that area unit exhibit by streaming social media information. Informal language refers to the utilization of colloquialisms and slang in communication, using the conventions of voice communication like ‘would not’ and ‘wouldn’t’. Not all systems area unit ready to discover sentiment from use of informal language and this might produce a retardant for the analysis and decision-making method.

Emoticons area unit a picturing of human facial expressions, that within the absence of visual communication and prosody serve to draw a receiver’s attention to the tenor or temper of a sender’s nominal verbal communication, rising and ever-changing its interpretation. For example, ? indicates a happy state of mind and indicates a sad state of mind. Systems currently in place do not have sufficient data to allow them to draw feelings out of the emoticons. As humans often started using emoticons to properly express what they cannot put into words. The data available is most often very few characters, which makes most text classification algorithms inefficient; as multiple keywords cannot often be derived from such data. Another challenge is posed by the composition of data itself. Recent internet culture has given rise to various slangs and short forms such as “LOL” (Laughing Out Loud) and “TTYL” (Talk to You Later) etc., Short-form is widely used even with short message service (SMS). The usage of short-form will be used more frequently on Twitter so as to help to minimize the characters used. This is because Twitter has put a limit on its characters to 140. Sentiment analysis has turned out as an exciting new trend in social media with a large amount of practical applications that range from applications in business to government use.

Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment analysis is also known as opinion mining. Sentiment Analysis is used to classify the reviews using the sentiment of the words into positive or negative or neutral. The sentiments can be of any type i.e. positive, negative, or neutral sentiment, or a numeric rating score stating the intensity of the sentiment. The main task is to accurately calculate the score of the tweet data and display the sentiments in that particular tweet. The sentiments can be of any type i.e. positive, negative, or neutral sentiment, or a numeric rating score stating the intensity of the sentiment.

The main task is to accurately calculate the score of the tweet data and display the sentiments in that particular tweet. Literature Review In the last decade there features a nice rise within the scope for analysis of human behavioural on web knowledge exploitation machine learning. In [1], the authors have foremost shown a quick procedure to carryout sentiment analysis method to classify extremely unstructured knowledge of Twitter into positive or negative classes. Secondly, have mentioned varied techniques to carryout sentiment analysis on Twitter knowledge together with knowledge-based technique and machine learning techniques. additional in [2], here the authors have analysed the performance of Support Vector Machine (SVM) for sentiment analysis.

For performance analysis of SVM, we’ve got used 2 pre-classified datasets of tweets, 1st dataset consisted of tweets concerning self-driving cars and second dataset was concerning the apple merchandise. rail tool is employed for performance analysis and comparison. Results area unit measured in terms of preciseness, recall and f-measure. In [3], The paper presents a survey of sentiment analysis and classification algorithms. This survey all over that sentiment classification remains Associate in Nursing open field for analysis. there’s tons of scope for algorithms in it. SVM and Naïve Thomas Bayes area unit hottest algorithms for sentiment classification. Sentiment analysis of tweets is incredibly common.

Datasets from sites like Amazon, IMDB, flip kart area unit wide used for sentiment analysis. Deeper analysis is needed just in case of social networking sites. In several cases, context thought is incredibly necessary. Therefore, have aforementioned that there’s additional analysis needed during this field. In [4], during this paper, the simplest way of rising the existent satire detection algorithms by together with higher pre-processing and text mining techniques like emoji and slang detection area unit presented.

For classifying tweets as satiric and non-sarcastic there are a unit varied techniques used, several of that area unit briefed in section a pair of. However, the paper takes up a classification rule and suggests varied enhancements, that directly contribute to the advance of accuracy. The project derived analytical views from a social media dataset i.e., twitter dataset and additionally filtered out or reverse analysed satiric tweets to attain a comprehensive accuracy within the classification of the info that’s conferred. The model has been tested in period of time and may capture live streaming tweets by filtering through hashtags so perform immediate classification.

In [5], the work identifies only 1 form of satire that’s common in tweets: distinction between a positive sentiment and negative scenario. it’s conferred a bootstrapped learning methodology to accumulate lists of positive sentiment phrases and negative activities and states, and show that these lists will be accustomed acknowledge satiric tweets. This work has solely damaged the surface of prospects for characteristic satire arising from positive/ negative distinction. Proposed System With vast variety of tweets and information coming back in at a brisk pace, it’s vital to pick the simplest potential methodology for the info out there. one among the simplest potential machine learning ways out there for sentimental analysis is Support Vector Machine (SVM).

The dataset equipped for the planned model could be a 2000 tweets dataset that contains general tweets with grim or non-sarcastic labels. This dataset could be a manually classified dataset that additionally is a brand new introduction during this paper. This assortment of tweets consists of the many things which require to be processed before moving onto the classification section. the primary step is hashtag identification and replacement. Hashtags area unit words or phrases preceded by a hash sign (#), used on social media websites and applications, especially Twitter, to spot messages a few specific topics.

The hashtags within the dataset area unit lots. Some like “#elections2019” “#irony” etc., area unit helpful for classification however things like “#Lisbon” “#entertainment” aren’t significantly helpful. These hashtags area unit treated as traditional tokens and passed to a part of Speech tagging. The coefficient relies on the POS tag appointed. following step in pre-processing is that the emoji wordbook mapping. the favored trend of exploitation emoji’s in tweets can’t be unheeded because it carries a great deal of weightage for classification.

The emoji’s in an exceedingly tweet area unit known so mapped with the manually engineered emoji wordbook introduced during this paper. The wordbook contains the favored emoji’s tagged as positive or negative or neutral. The last step is that the slang wordbook mapping. The slang wordbook contains all the favored slangs and their meanings or full forms as key-value pairs. once a tweet is being analyzed, if there’s any slang that’s detected it’s now mapped to the wordbook and replaced by the acceptable which means or full kind.

This aids within the classification method. (2) Data Preparation After the pre-processing steps are completed, the data should be prepared and made ready for the classification phase. The tweets are a collection of sentences which cannot be directly fed into the classifiers. Hence, 3 major steps are performed to prepare the data for the next phase. The stages are

The first step is tokenization. Tokenization is performed on tweets to interrupt them down into good significant modules from a sentence. typically, tokens are often in terms of paragraphs or whole sentences except for the planned model it’s a word. The tweet is countermined into words and also the keywords which is able to aid in classification square measure chosen and stop words square measure removed. when the tweets square measure tokenized, a part of Speech tagging is performed. The words in a very tweet and their elements of speech play a task in classification. If the person is employing a heap of adjectives there’s a prospect that he’s describing one thing with an excessive amount of praise, that hints regarding it being sarcastic.

With this in mind, the propose model tags the elements of speech for every word. Stemming is made upon the concept that words with constant stem square measure march on that means. that the words square measure stemmed to spot the words that square measure similar in that means. when identification specific weights square measure appointed supported the that means. Lemmatization is that the method of identification of the foundation word of the assorted words employed in the tweet. for instance, words like mice square measure born-again to mouse. Such conversion clarifies the context of usage for the word and makes it easier to map it with its that means. (3) SVM Classifier. There are varied SVM algorithms that uses completely different kernels supported the kind of information to be classified. The Support Vector Machine will be viewed as a kernel machine. As a result, you’ll be able to amendment its behavior by employing a completely different kernel perform. the foremost widespread kernel functions are: 1. The linear kernel is usually counseled for text classification. Most of text classification issues square measure linearly divisible. The linear kernel is nice once there’s heaps of options. that is as a result of mapping the information to the next dimensional area doesn’t very improve the performance.

In text classification, each the numbers of instances (document) and options (words) square measure giant. coaching a SVM with a linear kernel is quicker than with another kernel. significantly once employing a dedicated library. Support vector machine (SVM) solves the normal text categorization downside effectively; usually outperforming Naïve mathematician because it supports the construct of most margin. the most principle of SVMs is to work out a linear setup that separates totally different categories within the search area with most distance i.e. with most margin. If we tend to represent the tweet exploitation t, the hyper plane exploitation h, and categories employing a set Cj € {l, -1} into that the tweet has got to be classified, the answer is written as follows similar to the sentiment of the tweet. [image: capt1.JPG] The idea of SVM is to work out a boundary or boundaries that separate distinct clusters or teams of knowledge. SVM performs this task constructing a collection of points and separating those points victimization mathematical formulas.

Fig. two illustrates the info flow of SVM. Below fig. shows SVM workflow [1]. [image: Capture2.JPG] After the SVM is applied the info is assessed into Positive Negative or Neutral looking on the score of the info. (4) Sarcasm Detection Sarcasm detection is one in every of the foremost difficult tasks in machine learning. Sarcasm is truly the employment of remarks that clearly mean the alternative of what they assert, created in order to harm someone’s feelings or to criticize one thing in a very dry method. satire is use of irony to mock or to mention one thing in Associate in Nursing opposite method of the particular intent. sarcastic comments area unit troublesome to capture by humans and therefore could be a challenge for machine to accurately observe. The steps concerned in satire or irony detection area unit as follows, (A)Feature extraction The terribly initiative in feature extraction is engineering the feature extraction. Here the unwanted or screeching knowledge from {the knowledge the info the information} is debarred and solely helpful data is maintained back. this feature extracted knowledge is extracted in such how that the that means of sentence remains preserved.

Feature extraction uses tools like POS tagging, removal of characters, tokenization’s etc. that is comparable thereto utilized in sentiment analysis. The feature values calculated i.e. the feature extraction score is then mapped into vectors. These vectors area unit more saved into a lexicon known as the lexicon vector that is employed by the SVM classifiers. The knowledge set (tweets) area unit divided into coaching and take a look acting data and more test vector and train vector area unit calculated. (B)Operations Data Set (Training & Testing) The data set is that the most significant a part of the classification. The dataset used is of 10,000 tweets. The set is split into coaching and testing data set.

The coaching of the information set starts with pre-assigning dataset with artificial weights. mistreatment Linear kernel from SVM the feature score accuracy is calculated by rending feature extracted dataset, fitting in SVM model. Accuracy score is calculated five consecutive times mistreatment totally different split points therefore to make sure all the information is taken into account all told attainable ways that. These results area unit saved within the SVM classifier and thus the classifier is trained. (C)Validation The trained SVM classifier is currently applied on the Testing information and therefore the accuracy is calculated by cross validation. any confusion matrix is planned.

A confusion matrix could be a table that’s usually wont to describe the performance of a classification model (or “classifier”) on a group of check information that truth values area unit far-famed. The confusion matrix itself is comparatively easy to know, however the connected nomenclature will be confusing.[image: 1st past] [image: 2nd part] Above is that the calculation of the performance of the trained SVM classifier. The F1-score could be a live of a test’s accuracy. It considers each the exactitude p and therefore the recall r of the check to reckon the score: p is that the variety of correct positive results divided by the quantity of all positive results came back by the classifier, and r is that the variety of correct positive results divided by the quantity of all relevant samples. within the planned work more in confusion matrix to increase its performance we’ve used and compared Normalized confusion matrix and confusion matrix while not standardization. [image: 1 Normalized confusion matrix] Fig 4.1 [image: 1 Only confusion matrix] Fig 4.2 In FIG. 4.1, The Normalized confusion matrix, the row could be a a part of the particular category, whereas the column represents the part of the expected category.

The diagonal represents the values of correct foreseen category. The off-diagonal parts represent thee incorrectly foreseen category that were mistakenly predicted as another category. The properly foreseen Ironic information tweets had 0.84 accuracy and 0.16 were incorrectly foreseen. IV.Result [image: Algo compare] Fig 5.1 (Fig. 5.1) The ID3 algorithm (i.e. decision tree algorithm) was 61.13% accurate while SVM and a far better and accurate result with 69.06%. [image: custom result 1] Fig. 5.2 In Fig 5.2 the SVM classifier correctly detected the user driven data as Ironic ([1]). From the results it is self-explanatory that the SVM classifier performs comprehensively better that other classifiers. V.Conclusion and Furture scope In this planned system we’ve got bestowed the simplest way in machine learning technique, SVM may be applied to massive sets of information to determine membership, during this case sarcastic(ironic) and non-sarcastic(non-ironic).

The SVM classifiers perform rather well on the info used here. The SVM classifier will a really smart job at learning and adapting to the coaching knowledge and testing knowledge. Comparison has been meted out with the choice tree algorithmic rule to match the accuracy between SVM and ID3(decision tree) algorithmic rule. additionally, the SVM classifier with success classifies positive negative and neutral sentiment within the knowledge tweets. we’ve got incontestable however knowledge will be analyzed with the machine learning technique SVM taking into thought the slang, emoji’s employed in tweets lately. In future there’s ought to bring home the bacon higher accuracy. No Machine learning technique is 100 percent correct. here the planned system tries to realize most accuracy. there’s heaps of future scope for sentimental analysis thanks to ever increasing quantity of information flowing in on the web. hope to stay making an attempt and march towards 100 percent accuracy in predicting human sentiments and also the means and during which context, victimization machine learning. V.

References

  1. Techniques for Sentiment Analysis of Twitter Data: A Comprehensive Survey – Mitali Desai, Mayuri Mehta – issue of International Conference on Computing, Communication and Automation (ICCCA2016)
  2.  Sentiment Analysis of Tweets using SVM – Munir Ahmad, Iftikhar Ali , Shabib Aftab – issue of International Journal of Computer Applications · November 2017
  3.  A Survey of Sentiment Analysis techniques -Harpreet Kaur, Veenu Mangat, Nidhi – issue of (I-SMAC 2017)
  4.  Sentiment Analysis for Sarcasm Detection on Streaming Short Text Data-Anukarsh G Prasad; Sanjana S, Skanda M Bhat, B S Harish – issue of 2017 2nd International Conference on Knowledge Engineering and Applications
  5. Sarcasm as Contrast between a Positive Sentiment and Negative Situation(2013) – Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva ,Nathan Gilbert, Ruihong Huang.
  6. Sentimental Analysis Using Fuzzy and Naive Bayes- Ruchi Mehra, Mandeep Kaur Bedi , Gagandeep Singh, Raman Arora, Tannu Bala, Sunny Saxena-issue of IEEE 2017 International Conference on Computing Methodologies and Communication.
  7. https://en.wikipedia.org/wiki/Global_Internet_usage#Internet_users
  8.  www.internetlivestats.com/twitter-statistics/
Did you like this example?

Having doubts about how to write your paper correctly?

Our editors will help you fix any mistakes and get an A+!

Get started
Leave your email and we will send a sample to you.
Thank you!

We will send an essay sample to you in 2 Hours. If you need help faster you can always use our custom writing service.

Get help with my paper
Sorry, but copying text is forbidden on this website. You can leave an email and we will send it to you.
Didn't find the paper that you were looking for?
We can create an original paper just for you!
What is your topic?
Number of pages
Deadline 0 days left
Get Your Price