Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data

doi:10.1109/KSE.2019.8919368

Proceedings ArticleDOI

Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data

Saurav Pradha, +2 more

- pp 8919368

Chats0

TLDR

An algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text to obtain the sentiment and an algorithm to train the Support Vector Machine, Deep Learning, and Naïve Bayes classifiers to process Twitter data.

Abstract:

In the big data era, data is made in real-time or closer to real-time. Thus, businesses can utilize this evergrowing volume of data for the data-driven or information-driven decision-making process to improve their businesses. Social media, like Twitter, generates an enormous amount of such data. However, social media data are often unstructured and difficult to manage. Hence, this study proposes an effective text data preprocessing technique and develop an algorithm to train the Support Vector Machine (SVM), Deep Learning (DL) and Naive Bayes (NB) classifiers to process Twitter data. We develop an algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text. In this study, we (i) compare different preprocessing techniques on the data collected from Twitter using various techniques such as (stemming, lemmatization and spelling correction) to obtain the efficient method (ii) develop an algorithm to weight the scores of the hashtag and cleaned text to obtain the sentiment. We retrieved N=1,314,000 Twitter data, and we compared the popularity of two products, Google Now and Amazon Alexa. Using our data preprocessing algorithm and sentiment weight score algorithm, we train SVM, DL, NB models. The results show that stemming technique performed best in terms of computational speed. Additionally, the accuracy of the algorithm was tested against manually sorted sentiments and sentiments produced before text data preprocessing. The result demonstrated that the impact produced by the algorithm was close to the manually annotated sentiments. In terms of model performance, the SVM performed better with the accuracy of 90.3%, perhaps, due to the unstructured nature of Twitter data. Previous studies used conventional techniques; hence, no precise methods were utilized on cleaning the text. Therefore, our approach confirms that proper text data preprocessing technique plays a significant role in the prediction accuracy and computational time of the classifier when using the unstructured Twitter data.

Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data

Citations

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian.

Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets

Multimodal sentimental analysis for social media applications: A comprehensive review

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Climate Change Sentiment Analysis Using Lexicon, Machine Learning and Hybrid Approaches

References

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Enabling Real-Time Drug Abuse Detection in Tweets

Big-data NoSQL databases: A comparison and analysis of “Big-Table”, “DynamoDB”, and “Cassandra”

An Analysis of Demographic and Behavior Trends Using Social Media: Facebook, Twitter, and Instagram

The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study

Related Papers (5)

The effect of preprocessing techniques on Twitter sentiment analysis

The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study

Twitter Sentiment Analysis -- A More Enhanced Way of Classification and Scoring

Classifying streaming of Twitter data based on sentiment analysis using hybridization

Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis