scispace - formally typeset
Proceedings ArticleDOI

Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data

Reads0
Chats0
TLDR
An algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text to obtain the sentiment and an algorithm to train the Support Vector Machine, Deep Learning, and Naïve Bayes classifiers to process Twitter data.
Abstract
In the big data era, data is made in real-time or closer to real-time. Thus, businesses can utilize this evergrowing volume of data for the data-driven or information-driven decision-making process to improve their businesses. Social media, like Twitter, generates an enormous amount of such data. However, social media data are often unstructured and difficult to manage. Hence, this study proposes an effective text data preprocessing technique and develop an algorithm to train the Support Vector Machine (SVM), Deep Learning (DL) and Naive Bayes (NB) classifiers to process Twitter data. We develop an algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text. In this study, we (i) compare different preprocessing techniques on the data collected from Twitter using various techniques such as (stemming, lemmatization and spelling correction) to obtain the efficient method (ii) develop an algorithm to weight the scores of the hashtag and cleaned text to obtain the sentiment. We retrieved N=1,314,000 Twitter data, and we compared the popularity of two products, Google Now and Amazon Alexa. Using our data preprocessing algorithm and sentiment weight score algorithm, we train SVM, DL, NB models. The results show that stemming technique performed best in terms of computational speed. Additionally, the accuracy of the algorithm was tested against manually sorted sentiments and sentiments produced before text data preprocessing. The result demonstrated that the impact produced by the algorithm was close to the manually annotated sentiments. In terms of model performance, the SVM performed better with the accuracy of 90.3%, perhaps, due to the unstructured nature of Twitter data. Previous studies used conventional techniques; hence, no precise methods were utilized on cleaning the text. Therefore, our approach confirms that proper text data preprocessing technique plays a significant role in the prediction accuracy and computational time of the classifier when using the unstructured Twitter data.

read more

Citations
More filters
Journal ArticleDOI

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian.

TL;DR: In this paper, the authors proposed a different approach for Twitter sentiment analysis based on two steps: first, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are languageindependent or easily applicable to different languages.
Journal ArticleDOI

Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets

TL;DR: Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.
Journal ArticleDOI

Multimodal sentimental analysis for social media applications: A comprehensive review

TL;DR: This work aims to present a survey of recent developments in analyzing the multimodal sentiments (involving text, audio, and video/image) which involve human–machine interaction and challenges involved in analyzing them.
Journal ArticleDOI

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

TL;DR: A comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision.
Journal ArticleDOI

Climate Change Sentiment Analysis Using Lexicon, Machine Learning and Hybrid Approaches

TL;DR: This study aims to find the most effective sentiment analysis approach for climate change tweets and related domains by performing a comparative evaluation of various sentiment analysis approaches using lexicon, machine learning and hybrid methods.
References
More filters
Journal ArticleDOI

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

TL;DR: Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset and Naive Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent.
Proceedings ArticleDOI

Enabling Real-Time Drug Abuse Detection in Tweets

TL;DR: The utility of Twitter in examining patterns of abuse, and the feasibility of building the drug abuse detection system that can process large volume data from social media sources in a near real-time are illustrated.
Proceedings ArticleDOI

Big-data NoSQL databases: A comparison and analysis of “Big-Table”, “DynamoDB”, and “Cassandra”

TL;DR: The study concluded that Google's BigTable and Amazon's DynamoDB are also critical and efficient on their own, and found that the combination of both systems had caused the development of Cassandra, which is the basis of various modern applications available both open source and socially.
Book ChapterDOI

An Analysis of Demographic and Behavior Trends Using Social Media: Facebook, Twitter, and Instagram

TL;DR: This chapter reviewed 30 research studies on the topic of behavioral analysis using the social media from 2015 to 2017 based on the method of previous publications and analyzed the results, limitations, and number of users to draw conclusions.
Proceedings ArticleDOI

The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study

TL;DR: The experiments prove that the performance of the sentiment analysis is enhanced by 5% using NLP and linguistic processing, yielding an accuracy of 73 % on the used data set.
Related Papers (5)