scispace - formally typeset
Book ChapterDOI

Novel Text Preprocessing Framework for Sentiment Analysis

01 Jan 2019-pp 309-317

TL;DR: A text preprocessing model for sentiment analysis (SA) over twitter posts with the help of Natural Language processing (NLP) techniques is proposed to reduce the dimensionality problem and execution time.

AbstractAim of this article is to propose a text preprocessing model for sentiment analysis (SA) over twitter posts with the help of Natural Language processing (NLP) techniques. Discussions and investments on health-related chatter in social media keep on increasing day by day. Capturing the actual intention of the tweeps (twitter users) is challenging. Twitter posts consist of Text. It needs to be cleaned before analyzing and we should reduce the dimensionality problem and execution time. Text preprocessing plays an important role in analyzing health-related tweets. We gained 5.4% more accurate results after performing text preprocessing and overall accuracy of 84.85% after classifying the tweets using LASSO approach.

...read more


Citations
More filters
Journal ArticleDOI
TL;DR: The results prove that the proposed approach is an effective strategy for sentiment analysis over patient authored text which helps in improving the classification accuracy.
Abstract: In recent days, the Government and other organizations are focusing on providing better health care to people. Understanding the patients experience of care-received is key for providing better health care. With prevailing usage of social media applications, patients are expressing their experience over social media. This patient authored text is a free-unstructured data which is available over social media in large chunks. To extract the sentiments from this huge data, a domain-specific dictionary is required to get better accuracy. The proposed approach defines a new domain-specific dictionary and uses this in sentiment scoring to enhance the overall sentiment classification on patient authored text. We conducted experiments on the proposed approach using NHS Choices dataset and compared it with popular classifiers like linear regression, stochastic gradient descent, dictionary-based approaches: VADER and AFINN. The results prove that the proposed approach is an effective strategy for sentiment analysis over patient authored text which helps in improving the classification accuracy.

7 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: The finding suggests that lexical and semantic-based methods for sentiment prediction offer better accuracy than Deep Learning methods; when a large enough and evenly distributed training dataset is not available.
Abstract: Technology has turned into a fundamental piece of everybody's life. Social media technology is already used widely by the public to speak out once mind openly. This data can be leveraged to have a better understanding of the current state of decision making. However, Twitter data is highly unstructured. Sentiment analysis can be applied to such health-related data to extract useful information regarding public opinion. The aim of the research is to understand (i) the correlation between Deep Learning versus lexical and semantic-based sentiment prediction methods, (ii) the sentiment prediction accuracy of these methods on manually annotated sentiment dataset (iii) domain-specific knowledge on accuracy of the sentiment prediction methods, and (iv) to utilize Twitterbased sentiment to understand the influence of telemedicine in regards to heart attack and epilepsy. Four sentiment prediction methods are utilized for the research; Lexical and Semantic-based (Valence Aware Dictionary and Sentiment Reasoner (VADER) and TextBlob) and Deep Learning based (Long Short Term Memory (LSTM) and sentiment model from Stanford CoreNLP). The dataset that we retrieved consists of 1.84 million old health-related tweets. Our finding suggests that lexical and semantic-based methods for sentiment prediction offer better accuracy than Deep Learning methods; when a large enough and evenly distributed training dataset is not available. We observed that domain-specific knowledge affects the prediction accuracy of sentiment, mainly when the target text contains more domain-specific words. Sentiment prediction on Twitter data can be utilized to understand the demographic distribution of sentiment. In our case, we observed that telemedicine has a high number of positive sentiment. It is still in its infancy and has not spread to a broader demographic.

6 citations


Cites methods from "Novel Text Preprocessing Framework ..."

  • ...Additionally, a new, improved method for tweet text cleaning can be implemented, which cleans the tweet in a way that the original sentiment stays intact [33]....

    [...]

Proceedings ArticleDOI
20 Apr 2020
TL;DR: The amount of semantic information lost by discounting emojis is qualitatively ascertained, as well as a mechanism of accounting for emojiis in a semantic task is shown.
Abstract: In this paper, we extend the task of semantic textual similarity to include sentences which contain emojis. Emojis are ubiquitous on social media today, but are often removed in the pre-processing stage of curating datasets for NLP tasks. In this paper, we qualitatively ascertain the amount of semantic information lost by discounting emojis, as well as show a mechanism of accounting for emojis in a semantic task. We create a sentence similarity dataset of 4000 pairs of tweets with emojis, which have been annotated for relatedness. The corpus contains tweets curated based on common topic as well as by replacement of emojis. The latter was done to analyze the difference in semantics associated with different emojis. We aim to provide an understanding of the information lost by removing emojis by providing a qualitative analysis of the dataset. We also aim to present a method of using both emojis and words for downstream NLP tasks beyond sentiment analysis.

3 citations


Cites background from "Novel Text Preprocessing Framework ..."

  • ...3383758 More often than not, semantic classification tasks treat emojis as noise and remove them from the dataset in the pre-processing stage [11]....

    [...]

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This paper summarizes the findings using sentiment analysis as well as comparing it to the quantitative data obtained from the survey, where most teachers agreed upon the benefits of ICT use and conclude more positive sentiment polarity.
Abstract: Sentiment analysis in gaining more attention as it is increasingly used in multiple domains, including in interpreting educational data. The article uses sentiment analysis technique to understand the early childhood educators reported beliefs (perception) on young children’s ICT use. The dataset was obtained from a comparative study of early childhood educators from two countries, Australia and Malaysia. The result shows a similar outcome where most teachers agreed upon the benefits of ICT use and conclude more positive sentiment polarity.This paper summarizes the findings using sentiment analysis as well as comparing it to the quantitative data obtained from the survey.

3 citations

Journal ArticleDOI
TL;DR: In this article, a Tweet-Scan-Post (TSP) framework is proposed to identify the presence of sensitive private data (SPD) in user's posts under personal, professional, and health domains.
Abstract: The social media technologies are open to users who are intended in creating a community and publishing their opinions of recent incidents. The participants of the online social networking sites remain ignorant of the criticality of disclosing personal data to the public audience. The private data of users are at high risk leading to many adverse effects like cyberbullying, identity theft, and job loss. This research work aims to define the user entities or data like phone number, email address, family details, health-related information as user’s sensitive private data (SPD) in a social media platform. The proposed system, Tweet-Scan-Post (TSP), is mainly focused on identifying the presence of SPD in user’s posts under personal, professional, and health domains. The TSP framework is built based on the standards and privacy regulations established by social networking sites and organizations like NIST, DHS, GDPR. The proposed approach of TSP addresses the prevailing challenges in determining the presence of sensitive PII, user privacy within the bounds of confidentiality and trustworthiness. A novel layered classification approach with various state-of-art machine learning models is used by the TSP framework to classify tweets as sensitive and insensitive. The findings of TSP systems include 201 Sensitive Privacy Keywords using a boosting strategy, sensitivity scaling that measures the degree of sensitivity allied with a tweet. The experimental results revealed that personal tweets were highly related to mother and children, professional tweets with apology, and health tweets with concern over the father’s health condition.

1 citations


References
More filters
06 Sep 2017
TL;DR: For instance, a survey conducted by the Pew Research Center found that a majority of adults in the United States access their news on social media, with 18% doing so often as mentioned in this paper.
Abstract: As part of an ongoing examination of social media platforms and news, the Pew Research Centre has found that a majority of adults in the United States – 62% or around two thirds – access their news on social media, with 18% doing so often. The researchers analysed the scope and characteristics of social media news consumers across nine social networking sites, with Facebook coming out on top. News plays a varying role across the social networking sites studied. The survey shows that two-thirds of Facebook users (66%) access news on the site, nearly six-in-ten Twitter users (59%) access news on Twitter, and seven-in-ten Reddit users get news on that platform. On Tumblr, the figure sits at 31%, while for the other five social networking sites it is true of only about one-fifth or less of their user bases. Addressing the issue of news audiences overlapping on social media platforms, the researchers found that of those who access news using at least one of the sites, a majority (64%) access news on just one – most commonly Facebook. About a quarter (26%) get news on two social media sites. Just one-in-ten access news on three or more sites. The study is based on a survey conducted between 12 January and 8 February 2016 with 4,654 members of the Pew Research Center’s American Trends Panel.

850 citations

Journal ArticleDOI
TL;DR: The role of text pre-processing in sentiment analysis is explored, and it is demonstrated that with appropriate feature selection and representation, sentiment analysis accuracies using support vector machines (SVM) in this area may be significantly improved.
Abstract: It is challenging to understand the latest trends and summarise the state or general opinions about products due to the big diversity and size of social media data, and this creates the need of automated and real time opinion extraction and mining. Mining online opinion is a form of sentiment analysis that is treated as a difficult text classification task. In this paper, we explore the role of text pre-processing in sentiment analysis, and report on experimental results that demonstrate that with appropriate feature selection and representation, sentiment analysis accuracies using support vector machines (SVM) in this area may be significantly improved. The level of accuracy achieved is shown to be comparable to the ones achieved in topic categorisation although sentiment analysis is considered to be a much harder problem in the literature.

364 citations

26 May 2016
TL;DR: For instance, a survey conducted by the Pew Research Center found that a majority of adults in the United States access their news on social media, with 18% doing so often as mentioned in this paper.
Abstract: As part of an ongoing examination of social media platforms and news, the Pew Research Centre has found that a majority of adults in the United States – 62% or around two thirds – access their news on social media, with 18% doing so often. The researchers analysed the scope and characteristics of social media news consumers across nine social networking sites, with Facebook coming out on top. News plays a varying role across the social networking sites studied. The survey shows that two-thirds of Facebook users (66%) access news on the site, nearly six-in-ten Twitter users (59%) access news on Twitter, and seven-in-ten Reddit users get news on that platform. On Tumblr, the figure sits at 31%, while for the other five social networking sites it is true of only about one-fifth or less of their user bases. Addressing the issue of news audiences overlapping on social media platforms, the researchers found that of those who access news using at least one of the sites, a majority (64%) access news on just one – most commonly Facebook. About a quarter (26%) get news on two social media sites. Just one-in-ten access news on three or more sites. The study is based on a survey conducted between 12 January and 8 February 2016 with 4,654 members of the Pew Research Center’s American Trends Panel.

258 citations

Proceedings Article
Timothy Baldwin1, Paul Cook1, Marco Lui1, Andrew MacKinlay2, Li Wang2 
01 Oct 2013
TL;DR: This work investigates just how linguistically noisy or otherwise text in social media text is over a range of social media sources, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which is compared to a reference corpus of edited English text.
Abstract: While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which we compare to a reference corpus of edited English text. We first extract out various descriptive statistics from each data type (including the distribution of languages, average sentence length and proportion of out-ofvocabulary words), and then investigate the proportion of grammatical sentences in each, based on a linguistically-motivated parser. We also investigate the relative similarity between different data types.

216 citations

Journal ArticleDOI
TL;DR: The experiments show that the accuracy and F1-measure of Twitter sentiment classification classifier are improved when using the pre-processing methods of expanding acronyms and replacing negation, but barely changes when removing URLs, removing numbers or stop words.
Abstract: Twitter sentiment analysis offers organizations ability to monitor public feeling towards the products and events related to them in real time. The first step of the sentiment analysis is the text pre-processing of Twitter data. Most existing researches about Twitter sentiment analysis are focused on the extraction of new sentiment features. However, to select the pre-processing method is ignored. This paper discussed the effects of text pre-processing method on sentiment classification performance in two types of classification tasks, and summed up the classification performances of six pre-processing methods using two feature models and four classifiers on five Twitter datasets. The experiments show that the accuracy and F1-measure of Twitter sentiment classification classifier are improved when using the pre-processing methods of expanding acronyms and replacing negation, but barely changes when removing URLs, removing numbers or stop words. The Naive Bayes and Random Forest classifiers are more sensitive than Logistic Regression and support vector machine classifiers when various pre-processing methods were applied.

154 citations