scispace - formally typeset
Search or ask a question
Book ChapterDOI

Novel Text Preprocessing Framework for Sentiment Analysis

01 Jan 2019-pp 309-317
TL;DR: A text preprocessing model for sentiment analysis (SA) over twitter posts with the help of Natural Language processing (NLP) techniques is proposed to reduce the dimensionality problem and execution time.
Abstract: Aim of this article is to propose a text preprocessing model for sentiment analysis (SA) over twitter posts with the help of Natural Language processing (NLP) techniques. Discussions and investments on health-related chatter in social media keep on increasing day by day. Capturing the actual intention of the tweeps (twitter users) is challenging. Twitter posts consist of Text. It needs to be cleaned before analyzing and we should reduce the dimensionality problem and execution time. Text preprocessing plays an important role in analyzing health-related tweets. We gained 5.4% more accurate results after performing text preprocessing and overall accuracy of 84.85% after classifying the tweets using LASSO approach.
Citations
More filters
Journal ArticleDOI
TL;DR: The results prove that the proposed approach is an effective strategy for sentiment analysis over patient authored text which helps in improving the classification accuracy.
Abstract: In recent days, the Government and other organizations are focusing on providing better health care to people. Understanding the patients experience of care-received is key for providing better health care. With prevailing usage of social media applications, patients are expressing their experience over social media. This patient authored text is a free-unstructured data which is available over social media in large chunks. To extract the sentiments from this huge data, a domain-specific dictionary is required to get better accuracy. The proposed approach defines a new domain-specific dictionary and uses this in sentiment scoring to enhance the overall sentiment classification on patient authored text. We conducted experiments on the proposed approach using NHS Choices dataset and compared it with popular classifiers like linear regression, stochastic gradient descent, dictionary-based approaches: VADER and AFINN. The results prove that the proposed approach is an effective strategy for sentiment analysis over patient authored text which helps in improving the classification accuracy.

19 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: The finding suggests that lexical and semantic-based methods for sentiment prediction offer better accuracy than Deep Learning methods; when a large enough and evenly distributed training dataset is not available.
Abstract: Technology has turned into a fundamental piece of everybody's life. Social media technology is already used widely by the public to speak out once mind openly. This data can be leveraged to have a better understanding of the current state of decision making. However, Twitter data is highly unstructured. Sentiment analysis can be applied to such health-related data to extract useful information regarding public opinion. The aim of the research is to understand (i) the correlation between Deep Learning versus lexical and semantic-based sentiment prediction methods, (ii) the sentiment prediction accuracy of these methods on manually annotated sentiment dataset (iii) domain-specific knowledge on accuracy of the sentiment prediction methods, and (iv) to utilize Twitterbased sentiment to understand the influence of telemedicine in regards to heart attack and epilepsy. Four sentiment prediction methods are utilized for the research; Lexical and Semantic-based (Valence Aware Dictionary and Sentiment Reasoner (VADER) and TextBlob) and Deep Learning based (Long Short Term Memory (LSTM) and sentiment model from Stanford CoreNLP). The dataset that we retrieved consists of 1.84 million old health-related tweets. Our finding suggests that lexical and semantic-based methods for sentiment prediction offer better accuracy than Deep Learning methods; when a large enough and evenly distributed training dataset is not available. We observed that domain-specific knowledge affects the prediction accuracy of sentiment, mainly when the target text contains more domain-specific words. Sentiment prediction on Twitter data can be utilized to understand the demographic distribution of sentiment. In our case, we observed that telemedicine has a high number of positive sentiment. It is still in its infancy and has not spread to a broader demographic.

18 citations


Cites methods from "Novel Text Preprocessing Framework ..."

  • ...Additionally, a new, improved method for tweet text cleaning can be implemented, which cleans the tweet in a way that the original sentiment stays intact [33]....

    [...]

Book ChapterDOI
TL;DR: In this paper , the authors provide the reader with the basics of NLP as well as present the text pre-processing procedure in detail, which can expand the text mining potential enormously, leading to deeper insights, a better understanding of social phenomena, and a better basis for decision-making.
Abstract: With the increase in internet usage, the amount of available textual data has also continued to increase rapidly. In addition, the development of stronger computers has enabled the processing of data to become much easier. The tourism field has a strong potential to utilize such data available on the internet; yet, on the other hand, a high proportion of available data is unlabelled and unprocessed. In order to use them effectively, new methods and new approaches are needed. In this regard, the area of Natural Language Processing (NLP) helps researchers to utilize textual data and develop an understanding of text analysis. By using machine learning approaches, text mining potential can expand enormously, leading to deeper insights, a better understanding of social phenomena, and, thus, also a better basis for decision-making. As such, this chapter will provide the reader with the basics of NLP as well as present the text pre-processing procedure in detail.

5 citations

Proceedings ArticleDOI
20 Apr 2020
TL;DR: The amount of semantic information lost by discounting emojis is qualitatively ascertained, as well as a mechanism of accounting for emojiis in a semantic task is shown.
Abstract: In this paper, we extend the task of semantic textual similarity to include sentences which contain emojis. Emojis are ubiquitous on social media today, but are often removed in the pre-processing stage of curating datasets for NLP tasks. In this paper, we qualitatively ascertain the amount of semantic information lost by discounting emojis, as well as show a mechanism of accounting for emojis in a semantic task. We create a sentence similarity dataset of 4000 pairs of tweets with emojis, which have been annotated for relatedness. The corpus contains tweets curated based on common topic as well as by replacement of emojis. The latter was done to analyze the difference in semantics associated with different emojis. We aim to provide an understanding of the information lost by removing emojis by providing a qualitative analysis of the dataset. We also aim to present a method of using both emojis and words for downstream NLP tasks beyond sentiment analysis.

5 citations


Cites background from "Novel Text Preprocessing Framework ..."

  • ...3383758 More often than not, semantic classification tasks treat emojis as noise and remove them from the dataset in the pre-processing stage [11]....

    [...]

Journal ArticleDOI
TL;DR: In this article, a Tweet-Scan-Post (TSP) framework is proposed to identify the presence of sensitive private data (SPD) in user's posts under personal, professional, and health domains.
Abstract: The social media technologies are open to users who are intended in creating a community and publishing their opinions of recent incidents. The participants of the online social networking sites remain ignorant of the criticality of disclosing personal data to the public audience. The private data of users are at high risk leading to many adverse effects like cyberbullying, identity theft, and job loss. This research work aims to define the user entities or data like phone number, email address, family details, health-related information as user’s sensitive private data (SPD) in a social media platform. The proposed system, Tweet-Scan-Post (TSP), is mainly focused on identifying the presence of SPD in user’s posts under personal, professional, and health domains. The TSP framework is built based on the standards and privacy regulations established by social networking sites and organizations like NIST, DHS, GDPR. The proposed approach of TSP addresses the prevailing challenges in determining the presence of sensitive PII, user privacy within the bounds of confidentiality and trustworthiness. A novel layered classification approach with various state-of-art machine learning models is used by the TSP framework to classify tweets as sensitive and insensitive. The findings of TSP systems include 201 Sensitive Privacy Keywords using a boosting strategy, sensitivity scaling that measures the degree of sensitivity allied with a tweet. The experimental results revealed that personal tweets were highly related to mother and children, professional tweets with apology, and health tweets with concern over the father’s health condition.

5 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, subjective information processing in financial news disclosures is used to measure news tone, and several Bayesian variable selection methods are used to select the relevant positive and negative words from financial news disclosure.
Abstract: This paper aims to operationalize subjective information processing in financial news disclosures. In order to measure news tone, previous research commonly utilizes manually-selected positive and negative word lists, such as the Harvard-IV psychological dictionary. However, such dictionaries may not be suitable for the domain of financial news because positive and negative entries could have different connotations in a financial context. To overcome the problem of words that are selected ex ante, we incorporate several Bayesian variable selection methods to select the relevant positive and negative words from financial news disclosures. These domain-specific dictionaries outperform existing dictionaries in terms of both their explanatory power and predictive performance, resulting in an improvement of up to 93.25 % in the correlation between news sentiment and stock market returns. According to our findings, the interpretation of words strongly depends on the context and managers need to be cautious when framing negative content using positive words.

4 citations

Posted Content
TL;DR: Novelty of the approach developed in this article blends basic ideas behind resampling and LASSO together which provides a significant variable reduction and improved prediction accuracy in terms of mean squared error in the test sample.
Abstract: In this article we study variable selection problem using LASSO with new improvisations LASSO uses $\ell_{1}$ penalty, it shrinks most of the coefficients to zero when number of explanatory variables $(p)$ are much larger the number of observations $(N)$ Novelty of the approach developed in this article blends basic ideas behind resampling and LASSO together which provides a significant variable reduction and improved prediction accuracy in terms of mean squared error in the test sample Different weighting schemes have been explored using Bootstrapped LASSO, the basic methodology developed in here Weighting schemes determine to what extent of data blending in case of grouped data Data sharing (DSL) technique developed by [11] lies at the root of the present methodology We apply the technique to analyze the IMDb dataset as discussed in [11] and compare our result with [11]

3 citations