Comment Volume Prediction using Regression

doi:10.5120/IJCA2016910131

International Journal of Computer Applications (0975 – 8887)

Volume 151 – No.1, October 2016

1

Comment Volume Prediction using Regression

Mandeep Kaur

Computer Science engg.CTIEMT

Jalandhar,Punjab,India

Prince Verma

Computer Science engg.CTIEMT

Jalandhar,Punjab,India

ABSTRACT

The latest decade lead to a unconstrained advancement of the

importance of online networking. Due to the gigantic

measures of records appearing in web organizing, there is a

colossal necessity for the programmed examination of such

records. Online networking customer's comments expect a

basic part in building or changing the one's acknowledgments

concerning some specific indicate or making it standard. This

paper demonstrates a preliminary work to exhibit the

sufficiency of machine learning prescient calculations on the

remarks of most well known long range informal

communication site, Facebook. We showed the customer

remark patters, over the posts on Facebook Pages and

expected that what number of remarks a post is depended

upon to get in next H hrs. To automate the technique, we

developed an item display containing the crawler, information

processor and data disclosure module. For prediction, we used

the Linear Regression model (Simple Linear model, Linear

relapse model and Pace relapse model) and Non-Linear

Regression model(Decision tree, MLP) on different data set

varieties and evaluated them under the appraisal estimations

Hits@10, AUC@10, Processing Time and Mean Absolute

Error.

Keywords

Social media, Comment volume, Pace regression, REP Tree.

1. INTRODUCTION

Initially Internet was explored for limited purposes as for

reading, watching or shopping only. But today‟s consumers

are more intelligent and are utilizing platforms such as blogs,

content sharing sites, wikipedia and social networking for

multiple- purposes as for news, advertisements,

communication, commenting, banking, marketing etc. Social

media user‟s comments are also very much influential in order

to make or change people‟s perception and to bring some

topic in trending which can also affect a firm‟s reputation,

survival and sale‟s revenues.

Facebook is getting the position of being most popular social

networking site, as in Facebook, the daily data ingested into

the databases, is beyond 500 terabytes. Figure 1. shows the

popularity of different networks across the world. These are

ranked by the number of active accounts by January 2016.

Fig. 1. Number of active users in millions and social

networks

Active users are those which have logged in during the last 30

days. With a constant presence in the lives of their users,

social networks have a strong social impact [19].

With such a huge popularity, facebook is having a big amount

of data which is impossible to analyze manually and so, there

is a huge need for the automatic analysis of such data. For this

analysis, we built up a software model comprising of the

Crawler, Data processor and Information Analytic module.

The Analytic component is oriented towards the feedback

volume prediction that a post is expected to receive by

considering the properties of domain.

This paper is organized as Section II discuss about the Social

media Comments and its significance and Section III discuss

about the related works, Domain Specific Concepts are

discussed in Section IV, Problem formulation is discussed in

Section V. Section VI and Section VII discusses about the

experimental settings and Results. The paper is closed with

Conclusion and Future work in Section VII.

2. SOCIAL MEDIA COMMENTS

Comments are the opinions of social media users about a

document. These are usually short textual messages referring

to the main text of the document or to other feedbacks.

The nature and valence of user‟s comments on various social

media platforms plays a significant role in building or

changing the one‟s perceptions regarding some specific topic

or to making it popular. Even, in today‟s era, it has an

important place in various fields whether it is education, sales

or predictions etc.

i. Perceptions--The paper[15] examined that the perception

of people is influenced by the comments of users on

social media. The findings were explored in the context

of relationship status, in which a person made an

announcement about the formation or breakup of a real-

0

500

1000

1500

2000

Snapchat

Viber

Skype

Twitter

Instag…

Whats…

Faceb…

No. of active Users in

millions as of

january 2016

International Journal of Computer Applications (0975 – 8887)

Volume 151 – No.1, October 2016

2

world relationships on Facebook. The results suggested

that the comments from different users affects the

perceptions of a Facebook relationship status update. It

showed that the positive comments lead to favourable

attitudes whereas negative comments lead to poorer

attitudes towards the status. And also the observer‟s

attitude towards a relationship status are more driven by

the count of the comments than the real nature of the

status, as positive facebook status can be seen as negative

if the associated comments are negative in nature.

ii. Education- This work[16] researched the impacts of

utilizing an online networking web application to help

learning and educating in classrooms on learners

execution. By looking at the proportion of number of

Facebook posts and remarks for instructive purposes to

the one for non-instructive purposes, a conclusion is

drawn that the utilization of Facebook has an effect on

understudy's learning exhibitions, as the understudies

who invested more energy in instruction related posting

and remarking earned better evaluations

iii. Predicting the future-This paper[17] demonstrate that the

contents of social media can be used for the predictions

of real-world outcomes. Predictions were performed on

platform Twitter.com, using almost 3 million tweets. It

predicted the box-office revenues of movies by

constructing a linear regression model, in advance of

their release dates. The results outperformed in terms of

accuracy and concluded that a topic‟s rank in future is

strongly correlated with the amount of attention it

receives on social media.

iv. Driving force in s/w evolution- User feedback is

important in improving software Quality. Many software

companies collect data on user satisfaction through

various means. The user‟s feedback available on the sites

thereby supply software companies with a rich source of

information that can be used to improve future releases.

Even Apple‟s App Store stated that the newest release

addresses many of the issues raised by users. Therefore,

user feedback can be and has been the driving force in

software evolution [13].

v. Trending- It shows a list of topics that have recently

spiked in popularity which is evaluated based on the

volume of the response towards a particular topic/issue

or product.

vi. Sales and Marketing- McDonald's Australia is serving

more than 1.7 million individuals crosswise over more

than 940 Australian eateries every day. This worldwide

fast food chain has utilized Facebook video promotions

to share its remarkable Australian story, interfacing with

more than 5 million individuals in only 5 weeks [18].

And even Coca-Cola Korea has used age group targeting

and video ads to connect with 4.5 million young

consumers, in order to promote a new pineapple

beverage, Sunny 10, which increased its purchase intent

for the product by 2 points.

3. RELATED WORKS

The related works to our research are:

The paper [2] has developed an industrial proof-of-concept

demonstrating the automatic analysis of the documents on

Hungarian Blogs. In this paper, author has trained various

regressor models by considering various features of the blogs

and measured the results by using the parameters Hits@10

and AUC@10. The result shows that the regression models

outperform than naive models.

Even, classifiers can be used to categorize the comment

volumes in specific classes like paper [3].It reports on

predicting the comment volume of news articles before their

publication using Random Forest Classifier based on the set

of five features i.e. surface, cumulative, textual, semantic, and

real-world features .It addresses the task in two steps- first

binary classification of articles with the potential to receive

comments and second to classify articles with „low volume‟

and „high volume‟. Outcomes show better results for binary

classification and evaluated that the Textual and semantic

features are strong performers among others. In the similar

way, paper [7] has made analysis on the content and publish

time of online news agencies, to detect effective factors of

diffusing contents in public. It has also used the Random

Forest Classifier to classify articles in three categories i.e.

without commented, moderately commented(1-6) and highly

commented(>6).The proposed model has made predictions

with more than 70% accuracy and reports that the publish date

and a weight introduced for content measure, were most

informative features. The results can be refined by

considering important days (i.e. elections, festivals, holidays )

and geographical features in prediction. While paper [5] has

shown the dynamics of user generated comments on seven

different news websites, using the log-normal and the

negative binomial distributions and predicted the comment

volume using Linear model and enable comparison across

various news sites. The results showed that prediction of long

term comment volume is possible with small error after 10

source-hours observations.

The paper [4] has worked on Social bookmarking website

Digg.com. By using comment information, it defined a co-

participation network between users and studied the

behavioral characteristics of users. It measured the entropy

and inferred that the users at Digg are interested in wide range

of topics. Using a classification and regression framework, it

has predicted the popularity of online content based on

comment data and social network derived features. It reported

a one to four percent loss in classification accuracy while

predicting the popularity metric by using only first few hours

of comment data as compare to all the available comment

data. The results can further be improved by analyzing the

polarity of the comments.

Various topic models can be used to extract the hidden topics

in post‟s content. The paper [6] has worked on political blogs

using Latent Variable topic model and analyzed the

relationship between the content and comment volume. It has

also used Naïve Bayes model for binary prediction task i.e.

high volume or low volume and evaluated the prediction

using precision, recall and F1 measures. It concluded that the

modeling topics can improve recall while predicting high

volume posts. Even paper [8] has predicted the formation of

user-to-content links in Flickr Groups to predict the chance

that a user will comment or like an image updated by another

user. It has taken into account both the community effect

using Transactional Mixed Membership Stochastic Block

(TMMSB) model and content effect using Latent Dirichlet

Allocation(LDA) for predicting user-to-content links. The

time zone effects can be used in future in order to make

results more accurate.

Summarization of the user‟s comments is even more difficult

task as these are usually mixed with different opinions,

specifically in case of restaurants where different opinions

refer to different dishes but evaluated as an overall score of

International Journal of Computer Applications (0975 – 8887)

Volume 151 – No.1, October 2016

3

restaurants. Paper [10] has presented a new approach for

comment summarization in the context of restaurants. It used

the real-world comments, crawled from Yelp and Dianping,

the most popular English and Chinese restaurant review web

sites. Using the attributes of the dishes and the user‟s remarks

on the attributes as two independent dimensions in the latent

space, it constructed a bilateral topic model which is

combined with the opinionated word extraction and

clustering-based selection algorithms, it provides a high-

quality summary on the restaurants as well as the dishes

served by the restaurants. This concept can be further used for

wider applications like for various selling goods or services.

In contrast to these works, we have focused on leading

platform i.e. Facebook, a leading platform and targeted the

regression models as Pace Regression and REP Tree.

4. DOMAIN SPECIFIC CONCEPTS

For this work, the concepts related to our domain are: (1)

Source : source refers to the page that produces the post. (2)

Links : these are the pointers to other related posts or pages

refered in main text or comments. (3) Main text : the text

refers to the main topic of the post. (4) Comments : these are

the opinions of the users about a post or other comments

mentioned under the main text. (5) Feedback Volume :

volume of feedback can be measured as the count of words in

the comment section, the number of comments, the number of

distinct users who leave comments, or a variety of other ways.

These measures can be affected by various factors like main

text of post, link to other posts, the time of day the post

appears, a side conversation, Page likes, page check-ins, page

talking about or page category etc. (6) Feedback Volume

Prediction : the user comment patterns are modeled over the

posts/documents appeared in the past and based on it,

predictions are made on the number of comments that the

posts/documents are expected to receive in next coming

hours. (7) Base-time : This is used to simulate the scenario to

make predictions of the post after the selected time i.e. base

time, it is simulated in the sense, as the real values of

feedback are already know which will be used further to

evaluate the predictions of the models. (8) Variants : For

effective time-lined analysis, the variants are used. Variant-X

defines that X variants are there in the dataset for particular

instance. Weight for particular instance is also increased as we

are considering the same instance X-times by varying its base

selected base date/time.

5. PROBLEM FORMULATION

For our work, we refer the predictive modeling techniques and

address this work like a regression problem. For Predicting

the feedback, the patterns of user‟s comments are modeled

over the posts appeared in the past and based on it, a model is

trained and predictions are made on the expected number of

comments that a post may receive in next N hours. Figure 2.

depicts the work flow of our process.

A. Data Pre-processing :

1) Crawler : Pages are crawled from facebook in the

raw form using language JAVA.

2) Data Cleansing : Data is cleaned by discarding the

data which has missing values, the data which is

older than three days with respect to the selected

base time or the posts with no comments.

3) Data Splitting : Data splitting is done on the

temporal basis to divide it into training and testing

data. In Training data, the value of the target is

already known and used to train the model, and

based on this information, we want to predict the

value of the target for those cases where it is

unknown using which feedbacks of testing data is

predicted.

4) Vectorization : Data cannot be used in the raw form

directly. For analysis, it is needed to be transformed

in vector forms. In this data is transformed from

document form(.txt) to the vectors form(.xls/.arff)

which can be easily processed by weka tool.

B. Features set :

Analysis is performed using the various properties of the

application into account. We considered following features as

input to the predictor and take 1 feature as output value for

each post. The various features are:

1) Page Features: We recognized 4 elements of this

class incorporates highlights that characterize the

ubiquity/Likes, classification, checkin's and

discussing of wellspring of page. Page likes : It is an

element that characterizes clients support for

particular remarks, pictures or pages. Page

Checkin's : It is a demonstration of indicating

nearness at specific spot. Page Category : It

characterize the class of the page as Entertainment,

sports, educational, music band, movies etc. Page

Talking About : This is the real check of clients who

are "locked in" and cooperating with that Page. The

clients who really return to the page, after liking the

page once. This incorporate exercises, for example,

remarks, likes to a post, shares by guests to the

page.

2) Post Features: This incorporate some report related

elements like length of archive, time hole between

chosen base date/time and record distributed

date/time. It ranges from (0; 71), report

advancement(promotion) status values (0,1) and

share count of post. 5 elements of this class are

recognized.

3) Comment related Features: This incorporates the

example of remark on the post in different time

interims with respect to the randomly chose base

time, named as C1 to C5. C1: Total comment

number before chose base date/time. C2: Comment

number in last 24 hrs w.r.t to chose base date/time.

C3: Comment check is last 48 hrs to last 24 hrs w.r.t

to base date/time. C4: Comment check in initial 24

hrs after the distribution of the report/post, yet

before the chose base date/time. C5: The distinction

in the middle of C2 and C3. Besides, we aggregate

these components by source and built up some

determined elements by computing min, max,

normal, middle and Standard deviation of 5

aforementioned highlights. Along these lines,

including the 5 key elements and 25 inferred

fundamental elements, we got 30 components of

this class.

International Journal of Computer Applications (0975 – 8887)

Volume 151 – No.1, October 2016

4

Fig 2. Flow of Feedback prediction process.

4) Timeline Features: Binary pointers (0,1) refers to

the day of the week on which the post was

distributed and the day on which the prediction is to

compute. 14 elements of this write are

distinguished.

C. Machine Learning for Feedback Prediction

In order to predict the feedback, we selected some time in the

past and reproduce as though the present time would be the

chosen time. We call the chose time as baseTime. As we

realize what happened after the baseTime, i.e., we know what

number of feedbacks the post got in the following H hours

after baseTime, we know the estimations of the objective for

these cases. At the same time, we just consider pages that

were distributed in the most recent 3 days in respect to the

baseTime, as more older pages typically don't get any new

feedback.

This prediction problem is addressed by Regression Analysis.

In our prototype we used Linear Regression models i.e.

Simple Linear Regression, Linear Regression, PACE

regression and Non-Linear Regression Models i.e. REP Tree,

Multi-Layer Preceptron model.

Regression Analysis is a statistical modeling technique for

finding the relationship between the target variable(Y) and the

predictor variables(X).

β is the coefficient of X and it defines the rate of change of

target variable (Y) with one unit of variation in predictor

variable(X).Variables used to explain the target variable are

called explanatory variables or independent variables and the

variables which are explained, are called response variable or

dependent variable. It is mostly used for Forecasting and

Prediction. In our case, the explanatory variables are the

features of facebook pages and the response variable is the

feedback volume.

Regression Models are classified as:

1) Simple Regression Model : If there is one predictor

variable only to define the response variable, then

the model is called a Simple Regression Model.

Y = b

o

+ b

1

X

1

(1)

2) Multiple Regression Model : It is a model when

there are more than one predictor variables available

to define the target variable, then the model is called

a Multiple Regression Model.

Y = b

o

+ b

1

X

1

+ b

2

X

2

+ ... + b

k

X

k

(2)

Regression analysis is carried out either by Linear Regression

or Non linear Regression techniques.

1) Linear regression: In Linear regression, the

regression function is defined by the linear

combination of finite number of unknown

parameters (β), which are estimated from given

data.

2) Nonlinear regression: It is a regression

analysis technique, in which data is modeled by a

regression function which is a combination of

nonlinear parameters. While the equation of a

linear model has a fundamental structure, nonlinear

mathematical statements can take a wide range of

structures as it uses logarithmic, trigonometric and

exponential functions, which is the reason nonlinear

regression gives the most adaptable curve fitting

usefulness. The data are fitted by a method of

successive approximations.

Nonlinear regression is similar to linear regression

as both seek to track the response from a given set

of variables, in a graphical form. While nonlinear

models are more complicated to develop because

the data are fitted through a method of series of

approximations.

Facebook

pages

Crawler

Data

Preprocessing

Data Splitting

&

Vectorization

Testing data

Training data

Model Builder

Target

Value

Remover

Prediction

Model

Evaluator

Results

International Journal of Computer Applications (0975 – 8887)

Volume 151 – No.1, October 2016

5

Pace Regression is a linear regression method that enhances

established Ordinary Least Squares (OLS) regression by

assessing the impact of every variable and utilizing a

clustering analysis to enhance the statistical basis for

evaluating their contribution to the whole regression.

Additionally, it perform very well than other linear modelling

methods and subset selection methods, which look for a

diminishment in dimensionality that drops out as a natural

characteristic of pace regression[21].

Decision Tree is a non-linear method which is used for both

regression and classification. Decision tree encodes the rules

in the form of if- then-else which predict the target value

based on the given features. This formation of rules is called

Decision tree learning. The importance of the decision tree

modeling lies in the fact that Decision trees can model the

regression and classification function of any type. But it is

prone to the problem of overfitting which can be easily

handled by Reduced Error pruning tree. It is a simple and

speedy procedure for pruning decision trees, given by

Quinlan. In this, internal nodes are transverse from leaves to

the root and at each node it checks whether replacing the node

with most frequent class effects its accuracy or not. The node

is pruned if it does not affect the accuracy. This procedures is

repeated on each node in the upward direction until it

decreases the accuracy[22].

Artificial Neural Network is a powerful data driven, self-

adaptive, flexible computational tool which has the capability

of capturing non-linear and complex characteristics with high

degree of accuracy.

Fig. 3 Block Diagram of Multi-layer Preceptron

Multilayer perceptron (MLP) is a classifier/regressor

depending on the feed-forward artificial neural network. MLP

involves numerous layers of nodes. Each and every layer is

completely linked to the upcoming layer in the network.

Nodes in the input layer represent the input data. All other

nodes maps inputs to the outputs by performing linear

composition of the inputs with the node‟s weights w and bias

b and applying activation function. Multilayer perceptron

(MLP) is an information processing technique based on

biological nervous systems process information, such as in

brain

6. EXPERIMENTAL SETTINGS

For our analysis, we crawled the data from facebook pages for

preparing the training and testing set of our proposed model.

In all out two thousand five hundred pages are crawled for

60,000 posts and it contains 4,100,500 comments utilizing

Facebook Query Language (FQL). The crept information

includes upto certain Giga bytes. The crept information is

cleaned and we cleared out with 50,000 posts. We split the

cleaned corpus into two subsets utilizing temporal split as

80% for training data and 20% for testing data and then these

datasets are sent to next modules for further processing.

For Training the data, PACE Regression and Decision Tree

(REP Tree) models are used through WEKA(The Waikato

Environment for Knowledge Analysis). On the other hand for

Testing the data : Out of the testing set, 10 test cases are

created randomly with 100 occurrences in each for

assessment and afterward they are changed to vectors.

A. Evaluation Parameters:

The performance of the models is evaluated using following

parameters :

1) Hits@10: It is an accuracy parameter for the

proposed work. In this, we consider the main 10

posts having largest number of remarks/comments

according to the results of our prediction. And then,

we compare and count the number of posts among

these that had received highest number of comments

in real. After that, it is averaged over all the test

cases [2].

2) AUC@10: It tells about the precision of the

predictions. AUC means area under the receiver-

operator curve. For this, we consider the 10 posts

receiving the largest number of comments in actual,

as positive. Then, these posts are sorted according to

the predicted count of comments and AUC is

calculated. It is written as:

𝐴𝑈𝐶 =

𝑇𝑃

𝑇𝑃 + 𝐹𝑃

where TP: True positives, FP: False positives.

3) M.A.E: Mean Absolute Error is an average of

the absolute errors. It is used to measure how close

forecasts or predictions are to the actual outcomes

of comment volume.

The mean absolute error is given by

where is the prediction and the true value.

4) Evaluation time: It is the time taken by the model to

perform the evaluation.

7. RESULTS AND DISCUSSION

The performance of the evaluated models is depicted in Table

1and 2, using measures as Hits@10, AUC@10, Evaluation

Time & M.A.E on implemented models.

Results using Linear regression Models:

Comment Volume Prediction using Regression

Citations

Factorization Machines for Blog Feedback Prediction

References

Introduction to Data Mining

An Introduction to Data Mining

Social Media? Get Serious! Understanding the Functional Building Blocks of Social Media

Social media? Get serious! Understanding the functional building blocks of social media

Predicting the Future with Social Media

Related Papers (1)

Hybrid ARIMA-HyFIS Model for Forecasting Univariate Time Series