(Open Access) Identification of User Patterns in Social Networks by Data Mining Techniques: Facebook Case (2010) | A. Selman Bozkir

Q: What have the authors contributed in "Identification of user patterns in social networks by data mining techniques: facebook case" ?

Therefore, in this study, the factors affecting “ Facebook usage time ” and ” Facebook access frequency ” are revealed via various predictive data mining techniques, based on a questionnaire applied on 570 Facebook users.

S. Kurbanoğlu et al. (Eds.): IMCW 2010, CCIS 96, pp. 145–153, 2010.

Identification of User Patterns in Social Networks by

Data Mining Techniques: Facebook Case

A. Selman Bozkır

, S. Güzin Mazman

, and Ebru Akçapınar Sezer

Hacettepe University, Department of Computer Engineering, Ankara, Turkey

selman@cs.hacettepe.edu.tr, ebru@hacettepe.edu.tr

Hacettepe University, Department of Computer Education and Instructional Technologies,

Ankara, Turkey

s.guzin@gmail.com

Abstract. Currently, social networks such as Facebook or Twitter are getting

more and more popular due to the opportunities they offer. As of November

2009, Facebook was the most popular and well known social network through-

out the world with over 316 million users. Among the countries, Turkey is in

third place in terms of Facebook users and half of them are younger than 25

years old (students). Turkey has 14 million Facebook members. The success of

Facebook and the rich opportunities offered by social media sites lead to the

creation of new web based applications for social networks and open up new

frontiers. Thus, discovering the usage patterns of social media sites might be

useful in taking decisions about the design and implementation of those applica-

tions as well as educational tools. Therefore, in this study, the factors affecting

“Facebook usage time” and ”Facebook access frequency” are revealed via vari-

ous predictive data mining techniques, based on a questionnaire applied on 570

Facebook users. At the same time, the associations of the students’ opinions on

the contribution of Facebook in an educational aspect are investigated by em-

ploying the association rules method.

Keywords: Social networks, decision trees, Facebook, association rules.

1 Introduction

In recent years, a rapid increase in numbers of social networks along with numbers of

people using these networks has been observed. Social networks, also called social

software or collaborative software, are a range of applications that augment group

interactions and shared spaces for collaboration, social connections, and aggregate

information exchanges in a web-based environment [1]. Similarly, [2] defined social

networks as web-based services allowing individuals to 1) construct a public or semi-

public profile within a bounded system, 2) articulate a list of other users with whom

they share a connection, and (3) view and traverse their list of connections and those

made by others within the system.

Millions of users have been interested in them since the introduction of social net-

work sites (SNSs) such as MySpace, Facebook, Cyworld, Bebo, Twitter, etc. The

majority of these users have integrated such sites into their daily lifes. Because most

146 A.S. Bozkır, S.G. Mazman, and E.A. Sezer

of the social network users are young individuals, many of them are university stu-

dents. Therefore, these sites are considered to play an active role in the younger gen-

eration’s daily life [3], [4]. On the other hand, it has been stated that social networks

have a prominent educational context, and this prominence has prompted a growing

number of educators to consider them to be important sites for student learning al-

though these are not intended primarily as educational applications. Besides, it has

been suggested that these social networks help users re-situate learning in an open-

ended social context by providing opportunities for moving beyond the mere access to

content (learning about) to the social application of knowledge in a constant process

of re-orientation (learning as becoming) [5].

There have been various studies about social networks in the educational context

including using social networks as a tool or utilizing them as an environment for

courses [6], [7], the utility of social networks in the teaching and learning process [8],

their value for communication and collaboration [9], educational usage themes of

social networks (e.g. [10], [11]). However, a study in the literature about data mining

analysis of social network usage has not been encountered.

As one the most popular social networks, Facebook is considered in the present

study. Facebook is defined as “a social utility that helps people share information and

communicate more efficiently with their friends, family and co-workers” (face-

book.com). As of November 2009, with 316 million users, Facebook is the most

popular and well known social network throughout the world. Moreover, Turkey,

with 14 million members, is the third country in terms of number of Facebook users

and half of these members are younger than 25 years old [12].

Data mining is a process that uses a variety of data analysis tools to discover pat-

terns and relations in data that may be used for prediction purposes. Supervised data

mining techniques are used to model an output variable based on one or more input

variables and these models can be used to predict or forecast future cases [13].

The purpose of the present study is to discover some usage patterns (i.e. usage time

and access frequency) of Facebook users by data mining techniques. Additionally, an

attempt is made to reveal the educational associations of the users. It is believed that

social network based application development and educational programs can be en-

hanced by the findings of this study.

2 Data Mining

Data mining is the process of exploration and analysis, by automatic or semi-

automatic means, of large quantities of data in order to discover useful patterns [13].

In other words, data mining is the complete process of revealing useful patterns and

relationships in data by using techniques like artificial intelligence, machine learning

and statistics via advanced data analysis tools. Oracle BI, SPSS Clementine, SAS

Enterprise Miner and Microsoft Analysis Services are well known data mining tools

in the marketplace [14].

Data mining methods are classified into two categories as predictive and descrip-

tive. The aim of predictive methods is to make predictions on unseen cases by using

Identification of User Patterns in Social Networks by Data Mining Techniques 147

seen cases via a trained model. However, the goal of descriptive methods is discover-

ing deep relationships, correlations and descriptive properties of data.

In this study, both of these method groups are employed by using SPSS

Clementine 12. Additionally, various decision trees algorithms such as CART,

CHAID and C5; artificial neural networks (ANN) and SVM (Support Vector

Machine) classifiers in prediction of target variables are used. Furthermore, the

variable importance feature of SPSS Clementine is used in discovering the factors

affecting “Facebook usage” and “Facebook access frequency”. Likewise, the Apriori

algorithm is employed in discovering frequent opinions of students on the educational

benefits of Facebook usage.

2.1 Methodology

As stated previously, various data mining techniques are employed during the analy-

ses and except one (association rules mining discovery), their prediction performances

are compared. Thus, in this section, a brief information is presented about the meth-

odologies followed.

The decision tree method is probably the most popular classification method

among the data mining techniques due to the ease of use and visual interpretation

capabilities. Typically, a data mining task for a decision tree is classification; for

example, to identify the credit risk for each customer [15]. The main idea of a deci-

sion tree is to split the data recursively into subsets so that each subset covers more or

fewer homogeneous states of the dependent variable. At each split in the tree, all

independent variables are recalculated for their impact on the dependent variable.

When this recursive process is stopped and the tree is in a stable state, the required

decision tree is formed [15]. At this stage, new cases can be classified via the deci-

sion tree. This stage is called tree deduction. C5, Quest, CHAID [16] and CART [17]

are well-known decision tree algorithms. Nevertheless, SPSS Clementine serves

whole algorithms in its package. In essence, differentiations among these algorithms

are mainly caused by technical capabilities and employing different splitting ap-

proaches and their functions. For instance, C5 and CHAID algorithms are designed to

classify only discrete valued variables by using “gain ratio” and “gini value” splitting

approaches, respectively. However, CART algorithms are designed for both classifi-

cation and regression purposes.

On the other hand, in the pattern recognition literature, SVM (Support Vector Ma-

chine) is a state-of-the-art method with its powerful discriminative features in linear

and non-linear classifications. Generally, SVM is designed to enlarge the boundary of

any two classes in pattern space by searching for an optimal hyper plane that has

maximum distance to the closest points between two classes which are termed support

vectors [18]. However, SVM has support for multiclass predictions via different de-

veloped kernel functions. By the help of these kernel functions, solving the problems

in upper dimensional spaces becomes possible.

ANN are systems which contain intelligence nodes arranged in layers. In essence,

an ANN has an input layer, a hidden layer, and an output layer. The nodes in the hid-

den layer collect the inputs from the input layer into a single output value which is

148 A.S. Bozkır, S.G. Mazman, and E.A. Sezer

passed on to the output layer. Associated with each node in the network is a weight.

The weights in the network are determined in a training phase of the network using

training data. The network performance is then tested on the remaining data, or hold-

out sample [19].

Association rule mining is again one of the best studied descriptive mining meth-

ods since the first design and creation. Agrawal, Imelinski and Swami stated a new

approach to mining association rules in 1993 and designed a new algorithm, namely

Apriori, via two phases seek mechanism on itemsets and by looking their association

frequencies (Romero & Ventura, 2007). In the second stage of this study, the analyses

are performed by using the algorithm Apriori. In association rules, mining analyzing,

support, rule support, confidence and lift values are the important parameters in the

usefulness evaluation of rules. In this study, lift and support values are considered.

Table 1. Variable names and available answers in the first part of the poll

Variable name Type Available answers and related distributions

Sex Discrete Male (50%) / Female (50%)

Age Discrete 18-25 (74.1%) / 26-35 (20.53%) / 36-40

(3.86%) / 41 and above (1.4%)

Frequency of access to

Facebook

Discrete Once a year (0.18%) / Once a month (2.98%) /

Several times a week (25.26%) / Once a day

(22.81%) / Several times a day (48.77%)

Facebook usage time Discrete Less than 15 mins. (32.28%) / Half an hour

(39.82%) / 1 hour (14.39%) / 1-3 hours (8.6%)

/ More than 3 hours (4.74%)

Education level

Membership in any group

Membership in student

groups

Membership in common

interest groups

Membership in internet &

tech groups

Membership in

organizations

Discrete

High School (5.96%) / Bachelor (70.35%) /

Master (23.16%)

Yes (99.82%) / No (0.18%)

Yes (86.49%) / No (13.51%)

Yes (77.54.5) / No (22.46%)

Yes (27.02%) / No (72.98%)

Yes (61.93%) / No (38.07%)

3 Data

Data was collected from 570 active Turkish Facebook users (students) with an online

poll. This online poll consisted of two sections. In the first section, demographic

characteristics of Facebook users and their frequency of Facebook usage, length of

time spent on Facebook, and memberships in Facebook groups were collected. In the

second section, a 10-point Likert scale with 11 opinions were asked, the answers

ranging from 1 (strongly disagree) to 10 (strongly agree), like “Facebook contributes

to communication between classmates”, “It’s useful for assigning tasks in classes and

Identification of User Patterns in Social Networks by Data Mining Techniques 149

homework assignments”. Thus members’ views of Facebook in relation to its educa-

tional usage were sought.

The variable names of the first part and available answers are given in Table 1.

Although the initial dataset size was larger than 570 people, during the data cleaning

and transforming steps, 13 people were removed due to the absence of sufficient in-

formation. Therefore, the final dataset comprised 570 people. In the dataset, male and

female participants are almost equal and more than 400 applicants are in the 18-25

age range. Furthermore, almost all students are at either undergraduate or graduate

level.

4 Application of Data Mining

To discover important factors that affect Facebook usage time and access frequency

to Facebook, CART, CHAID, C5, artificial neural network and SVM algorithms,

which are built in to SPSS Clementine 12, were employed on the dataset at hand (see

Fig. 1). The overall data is partitioned as 80% training and 20% testing, respectively.

Training and test datasets are selected randomly. As the dataset consists of discrete

valued variables, the true and false prediction rates are listed.

According to the results (see Table 2), SVM achieves the most accurate predictions

for two target variables. Therefore, it is considered that the variable importance re-

sults of SVM are the most accurate predictions. As can be seen in Fig. 2, sex, educa-

tion level, membership in a group and membership in any common interest groups are

the most important factors affecting Facebook usage time. Sex plays a crucial role in

Facebook usage time with 68%. Again, it can be clearly seen that age, membership in

student groups and usage time variables are the most important factors affecting

access frequency to Facebook. The effect of age is more than 80% in access

frequency.

Table 2. Applied algorithms and prediction results

Target variable - Applied algorithm True classification False classification

Facebook usage – SVM 62.63 % 37.37 %

Facebook usage – ANN

Facebook usage – C5

47.72 %

47.54 %

52.28 %

52.46 %

Facebook usage – CART 43.68 % 56.32 %

Facebook usage – CHAID

Access frequency to Facebook – SVM

41.40 %

69.65 %

58.60 %

30.35 %

Access frequency to Facebook – C5

Access frequency to Facebook – CART

Access frequency to Facebook – CHAID

Access frequency to Facebook – ANN

55.79 %

52.81 %

50.35 %

48.77 %

44.21 %

47.19 %

49.65 %

51.23 %

Identification of User Patterns in Social Networks by Data Mining Techniques: Facebook Case

Figures

Citations

Making friends and communicating on Facebook: Implications for the access to social capital

A Survey of Sentiment Analysis from Social Media Data

How do motives affect attitudes and behaviors toward internet advertising and Facebook advertising

The sociability score

Accounting for the social: Investigating commensuration and Big Data practices at Facebook:

References

Educational data mining: A survey from 1995 to 2005

Investigating faculty decisions to adopt Web 2.0 technologies: Theory and empirical tests

Mastering Data Mining: The Art and Science of Customer Relationship Management

Adults and social network websites

You have been poked: Exploring the uses and gratifications of Facebook among emerging adults

Related Papers (5)

A review of research on Facebook as an educational environment

Information Retrieval and Academic Performance among Facebook Users

A Large-Scale Analysis of Facebook’s User-Base and User Engagement Growth

First-Generation Students and College: The Role of Facebook Networks as Information Sources

An exploratory study on the use of Twitter and Facebook in tandem

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Identification of user patterns in social networks by data mining techniques: facebook case" ?