David C. Wyld et al. (Eds) : SAI, CDKP, ICAITA, NeCoM, SEAS, CMCA, ASUC, Signal - 2014
pp. 83–96, 2014. © CS & IT-CSCP 2014 DOI : 10.5121/csit.2014.41109
Mohammed Bekkali and Abdelmonaime Lachkar
L.S.I.S, E.N.S.A,University Sidi Mohamed Ben Abdellah (USMBA),
Fez, Morocco
bekkalimohammed@gmail.com, abdelmonaime_lachkar@yahoo.fr
A
BSTRACT
Twitter is a popular microblogging service where users create status messages (called
“tweets”). These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order. This format of presentation is useful to the user since the
latest tweets from are rich on recent news which is generally more interesting than tweets about
an event that occurred long time back. Merely, presenting tweets in a chronological order may
be too embarrassing to the user, especially if he has many followers. Therefore, there is a need
to separate the tweets into different categories and then present the categories to the user.
Nowadays Text Categorization (TC) becomes more significant especially for the Arabic
language which is one of the most complex languages.
In this paper, in order to improve the accuracy of tweets categorization a system based on
Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness
of our system was evaluated and compared in term of the F-measure of the Naïve Bayesian
classifier and the Support Vector Machine classifier.
K
EYWORDS
Arabic Language, Text Categorization, Rough Set Theory, Twitter, Tweets.
1. I
NTRODUCTION
Twitter is a popular micro-blogging service where users search for timely and social information.
As in the rest of the world, users in Arab countries engage in social media applications for
interacting and posting information, opinions, and ideas [1]. Users post short text messages called
tweets, which are limited by 140 characters [2] [3] in length and can be viewed by user’s
followers. These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order [4]. This format of presentation is useful to the user since the
latest tweets are generally more interesting than tweets about an event that occurred long time
back. Merely, presenting tweets in a chronological order may be too embarrassing to the user,
especially if he has many followers [5] [6]. Therefore, there is a great need to separate the tweets
into different categories and then present the categories to the user. Text Categorization (TC) is a
good way to solve this problem.
Text Categorization Systems try to find a relation between a set of Texts and a set of category.ies
(tags, classes). Machine learning is the tool that allows deciding whether a Text belongs to a set
84
Compute
of predefined categories [6]. Se
English and other European lan
Arabic Text Categorization [7].
set of pre-
processing to be ma
morphology comp
ared with Eng
pass through a series of steps (Fi
text, removed the stop words wh
and finally all words must be ste
the word by removing the affixes
document, the document must
process consists of three phases [
a)
All the terms appear in the d
b)
Term selection is a kind of d
in the super vector to some c
c)
Term weighting in which,
weight is calculated by TF-
I
document frequency [19].
Finally, the classifier is built by le
documents. After building of cla
and verifies the degree of corres
corpus.
Not that, one of the major prob
where we still limited only by t
beli
eve that Arabic Tweets (wh
crucial stage. It may impact
Categorization system, and ther
necessity to the improvement of
F
ter Science & Information Technology (CS & IT)
Several Text Categorization Systems have been c
languages,
yet very little researches have been done
]. Arabic language is a highly inflected language and
manipulated, it is a Semitic language that has a v
nglish. In the process of Text Categorization the do
Figure.1): transformation the different types of docum
which are considered irrelevant words (prepositions
a
stemmed. Stemming is the process consists to extract
es [8] [9] [10] [11] [12] [13] [14]. To represent the int
t passed by the
indexing process after pre-
processi
s [15]:
e documents corpus has been stocked in the super vect
f dimensionality reduction, it aims at proposing a new
e criteria [16] [17] [18];
, for each term selected in phase (b) and for every
IDF which combine the definitions of term frequenc
y learning the characteristics of each category from a t
classifier, its effectiveness of is tested by applying it t
esponden
ce between the obtained results and those en
roblems in Text Categorization is the document’s r
the terms or words that occur in the document. In o
which are Short Text Messages) representation is c
ct positively or negatively on the accuracy of
erefore the improvement of the representation step
f any Text Categorization system very greatly.
Figure .1 Architecture of TC System
conducted for
one out for the
nd it requires a
very complex
document must
ments into brut
and particles);
ct the root from
internal of each
ssing. Indexing
ctor.
ew set
of terms
ry document, a
ncy and inverse
a training set of
it to the test set
encoded in the
representation
n our work, we
challenge and
f any Tweets
ep will
lead by
Computer Science & Information Technology (CS & IT) 85
To overcome this problem, in this paper we propose a system for Tweets Categorization based on
Rough Set Theory (RST) [20] [21]. This latter is a mathematical tool to deal with vagueness and
uncertainty. RST has been introduced by Pawlak in the early 1980s [20], it has been integrated in
many Text mining applications such as for features selection [], in this work we proposed to use
the Upper Approximation based RST to enrich the Tweet’s Representation by using other terms
in the corpus with which there is semantic links; it has been successful in many applications. In
this theory each set in a universe is described by a pair of ordinary sets called Lower and Upper
Approximations, determined by an equivalence relation in the Universe [20].
The remainder parts of this paper are organized as follows: we begin with a brief review on
related work in Arabic Tweets Categorization in the next section. Section III presents introduction
of the Rough Set Theory and his Tolerance Model; section IV presents two machine learning
algorithms for Text Categorization (TC): Naïve Bayesian and Support Vector Machine classifiers
used in our system; section V describes our proposed system for Arabic Tweets Categorization;
section VI conducts the experiments results; finally, section VII concludes this paper and presents
future work and some perspectives.
2. R
ELATED
W
ORK
A number of recent papers have addressed the categorization of tweets most of them were tested
against English Text [4] [30] [31]. Furthermore Categorization Systems that address Arabic
Tweets are very rare in the literature [1]. This latter work realized by Rehab Nasser et al. presents
a roadmap for understanding Arabic Tweets through two main objectives. The first is to predict
tweet popularity in the Arab world. The second one is to analyze the use of Arabic proverbs in
Tweets, The Arabic proverbs classification model was labeled "Category" with four class values
sport, religious, political, and ideational.
On the other hand a wide range of Text Categorization based Rough Set Theory have been
developed most of them were tested against English Text [39] [40]. Concerning Text
Categorization Systems based on Rough Set that address Arabic Text is rare in the literature [41].
In Arabic Text Categorization we found Sawaf presented in [32] uses statistical methods such as
maximum entropy to cluster Arabic news articles; the results derived by these methods were
promising without morphological analysis. In [33], NB was applied to classify Arabic web data;
the results showed that the average accuracy was 68.78%. The work of Duwairi [34] describes a
distance-based classifier for Arabic text categorization. In [35] Laila et al. compared between
Manhattan distance and Dice measures using N-gram frequency statistical technique against
Arabic data sets collected from several online Arabic newspaper websites. The results showed
that N-gram using Dice measure outperformed Manhattan distance.
Mesleh et al. [36] used three classification algorithms, namely SVM, KNN and NB, to classify
1445 texts taken from online Arabic newspaper archives. The compiled text Automated Arabic
Text Categorization Using SVM and NB 125 were classified into nine classes: Computer,
Economics, Education, Engineering, Law, Medicine, Politics, Religion and Sports. Chi Square
statistics was used for features selection. [36] Discussed that "Compared to other classification
methods, their system shows a high classification effectiveness for Arabic data set in terms of F
measure (F=88.11)".
Thabtah et al. [37] investigate NB algorithm based on Chi Square feature selection method. The
experimental results compared against different Arabic text categorization data sets provided
evidence that features selection often increases classification accuracy by removing rare terms. In
[38] NB and KNN were applied to classify Arabic text collected from online Arabic newspapers.
86 Computer Science & Information Technology (CS & IT)
The results show that the NB classifier outperformed KNN base on Cosine coefficient with
regards to macro F1, macro recall and macro precision measures.
Recently, Hadni et al. team [7] presents an Effective Arabic Stemmer Based Hybrid Approach for
Arabic Text Categorization.
Note that, in any Text Categorization system the center point is the document and its
representation that may impact positively or negatively on the accuracy of the system.
In the following section we present the Rough Set Theory, its mathematical background and also
the Tolerance Rough Set Model which is proposed to deal with Text Representation.
3.
R
OUGH
S
ET
T
HEORY
3.1. Rough Set Theory
In this section we present Rough Set Theory that has been originally developed as a tool for data
analysis and classification [20] [21]. It has been successfully applied in various tasks, such as
features selection/extraction, rule synthesis and classification. The central point of Rough Set
theory is the notion of set approximation: any set in U (a non-empty set of object called the
universe) can be approximated by its lower and upper approximation. In order to define lower and
upper approximation we need to introduce an indiscernibility relation that could be any
equivalence relation R (reflexive, symmetric, transitive). For two objects x, y
U, if xRy then we
say that x and y are indiscernible from each other. The indiscernibility relation R induces a
complete partition of universe U into equivalent classes [x]
R
, x
U [22].
We define lower and upper approximation of set X, with regards to an approximation space
denoted by A = (U, R), respectively as:
L
R
(X) = {x
U: [x]
R
X} (1)
U
R
(X) = {x
U: [x]
R
X
} (2)
Approximations can also be defined by mean of rough membership function. Given rough
membership function µX: U
[0, 1] of a set X
U, the rough approximation is defined as:
L
R
(X) = {x
U: µX(x, X) = 1} (3)
U
R
(X) = {x
U: µX(x, X) > 0} (4)
Note that, given rough membership function as:
µ
X
(x, X) =
(5)
Rough Set Theory is dedicated to any data type but when it comes with Documents
Representation we use its Tolerance Model described in the next section.
3.2. Tolerance Rough Set Model
Let D= {d
1
, d
2
…, d
n
} be a set of document and T= {t
1
, t
2
…, t
m
} set of index terms for D. with the
adoption of the vector space model, each document d
i
is represented by a weight vector {w
i1
,
w
i2
…, w
im
} where w
ij
denotes the weight of index term j in document i. The tolerance space is
defined over a universe of all index terms U= T= {t
1
, t
2
…, t
m
} [23].
Computer Science & Information Technology (CS & IT) 87
Let f
d
i
(t
i
) denotes the number of index terms t
i
in document d
i
; f
D
(t
i
, t
j
) denotes the number of
documents in D in which both index terms t
i
an t
j
occurs. The uncertainty function I with regards
to threshold
is defined as:
I
= {t
j
| f
D
(t
i
, t
j
)
} U {t
i
} (6)
Clearly, the above function satisfies conditions of being reflexive and symmetric. So I
(I
i
) is the
tolerance class of index term t
i
. Thus we can define the membership function µ for I
i
T, X
T
as [24]:
µ
X
(t
i
, X) = v(I
(t
i
), X) =
(7)
Finally, the lower and the upper approximation of any document d
i
T can be determined as:
L
R
(d
i
) = {t
i
T: v(I
(t
i
), d
i
) = 1} (8)
U
R
(d
i
) = {t
i
T: v(I
(t
i
), d
i
) > 0} (9)
Once the documents handling is finished, the results will be the entry of any Text Categorization
System. In the following section we present two of the most popular Machine Learning
algorithms, Naïve Bayesian and the Vector Machine.
4. B
ASED
M
ACHINE
L
EARNING
TC is the task of automatically sorting a set of documents into categories from a predefined set.
This section covers two algorithms among the used known Machine Learning Algorithms for TC:
Naïve Bayesian (NB) and Support Vector Machine (SVM).
4.1. Naïve Bayesian Classifier
The NB is a simple probabilistic classifier based on applying Baye's theorem, and its powerful,
easy and language independent method. [25]
When the NB classifier is applied on the TC problem we use equation (10)
p(class | document)=
(10)
where:
P (class | document): It’s the probability that a given document D belongs to a given class C
P (document): The probability of a document, it's a constant that can be ignored
P (class): The probability of a class, it’s calculated from the number of documents in the category
divided by documents number in all categories
P (document | class): it’s the probability of document given class, and documents can be
represented by a set of words:
p(document | class) =
!"
#
$%&''
#
(11)
so:
p(class | document)= p(class).
!"
#
$%&''
#
(12)
where: