scispace - formally typeset
Open AccessProceedings ArticleDOI

Arabic tweets categorization based on rough set theory

Reads0
Chats0
TLDR
In order to improve the accuracy of tweets categorization a system based on Rough Set Theory is proposed for enrichment the document’s representation.
Abstract
Twitter is a popular microblogging service where users create status messages (called “tweets”). These tweets sometimes express opinions about different topics; and are presented to the user in a chronological order. This format of presentation is useful to the user since the latest tweets from are rich on recent news which is generally more interesting than tweets about an event that occurred long time back. Merely, presenting tweets in a chronological order may be too embarrassing to the user, especially if he has many followers. Therefore, there is a need to separate the tweets into different categories and then present the categories to the user. Nowadays Text Categorization (TC) becomes more significant especially for the Arabic language which is one of the most complex languages. In this paper, in order to improve the accuracy of tweets categorization a system based on Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness of our system was evaluated and compared in term of the F-measure of the Naive Bayesian classifier and the Support Vector Machine classifier.

read more

Content maybe subject to copyright    Report

David C. Wyld et al. (Eds) : SAI, CDKP, ICAITA, NeCoM, SEAS, CMCA, ASUC, Signal - 2014
pp. 83–96, 2014. © CS & IT-CSCP 2014 DOI : 10.5121/csit.2014.41109








Mohammed Bekkali and Abdelmonaime Lachkar
L.S.I.S, E.N.S.A,University Sidi Mohamed Ben Abdellah (USMBA),
Fez, Morocco
bekkalimohammed@gmail.com, abdelmonaime_lachkar@yahoo.fr
A
BSTRACT
Twitter is a popular microblogging service where users create status messages (called
“tweets”). These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order. This format of presentation is useful to the user since the
latest tweets from are rich on recent news which is generally more interesting than tweets about
an event that occurred long time back. Merely, presenting tweets in a chronological order may
be too embarrassing to the user, especially if he has many followers. Therefore, there is a need
to separate the tweets into different categories and then present the categories to the user.
Nowadays Text Categorization (TC) becomes more significant especially for the Arabic
language which is one of the most complex languages.
In this paper, in order to improve the accuracy of tweets categorization a system based on
Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness
of our system was evaluated and compared in term of the F-measure of the Naïve Bayesian
classifier and the Support Vector Machine classifier.
K
EYWORDS
Arabic Language, Text Categorization, Rough Set Theory, Twitter, Tweets.
1. I
NTRODUCTION
Twitter is a popular micro-blogging service where users search for timely and social information.
As in the rest of the world, users in Arab countries engage in social media applications for
interacting and posting information, opinions, and ideas [1]. Users post short text messages called
tweets, which are limited by 140 characters [2] [3] in length and can be viewed by user’s
followers. These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order [4]. This format of presentation is useful to the user since the
latest tweets are generally more interesting than tweets about an event that occurred long time
back. Merely, presenting tweets in a chronological order may be too embarrassing to the user,
especially if he has many followers [5] [6]. Therefore, there is a great need to separate the tweets
into different categories and then present the categories to the user. Text Categorization (TC) is a
good way to solve this problem.
Text Categorization Systems try to find a relation between a set of Texts and a set of category.ies
(tags, classes). Machine learning is the tool that allows deciding whether a Text belongs to a set

84
Compute
of predefined categories [6]. Se
English and other European lan
Arabic Text Categorization [7].
set of pre-
processing to be ma
morphology comp
ared with Eng
pass through a series of steps (Fi
text, removed the stop words wh
and finally all words must be ste
the word by removing the affixes
document, the document must
process consists of three phases [
a)
All the terms appear in the d
b)
Term selection is a kind of d
in the super vector to some c
c)
Term weighting in which,
weight is calculated by TF-
I
document frequency [19].
Finally, the classifier is built by le
documents. After building of cla
and verifies the degree of corres
corpus.
Not that, one of the major prob
where we still limited only by t
beli
eve that Arabic Tweets (wh
crucial stage. It may impact
Categorization system, and ther
necessity to the improvement of
ter Science & Information Technology (CS & IT)
Several Text Categorization Systems have been c
languages,
yet very little researches have been done
]. Arabic language is a highly inflected language and
manipulated, it is a Semitic language that has a v
nglish. In the process of Text Categorization the do
Figure.1): transformation the different types of docum
which are considered irrelevant words (prepositions
a
stemmed. Stemming is the process consists to extract
es [8] [9] [10] [11] [12] [13] [14]. To represent the int
t passed by the
indexing process after pre-
processi
s [15]:
e documents corpus has been stocked in the super vect
f dimensionality reduction, it aims at proposing a new
e criteria [16] [17] [18];
, for each term selected in phase (b) and for every
IDF which combine the definitions of term frequenc
y learning the characteristics of each category from a t
classifier, its effectiveness of is tested by applying it t
esponden
ce between the obtained results and those en
roblems in Text Categorization is the document’s r
the terms or words that occur in the document. In o
which are Short Text Messages) representation is c
ct positively or negatively on the accuracy of
erefore the improvement of the representation step
f any Text Categorization system very greatly.
Figure .1 Architecture of TC System
conducted for
one out for the
nd it requires a
very complex
document must
ments into brut
and particles);
ct the root from
internal of each
ssing. Indexing
ctor.
ew set
of terms
ry document, a
ncy and inverse
a training set of
it to the test set
encoded in the
representation
n our work, we
challenge and
f any Tweets
ep will
lead by

Computer Science & Information Technology (CS & IT) 85
To overcome this problem, in this paper we propose a system for Tweets Categorization based on
Rough Set Theory (RST) [20] [21]. This latter is a mathematical tool to deal with vagueness and
uncertainty. RST has been introduced by Pawlak in the early 1980s [20], it has been integrated in
many Text mining applications such as for features selection [], in this work we proposed to use
the Upper Approximation based RST to enrich the Tweet’s Representation by using other terms
in the corpus with which there is semantic links; it has been successful in many applications. In
this theory each set in a universe is described by a pair of ordinary sets called Lower and Upper
Approximations, determined by an equivalence relation in the Universe [20].
The remainder parts of this paper are organized as follows: we begin with a brief review on
related work in Arabic Tweets Categorization in the next section. Section III presents introduction
of the Rough Set Theory and his Tolerance Model; section IV presents two machine learning
algorithms for Text Categorization (TC): Naïve Bayesian and Support Vector Machine classifiers
used in our system; section V describes our proposed system for Arabic Tweets Categorization;
section VI conducts the experiments results; finally, section VII concludes this paper and presents
future work and some perspectives.
2. R
ELATED
W
ORK
A number of recent papers have addressed the categorization of tweets most of them were tested
against English Text [4] [30] [31]. Furthermore Categorization Systems that address Arabic
Tweets are very rare in the literature [1]. This latter work realized by Rehab Nasser et al. presents
a roadmap for understanding Arabic Tweets through two main objectives. The first is to predict
tweet popularity in the Arab world. The second one is to analyze the use of Arabic proverbs in
Tweets, The Arabic proverbs classification model was labeled "Category" with four class values
sport, religious, political, and ideational.
On the other hand a wide range of Text Categorization based Rough Set Theory have been
developed most of them were tested against English Text [39] [40]. Concerning Text
Categorization Systems based on Rough Set that address Arabic Text is rare in the literature [41].
In Arabic Text Categorization we found Sawaf presented in [32] uses statistical methods such as
maximum entropy to cluster Arabic news articles; the results derived by these methods were
promising without morphological analysis. In [33], NB was applied to classify Arabic web data;
the results showed that the average accuracy was 68.78%. The work of Duwairi [34] describes a
distance-based classifier for Arabic text categorization. In [35] Laila et al. compared between
Manhattan distance and Dice measures using N-gram frequency statistical technique against
Arabic data sets collected from several online Arabic newspaper websites. The results showed
that N-gram using Dice measure outperformed Manhattan distance.
Mesleh et al. [36] used three classification algorithms, namely SVM, KNN and NB, to classify
1445 texts taken from online Arabic newspaper archives. The compiled text Automated Arabic
Text Categorization Using SVM and NB 125 were classified into nine classes: Computer,
Economics, Education, Engineering, Law, Medicine, Politics, Religion and Sports. Chi Square
statistics was used for features selection. [36] Discussed that "Compared to other classification
methods, their system shows a high classification effectiveness for Arabic data set in terms of F
measure (F=88.11)".
Thabtah et al. [37] investigate NB algorithm based on Chi Square feature selection method. The
experimental results compared against different Arabic text categorization data sets provided
evidence that features selection often increases classification accuracy by removing rare terms. In
[38] NB and KNN were applied to classify Arabic text collected from online Arabic newspapers.

86 Computer Science & Information Technology (CS & IT)
The results show that the NB classifier outperformed KNN base on Cosine coefficient with
regards to macro F1, macro recall and macro precision measures.
Recently, Hadni et al. team [7] presents an Effective Arabic Stemmer Based Hybrid Approach for
Arabic Text Categorization.
Note that, in any Text Categorization system the center point is the document and its
representation that may impact positively or negatively on the accuracy of the system.
In the following section we present the Rough Set Theory, its mathematical background and also
the Tolerance Rough Set Model which is proposed to deal with Text Representation.
3.
R
OUGH
S
ET
T
HEORY
3.1. Rough Set Theory
In this section we present Rough Set Theory that has been originally developed as a tool for data
analysis and classification [20] [21]. It has been successfully applied in various tasks, such as
features selection/extraction, rule synthesis and classification. The central point of Rough Set
theory is the notion of set approximation: any set in U (a non-empty set of object called the
universe) can be approximated by its lower and upper approximation. In order to define lower and
upper approximation we need to introduce an indiscernibility relation that could be any
equivalence relation R (reflexive, symmetric, transitive). For two objects x, y
U, if xRy then we
say that x and y are indiscernible from each other. The indiscernibility relation R induces a
complete partition of universe U into equivalent classes [x]
R
, x
U [22].
We define lower and upper approximation of set X, with regards to an approximation space
denoted by A = (U, R), respectively as:
L
R
(X) = {x
U: [x]
R
X} (1)
U
R
(X) = {x
U: [x]
R
X
} (2)
Approximations can also be defined by mean of rough membership function. Given rough
membership function µX: U
[0, 1] of a set X
U, the rough approximation is defined as:
L
R
(X) = {x
U: µX(x, X) = 1} (3)
U
R
(X) = {x
U: µX(x, X) > 0} (4)
Note that, given rough membership function as:
µ
X
(x, X) =





(5)
Rough Set Theory is dedicated to any data type but when it comes with Documents
Representation we use its Tolerance Model described in the next section.
3.2. Tolerance Rough Set Model
Let D= {d
1
, d
2
…, d
n
} be a set of document and T= {t
1
, t
2
…, t
m
} set of index terms for D. with the
adoption of the vector space model, each document d
i
is represented by a weight vector {w
i1
,
w
i2
…, w
im
} where w
ij
denotes the weight of index term j in document i. The tolerance space is
defined over a universe of all index terms U= T= {t
1
, t
2
…, t
m
} [23].

Computer Science & Information Technology (CS & IT) 87
Let f
d
i
(t
i
) denotes the number of index terms t
i
in document d
i
; f
D
(t
i
, t
j
) denotes the number of
documents in D in which both index terms t
i
an t
j
occurs. The uncertainty function I with regards
to threshold
is defined as:
I
= {t
j
| f
D
(t
i
, t
j
)
} U {t
i
} (6)
Clearly, the above function satisfies conditions of being reflexive and symmetric. So I
(I
i
) is the
tolerance class of index term t
i
. Thus we can define the membership function µ for I
i
T, X
T
as [24]:
µ
X
(t
i
, X) = v(I
(t
i
), X) =




(7)
Finally, the lower and the upper approximation of any document d
i
T can be determined as:
L
R
(d
i
) = {t
i
T: v(I
(t
i
), d
i
) = 1} (8)
U
R
(d
i
) = {t
i
T: v(I
(t
i
), d
i
) > 0} (9)
Once the documents handling is finished, the results will be the entry of any Text Categorization
System. In the following section we present two of the most popular Machine Learning
algorithms, Naïve Bayesian and the Vector Machine.
4. B
ASED
M
ACHINE
L
EARNING
TC is the task of automatically sorting a set of documents into categories from a predefined set.
This section covers two algorithms among the used known Machine Learning Algorithms for TC:
Naïve Bayesian (NB) and Support Vector Machine (SVM).
4.1. Naïve Bayesian Classifier
The NB is a simple probabilistic classifier based on applying Baye's theorem, and its powerful,
easy and language independent method. [25]
When the NB classifier is applied on the TC problem we use equation (10)
p(class | document)=




(10)
where:
P (class | document): It’s the probability that a given document D belongs to a given class C
P (document): The probability of a document, it's a constant that can be ignored
P (class): The probability of a class, it’s calculated from the number of documents in the category
divided by documents number in all categories
P (document | class): it’s the probability of document given class, and documents can be
represented by a set of words:
p(document | class) =
 !"
#
$%&''
#
(11)
so:
p(class | document)= p(class).
 !"
#
$%&''
#
(12)
where:

Citations
More filters
Journal ArticleDOI

Improve the automatic classification accuracy for Arabic tweets using ensemble methods

TL;DR: The experimental results showed that using ensemble methods are better than using individual classifier, to improve the accuracy of classification.
Journal ArticleDOI

Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM

TL;DR: PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.
Journal ArticleDOI

Multi-Class Sentiment Classification for Healthcare Tweets Using Supervised Learning Techniques

TL;DR: A new approach for multi-class sentiment classification using supervised learning techniques to assign the healthcare Tweets automatically into predetermined categories on the basis of their linguistic characteristics, their contents, and some of the words that characterize each category from the others.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Book ChapterDOI

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Book

Rough Sets: Theoretical Aspects of Reasoning about Data

TL;DR: Theoretical Foundations.
Journal ArticleDOI

Machine learning in automated text categorization

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Proceedings Article

A Comparative Study on Feature Selection in Text Categorization

TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Related Papers (5)