Arabic tweets categorization based on rough set theory

doi:10.5121/CSIT.2014.41109

David C. Wyld et al. (Eds) : SAI, CDKP, ICAITA, NeCoM, SEAS, CMCA, ASUC, Signal - 2014

pp. 83–96, 2014. © CS & IT-CSCP 2014 DOI : 10.5121/csit.2014.41109

















































Mohammed Bekkali and Abdelmonaime Lachkar

L.S.I.S, E.N.S.A,University Sidi Mohamed Ben Abdellah (USMBA),

Fez, Morocco

bekkalimohammed@gmail.com, abdelmonaime_lachkar@yahoo.fr

A

BSTRACT

Twitter is a popular microblogging service where users create status messages (called

“tweets”). These tweets sometimes express opinions about different topics; and are presented to

the user in a chronological order. This format of presentation is useful to the user since the

latest tweets from are rich on recent news which is generally more interesting than tweets about

an event that occurred long time back. Merely, presenting tweets in a chronological order may

be too embarrassing to the user, especially if he has many followers. Therefore, there is a need

to separate the tweets into different categories and then present the categories to the user.

Nowadays Text Categorization (TC) becomes more significant especially for the Arabic

language which is one of the most complex languages.

In this paper, in order to improve the accuracy of tweets categorization a system based on

Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness

of our system was evaluated and compared in term of the F-measure of the Naïve Bayesian

classifier and the Support Vector Machine classifier.

K

EYWORDS

Arabic Language, Text Categorization, Rough Set Theory, Twitter, Tweets.

1. I

NTRODUCTION

Twitter is a popular micro-blogging service where users search for timely and social information.

As in the rest of the world, users in Arab countries engage in social media applications for

interacting and posting information, opinions, and ideas [1]. Users post short text messages called

tweets, which are limited by 140 characters [2] [3] in length and can be viewed by user’s

followers. These tweets sometimes express opinions about different topics; and are presented to

the user in a chronological order [4]. This format of presentation is useful to the user since the

latest tweets are generally more interesting than tweets about an event that occurred long time

back. Merely, presenting tweets in a chronological order may be too embarrassing to the user,

especially if he has many followers [5] [6]. Therefore, there is a great need to separate the tweets

into different categories and then present the categories to the user. Text Categorization (TC) is a

good way to solve this problem.

Text Categorization Systems try to find a relation between a set of Texts and a set of category.ies

(tags, classes). Machine learning is the tool that allows deciding whether a Text belongs to a set

84

Compute

of predefined categories [6]. Se

English and other European lan

Arabic Text Categorization [7].

set of pre-

processing to be ma

morphology comp

ared with Eng

pass through a series of steps (Fi

text, removed the stop words wh

and finally all words must be ste

the word by removing the affixes

document, the document must

process consists of three phases [

a)

All the terms appear in the d

b)

Term selection is a kind of d

in the super vector to some c

c)

Term weighting in which,

weight is calculated by TF-

I

document frequency [19].

Finally, the classifier is built by le

documents. After building of cla

and verifies the degree of corres

corpus.

Not that, one of the major prob

where we still limited only by t

beli

eve that Arabic Tweets (wh

crucial stage. It may impact

Categorization system, and ther

necessity to the improvement of

F

ter Science & Information Technology (CS & IT)

Several Text Categorization Systems have been c

languages,

yet very little researches have been done

]. Arabic language is a highly inflected language and

manipulated, it is a Semitic language that has a v

nglish. In the process of Text Categorization the do

Figure.1): transformation the different types of docum

which are considered irrelevant words (prepositions

a

stemmed. Stemming is the process consists to extract

es [8] [9] [10] [11] [12] [13] [14]. To represent the int

t passed by the

indexing process after pre-

processi

s [15]:

e documents corpus has been stocked in the super vect

f dimensionality reduction, it aims at proposing a new

e criteria [16] [17] [18];

, for each term selected in phase (b) and for every

IDF which combine the definitions of term frequenc

y learning the characteristics of each category from a t

classifier, its effectiveness of is tested by applying it t

esponden

ce between the obtained results and those en

roblems in Text Categorization is the document’s r

the terms or words that occur in the document. In o

which are Short Text Messages) representation is c

ct positively or negatively on the accuracy of

erefore the improvement of the representation step

f any Text Categorization system very greatly.

Figure .1 Architecture of TC System

conducted for

one out for the

nd it requires a

very complex

document must

ments into brut

and particles);

ct the root from

internal of each

ssing. Indexing

ctor.

ew set

of terms

ry document, a

ncy and inverse

a training set of

it to the test set

encoded in the

representation

n our work, we

challenge and

f any Tweets

ep will

lead by

Computer Science & Information Technology (CS & IT) 85

To overcome this problem, in this paper we propose a system for Tweets Categorization based on

Rough Set Theory (RST) [20] [21]. This latter is a mathematical tool to deal with vagueness and

uncertainty. RST has been introduced by Pawlak in the early 1980s [20], it has been integrated in

many Text mining applications such as for features selection [], in this work we proposed to use

the Upper Approximation based RST to enrich the Tweet’s Representation by using other terms

in the corpus with which there is semantic links; it has been successful in many applications. In

this theory each set in a universe is described by a pair of ordinary sets called Lower and Upper

Approximations, determined by an equivalence relation in the Universe [20].

The remainder parts of this paper are organized as follows: we begin with a brief review on

related work in Arabic Tweets Categorization in the next section. Section III presents introduction

of the Rough Set Theory and his Tolerance Model; section IV presents two machine learning

algorithms for Text Categorization (TC): Naïve Bayesian and Support Vector Machine classifiers

used in our system; section V describes our proposed system for Arabic Tweets Categorization;

section VI conducts the experiments results; finally, section VII concludes this paper and presents

future work and some perspectives.

2. R

ELATED

W

ORK

A number of recent papers have addressed the categorization of tweets most of them were tested

against English Text [4] [30] [31]. Furthermore Categorization Systems that address Arabic

Tweets are very rare in the literature [1]. This latter work realized by Rehab Nasser et al. presents

a roadmap for understanding Arabic Tweets through two main objectives. The first is to predict

tweet popularity in the Arab world. The second one is to analyze the use of Arabic proverbs in

Tweets, The Arabic proverbs classification model was labeled "Category" with four class values

sport, religious, political, and ideational.

On the other hand a wide range of Text Categorization based Rough Set Theory have been

developed most of them were tested against English Text [39] [40]. Concerning Text

Categorization Systems based on Rough Set that address Arabic Text is rare in the literature [41].

In Arabic Text Categorization we found Sawaf presented in [32] uses statistical methods such as

maximum entropy to cluster Arabic news articles; the results derived by these methods were

promising without morphological analysis. In [33], NB was applied to classify Arabic web data;

the results showed that the average accuracy was 68.78%. The work of Duwairi [34] describes a

distance-based classifier for Arabic text categorization. In [35] Laila et al. compared between

Manhattan distance and Dice measures using N-gram frequency statistical technique against

Arabic data sets collected from several online Arabic newspaper websites. The results showed

that N-gram using Dice measure outperformed Manhattan distance.

Mesleh et al. [36] used three classification algorithms, namely SVM, KNN and NB, to classify

1445 texts taken from online Arabic newspaper archives. The compiled text Automated Arabic

Text Categorization Using SVM and NB 125 were classified into nine classes: Computer,

Economics, Education, Engineering, Law, Medicine, Politics, Religion and Sports. Chi Square

statistics was used for features selection. [36] Discussed that "Compared to other classification

methods, their system shows a high classification effectiveness for Arabic data set in terms of F

measure (F=88.11)".

Thabtah et al. [37] investigate NB algorithm based on Chi Square feature selection method. The

experimental results compared against different Arabic text categorization data sets provided

evidence that features selection often increases classification accuracy by removing rare terms. In

[38] NB and KNN were applied to classify Arabic text collected from online Arabic newspapers.

86 Computer Science & Information Technology (CS & IT)

The results show that the NB classifier outperformed KNN base on Cosine coefficient with

regards to macro F1, macro recall and macro precision measures.

Recently, Hadni et al. team [7] presents an Effective Arabic Stemmer Based Hybrid Approach for

Arabic Text Categorization.

Note that, in any Text Categorization system the center point is the document and its

representation that may impact positively or negatively on the accuracy of the system.

In the following section we present the Rough Set Theory, its mathematical background and also

the Tolerance Rough Set Model which is proposed to deal with Text Representation.

3.

R

OUGH

S

ET

T

HEORY

3.1. Rough Set Theory

In this section we present Rough Set Theory that has been originally developed as a tool for data

analysis and classification [20] [21]. It has been successfully applied in various tasks, such as

features selection/extraction, rule synthesis and classification. The central point of Rough Set

theory is the notion of set approximation: any set in U (a non-empty set of object called the

universe) can be approximated by its lower and upper approximation. In order to define lower and

upper approximation we need to introduce an indiscernibility relation that could be any

equivalence relation R (reflexive, symmetric, transitive). For two objects x, y



U, if xRy then we

say that x and y are indiscernible from each other. The indiscernibility relation R induces a

complete partition of universe U into equivalent classes [x]

R

, x



U [22].

We define lower and upper approximation of set X, with regards to an approximation space

denoted by A = (U, R), respectively as:

L

R

(X) = {x



U: [x]

R



X} (1)

U

R

(X) = {x



U: [x]

R



X





} (2)

Approximations can also be defined by mean of rough membership function. Given rough

membership function µX: U



[0, 1] of a set X



U, the rough approximation is defined as:

L

R

(X) = {x



U: µX(x, X) = 1} (3)

U

R

(X) = {x



U: µX(x, X) > 0} (4)

Note that, given rough membership function as:

µ

X

(x, X) =













(5)

Rough Set Theory is dedicated to any data type but when it comes with Documents

Representation we use its Tolerance Model described in the next section.

3.2. Tolerance Rough Set Model

Let D= {d

1

, d

2

…, d

n

} be a set of document and T= {t

1

, t

2

…, t

m

} set of index terms for D. with the

adoption of the vector space model, each document d

i

is represented by a weight vector {w

i1

,

w

i2

…, w

im

} where w

ij

denotes the weight of index term j in document i. The tolerance space is

defined over a universe of all index terms U= T= {t

1

, t

2

…, t

m

} [23].

Computer Science & Information Technology (CS & IT) 87

Let f

d

i

(t

i

) denotes the number of index terms t

i

in document d

i

; f

D

(t

i

, t

j

) denotes the number of

documents in D in which both index terms t

i

an t

j

occurs. The uncertainty function I with regards

to threshold



is defined as:

I



= {t

j

| f

D

(t

i

, t

j

)





} U {t

i

} (6)

Clearly, the above function satisfies conditions of being reflexive and symmetric. So I



(I

i

) is the

tolerance class of index term t

i

. Thus we can define the membership function µ for I

i



T, X



T

as [24]:

µ

X

(t

i

, X) = v(I



(t

i

), X) =





























(7)

Finally, the lower and the upper approximation of any document d

i



T can be determined as:

L

R

(d

i

) = {t

i



T: v(I



(t

i

), d

i

) = 1} (8)

U

R

(d

i

) = {t

i



T: v(I



(t

i

), d

i

) > 0} (9)

Once the documents handling is finished, the results will be the entry of any Text Categorization

System. In the following section we present two of the most popular Machine Learning

algorithms, Naïve Bayesian and the Vector Machine.

4. B

ASED

M

ACHINE

L

EARNING

TC is the task of automatically sorting a set of documents into categories from a predefined set.

This section covers two algorithms among the used known Machine Learning Algorithms for TC:

Naïve Bayesian (NB) and Support Vector Machine (SVM).

4.1. Naïve Bayesian Classifier

The NB is a simple probabilistic classifier based on applying Baye's theorem, and its powerful,

easy and language independent method. [25]

When the NB classifier is applied on the TC problem we use equation (10)

p(class | document)=













(10)

where:

P (class | document): It’s the probability that a given document D belongs to a given class C

P (document): The probability of a document, it's a constant that can be ignored

P (class): The probability of a class, it’s calculated from the number of documents in the category

divided by documents number in all categories

P (document | class): it’s the probability of document given class, and documents can be

represented by a set of words:

p(document | class) =







 !"

#





$%&''

#

(11)

so:

p(class | document)= p(class).







 !"

#



$%&''

#

(12)

where:

Arabic tweets categorization based on rough set theory

Citations

Improve the automatic classification accuracy for Arabic tweets using ensemble methods

Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM

Multi-Class Sentiment Classification for Healthcare Tweets Using Supervised Learning Techniques

References

The Nature of Statistical Learning Theory

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

Rough Sets: Theoretical Aspects of Reasoning about Data

Machine learning in automated text categorization

A Comparative Study on Feature Selection in Text Categorization

Related Papers (5)

Classification of news-related tweets

Grammar Rule-Based Sentiment Categorization Model for Tamil Tweets

A Semantic Approach for Tweet Categorization

Using KNN and SVM Based One-Class Classifier for Detecting Online Radicalization on Twitter

Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using Universal Sentence Encoder.