scispace - formally typeset
Search or ask a question
Author

Tarek F. Gharib

Bio: Tarek F. Gharib is an academic researcher from Ain Shams University. The author has contributed to research in topics: Cluster analysis & WordNet. The author has an hindex of 10, co-authored 65 publications receiving 418 citations. Previous affiliations of Tarek F. Gharib include World Islamic Sciences and Education University & Michigan State University.


Papers
More filters
Journal ArticleDOI
TL;DR: Experimental results show that Rocchio classifier gives better results while SVM outperform the other classifiers when the size of the feature set is large enough, and Leave one method gives more realistic results over the use of training set as a test set.
Abstract: Text classification (TC) is the process of classifying documents into a predefined set of categories based on their content. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In this paper we applied the Support Vector Machines (SVM) model in classifying Arabic text documents. The results compared with the other traditional classifiers Bayes classifier, K-Nearest Neighbor classifier and Rocchio classifier. Two experiments used to test the different classifiers. The first uses the training set as the test set, and the second uses Leave one testing method. Experimental results performed on a set of 1132 document show that Rocchio classifier gives better results when the size of feature set is small while SVM outperform the other classifiers when the size of the feature set is large enough. Classification rate exceeds 90% when using more than 4000 feature. Leave one method gives more realistic results over the use of training set as a test set.

106 citations

Journal ArticleDOI
01 Aug 2010
TL;DR: In this paper, an incremental algorithm to maintain the temporal association rules in a transaction database is proposed and benefits from the results of earlier mining to derive the final mining output.
Abstract: This paper presents the concept of temporal association rules in order to solve the problem of handling time series by including time expressions into association rules. Actually, temporal databases are continually appended or updated so that the discovered rules need to be updated. Re-running the temporal mining algorithm every time is ineffective since it neglects the previously discovered rules, and repeats the work done previously. Furthermore, existing incremental mining techniques cannot deal with temporal association rules. In this paper, an incremental algorithm to maintain the temporal association rules in a transaction database is proposed. The algorithm benefits from the results of earlier mining to derive the final mining output. The experimental results on both the synthetic and the real dataset illustrate a significant improvement over the conventional approach of mining the entire updated database.

87 citations

Book ChapterDOI
22 Feb 2018
TL;DR: The proposed system built a machine learning model for detecting positive and negative tweets using different techniques to represent the input labeled tweets in the training phase using different features sets.
Abstract: Sentiment analysis from Twitter is one of the interesting research fields recently It combines natural language processing techniques with the data mining approaches for building such systems In this paper, we introduced an efficient system for Twitter sentiment analysis The proposed system built a machine learning model for detecting positive and negative tweets This model used different techniques to represent the input labeled tweets in the training phase using different features sets In the classification phase, the classifier ensemble is presented with different base classifiers for more accurate results The proposed system can be used for measuring users’ opinion from their tweets which is very useful in many applications such as marketing, political polarity detection and reviewing products

34 citations

Proceedings Article
01 Jan 2003
TL;DR: The multiresolution-based image segmentation techniques, which have emerged as a powerful method for producing high-quality segmentation of images, are combined here with the EM algorithm to overcome its drawbacks and in the same time take its advantage of simplicity.
Abstract: We present a MR image segmentation algorithm based on the conventional Expectation Maximization (EM) algorithm and the multiresolution analysis of images. Although the EM algorithm was used in MRI brain segmentation, as well as, image segmentation in general, it fails to utilize the strong spatial correlation between neighboring pixels. The multiresolution-based image segmentation techniques, which have emerged as a powerful method for producing high-quality segmentation of images, are combined here with the EM algorithm to overcome its drawbacks and in the same time take its advantage of simplicity. Two data sets are used to test the performance of the EM and the proposed Gaussian Multiresolution EM, GMEM, algorithm. The results, which proved more accurate segmentation by the GMEM algorithm compared to that of the EM algorithm, are represented statistically and graphically to give deep

31 citations

Journal ArticleDOI
TL;DR: The proposed ensemble method is based on integrating a rule-based classifier with machine learning techniques, while utilizing content-based features that depend on N-gram features and Negation handling and achieves a classification accuracy of 95.25% and 99.98% for the two experimented datasets.
Abstract: Nowadays, individuals express experiences and opinions through online reviews. These has an influence on online marketing and obtaining real knowledge about products and services. However, some of the online reviews can be unreal. They may have been written to promote low-quality products/services or sabotage a product/service reputation to mislead potential customers. Such misleading reviews are known as spam reviews and require crucial attention. Prior spam detection research focused on English reviews, with less attention to other languages. The detection of spam reviews in Arabic online sources is a relatively new topic despite the relatively huge amount of data generated. Therefore, this paper contributes to such topic by presenting four different Arabic spam reviews detection methods, while putting more focus towards the construction and evaluation of an ensemble approach. The proposed ensemble method is based on integrating a rule-based classifier with machine learning techniques, while utilizing content-based features that depend on N-gram features and Negation handling. The four proposed methods are evaluated on two datasets of different sizes. The results indicate the efficiency of the ensemble approach where it achieves a classification accuracy of 95.25% and 99.98% for the two experimented datasets and outperforming existing related work by far of 25%.

31 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

01 Jan 2002

9,314 citations

Journal ArticleDOI
TL;DR: The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability ofText classification schemes, which is of practical importance in the application fields of text classification.
Abstract: Text classification is a domain with high dimensional feature space.Extracting the keywords as the features can be extremely useful in text classification.An empirical analysis of five statistical keyword extraction methods.A comprehensive analysis of classifier and keyword extraction ensembles.For ACM collection, a classification accuracy of 93.80% with Bagging ensemble of Random Forest. Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naive Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.

445 citations

Journal ArticleDOI
TL;DR: A general overview of the requirements and system architectures of disaster management systems is presented and state-of-the-art data-driven techniques that have been applied on improving situation awareness as well as in addressing users’ information needs in disaster management are summarized.
Abstract: Improving disaster management and recovery techniques is one of national priorities given the huge toll caused by man-made and nature calamities. Data-driven disaster management aims at applying advanced data collection and analysis technologies to achieve more effective and responsive disaster management, and has undergone considerable progress in the last decade. However, to the best of our knowledge, there is currently no work that both summarizes recent progress and suggests future directions for this emerging research area. To remedy this situation, we provide a systematic treatment of the recent developments in data-driven disaster management. Specifically, we first present a general overview of the requirements and system architectures of disaster management systems and then summarize state-of-the-art data-driven techniques that have been applied on improving situation awareness as well as in addressing users’ information needs in disaster management. We also discuss and categorize general data-mining and machine-learning techniques in disaster management. Finally, we recommend several research directions for further investigations.

364 citations