Author
Wael Hassan Gomaa
Other affiliations: Modern Academy In Maadi
Bio: Wael Hassan Gomaa is an academic researcher from Beni-Suef University. The author has contributed to research in topic(s): Semantic similarity & Document clustering. The author has an hindex of 7, co-authored 11 publication(s) receiving 727 citation(s). Previous affiliations of Wael Hassan Gomaa include Modern Academy In Maadi.
Papers
More filters
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.
596 citations
TL;DR: This paper presents a different unsupervised approach which deals with students’ answers holistically using text to text similarity using Bag of Words (BOW) when compared to previous work.
Abstract: Most automatic scoring systems use pattern based that requires a lot of hard and tedious work. These systems work in a supervised manner where predefined patterns and scoring rules are generated. This paper presents a different unsupervised approach which deals with students’ answers holistically using text to text similarity. Different String-based and Corpus-based similarity measures were tested separately and then combined to achieve a maximum correlation value of 0.504. The achieved correlation is the best value achieved for unsupervised approach Bag of Words (BOW) when compared to previous work.
45 citations
TL;DR: This research presents the first benchmark Arabic data set that contains 610 students’ short answers together with their English translations, and focuses on applying multiple similarity measures separately and in combination.
Abstract: Most research in the automatic assessment of free text answers written by students address English language. This paper handles the assessment task in Arabic language. This research focuses on applying multiple similarity measures separately and in combination. Many aspects are introduced that depend on translation to overcome the lack of text processing resources in Arabic, such as extracting model answers automatically from an already built database and applying K-means clustering to scale the obtained similarity values. Additionally, this research presents the first benchmark Arabic data set that contains 610 students’ short answers together with their English translations.
31 citations
TL;DR: A classification model based on supervised machine learning techniques and word-based N-gram analysis to classify Twitter messages automatically into credible and not credible and experiments show that the proposed model achieved an improvement when compared to two models existing in the literature.
Abstract: With the evolution of social media platforms, the Internet is used as a source for obtaining news about current events. Recently, Twitter has become one of the most popular social media platforms that allows public users to share the news. The platform is growing rapidly especially among young people who may be influenced by the information from anonymous sources. Therefore, predicting the credibility of news in Twitter becomes a necessity especially in the case of emergencies. This paper introduces a classification model based on supervised machine learning techniques and word-based N-gram analysis to classify Twitter messages automatically into credible and not credible. Five different supervised classification techniques are applied and compared namely: Linear Support Vector Machines (LSVM), Logistic Regression (LR), Random Forests (RF), Naïve Bayes (NB) and K-Nearest Neighbors (KNN). The research investigates two feature representations (TF and TF-IDF) and different word N-gram ranges. For model training and testing, 10-fold cross validation is performed on two datasets in different languages (English and Arabic). The best performance is achieved using a combination of both unigrams and bigrams, LSVM as a classifier and TF-IDF as a feature extraction technique. The proposed model achieves 84.9% Accuracy, 86.6% Precision, 91.9% Recall, and 89% F-Measure on the English dataset. Regarding the Arabic dataset, the model achieves 73.2% Accuracy, 76.4% Precision, 80.7% Recall, and 78.5% F-Measure. The obtained results indicate that word N-gram features are more relevant for the credibility prediction compared with content and source-based features, also compared with character N-gram features. Experiments also show that the proposed model achieved an improvement when compared to two models existing in the literature.
13 citations
TL;DR: Overall, the obtained correlation and error rate results prove that the presented system performs well enough for deployment in a real scoring environment.
Abstract: In this paper, we explore text similarity techniques for the task of automatic short answer scoring in Arabic language. We compare a number of string-based and corpus-based similarity measures, evaluate the effect of combining these measures, handle student’s answers holistically and partially, provide immediate useful feedback to student and also introduce a new benchmark Arabic data set that contains 50 questions and 600 student answers. Overall, the obtained correlation and error rate results prove that the presented system performs well enough for deployment in a real scoring environment. General Terms Natural Language Processing, Text Mining
12 citations
Cited by
More filters
Journal Article•
189 citations
31 Oct 2013
TL;DR: This paper used a similarity metric between student responses, and then used this metric to group responses into clusters and subclusters, which allowed teachers to grade multiple responses with a single action, provide rich feedback to groups of similar answers, and discover modalities of misunderstanding among students.
Abstract: We introduce a new approach to the machine-assisted grading of short answer questions. We follow past work in automated grading by first training a similarity metric between student responses, but then go on to use this metric to group responses into clusters and subclusters. The resulting groupings allow teachers to grade multiple responses with a single action, provide rich feedback to groups of similar answers, and discover modalities of misunderstanding among students; we refer to this amplification of grader effort as “powergrading.” We develop the means to further reduce teacher effort by automatically performing actions when an answer key is available. We show results in terms of grading progress with a small “budget” of human actions, both from our method and an LDA-based approach, on a test corpus of 10 questions answered by 698 respondents.
119 citations
TL;DR: In this article, an application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.
Abstract: An application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.
118 citations