scispace - formally typeset
Search or ask a question
Author

Wael Hassan Gomaa

Other affiliations: Modern Academy In Maadi
Bio: Wael Hassan Gomaa is an academic researcher from Beni-Suef University. The author has contributed to research in topics: Semantic similarity & Document clustering. The author has an hindex of 7, co-authored 11 publications receiving 727 citations. Previous affiliations of Wael Hassan Gomaa include Modern Academy In Maadi.

Papers
More filters
Journal ArticleDOI
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

718 citations

Journal ArticleDOI
TL;DR: This paper presents a different unsupervised approach which deals with students’ answers holistically using text to text similarity using Bag of Words (BOW) when compared to previous work.
Abstract: Most automatic scoring systems use pattern based that requires a lot of hard and tedious work. These systems work in a supervised manner where predefined patterns and scoring rules are generated. This paper presents a different unsupervised approach which deals with students’ answers holistically using text to text similarity. Different String-based and Corpus-based similarity measures were tested separately and then combined to achieve a maximum correlation value of 0.504. The achieved correlation is the best value achieved for unsupervised approach Bag of Words (BOW) when compared to previous work.

49 citations

Journal ArticleDOI
TL;DR: This research presents the first benchmark Arabic data set that contains 610 students’ short answers together with their English translations, and focuses on applying multiple similarity measures separately and in combination.

41 citations

Journal ArticleDOI
TL;DR: A classification model based on supervised machine learning techniques and word-based N-gram analysis to classify Twitter messages automatically into credible and not credible and experiments show that the proposed model achieved an improvement when compared to two models existing in the literature.
Abstract: With the evolution of social media platforms, the Internet is used as a source for obtaining news about current events. Recently, Twitter has become one of the most popular social media platforms that allows public users to share the news. The platform is growing rapidly especially among young people who may be influenced by the information from anonymous sources. Therefore, predicting the credibility of news in Twitter becomes a necessity especially in the case of emergencies. This paper introduces a classification model based on supervised machine learning techniques and word-based N-gram analysis to classify Twitter messages automatically into credible and not credible. Five different supervised classification techniques are applied and compared namely: Linear Support Vector Machines (LSVM), Logistic Regression (LR), Random Forests (RF), Naïve Bayes (NB) and K-Nearest Neighbors (KNN). The research investigates two feature representations (TF and TF-IDF) and different word N-gram ranges. For model training and testing, 10-fold cross validation is performed on two datasets in different languages (English and Arabic). The best performance is achieved using a combination of both unigrams and bigrams, LSVM as a classifier and TF-IDF as a feature extraction technique. The proposed model achieves 84.9% Accuracy, 86.6% Precision, 91.9% Recall, and 89% F-Measure on the English dataset. Regarding the Arabic dataset, the model achieves 73.2% Accuracy, 76.4% Precision, 80.7% Recall, and 78.5% F-Measure. The obtained results indicate that word N-gram features are more relevant for the credibility prediction compared with content and source-based features, also compared with character N-gram features. Experiments also show that the proposed model achieved an improvement when compared to two models existing in the literature.

29 citations

Book ChapterDOI
28 Mar 2019
TL;DR: An efficient and uncomplicated short answer grading model named Ans2vec is proposed, used to convert both model and student’s answers into meaningful vectors to measure the similarity between them.
Abstract: Automatic scoring is a complex task in computational linguistics, particularly in an educational context. Sentences vectors (sent2vec) approaches affirmed their prosperity recently as favorable models for sentence representation. In this research, we propose an efficient and uncomplicated short answer grading model named Ans2vec. Skip-thought vector approach is used to convert both model and student’s answers into meaningful vectors to measure the similarity between them. Ans2vec model achieves promising results on three different benchmarking data sets. For Texas data set; Ans2vec achieves the best Pearson correlation value (0.63) compared to all related systems.

24 citations


Cited by
More filters
Journal ArticleDOI

164 citations

Journal ArticleDOI
Justin Farrell1
TL;DR: In this article, an application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.
Abstract: An application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.

144 citations

Proceedings ArticleDOI
26 Apr 2016
TL;DR: This research implemented the weighting of Term Frequency - Inverse Document Frequency (TF-IDF) method and Cosine Similarity with the measuring degree concept of similarity terms in a document to rank the document weight that have closesness match level with expert's document.
Abstract: Development of technology in educational field brings the easier ways through the variety of facilitation for learning process, sharing files, giving assignment and assessment. Automated Essay Scoring (AES) is one of the development systems for determining a score automatically from text document source to facilitate the correction and scoring by utilizing applications that run on the computer. AES process is used to help the lecturers to score efficiently and effectively. Besides it can reduce the subjectivity scoring problem. However, implementation of AES depends on many factors and cases, such as language and mechanism of scoring process especially for essay scoring. A number of methods implemented for weighting the terms from document and reaching the solutions for handling comparative level between documents answer and expert's document still defined. In this research, we implemented the weighting of Term Frequency — Inverse Document Frequency (TF-IDF) method and Cosine Similarity with the measuring degree concept of similarity terms in a document. Tests carried out on a number of Indonesian text-based documents that have gone through the stage of pre-processing for data extraction purposes. This process results is in a ranking of the document weight that have closesness match level with expert's document.

137 citations

Journal ArticleDOI
31 Oct 2013
TL;DR: This paper used a similarity metric between student responses, and then used this metric to group responses into clusters and subclusters, which allowed teachers to grade multiple responses with a single action, provide rich feedback to groups of similar answers, and discover modalities of misunderstanding among students.
Abstract: We introduce a new approach to the machine-assisted grading of short answer questions. We follow past work in automated grading by first training a similarity metric between student responses, but then go on to use this metric to group responses into clusters and subclusters. The resulting groupings allow teachers to grade multiple responses with a single action, provide rich feedback to groups of similar answers, and discover modalities of misunderstanding among students; we refer to this amplification of grader effort as “powergrading.” We develop the means to further reduce teacher effort by automatically performing actions when an answer key is available. We show results in terms of grading progress with a small “budget” of human actions, both from our method and an LDA-based approach, on a test corpus of 10 questions answered by 698 respondents.

134 citations