scispace - formally typeset
Search or ask a question
Author

Mamdouh Farouk

Bio: Mamdouh Farouk is an academic researcher from Assiut University. The author has contributed to research in topics: Similarity (network science) & Word embedding. The author has an hindex of 5, co-authored 8 publications receiving 49 citations.

Papers
More filters
Journal ArticleDOI
Mamdouh Farouk1
TL;DR: Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity, but structure based similarity that measures similarity between sentences structures needs more investigation.
Abstract: This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification, information retrieval, question answering, and plagiarism detection. This survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories. Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity. Each approach measures relatedness between short texts based on a specific perspective. In addition, datasets that are mostly used as benchmarks for evaluating techniques in this field are introduced to provide a complete view on this issue. The approaches that combine more than one perspective give better results. Moreover, structure based similarity that measures similarity between sentences structures needs more investigation.

26 citations

Journal ArticleDOI
TL;DR: In this paper, a survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories: word-to-word based, structure-based, and vector-based.
Abstract: Objective/Methods: This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification, information retrieval, question answering, and plagiarism detection. This survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories. Word-to-word based, structurebased, and vector-based are the most widely used approaches to find sentences similarity. Findings/Application: Each approach measures relatedness between short texts based on a specific perspective. In addition, datasets that are mostly used as benchmarks for evaluating techniques in this field are introduced to provide a complete view on this issue. The approaches that combine more than one perspective give better results. Moreover, structure based similarity that measures similarity between sentences’ structures needs more investigation. Keywords: Sentence Representation, Sentences Similarity, Structural Similarity, Word Embedding, Words Similarity

25 citations

Journal ArticleDOI
Mamdouh Farouk1
TL;DR: The proposed approach combines different similarity measures in the calculation of sentence similarity and exploits sentence semantic structure to improve the accuracy of the sentence similarity calculation.

23 citations

Proceedings ArticleDOI
Mamdouh Farouk1
01 Dec 2018
TL;DR: This paper combines the using of pre-trained word vector and WordNet to measure semantic similarity between two sentences and achieves better results comparing with other approaches previously proposed to measure sentence similarity.
Abstract: Semantic similarity between sentences is a crucial task for many applications. The emerging of word embedding encourages calculating similarity between words and between sentences based on the new semantic word representation. On the other hand, WordNet is widely used to find semantic distance between sentences. This paper combines the using of pre-trained word vector and WordNet to measure semantic similarity between two sentences. In addition, word order similarity is applied to make the final similarity more accurate. The proposed approach has been implemented and tested using standard datasets. Experiments show that presented methods achieves better results comparing with other approaches previously proposed to measure sentence similarity.

17 citations

Book ChapterDOI
23 Oct 2018
TL;DR: This work proposes a search engine for searching Web data represented in UNL (Universal Networking Language) based on semantic graph matching, which includes semantic expansion for graph nodes and relation matching based on relation meaning.
Abstract: Explosive growth of the Web has made searching Web data a challenging task for information retrieval systems. Semantic search systems that go beyond the shallow keyword matching approaches and map words to their conceptual meaning representations offer better results to the users. On the other hand, a lot of representation formats have been specified to represent Web data into a semantic format. We propose a search engine for searching Web data represented in UNL (Universal Networking Language). UNL has numerous attractive features to support semantic search. One of the main features is that UNL does not depend on domain ontology. Our proposed search engine is based on semantic graph matching. It includes semantic expansion for graph nodes and relation matching based on relation meaning. The search results are ranked depending on the semantic similarity between the user query and the retrieved documents. We developed a prototype implementing the proposed semantic search engine, and our evaluations demonstrate its effectiveness across a wide-range of semantic search tasks.

10 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: GatorTron as mentioned in this paper uses >90 billion words of text and systematically evaluates it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, and medical question answering (MQA).
Abstract: There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model-GatorTron-using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og .

69 citations

Journal ArticleDOI
Xi Yang1, Xing He1, Hansi Zhang1, Yinghan Ma1, Jian-Guo Bian1, Yonghui Wu1 
TL;DR: This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text using Bidirectional Encoder Representations from Transformers, XLNet, and Robustly optimized BERT approach.
Abstract: Background: Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS. Objective: This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS. Methods: In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models. Results: Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010). Conclusions: This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.

30 citations

Journal ArticleDOI
Mamdouh Farouk1
TL;DR: Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity, but structure based similarity that measures similarity between sentences structures needs more investigation.
Abstract: This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification, information retrieval, question answering, and plagiarism detection. This survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories. Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity. Each approach measures relatedness between short texts based on a specific perspective. In addition, datasets that are mostly used as benchmarks for evaluating techniques in this field are introduced to provide a complete view on this issue. The approaches that combine more than one perspective give better results. Moreover, structure based similarity that measures similarity between sentences structures needs more investigation.

26 citations

Journal ArticleDOI
Mamdouh Farouk1
TL;DR: The proposed approach combines different similarity measures in the calculation of sentence similarity and exploits sentence semantic structure to improve the accuracy of the sentence similarity calculation.

23 citations

Posted ContentDOI
02 Feb 2022-medRxiv
TL;DR: GatorTron as discussed by the authors is the largest transformer model in the clinical domain that scaled up from the previous 110 million to 8.9 billion parameters and achieved state-of-the-art performance on the 5 clinical NLP tasks targeting various healthcare information documented in EHRs.
Abstract: There is an increasing interest in developing massive-size deep learning models in natural language processing (NLP) - the key technology to extract patient information from unstructured electronic health records (EHRs). However, there are limited studies exploring large language models in the clinical domain; the current largest clinical NLP model was trained with 110 million parameters (compared with 175 billion parameters in the general domain). It is not clear how large-size NLP models can help machines understand patients' clinical information from unstructured EHRs. In this study, we developed a large clinical transformer model - GatorTron - using >90 billion words of text and evaluated it on 5 clinical NLP tasks including clinical concept extraction, relation extraction, semantic textual similarity, natural language inference, and medical question answering. GatorTron is now the largest transformer model in the clinical domain that scaled up from the previous 110 million to 8.9 billion parameters and achieved state-of-the-art performance on the 5 clinical NLP tasks targeting various healthcare information documented in EHRs. GatorTron models perform better in understanding and utilizing patient information from clinical narratives in ways that can be applied to improvements in healthcare delivery and patient outcomes.

19 citations