scispace - formally typeset
Search or ask a question
Author

M.M.S. Rauthan

Bio: M.M.S. Rauthan is an academic researcher. The author has contributed to research in topics: Parsing & Sentence. The author has an hindex of 3, co-authored 5 publications receiving 24 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: An attempt has been made to discuss the applicability of language model as an approach to calculate the relevance of the document by utilizing user-supplied information of those documents that are relevant to the query items.
Abstract: In the present work an attempt has been made to discuss the applicability of language model as an approach to calculate the relevance of the document by utilizing user-supplied information of those documents that are relevant to the query items. This method shall have the advantage of improving retrieval performance as we have utilized user-supplied information of those documents that are relevant to the query in question. The design and implementation of information retrieval systems is concerned with methods for storing, organizing and retrieving information from a collection of documents. The quality of a system is measured by how useful it is to the typical users of the system. In this approach, a query shall be considered generated from an “ideal” document that shall satisfy the information need. The system‟s job has been to calculate the frequency of the word in the given document and rank them accordingly. Keywords:

16 citations

Journal ArticleDOI
TL;DR: An algorithm based on maximum entropy and stop word removal modules, which works with almost 99% accuracy and has established supremacy over the existing paragraph breaker software developed by Text Mining Group.
Abstract: take care of the sentences which are most common In the present work we have developed an algorithm based on maximum entropy and stop word removal modules, which works with almost 99% accuracy and have established supremacy over the existing paragraph breaker software developed by Text Mining Group, School of Computer Science, Manchester University, United Kingdom . Keywords : Sentence Boundary, Information retrieval, Evaluation. 1. INTRODUCTION Sentence Boundary Disambiguation (SBD) has received increased attention in recent years as a way to enrich speech recognition output for better readability and improved demonstrations in many applications of Natural Language Processing, like: Parsing, Information Extraction, Machine Translation, POS tagging and Document Summarization. Among the most relevant works, we can cite the names of Berger (1996), Palmer & Hearst (1997), Mikheev (2000), Manning & Schutze (2002), Kiss & Strunk (2006), Xuan et al (2007), Siminski (2007) and Gillick (2009) etc. to mention only a few. We know that sentence is a sequence of words ending with a terminal punctuation, such as „.‟,‟?‟, ‟!‟ etc. Most sentences use a period at the end. However, sometimes a period can be associated with an abbreviation, such as ”Mr. or mr, U.S.A., Ph. D., M. Sc. etc.” or can represent a decimal point in a number like 102.53. In all these cases, it is a part of an abbreviation or a number. We cannot delimit a sentence because the period has a different meaning here and therefore there arises an ambiguity in breaking the sentence. To establish the task of sentence boundary disambiguation for a given document there are certain necessary conditions those are very important when breaking a sentence boundary disambiguation. In this paper we have made an attempt to provide a system which can be implemented in any system and can deduce the sentence boundary with high accuracy. For this purpose, we have considered the following conditions through which our system provides the high accuracy for detecting the sentence boundary: not to break a sentence when the sentence contains certain abbreviated words like

5 citations

Journal ArticleDOI
TL;DR: A model is proposed which is useful for text summarization of the given document by using pattern recognition techniques for improving the retrieval performance of the relevant information.
Abstract: present work a model is proposed which is useful for text summarization of the given document by using pattern recognition techniques for improving the retrieval performance of the relevant information. The design and implementation of the proposed systems is concerned with methods for summarizing of the retrieving information from a collection of documents or corpuses. The quality of a system is measured by how useful it is to the typical users of the system. In the basic approach, a query is considered generated from an "ideal" document that satisfies the information need. The system's job is then to estimate the likelihood of each document in the collection being the ideal document and rank them accordingly. The recent development of related techniques stimulates new modeling and estimation methods that are beyond the scope of the traditional approaches.

4 citations

Journal ArticleDOI
TL;DR: A Vector Space based translation model has been proposed that transforms a Vector Space by graphical representation of text that addresses the issues of manual, automatic and adaptive strategies by incorporating the selection preferences for word argument positions.
Abstract: In the present work a model is proposed which deals with specifying the pattern for translating the English sentences into Hindi. Here a Vector Space based translation model has been proposed that transforms a Vector Space by graphical representation of text that addresses the issues of manual, automatic and adaptive strategies by incorporating the selection preferences for word argument positions. Vector Space Model (VSM) represents documents and queries usually as Vectors, Matrices or Tuples. The similarity of the Query Vector and Document Vector is represented as a scalar value. This model constructs a sentence graph for a given sentence and applies structural parsing on this sentence. The quality of a system is measured by considering its usefulness for typical users of the system. The recent development of related techniques stimulates new modeling and estimation methods that are beyond the scope of the traditional approaches.
Journal ArticleDOI
TL;DR: The present work is an attempt in the direction of writing of computer programs for defining texts in the form of vector algebra and their basis so that pattern of occurrence of parts of speech could be modeled in theform of Markov Chain.
Abstract: The present work is an attempt in the direction of writing of computer programs for defining texts in the form of vector algebra and their basis so that pattern of occurrence of parts of speech could be modeled in the form of Markov Chain.

Cited by
More filters
Journal ArticleDOI
TL;DR: This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.
Abstract: In this survey, we review different text mining techniques to discover various textual patterns from the social networking sites. Social network applications create opportunities to establish interaction among people leading to mutual learning and sharing of valuable knowledge, such as chat, comments, and discussion boards. Data in social networking websites is inherently unstructured and fuzzy in nature. In everyday life conversations, people do not care about the spellings and accurate grammatical construction of a sentence that may lead to different types of ambiguities, such as lexical, syntactic, and semantic. Therefore, analyzing and extracting information patterns from such data sets are more complex. Several surveys have been conducted to analyze different methods for the information extraction. Most of the surveys emphasized on the application of different text mining techniques for unstructured data sets reside in the form of text documents, but do not specifically target the data sets in social networking website. This survey attempts to provide a thorough understanding of different text mining techniques as well as the application of these techniques in the social networking websites. This survey investigates the recent advancement in the field of text analysis and covers two basic approaches of text mining, such as classification and clustering that are widely used for the exploration of the unstructured text available on the Web.

100 citations

Proceedings ArticleDOI
24 Oct 2016
TL;DR: This work introduces a novel retrieval model by viewing the matching between queries and documents as a non-linear word transportation (NWT) problem, and defines the capacity and profit of a transportation model designed for the IR task.
Abstract: A common limitation of many information retrieval (IR) models is that relevance scores are solely based on exact (i.e., syntactic) matching of words in queries and documents under the simple Bag-of-Words (BoW) representation. This not only leads to the well-known vocabulary mismatch problem, but also does not allow semantically related words to contribute to the relevance score. Recent advances in word embedding have shown that semantic representations for words can be efficiently learned by distributional models. A natural generalization is then to represent both queries and documents as Bag-of-Word-Embeddings (BoWE), which provides a better foundation for semantic matching than BoW. Based on this representation, we introduce a novel retrieval model by viewing the matching between queries and documents as a non-linear word transportation (NWT) problem. With this formulation, we define the capacity and profit of a transportation model designed for the IR task. We show that this transportation problem can be efficiently solved via pruning and indexing strategies. Experimental results on several representative benchmark datasets show that our model can outperform many state-of-the-art retrieval models as well as recently introduced word embedding-based models. We also conducted extensive experiments to analyze the effect of different settings on our semantic matching model.

70 citations

Proceedings ArticleDOI
02 Feb 2017
TL;DR: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model that adopts a probabilistic reasoning model for calculating the conditional probability of a queries concept given values assigned to document concepts.
Abstract: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model. Our model relies on the use of semantic linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between concepts. Based on this graph, our model adopts a probabilistic reasoning model for calculating the conditional probability of a query concept given values assigned to document concepts. We present an integration framework for interpolating other retrieval systems with the presented model in this paper. Our empirical experiments on a number of TREC collections show that the semantic retrieval has a synergetic impact on the results obtained through state of the art keyword-based approaches, and the consideration of semantic information obtained from entity linking on queries and documents can complement and enhance the performance of other retrieval models.

69 citations

Journal ArticleDOI
Meiyan Huang1, Wei Yang1, Mei Yu1, Zhentai Lu1, Qianjin Feng1, Wufan Chen1 
TL;DR: A content-based image retrieval (CBIR) system is proposed for the retrieval of T1-weighted contrast-enhanced MRI (CE-MRI) images of brain tumors with results demonstrating that the mean average precision values of the proposed method range from 90.4% to 91.5% for different views.
Abstract: A content-based image retrieval (CBIR) system is proposed for the retrieval of T1-weighted contrast-enhanced MRI (CE-MRI) images of brain tumors. In this CBIR system, spatial information in the bag-of-visual-words model and domain knowledge on the brain tumor images are considered for the representation of brain tumor images. A similarity metric is learned through a distance metric learning algorithm to reduce the gap between the visual features and the semantic concepts in an image. The learned similarity metric is then used to measure the similarity between two images and then retrieve the most similar images in the dataset when a query image is submitted to the CBIR system. The retrieval performance of the proposed method is evaluated on a brain CE-MRI dataset with three types of brain tumors (i.e., meningioma, glioma, and pituitary tumor). The experimental results demonstrate that the mean average precision values of the proposed method range from 90.4% to 91.5% for different views (transverse, coronal, and sagittal) with an average value of 91.0%.

39 citations

Book ChapterDOI
26 Mar 2018
TL;DR: A novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content is introduced, which sets a new state-of-the-art performance for boilerplate removal on the CleanEval benchmark.
Abstract: Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content. Our method uses a hidden Markov model on top of potentials derived from DOM tree features using convolutional neural networks. The proposed method sets a new state-of-the-art performance for boilerplate removal on the CleanEval benchmark. As a component of information retrieval pipelines, it improves retrieval performance on the ClueWeb12 collection.

30 citations