scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2015"


Journal ArticleDOI
TL;DR: Using naturalistic inquiry methodology, an empirical study of user-based relevance interpretations is reported that reflects the nature of the thought processes of users who are evaluating bibliographic citations produced by a document retrieval system.
Abstract: Experimental research in information retrieval (IR) depends on the idea of relevance. Because of its key role in IR, recent questions about relevance have raised issues of methodological concern and have shaken the philosophical foundations of IR theory development. Despite an existing set of theoretical definitions of this concept, our understanding of relevance from users' perspectives is still limited. Using naturalistic inquiry methodology, this article reports an empirical study of user-based relevance interpretations. A model is presented that reflects the nature of the thought processes of users who are evaluating bibliographic citations produced by a document retrieval system. Three major categories of variables affecting relevance assessments-internal context, external context, and problem context-are identified and described. Users' relevance assessments involve multiple layers of interpretations that are derived from individuals' experiences, perceptions, and private knowledge related to the pa...

201 citations


Journal ArticleDOI
TL;DR: In this article, a model that uses recurrent neural networks with Long Short-Term Memory (LSTM) cells was developed to address sentence embedding, a hot topic in current natural language processing research.
Abstract: This paper develops a model that addresses sentence embedding, a hot topic in current natural language processing research, using recurrent neural networks with Long Short-Term Memory (LSTM) cells. Due to its ability to capture long term memory, the LSTM-RNN accumulates increasingly richer information as it goes through the sentence, and when it reaches the last word, the hidden layer of the network provides a semantic representation of the whole sentence. In this paper, the LSTM-RNN is trained in a weakly supervised manner on user click-through data logged by a commercial web search engine. Visualization and analysis are performed to understand how the embedding process works. The model is found to automatically attenuate the unimportant words and detects the salient keywords in the sentence. Furthermore, these detected keywords are found to automatically activate different cells of the LSTM-RNN, where words belonging to a similar topic activate the same cell. As a semantic representation of the sentence, the embedding vector can be used in many different applications. These automatic keyword detection and topic allocation abilities enabled by the LSTM-RNN allow the network to perform document retrieval, a difficult language processing task, where the similarity between the query and documents can be measured by the distance between their corresponding sentence embedding vectors computed by the LSTM-RNN. On a web search task, the LSTM-RNN embedding is shown to significantly outperform several existing state of the art methods. We emphasize that the proposed model generates sentence embedding vectors that are specially useful for web document retrieval tasks. A comparison with a well known general sentence embedding method, the Paragraph Vector, is performed. The results show that the proposed method in this paper significantly outperforms it for web document retrieval task.

175 citations


Book ChapterDOI
01 Jan 2015
TL;DR: In this article, the authors provide an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic analysis method today, and illustrate how to employ LDA on a textual data set.
Abstract: Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

148 citations


Patent
15 Sep 2015
TL;DR: In this article, a retrieval request for one or more documents containing search terms descriptive of the documents can be processed by identifying a set of candidate documents tagged with subjects, then using affinity values to adjust the aggregate score for the terms in the dictionaries.
Abstract: Techniques for managing big data include retrieval using per-subject dictionaries having multiple levels of sub-classification hierarchy within the subject. Entries may include subject-determining-power (SDP) scores that provide an indication of the descriptive power of the entry term with respect to the subject of the dictionary containing the term. The same term may have entries in multiple dictionaries with different SDP scores in each of the dictionaries. A retrieval request for one or more documents containing search terms descriptive of the one or more documents can be processed by identifying a set of candidate documents tagged with subjects, i.e., identifiers of per-subject dictionaries having entries corresponding to a search term, then using affinity values to adjust the aggregate score for the terms in the dictionaries. Documents are then selected for best match to the subject based on the adjusted scores. Alternatively, the adjustment may be performed after selecting the documents by re-ordering them according to adjusted scores.

100 citations


Proceedings ArticleDOI
09 Aug 2015
TL;DR: A novel retrieval model that incorporates term dependencies into structured document retrieval and applies it to the task of ERWD is proposed and experiments indicate significant improvement of the accuracy of retrieval results by the proposed model over state-of-the-art retrieval models for ERWD.
Abstract: Previously proposed approaches to ad-hoc entity retrieval in the Web of Data (ERWD) used multi-fielded representation of entities and relied on standard unigram bag-of-words retrieval models. Although retrieval models incorporating term dependencies have been shown to be significantly more effective than the unigram bag-of-words ones for ad hoc document retrieval, it is not known whether accounting for term dependencies can improve retrieval from the Web of Data. In this work, we propose a novel retrieval model that incorporates term dependencies into structured document retrieval and apply it to the task of ERWD. In the proposed model, the document field weights and the relative importance of unigrams and bigrams are optimized with respect to the target retrieval metric using a learning-to-rank method. Experiments on a publicly available benchmark indicate significant improvement of the accuracy of retrieval results by the proposed model over state-of-the-art retrieval models for ERWD.

89 citations


Proceedings ArticleDOI
10 Aug 2015
TL;DR: Diversified RBM (DRBM) is proposed which diversifies the hidden units, to make them cover not only the dominant topics, but also those in the long-tail region, and it is proved that maximizing the lower bound with projected gradient ascent can increase this diversity metric.
Abstract: Restricted Boltzmann Machine (RBM) has shown great effectiveness in document modeling. It utilizes hidden units to discover the latent topics and can learn compact semantic representations for documents which greatly facilitate document retrieval, clustering and classification. The popularity (or frequency) of topics in text corpora usually follow a power-law distribution where a few dominant topics occur very frequently while most topics (in the long-tail region) have low probabilities. Due to this imbalance, RBM tends to learn multiple redundant hidden units to best represent dominant topics and ignore those in the long-tail region, which renders the learned representations to be redundant and non-informative. To solve this problem, we propose Diversified RBM (DRBM) which diversifies the hidden units, to make them cover not only the dominant topics, but also those in the long-tail region. We define a diversity metric and use it as a regularizer to encourage the hidden units to be diverse. Since the diversity metric is hard to optimize directly, we instead optimize its lower bound and prove that maximizing the lower bound with projected gradient ascent can increase this diversity metric. Experiments on document retrieval and clustering demonstrate that with diversification, the document modeling power of DRBM can be greatly improved.

84 citations


Journal ArticleDOI
TL;DR: Experimental results over TREC collections show that the proposed LES approach is effective in capturing latent semantic content and can significantly improve the search accuracy of several state-of-the-art retrieval models for entity-bearing queries.
Abstract: Analysis on Web search query logs has revealed that there is a large portion of entity-bearing queries, reflecting the increasing demand of users on retrieving relevant information about entities such as persons, organizations, products, etc. In the meantime, significant progress has been made in Web-scale information extraction, which enables efficient entity extraction from free text. Since an entity is expected to capture the semantic content of documents and queries more accurately than a term, it would be interesting to study whether leveraging the information about entities can improve the retrieval accuracy for entity-bearing queries. In this paper, we propose a novel retrieval approach, i.e., latent entity space (LES), which models the relevance by leveraging entity profiles to represent semantic content of documents and queries. In the LES, each entity corresponds to one dimension, representing one semantic relevance aspect. We propose a formal probabilistic framework to model the relevance in the high-dimensional entity space. Experimental results over TREC collections show that the proposed LES approach is effective in capturing latent semantic content and can significantly improve the search accuracy of several state-of-the-art retrieval models for entity-bearing queries.

72 citations


Journal ArticleDOI
TL;DR: The theoretical principles for the design of effective information retrieval systems are discussed, and an experimental online catalog system based on these principles is described, called CHESHIRE, which uses a method called "classification clustering," combined with probabilistic retrieval techniques, to provide natural language searching.
Abstract: Research into online catalog use and users has found some pervasive problems with subject searching in these systems. Subject searches too often fail to retrieve anything, and those that do succeed often retrieve "too much" material. This article examines these problems and how they might be remedied. The theoretical principles for the design of effective information retrieval systems are discussed, and an experimental online catalog system based on these principles is described. The system, CHESHIRE, uses a method called "classification clustering," combined with probabilistic retrieval techniques, to provide natural language searching (which helps to reduce search failure) and to provide effective control of "information overload" in subject searching.

70 citations


Journal ArticleDOI
TL;DR: A comprehensive and structured overview of various verbose query processing methods can be found in this article, where the focus of many novel search applications shifted from short keyword queries to verbose natural language queries and effective handling of verbose queries has become a critical factor for adoption of information retrieval techniques in new breed of search applications.
Abstract: Recently, the focus of many novel search applications shifted from short keyword queries to verbose natural language queries. Examples include question answering systems and dialogue systems, voice search on mobile devices and entity search engines like Facebook's Graph Search or Google's Knowledge Graph. However the performance of textbook information retrieval techniques for such verbose queries is not as good as that for their shorter counterparts. Thus, effective handling of verbose queries has become a critical factor for adoption of information retrieval techniques in this new breed of search applications. Over the past decade, the information retrieval community has deeply explored the problem of transforming natural language verbose queries using operations like reduction, weighting, expansion, reformulation and segmentation into more effective structural representations. However, thus far, there was not a coherent and organized tutorial on this topic. In this tutorial, we aim to put together various research pieces of the puzzle, provide a comprehensive and structured overview of various proposed methods, and also list various application scenarios where effective verbose query processing can make a significant difference.

64 citations


Journal ArticleDOI
TL;DR: A semantic-aware co-indexing algorithm to jointly embed two strong cues into the inverted indexes: 1) local invariant features that are robust to delineate low-level image contents, and 2) semantic attributes from large-scale object recognition that may reveal image semantic meanings.
Abstract: In content-based image retrieval, inverted indexes allow fast access to database images and summarize all knowledge about the database. Indexing multiple clues of image contents allows retrieval algorithms search for relevant images from different perspectives, which is appealing to deliver satisfactory user experiences. However, when incorporating diverse image features during online retrieval, it is challenging to ensure retrieval efficiency and scalability. In this paper, for large-scale image retrieval, we propose a semantic-aware co-indexing algorithm to jointly embed two strong cues into the inverted indexes: 1) local invariant features that are robust to delineate low-level image contents, and 2) semantic attributes from large-scale object recognition that may reveal image semantic meanings. Specifically, for an initial set of inverted indexes of local features, we utilize semantic attributes to filter out isolated images and insert semantically similar images to this initial set. Encoding these two distinct and complementary cues together effectively enhances the discriminative capability of inverted indexes. Such co-indexing operations are totally off-line and introduce small computation overhead to online retrieval, because only local features but no semantic attributes are employed for the query. Hence, this co-indexing is different from existing image retrieval methods fusing multiple features or retrieval results. Extensive experiments and comparisons with recent retrieval methods manifest the competitive performance of our method.

51 citations


Journal ArticleDOI
TL;DR: Different types of similarity like lexical similarity, semantic similarity etc. are described, which play an important role in the categorization of text as well as document.
Abstract: With large number of documents on the web, there is a increasing need to be able to retrieve the best relevant document. There are different techniques through which we can retrieve most relevant document from the large corpus. Similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. Text similarity means user’s query text is matched with the document text and on the basis on this matching user retrieves the most relevant documents. Text similarity also plays an important role in the categorization of text as well as document. We can measure the similarity between sentences, words, paragraphs and documents to categorize them in an efficient way. On the basis of this categorization, we can retrieve the best relevant document corresponding to user’s query. This paper describes different types of similarity like lexical similarity, semantic similarity etc.

Proceedings ArticleDOI
16 May 2015
TL;DR: This technical briefing presents the state of the art Text Retrieval and Natural Language Processing techniques used in Software Engineering and discusses their applications in the field.
Abstract: This technical briefing presents the state of the art Text Retrieval and Natural Language Processing techniques used in Software Engineering and discusses their applications in the field.

Proceedings Article
25 Jan 2015
TL;DR: This paper proposes two word embedding based models for acronym disambiguation and evaluates the models on MSH Dataset and ScienceWISE dataset, and both models outperform the state-of-art methods on accuracy.
Abstract: According to the website AcronymFinder.com which is one of the world's largest and most comprehensive dictionaries of acronyms, an average of 37 new human-edited acronym definitions are added every day. There are 379,918 acronyms with 4,766,899 definitions on that site up to now, and each acronym has 12.5 definitions on average. It is a very important research topic to identify what exactly an acronym means in a given context for document comprehension as well as for document retrieval. In this paper, we propose two word embedding based models for acronym disambiguation. Word embedding is to represent words in a continuous and multidimensional vector space, so that it is easy to calculate the semantic similarity between words by calculating the vector distance. We evaluate the models on MSH Dataset and ScienceWISE Dataset, and both models outperform the state-of-art methods on accuracy. The experimental results show that word embedding helps to improve acronym disambiguation.

Proceedings ArticleDOI
01 Sep 2015
TL;DR: This work uses the images in image-text documents of each language as the hub and derives a common semantic subspace bridging two languages by means of generalized canonical correlation analysis, which substantially enhances retrieval accuracy in zero-shot and few-shot scenarios where text-to-text examples are scarce.
Abstract: We propose an image-mediated learning approach for cross-lingual document retrieval where no or only a few parallel corpora are available. Using the images in image-text documents of each language as the hub, we derive a common semantic subspace bridging two languages by means of generalized canonical correlation analysis. For the purpose of evaluation, we create and release a new document dataset consisting of three types of data (English text, Japanese text, and images). Our approach substantially enhances retrieval accuracy in zero-shot and few-shot scenarios where text-to-text examples are scarce.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: The result shows that TF-IDF model gives the highest precision values with the new corpus dataset, and is carried out to analyze and evaluate the retrieval effectiveness of vector -- space model while using the new data set of FIRE 2011.
Abstract: An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of an Information. In this context Information can be composed of text (including numeric and date data), images, audio, video and other multi-media objects. The TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. There exist various models for weighting terms of corpus documents and query terms. This work is carried out to analyze and evaluate the retrieval effectiveness of vector -- space model while using the new data set of FIRE 2011. The experiments were performed with TF-IDF and its variants. For all experiments and evaluation the open search engine, Terrier 3.5 was used. Our result shows that TF-IDF model gives the highest precision values with the new corpus dataset.

Journal ArticleDOI
TL;DR: This exploratory study sought to examine judgments of relevance of document representations to query statements made by people other than the originators of the queries and found some interesting differences and similarities between the groups.
Abstract: This exploratory study sought to examine judgments of relevance of document representations to query statements made by people other than the originators of the queries. A small group of graduate students in the School of Information and Library Studies and undergraduates at the University of Michigan judged sets of documents that had been retrieved for and judged by real users for a previous study. The secondary judges' assessments of relevance were analyzed by themselves and in comparison to the users' assessments. The judges performed reasonably well, but some important differences were identified. Secondary judges use the various fields of document records differently than users, and they have a higher threshold of relevance. There are other interesting differences and similarities between the groups. Implications of these findings for designing and testing document retrieval systems are discussed.

01 Jan 2015
TL;DR: Two approaches to semantic search are presented by incorporating Linked Data annotations of documents into a Generalized Vector Space Model, which exploits taxonomic relationships among entities in documents and queries as well as user-rated relevance assessments.
Abstract: This paper presents two approaches to semantic search by incorporating Linked Data annotations of documents into a Generalized Vector Space Model. One model exploits taxonomic relationships among entities in documents and queries, while the other model computes term weights based on semantic relationships within a document. We publish an evaluation dataset with annotated documents and queries as well as user-rated relevance assessments. The evaluation on this dataset shows significant improvements of both models over traditional keyword based search.

Journal ArticleDOI
TL;DR: The results of an empirical evaluation indicate that the new approach leads to better retrieval performance than baseline approaches that use text retrieval and clustering, and that the users exploiting the technique were significantly better supported in the identification of the code to be changed in response to a bug fixing request.
Abstract: During software evolution, one of the most important comprehension activities is concept location in source code, as it identifies the places in the code where changes are to be made in response to a modification request Change requests (such as, bug fixing or new feature requests) are usually formulated in natural language, while the source code also includes large amounts of text In consequence, many of the existing concept location techniques are based on text search or text retrieval Such approaches reformulate concept location as a document retrieval problem We refine and improve such solutions by leveraging dependencies between source code elements Dependency information is used by a link analysis algorithm to rank the document space and to improve concept location based on text retrieval We implemented our solution to concept location using the PageRank algorithm, used in web document retrieval applications The results of an empirical evaluation indicate that the new approach leads to better retrieval performance than baseline approaches that use text retrieval and clustering In addition, we present the results of a controlled experiment and of a differentiated replication to assess whether the new technique supports users in identifying the places in the code where changes are to be made The results of these experiments revealed that the users exploiting our technique were significantly better supported in the identification of the code to be changed in response to a bug fixing request compared to the users who did not use this technique

Journal ArticleDOI
TL;DR: The use of text classifiers for automatically classifying documents according to their corresponding group of semantically related documents is evaluated, with the highest performance in terms of classification accuracy achieved by a Rocchio classifier and a kNN classifier with the application of dimensionality reduction and using the tf-idf weighting method.
Abstract: Organizing construction project documents based on semantic similarities offers several advantages over traditional metadata criteria, including facilitating document retrieval and enhancing knowledge reuse. In this study, the use of text classifiers for automatically classifying documents according to their corresponding group of semantically related documents is evaluated. Supporting documents of claims were used as representations of document discourses. The evaluation was performed under varying general conditions (such as dimensionality level and weighting method) to assess the effect of such conditions on performance, and varying classifier-specific parameters. The highest performance in terms of classification accuracy was achieved by a Rocchio classifier and a kNN classifier with the application of dimensionality reduction and using the tf-idf weighting method. A combined classifier approach was also evaluated in which the classification outcome is based on a majority vote strategy between...

15 Mar 2015
TL;DR: This is the first study in the medical domain that has been done to use fuzzy set theory to express semantic properties of words and documents in terms of topics, and the experimental results showed major improvements.
Abstract: In the past several years, the medical data have been growing explosively. For example, the number of papers published in PubMed was increased from 112,177 in 1960 to 2,019,238 in 2013 and the annual average number of discharges between 2007 and 2010 is around 35 million1. Recently, various text mining techniques have been introduced into the medical domain. One fundamental objective of those techniques is to process the unstructured medical data into a proper format for better utilization to recognize explicit facts. Topic Modeling with Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003) is a popular unsupervised method for discovering latent semantic structure of a document collection. Topic modeling has been applied on medical data for different purposes, such as medical document categorization (Sarioglu, Yadav, & Choi, 2013) and medical document retrieval (Huang et al., 2014). Despite the usefulness of topic models for medical data analysis, existing topic models such as LDA still suffer from several critical issues. One issue of those existing topic models is their computational complexity of the model. Almost all uses of topic models require probabilistic inference, which is arguably hard to achieve without approximate inference algorithms such as Gibbs sampling. Another issue of those existing topic models is their expressive power of representing medical documents. The performance of various tasks such as document classification modeling using topic models is still not satisfactory. In this paper, we propose to model medical documents using fuzzy set theory. Fuzzy set theory models membership of objects using a possibility distribution. To the best of our knowledge, this is the first study in the medical domain that has been done to use fuzzy set theory to express semantic properties of words and documents in terms of topics. Compared with existing topic models such as LDA, the fuzzy set theory is computationally efficient. We develop several efficient strategies to model medical documents using fuzzy set theory. Regarding the expressive power, we adopt real medical document collections and compare the performance of our proposed method with LDA by considering document modeling. The experimental results showed major improvements.

Journal ArticleDOI
TL;DR: The proposed method solves locally applied language incompact usage in the process of document clus-tering by document repre-sentation based on tagging, and to improve clustering results by using knowledge technology – ontology.
Abstract: Text documents are very significant in the contemporary organizations, moreover their constant accumulation enlarges the scope of document storage. Standard text mining and information retrieval techniques of text document usually rely on word matching. An alternative way of information retrieval is clustering. In this paper we suggest to complement the traditional clustering method by document repre-sentation based on tagging, and to improve clustering results by using knowledge technology – ontology. The proposed method solves locally applied language incompact usage in the process of document clus-tering.

Posted Content
TL;DR: The model is found to automatically attenuate the unimportant words and detects the salient keywords in the sentence, which allow the network to perform document retrieval, a difficult language processing task, where the similarity between the query and documents can be measured by the distance between their corresponding sentence embedding vectors computed by the LSTM-RNN.
Abstract: This paper develops a model that addresses sentence embedding, a hot topic in current natural language processing research, using recurrent neural networks (RNN) with Long Short-Term Memory (LSTM) cells The proposed LSTM-RNN model sequentially takes each word in a sentence, extracts its information, and embeds it into a semantic vector Due to its ability to capture long term memory, the LSTM-RNN accumulates increasingly richer information as it goes through the sentence, and when it reaches the last word, the hidden layer of the network provides a semantic representation of the whole sentence In this paper, the LSTM-RNN is trained in a weakly supervised manner on user click-through data logged by a commercial web search engine Visualization and analysis are performed to understand how the embedding process works The model is found to automatically attenuate the unimportant words and detects the salient keywords in the sentence Furthermore, these detected keywords are found to automatically activate different cells of the LSTM-RNN, where words belonging to a similar topic activate the same cell As a semantic representation of the sentence, the embedding vector can be used in many different applications These automatic keyword detection and topic allocation abilities enabled by the LSTM-RNN allow the network to perform document retrieval, a difficult language processing task, where the similarity between the query and documents can be measured by the distance between their corresponding sentence embedding vectors computed by the LSTM-RNN On a web search task, the LSTM-RNN embedding is shown to significantly outperform several existing state of the art methods

Proceedings ArticleDOI
01 Oct 2015
TL;DR: This paper presents the retrieval system for Marathi language documents based on the user profile, which provides text categorization of Marathi documents by using the LINGO [Label Induction Grouping] algorithm, based upon the VSM [Vector Space Model].
Abstract: Information technology generated huge data on the internet. Initially this data is mainly in English language so majority of data mining research work is on the English text documents. As the internet usage increased, data in other languages like Marathi, Tamil, Telugu and Punjabi etc. increased on the internet. This paper presents the retrieval system for Marathi language documents based on the user profile. User profile considers the user's interests, user's browsing history. The system shows the Marathi documents to the end user based on the user profile. Automatic text categorization is useful in better management and retrieval of these text documents and also makes document retrieval as simple task. This paper discusses the automatic text categorization of Marathi documents and literature survey of the related work done in automatic text categorization of Marathi documents. Various learning techniques exist for the classification of text documents like Naive Bayes, Support Vector Machine and Decision Trees etc. There are different clustering techniques used for text categorization like Label Induction Grouping Algorithm, Suffix Tree Clustering, and K- means etc. Literature survey shows that for non-English documents VSM [Vector Space Model] gives the better results than any other models. The system provides text categorization of Marathi documents by using the LINGO [Label Induction Grouping] algorithm. LINGO is based on the VSM [Vector Space Model]. The system uses the dataset which contains 200 documents of 20 different categories. The result represents that for Marathi text documents LINGO clustering algorithm is efficient.

Journal ArticleDOI
TL;DR: This work proposes a method for solving entity disambiguation task from timestamped link information obtained from a collaboration network that uses only the graph topology of an anonymized network.
Abstract: In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error lead to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from timestamped link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.

Proceedings ArticleDOI
23 Aug 2015
TL;DR: A text-independent writer identification framework for online handwritten text using an unsupervised learning scheme termed `subtractive clustering' to discover the unique writing styles of a given author and a modified scoring scheme for identifying the writer.
Abstract: This paper proposes a text-independent writer identification framework for online handwritten text. The method utilizes an unsupervised learning scheme termed ‘subtractive clustering’ to discover the unique writing styles of a given author. Subtractive clustering has been adopted in the literature for the problems of image segmentation and speaker identification. To the best of our knowledge, its applicability in the domain of writer identification is yet to be explored. Unlike traditional clustering techniques such as k-means and fuzzy c-means, the subtractive clustering algorithm does not rely on the initial choice of seed points. Instead, it locates the high density regions in the feature space, and this make this scheme an interesting exploration to capture the writing styles of an author (referred to as ‘prototypes’). The discovered prototypes from the clustering algorithm are subsequently employed to score the authorship of an unknown handwritten text. In addition, inspired from the t f-idf approach used in document retrieval, we propose a modified scoring scheme for identifying the writer. The efficacy of the algorithms are evaluated on the paragraphs from the IAM-Online Handwritten Database.

Proceedings ArticleDOI
24 Oct 2015
TL;DR: A novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False is proposed.
Abstract: Recently due to large-scale data spread in digital economy, the era of big data is coming. Through big data, unstructured text data consisting of technical text document, confidential document, false information documents are experiencing serious problems in the runoff. To prevent this, the need of art to sort and process the document consisting of text data has increased. In this paper, we propose a novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False. The proposed method is implemented using Naive Bayes document classifier and TF-IDF.

Journal ArticleDOI
TL;DR: This work designs algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that they can efficiently provide high-quality answers to user queries using only the selected subset.
Abstract: We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of Web search. In the first case, we favor the retrieval of documents that are relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios.

Journal ArticleDOI
TL;DR: A novel supervised topic model for document retrieval learning is proposed which can be regarded as a pointwise model for tackling the learning-to-rank task and improves upon the state-of-the-art models.
Abstract: One limitation of most existing probabilistic latent topic models for document classification is that the topic model itself does not consider useful side-information, namely, class labels of documents. Topic models, which in turn consider the side-information, popularly known as supervised topic models, do not consider the word order structure in documents. One of the motivations behind considering the word order structure is to capture the semantic fabric of the document. We investigate a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into a supervised topic model enabling a more effective interaction among such information for solving document classification. We derive a collapsed Gibbs sampler for our model. Likewise, supervised topic models with word order structure have not been explored in document retrieval learning. We propose a novel supervised topic model for document retrieval learning which can be regarded as a pointwise model for tackling the learning-to-rank task. Available relevance assessments and word order structure are integrated into the topic model itself. We conduct extensive experiments on several publicly available benchmark datasets, and show that our model improves upon the state-of-the-art models.

Journal ArticleDOI
TL;DR: A precisely represented semantics of a document as a graph and multiple relation-based weighting schemes are important factors underlying the notable improvement in the proposed ranking approach.

Journal ArticleDOI
TL;DR: This paper proposes the Topic Enhanced Inverted Index (TEII), which includes the topic information into the traditional inverted index by adding topic-based inverted lists, and explores two different types of TEIIs, which are beneficial for legacy IR systems and highly extensible for incorporating different ranking factors.
Abstract: In recent years, topic modeling is gaining significant momentum in information retrieval (IR). Researchers have found that utilizing the topic information generated through topic modeling together with traditional TF-IDF information generates superior results in document retrieval. However, in order to apply this idea to real-life IR systems, some critical problems need to be solved: how to store the topic information and how to utilize it with the TF-IDF information for efficient document retrieval. In this paper, we propose the Topic Enhanced Inverted Index (TEII) to incorporate the topic information into the inverted index for efficient top- k document retrieval. Specifically, we explore two different types of TEIIs. We first propose the incremental TEII, which includes the topic information into the traditional inverted index by adding topic-based inverted lists. The incremental TEII is beneficial for legacy IR systems, since it does not change the existing TF-IDF-based inverted lists. As a more flexible alternative, we propose the hybrid TEII to incorporate the topic information into each posting of the inverted index. In the hybrid TEII, two relaxation methods are proposed to support dynamic estimation of the upper bound impact of each posting. The hybrid TEII is highly extensible for incorporating different ranking factors and we show an extension of the hybrid TEII by considering the static quality of the documents in the corpus. Based on the incremental and hybrid TEIIs, we develop several query processing algorithms to support efficient top- k document retrieval on TEIIs. Empirical evaluation on the TREC dataset verifies the effectiveness and efficiency of the proposed index structures and query processing algorithms.