Showing papers on "Document retrieval published in 2009"

PDF

Open Access

Book•

[...]

Stephen Robertson¹, Hugo Zaragoza²•Institutions (2)

17 Dec 2009

TL;DR: This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F.

...read moreread less

Abstract: The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.

...read moreread less

2,037 citations

Book•

Search Engines: Information Retrieval in Practice

[...]

W. Bruce Croft, Donald Metzler, Trevor Strohman

16 Feb 2009

TL;DR: This text provides the background and tools needed to evaluate, compare and modify search engines and numerous programming exercises make extensive use of Galago, a Java-based open source search engine.

...read moreread less

Abstract: KEY BENEFIT: Written by a leader in the field of information retrieval, this text provides the background and tools needed to evaluate, compare and modify search engines. KEY TOPICS: Coverage of the underlying IR and mathematical models reinforce key concepts. Numerous programming exercises make extensive use of Galago, a Java-based open source search engine. MARKET: A valuable tool for search engine and information retrieval professionals.

...read moreread less

1,050 citations

Book Chapter•DOI•

Parallel K-Means Clustering Based on MapReduce

[...]

Weizhong Zhao¹, Huifang Ma¹, Qing He¹•Institutions (1)

Chinese Academy of Sciences¹

22 Nov 2009

TL;DR: This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

...read moreread less

Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

...read moreread less

626 citations

Proceedings Article•DOI•

Positional language models for information retrieval

[...]

Yuanhua Lv¹, ChengXiang Zhai¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

19 Jul 2009

TL;DR: A novel positional language model (PLM) is proposed which implements both heuristics in a unified language model and is effective for passage retrieval and performs better than a state-of-the-art proximity-based retrieval model.

...read moreread less

Abstract: Although many variants of language models have been proposed for information retrieval, there are two related retrieval heuristics remaining "external" to the language modeling approach: (1) proximity heuristic which rewards a document where the matched query terms occur close to each other; (2) passage retrieval which scores a document mainly based on the best matching passage. Existing studies have only attempted to use a standard language model as a "black box" to implement these heuristics, making it hard to optimize the combination parameters. In this paper, we propose a novel positional language model (PLM) which implements both heuristics in a unified language model. The key idea is to define a language model for each position of a document, and score a document based on the scores of its PLMs. The PLM is estimated based on propagated counts of words within a document through a proximity-based density function, which both captures proximity heuristics and achieves an effect of "soft" passage retrieval. We propose and study several representative density functions and several different PLM-based document ranking strategies. Experiment results on standard TREC test collections show that the PLM is effective for passage retrieval and performs better than a state-of-the-art proximity-based retrieval model.

...read moreread less

257 citations

Proceedings Article•DOI•

Query dependent pseudo-relevance feedback based on wikipedia

[...]

Yang Xu¹, Gareth J. F. Jones², Bin Wang¹•Institutions (2)

Chinese Academy of Sciences¹, Dublin City University²

19 Jul 2009

TL;DR: This work proposes and proposes and studies the effectiveness of three methods for expansion term selection, each modeling the Wikipedia based pseudo-relevance information from a different perspective, and incorporates the expansion terms into the original query and uses language modeling IR to evaluate these methods.

...read moreread less

Abstract: Pseudo-relevance feedback (PRF) via query-expansion has been proven to be e®ective in many information retrieval (IR) tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for PRF. One problem with this approach is that one or more of the top retrieved documents may be non-relevant, which can introduce noise into the feedback process. Besides, existing methods generally do not take into account the significantly different types of queries that are often entered into an IR system. Intuitively, Wikipedia can be seen as a large, manually edited document collection which could be exploited to improve document retrieval effectiveness within PRF. It is not obvious how we might best utilize information from Wikipedia in PRF, and to date, the potential of Wikipedia for this task has been largely unexplored. In our work, we present a systematic exploration of the utilization of Wikipedia in PRF for query dependent expansion. Specifically, we classify TREC topics into three categories based on Wikipedia: 1) entity queries, 2) ambiguous queries, and 3) broader queries. We propose and study the effectiveness of three methods for expansion term selection, each modeling the Wikipedia based pseudo-relevance information from a different perspective. We incorporate the expansion terms into the original query and use language modeling IR to evaluate these methods. Experiments on four TREC test collections, including the large web collection GOV2, show that retrieval performance of each type of query can be improved. In addition, we demonstrate that the proposed method out-performs the baseline relevance model in terms of precision and robustness.

...read moreread less

229 citations

Journal Article•DOI•

Learning and inferencing in user ontology for personalized Semantic Web search

[...]

Xing Jiang¹, Ah-Hwee Tan¹•Institutions (1)

Nanyang Technological University¹

15 Jul 2009-Information Sciences

TL;DR: The proposed user ontology model with the spreading activation based inferencing procedure has been incorporated into a semantic search engine, called OntoSearch, to provide personalized document retrieval services.

...read moreread less

128 citations

Patent•

Speech translation apparatus and computer program product

[...]

Kazuo Sumita¹, Tetsuro Chino, Satoshi Kamatani, Kouji Ueno•Institutions (1)

Toshiba¹

18 Feb 2009

TL;DR: In this article, a translation direction specifying unit specifies a first language and a second language, and a keyword extracting unit extracts a keyword for a document retrieval from the first-language character string or the second-languages character string, with which a document retrieving unit performs document retrieval.

...read moreread less

Abstract: A translation direction specifying unit specifies a first language and a second language. A speech recognizing unit recognizes a speech signal of the first language and outputs a first language character string. A first translating unit translates the first language character string into a second language character string that will be displayed on a display device. A keyword extracting unit extracts a keyword for a document retrieval from the first language character string or the second language character string, with which a document retrieving unit performs a document retrieval. A second translating unit translates a retrieved document into its opponent language, which will be displayed on the display device.

...read moreread less

114 citations

Proceedings Article•DOI•

Space-Efficient Framework for Top-k String Retrieval Problems

[...]

Wing-Kai Hon¹, Rahul Shah², Jeffrey Scott Vitter³•Institutions (3)

National Tsing Hua University¹, Louisiana State University², Texas A&M University³

25 Oct 2009

TL;DR: This framework gives linear space data structure with optimal query times for arbitrary score functions and improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance.

...read moreread less

Abstract: Given a set ${\cal D}=\{d_1, d_2,..., d_D\}$ of $D$strings of total length $n$, our task is to report the "most relevant"strings for a given query pattern $P$. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by [Muthukrishnan, 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking $O(n \logn)$ words of space. We study this problem in a slightly different framework of reporting the top $k$ most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.

...read moreread less

109 citations

Journal Article•DOI•

Exploiting noun phrases and semantic relationships for text document clustering

[...]

Hai-Tao Zheng¹, Bo-Yeong Kang¹, Hong-Gee Kim¹•Institutions (1)

Seoul National University¹

01 Jun 2009-Information Sciences

TL;DR: This work combines detection of noun phrases with the use of WordNet as background knowledge to explore better ways of representing documents semantically for clustering, and finds that noun phrase analysis improves the WordNet-based clustering method.

...read moreread less

96 citations

Journal Article•DOI•

Signature Detection and Matching for Document Image Retrieval

[...]

Guangyu Zhu¹, Yefeng Zheng², David Doermann¹, Stefan Jaeger•Institutions (2)

University of Maryland, College Park¹, Princeton University²

01 Nov 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes a novel multiscale approach to jointly detecting and segmenting signatures from document images, and quantitatively studies state-of-the-art shape representations, shape matching algorithms, measures of dissimilarity, and the use of multiple instances as query in document image retrieval.

...read moreread less

Abstract: As one of the most pervasive methods of individual identification and document authentication, signatures present convincing evidence and provide an important form of indexing for effective document image processing and retrieval in a broad range of applications. However, detection and segmentation of free-form objects such as signatures from clustered background is currently an open document analysis problem. In this paper, we focus on two fundamental problems in signature-based document image retrieval. First, we propose a novel multiscale approach to jointly detecting and segmenting signatures from document images. Rather than focusing on local features that typically have large variations, our approach captures the structural saliency using a signature production model and computes the dynamic curvature of 2D contour fragments over multiple scales. This detection framework is general and computationally tractable. Second, we treat the problem of signature retrieval in the unconstrained setting of translation, scale, and rotation invariant nonrigid shape matching. We propose two novel measures of shape dissimilarity based on anisotropic scaling and registration residual error and present a supervised learning framework for combining complementary shape information from different dissimilarity metrics using LDA. We quantitatively study state-of-the-art shape representations, shape matching algorithms, measures of dissimilarity, and the use of multiple instances as query in document image retrieval. We further demonstrate our matching techniques in offline signature verification. Extensive experiments using large real-world collections of English and Arabic machine-printed and handwritten documents demonstrate the excellent performance of our approaches.

...read moreread less

89 citations

Journal Article•DOI•

Evaluation Challenges and Directions for Information-Seeking Support Systems

[...]

Diane Kelly¹, Susan T. Dumais², Jan Pedersen•Institutions (2)

University of North Carolina at Chapel Hill¹, Microsoft²

01 Mar 2009-IEEE Computer

TL;DR: ISSSs provide an exciting opportunity to extend previous information-seeking and interactive information retrieval evaluation models and create a research community that embraces diverse methods and broader participation.

...read moreread less

Abstract: ISSSs provide an exciting opportunity to extend previous information-seeking and interactive information retrieval evaluation models and create a research community that embraces diverse methods and broader participation.

...read moreread less

Proceedings Article•DOI•

A machine learning approach for improved BM25 retrieval

[...]

Krysta M. Svore¹, Christopher J. C. Burges¹•Institutions (1)

Microsoft¹

02 Nov 2009

TL;DR: A machine learning approach to BM25-style retrieval is developed that learns, using LambdaRank, from the input attributes of BM25, and significantly improves retrieval effectiveness over BM25 and BM25F.

...read moreread less

Abstract: Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.

...read moreread less

Journal Article•DOI•

CoScribe: Integrating Paper and Digital Documents for Collaborative Knowledge Work

[...]

Jürgen Steimle, Oliver Brdiczka¹, Max Mühlhäuser•Institutions (1)

PARC¹

01 Jul 2009-IEEE Transactions on Learning Technologies

TL;DR: The results indicate that CoScribe imposes only minimal overhead on traditional annotation processes and provides for a more efficient structuring and retrieval of documents.

...read moreread less

Abstract: This paper presents CoScribe, a concept and prototype system for the combined work with printed and digital documents, which supports a large variety of knowledge work settings It integrates novel pen-and-paper-based interaction techniques that enable users to collaboratively annotate, link and tag both printed and digital documents CoScribe provides for a very seamless integration of paper with the digital world, as the same digital pen and the same interactions can be used on paper and displays As our second contribution, we present empirical results of three field studies on learning at universities These motivated the design of CoScribe and were abstracted to a generic framework for the design of intuitive pen-and-paper user interfaces The resulting interaction design comprising collaboration support and multiuser visualizations has been implemented and evaluated in user studies The results indicate that CoScribe imposes only minimal overhead on traditional annotation processes and provides for a more efficient structuring and retrieval of documents

...read moreread less

Book Chapter•DOI•

Regression Rank: Learning to Meet the Opportunity of Descriptive Queries

[...]

Matthew Lease¹, James Allan², W. Bruce Croft²•Institutions (2)

Brown University¹, University of Massachusetts Amherst²

18 Apr 2009

TL;DR: A new learning to rank framework for estimating context-sensitive term weights without use of feedback is presented, which achieves generalization by introducing secondary features correlated with term weights and applying regression to predict term weights given features.

...read moreread less

Abstract: We present a new learning to rank framework for estimating context-sensitive term weights without use of feedback. Specifically, knowledge of effective term weights on past queries is used to estimate term weights for new queries. This generalization is achieved by introducing secondary features correlated with term weights and applying regression to predict term weights given features. To improve support for more focused retrieval like question answering, we conduct document retrieval experiments with TREC description queries on three document collections. Results show significantly improved retrieval accuracy.

...read moreread less

Book•DOI•

Advances in Information Retrieval Theory

[...]

Leif Azzopardi, Gabriella Kazai, Stephen Robertson, Stefan Rüger, Milad Shokouhi, Dawei Song, Emine Yilmaz - Show less +3 more

01 Jan 2009-Lecture Notes in Computer Science

TL;DR: In this article, the authors present an approach to Verbose queries using a limited dependencies language model and time-sensitive language modeling for online term recurrence prediction for personalized information retrieval.

...read moreread less

Abstract: Invited Talk.- Is There Something Quantum-Like about the Human Mental Lexicon?.- Regular Papers.- Probably Approximately Correct Search.- PageRank: Splitting Homogeneous Singular Linear Systems of Index One.- Training Data Cleaning for Text Classification.- Semi-parametric and Non-parametric Term Weighting for Information Retrieval.- Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR.- Ordinal Regression Based Model for Personalized Information Retrieval.- Navigating in the Dark: Modeling Uncertainty in Ad Hoc Retrieval Using Multiple Relevance Models.- A Belief Model of Query Difficulty That Uses Subjective Logic.- \"A term is known by the company it keeps\": On Selecting a Good Expansion Set in Pseudo-Relevance Feedback.- An Effective Approach to Verbose Queries Using a Limited Dependencies Language Model.- Time-Sensitive Language Modelling for Online Term Recurrence Prediction.- Score Distributions in Information Retrieval.- Modeling the Score Distributions of Relevant and Non-relevant Documents.- Modeling Expected Utility of Multi-session Information Distillation.- Specificity Aboutness in XML Retrieval.- An Effectiveness Measure for Ambiguous and Underspecified Queries.- An Analysis of NP-Completeness in Novelty and Diversity Ranking.- From \"Identical\" to \"Similar\": Fusing Retrieved Lists Based on Inter-document Similarities.- Short Papers.- A Quantum-Based Model for Interactive Information Retrieval.- The Quantum Probability Ranking Principle for Information Retrieval.- Written Texts as Statistical Mechanical Problem.- What Happened to Content-Based Information Filtering?.- Prior Information and the Determination of Event Spaces in Probabilistic Information Retrieval Models.- Robust Word Similarity Estimation Using Perturbation Kernels.- Possibilistic Similarity Estimation and Visualization.- A New Measure of the Cluster Hypothesis.- Explaining User Performance in Information Retrieval: Challenges to IR Evaluation.- A Four-Factor User Interaction Model for Content-Based Image Retrieval.- Predicting Query Performance by Query-Drift Estimation.- What's in a Link? From Document Importance to Topical Relevance.- Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links.- Optimizing WebPage Interest.- Posters.- The \"Beautiful\" in Information.- IR Evaluation without a Common Set of Topics.- An Ad Hoc Information Retrieval Perspective on PLSI through Language Model Identification.- Less Is More: Maximal Marginal Relevance as a Summarisation Feature.- On the Notion of \"An Information Need\".- A Logical Inference Approach to Query Expansion with Social Tags.- Evaluating Mobile Proactive Context-Aware Retrieval: An Incremental Benchmark.- Predicting the Usefulness of Collection Enrichment for Enterprise Search.- Ranking List Dispersion as a Query Performance Predictor.- Semi-subsumed Events: A Probabilistic Semantics of the BM25 Term Frequency Quantification.- Batch-Mode Computational Advertising Based on Modern Portfolio Theory.

...read moreread less

Journal Article•DOI•

Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection

[...]

Tommy W. S. Chow¹, M.K.M. Rahman¹•Institutions (1)

City University of Hong Kong¹

01 Sep 2009-IEEE Transactions on Neural Networks

TL;DR: This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM), and shows that the tree-structured data is effective for DR and PD.

...read moreread less

Abstract: This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM). A document is modeled by a rich tree-structured representation, and a SOM-based system is used as a computationally effective solution. Instead of relying on keywords/lines, the proposed scheme compares a full document as a query for performing retrieval and PD. The tree-structured representation hierarchically includes document features as document, pages, and paragraphs. Thus, it can reflect underlying context that is difficult to acquire from the currently used word-frequency information. We show that the tree-structured data is effective for DR and PD. To handle tree-structured representation in an efficient way, we use an MLSOM algorithm, which was previously developed by the authors for the application of image retrieval. In this study, it serves as an effective clustering algorithm. Using the MLSOM, local matching techniques are developed for comparing text documents. Two novel MLSOM-based PD methods are proposed. Detailed simulations are conducted and the experimental results corroborate that the proposed approach is computationally efficient and accurate for DR and PD.

...read moreread less

Journal Article•DOI•

Improving legal information retrieval using an ontological framework

[...]

M. Saravanan¹, Balaraman Ravindran¹, S. Raman¹•Institutions (1)

Indian Institute of Technology Madras¹

01 Jun 2009-Artificial Intelligence and Law

TL;DR: An ontological framework to enhance the user’s query for retrieval of truly relevant legal judgments has been proposed in this paper and empirical results demonstrate that ontology-based searches generate significantly better results than traditional search methods.

...read moreread less

Abstract: A variety of legal documents are increasingly being made available in electronic format. Automatic Information Search and Retrieval algorithms play a key role in enabling efficient access to such digitized documents. Although keyword-based search is the traditional method used for text retrieval, they perform poorly when literal term matching is done for query processing, due to synonymy and ambivalence of words. To overcome these drawbacks, an ontological framework to enhance the user's query for retrieval of truly relevant legal judgments has been proposed in this paper. Ontologies ensure efficient retrieval by enabling inferences based on domain knowledge, which is gathered during the construction of the knowledge base. Empirical results demonstrate that ontology-based searches generate significantly better results than traditional search methods.

...read moreread less

Proceedings Article•DOI•

Retrieval experiments using pseudo-desktop collections

[...]

Jinyoung Kim¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

02 Nov 2009

TL;DR: A probabilistic retrieval model using the mapping relation between a query term and a document field (PRM-S) has the best performance in collections with more structure, such as email, and that the query-likelihood language model is better for other document types.

...read moreread less

Abstract: Desktop search is an important part of personal information management (PIM). However, research in this area has been limited by the lack of shareable test collections, making cumulative progress difficult. In this paper, we define desktop search as a semi-structured document retrieval problem and introduce a methodology to automatically build a reusable collection (the pseudo-desktop) that has many of the same properties as a real desktop collection. We then present a comprehensive evaluation of retrieval methods for semi-structured document retrieval on several pseudo-desktop collections and the TREC Enterprise collection. Our results show that a probabilistic retrieval model using the mapping relation between a query term and a document field (PRM-S) has the best performance in collections with more structure, such as email, and that the query-likelihood language model is better for other document types. We further analyze the observed differences using generated queries and suggest ways to improve PRM-S, which makes the performance gains more significant and consistent.

...read moreread less

Journal Article•DOI•

Semi-supervised document retrieval

[...]

Ming Li¹, Hang Li², Zhi-Hua Zhou¹•Institutions (2)

Nanjing University¹, Microsoft²

01 May 2009-Information Processing and Management

TL;DR: Experimental results indicate that SSRank consistently and almost always significantly outperforms the baseline methods, given the same amount of labeled data, because SSRank can effectively leverage the use of unlabeled data in learning.

...read moreread less

Abstract: This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.

...read moreread less

Journal Article•DOI•

Advanced Information Retrieval

[...]

Fuji Ren¹, David B. Bracewell²•Institutions (2)

Beijing University of Posts and Telecommunications¹, University of Tokushima²

01 Jan 2009-Electronic Notes in Theoretical Computer Science

TL;DR: Some of the most important areas of advanced information retrieval are explored, in particular, cross-lingual information retrieval, multimedia information retrieval and semantic-based information retrieval.

...read moreread less

Journal Article•DOI•

Word Topic Models for Spoken Document Retrieval and Transcription

[...]

Berlin Chen¹•Institutions (1)

National Taiwan Normal University¹

01 Mar 2009-ACM Transactions on Asian Language Information Processing

TL;DR: A word topic model (WTM) is proposed to explore the co-occurrence relationship between words, as well as the long-span latent topical information, for language modeling in spoken document retrieval and transcription.

...read moreread less

Abstract: Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptability of a given word sequence, has long been an interesting yet challenging research topic in the speech and language processing community. It also has been introduced to information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems. In this article, we propose a word topic model (WTM) to explore the co-occurrence relationship between words, as well as the long-span latent topical information, for language modeling in spoken document retrieval and transcription. The document or the search history as a whole is modeled as a composite WTM model for generating a newly observed word. The underlying characteristics and different kinds of model structures are extensively investigated, while the performance of WTM is thoroughly analyzed and verified by comparison with the well-known probabilistic latent semantic analysis (PLSA) model as well as the other models. The IR experiments are performed on the TDT Chinese collections (TDT-2 and TDT-3), while the large vocabulary continuous speech recognition (LVCSR) experiments are conducted on the Mandarin broadcast news collected in Taiwan. Experimental results seem to indicate that WTM is a promising alternative to the existing models.

...read moreread less

Proceedings Article•DOI•

Logo Detection in Document Images Based on Boundary Extension of Feature Rectangles

[...]

Hongye Wang¹, Youbin Chen¹•Institutions (1)

Tsinghua University¹

26 Jul 2009

TL;DR: A new method of logo detection in document images is proposed, based on the boundary extension of feature rectangles, that is independent on logo shapes and very fast.

...read moreread less

Abstract: A new method of logo detection in document images is proposed in this paper. It is based on the boundary extension of feature rectangles of which the definition is also given in this paper. This novel method takes advantage of a layout assumption that logos have background (white spaces) surrounding it in a document. Compared with other logo detection methods, this new method has the advantage that it is independent on logo shapes and very fast. After the logo candidates are detected, a simple decision tree is used to reduce the false positive from the logo candidate pool. We have tested our method on a public image database involving logos. Experiments show that our method is more precise and robust than the previous methods and is well qualified as an effective assistance in document retrieval.

...read moreread less

Proceedings Article•DOI•

Incremental maintenance of length normalized indexes for approximate string matching

[...]

Marios Hadjieleftheriou¹, Nick Koudas², Divesh Srivastava¹•Institutions (2)

AT&T Labs¹, University of Toronto²

29 Jun 2009

TL;DR: This paper presents a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers.

...read moreread less

Abstract: Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.

...read moreread less

Journal Article•DOI•

A new document representation using term frequency and vectorized graph connectionists with application to document retrieval

[...]

Tommy W. S. Chow¹, Haijun Zhang¹, M. K. M. Rahman¹•Institutions (1)

City University of Hong Kong¹

01 Dec 2009-Expert Systems With Applications

TL;DR: This paper presents a new document representation with vectorized multiple features including term frequency and term-connection-frequency, and develops a document retrieval system based on self-organizing map (SOM) to speed up the retrieval process.

...read moreread less

Abstract: This paper presents a new document representation with vectorized multiple features including term frequency and term-connection-frequency. A document is represented by undirected and directed graph, respectively. Then terms and vectorized graph connectionists are extracted from the graphs by employing several feature extraction methods. This hybrid document feature representation more accurately reflects the underlying semantics that are difficult to achieve from the currently used term histograms, and it facilitates the matching of complex graph. In application level, we develop a document retrieval system based on self-organizing map (SOM) to speed up the retrieval process. We perform extensive experimental verification, and the results suggest that the proposed method is computationally efficient and accurate for document retrieval.

...read moreread less

Effectiveness of Temporal Snippets

[...]

Omar Alonso, Ricardo Baeza-Yates¹, Michael Gertz•Institutions (1)

Association for Computing Machinery¹

01 Jan 2009

TL;DR: The notion of time-centered snippets, called TSnippet, as document surrogates for document retrieval and exploration and an alternative document snippet based on temporal information that can be useful for supporting exploratory search are introduced.

...read moreread less

Abstract: We introduce the notion of time-centered snippets, called TSnippet, as document surrogates for document retrieval and exploration. We propose an alternative document snippet based on temporal information that can be useful for supporting exploratory search. The idea of using sentences that contain the most frequent chronons (units of time) can be used for constructing document surrogates. We conducted a series of experiments to evaluate this new approach using a crowdsourcing approach. The evaluation against two Web search engines shows that our technique produces good snippets and users like to see time-sensitive information in search results.

...read moreread less

Proceedings Article•DOI•

A word clustering approach for language model-based sentence retrieval in question answering systems

[...]

Saeedeh Momtazi¹, Dietrich Klakow¹•Institutions (1)

Saarland University¹

02 Nov 2009

TL;DR: Language Modeling techniques are proposed to overcome the problems of data sparsity and exact matching and improve the sentence retrieval performance in Question Answering (QA) systems.

...read moreread less

Abstract: In this paper we propose a term clustering approach to improve the performance of sentence retrieval in Question Answering (QA) systems. As the search in question answering is conducted over smaller segments of data than in a document retrieval task, the problems of data sparsity and exact matching become more critical. In this paper we propose Language Modeling (LM) techniques to overcome such problems and improve the sentence retrieval performance. Our proposed methods include building class-based models by term clustering, and then employing higher order n-grams with the new class-based model. We report our experiments on the TREC 2007 questions from QA track. The results show that the methods investigated here enhanced the mean average precision of sentence retrieval from 23.62% to 29.91%.

...read moreread less

Patent•

Multilayer index voice document searching method and system thereof

[...]

Xihong Wu, Huisheng Chi, Tianshu Qu, Guanglu Wan

19 Aug 2009

TL;DR: In this article, a multilayer indexing voice document retrieval method was proposed, which consists of an automatic voice identifying module that is used for automatically identifying characters in voice documents; an automated voice document index constructing module that was used for constructing double indexes of the voice identification result, and a voice document retrieving module that were used for searching the relevant documents of given query terms in the indexing database and returning the documents to users.

...read moreread less

Abstract: The invention discloses a multilayer indexing voice document retrieval method and a system thereof, and belongs to the technical field of information retrieval. The multilayer indexing voice document retrieval method comprises the following steps: (1) feature extraction of a multimedia stream is implemented, thus obtaining a voice feature sequence; (2) a voice identifying decoder is used for searching the voice feature sequences, thus obtaining a word lattice and an optimal identification result; (3) according to the word lattice and the optimal identification result, a word and syllable double-layer indexing database is constructed; and (4) relevant documents of a given query term are searched in the indexing database and returned to users. The multilayer indexing voice document retrieval system comprises an automatic voice identifying module that is used for automatically identifying characters in voice documents; an automatic voice document index constructing module that is used for constructing double indexes of the voice identification result, and a voice document retrieval module that is used for searching the relevant documents of given query terms in the indexing database and returning the documents to users. Compared with the prior art, the multilayer indexing voice document retrieval method and the system can realize quick and accurate searching of multimedia data.

...read moreread less

Journal Article•DOI•

A quality-aware optimizer for information extraction

[...]

Alpa Jain¹, Panagiotis G. Ipeirotis²•Institutions (2)

Columbia University¹, New York University²

23 Apr 2009-ACM Transactions on Database Systems

TL;DR: This article shows how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and how toUse ROC analysis to select the extraction parameters in a principled manner and presents analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation.

...read moreread less

Abstract: A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.

...read moreread less

Journal Article•DOI•

Psychiatric document retrieval using a discourse-aware model

[...]

Liang-Chih Yu¹, Chung-Hsien Wu², Fong-Lin Jang•Institutions (2)

Yuan Ze University¹, National Cheng Kung University²

01 May 2009-Artificial Intelligence

TL;DR: Experimental results show that the discourse-aware retrieval model achieves higher precision than the word-based retrieval models, namely the vector space model (VSM) and Okapi model, adopting word-level information alone.

...read moreread less

Journal Article•DOI•

Retrieval of online handwriting by synthesis and matching

[...]

C. V. Jawahar¹, A. Balasubramanian¹, Million Meshesha¹, Anoop M. Namboodiri¹•Institutions (1)

International Institute of Information Technology¹

01 Jul 2009-Pattern Recognition

TL;DR: The proposed approach provides a keyboard-based search interface that enables to search handwritten data from any platform, in addition to pen-based and example-based queries, and allows cross-lingual document retrieval across Indian languages.

...read moreread less

Collapse