scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2006"


Proceedings ArticleDOI
Yunbo Cao1, Jun Xu2, Tie-Yan Liu1, Hang Li1, Yalou Huang2, Hsiao-Wuen Hon1 
06 Aug 2006
TL;DR: Experimental results show that the modifications made in conventional Ranking SVM can outperform the conventional ranking SVM and other existing methods for document retrieval on two datasets and employ two methods to conduct optimization on the loss function: gradient descent and quadratic programming.
Abstract: The paper is concerned with applying learning to rank to document retrieval. Ranking SVM is a typical method of learning to rank. We point out that there are two factors one must consider when applying Ranking SVM, in general a "learning to rank" method, to document retrieval. First, correctly ranking documents on the top of the result list is crucial for an Information Retrieval system. One must conduct training in a way that such ranked results are accurate. Second, the number of relevant documents can vary from query to query. One must avoid training a model biased toward queries with a large number of relevant documents. Previously, when existing methods that include Ranking SVM were applied to document retrieval, none of the two factors was taken into consideration. We show it is possible to make modifications in conventional Ranking SVM, so it can be better used for document retrieval. Specifically, we modify the "Hinge Loss" function in Ranking SVM to deal with the problems described above. We employ two methods to conduct optimization on the loss function: gradient descent and quadratic programming. Experimental results show that our method, referred to as Ranking SVM for IR, can outperform the conventional Ranking SVM and other existing methods for document retrieval on two datasets.

648 citations


Journal ArticleDOI
01 Jan 2006
TL;DR: A probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem, and how this framework can unify existing retrieval models and accommodate systematic development of new retrieval models is discussed.
Abstract: This paper presents a probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models, user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they relax the traditional assumption of independent relevance of documents.

160 citations


Proceedings ArticleDOI
04 Jun 2006
TL;DR: This paper constructs a probabilistic neighborhood for each document, and expands the document with its neighborhood information, which provides a more accurate estimation of the document model, thus improves retrieval accuracy.
Abstract: Language model information retrieval depends on accurate estimation of document models. In this paper, we propose a document expansion technique to deal with the problem of insufficient sampling of documents. We construct a probabilistic neighborhood for each document, and expand the document with its neighborhood information. The expanded document provides a more accurate estimation of the document model, thus improves retrieval accuracy. Moreover, since document expansion and pseudo feedback exploit different corpus structures, they can be combined to further improve performance. The experiment results on several different data sets demonstrate the effectiveness of the proposed document expansion method.

146 citations


Proceedings ArticleDOI
06 Aug 2006
TL;DR: This paper shows that semantic term matching can be naturally incorporated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic similarity function of terms, and shows that such extension can be efficiently implemented as query expansion.
Abstract: A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing distinct but semantically related terms to match each other and contribute to the retrieval score. In this paper, we show that semantic term matching can be naturally incorporated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic similarity function of terms. We define several desirable retrieval constraints for semantic term matching and use such constraints to extend the axiomatic model to directly support semantic term matching based on the mutual information of terms computed on some document set. We show that such extension can be efficiently implemented as query expansion. Experiment results on several representative data sets show that, with mutual information computed over the documents in either the target collection for retrieval or an external collection such as the Web, our semantic expansion consistently and substantially improves retrieval accuracy over the baseline axiomatic retrieval model. As a pseudo feedback method, our method also outperforms a state-of-the-art language modeling feedback method.

145 citations


Book
28 Jul 2006
TL;DR: The book elaborates on the past and current most successful algorithms and their application in a variety of domains and reveals a number of ideas towards an advanced understanding and synthesis of textual content.
Abstract: Information extraction regards the processes of structuring and combining content that is explicitly stated or implied in one or multiple unstructured information sources. It involves a semantic classification and linking of certain pieces of information and is considered as a light form of content understanding by the machine. Currently, there is a considerable interest in integrating the results of information extraction in retrieval systems, because of the growing demand for search engines that return precise answers to flexible information queries. Advanced retrieval models satisfy that need and they rely on tools that automatically build a probabilistic model of the content of a (multi-media) document. The book focuses on content recognition in text. It elaborates on the past and current most successful algorithms and their application in a variety of domains (e.g., news filtering, mining of biomedical text, intelligence gathering, competitive intelligence, legal information searching, and processing of informal text). An important part discusses current statistical and machine learning algorithms for information detection and classification and integrates their results in probabilistic retrieval models. The book also reveals a number of ideas towards an advanced understanding and synthesis of textual content. The book is aimed at researchers and software developers interested in information extraction and retrieval, but the many illustrations and real world examples make it also suitable as a handbook for students.

139 citations


Patent
26 Jan 2006
TL;DR: A document retrieval apparatus for retrieving documents from a document database in which documents are registered, and displaying the retrieved documents, including a keyword input unit for accepting input of a retrieval keyword for retrieving the documents as mentioned in this paper.
Abstract: A document retrieval apparatus for retrieving documents from a document database in which documents are registered, and displaying the retrieved documents, the document retrieval apparatus includes: a keyword input unit for accepting input of a retrieval keyword for retrieving the documents; a document retrieval unit for retrieving the documents from the document database on the basis of the retrieval keyword; a keyword weight calculation unit for calculating those weights of the retrieval keyword which feature contents of the retrieved documents, as keyword weights; and a display process unit for displaying the retrieved documents in a state where the retrieval keyword contained in the retrieved documents is presented in display aspects conforming to the keyword weights.

134 citations


Book ChapterDOI
13 Feb 2006
TL;DR: This paper introduces into LLAH an affine invariant instead of the perspective invariant so as to improve its adjustability and experimental results show that the use of the affines enables us to improve either the accuracy from 96.2% to 97.8%, or the retrieval time from 112 msec./query to 75 msec./ query by selecting parameters of processing.
Abstract: Camera-based document image retrieval is a task of searching document images from the database based on query images captured using digital cameras. For this task, it is required to solve the problem of “perspective distortion” of images,as well as to establish a way of matching document images efficiently. To solve these problems we have proposed a method called Locally Likely Arrangement Hashing (LLAH) which is characterized by both the use of a perspective invariant to cope with the distortion and the efficiency: LLAH only requires O(N) time where N is the number of feature points that describe the query image. In this paper, we introduce into LLAH an affine invariant instead of the perspective invariant so as to improve its adjustability. Experimental results show that the use of the affine invariant enables us to improve either the accuracy from 96.2% to 97.8%, or the retrieval time from 112 msec./query to 75 msec./query by selecting parameters of processing.

121 citations


01 Jan 2006
TL;DR: This thesis presents methods for introducing ontologies in information retrieval, and appears that the fuzzy set model comprises the flexibility needed when generalizing to an ontology-based retrieval model and, with the introduction of a hierarchical fuzzy aggregation principle, compound concepts can be handled in a straightforward and natural manner.
Abstract: In this thesis, we will present methods for introducing ontologies in information retrieval. The main hypothesis is that the inclusion of conceptual knowledge such as ontologies in the information retrieval process can contribute to the solution of major problems currently found in information retrieval. This utilization of ontologies has a number of challenges. Our focus is on the use of similarity measures derived from the knowledge about relations between concepts in ontologies, the recognition of semantic information in texts and the mapping of this knowledge into the ontologies in use, as well as how to fuse together the ideas of ontological similarity and ontological indexing into a realistic information retrieval scenario. To achieve the recognition of semantic knowledge in a text, shallow natural language processing is used during indexing that reveals knowledge to the level of noun phrases. Furthermore, we briefly cover the identification of semantic relations inside and between noun phrases, as well as discuss which kind of problems are caused by an increase in compoundness with respect to the structure of concepts in the evaluation of queries. Measuring similarity between concepts based on distances in the structure of the ontology is discussed. In addition, a shared nodes measure is introduced and, based on a set of intuitive similarity properties, compared to a number of different measures. In this comparison the shared nodes measure appears to be superior, though more computationally complex. Some of the major problems of shared nodes which relate to the way relations differ with respect to the degree they bring the concepts they connect closer are discussed. A generalized measure called weighted shared nodes is introduced to deal with these problems. Finally, the utilization of concept similarity in query evaluation is discussed. A semantic expansion approach that incorporates concept similarity is introduced and a generalized fuzzy set retrieval model that applies expansion during query evaluation is presented. While not commonly used in present information retrieval systems, it appears that the fuzzy set model comprises the flexibility needed when generalizing to an ontology-based retrieval model and, with the introduction of a hierarchical fuzzy aggregation principle, compound concepts can be handled in a straightforward and natural manner.

119 citations


Proceedings ArticleDOI
17 Jul 2006
TL;DR: A novel framework for the accurate retrieval of relational concepts from huge texts where all sentences are annotated with predicate argument structures and ontological identifiers by applying a deep parser and a term recognizer is introduced.
Abstract: This paper introduces a novel framework for the accurate retrieval of relational concepts from huge texts. Prior to retrieval, all sentences are annotated with predicate argument structures and ontological identifiers by applying a deep parser and a term recognizer. During the run time, user requests are converted into queries of region algebra on these annotations. Structural matching with pre-computed semantic annotations establishes the accurate and efficient retrieval of relational concepts. This framework was applied to a text retrieval system for MEDLINE. Experiments on the retrieval of biomedical correlations revealed that the cost is sufficiently small for real-time applications and that the retrieval precision is significantly improved.

107 citations


Proceedings ArticleDOI
06 Nov 2006
TL;DR: A static index pruning method that follows a document-centric approach to decide whether a posting for a given term should remain in the index or not is presented, which can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval effectiveness.
Abstract: We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a document-centric approach to decide whether a posting for a given term should remain in the index or not The decision is made based on the term's contribution to the document's Kullback-Leibler divergence from the text collection's global language model Our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval effectiveness It thus allows us to make the index small enough to fit entirely into the main memory of a single PC, even for large text collections containing millions of documents This results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the GOV2 document collection

95 citations


Proceedings ArticleDOI
Jonathan Mamou1, David Carmel1, Ron Hoory1
06 Aug 2006
TL;DR: This paper shows that the mean average precision (MAP) is improved using WCNs compared to the raw word transcripts, and analyzes the effect of increasing ASR word error rate on search effectiveness.
Abstract: We are interested in retrieving information from conversational speech corpora, such as call-center data. This data comprises spontaneous speech conversations with low recording quality, which makes automatic speech recognition (ASR) a highly difficult task. For typical call-center data, even state-of-the-art large vocabulary continuous speech recognition systems produce a transcript with word error rate of 30% or higher. In addition to the output transcript, advanced systems provide word confusion networks (WCNs), a compact representation of word lattices associating each word hypothesis with its posterior probability. Our work exploits the information provided by WCNs in order to improve retrieval performance. In this paper, we show that the mean average precision (MAP) is improved using WCNs compared to the raw word transcripts. Finally, we analyze the effect of increasing ASR word error rate on search effectiveness. We show that MAP is still reasonable even under extremely high error rate.

Patent
Jay Ponte1
03 Apr 2006
TL;DR: In this article, methods and systems for indexing or retrieving materials accessible through computer networks are described and discussed. But none of these methods are suitable for large-scale data sets.
Abstract: Disclosed are methods and systems for indexing or retrieving materials accessible through computer networks.

Patent
04 May 2006
TL;DR: A customized, specialty-oriented database and index of a subject matter area and methods for constructing and using such a database are provided in this paper, where articles are indexed in a manner that allows facile, rapid retrieval of highly relevant articles with few or no false positives with much reduced database maintenance cost through frugal limitation of number of documents in the database,; number of terms in a Master Index, and number of codes assigned to each document.
Abstract: A customized, specialty-oriented database and index, of a subject matter area and methods for constructing and using such a database are provided. Selection and indexing of articles is done by experts in the topic with which the database is concerned. As a result, articles are indexed in a manner that allows facile, rapid retrieval of highly relevant articles with few or no false positives with much reduced database maintenance cost through frugal limitation of number of documents in the database,; number of terms in a Master Index, and number of codes assigned to each document. A thesaurus allows indexing and search in accordance with terminology familiar to different anticipated groups of users (e.g. doctors, patients, nurses, technicians, and the like). Key articles collections and rapid access to documents therein are also provided.

Journal ArticleDOI
TL;DR: A distance factor that can be effectively combined with the statistical term association measure of mutual information for selecting query expansion terms is proposed and shown to be more effective than mutual information alone in selecting terms for query expansion.
Abstract: Query expansion terms are often used to enhance original query formulations in document retrieval. Such terms are usually selected from the entire documents or from windows or passages surrounding query term occurrences. Arguably, the semantic relatedness between terms weakens with the increase in the distance separating them. In this paper we report a study that was conducted to systematically evaluate different distance functions for selecting query expansion terms. We propose a distance factor that can be effectively combined with the statistical term association measure of mutual information for selecting query expansion terms. Evaluation of the TREC collection shows that distance-weighted mutual information is more effective than mutual information alone in selecting terms for query expansion.

Proceedings ArticleDOI
06 Aug 2006
TL;DR: A general framework for the use of translation probabilities in cross-language information retrieval based on the notion that information retrieval fundamentally requires matching what the searcher means with what the author of a document meant is introduced.
Abstract: This paper introduces a general framework for the use of translation probabilities in cross-language information retrieval based on the notion that information retrieval fundamentally requires matching what the searcher means with what the author of a document meant. That perspective yields a computational formulation that provides a natural way of combining what have been known as query and document translation. Two well-recognized techniques are shown to be a special case of this model under restrictive assumptions. Cross-language search results are reported that are statistically indistinguishable from strong monolingual baselines for both French and Chinese documents.

Journal ArticleDOI
TL;DR: It turns out that the use of formatting information can lead to quite accurate extraction from general documents, and the method can significantly improve search ranking results in document retrieval by using the extracted titles.
Abstract: In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to other domains, and more surprisingly we can even train models in one language and apply them to other languages. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.

Proceedings ArticleDOI
11 Jun 2006
TL;DR: A comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete- link, and complete link), Bisecting K-means, K-Means, and suffix tree clustering in terms of the efficiency, the effectiveness, and the scalability.
Abstract: Document clustering has been used for better document retrieval, document browsing, and text mining in digital library. In this paper, we perform a comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and Suffix Tree Clustering in terms of the efficiency, the effectiveness, and the scalability. In addition, we apply a domain ontology to document clustering to investigate if the ontology such as MeSH improves clustering qualify for MEDLINE articles. Because an ontology is a formal, explicit specification of a shared conceptualization for a domain of interest, the use of ontologies is a natural way to solve traditional information retrieval problems such as synonym/hypernym/ hyponym problems. We conducted fairly extensive experiments based on different evaluation metrics such as misclassification index, F-measure, cluster purity, and Entropy on very large article sets from MEDLINE, the largest biomedical digital library in biomedicine.

Proceedings ArticleDOI
04 Jun 2006
TL;DR: Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.
Abstract: Large-scale web-search engines are generally designed for linear text. The linear text representation is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices.This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines, requiring only limited code changes.The proposed method, called Time-based Merging for Indexing (TMI), first converts the word lattice to a posterior-probability representation and then merges word hypotheses with similar time boundaries to reduce the index size. Four alternative approximations are presented, which differ in index size and the strictness of the phrase-matching constraints.Results are presented for three types of typical web audio content, podcasts, video clips, and online lectures, for phrase spotting and relevance ranking. Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.

Book ChapterDOI
13 Feb 2006
TL;DR: A novel DTW-based partial matching scheme is employed to take care of morphologically variant words to achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level.
Abstract: This paper presents a system for retrieval of relevant documents from large document image collections. We achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level. For representations of the words, profile-based and shape-based features are employed. A novel DTW-based partial matching scheme is employed to take care of morphologically variant words. This is useful for grouping together similar words during the indexing process.The system supports cross-lingual search using OM-Trans transliteration and a dictionary-based approach. System-level issues for retrieval (eg. scalability, effective delivery etc.) are addressed in this paper.

Journal ArticleDOI
TL;DR: The experimental results show that the noun phrase extractor is effective in identifying noun phrases from medical documents, so is the keyphrase extractor in identifying important medical conceptual terms.

Journal ArticleDOI
TL;DR: The results show that experience with search engines significantly affects users' attitudes toward search engines for information retrieval, the query- based service is more popular than the directory-based service, and users are not completely satisfied with the precision of retrieved information and the response time of search engines.

Journal IssueDOI
TL;DR: It is argued that document retrieval for question answering represents a task different from retrieving documents in response to more general retrospective information needs, and to guide future system development, specialized question answering test collections must be constructed.
Abstract: In contrast to traditional information retrieval systems, which return ranked lists of documents that users must manually browse through, a question answering system attempts to directly answer natural language questions posed by the user. Although such systems possess language-processing capabilities, they still rely on traditional document retrieval techniques to generate an initial candidate set of documents. In this article, the authors argue that document retrieval for question answering represents a task different from retrieving documents in response to more general retrospective information needs. Thus, to guide future system development, specialized question answering test collections must be constructed. They show that the current evaluation resources have major shortcomings; to remedy the situation, they have manually created a small, reusable question answering test collection for research purposes. In this article they describe their methodology for building this test collection and discuss issues they encountered regarding the notion of “answer correctness.” © 2006 Wiley Periodicals, Inc.

Posted Content
TL;DR: The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as well as exploiting the existence of additional more or less language-independent text items such as dates, currency expressions, numbers, names and cognates.
Abstract: We are proposing a simple, but efficient basic approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as well as exploiting the existence of additional more or less language-independent text items such as dates, currency expressions, numbers, names and cognates. Mapping texts onto the multilingual resources and identifying word token links between texts in different languages are basic ingredients for applications such as cross-lingual document similarity calculation, multilingual clustering and categorisation, cross-lingual document retrieval, and tools to provide cross-lingual information access.

Proceedings ArticleDOI
27 Apr 2006-Scopus
TL;DR: A novel signature retrieval strategy is presented, which includes a technique for noise and printed text removal from signature images, previously extracted from business documents, based on a normalized correlation similarity measure using global shape-based binary feature vectors.
Abstract: In searching a repository of business documents, a task of interest is that of using a query signature image to retrieve from a database, other signatures matching the query. The signature retrieval task involves a two-step process of extracting all the signatures from the documents and then performing a match on these signatures. This paper presents a novel signature retrieval strategy, which includes a technique for noise and printed text removal from signature images, previously extracted from business documents. Signature matching is based on a normalized correlation similarity measure using global shape-based binary feature vectors. In a retrieval task involving a database of 447 signatures, on an average 4.43 out of the top 5 choices were signatures belonging to the writer of the queried signature. On considering the Top 10 ranks, a F-measure value of 76.3 was obtained and the precision and recall values at this F-measure were 74.5% and 78.28% respectively.

Journal ArticleDOI
TL;DR: A new method for query expansion based on user relevance feedback techniques for mining additional query terms based on fuzzy rules that increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval.
Abstract: In document retrieval systems, proper query terms significantly affect the performance of document retrieval systems. The performance of the systems can be improved by using query expansion techniques. In this paper, we present a new method for query expansion based on user relevance feedback techniques for mining additional query terms. According to the user's relevance feedback, the proposed query expansion method calculates the degrees of importance of relevant terms of documents in the document database. The relevant terms have higher degrees of importance may become additional query terms. The proposed method uses fuzzy rules to infer the weights of the additional query terms. Then, the weights of the additional query terms and the weights of the original query terms are used to form the new query vector, and we use this new query vector to retrieve documents. The proposed query expansion method increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval. It gets a higher average recall rate and a higher average precision rate than the method presented in Chang, Y. C., Chen, S. M., & Liau, C. J. (2003). A new query expansion method based on fuzzy rules. Proceedings of the Seventh Joint Conference on AI, Fuzzy System, and Grey System , Taipei, Taiwan, Republic of China.

Journal ArticleDOI
TL;DR: This paper reports the experimental investigation into the use of more realistic concepts as opposed to simple keywords for document retrieval, and reinforcement learning for improving document representations to help the retrieval of useful documents for relevant queries.
Abstract: This paper reports our experimental investigation into the use of more realistic concepts as opposed to simple keywords for document retrieval, and reinforcement learning for improving document representations to help the retrieval of useful documents for relevant queries. The framework used for achieving this was based on the theory of Formal Concept Analysis (FCA) and Lattice Theory. Features or concepts of each document (and query), formulated according to FCA, are represented in a separate concept lattice and are weighted separately with respect to the individual documents they present. The document retrieval process is viewed as a continuous conversation between queries and documents, during which documents are allowed to learn a set of significant concepts to help their retrieval. The learning strategy used was based on relevance feedback information that makes the similarity of relevant documents stronger and non-relevant documents weaker. Test results obtained on the Cranfield collection show a significant increase in average precisions as the system learns from experience.

Proceedings Article
16 Jul 2006
TL;DR: The experimental results support the efficacy of the OntoSearch system by using domain ontology and user ontology for enhanced search performance.
Abstract: OntoSearch, a full-text search engine that exploits ontological knowledge for document retrieval, is presented in this paper. Different from other ontology based search engines, OntoSearch does not require a user to specify the associated concepts of his/her queries. Domain ontology in OntoSearch is in the form of a semantic network. Given a keyword based query, OntoSearch infers the related concepts through a spreading activation process in the domain ontology. To provide personalized information access, we further develop algorithms to learn and exploit user ontology model based on a customized view of the domain ontology. The proposed system has been applied to the domain of searching scientific publications in the ACM Digital Library. The experimental results support the efficacy of the OntoSearch system by using domain ontology and user ontology for enhanced search performance.

Journal Article
TL;DR: This paper introduces a newly developed XML indexing and retrieval system on Okapi and extends Robertson’s field-weighted BM25F for document retrieval to element level retrieval function BM25E, and shows how the tuned weights for selected fields are tuned by using INEX 2004 topics and assessments.
Abstract: This is the first year for the Centre for Interactive Systems Research participation of INEX. Based on a newly developed XML indexing and retrieval system on Okapi, we extend Robertson's field-weighted BM25F for document retrieval to element level retrieval function BM25E. In this paper, we introduce this new function and our experimental method in detail, and then show how we tuned weights for our selected fields by using INEX 2004 topics and assessments. Based on the tuned models we submitted our runs for CO.Thorough, CO.FetchBrowse, the methods we propose show real promise. Existing problems and future work are also discussed.

Book ChapterDOI
09 Oct 2006
TL;DR: A novel method using the Non-negative Matrix Factorization (NMF) to extract the query relevant sentences from documents for query based summaries that exactly summarizes documents for the given query by using semantic features and semantic variables.
Abstract: Query based document summaries are important in document retrieval system to show the concise relevance of documents retrieved to a query. This paper proposes a novel method using the Non-negative Matrix Factorization (NMF) to extract the query relevant sentences from documents for query based summaries. The proposed method doesn't need the training phase using training data comprising queries and query specific documents. And it exactly summarizes documents for the given query by using semantic features and semantic variables without complex processing like transformation of documents to graphs because the NMF have a great power to naturally extract semantic features representing the inherent structure of a document.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: WordBars visually represents the frequencies of the terms found in the first 100 document surrogates returned from the initial query, and allows the users to interactively re-sort the search results based on thefrequency of the selected terms within the document surrogate, generating a new set of search results.
Abstract: It is common for web searchers to have difficulties crafting queries to fulfill their information needs. Even when they provide a good query, users often find it challenging to evaluate the results of their web searches. Sources of these problems include the lack of support for query refinement, and the static nature of the list-based representations of web search results. To address these issues, we have developed WordBars, an interactive tool for web information retrieval. WordBars visually represents the frequencies of the terms found in the first 100 document surrogates returned from the initial query. This system allows the users to interactively re-sort the search results based on the frequencies of the selected terms within the document surrogates, as well as to add and remove terms from the query, generating a new set of search results. Examples illustrate how WordBars can provide valuable support for query refinement and search results exploration, both when specific and vague initial queries are provided.