scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2007"


Proceedings Article
23 Sep 2007
TL;DR: This work focuses on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking.
Abstract: As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. While entities appear in many pages, current engines only find each page individually. Toward searching directly and holistically for finding information of finer granularity, we study the problem of entity search, a significant departure from traditional document retrieval. We focus on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively.

200 citations


Journal ArticleDOI
TL;DR: It is shown how a simple feed-forward neural network can be trained to filter documents under these conditions, and that this method seems to be superior to modified methods, such as Rocchio, Nearest Neighbor, Naive-Bayes, Distance-based Probability and One-Class SVM algorithms.

196 citations


Journal ArticleDOI
01 Dec 2007
TL;DR: This paper shows some of the areas that can benefit from exploiting all the time information available in the content of documents to provide better search results and user experience.
Abstract: Time is an important dimension of any information space and can be very useful in information retrieval. Current information retrieval systems and applications do not take advantage of all the time information available in the content of documents to provide better search results and user experience. In this paper we show some of the areas that can benefit from exploiting such temporal information.

183 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: The goal is to improve the access to on-line audio/visual recordings of academic lectures by developing tools for the processing, transcription, indexing, segmentation, summarization, retrieval and browsing of this media.
Abstract: In this paper we discuss our research activities in the area of spoken lecture processing. Our goal is to improve the access to on-line audio/visual recordings of academic lectures by developing tools for the processing, transcription, indexing, segmentation, summarization, retrieval and browsing of this media. In this paper, we provide an overview of the technology components and systems that have been developed as part of this project, present some experimental results, and discuss our ongoing and future research plans. Index Terms:spoken lecture processing, spoken document retrieval, audio browsing

176 citations


Proceedings ArticleDOI
29 Oct 2007
TL;DR: The proposed methods form the first steps to bring together advanced information retrieval and secure search capabilities for a wide range of applications including managing data in government and business operations, enabling scholarly study of sensitive data, and facilitating the document discovery process in litigation.
Abstract: This paper introduces a new framework for confidentiality preserving rank-ordered search and retrieval over large document collections. The proposed framework not only protects document/query confidentiality against an outside intruder, but also prevents an untrusted data center from learning information about the query and the document collection. We present practical techniques for proper integration of relevance scoring methods and cryptographic techniques, such as order preserving encryption, to protect data collections and indices and provide efficient and accurate search capabilities to securely rank-order documents in response to a query. Experimental results on the W3C collection show that these techniques have comparable performance to conventional search systems designed for non-encrypted data in terms of search accuracy. The proposed methods thus form the first steps to bring together advanced information retrieval and secure search capabilities for a wide range of applications including managing data in government and business operations, enabling scholarly study of sensitive data, and facilitating the document discovery process in litigation.

171 citations


Proceedings ArticleDOI
06 Nov 2007
TL;DR: An opinion retrieval algorithm that has a traditional information retrieval component to find topic relevant documents from a document set, an opinion classification component tofind documents having opinions from the results of the IR step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query.
Abstract: Opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. A relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. In this paper, we describe an opinion retrieval algorithm. It has a traditional information retrieval (IR) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the IR step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. We implemented the algorithm as a working system and tested it using TREC 2006 Blog Track data in automatic title-only runs. Our result showed 28% to 32% improvements in MAP score over the best automatic runs in this 2006 track. Our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set.

163 citations


Proceedings ArticleDOI
23 Jul 2007
TL;DR: It is shown how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness.
Abstract: Disk access performance is a major bottleneck in traditional information retrieval systems. Compared to system memory, disk bandwidth is poor, and seek times are worse.We circumvent this problem by considering query evaluation strategies in main memory. We show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness.We evaluate our techniques using Galago, a new retrieval system designed for efficient query processing. Our system achieves a 69% improvement in query throughput over previous methods.

134 citations


Proceedings ArticleDOI
Zaiqing Nie1, Yunxiao Ma1, Shuming Shi1, Ji-Rong Wen1, Wei-Ying Ma1 
08 May 2007
TL;DR: This paper proposes several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieved model, and a hybrid model with both structured and unstructuring retrieval features, and concludes that the hybrid model is the superior by taking into account the extraction errors at varying levels.
Abstract: The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects is embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.

129 citations


Proceedings ArticleDOI
23 Sep 2007
TL;DR: A new approach to logo detection and extraction in document images that robustly classifies and precisely localizes logos using a boosting strategy across multiple image scales is proposed.
Abstract: Automatic logo detection and recognition continues to be of great interest to the document retrieval community as it enables effective identification of the source of a document. In this paper, we propose a new approach to logo detection and extraction in document images that robustly classifies and precisely localizes logos using a boosting strategy across multiple image scales. At a coarse scale, a trained Fisher classifier performs initial classification using features from document context and connected components. Each logo candidate region is further classified at successively finer scales by a cascade of simple classifiers, which allows false alarms to be discarded and the detected region to be refined. Our approach is segmentation free and lay-out independent. We define a meaningful evaluation metric to measure the quality of logo detection using labeled groundtruth. We demonstrate the effectiveness of our approach using a large collection of real-world documents.

119 citations


Journal ArticleDOI
01 Nov 2007
TL;DR: The system framework and some key techniques of content-based 3D model retrieval are identified and explained, including canonical coordinate normalization and preprocessing, feature extraction, similarity match, query representation and user interface, and performance evaluation.
Abstract: As the number of available 3D models grows, there is an increasing need to index and retrieve them according to their contents. This paper provides a survey of the up-to-date methods for content-based 3D model retrieval. First, the new challenges encountered in 3D model retrieval are discussed. Then, the system framework and some key techniques of content-based 3D model retrieval are identified and explained, including canonical coordinate normalization and preprocessing, feature extraction, similarity match, query representation and user interface, and performance evaluation. In particular, similarity measures using semantic clues and machine learning methods, as well as retrieval approaches using nonshape features, are given adequate recognition as improvements and complements for traditional shape-matching techniques. Typical 3D model retrieval systems and search engines are also listed and compared. Finally, future research directions are indicated, and an extensive bibliography is provided.

97 citations


Proceedings ArticleDOI
23 Jul 2007
TL;DR: This thesis focuses on one particular type of entity: people, which brings new, exciting challenges to the fields of Information Retrieval and Information Extraction.
Abstract: The enormous increase in recent years in the amount of information available online has led to a renewed interest in a broad range of IR-related areas that go beyond plain document retrieval. Some of this new attention has fallen on a subset of IR tasks, in particular on entity retrieval tasks. This emerging area differs from traditional document retrieval in a number of ways. Entities are not represented directly (as retrievable units such as documents), and we need to identify them "indirectly" through occurrences in documents. This brings new, exciting challenges to the fields of Information Retrieval and Information Extraction. In this thesis we focus on one particular type of entity: people.

Journal ArticleDOI
TL;DR: This paper reviews a variety of text/image retrieval approaches as well as their individual components in the context of broadcast news video, and conducts a series of retrieval experiments on TRECVID video collections to identify their advantages and disadvantages.
Abstract: The effectiveness of a video retrieval system largely depends on the choice of underlying text and image retrieval components. The unique properties of video collections (e.g., multiple sources, noisy features and temporal relations) suggest we examine the performance of these retrieval methods in such a multimodal environment, and identify the relative importance of the underlying retrieval components. In this paper, we review a variety of text/image retrieval approaches as well as their individual components in the context of broadcast news video. Numerous components of text/image retrieval have been discussed in detail, including retrieval models, text sources, temporal expansion methods, query expansion methods, image features, and similarity measures. For each component, we conduct a series of retrieval experiments on TRECVID video collections to identify their advantages and disadvantages. To provide a more complete coverage of video retrieval, we briefly discuss an emerging approach called concept-based video retrieval, and review strategies for combining multiple retrieval outputs.

Proceedings ArticleDOI
23 Jul 2007
TL;DR: It is shown that appropriate use of domain-specific knowledge in a proposed conceptual retrieval model yields about 23% improvement over the best reported result in passage retrieval in the Genomics Track of TREC 2006.
Abstract: This paper presents a study of incorporating domain-specific knowledge (i.e., information about concepts and relationships between concepts in a certain domain) in an information retrieval (IR) system to improve its effectiveness in retrieving biomedical literature. The effects of different types of domain-specific knowledge in performance contribution are examined. Based on the TREC platform, we show that appropriate use of domain-specific knowledge in a proposed conceptual retrieval model yields about 23% improvement over the best reported result in passage retrieval in the Genomics Track of TREC 2006.

Proceedings Article
01 Apr 2007
TL;DR: A suite of automatic techniques to re-write queries and study their characteristics show that the shortcomings of automatic methods can be ameliorated by some simple user interaction, and report results that are on average 25% better than the baseline.
Abstract: Information retrieval systems are frequently required to handle long queries. Simply using all terms in the query or relying on the underlying retrieval model to appropriately weight terms often leads to ineffective retrieval. We show that rewriting the query to a version that comprises a small subset of appropriate terms from the original query greatly improves effectiveness. Targeting a demonstrated potential improvement of almost 50% on some difficult TREC queries and their associated collections, we develop a suite of automatic techniques to re-write queries and study their characteristics. We show that the shortcomings of automatic methods can be ameliorated by some simple user interaction, and report results that are on average 25% better than the baseline.

Journal ArticleDOI
TL;DR: It is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.
Abstract: Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.

Proceedings ArticleDOI
11 Mar 2007
TL;DR: This paper proposes a graph-based text representation, which is capable of capturing Term order, Term frequency, and Term co-occurrence in documents and applies the graph model into the text mining task,Which is to discover unapparent associations between two and more concepts from a large text corpus.
Abstract: For information retrieval and text-mining, a robust scalable framework is required to represent the information extracted from documents and enable visualization and query of such information. One very widely used model is the vector space model which is based on the bag-of-words approach. However, it suffers from the fact that it loses important information about the original text, such as information about the order of the terms in the text or about the frontiers between sentences or paragraphs. In this paper, we propose a graph-based text representation, which is capable of capturing (i) Term order (ii) Term frequency (iii) Term co-occurrence (iv) Term context in documents. We also apply the graph model into our text mining task, which is to discover unapparent associations between two and more concepts (e.g. individuals) from a large text corpus. Counterterrorism corpus is used to evaluate the performance of various retrieval models, which demonstrates feasibility and effectiveness of graphic text representation in information retrieval and text mining.

Journal ArticleDOI
01 Apr 2007
TL;DR: The development and the validation of the phrase-based VSM are reported, showing significant retrieval effectiveness improvements in both the exhaustive search and cluster-based document retrievals.
Abstract: Objective: To develop a document indexing scheme that improves the retrieval effectiveness for free-text medical documents. Design: The phrase-based vector space model (VSM) uses multi-word phrases as indexing terms. Each phrase consists of a concept in the unified medical language system (UMLS) and its corresponding component word stems. The similarity between concepts are defined by their relations in a hypernym hierarchy derived from UMLS. After defining the similarity between two phrases by their stem overlaps and the similarity between the concepts they represent, we define the similarity between two documents as the cosine of the angle between their corresponding phrase vectors. This paper reports the development and the validation of the phrase-based VSM. Measurement: We compare the retrieval effectiveness of different vector space models using two standard test collections, OHSUMED and Medlars. OHSUMED contains 105 queries and 14,430 documents, and Medlars contains 30 queries and 1033 documents. Each document in the test collections is judged by human experts to be either relevant or non-relevant to each query. The retrieval effectiveness is measured by precision and recall. Results: The phrase-based VSM is significantly more effective than the current gold standard-the stem-based VSM. Such significant retrieval effectiveness improvements are observed in both the exhaustive search and cluster-based document retrievals. Conclusion: The phrase-based VSM is a better indexing scheme than the stem-based VSM. Medical document retrieval using the phrase-based VSM is significantly more effective than that using the stem-based VSM.

Journal ArticleDOI
01 Jan 2007
TL;DR: This paper presents a system that provides users with personalized results derived from a search engine that uses link structures, and demonstrates that the developed system is capable of searching not only relevant but also personalized web pages, depending on the preferences of the user.
Abstract: Personalized search engines are important tools for finding web documents for specific users, because they are able to provide the location of information on the WWW as accurately as possible, using efficient methods of data mining and knowledge discovery. The types and features of traditional search engines are various, including support for different functionality and ranking methods. New search engines that use link structures have produced improved search results which can overcome the limitations of conventional text-based search engines. Going a step further, this paper presents a system that provides users with personalized results derived from a search engine that uses link structures. The fuzzy document retrieval system (constructed from a fuzzy concept network based on the user's profile) personalizes the results yielded from link-based search engines with the preferences of the specific user. A preliminary experiment with six subjects indicates that the developed system is capable of searching not only relevant but also personalized web pages, depending on the preferences of the user.

Proceedings ArticleDOI
09 Jul 2007
TL;DR: This paper shows how information theory supplies us with the tools necessary to develop a unique model for text, image, and text/image retrieval and estimates a maximum entropy model based on exclusively continuous features that were preprocessed.
Abstract: To solve the problem of indexing collections with diverse text documents, image documents, or documents with both text and images, one needs to develop a model that supports heterogeneous types of documents. In this paper, we show how information theory supplies us with the tools necessary to develop a unique model for text, image, and text/image retrieval. In our approach, for each possible query keyword we estimate a maximum entropy model based on exclusively continuous features that were preprocessed. The unique continuous feature-space of text and visual data is constructed by using a minimum description length criterion to find the optimal feature-space representation (optimal from an information theory point of view). We evaluate our approach in three experiments: only text retrieval, only image retrieval, and text combined with image retrieval.

Proceedings ArticleDOI
11 Jun 2007
TL;DR: The WISDM project at UIUC is built and a prototype search engine over a 2TB Web corpus is evaluated, showing the feasibility and promise of a large-scale system architecture to support entity search.
Abstract: As the Web has evolved into a data-rich repository, with the standard page view," current search engines are increasingly inadequate While we often search for various data "entities" (eg phone number, paper PDF, date), today's engines only take us indirectly to pages Therefore, we propose the concept of entity search, a significant departure from traditional document retrieval Towards our goal of supporting entity search, in the WISDM project at UIUC we build and evaluate our prototype search engine over a 2TB Web corpus Our demonstration shows the feasibility and promise of a large-scale system architecture to support entity search

Journal ArticleDOI
01 Mar 2007
TL;DR: Measured by miss and false alarm rates, the EP-supported ET (EPET) technique exhibits better tracking effectiveness than a traditional ET technique and suggests that the proposed EP technique could effectively discover event episodes and EPs in sequences of documents.
Abstract: Recent advances in information and networking technologies have contributed significantly to global connectivity and greatly facilitated and fostered information creation, distribution, and access. The resultant ever-increasing volume of online textual documents creates an urgent need for new text mining techniques that can intelligently and automatically extract implicit and potentially useful knowledge from these documents for decision support. This research focuses on identifying and discovering event episodes together with their temporal relationships that occur frequently (referred to as evolution patterns (EPs) in this paper) in sequences of documents. The discovery of such EPs can be applied in domains such as knowledge management and used to facilitate existing document management and retrieval techniques [e.g., event tracking (ET)]. Specifically, we propose and design an EP discovery technique for mining EPs from sequences of documents. We experimentally evaluate our proposed EP technique in the context of facilitating ET. Measured by miss and false alarm rates, the EP-supported ET (EPET) technique exhibits better tracking effectiveness than a traditional ET technique. The encouraging performance of the EPET technique demonstrates the potential usefulness of EPs in supporting ET and suggests that the proposed EP technique could effectively discover event episodes and EPs in sequences of documents

Book ChapterDOI
27 Jun 2007
TL;DR: An overview of the NL understanding environment functionalities, and the algorithms related to the text segmentation method, which requires a NLP-parser and a semantic representation in Roget-based vectors is presented.
Abstract: Information retrieval needs to match relevant texts with a given query. Selecting appropriate parts is useful when documents are long, and only portions are interesting to the user. In this paper, we describe a method that extensively uses natural language techniques for text segmentation based on topic change detection. The method requires a NLP-parser and a semantic representation in Roget-based vectors. We have run the experiment on French documents, for which we have the appropriate tools, but the method could be transposed to any other language with the same requirements. The article sketches an overview of the NL understanding environment functionalities, and the algorithms related to our text segmentation method. An experiment in text segmentation is also presented and its result in an information retrieval task is shown.

Proceedings Article
01 Jan 2007
TL;DR: HDR is defined as the retrieval of relevant historic documents for a modern query to treat the historic and modern languages as different languages, and use cross-language information retrieval (CLIR) techniques to translate one language into the other.
Abstract: Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives, like DigiCULT, make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience. Natural languages evolve over time, changing in pronunciation and spelling, and new words are introduced continuously, while older words may disappear out of everyday use. For these reasons, queries involving modern words may not be very effective for retrieving documents that contain many historic terms. Although reading a 300-year-old document might not be problematic because the words are still recognizable, the changes in vocabulary and spelling can make it difficult to use a search engine to find relevant documents. To illustrate this, consider the following example from our collection of 17th century Dutch law texts. Looking for information on the tasks of a lawyer (modern Dutch: {it advocaat}) in these texts, the modern spelling will not lead you to documents containing the 17th century Dutch spelling variant {it advocaet}. Since spelling rules were not introduced until the 19th century, 17th century Dutch spelling is inconsistent. Being based mainly on pronunciation, words were often spelled in several different variants, which poses a problem for standard retrieval engines. We therefore define Historic Document Retrieval (HDR) as the retrieval of relevant historic documents for a modern query. Our approach to this problem is to treat the historic and modern languages as different languages, and use cross-language information retrieval (CLIR) techniques to translate one language into the other.

Proceedings ArticleDOI
23 Jul 2007
TL;DR: The effect of pruning in recognition is analyzed, showing that when recognition speed is increased, the reduction in retrieval performance due to the increase in the 1-best error rate can be compensated by using confusion networks.
Abstract: In this paper, we investigate methods for improving the performance of morph-based spoken document retrieval in Finnish by extracting relevant index terms from confusion networks. Our approach uses morpheme-like subword units ("morphs") for recognition and indexing. This alleviates the problem of out-of-vocabulary words, especially with inflectional languages like Finnish. Confusion networks offer a convenient representation of alternative recognition candidates by aligning mutually exclusive terms and by giving the posterior probability of each term. The rank of the competing terms and their posterior probability is used to estimate term frequency for indexing. Comparing against 1-best recognizer transcripts, we show that retrieval effectiveness is significantly improved. Finally, the effect of pruning in recognition is analyzed, showing that when recognition speed is increased, the reduction in retrieval performance due to the increase in the 1-best error rate can be compensated by using confusion networks.

Proceedings ArticleDOI
06 Nov 2007
TL;DR: This paper defines four types of noun phrases and presents an algorithm for recognizing these phrases in queries and uses a baseline noun phrase recognition algorithm to recognize phrases from the TREC queries.
Abstract: It has been shown that using phrases properly in the document retrieval leads to higher retrieval effectiveness. In this paper, we define four types of noun phrases and present an algorithm for recognizing these phrases in queries. The strengths of several existing tools are combined for phrase recognition. Our algorithm is tested using a set of 500 web queries from a query log, and a set of 238 TREC queries. Experimental results show that our algorithm yields high phrase recognition accuracy. We also use a baseline noun phrase recognition algorithm to recognize phrases from the TREC queries. A document retrieval experiment is conducted using the TREC queries (1) without any phrases, (2) with the phrases recognized from a baseline noun phrase recognition algorithm, and (3) with the phrases recognized from our algorithm respectively. The retrieval effectiveness of (3) is better than that of (2), which is better than that of (1). This demonstrates that utilizing phrases in queries does improve the retrieval effectiveness, and better noun phrase recognition yields higher retrieval performance.

Patent
Ashutosh Garg1, Mayur Datar1
21 Dec 2007
TL;DR: In this article, a system generates a text document from a received document image by assigning metadata elements to all or part of the text document by a user or by a template used to generate the text documents.
Abstract: A system generates a text document from a received document image. Searchable metadata elements may be assigned to all or part of the text document by a user or by a template used to generate the text document. The text document and the associated metadata elements may be stored to facilitate subsequent searching and retrieval of the text document based on contents of the text document and/or its associated metadata elements.

Book ChapterDOI
Milad Shokouhi1
02 Apr 2007
TL;DR: This work proposes a new fusion method that partitions the rank lists of document retrieval systems into chunks and shows that the proposed method produces higher average precision values than previous systems across a range of testbeds.
Abstract: Metasearch and data-fusion techniques combine the rank lists of multiple document retrieval systems with the aim of improving search coverage and precision We propose a new fusion method that partitions the rank lists of document retrieval systems into chunks The size of chunks grows exponentially in the rank list Using a small number of training queries, the probabilities of relevance of documents in different chunks are approximated for each search system The estimated probabilities and normalized document scores are used to compute the final document ranks in the merged list We show that our proposed method produces higher average precision values than previous systems across a range of testbeds

Proceedings Article
01 Jan 2007
TL;DR: The official runs of the team for the CLEF 2004 ad hoc tasks are described, including the FlexIR system as well as the approaches used for each of the tasks in which they participated.
Abstract: We describe the official runs of our team for the CLEF 2004 ad hoc tasks. We took part in the monolingual task (for Finnish, French, Portuguese, and Russian), in the bilingual task (for Amharic to English, and English to Portuguese), and, finally, in the multilingual task. In the CLEF 2004 evaluation exercise we participated in all three ad hoc retrieval tasks. We took part in the monolingual tasks for four non-English languages, Finnish, French, Portuguese, and Russian. The Portuguese language was new for CLEF 2004. Our participation in the monolingual task was a further continuation of our earlier efforts to monolingual retrieval [11, 5, 6]. Our first aim was to continue our experiments with a number of language-dependent techniques, in particular stemming algorithms for all European languages [14], and compound splitting for the compound rich Finnish language. A second aim was to continue our experiments with languageindependent techniques, in particular the use of character n-grams, where we may also index leading and ending character sequences, and retain the original words. Our third aim was to experiment with combinations of runs. We took part in the bilingual task, this year focusing on Amharic into English, and on English to Portuguese. Our bilingual runs were motivated by the following aims. Our first aim was to experiment with a language for which resources are few and far between, Amharic, and to see how far we could get by combining the scarcely available resources. Our second aim was to experiment with the relative effectiveness of a number of translation resources: machine translation [16] versus a parallel corpus [7], and query translation versus collection translation. Our third aim was to evaluate the effectiveness of our monolingual retrieval approaches for imperfectly translated queries, shedding light on the robustness of these approaches. Finally, we continued our participation for the multilingual task, where we experimented with straightforward ways of query translation, using machine translation whenever available, and a translation dictionary otherwise. We also experimented with combination methods using runs made on varying types of indexes. In Section 2 we describe the FlexIR system as well as the approaches used for each of the tasks in which we participated. Section 3 describes our official retrieval runs for CLEF 2004. In Section 4 we discuss the results we have obtained. Finally, in Section 5, we offer some conclusions regarding our document retrieval efforts.

Proceedings ArticleDOI
02 Nov 2007
TL;DR: This work presents a novel approach that extracts real content from news Web pages in an unsupervised fashion based on distilling linguistic and structural features from text blocks in HTML pages, having a particle swarm optimizer (PSO) learn feature thresholds for optimal classification performance.
Abstract: Today's Web pages are commonly made up of more than merely one cohesive block of information. For instance, news pages from popular media channels such as Financial Times or Washington Post consist of no more than 30%-50% of textual news, next to advertisements, link lists to related articles, disclaimer information, and so forth. However, for many search-oriented applications such as the detection of relevant pages for an in-focus topic, dissecting the actual textual content from surrounding page clutter is an essential task, so as to maintain appropriate levels of document retrieval accuracy. We present a novel approach that extracts real content from news Web pages in an unsupervised fashion. Our method is based on distilling linguistic and structural features from text blocks in HTML pages, having a particle swarm optimizer (PSO) learn feature thresholds for optimal classification performance. Empirical evaluations and benchmarks show that our approach works very well when applied to several hundreds of news pages from popular media in 5 languages.

Patent
01 Jun 2007
TL;DR: In this article, a set of integrated methodologies that can combine automatic concept extraction/matching from text, a powerful fuzzy search engine, and a collaborative user preference learning engine to provide accurate and personalized search results are presented.
Abstract: Information retrieval systems face challenging problems with delivering highly relevant and highly inclusive search results in response to a user's query Contextual personalized information retrieval uses a set of integrated methodologies that can combine automatic concept extraction/matching from text, a powerful fuzzy search engine, and a collaborative user preference learning engine to provide accurate and personalized search results The system can include constructing a search query to execute a search of a database The system can parse an input query from a user conducting the search of the database into sub-strings, and can match the sub-strings to concepts in a semantic concept network of a knowledge base The system can further map the matched concepts to criteria and criteria values that specify a set of constraints on and scoring parameters for the matched concepts