scispace - formally typeset
Search or ask a question

Showing papers on "Search engine indexing published in 2004"


Journal ArticleDOI
TL;DR: A large number of techniques to address the problem of text information extraction are classified and reviewed, benchmark data and performance evaluation are discussed, and promising directions for future research are pointed out.

927 citations


Proceedings ArticleDOI
13 Nov 2004
TL;DR: Swoogle is a crawler-based indexing and retrieval system for the Semantic Web that extracts metadata for each discovered document, and computes relations between documents.
Abstract: Swoogle is a crawler-based indexing and retrieval system for the Semantic Web. It extracts metadata for each discovered document, and computes relations between documents. Discovered documents are also indexed by an information retrieval system which can use either character N-Gram or URIrefs as keywords to find relevant documents and to compute the similarity among a set of documents. One of the interesting properties we compute is ontology rank, a measure of the importance of a Semantic Web document.

926 citations


Proceedings ArticleDOI
13 Jun 2004
TL;DR: The gIndex approach not only provides and elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit form data mining, especially frequent pattern mining.
Abstract: Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the issues of indexing graphs and propose a novel solution by applying a graph mining technique. Different from the existing path-based methods, our approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. To reduce the size of index structure, two techniques, size-increasing support constraint and discriminative fragments, are introduced. Our performance study shows that gIndex has 10 times smaller index size, but achieves 3--10 times better performance in comparison with a typical path-based method, GraphGrep. The gIndex approach not only provides and elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit form data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be applied to indexing sequences, trees, and other complicated structures as well.

706 citations


Book
25 Oct 2004
TL;DR: This book introduces the new world of text mining and examines proven methods for various critical text-mining tasks, such as automated document indexing and information retrieval and search, as well as new research areas that rely on evolving text- mining techniques.
Abstract: The growth of the web can be seen as an expanding public digital library collection. Online digital information extends far beyond the web and its publicly available information. Huge amounts of information are private and are of interest to local communities, such as the records of customers of a business. This information is overwhelmingly text and has its record-keeping purpose, but an automated analysis might be desirable to find patterns in the stored records. Analogous to this data mining is text mining, which also finds patterns and trends in information samples but which does so with far less structured--though with greater immediate utility for users--ingredients. This book focuses on the concepts and methods needed to expand horizons beyond structured, numeric data to automated mining of text samples. It introduces the new world of text mining and examines proven methods for various critical text-mining tasks, such as automated document indexing and information retrieval and search. New research areas are explored, such as information extraction and document summarization, that rely on evolving text-mining techniques.

596 citations


Proceedings ArticleDOI
07 Jun 2004
TL;DR: The indicating means provides a replace battery indication which is presented when the battery is in a replacement condition and the amplifier is operated at saturation.
Abstract: Recent developments in techniques for modeling, digitizing and visualizing 3D shapes has led to an explosion in the number of available 3D models on the Internet and in domain-specific databases. This has led to the development of 3D shape retrieval systems that, given a query object, retrieve similar 3D objects. For visualization, 3D shapes are often represented as a surface, in particular polygonal meshes, for example in VRML format. Often these models contain holes, intersecting polygons, are not manifold, and do not enclose a volume unambiguously. On the contrary, 3D volume models, such as solid models produced by CAD systems, or voxels models, enclose a volume properly. This paper surveys the literature on methods for content based 3D retrieval, taking into account the applicability to surface models as well as to volume models. The methods are evaluated with respect to several requirements of content based 3D shape retrieval, such as: (1) shape representation requirements, (2) properties of dissimilarity measures, (3) efficiency, (4) discrimination abilities, (5) ability to perform partial matching, (6) robustness, and (7) necessity of pose normalization. Finally, the advantages and limits of the several approaches in content based 3D shape retrieval are discussed.

381 citations


Journal ArticleDOI
TL;DR: It is demonstrated empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages and is a good choice for those languages, and the increased storage and time requirements of the technique.
Abstract: The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n e 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.

356 citations


Journal ArticleDOI
TL;DR: This article proposed a phrase-based document index model, the document index graph, which allows for incremental construction of a phrasebased index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only.
Abstract: Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.

348 citations


Journal ArticleDOI
TL;DR: It is shown that in order to allow people to profit from all this visual information, there is a need to develop tools that help them to locate the needed images with good precision in a reasonable time and that such tools are useful for many applications and purposes.
Abstract: With the explosive growth of the World Wide Web, the public is gaining access to massive amounts of information. However, locating needed and relevant information remains a difficult task, whether the information is textual or visual. Text search engines have existed for some years now and have achieved a certain degree of success. However, despite the large number of images available on the Web, image search engines are still rare. In this article, we show that in order to allow people to profit from all this visual information, there is a need to develop tools that help them to locate the needed images with good precision in a reasonable time, and that such tools are useful for many applications and purposes. The article surveys the main characteristics of the existing systems most often cited in the literature, such as ImageRover, WebSeek, Diogenes, and Atlas WISE. It then examines the various issues related to the design and implementation of a Web image search engine, such as data gathering and digestion, indexing, query specification, retrieval and similarity, Web coverage, and performance evaluation. A general discussion is given for each of these issues, with examples of the ways they are addressed by existing engines, and 130 related references are given. Some concluding remarks and directions for future research are also presented.

338 citations


Book ChapterDOI
31 Aug 2004
TL;DR: This paper develops two index structures and associated algorithms to efficiently answer Probabilistic Threshold Queries (PTQs), and establishes the difficulty of this problem by mapping one-dimensional intervals to a two-dimensional space, and shows that the problem of intervals indexing with probabilities is significantly harder than interval indexing which is considered a well-studied problem.
Abstract: It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, it is only necessary to know whether the probability exceeds a given threshold - we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries. In particular, we develop two index structures and associated algorithms to efficiently answer PTQs. The first index scheme is based on the idea of augmenting uncertainty information to an R-tree. We establish the difficulty of this problem by mapping one-dimensional intervals to a two-dimensional space, and show that the problem of interval indexing with probabilities is significantly harder than interval indexing which is considered a well-studied problem. To overcome the limitations of this R-tree based structure, we apply a technique we call variance-based clustering, where data points with similar degrees of uncertainty are clustered together. Our extensive index structure can answer the queries for various kinds of uncertainty pdfs, in an almost optimal sense. We conduct experiments to validate the superior performance of both indexing schemes.

305 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: This paper analytically estimates how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results and shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.
Abstract: Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For example, given that search engines return currently popular" pages at the top of search results, are we somehow penalizing newly created pages that are not very well known yet? Are popular pages getting even more popular and new pages completely ignored? We first show that this unfortunate trend indeed exists on the Web through an experimental study based on real Web data. We then analytically estimate how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results. Our result shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.

300 citations


Journal ArticleDOI
TL;DR: A new method for localizing and recognizing text in complex images and videos and showing good performance when integrated in a sports video annotation system and a video indexing system within the framework of two European projects is presented.

Proceedings Article
01 Jan 2004
TL;DR: This paper proposes an indexing procedure for spoken utterance retrieval that works on lattices rather than just single-best text, and demonstrates that this procedure can improve F scores by over five points compared to singlebest retrieval on tasks with poor WER and low redundancy.
Abstract: Recent work on spoken document retrieval has suggested that it is adequate to take the singlebest output of ASR, and perform text retrieval on this output. This is reasonable enough for the task of retrieving broadcast news stories, where word error rates are relatively low, and the stories are long enough to contain much redundancy. But it is patently not reasonable if one’s task is to retrieve a short snippet of speech in a domain where WER’s can be as high as 50%; such would be the situation with teleconference speech, where one’s task is to find if and when a participant uttered a certain phrase. In this paper we propose an indexing procedure for spoken utterance retrieval that works on lattices rather than just single-best text. We demonstrate that this procedure can improve F scores by over five points compared to singlebest retrieval on tasks with poor WER and low redundancy. The representation is flexible so that we can represent both word lattices, as well as phone lattices, the latter being important for improving performance when searching for phrases containing OOV words.

Journal ArticleDOI
TL;DR: A Peer-to-Peer (P2P) indexing system and associated P2P storage that supports large-scale, decentralized, real-time search capabilities and guarantees that all existing data elements matching a query will be found with bounded costs.
Abstract: Web Services are emerging as a dominant paradigm for constructing and composing distributed business applications and enabling enterprise-wide interoperability. A critical factor to the overall utility of Web Services is a scalable, flexible and robust discovery mechanism. This paper presents a Peer-to-Peer (P2P) indexing system and associated P2P storage that supports large-scale, decentralized, real-time search capabilities. The presented system supports complex queries containing partial keywords and wildcards. Furthermore, it guarantees that all existing data elements matching a query will be found with bounded costs in terms of number of messages and number of nodes involved. The key innovation is a dimension reducing indexing scheme that effectively maps the multidimensional information space to physical peers. The design and an experimental evaluation of the system are presented.

Book ChapterDOI
31 Aug 2004
TL;DR: A novel technique to speed up similarity search under uniform scaling, based on bounding envelopes is proposed, which can achieve orders of magnitude of speedup over the brute force approach, the only alternative solution currently available.
Abstract: Data-driven animation has become the industry standard for computer games and many animated movies and special effects In particular, motion capture data recorded from live actors, is the most promising approach offered thus far for animating realistic human characters However, the manipulation of such data for general use and re-use is not yet a solved problem Many of the existing techniques dealing with editing motion rely on indexing for annotation, segmentation, and re-ordering of the data Euclidean distance is inappropriate for solving these indexing problems because of the inherent variability found in human motion The limitations of Euclidean distance stems from the fact that it is very sensitive to distortions in the time axis A partial solution to this problem, Dynamic Time Warping (DTW), aligns the time axis before calculating the Euclidean distance However, DTW can only address the problem of local scaling As we demonstrate in this paper, global or uniform scaling is just as important in the indexing of human motion We propose a novel technique to speed up similarity search under uniform scaling, based on bounding envelopes Our technique is intuitive and simple to implement We describe algorithms that make use of this technique, we perform an experimental analysis with real datasets, and we evaluate it in the context of a motion capture processing system The results demonstrate the utility of our approach, and show that we can achieve orders of magnitude of speedup over the brute force approach, the only alternative solution currently available

Patent
08 Apr 2004
TL;DR: In this paper, a statistical modeling approach to automatic linguistic indexing of photographic images is presented, where images of any given concept are regarded as instances of a stochastic process that characterizes the concept and a high likelihood indicates a strong association between the textual description and the image.
Abstract: The present invention provides a statistical modeling approach to automatic linguistic indexing of photographic images. The invention uses categorized images to train a dictionary of hundreds of statistical models each representing a concept. Images of any given concept are regarded as instances of a stochastic process that characterizes the concept. To measure the extent of association between an image and a textual description associated with a predefined concept, the likelihood of the occurrence of the image based on the characterizing stochastic process is computed. A high likelihood indicates a strong association between the textual description and the image. The invention utilizes two-dimensional multi-resolution hidden Markov models that demonstrate accuracy and high potential in linguistic indexing of photographic images.

Book ChapterDOI
05 Apr 2004
TL;DR: Phrases, word senses and syntactic relations derived by Natural Language Processing techniques were observed ineffective to increase retrieval accuracy.
Abstract: Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval).

Journal ArticleDOI
TL;DR: An image retrieval framework that integrates efficient region-based representation in terms of storage and complexity and effective on-line learning capability and a region weighting strategy is introduced to optimally weight the regions and enable the system to self-improve.
Abstract: An image retrieval framework that integrates efficient region-based representation in terms of storage and complexity and effective on-line learning capability is proposed. The framework consists of methods for region-based image representation and comparison, indexing using modified inverted files, relevance feedback, and learning region weighting. By exploiting a vector quantization method, both compact and sparse (vector) region-based image representations are achieved. Using the compact representation, an indexing scheme similar to the inverted file technology and an image similarity measure based on Earth Mover's Distance are presented. Moreover, the vector representation facilitates a weighted query point movement algorithm and the compact representation enables a classification-based algorithm for relevance feedback. Based on users' feedback information, a region weighting strategy is also introduced to optimally weight the regions and enable the system to self-improve. Experimental results on a database of 10 000 general-purposed images demonstrate the efficiency and effectiveness of the proposed framework.

Journal ArticleDOI
TL;DR: An experiment conducted with MEDLINE indexers to evaluate MTI's performance and to generate ideas for its improvement as a tool for user-assisted indexing is reported here on.
Abstract: The Medical Text Indexer (MTI) is a program for producing MeSH indexing recommendations. It is the major product of NLM's Indexing Initiative and has been used in both semi-automated and fully automated indexing environments at the Library since mid 2002. We report here on an experiment conducted with MEDLINE indexers to evaluate MTI's performance and to generate ideas for its improvement as a tool for user-assisted indexing. We also discuss some filtering techniques developed to improve MTI's accuracy for use primarily in automatically producing the indexing for several abstracts collections.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: An indexing method, called STRIPES, which indexes predicted trajectories in a dual transformed space using a regular hierarchical grid decomposition indexing structure, which can evaluate a range of queries including time-slice, window, and moving queries.
Abstract: Moving object databases are required to support queries on a large number of continuously moving objects. A key requirement for indexing methods in this domain is to efficiently support both update and query operations. Previous work on indexing such databases can be broadly divided into categories: indexing the past positions and indexing the future predicted positions. In this paper we focus on an efficient indexing method for indexing the future positions of moving objects.In this paper we propose an indexing method, called STRIPES, which indexes predicted trajectories in a dual transformed space. Trajectories for objects in d-dimensional space become points in a higher-dimensional 2d-space. This dual transformed space is then indexed using a regular hierarchical grid decomposition indexing structure. STRIPES can evaluate a range of queries including time-slice, window, and moving queries. We have carried out extensive experimental evaluation comparing the performance of STRIPES with the best known existing predicted trajectory index (the TPR*-tree), and show that our approach is significantly faster than TPR*-tree for both updates and search queries.

Proceedings Article
29 Mar 2004
TL;DR: This paper proposes eSearch--a P2P keyword search system based on a novel hybrid indexing structure that is scalable and efficient, and obtains search results as good as state-of-the-art centralized systems.
Abstract: Content-based full-text search still remains a particularly challenging problem in peer-to-peer (P2P) systems. Traditionally, there have been two index partitioning structures--partitioning based on the document space or partitioning based on keywords. The former requires search of every node in the system to answer a query whereas the latter transmits a large amount of data when processing multi-term queries. In this paper, we propose eSearch--a P2P keyword search system based on a novel hybrid indexing structure. In eSearch, each node is responsible for certain terms. Given a document, eSearch uses a modern information retrieval algorithm to select a small number of top (important) terms in the document and publishes the complete term list for the document to nodes responsible for those top terms. This selective replication of term lists allows a multi-term query to proceed local to the nodes responsible for query terms. We also propose automatic query expansion to alleviate the degradation of quality of search results due to the selective replication, overlay source multicast to reduce the cost of disseminating term lists, and techniques to balance term list distribution across nodes. eSearch is scalable and efficient, and obtains search results as good as state-of-the-art centralized systems. Despite the use of replication, eSearch actually consumes less bandwidth than systems based on keyword partitioning when publishing metadata for a document. During a retrieval operation, it searches only a small number of nodes and typically transmits a small amount of data (3.3KB) that is independent of the size of the corpus and grows slowly (logarithmically) with the number of nodes in the system. eSearch's efficiency comes at a modest storage cost, 6.8 times that of systems based on keyword partitioning. This cost can be further reduced by adopting index compression or pruning techniques.

Book ChapterDOI
20 Oct 2004
TL;DR: The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information and supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction.
Abstract: The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, a geographical ontology, maintenance and retrieval functions for a test collection of web documents, textual and spatial indexes, relevance ranking and metadata extraction. Here we summarise the functionality and interaction between these components before focusing on the design of the geo-ontology and the development of spatio-textual indexing methods. The geo-ontology supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction. Geographical place names are accompanied by multiple geometric footprints and qualitative spatial relationships. Spatial indexing of documents has been integrated with text indexing through the use of spatio-textual keys in which terms are concatenated with spatial cells to which they relate. Preliminary experiments demonstrate considerable performance benefits when compared with pure text indexing and with text indexing followed by a spatial filtering stage.

Journal ArticleDOI
TL;DR: The Squid peer-to-peer information discovery system supports flexible queries using partial keywords, wildcards, and ranges and is a dimension-reducing indexing scheme that effectively maps multidimensional information space to physical peers.
Abstract: The Squid peer-to-peer information discovery system supports flexible queries using partial keywords, wildcards, and ranges. It is built on a structured overlay and uses data lookup protocols to guarantee that all existing data elements that match a query are found efficiently. Its main innovation is a dimension-reducing indexing scheme that effectively maps multidimensional information space to physical peers.

Journal ArticleDOI
TL;DR: Experimental results of the application of the segmentation algorithm to known sequences demonstrate the efficiency of the proposed segmentation approach and reveal the potential of employing this segmentation algorithms as part of an object-based video indexing and retrieval scheme.
Abstract: In this paper, a novel algorithm is presented for the real-time, compressed-domain, unsupervised segmentation of image sequences and is applied to video indexing and retrieval. The segmentation algorithm uses motion and color information directly extracted from the MPEG-2 compressed stream. An iterative rejection scheme based on the bilinear motion model is used to effect foreground/background segmentation. Following that, meaningful foreground spatiotemporal objects are formed by initially examining the temporal consistency of the output of iterative rejection, clustering the resulting foreground macroblocks to connected regions and finally performing region tracking. Background segmentation to spatiotemporal objects is additionally performed. MPEG-7 compliant low-level descriptors describing the color, shape, position, and motion of the resulting spatiotemporal objects are extracted and are automatically mapped to appropriate intermediate-level descriptors forming a simple vocabulary termed object ontology. This, combined with a relevance feedback mechanism, allows the qualitative definition of the high-level concepts the user queries for (semantic objects, each represented by a keyword) and the retrieval of relevant video segments. Desired spatial and temporal relationships between the objects in multiple-keyword queries can also be expressed, using the shot ontology. Experimental results of the application of the segmentation algorithm to known sequences demonstrate the efficiency of the proposed segmentation approach. Sample queries reveal the potential of employing this segmentation algorithm as part of an object-based video indexing and retrieval scheme.

Journal ArticleDOI
TL;DR: Improved methods for indexing diffraction patterns from macromolecular crystals are presented and include a more robust way to verify the position of the incident X-ray beam on the detector.
Abstract: Improved methods for indexing diffraction patterns from macromolecular crystals are presented. The novel procedures include a more robust way to verify the position of the incident X-ray beam on the detector, an algorithm to verify that the deduced lattice basis is consistent with the observations, and an alternative approach to identify the metric symmetry of the lattice. These methods help to correct failures commonly experienced during indexing, and increase the overall success rate of the process. Rapid indexing, without the need for visual inspection, will play an important role as beamlines at synchrotron sources prepare for high-throughput automation.

Journal ArticleDOI
TL;DR: This paper has proposed a novel framework, called ClassView, to make some advances toward more efficient video database indexing and access, and proposes a hierarchical semantics-sensitive video classifier to shorten the semantic gap.
Abstract: Recent advances in digital video compression and networks have made video more accessible than ever. However, the existing content-based video retrieval systems still suffer from the following problems. 1) Semantics-sensitive video classification problem because of the semantic gap between low-level visual features and high-level semantic visual concepts; 2) Integrated video access problem because of the lack of efficient video database indexing, automatic video annotation, and concept-oriented summary organization techniques. In this paper, we have proposed a novel framework, called ClassView, to make some advances toward more efficient video database indexing and access. 1) A hierarchical semantics-sensitive video classifier is proposed to shorten the semantic gap. The hierarchical tree structure of the semantics-sensitive video classifier is derived from the domain-dependent concept hierarchy of video contents in a database. Relevance analysis is used for selecting the discriminating visual features with suitable importances. The Expectation-Maximization (EM) algorithm is also used to determine the classification rule for each visual concept node in the classifier. 2) A hierarchical video database indexing and summary presentation technique is proposed to support more effective video access over a large-scale database. The hierarchical tree structure of our video database indexing scheme is determined by the domain-dependent concept hierarchy which is also used for video classification. The presentation of visual summary is also integrated with the inherent hierarchical video database indexing tree structure. Integrating video access with efficient database indexing tree structure has provided great opportunity for supporting more powerful video search engines.

Journal ArticleDOI
01 Jan 2004
TL;DR: An automatic mechanism for selecting appropriate concepts that both describe and identify documents as well as language employed in user requests is described, and a scalable disambiguation algorithm that prunes irrelevant concepts and allows relevant ones to associate with documents and participate in query generation is proposed.
Abstract: Technology in the field of digital media generates huge amounts of nontextual information, audio, video, and images, along with more familiar textual information. The potential for exchange and retrieval of information is vast and daunting. The key problem in achieving efficient and user-friendly retrieval is the development of a search mechanism to guarantee delivery of minimal irrelevant information (high precision) while insuring relevant information is not overlooked (high recall). The traditional solution employs keyword-based search. The only documents retrieved are those containing user-specified keywords. But many documents convey desired semantic information without containing these keywords. This limitation is frequently addressed through query expansion mechanisms based on the statistical co-occurrence of terms. Recall is increased, but at the expense of deteriorating precision. One can overcome this problem by indexing documents according to context and meaning rather than keywords, although this requires a method of converting words to meanings and the creation of a meaning-based index structure. We have solved the problem of an index structure through the design and implementation of a concept-based model using domain-dependent ontologies. An ontology is a collection of concepts and their interrelationships that provide an abstract view of an application domain. With regard to converting words to meaning, the key issue is to identify appropriate concepts that both describe and identify documents as well as language employed in user requests. This paper describes an automatic mechanism for selecting these concepts. An important novelty is a scalable disambiguation algorithm that prunes irrelevant concepts and allows relevant ones to associate with documents and participate in query generation. We also propose an automatic query expansion mechanism that deals with user requests expressed in natural language. This mechanism generates database queries with appropriate and relevant expansion through knowledge encoded in ontology form. Focusing on audio data, we have constructed a demonstration prototype. We have experimentally and analytically shown that our model, compared to keyword search, achieves a significantly higher degree of precision and recall. The techniques employed can be applied to the problem of information selection in all media types.

Patent
02 Oct 2004
TL;DR: In this article, an index database is created comprising word occurrences and table relationship information, which allows keyword searches in both structured and unstructured databases, and across multiple databases of different vendors.
Abstract: This invention allows keyword searches in both structured and unstructured databases, and across multiple databases of different vendors. An index database is created comprising word occurrences and table relationship information. In the case of unstructured databases with no predetermined schema, the relationship between different tables is derived through propagative n-level indexing and data is then populated in index tables. A database adapter enables indexing and searching across multi-vendor databases, which resolves discrepancies across different database access methodologies. Given a set of keyword inputs, the rows containing the search words and all the related rows are searched using word occurrences and relationship information.

Journal ArticleDOI
TL;DR: It is discussed how fuzzy set theory can be effectively used for this purpose and an image retrieval system called FIRST (fuzzy image retrieved system) which incorporates many of these ideas is described.
Abstract: A typical content-based image retrieval (CBIR) system would need to handle the vagueness in the user queries as well as the inherent uncertainty in image representation, similarity measure, and relevance feedback. We discuss how fuzzy set theory can be effectively used for this purpose and describe an image retrieval system called FIRST (fuzzy image retrieval system) which incorporates many of these ideas. FIRST can handle exemplar-based, graphical-sketch-based, as well as linguistic queries involving region labels, attributes, and spatial relations. FIRST uses fuzzy attributed relational graphs (FARGs) to represent images, where each node in the graph represents an image region and each edge represents a relation between two regions. The given query is converted to a FARG, and a low-complexity fuzzy graph matching algorithm is used to compare the query graph with the FARGs in the database. The use of an indexing scheme based on a leader clustering algorithm avoids an exhaustive search of the FARG database. We quantify the retrieval performance of the system in terms of several standard measures.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing, and need fewer (non-zero) parameters to describe the data.
Abstract: We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, they need fewer (non-zero) parameters to describe the data. We apply parsimonious models at three stages of the retrieval process: 1) at indexing time; 2) at search time; 3) at feedback time. Experimental results show that we are able to build models that are significantly smaller than standard models, but that still perform at least as well as the standard approaches.

Journal ArticleDOI
TL;DR: A novel framework to make some advances toward the final goal to solve the challenging problems of semantic gap, semantic video concept modeling, semanticVideo classification, and concept-oriented video database indexing and access is proposed.
Abstract: Digital video now plays an important role in medical education, health care, telemedicine and other medical applications. Several content-based video retrieval (CBVR) systems have been proposed in the past, but they still suffer from the following challenging problems: semantic gap, semantic video concept modeling, semantic video classification, and concept-oriented video database indexing and access. In this paper, we propose a novel framework to make some advances toward the final goal to solve these problems. Specifically, the framework includes: 1) a semantic-sensitive video content representation framework by using principal video shots to enhance the quality of features; 2) semantic video concept interpretation by using flexible mixture model to bridge the semantic gap; 3) a novel semantic video-classifier training framework by integrating feature selection, parameter estimation, and model selection seamlessly in a single algorithm; and 4) a concept-oriented video database organization technique through a certain domain-dependent concept hierarchy to enable semantic-sensitive video retrieval and browsing.