scispace - formally typeset
Search or ask a question

Showing papers on "Inverted index published in 2004"


Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper presents an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance that generalize to several weighted and unweighted measures of partial word overlap between sets.
Abstract: In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.

376 citations


Journal ArticleDOI
TL;DR: An image retrieval framework that integrates efficient region-based representation in terms of storage and complexity and effective on-line learning capability and a region weighting strategy is introduced to optimally weight the regions and enable the system to self-improve.
Abstract: An image retrieval framework that integrates efficient region-based representation in terms of storage and complexity and effective on-line learning capability is proposed. The framework consists of methods for region-based image representation and comparison, indexing using modified inverted files, relevance feedback, and learning region weighting. By exploiting a vector quantization method, both compact and sparse (vector) region-based image representations are achieved. Using the compact representation, an indexing scheme similar to the inverted file technology and an image similarity measure based on Earth Mover's Distance are presented. Moreover, the vector representation facilitates a weighted query point movement algorithm and the compact representation enables a classification-based algorithm for relevance feedback. Based on users' feedback information, a region weighting strategy is also introduced to optimally weight the regions and enable the system to self-improve. Experimental results on a database of 10 000 general-purposed images demonstrate the efficiency and effectiveness of the proposed framework.

203 citations


Proceedings ArticleDOI
13 Jun 2004
TL;DR: This work focuses on a subclass of CAS queries consisting of simple path expressions, which study algorithmic issues in integrating structure indexes with inverted lists for the evaluation of these queries, where they rank all documents that match the query and return the top k documents in order of relevance.
Abstract: Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists We propose a strategy that combines the two forms of auxiliary indexes, and a query evaluation algorithm for branching path expressions based on this strategy Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms Our experiments over the Niagara XML DBMS show the benefit of integrating the two forms of indexes We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al, we obtain instance optimal algorithms to push down top k computation

185 citations


Patent
Bill Chau1
30 Dec 2004
TL;DR: In this article, a method, apparatus, and computer readable medium for searching and navigating a document database is provided, where each document in a database is assigned to one of the document categories and metadata is associated with each electronic document that includes the numeric category identifier corresponding to the category assigned to the document.
Abstract: A method, apparatus, and computer readable medium for searching and navigating a document database is provided. Document categories are assigned unique numeric category identifiers. Each document in a database is assigned to one of the document categories. Metadata is associated with each electronic document that includes the numeric category identifier corresponding to the category assigned to the document. The database may be searched or browsed based on category by utilizing the metadata. URLs may also be embedded in a Web page that includes a list of document identifiers and an index. The list of document identifiers is a list containing the identities of an arbitrary number of search results. The index identifies one of the documents in the list of document identifiers to be retrieved. When such a URL is selected, a Web server computer utilizes the list of document identifiers and the index to identify the document to be returned.

135 citations


Patent
Georges R. Harik1
30 Jun 2004
TL;DR: In this article, an inverted index of advertiser Web pages is used to search for matching advertisers, and therefore matching ads, without requiring the advertiser to enter and/or maintain certain targeting information, such as keyword targeting.
Abstract: Advertisers are permitted to put targeted ads on, or to serve ads in association with, various content such as search results pages, Web pages, e-mail, etc., without requiring the advertiser to enter and/or maintain certain targeting information, such as keyword targeting. This may be accomplished by using a searchable data structure, such as an inverted index for example, of available advertiser Web information. The advertiser Web information may include terms and/or phrase extracted from the advertiser's Website. In particular, a search query may be used to search for matching advertisers, and therefore matching ads. For example, the search query can be used to search an inverted index including words and/or phrases extracted from advertiser Websites. The advertiser Web page, or some other identifier, can be used as a key to search for an associated ad.

117 citations


Journal ArticleDOI
TL;DR: Combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.
Abstract: Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26p of the size of the inverted file.

109 citations


Patent
Helmut Koenig1
23 Jul 2004
TL;DR: In this paper, a method is proposed for automatically indexing multimedia data archives and categorizing the files held therein and also to a client/server architecture in an image retrieval system for content-based searching for relevant files in a particular format and having a particular file structure.
Abstract: A method is for automatically indexing multimedia data archives and categorizing the files held therein and also to a client/server architecture in an image retrieval system for content-based searching for relevant files in a particular format and having a particular file structure. The parsed files stored in a data archive managed by the document management system are subjected to a feature extraction algorithm. The features obtained are then used for producing a binary-coded inverted index which includes elements of at least two attributes and holds context information which is held in these files and is needed for content-based image retrieval. If new files or files with an extended or modified content are stored in the data archive, then it involves a parsing algorithm and also an algorithm for automatically extracting features of these files being executed under event control during every storage process, in order to extend the inverted index by individual attributes or in order to update particular elements of already existing index attributes.

93 citations


Patent
11 Mar 2004
TL;DR: In this article, the authors describe a method to identify at least one word identifiable with at least 1 keyword, and jump to the identified node without first traversing any other node.
Abstract: A method performed in a system having multiple navigable nodes interconnected in a hierarchical arrangement involves receiving an input containing at least one word identifiable with at least one keyword, identifying at least one node, other than the first node, not directly connected to the first node, but associated with the at least one keyword, and jumping to the identified node. A transaction processing system having a hierarchical arrangement of nodes and is configured for user navigation among the nodes. The system has an inverted index correlating keywords with at least some nodes in the arrangement so that when the user provides an input in response to a verbal description and the response includes a meaningful word correlatable with a keyword, the system will identify at least one node correlated to the meaningful word by the inverted index and jump to that node without first traversing any other node.

88 citations


Book ChapterDOI
Yosi Mass1, Matan Mandelbrod1
06 Dec 2004
TL;DR: This paper shows an improvement to a component ranking algorithm by introducing a document pivot that compensates for missing terms statistics in small components and describes a general mechanism to apply known Query Refinement algorithms from traditional IR on top of this componentranking algorithm.
Abstract: Queries over XML documents challenge search engines to return the most relevant XML components that satisfy the query concepts. In a previous work we described a component ranking algorithm that performed relatively well in INEX'03. In this paper we show an improvement to that algorithm by introducing a document pivot that compensates for missing terms statistics in small components. Using this new algorithm we achieved improvements of 30%-50% in the Mean Average Precision over the previous algorithm. We then describe a general mechanism to apply known Query Refinement algorithms from traditional IR on top of this component ranking algorithm and demonstrate an example such algorithm that achieved top results in INEX'04.

84 citations


Book ChapterDOI
Shuming Shi1, Guangwen Yang1, Dingxing Wang1, Jin Yu1, Shaogang Qu1, Ming Chen1 
26 Feb 2004
TL;DR: A new index partitioning and building scheme, multi-level partitioning (MLP), is proposed and its implementation on top of P2P networks is discussed, which can dramatically reduce bisection bandwidth consumption and end-user latency.
Abstract: This paper discusses large scale keyword searching on top of peer-to-peer (P2P) networks. The state-of-the-art keyword searching techniques for unstructured and structured P2P systems are query flooding and inverted list intersection respectively. However, it has been demonstrated that P2P-based large scale full-text searching is not feasible by using either of the two techniques. We propose in this paper a new index partitioning and building scheme, multi-level partitioning (MLP), and discuss its implementation on top of P2P networks. MLP can dramatically reduce bisection bandwidth consumption and end-user latency compared with the partition-by-keyword scheme. And comparing with partition-by-document, it need only broadcast a query to moderate number of peers to generate precise results.

69 citations


Proceedings ArticleDOI
25 Jul 2004
TL;DR: This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of d-gaps, thus enhancing the effectiveness of traditional compression methods.
Abstract: Web Search Engines provide a large-scale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their d-gaps representation can be effectively and efficiently compressed by using variable length encoding methods. This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of d-gaps, thus enhancing the effectiveness of traditional compression methods. We ran several tests over the Google contest collection in order to validate the techniques proposed. The experiments demonstrated the scalability and effectiveness of our algorithms. Using the proposed algorithms, we were able to sensibly improve (up to 20.81%) the compression ratios of several encoding schemes.

Patent
30 Sep 2004
TL;DR: In this article, search results of a search query on a network are ranked according to a scoring function that incorporates anchor text as a term, which is adjusted so that a target document of anchor text reflect the use of terms in the anchor text in the target document's ranking.
Abstract: Search results of a search query on a network are ranked according to a scoring function that incorporates anchor text as a term. The scoring function is adjusted so that a target document of anchor text reflect the use of terms in the anchor text in the target document's ranking. Initially, the properties associated with the anchor text are collected during a crawl of the network. A separate index is generated that includes an inverted list of the documents and the terms in the anchor text. The index is then consulted in response to a query to calculate a document's score. The score is then used to rank the documents and produce the query results.

Journal ArticleDOI
TL;DR: This study provides CBR efficiency and effectiveness experiments using the largest corpus in an environment that employs no user interaction or user behavior assumption for clustering and confirms that the approach is scalable and system performance improves with increasing database size.

01 Jan 2004
TL;DR: In this paper, a self-organizing inverted index based on past queries is proposed, which improves query evaluation speed by 25% and 40% over a conventional, optimised approach with almost indistinguishable accuracy.
Abstract: Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations within the searchable document collection. Inverted indexes are highly optimised, and significant work has been undertaken over the past fifteen years to store, retrieve, compress, and understand heuristics for these structures. In this paper, we propose a new self-organising inverted index based on past queries. We show that this access-ordered index improves query evaluation speed by 25%--40% over a conventional, optimised approach with almost indistinguishable accuracy. We conclude that access-ordered indexes are a valuable new tool to support fast and accurate web search.

Proceedings ArticleDOI
14 Mar 2004
TL;DR: This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d_gaps, and shows that the complexity of the algorithm is linear in space and time.
Abstract: Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d_gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d_gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d_gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.

Book ChapterDOI
Marcus Fontoura1, Engene Shekita1, Jason Zien1, Sridhar Rajagopalan1, Andreas Neumann1 
31 Aug 2004
TL;DR: This paper describes how the use of slightly outdated information from global analysis and a fast index construction algorithm based on radix sorting can be combined in a novel way to significantly speed up the index build process without sacrificing search quality.
Abstract: There has been a substantial amount of research on high-performance algorithms for constructing an inverted text index. However, constructing the inverted index in a intranet search engine is only the final step in a more complicated index build process. Among other things, this process requires an analysis of all the data being indexed to compute measures like PageRank. The time to perform this global analysis step is significant compared to the time to construct the inverted index, yet it has not received much attention in the research literature. In this paper, we describe how the use of slightly outdated information from global analysis and a fast index construction algorithm based on radix sorting can be combined in a novel way to significantly speed up the index build process without sacrificing search quality.

Patent
Zhong Su1, Yue Pan1, Li Ping Yang1
06 Apr 2004
TL;DR: In this paper, a method for storing inverted index based on an inverted file is proposed, where the index information related to the same index item is stored in continuous blocks and the index units in each index block are only used for storing index information relating to each index item.
Abstract: The invention provides a method for storing inverted index based on an inverted file, the method comprising: creating an inverted file in a storage medium for storing the inverted index, the inverted file including a plurality of fixed-size index blocks, each of them including a plurality of fixed-size index units, wherein each index unit is used to store one piece of index information; and sequentially storing the index information related to each index item into the created inverted file, wherein the index information related to the same index item is stored in continuous blocks and the index units in each index block are only for storing index information related to the same index item. Since each index block is used only for storing index information related to the same index item, when performing operations on the index information in an index block, other index items are not affected, therefore, it is possible to on-line update index information in any index block.

Book ChapterDOI
06 Dec 2004
TL;DR: It is argued that a more explicit definition of the INEX retrieval tasks is needed because it is seen that removing overlap from the result set decreases retrieval effectiveness for all metrics except the XML cumulative gain measure.
Abstract: We describe the INEX 2004 participation of the Informatics Institute of the University of Amsterdam. We completely revamped our XML retrieval system, now implemented as a mixture language model on top of a standard search engine. To speed up structural reasoning, we indexed the collection's structure in a separate database. Our main findings are as follows. First, we show that blind feedback improves retrieval effectiveness, but increases overlap.Second, we see that removing overlap from the result set decreases retrieval effectiveness for all metrics except the XML cumulative gain measure.Third, we show that ignoring the structural constraints gives good results if measured in terms of mean average precision; the structural constraints are, however, useful for achieving high initial precision. Finally, we provide a detailed analysis of the characteristics of one of our runs. Based on this analysis we argue that a more explicit definition of the INEX retrieval tasks is needed.

Patent
14 Sep 2004
TL;DR: In this paper, a method performed in connection with an arrangement of nodes representable as a graph, and an inverted index containing a correlation among keywords and nodes such that at least some nodes containing a given keyword are indexed to that given keyword, involves receiving a word, searching the inverted index to determine whether the word is a keyword and, if the word was a keyword, jumping to a node identified in inverted index as correlated to that keyword, otherwise, learning a meaning for the word based upon reaching a result node and applying at least one specified rule such that a new input containing
Abstract: A method performed in connection with an arrangement of nodes representable as a graph, and an inverted index containing a correlation among keywords and nodes such that at least some nodes containing a given keyword are indexed to that given keyword, involves receiving a word, searching the inverted index to determine whether the word is a keyword and, if the word is a keyword, jumping to a node identified in the inverted index as correlated to that keyword, otherwise, learning a meaning for the word based upon reaching a result node and applying at least one specified rule such that a new input containing the word can be received thereafter and the word from the new input will be treated according to the learned meaning.

Book ChapterDOI
05 Apr 2004
TL;DR: It is demonstrated that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.
Abstract: We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.

Book ChapterDOI
01 Jan 2004
TL;DR: This chapter discusses Web search engine performance over time, and discusses in short search engine architecture, models and characterizations on the growing and changing Web, and reviews a number of small experiences that demonstrate that search engines do not always cope satisfactorily with dynamic changes.
Abstract: This chapter discusses Web search engine performance over time. Unlike classical information retrieval systems, the Web is decentralized and dynamic, that is, new pages are added, others are moved and removed, while existing pages may undergo changes. The dynamic nature of Web pages should influence search engine results over time. a set of measures is introduced to evaluate search engine performance in this constantly changing environment. The chapter also discusses in short search engine architecture, models and characterizations on the growing and changing Web, and reviews a number of small experiences that demonstrate that search engines do not always cope satisfactorily with dynamic changes.

Patent
28 Oct 2004
TL;DR: A method, system, and computer program product for managing data associated with a document stored in an electronic form can be found in this article, where the document can be a part of a file.
Abstract: A method, system, and computer program product for managing data associated with a document stored in an electronic form. The document can be a part of a file. Computer processed algorithms, user-operated computer graphics tools, or both can be used to derive data from or assign data to the document or the file. First data is derived from the document, second data is assigned to the document, or both. The first data, the second data, or both are organized as attributes of an object of a first computer database. At least one attribute is organized as a child object of the object. The at least one attribute is associated with a feature of the document. Optionally, an attribute is moved from the object of the first computer database to an object of a second computer database and an address of a location in a memory at which the object of the second computer database is stored is added as a new attribute of the object of the first computer database.

Book ChapterDOI
06 Dec 2004
TL;DR: In this paper, a new XML retrieval system prototype employing structural indices and a tf * idf weighting modification is presented, which emphasizes the tf part in weighting and allows overlap in run results to different degrees.
Abstract: In this paper, we present a new XML retrieval system prototype employing structural indices and a tf * idf weighting modification. We test retrieval methods that a) emphasize the tf part in weighting and b) allow overlap in run results to different degrees. It seems that increasing the overlap percentage leads to a better performance. Emphasizing the tf part enables us to increase exhaustiveness of the returned results.

Journal Article
TL;DR: This article proposes a multidimensional approach of term indexing providing efficient term retrieval and supporting regular expression queries and introduces an improvement based on a new data structure, called BUB-forest, providing even more efficientterm retrieval.
Abstract: The area of Information Retrieval deals with problems of storage and retrieval within a huge collection of text documents. In IR models, the semantics of a document is usually characterized using a set of terms. A common need to various IR models is an efficient term retrieval provided via a term index. Existing approaches of term indexing, e. g. the inverted list, support efficiently only simple queries asking for a term occurrence. In practice, we would like to exploit some more sophisticated querying mechanisms, in particular queries based on regular expressions. In this article we propose a multidimensional approach of term indexing providing efficient term retrieval and supporting regular expression queries. Since the term lengths are usually different, we also introduce an improvement based on a new data structure, called BUB-forest, providing even more efficient term retrieval.

Patent
13 Oct 2004
TL;DR: In this paper, an inverted file storing method based on inverted file is proposed, where the index information about the same index term is stored in the continuous index blocks and many index units in each index block only store the index terms.
Abstract: The invention provides an inverted file storing method based on inverted file, including: create an inverted file, which includes many fixed-size index blocks, each of these index blocks includes many fixed-size index units, and each of these index units stores a piece of index information; and storing the index information about each index term into the created file, where the index information about the same index term is stored in the continuous index blocks and many index units in each index block only store the index information about the same index term. When operating an index block, it can not affect other index terms; hence it can online update the index information in any index block.

Book ChapterDOI
13 Dec 2004
TL;DR: This paper presents the implementation techniques for an intelligent Web image search engine that includes several components such as a crawler, a preprocessor, a semantic extractor, an indexer, a knowledge learner and a query engine.
Abstract: This paper presents our implementation techniques for an intelligent Web image search engine. A reference architecture of the system is provided and addressed in this paper. The system includes several components such as a crawler, a preprocessor, a semantic extractor, an indexer, a knowledge learner and a query engine. The crawler traverses web sites in multithread accesses model. And it can dynamically control its access load to a Web server based on the corresponding capacity of the local system. The preprocessor is used to clean and normalize the information resource downloaded from Web sites. In this process, stop-word removing and word stemming are applied to the raw resources. The semantic extractor derives Web image semantics by partitioning combining the associated text. The indexer of the system creates and maintains inverted indices with relational model. Our knowledge learner is designed to automatically acquire knowledge from users' query activities. Finally, the query engine delivers search results in two phases in order to mine out the users' feedbacks.

Book ChapterDOI
30 Nov 2004
TL;DR: An image representation for objects and scenes consisting of a configuration of viewpoint covariant regions and their descriptors enables recognition to proceed successfully despite changes in scale, viewpoint, illumination and partial occlusion is described.
Abstract: We describe an image representation for objects and scenes consisting of a configuration of viewpoint covariant regions and their descriptors. This representation enables recognition to proceed successfully despite changes in scale, viewpoint, illumination and partial occlusion. Vector quantization of these descriptors then enables efficient matching on the scale of an entire feature film. We show two applications. The first is to efficient object retrieval where the technology of text retrieval, such as inverted file systems, can be employed at run time to return all shots containing the object in a manner, and with a speed, similar to a Google search for text. The object is specified by a user outlining it in an image, and the object is then delineated in the retrieved shots. The second application is to data mining. We obtain the principal objects, characters and scenes in a video by measuring the reoccurrence of these spatial configurations of viewpoint covariant regions. The applications are illustrated on two full length feature films.

Book ChapterDOI
21 Jul 2004
TL;DR: This work presents an indexing structure inspired by VA-file and Inverted file that does not need to determine the importance at indexing time in order to perform well and adapts to the importance of vector components at query processing time.
Abstract: Most indexing structures for high-dimensional vectors used in multimedia retrieval today rely on determining the importance of each vector component at indexing time in order to create the index. However for Histogram Intersection and other important distance measures this is not possible because the importance of vector components depends on the query. We present an indexing structure inspired by VA-file and Inverted file that does not need to determine the importance at indexing time in order to perform well. Instead, our structure adapts to the importance of vector components at query processing time.

Book ChapterDOI
01 Jan 2004
TL;DR: The state of the art of the main component of text retrieval systems: the search engine is presented and the main observation is that simpler ideas are better in practice.
Abstract: We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main observation is that simpler ideas are better in practice.

Book ChapterDOI
14 Apr 2004
TL;DR: A hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods is proposed which provides not only an effective but also scalable solution for duplicate detection.
Abstract: Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by todayś commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. This hybrid method provides not only an effective but also scalable solution for duplicate detection.