scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2007"


Proceedings Article
Qin Lv1, William Josephson1, Zhe Wang1, Moses Charikar1, Kai Li1 
23 Sep 2007
TL;DR: This paper proposes a new indexing scheme called multi-probe LSH, built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table to achieve the same search quality.
Abstract: Similarity indices for high-dimensional data are very desirable for building content-based search systems for feature-rich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate similarity search. A significant drawback of these approaches is the requirement for a large number of hash tables in order to achieve good search quality. This paper proposes a new indexing scheme called multi-probe LSH that overcomes this drawback. Multi-probe LSH is built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by and improves upon recent theoretical work on entropy-based LSH designed to reduce the space requirement of the basic LSH method. We have implemented the multi-probe LSH method and evaluated the implementation with two different high-dimensional datasets. Our evaluation shows that the multi-probe LSH method substantially improves upon previously proposed methods in both space and time efficiency. To achieve the same search quality, multi-probe LSH has a similar time-efficiency as the basic LSH method while reducing the number of hash tables by an order of magnitude. In comparison with the entropy-based LSH method, to achieve the same search quality, multi-probe LSH uses less query time and 5 to 8 times fewer number of hash tables.

801 citations


Proceedings ArticleDOI
09 Jul 2007
TL;DR: Two novel schemes for near duplicate image and video-shot detection based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval and local feature descriptors, are proposed and compared.
Abstract: This paper proposes and compares two novel schemes for near duplicate image and video-shot detection. The first approach is based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval. The second approach uses local feature descriptors (SIFT) and for retrieval exploits techniques used in the information retrieval community to compute approximate set intersections between documents using a min-Hash algorithm.The requirements for near-duplicate images vary according to the application, and we address two types of near duplicate definition: (i) being perceptually identical (e.g. up to noise, discretization effects, small photometric distortions etc); and (ii) being images of the same 3D scene (so allowing for viewpoint changes and partial occlusion). We define two shots to be near-duplicates if they share a large percentage of near-duplicate frames.We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. Both methods are designed so that only a small amount of data need be stored for each image. In the case of near-duplicate shot detection it is shown that a weak approximation to histogram matching, consuming substantially less storage, is sufficient for good results. We demonstrate our methods on the TRECVID 2006 data set which contains approximately 165 hours of video (about 17.8M frames with 146K key frames), and also on feature films and pop videos.

237 citations


Journal ArticleDOI
TL;DR: This paper proposes a new approach for near-duplicate keyframe (NDK) identification by matching, filtering and learning of local interest points (LIPs) with PCA-SIFT descriptors and proposes a one-to-one symmetric matching (OOS) algorithm found to be highly reliable for NDK identification.
Abstract: This paper proposes a new approach for near-duplicate keyframe (NDK) identification by matching, filtering and learning of local interest points (LIPs) with PCA-SIFT descriptors. The issues in matching reliability, filtering efficiency and learning flexibility are novelly exploited to delve into the potential of LIP-based retrieval and detection. In matching, we propose a one-to-one symmetric matching (OOS) algorithm which is found to be highly reliable for NDK identification, due to its capability in excluding false LIP matches compared with other matching strategies. For rapid filtering, we address two issues: speed efficiency and search effectiveness, to support OOS with a new index structure called LIP-IS. By exploring the properties of PCA-SIFT, the filtering capability and speed of LIP-IS are asymptotically estimated and compared to locality sensitive hashing (LSH). Owing to the robustness consideration, the matching of LIPs across keyframes forms vivid patterns that are utilized for discriminative learning and detection with support vector machines. Experimental results on TRECVID-2003 corpus show that our proposed approach outperforms other popular methods including the techniques with LSH in terms of retrieval and detection effectiveness. In addition, the proposed LIP-IS successfully speeds up OOS for more than ten times and possesses several avorable properties compared to LSH.

176 citations


Proceedings ArticleDOI
29 Sep 2007
TL;DR: This paper presents a search-based solution for scalable music recommendations, in which a music piece is first transformed to a music signature sequence in which each signature characterizes the timbre of a local music clip, using the locality sensitive hashing (LSH).
Abstract: The growth of music resources on personal devices and Internet radio has increased the need for music recommendations. In this paper, aiming at providing an efficient and general solution, we present a search-based solution for scalable music recommendations. In this solution a music piece is first transformed to a music signature sequence in which each signature characterizes the timbre of a local music clip. Based on such signatures, a scale-sensitive method is then proposed to index the music pieces for similarity search, using the locality sensitive hashing (LSH). The scale-sensitive method can numerically find the appropriate parameters for indexing various scales of music collections, and thus can guarantee a proper number of nearest neighbors are found in search. In the recommendation stage, representative signatures from snippets of a seed piece are extracted as query terms, to retrieve pieces with similar melodies for suggestions. We also design a relevance-ranking function to sort the search results, based on the criteria that include matching ratio, temporal order, term weight, and matching confidence. Finally, with the search results, we propose a strategy to generate a dynamic playlist which can automatically expand with time. Evaluations of several music collections at various scales showed that our approach achieves encouraging results in terms of recommendation satisfaction and system scalability.

165 citations


Journal ArticleDOI
TL;DR: This paper proposes a fast approximation algorithm for the single linkage method that reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithms for the approximate nearest neighbor search.
Abstract: The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between the nearest members is the shortest. This step is repeated until only one cluster remains. The single linkage method can efficiently detect clusters in arbitrary shapes. However, a drawback of this method is a large time complexity of O(n 2), where n represents the number of data points. This time complexity makes this method infeasible for large data. This paper proposes a fast approximation algorithm for the single linkage method. Our algorithm reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. Here, B represents the maximum number of points going into a single hash entry and it practically diminishes to a small constant as compared to n for sufficiently large hash tables. Experimentally, we show that (1) the proposed algorithm obtains clustering results similar to those obtained by the single linkage method and (2) it runs faster for large data than the single linkage method.

106 citations


Book ChapterDOI
15 Aug 2007
TL;DR: A variant of the LSH algorithm for solving the c-approximate nearest neighbor problem in high dimensional spaces is presented, focusing on the special case of where all points in the dataset lie on the surface of the unit hypersphere in a d-dimensional Euclidean space.
Abstract: LSH (Locality Sensitive Hashing) is one of the best known methods for solving the c-approximate nearest neighbor problem in high dimensional spaces. This paper presents a variant of the LSH algorithm, focusing on the special case of where all points in the dataset lie on the surface of the unit hypersphere in a d-dimensional Euclidean space. The LSH scheme is based on a family of hash functions that preserves locality of points. This paper points out that when all points are constrained to lie on the surface of the unit hypersphere, there exist hash functions that partition the space more efficiently than the previously proposed methods. The design of these hash functions uses randomly rotated regular polytopes and it partitions the surface of the unit hypersphere like a Voronoi diagram. Our new scheme improves the exponent ρ, the main indicator of the performance of the LSH algorithm.

105 citations


Proceedings ArticleDOI
23 Jul 2007
TL;DR: The design principles behind hash-based search methods are revealed and it is shown how optimum hash functions for similarity search can be derived and the rationale of their effectiveness is explained.
Abstract: Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled--while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.

97 citations


Proceedings ArticleDOI
04 Jun 2007
TL;DR: The experimental results demonstrate that 1) the nearest rounding approach often leads to large timing violations and 2) compared to the well-known Coudert's approach, the new algorithm saves up to 21% in area cost while still satisfying the timing constraint.
Abstract: With increasing time-to-market pressure and shortening semiconductor product cycles, more and more chips are being designed with library-based methodologies. In spite of this shift, the problem of discrete gate sizing has received significantly less attention than its continuous counterpart. On the other hand, cell sizes of many realistic libraries are sparse, for example, geometrically spaced, which makes the nearest rounding approach inapplicable as large timing violations may be introduced. Therefore, it is highly desirable to design an effective algorithm to handle this discrete gate sizing problem. Such an algorithm is proposed in this paper. The algorithm is a continuous solution guided dynamic programming approach. A set of novel techniques, such as locality sensitive hashing based solution selection and stage pruning, are also proposed to accelerate the algorithm and improve the solution quality. Our experimental results demonstrate that (1) nearest rounding approach often leads to large timing violations and (2) compared to the well-known Coudert's approach, the new algorithm saves 9% - 31% in area cost while still satisfying the timing constraint.

70 citations


Journal ArticleDOI
TL;DR: This article shows that it can avoid the one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, and can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately.
Abstract: Motivation: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates. Results: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. Contact: tingchen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online.

53 citations


Book ChapterDOI
18 Nov 2007
TL;DR: This paper presents an efficient indexing and retrieval scheme for searching in document image databases that achieves high precision and recall, using a large image corpus consisting of seven Kalidasa's books in the Telugu language.
Abstract: This paper presents an efficient indexing and retrieval scheme for searching in document image databases. In many non-European languages, optical character recognizers are not very accurate. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic time warping and/or nearest neighbor search, tend to be slow. Here indexing is done using locality sensitive hashing (LSH) - a technique which computes multiple hashes - using word image features computed at word level. Efficiency and scalability is achieved by content-sensitive hashing implemented through approximate nearest neighbor computation. We demonstrate that the technique achieves high precision and recall (in the 90% range), using a large image corpus consisting of seven Kalidasa's (a well known Indian poet of antiquity) books in the Telugu language. The accuracy is comparable to using dynamic time warping and nearest neighbor search while the speed is orders of magnitude better - 20000 word images can be searched in milliseconds.

44 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: This paper presents an efficient solution to the approximate nearest subspace problem for both linear and affine subspaces based on a simple reduction to the problem of nearest point search, and can thus employ tree based search or locality sensitive hashing to find a near subspace.
Abstract: Linear and affine subspaces are commonly used to describe appearance of objects under different lighting, viewpoint, articulation, and identity. A natural problem arising from their use is - given a query image portion represented as a point in some high dimensional space - find a subspace near to the query. This paper presents an efficient solution to the approximate nearest subspace problem for both linear and affine subspaces. Our method is based on a simple reduction to the problem of nearest point search, and can thus employ tree based search or locality sensitive hashing to find a near subspace. Further speedup may be achieved by using random projections to lower the dimensionality of the problem. We provide theoretical proofs of correctness and error bounds of our construction and demonstrate its capabilities on synthetic and real data. Our experiments demonstrate that an approximate nearest subspace can be located significantly faster than the exact nearest subspace, while at the same time it can find better matches compared to a similar search on points, in the presence of variations due to viewpoint, lighting etc.


Journal ArticleDOI
TL;DR: This paper proposes a novel approach to efficient semantic search on DHT overlays that achieves high recall for queries at very low cost, i.e., the number of nodes visited for a query is about 10-20, independent of the overlay size.

Proceedings Article
03 Dec 2007
TL;DR: A general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate is presented.
Abstract: Can we leverage learning techniques to build a fast nearest-neighbor (ANN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures.

Proceedings ArticleDOI
02 Jul 2007
TL;DR: This paper proposes a scalable content-based image retrieval scheme using locality-sensitive hashing (LSH), and conducts extensive evaluations on a large image testbed of a half million images, which is promising for building Web-scale CBIR systems.
Abstract: One key challenge in content-based image retrieval (CBIR) is to develop a fast solution for indexing high-dimensional image contents, which is crucial to building large-scale CBIR systems. In this paper, we propose a scalable content-based image retrieval scheme using locality-sensitive hashing (LSH), and conduct extensive evaluations on a large image testbed of a half million images. To the best of our knowledge, there is less comprehensive study on large-scale CBIR evaluation with a half million images. Our empirical results show that our proposed solution is able to scale for hundreds of thousands of images, which is promising for building Web-scale CBIR systems.

Book ChapterDOI
05 Jul 2007
TL;DR: This article uses local image descriptors to analyze how a large input image can be decomposed by small template images contained in a database, and introduces a filtering step to ensure that found images do not overlap themselves when warped on the input image.
Abstract: During last years, local image descriptors have received much attention because of their efficiency for several computer vision tasks such as image retrieval, image comparison, features matching for 3D reconstruction... Recent surveys have shown that Scale Invariant Features Transform (SIFT) vectors are the most efficient for several criteria. In this article, we use these descriptors to analyze how a large input image can be decomposed by small template images contained in a database. Affine transformations from database images onto the input image are found as described in [16]. The large image is thus covered by small patches like a jigsaw puzzle. We introduce a filtering step to ensure that found images do not overlap themselves when warped on the input image. A typical new application is to retrieve which products are proposed on a supermarket shelf. This is achieved using only a large picture of the shelf and a database of all products available in the supermarket. Because the database can be large and the analysis should ideally be done in a few seconds, we compare the performances of two state of the art algorithms to search SIFT correspondences: Best-Bin-First algorithm on Kd-Tree and Locality Sensitive Hashing. We also introduce a modification in the LSH algorithm to adapt it to SIFT vectors.

Proceedings ArticleDOI
25 Jun 2007
TL;DR: A new mechanism, TSO, is proposed for constructing the two layer topology-aware structured overlay network based on locality sensitive hashing scheme, which shows the effectiveness of TSO.
Abstract: The DHT scheme without any knowledge about underlying physical topology could cause a serious topology mismatching between the P2P overlay network and the physical underlying network. In this paper, a new mechanism, TSO, is proposed for constructing the two layer topology-aware structured overlay network based on locality sensitive hashing scheme. In TSO, the physical close nodes have been clustered into a local level P2P ring which is regarded as a virtual node in the high level Chord ring of the overall P2P overlay network. A large portion of routing hops previously executed in the global P2P ring are now replaced by hops in local level rings, thus routing overheads can be reduced. The intensive simulation experiments, we have shown the effectiveness of TSO.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: A novel scheme for representing character string images in the scanned document by converting conventional multi-dimensional descriptors into pseudo-codes which have a property that: if two vectors are near in the original space then encoded pseudo- codes are 'semi equivalent with high probability'.
Abstract: In this paper, we propose a novel scheme for representing character string images in the scanned document. We converted conventional multi-dimensional descriptors into pseudo-codes which have a property that: if two vectors are near in the original space then encoded pseudo-codes are 'semi equivalent with high probability. For this conversion, we combined locality sensitive hashing (LSH) indices and at the same time we also developed a new family of LSH functions that is superior to earlier ones when all vectors are constrained to lie on the surface of the unit sphere. Word spotting based on our pseudo-code becomes faster than multi-dimensional descriptor-based method while it scarcely degrades the accuracy.

Proceedings ArticleDOI
Junsong Yuan1, Wei Wang2, Jingjing Meng2, Ying Wu1, Dongge Li2 
29 Sep 2007
TL;DR: A novel method which translates repetitive clip mining to the continuous path finding problem in a matching trellis, where sequence matching can be accelerated by taking advantage of the temporal redundancies in the videos.
Abstract: Automatically discovering repetitive clips from large video database is a challenging problem due to the enormous computational cost involved in exploring the huge solution space. Without any a priori knowledge of the contents, lengths and total number of the repetitive clips, we need to discover all of them in the video database. To address the large computational cost, we propose a novel method which translates repetitive clip mining to the continuous path finding problem in a matching trellis, where sequence matching can be accelerated by taking advantage of the temporal redundancies in the videos. By applying the locality sensitive hashing (LSH) for efficient similarity query and the proposed continuous path finding algorithm, our method is of only quadratic complexity of the database size. Experiments conducted on a 10.5-hour TRECVID news dataset have shown the effectiveness, which can discover repetitive clips of various lengths and contents in only 25 minutes, with features extracted off-line.

Proceedings ArticleDOI
29 Sep 2007
TL;DR: This tutorial describes techniques essential for searching the large multimedia databases that are now common on the Internet, including methods for multimedia retrieval on large document collections and a special focus is on how to combine large data methods with semantically meaningful descriptors in order to facilitate efficient similarity-based retrieval.
Abstract: This tutorial describes techniques essential for searching the large multimedia databases that are now common on the Internet. There are up to 10 million songs in commercial music catalogues and over 300 million images stored in online photo services such as Flickr. How can we find the music, videos or images we want? How can we organize such large collections: find duplicates, create links between similar documents, extract and annotate semantic structures from complex audiovisual documents? Conventional methods for handling large data sets, such as hashing, get us part of the way, but those methods may not straightforwardly be used for similarity-based matching and retrieval in audiovisual document collections. On the other hand, several elaborate methods from multimedia retrieval are available for semantic document analysis. Unfortunately, those methods generally do not scale for large data sets. Instead, new classes of algorithms combining the best of the two worlds of large data methods and semantic analysis are needed to handle large multimedia databases. Innovative methods such as locality sensitive hashing, which are based on randomized probes, are the new workhorses. This tutorial covers methods for multimedia retrieval on large document collections. Starting with audio retrieval, we describe both the theory (i.e., randomized algorithms for hashing) and the implementation details (how do you store hash values for millions of songs?). A special focus is on how to combine large data methods with semantically meaningful descriptors in order to facilitate efficient similarity-based retrieval. Besides audio, the tutorial also covers image, 3d motion and video retrieval.

Proceedings ArticleDOI
28 Jan 2007
TL;DR: In this article, the authors discuss one of the important issues in generating a robust media hash and propose a method for hashing that clearly improves the robustness of a hashing algorithm compared to other methods.
Abstract: This paper discusses one of the important issues in generating a robust media hash. Robustness of a media hashing algorithm is primarily determined by three factors, (1) robustness-false alarm tradeoff achieved by the chosen feature representation, (2) accuracy of the bit extraction step and (3) the distance measure used to measure similarity (dissimilarity) between two hashes. The robustness-false alarm tradeoff in feature space is measured by a similarity (dissimilarity) measure and it defines a limit on the performance of the hashing algorithm. The distance measure used to compute the distance between the hashes determines how far this tradeoff in the feature space is preserved through the bit extraction step. Hence the bit extraction step is crucial, in defining the robustness of a hashing algorithm. Although this is recognized as an important requirement by all, to our knowledge there is no work in the existing literature that elucidates the effcacy of their algorithm based on their effectiveness in improving this tradeoff compared to other methods. This paper specifically demonstrates the kind of robustness false alarm tradeoff achieved by existing methods and proposes a method for hashing that clearly improves this tradeoff.

Book ChapterDOI
09 Jan 2007
TL;DR: This work pays the most attention to the design of Locality Sensitive Hashing (LSH) and the partial sequence comparison, and proposes a fast and efficient audio retrieval framework of query-by-content.
Abstract: With this work we study suitable indexing techniques to support efficient content-based music retrieval in large acoustic databases. To obtain the index-based retrieval mechanism applicable to audio content, we pay the most attention to the design of Locality Sensitive Hashing (LSH) and the partial sequence comparison, and propose a fast and efficient audio retrieval framework of query-by-content. On the basis of this indexable framework, four different retrieval schemes, LSH-Dynamic Programming (DP), LSH-Sparse DP (SDP), Exact Euclidian LSH (E2LSH)-DP, E2LSH-SDP, are presented and estimated in order to achieve an extensive understanding of retrieval algorithms performance. The experiment results indicate that compared to other three schemes, E2LSH-SDP exhibits best tradeoff in terms of the response time, retrieval ratio, and computation cost.

Proceedings ArticleDOI
06 Jun 2007
TL;DR: This paper proposes two non-index pruning strategies in ANN queries on metric space that utilize the r-NN query and projecting law, analyze the distributing of query points, find out the search region in data space, and get the result efficiently.
Abstract: Aggregate Nearest Neighbor Queries are much more complex than Nearest Neighbor queries, and pruning strategies are always utilized in ANN queries Most of the pruning methods are based on the data index mechanisms, such as R-tree But for the well-known curse of dimensionality, ANN search could be meaningless in high dimensional spaces In this paper, we propose two non-index pruning strategies in ANN queries on metric space Our methods utilize the r-NN query and projecting law, analyze the distributing of query points, find out the search region in data space, and get the result efficiently

Journal ArticleDOI
TL;DR: This paper addresses the cases when such distribution follows a natural negative linear distribution, a partial negative linear distributions, or an exponential distribution which are found to closely approximate many real-life database distributions and derives a general formula for calculating the distribution variance produced by any given non-overlapping bit-grouping XOR hashing function.

Proceedings ArticleDOI
02 Apr 2007
TL;DR: This paper proposes a method for retrieving similar interaction protein using profiles that represent the features of the interaction site binding to a certain compound using geometric hashing technique.
Abstract: Protein function is expressed by binding to other compounds at a local portion, called an interaction site. Since the structure of its interaction site and function of a protein are closely related, retrieving similar interaction protein is effective in clarifying the function of a protein. We have proposed a method for retrieving similar interaction protein using profiles that represent the features of the interaction site binding to a certain compound. In this method, it is necessary to compare the structure between proteins and a profile, we use geometric hashing technique which is one of the popular methods for structure comparison. However, the problem of structure comparison by using the geometric hashing is that memory usage becomes too large. This paper proposes a method for arranging the geometric hashing to alleviate this problem. Firstly, only small parts of the target structures are stored in the hash table to reduce the size of the hash table. By evaluating this hash table we screen out candidates of similar structures between target proteins and query profiles. Secondly overall structures are compared for these candidates. In order to reduce the time for retrieval we evaluate the information of the origin which is not generally evaluated without increasing the size of the hash table. Reference set, the basis for transforming in geometric hashing, are sorted

Book ChapterDOI
27 Jun 2007
TL;DR: A definition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments is proposed and several appropriate data transformations for this task are proposed.
Abstract: 2D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of 1H and 13C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a definition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime performance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash-collisions.

01 Jan 2007
TL;DR: The proposed method consists of non-uniform video segmentation, self-similarity analysis, locality sensitive hashing, and videorepeat boundary refinement to automatically discover unknown short video repeats with arbitrary lengths from large videodatabases or streams.
Abstract: Inthis paper wepropose anefficient androbust method to automatically discover unknownshort video repeats with arbitrary lengths, fromafewseconds toafewminutes, from large videodatabases orstreams. Theproposed method consists ofnon-uniform video segmentation, self-similarity analysis, locality sensitive hashing, andvideorepeat boundary refinement. Inorder toachieve efficient and accurate processing feature extraction andsimilarity measure areperformed attwolevels: video frame level and video segment level. Experiments areconducted on12hour CNN/ABCnews,and12hourdocumentaries (Discovery andNational Geography), highrecall andprecision of 98%-99%havebeenachieved. Videorepeats' boundaries canbelocated within several frames. Applying theproposed method forvideo structure analysis isalso briefly discussed.