scispace - formally typeset
Search or ask a question

Showing papers on "Inverted index published in 2016"


Posted Content
TL;DR: The proposed 3D shape search engine, which combines GPU acceleration and Inverted File (Twice), is named as GIFT, which outperforms the state-of-the-art methods significantly in retrieval accuracy on various shape benchmarks and competitions.
Abstract: Projective analysis is an important solution for 3D shape retrieval, since human visual perceptions of 3D shapes rely on various 2D observations from different view points. Although multiple informative and discriminative views are utilized, most projection-based retrieval systems suffer from heavy computational cost, thus cannot satisfy the basic requirement of scalability for search engines. In this paper, we present a real-time 3D shape search engine based on the projective images of 3D shapes. The real-time property of our search engine results from the following aspects: (1) efficient projection and view feature extraction using GPU acceleration; (2) the first inverted file, referred as F-IF, is utilized to speed up the procedure of multi-view matching; (3) the second inverted file (S-IF), which captures a local distribution of 3D shapes in the feature manifold, is adopted for efficient context-based re-ranking. As a result, for each query the retrieval task can be finished within one second despite the necessary cost of IO overhead. We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File Twice, as GIFT. Besides its high efficiency, GIFT also outperforms the state-of-the-art methods significantly in retrieval accuracy on various shape benchmarks and competitions.

216 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: GIFT as discussed by the authors is a real-time 3D shape search engine based on the projective images of 3D shapes, which combines GPU acceleration and Inverted File (Twice) for efficient context-based re-ranking.
Abstract: Projective analysis is an important solution for 3D shape retrieval, since human visual perceptions of 3D shapes rely on various 2D observations from different view points. Although multiple informative and discriminative views are utilized, most projection-based retrieval systems suffer from heavy computational cost, thus cannot satisfy the basic requirement of scalability for search engines. In this paper, we present a real-time 3D shape search engine based on the projective images of 3D shapes. The real-time property of our search engine results from the following aspects: (1) efficient projection and view feature extraction using GPU acceleration, (2) the first inverted file, referred as F-IF, is utilized to speed up the procedure of multi-view matching, (3) the second inverted file (S-IF), which captures a local distribution of 3D shapes in the feature manifold, is adopted for efficient context-based reranking. As a result, for each query the retrieval task can be finished within one second despite the necessary cost of IO overhead. We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File (Twice), as GIFT. Besides its high efficiency, GIFT also outperforms the state-of-the-art methods significantly in retrieval accuracy on various shape benchmarks and competitions.

168 citations


Journal ArticleDOI
TL;DR: An extremely efficient algorithm for visual re-ranking by considering the original pairwise distance in the contextual space, and developing a feature vector called sparse contextual activation (SCA) that encodes the local distribution of an image that preserves the characteristic of high time efficiency is proposed.
Abstract: In this paper, we propose an extremely efficient algorithm for visual re-ranking. By considering the original pairwise distance in the contextual space, we develop a feature vector called sparse contextual activation (SCA) that encodes the local distribution of an image. Hence, re-ranking task can be simply accomplished by vector comparison under the generalized Jaccard metric, which has its theoretical meaning in the fuzzy set theory. In order to improve the time efficiency of re-ranking procedure, inverted index is successfully introduced to speed up the computation of generalized Jaccard metric. As a result, the average time cost of re-ranking for a certain query can be controlled within 1 ms. Furthermore, inspired by query expansion, we also develop an additional method called local consistency enhancement on the proposed SCA to improve the retrieval performance in an unsupervised manner. On the other hand, the retrieval performance using a single feature may not be satisfactory enough, which inspires us to fuse multiple complementary features for accurate retrieval. Based on SCA, a robust feature fusion algorithm is exploited that also preserves the characteristic of high time efficiency. We assess our proposed method in various visual re-ranking tasks. Experimental results on Princeton shape benchmark (3D object), WM-SRHEC07 (3D competition), YAEL data set B (face), MPEG-7 data set (shape), and Ukbench data set (image) manifest the effectiveness and efficiency of SCA.

140 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper integrated discriminative cues from multiple contextual levels, i.e., local, regional, and global, via probabilistic analysis, to improve the precision of visual matching.
Abstract: This paper considers the task of image search using the Bag-of-Words (BoW) model. In this model, the precision of visual matching plays a critical role. Conventionally, local cues of a keypoint, e.g., SIFT, are employed. However, such strategy does not consider the contextual evidences of a keypoint, a problem which would lead to the prevalence of false matches. To address this problem and enable accurate visual matching, this paper proposes to integrate discriminative cues from multiple contextual levels, i.e., local, regional, and global, via probabilistic analysis. "True match" is defined as a pair of keypoints corresponding to the same scene location on all three levels (Fig. 1). Specifically, the Convolutional Neural Network (CNN) is employed to extract features from regional and global patches. We show that CNN feature is complementary to SIFT due to its semantic awareness and compares favorably to several other descriptors such as GIST, HSV, etc. To reduce memory usage, we propose to index CNN features outside the inverted file, communicated by memory-efficient pointers. Experiments on three benchmark datasets demonstrate that our method greatly promotes the search accuracy when CNN feature is integrated. We show that our method is efficient in terms of time cost compared with the BoW baseline, and yields competitive accuracy with the state-of-the-arts.

87 citations


Journal ArticleDOI
TL;DR: A novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space is proposed.
Abstract: With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top $k$ spatial keyword search (TOPK-SK), and batch top $k$ spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest $k$ objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top $k$ spatial keyword search. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.

63 citations


Journal ArticleDOI
TL;DR: A social re-ranking system for tag-based image retrieval with the consideration of an image's relevance and diversity is proposed and an inverted index structure for the social image dataset is built to accelerate the searching process.
Abstract: Social media sharing websites like Flickr allow users to annotate images with free tags, which significantly contribute to the development of the web image retrieval and organization. Tag-based image search is an important method to find images contributed by social users in such social websites. However, how to make the top ranked result relevant and, with diversity, is challenging. In this paper, we propose a social re-ranking system for tag-based image retrieval with the consideration of an image's relevance and diversity. We aim at re-ranking images according to their visual information, semantic information, and social clues. The initial results include images contributed by different social users. Usually each user contributes several images. First, we sort these images by inter-user re-ranking. Users that have higher contribution to the given query rank higher. Then we sequentially implement intra-user re-ranking on the ranked user's image set, and only the most relevant image from each user's image set is selected. These selected images compose the final retrieved results. We build an inverted index structure for the social image dataset to accelerate the searching process. Experimental results on a Flickr dataset show that our social re-ranking method is effective and efficient.

59 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: The Maximum Subtree Similarity (MSS) is proposed for ranking formulae based upon the subexpression whose symbols and layout best match a query formula, and the Tangent-3 system first retrieves expressions using an inverted index over symbol pair relationships, ranking hits using the Dice coefficient.
Abstract: When using a mathematical formula for search (query-by-expression), the suitability of retrieved formulae often depends more upon symbol identities and layout than deep mathematical semantics. Using a Symbol Layout Tree representation for formula appearance, we propose the Maximum Subtree Similarity (MSS) for ranking formulae based upon the subexpression whose symbols and layout best match a query formula. Because MSS is too expensive to apply against a complete collection, the Tangent-3 system first retrieves expressions using an inverted index over symbol pair relationships, ranking hits using the Dice coefficient; the top-k formulae are then re-ranked by MSS. Tangent-3 obtains state-of-the-art performance on the NTCIR-11 Wikipedia formula retrieval benchmark, and is efficient in terms of both space and time. Retrieval systems for other graphical forms, including chemical diagrams, flowcharts, figures, and tables, may benefit from adopting this approach.

51 citations


Journal ArticleDOI
TL;DR: A multilevel index model for large-scale service repositories, which can be used to reduce the execution time of service discovery and composition and validate that the proposed model is more efficient than the existing structures, i.e., sequential and inverted index ones.
Abstract: The number of web services has grown drastically. Then how to manage them efficiently in a service repository is an important issue to address. Given a special field, there often exists an efficient data structure for a class of objects, e.g., the Google’ Bigtable is very suitable for webpages’ storage and management. Based on the theory of the equivalence relations and quotient sets, this work proposes a multilevel index model for large-scale service repositories, which can be used to reduce the execution time of service discovery and composition. Its novel use of keys as inspired by the key in relational database can effectively remove the redundancy of the commonly-used inverted index. Its four function-based operations are for the first time proposed to manage and maintain services in a repository. The experiments validate that the proposed model is more efficient than the existing structures, i.e., sequential and inverted index ones.

45 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce new techniques for compressing inverted indexes that exploit this near-copy regularity, based on run-length, Lempel-Ziv or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them.

37 citations


Journal ArticleDOI
TL;DR: This work proposes a new data structure that could be used by a variety of existing algorithms without modifying its original schema, and demonstrates the utility of the proposed data structure in enhancing the algorithms' runtime orders of magnitude, and substantially reducing both the auxiliary and the main memory requirements.
Abstract: The growing interest in data storage has made the data size to be exponentially increased, hampering the process of knowledge discovery from these large volumes of high-dimensional and heterogeneous data. In recent years, many efficient algorithms for mining data associations have been proposed, facing up time and main memory requirements. Nevertheless, this mining process could still become hard when the number of items and records is extremely high. In this paper, the goal is not to propose new efficient algorithms but a new data structure that could be used by a variety of existing algorithms without modifying its original schema. Thus, our aim is to speed up the association rule mining process regardless the algorithm used to this end, enabling the performance of efficient implementations to be enhanced. The structure simplifies, reorganizes, and speeds up the data access by sorting data by means of a shuffling strategy based on the hamming distance, which achieve similar values to be closer, and considering both an inverted index mapping and a run length encoding compression. In the experimental study, we explore the bounds of the algorithms’ performance by using a wide number of data sets that comprise either thousands or millions of both items and records. The results demonstrate the utility of the proposed data structure in enhancing the algorithms’ runtime orders of magnitude, and substantially reducing both the auxiliary and the main memory requirements.

36 citations


Proceedings ArticleDOI
13 Aug 2016
TL;DR: CaSMoS, a machine learned candidate selection framework that makes use of Weighted AND (WAND) query, is proposed, designed to prune irrelevant documents and retrieve documents that are likely to be part of the top-k results for the query.
Abstract: User experience at social media and web platforms such as LinkedIn is heavily dependent on the performance and scalability of its products. Applications such as personalized search and recommendations require real-time scoring of millions of structured candidate documents associated with each query, with strict latency constraints. In such applications, the query incorporates the context of the user (in addition to search keywords if present), and hence can become very large, comprising of thousands of Boolean clauses over hundreds of document attributes. Consequently, candidate selection techniques need to be applied since it is infeasible to retrieve and score all matching documents from the underlying inverted index. We propose CaSMoS, a machine learned candidate selection framework that makes use of Weighted AND (WAND) query. Our framework is designed to prune irrelevant documents and retrieve documents that are likely to be part of the top-k results for the query. We apply a constrained feature selection algorithm to learn positive weights for feature combinations that are used as part of the weighted candidate selection query. We have implemented and deployed this system to be executed in real time using LinkedIn's Galene search platform. We perform extensive evaluation with different training data approaches and parameter settings, and investigate the scalability of the proposed candidate selection model. Our deployment of this system as part of LinkedIn's job recommendation engine has resulted in significant reduction in latency (up to 25%) without sacrificing the quality of the retrieved results, thereby paving the way for more sophisticated scoring models.

Proceedings ArticleDOI
16 May 2016
TL;DR: This work captures the dynamics of events using four event operations (create, absorb, split, and merge), which can be effectively used to monitor evolving events, and proposes a novel event indexing structure, called Multi-layer Inverted List (MIL), to manage dynamic event databases for the acceleration of large-scale event search and update.
Abstract: Tweet streams provide a variety of real-time information on dynamic social events. Although event detection has been actively studied, most of the existing approaches do not address the issue of efficient event monitoring in the presence of a large number of events detected from continuous tweet streams. In this paper, we capture the dynamics of events using four event operations: creation, absorption, split and merge.We also propose a novel event indexing structure, named Multi-layer Inverted List (MIL), for the acceleration of large-scale event search and update. We thoroughly study the problem of nearest neighbour search using MIL based on upper bound pruning. Extensive experiments have been conducted on a large-scale tweet dataset. The results demonstrate the promising performance of our method in terms of both efficiency and effectiveness.

Book ChapterDOI
05 Sep 2016
TL;DR: This work proposes an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases and builds LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities.
Abstract: Content-based image retrieval using Deep Learning has become very popular during the last few years. In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases. The idea is to provide a text encoding for these features enabling the use of a text retrieval engine to perform image similarity search. In this way, we built LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities. In order to optimize the index occupation and the query response time, we evaluated various tuning parameters to generate the text encoding. To this end, we have developed a web-based prototype to efficiently search through a dataset of 100 million of images.

Proceedings Article
01 Feb 2016
TL;DR: In this article, the authors propose a goal-driven web navigation task, where an agent navigates through a website, which is represented as a graph consisting of web pages as nodes and hyperlinks as directed edges, to find a web page in which a query appears.
Abstract: We propose a goal-driven web navigation as a benchmark task for evaluating an agent with abilities to understand natural language and plan on partially observed environments. In this challenging task, an agent navigates through a website, which is represented as a graph consisting of web pages as nodes and hyperlinks as directed edges, to find a web page in which a query appears. The agent is required to have sophisticated high-level reasoning based on natural languages and efficient sequential decision-making capability to succeed. We release a software tool, called WebNav, that automatically transforms a website into this goal-driven web navigation task, and as an example, we make WikiNav, a dataset constructed from the English Wikipedia. We extensively evaluate different variants of neural net based artificial agents on WikiNav and observe that the proposed goal-driven web navigation well reflects the advances in models, making it a suitable benchmark for evaluating future progress. Furthermore, we extend the WikiNav with question-answer pairs from Jeopardy! and test the proposed agent based on recurrent neural networks against strong inverted index based search engines. The artificial agents trained on WikiNav outperforms the engined based approaches, demonstrating the capability of the proposed goal-driven navigation as a good proxy for measuring the progress in real-world tasks such as focused crawling and question-answering.

Journal ArticleDOI
TL;DR: An efficient private keyword search scheme that supports binary search and extend it to dynamic settings (called DEPKS) for inverted index--based encrypted data and an EPKS scheme whose complexity is logarithmic in the number of keywords is designed.
Abstract: Querying over encrypted data is gaining increasing popularity in cloud-based data hosting services. Security and efficiency are recognized as two important and yet conflicting requirements for querying over encrypted data. In this article, we propose an efficient private keyword search (EPKS) scheme that supports binary search and extend it to dynamic settings (called DEPKS) for inverted index--based encrypted data. First, we describe our approaches of constructing a searchable symmetric encryption (SSE) scheme that supports binary search. Second, we present a novel framework for EPKS and provide its formal security definitions in terms of plaintext privacy and predicate privacy by modifying Shen et al.’s security notions [Shen et al. 2009]. Third, built on the proposed framework, we design an EPKS scheme whose complexity is logarithmic in the number of keywords. The scheme is based on the groups of prime order and enjoys strong notions of security, namely statistical plaintext privacy and statistical predicate privacy. Fourth, we extend the EPKS scheme to support dynamic keyword and document updates. The extended scheme not only maintains the properties of logarithmic-time search efficiency and plaintext privacy and predicate privacy but also has fewer rounds of communications for updates compared to existing dynamic search encryption schemes. We experimentally evaluate the proposed EPKS and DEPKS schemes and show that they are significantly more efficient in terms of both keyword search complexity and communication complexity than existing randomized SSE schemes.

Journal ArticleDOI
TL;DR: This work proposed an improved approach for integrating and managing massive remote-sensing data by adding a spatial code column in an array format in a database, so that spatial information in remote-Sensing metadata can be stored and logically subdivided.
Abstract: Owing to the rapid development of earth observation technology, the volume of spatial information is growing rapidly; therefore, improving query retrieval speed from large, rich data sources for remote-sensing data management systems is quite urgent. A global subdivision model, geographic coordinate subdivision grid with one-dimension integer coding on 2n-tree, which we propose as a solution, has been used in data management organizations. However, because a spatial object may cover several grids, ample data redundancy will occur when data are stored in relational databases. To solve this redundancy problem, we first combined the subdivision model with the spatial array database containing the inverted index. We proposed an improved approach for integrating and managing massive remote-sensing data. By adding a spatial code column in an array format in a database, spatial information in remote-sensing metadata can be stored and logically subdivided. We implemented our method in a Kingbase Enterprise Server database system and compared the results with the Oracle platform by simulating worldwide image data. Experimental results showed that our approach performed better than Oracle in terms of data integration and time and space efficiency. Our approach also offers an efficient storage management system for existing storage centers and management systems.

Journal ArticleDOI
TL;DR: The Block Max WAND with Candidate Selection and Preserving Top-K Results algorithm, or BMW-CSP, is proposed and evaluated and it is shown that the method is competitive when compared to baselines, and may constitute an excellent alternative query processing method.
Abstract: We present a new query processing method for text search.We extend the BMW-CS algorithm to now preserve the top-k results, proposing BMW-CSP.We show through experiments that the method is competitive when compared to baselines. In this paper we propose and evaluate the Block Max WAND with Candidate Selection and Preserving Top-K Results algorithm, or BMW-CSP. It is an extension of BMW-CS, a method previously proposed by us. Although very efficient, BMW-CS does not guarantee preserving the top-k results for a given query. Algorithms that do not preserve the top results may reduce the quality of ranking results in search systems. BMW-CSP extends BMW-CS to ensure that the top-k results will have their rankings preserved. In the experiments we performed for computing the top-10 results, the final average time required for processing queries with BMW-CSP was lesser than the ones required by the baselines adopted. For instance, when computing top-10 results, the average time achieved by MBMW, the best multi-tier baseline we found in the literature, was 36.29źms per query, while the average time achieved by BMW-CSP was 19.64źms per query. The price paid by BMW-CSP is an extra memory required to store partial scores of documents. As we show in the experiments, this price is not prohibitive and, in cases where it is acceptable, BMW-CSP may constitute an excellent alternative query processing method.

Book ChapterDOI
01 Jan 2016
TL;DR: This paper develops an approach to achieve faceted browsing in live collections, in which not only the contents but also the thesauri can be constantly reorganized, and proposes two indexing strategies to avoid this exponential worst-case complexity.
Abstract: Faceted thesauri group classification terms into hierarchically arranged facets. They enable faceted browsing, a well-known browsing technique that makes it possible to narrowing down digital collections by recursively adding filtering terms from the facet hierarchy. In this paper we develop an approach to achieve faceted browsing in live collections, in which not only the contents but also the thesauri can be constantly reorganized. For this purpose we start by introducing a faceted thesauri-based digital collection model in which users can freely rearrange the hierarchical organizations of facets. Then we analyze how to efficiently react to thesauri reconfigurations by representing all the possible ways of browsing a collection with a finite state machine called navigation automaton. Since, in the worst-case, the number of states in navigation automata can grow exponentially with respect to the collections’ sizes, we propose two indexing strategies to avoid this exponential worst-case complexity: one based on inverted indexes, and another inspired by hierarchical clustering, which makes use of the so-called navigation dendrograms. Some experimental results concerning Clavy, a system for managing digital collections with reconfigurable structures in digital humanities and educational settings, provide evidence that navigation dendrogram organization outperforms the inverted index-based one.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed an adaptive methodology based on a cost model to limit the prefix tree construction and reduce the space and time cost of the join, which significantly reduces the maximum memory requirements during the join.
Abstract: Given two collections of set objects R and S, the $$R\bowtie _{\subseteq }S$$Rź⊆S set containment join returns all object pairs $$(r,s) \in R\times S$$(r,s)źR×S such that $$r\subseteq s$$r⊆s. Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins ($$\mathtt {PRETTI}$$PRETTI) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves $$\mathtt {PRETTI}$$PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms $$\mathtt {PRETTI}$$PRETTI by a wide margin.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: This work proposes a new indexing strategy that uniformly handles text, space and time in a single structure, and is thus able to efficiently evaluate queries that combine keywords with spatial and temporal constraints.
Abstract: From tweets to urban data sets, there has been an explosion in the volume of textual data that is associated with both temporal and spatial components. Efficiently evaluating queries over these data is challenging. Previous approaches have focused on the spatial aspect. Some used separate indices for space and text, thus incurring the overhead of storing separate indices and joining their results. Others proposed a combined index that either inserts terms into a spatial structure or adds a spatial structure to an inverted index. These benefit queries with highly-selective constraints that match the primary index structure but have limited effectiveness and pruning power otherwise. We propose a new indexing strategy that uniformly handles text, space and time in a single structure, and is thus able to efficiently evaluate queries that combine keywords with spatial and temporal constraints. We present a detailed experimental evaluation using real data sets which shows that not only our index attains substantially lower query processing times, but it can also be constructed in a fraction of the time required by state-of-the-art approaches.

Proceedings ArticleDOI
02 Sep 2016
TL;DR: The main contribution of this paper is the introduction of a novel system which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Abstract: This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain. The source code for our Semantic Knowledge Graph implementation is being published along with this paper to facilitate further research and extensions of this work.

Proceedings ArticleDOI
26 Jun 2016
TL;DR: The experiments show that the accuracy of VSM-Cilin is significantly improved compared with the traditional vector space model and the method of bidirectional mapping based on HITIR-Lab Tongyici Cilin.
Abstract: In this paper, a text similarity computation method named VSM-Cilin which is based on semantic vector space model is proposed in the background of radio station. VSM-Cilin improved the traditional VSM in the following areas. First, consider the semantic relations between words. Second, use semantic resources to reduce dimension. Third, use inverted index to filter out candidate document set. Forth, take the weight of the feature item into consideration when compute the similarity. The experiments show that the accuracy of VSM-Cilin is significantly improved compared with the traditional vector space model and the method of bidirectional mapping based on HITIR-Lab Tongyici Cilin.

Journal ArticleDOI
TL;DR: A content-based retrieval method for long-surveillance videos in wide-area (airborne) and near-field [closed-circuit television (CCTV)] imagery to retrieve video segments, with a focus on detecting objects moving on routes, that match user-defined events of interest.
Abstract: We present a content-based retrieval method for long-surveillance videos in wide-area (airborne) and near-field [closed-circuit television (CCTV)] imagery. Our goal is to retrieve video segments, with a focus on detecting objects moving on routes, that match user-defined events of interest. The sheer size and remote locations where surveillance videos are acquired necessitates highly compressed representations that are also meaningful for supporting user-defined queries. To address these challenges, we archive long-surveillance video through lightweight processing based on low-level local spatiotemporal extraction of motion and object 2. These are then hashed into an inverted index using locality-sensitive hashing. This local approach allows for query flexibility and leads to significant gains in compression. Our second task is to extract partial matches to user-created queries and assemble them into full matches using dynamic programming (DP). DP assembles the indexed low-level features into a video segment that matches the query route by exploiting causality. We examine CCTV and airborne footage, whose low contrast makes motion extraction more difficult. We generate robust motion estimates for airborne data using a tracklets generation algorithm, while we use the Horn and Schunck approach to generate motion estimates for CCTV. Our approach handles long routes, low contrasts, and occlusion. We derive bounds on the rate of false positives and demonstrate the effectiveness of the approach for counting, motion pattern recognition, and abandoned object applications.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This work proposes an alternative framework that builds specialized single-term and pairwise index structures, and then during query time selectively accesses these structures based on a cost budget and a set of early termination techniques.
Abstract: Current search engines use very complex ranking functions based on hundreds of features. While such functions return high-quality results, they create efficiency challenges as it is too costly to fully evaluate them on all documents in the union, or even intersection, of the query terms. To address this issue, search engines use a series of cascading rankers, starting with a very simple ranking function and then applying increasingly complex and expensive ranking functions on smaller and smaller sets of candidate results. Researchers have recently started studying several problems within this framework of query processing by cascading rankers; see, e.g., [5, 13, 17, 51]. We focus on one such problem, the design of the initial cascade. Thus, the goal is to very quickly identify a set of good candidate documents that should be passed to the second and further cascades. Previous work by Asadi and Lin [3, 5] showed that while a top-k computation on either the union or intersection gives good results, a further optimization using a global document ordering based on spam scores leads to a significant reduction in quality. Our contribution is to propose an alternative framework that builds specialized single-term and pairwise index structures, and then during query time selectively accesses these structures based on a cost budget and a set of early termination techniques. Using an end-to-end evaluation with a complex machine-learned ranker, we show that our approach finds candidates about an order of magnitude faster than a conjunctive top-k computation, while essentially matching the quality.

Patent
08 Jun 2016
TL;DR: In this article, a multilayer quotation recommendation method based on a literature content mapping knowledge domain is proposed, where the query requirement consists of the key words of the title and the digest of a thesis which needs to recommend a quotation thesis or quotation literature.
Abstract: The invention discloses a multilayer quotation recommendation method based on a literature content mapping knowledge domain, and belongs to the field of information recommendation and intelligent information processing. The method comprises the following steps: firstly, obtaining the query requirement of a user, wherein the query requirement consists of the key words of the title and the digest of a thesis which needs to recommend a quotation thesis or quotation literature; then, on the basis of the literature content mapping knowledge domain, expanding and querying a retrieval word, wherein the mapping knowledge domain consists of the research object word and the research behavior word node of the literature, and edges which express various semantic relations including synonymy, synonym, an up and down position, part-whole, juxtaposition and the like; and finally, constructing the inverted index of the literature in a data set, selecting a candidate quotation, calculating the similarity between the candidate quotation and query, and adopting a gradient progressive regression tree to carry out quotation recommendation. The method carries out multilayer quotation recommendation on the basis of the literature content mapping knowledge domain, enlarges the range of the candidate quotation, accurately expresses the research object and contents of the thesis, improves efficiency for users to obtain a relevant literature and has a wide application prospect.

Patent
06 Apr 2016
TL;DR: In this paper, a feature bag image retrieval method based on a Hash binary code was proposed, which comprises steps that, a vision term list is established; tf-idf (term frequency-inverse document frequency index) weight quantification of vision terms is carried out; vision term characteristic quantification was carried out.
Abstract: The invention discloses a feature bag image retrieval method based on a Hash binary code. The method comprises steps that, a vision term list is established; tf-idf (term frequency-inverse document frequency index) weight quantification of vision terms is carried out; vision term characteristic quantification of an image is carried out; an inverted index is established; a feature binary code projection direction is learned; feature binary code quantification is carried out; candidate image sets are retrieved. According to the method, the index is established for an image database, rapid image retrieval is realized, and retrieval efficiency is improved, moreover, through a binary code learning method having the similarity retention capability, the binary code is learned from spatial distance similarity and meaning distance similarity as signature, and image retrieval accuracy is improved. The feature bag image retrieval technology based on the Hash binary code has properties of high efficiency and accuracy, and relatively high use values are realized.

Patent
22 Jun 2016
TL;DR: In this article, an information push method and device is described, in which a search request is sent by a terminal, wherein the search request comprises search contents, extracting a plurality of first candidate push information matching the search contents as a first push information set from an inverted index database, determining key words in the search content, extracting information to be pushed matching the key words as a set of information to push from the first candidate Push information set.
Abstract: This application discloses an information push method and device. A specific embodiment of the method comprises: receiving a search request sent by a terminal, wherein the search request comprises search contents; extracting a plurality of first candidate push information matching the search contents as a first candidate push information set from an inverted index database; determining key words in the search contents; extracting information to be pushed matching the key words as a set of information to be pushed from the first candidate push information set; and sending a search result page corresponding to the search contents and various information to be pushed in the set of information to be pushed to the terminal, so that the various information to be pushed are presented on the search result page in a predetermined order. The solution realizes pertinent information pushing.

Proceedings ArticleDOI
16 Oct 2016
TL;DR: This paper presents a corpus of deep features extracted from the YFCC100M images considering the fc6 hidden layer activation of the HybridNet deep convolutional neural network and presents experimental results obtained indexing this corpus with two distinct approaches: the Metric Inverted File and the Lucene Quantization.
Abstract: This paper presents a corpus of deep features extracted from the YFCC100M images considering the fc6 hidden layer activation of the HybridNet deep convolutional neural network. For a set of random selected queries we made available k-NN results obtained sequentially scanning the entire set features comparing both using the Euclidean and Hamming Distance on a binarized version of the features. This set of results is ground truth for evaluating Content-Based Image Retrieval (CBIR) systems that use approximate similarity search methods for efficient and scalable indexing. Moreover, we present experimental results obtained indexing this corpus with two distinct approaches: the Metric Inverted File and the Lucene Quantization. These two CBIR systems are public available online allowing real-time search using both internal and external queries.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: An alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search is proposed.
Abstract: The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.

Proceedings ArticleDOI
01 Mar 2016
TL;DR: A blocking scheme, CER-Blocking, which is based on an inverted index structure and that uses different data evidences from a triple, aiming to maximize its effectiveness is presented, empirically evaluated on real and synthetic datasets.
Abstract: The amount and diversity of data in the Semantic Web has grown quite. RDF datasets has proportionally more problems than relational datasets due to the way data are published, usually without formal criteria. Entity Resolution isan important issue which is related to a known task of many research communities and it aims at finding all representations that refer to the same entity in different datasets. Yet, it is still an open problem. Blocking methods are used to avoid the quadratic complexity of the brute force approach by clustering entities into blocks and limiting the evaluation of entity specifications to entity pairs within blocks. In the last years only a fewblocking methods were conceived to deal with RDF data and novel blocking techniques are required for dealing with noisy and heterogeneous data in the Web of Data. In this paper we present a blocking scheme, CER-Blocking, which is based on an inverted index structure and that uses different data evidences from a triple, aiming to maximize its effectiveness. To overcomethe problems of data quality or even the very absence thereof, we use two blocking key definitions. This scheme is part of an ER approach which is based on a relational learning algorithm that addresses the problem by statistical approximation. It was empirically evaluated on real and synthetic datasets which are part of consolidated benchmarks found on the literature.