scispace - formally typeset
Search or ask a question

Showing papers by "Bin Yao published in 2016"


Proceedings ArticleDOI
14 Jun 2016
TL;DR: Simba is a scalable and efficient in-memory spatial query processing and analytics for big spatial data that extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API.
Abstract: Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.

228 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: This paper introduces a general system model based on the concept of Oblivious Storage (OS), which can deal with queries requiring strong privacy properties, and proposes a new oblivious shuffle algorithm to optimize an existing OS scheme.
Abstract: As location-based services (LBSs) become popular, location-dependent queries have raised serious privacy concerns since they may disclose sensitive information in query processing. Among typical queries supported by LBSs, shortest path queries may reveal information about not only current locations of the clients, but also their potential destinations and travel plans. Unfortunately, existing methods for private shortest path computation suffer from issues of weak privacy property, low performance or poor scalability. In this paper, we aim at a strong privacy guarantee, where the adversary cannot infer almost any information about the queries, with better performance and scalability. To achieve this goal, we introduce a general system model based on the concept of Oblivious Storage (OS), which can deal with queries requiring strong privacy properties. Furthermore, we propose a new oblivious shuffle algorithm to optimize an existing OS scheme. By making trade-offs between query performance, scalability and privacy properties, we design different schemes for private shortest path computation. Eventually, we comprehensively evaluate our schemes upon real road networks in a practical environment and show their efficiency.

28 citations


Journal ArticleDOI
TL;DR: It is proved theoretically that RT-HCN is both space-efficient and query-efficient, by which each node actually maintains a tolerable number of global indices while high concurrent queries can be processed within accepted overhead.
Abstract: Cloud storage system poses new challenges to the community to support efficient concurrent querying tasks for various data-intensive applications, where indices always hold important positions. In this paper, we explore a practical method to construct a two-layer indexing scheme for multi-dimensional data in diverse server-centric cloud storage system. We first propose RT-HCN, an indexing scheme integrating R-tree based indexing structure and HCN-based routing protocol. RT-HCN organizes storage and compute nodes into an HCN overlay, one of the newly proposed sever-centric data center topologies. Based on the properties of HCN, we design a specific index mapping technique to maintain layered global indices and corresponding query processing algorithms to support efficient query tasks. Then, we expand the idea of RT-HCN onto another server-centric data center topology DCell, discovering a potential generalized and feasible way of deploying two-layer indexing schemes on other server-centric networks. Furthermore, we prove theoretically that RT-HCN is both space-efficient and query-efficient, by which each node actually maintains a tolerable number of global indices while high concurrent queries can be processed within accepted overhead. We finally conduct targeted experiments on Amazon's EC2 platforms, comparing our design with RT-CAN, a similar indexing scheme for traditional P2P network. The results validate the query efficiency, especially the speedup of point query of RT-HCN, depicting its potential applicability in future data centers.

17 citations


Proceedings ArticleDOI
31 Oct 2016
TL;DR: The Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data, is presented.
Abstract: We present the Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and DataFrame API. It enables the construction of indexes over RDDs inside the engine in order to work with big spatial data and complex spatial operations. Simba also comes with an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput in big spatial data analysis. This demonstration proposal describes key ideas in the design of Simba, and presents a demonstration plan.

16 citations


Journal ArticleDOI
01 Jun 2016
TL;DR: This paper proposes an added flexibility to the query definition, where the similarity is an aggregation over the distances between p and any subset of M objects in Q for some support, and calls this new definition flexible aggregate similarity search.
Abstract: Aggregate similarity search, also known as aggregate nearest-neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the similarity is an aggregation (e.g., $${{\mathrm{sum}}}$$sum, $$\max $$max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggregation over the distances between p and any subset of $$\phi M$$?M objects in Q for some support$$0< \phi \le 1$$0

11 citations


Journal ArticleDOI
TL;DR: The central idea is to swap the order of geometric operations and to compute the appearance probability in a multi-step manner and to differentiate two forms of CSPTRQs: explicit and implicit ones.
Abstract: This paper studies the constrained-space probabilistic threshold range query (CSPTRQ) for moving objects, where objects move in a constrained-space (i.e., objects are forbidden to be located in some specific areas), and objects' locations are uncertain. We differentiate two forms of CSPTRQs: explicit and implicit ones. Specifically, for each moving object o, we model its location uncertainty as a closed region, u, together with a probability density function. We also model a query range, R, as an arbitrary polygon. An explicit query can be reduced to a search (over all the u) that returns a set of tuples in form of (o, p) such that p ? pt, where p is the probability of o being located in R, and 0≤pt ≤ 1 is a given probabilistic threshold. In contrast, an implicit query returns only a set of objects (without attaching the specific probability information), whose probabilities being located in R are higher than pt. The CSPTRQ is a variant of the traditional probabilistic threshold range query (PTRQ). As objects moving in a constrained-space are common, clearly, it can also find many applications. At the first sight, our problem can be easily tackled by extending existing methods used to answer the PTRQ. Unfortunately, those classical techniques are not well suitable for our problem, due to a set of new challenges. Another method used to answer the constrained-space probabilistic range query (CSPRQ) can be easily extended to tackle our problem, but a simple adaptation of this method is inefficient, due to its weak pruning/validating capability. To solve our problem, we develop targeted solutions that are easy-to-understand and also easy-to-implement. Our central idea is to swap the order of geometric operations and to compute the appearance probability in a multi-step manner. We demonstrate the efficiency and effectiveness of the proposed methods through extensive experiments. Meanwhile, from the experimental results, we further perceive the difference between explicit and implicit queries; this finding is interesting and also meaningful especially for the topics of other types of probabilistic threshold queries.

8 citations


Proceedings ArticleDOI
01 Jan 2016
TL;DR: A new structure to manage typed intervals based on the standard interval tree is developed and an efficient query algorithms are proposed to improve the performance of this solution over alternative methods.
Abstract: Assume that a database stores a set of intervals, each of which defines start and end points, a weight and a type. Typed intervals enrich the data representation and support applications involving different kinds of data intervals. Given a query time and type, the system reports k intervals that intersect the time, contain the type and have the largest weight. We develop a new structure to manage typed intervals based on the standard interval tree and propose efficient query algorithms. Experiments with synthetic datasets are conducted to verify the performance advantage of our solution over alternative methods.

3 citations