Author
Bin Yao
Other affiliations: Shenzhen University, Florida State University
Bio: Bin Yao is an academic researcher from Shanghai Jiao Tong University. The author has contributed to research in topics: Spatial query & Tree (data structure). The author has an hindex of 20, co-authored 79 publications receiving 1540 citations. Previous affiliations of Bin Yao include Shenzhen University & Florida State University.
Papers
More filters
••
14 Jun 2016TL;DR: Simba is a scalable and efficient in-memory spatial query processing and analytics for big spatial data that extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API.
Abstract: Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.
228 citations
••
08 Apr 2013TL;DR: New SNN methods are designed, which provide customizable tradeoff between efficiency and communication cost, and are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.
Abstract: In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a cloud service provider and asks for an encrypted data point in E(D) (the encrypted database) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods [21], [15], even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field [6], [5], and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models [6], [5], our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.
219 citations
••
11 Apr 2011TL;DR: A unified framework is proposed that addresses three variants of OL queries that find important applications in practice, and is instantiate the framework with several novel query processing algorithms.
Abstract: Optimal location (OL) queries are a type of spatial queries particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an L p space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their L p distance. Motivated by the deficiency of the existing techniques, this paper presents the first study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We demonstrate the efficiency of our solutions through extensive experiments with real data.
111 citations
••
19 Apr 2017
TL;DR: This paper formally defines a novel dynamic online task assignment problem, called the trichromatic online matching in real-time spatial crowdsourcing (TOM) problem, which is proven to be NP-hard and presents a threshold-based randomized algorithm that not only guarantees a tighter competitive ratio but also includes an adaptive optimization technique, which can quickly learn the optimal threshold for the randomized algorithm.
Abstract: The prevalence of mobile Internet techniques and Online-To-Offline (O2O) business models has led the emergence of various spatial crowdsourcing (SC) platforms in our daily life. A core issue of SC is to assign real-time tasks to suitable crowd workers. Existing approaches usually focus on the matching of two types of objects, tasks and workers, or assume the static offline scenarios, where the spatio-temporal information of all the tasks and workers is known in advance. Recently, some new emerging O2O applications incur new challenges: SC platforms need to assign three types of objects, tasks, workers and workplaces, and support dynamic real-time online scenarios, where the existing solutions cannot handle. In this paper, based on the aforementioned challenges, we formally define a novel dynamic online task assignment problem, called the trichromatic online matching in real-time spatial crowdsourcing (TOM) problem, which is proven to be NP-hard. Thus, we first devise an efficient greedy online algorithm. However, the greedy algorithm can be trapped into local optimal solutions easily. We then present a threshold-based randomized algorithm that not only guarantees a tighter competitive ratio but also includes an adaptive optimization technique, which can quickly learn the optimal threshold for the randomized algorithm. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.
95 citations
••
01 Mar 2010TL;DR: This work designs algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan.
Abstract: Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts to solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, which may limit their application on large data sets that are stored in a relational database management system. Furthermore, these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize.We design algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with a constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.
90 citations
Cited by
More filters
••
TL;DR: A new method of keyword transformation based on the uni-gram is developed, which will simultaneously improve the accuracy and creates the ability to handle other spelling mistakes and consider the keyword weight when selecting an adequate matching file set.
Abstract: Keyword-based search over encrypted outsourced data has become an important tool in the current cloud computing scenario. The majority of the existing techniques are focusing on multi-keyword exact match or single keyword fuzzy search. However, those existing techniques find less practical significance in real-world applications compared with the multi-keyword fuzzy search technique over encrypted data. The first attempt to construct such a multi-keyword fuzzy search scheme was reported by Wang et al. , who used locality-sensitive hashing functions and Bloom filtering to meet the goal of multi-keyword fuzzy search. Nevertheless, Wang’s scheme was only effective for a one letter mistake in keyword but was not effective for other common spelling mistakes. Moreover, Wang’s scheme was vulnerable to server out-of-order problems during the ranking process and did not consider the keyword weight. In this paper, based on Wang et al. ’s scheme, we propose an efficient multi-keyword fuzzy ranked search scheme based on Wang et al. ’s scheme that is able to address the aforementioned problems. First, we develop a new method of keyword transformation based on the uni-gram, which will simultaneously improve the accuracy and creates the ability to handle other spelling mistakes. In addition, keywords with the same root can be queried using the stemming algorithm. Furthermore, we consider the keyword weight when selecting an adequate matching file set. Experiments using real-world data show that our scheme is practically efficient and achieve high accuracy.
464 citations
••
08 May 2013TL;DR: This paper presents a verifiable privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking to address the problem of secure search functions over encrypted data and proposes two secure index schemes to meet the stringent privacy requirements under strong threat models.
Abstract: With the increasing popularity of cloud computing, huge amount of documents are outsourced to the cloud for reduced management cost and ease of access. Although encryption helps protecting user data confidentiality, it leaves the well-functioning yet practically-efficient secure search functions over encrypted data a challenging problem. In this paper, we present a privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking to address this problem. To support multi-keyword search and search result ranking, we propose to build the search index based on term frequency and the vector space model with cosine similarity measure to achieve higher search result accuracy. To improve the search efficiency, we propose a tree-based index structure and various adaption methods for multi-dimensional (MD) algorithm so that the practical search efficiency is much better than that of linear search. To further enhance the search privacy, we propose two secure index schemes to meet the stringent privacy requirements under strong threat models, i.e., known ciphertext model and known background model. Finally, we demonstrate the effectiveness and efficiency of the proposed schemes through extensive experimental evaluation.
349 citations
••
01 Jun 2014TL;DR: This paper argues that existing location privacy techniques are not sufficient for SC, and a mechanism based on differential privacy and geocasting that achieves effective SC services while offering privacy guarantees to workers is proposed.
Abstract: Spatial Crowdsourcing (SC) is a transformative platform that engages individuals, groups and communities in the act of collecting, analyzing, and disseminating environmental, social and other spatio-temporal information. The objective of SC is to outsource a set of spatio-temporal tasks to a set of workers, i.e., individuals with mobile devices that perform the tasks by physically traveling to specified locations of interest. However, current solutions require the workers, who in many cases are simply volunteering for a cause, to disclose their locations to untrustworthy entities. In this paper, we introduce a framework for protecting location privacy of workers participating in SC tasks. We argue that existing location privacy techniques are not sufficient for SC, and we propose a mechanism based on differential privacy and geocasting that achieves effective SC services while offering privacy guarantees to workers. We investigate analytical models and task assignment strategies that balance multiple crucial aspects of SC functionality, such as task completion rate, worker travel distance and system overhead. Extensive experimental results on real-world datasets show that the proposed technique protects workers' location privacy without incurring significant performance metrics penalties.
343 citations