scispace - formally typeset
Search or ask a question

Showing papers by "Bin Yao published in 2015"


Journal ArticleDOI
TL;DR: A new dynamic network optimizer called OFScheduler for heterogeneous clusters to relieve the network traffic during the execution of MapReduce jobs to reduce bandwith competition, balancing the workload of network links and increasing bandwidth utilization is proposed.
Abstract: MapReduce is a popular programming paradigm in cloud computing due to its excellent scalability for processing large-scale data. However, MapReduce performs poorly in heterogeneous clusters. One of the reasons is that Hadoop's built-in load balancing algorithm for Map function leads to excessive network traffic. We propose a new dynamic network optimizer called OFScheduler for heterogeneous clusters to relieve the network traffic during the execution of MapReduce jobs. The optimizer focuses on reducing bandwith competition, balancing the workload of network links and increasing bandwidth utilization. The proposed optimizer tags different types of traffic and utilize the Openflow to adjust transfers of flows dynamically. We instantiate a simulator and an OpenFlow testbed for evaluation. The simulation results demonstrate that the proposed optimizer has a significant effect on increasing bandwidth utilization and improving the performance of MapReduce by 24 ~ 63 % for most of jobs in a multi-path heterogeneous cluster. The experiment results show that the proposed optimizer can be deployed into a real environment.

27 citations


Proceedings ArticleDOI
Hao Lin1, Jingyu Zhou1, Bin Yao1, Minyi Guo1, Jie Li1 
04 May 2015
TL;DR: This work proposes a column-wise compression approach for well-formatted log streams, where each log entry can be independently compressed or decompressed for analysis, and shows that this scheme outperforms traditional compression methods for decompression times and has a competitive compression ratio.
Abstract: Nowadays massive log streams are generated from many Internet and cloud services. Storing log streams consumes a large amount of disk space and incurs high cost. Traditional compression methods can be applied to reduce storage cost, but are inefficient for log analysis, because fetching relevant log entries from compressed data often requires retrieval and decompression of large blocks of data. We propose a column-wise compression approach for well-formatted log streams, where each log entry can be independently compressed or decompressed for analysis. Specifically, we separate a log entry into several columns and compress each column with different models. We have implemented our approach as a library and integrated it into two applications, a log search system and a log joining system. Experimental results show that our compression scheme outperforms traditional compression methods for decompression times and has a competitive compression ratio. For log search, our approach achieves better query times than using traditional compression algorithms for both in-core and out-of-core cases. For joining log streams, our approach achieves the same join quality with only 30% memory of uncompressed streams.

19 citations


Journal ArticleDOI
01 Jun 2015
TL;DR: This paper studies the problem of kCP query processing in general metric spaces, namely Metric kCP (MkCP)search, and proposes several efficient algorithms using dynamic disk-based metric indexes, and derives a node-based cost model for MkCP retrieval.
Abstract: Given two object sets $$P$$P and $$Q$$Q, a k-closest pair$$(k\hbox {CP})$$(kCP)query finds $$k$$k closest object pairs from $$P\times Q$$P×Q This operation is common in many real-life applications such as GIS, data mining, and recommender systems Although it has received much attention in the Euclidean space, there is little prior work on the metric space In this paper, we study the problem of kCP query processing in general metric spaces, namely Metric kCP$$(\hbox {M}k\hbox {CP})$$(MkCP)search, and propose several efficient algorithms using dynamic disk-based metric indexes (eg, M-tree), which can be applied to arbitrary type of data as long as a certain metric distance is defined and satisfies the triangle inequality Our approaches follow depth-first and/or best-first traversal paradigm(s), employ effective pruning rules based on metric space properties and the counting information preserved in the metric index, take advantage of aggressive pruning and compensation to further boost query efficiency, and derive a node-based cost model for $$\hbox {M}k\hbox {CP}$$MkCP retrieval In addition, we extend our techniques to tackle two interesting variants of $$\hbox {M}k\hbox {CP}$$MkCP queries Extensive experiments with both real and synthetic data sets demonstrate the performance of our proposed algorithms, the effectiveness of our developed pruning rules, and the accuracy of our presented cost model

17 citations


Journal ArticleDOI
TL;DR: This paper studies the PRQ over objects moving in a constrained 2D space where objects are forbidden to be located in some specific areas and uses a strategy called pre-approximation to reduce the initial problem to a highly simplified version, implying that it makes the rest of steps easy to tackle.
Abstract: Probabilistic range query (PRQ) over uncertain moving objects has attracted much attentions in recent years. Most of existing works focus on the PRQ for objects moving freely in two-dimensional (2D) space. In contrast, this paper studies the PRQ over objects moving in a constrained 2D space where objects are forbidden to be located in some specific areas. We dub it the constrained space probabilistic range query (CSPRQ). We analyze its unique properties and show that to process the CSPRQ using a straightforward solution is infeasible. The key idea of our solution is to use a strategy called pre-approximation that can reduce the initial problem to a highly simplified version, implying that it makes the rest of steps easy to tackle. In particular, this strategy itself is pretty simple and easy to implement. Furthermore, motivated by the cost analysis, we further optimize our solution. The optimizations are mainly based on two insights: (i) the number of effective subdivision s is no more than 1; and (ii) an entity with the larger span is more likely to subdivide a single region. We demonstrate the effectiveness and efficiency of our proposed approaches through extensive experiments under various experimental settings, and highlight an extra finding—the precomputation based method suffers a non-trivial preprocessing time, which offers an important indication sign for the future research.

16 citations


Proceedings ArticleDOI
25 May 2015
TL;DR: This work proposes a hybrid approach for generating proofs of cloud search results, which model search indices as sets and search operations as set intersections, and build proofs based on RSA accumulators and aggregated membership and no membership witnesses.
Abstract: As cloud computing has become prominent, the need for searching cloud data has grown increasingly urgent. However, cloud search may be incorrect due to errors of cloud providers and attacks from other malicious tenants. Previous work on verifiable computing returns results with probabilistically checkable proofs, which targets at different applications other than search and requires a large computation overhead. We propose a hybrid approach for generating proofs of cloud search results. Specifically, we model search indices as sets and search operations as set intersections, and build proofs based on RSA accumulators and aggregated membership and no membership witnesses. Because generating witnesses for large sets is computationally expensive, we employ interval-based witnesses for fast proof generation. To reduce proof size, our hybrid method uses Bloom filters when set difference is large. Evaluation on real datasets shows that our hybrid approach generates proofs in an average of 0.197s, up to 83.2% faster than previous work with a smaller proof size. Experiments also show our approach allows incremental updates with constant cost.

2 citations


Patent
02 Sep 2015
TL;DR: In this paper, a spatial data-based method of safety range query is characterized in that when a client acquires an ID (identifier) from a server, the ID is decrypted and is re-encrypted when the client returns data of the server.
Abstract: A spatial data-based method of safety range query is characterized in that when a client acquires an ID (identifier) from a server, the ID is decrypted and is re-encrypted when the client returns data of the server. The method has the advantages that query efficiency is ensured, data encryption is implemented, a data access mode is hidden and protected, and the risk of information leakage is greatly decreased.