scispace - formally typeset
Search or ask a question
Author

Haoyu Tan

Bio: Haoyu Tan is an academic researcher from Hong Kong University of Science and Technology. The author has contributed to research in topics: Communication channel & Scalability. The author has an hindex of 11, co-authored 34 publications receiving 1028 citations.

Papers
More filters
Proceedings ArticleDOI
07 Dec 2011
TL;DR: This paper proposes an efficient parallel density-based clustering algorithm and implements it by a 4-stages MapReduce paradigm and adopts a quick partitioning strategy for large scale non-indexed data.
Abstract: Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging due to the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in kinds of data process fields. In this paper, we propose an efficient parallel density-based clustering algorithm and implement it by a 4-stages MapReduce paradigm. Furthermore, we adopt a quick partitioning strategy for large scale non-indexed data. We study the metric of merge among bordering partitions and make optimizations on it. At last, we evaluate our work on real large scale datasets using Hadoop platform. Results reveal that the speedup and scale up of our work are very efficient.

213 citations

Proceedings ArticleDOI
22 Jun 2013
TL;DR: A new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data and proposes efficient search algorithms together with novel indexes to speed up the processing of TPMFP.
Abstract: The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. In this paper, we study a new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data. We refer to this query as time period-based MFP (TPMFP). Specifically, given a time period T, a source v_s and a destination v_d, TPMFP searches the MFP from v_s to v_d during T. Though there exist several proposals on defining MFP, they only consider a fixed time period. Most importantly, we find that none of them can well reflect people's common sense notion which can be described by three key properties, namely suffix-optimal (i.e., any suffix of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time effectiveness into consideration. Therefore, our first task is to give a TPMFP definition that satisfies the above three properties. Then, given the comprehensive TPMFP definition, our next task is to find TPMFP over huge amount of trajectory data efficiently. Particularly, we propose efficient search algorithms together with novel indexes to speed up the processing of TPMFP. To demonstrate both the effectiveness and the efficiency of our approach, we conduct extensive experiments using a real dataset containing over 11 million trajectories.

193 citations

Journal ArticleDOI
TL;DR: MR-DBSCAN is presented, a scalable DBSCAN algorithm using MapReduce that achieves desirable load balancing even in the context of heavily skewed data and proposes a novel data partitioning method based on computation cost estimation.
Abstract: DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

142 citations

Proceedings ArticleDOI
20 Sep 2010
TL;DR: An interesting observation that by generating intended patterns, some simultaneous transmissions can be successfully decoded without degrading the effective throughput in original transmission is observed, and a DC-MAC is proposed to leverage this "free” coordination channel for efficient medium access in a multiple-user wireless network.
Abstract: Interference is a critical issue in wireless communications. In a typical multiple-user environment, different users may severely interfere with each other. Coordination among users therefore is an indispensable part for interference management in wireless networks. It is known that, coordination among multiple nodes is a costly operation taking a significant amount of valuable communication resource. In this paper, we have an interesting observation that by generating intended patterns, some simultaneous transmissions, i.e., "interference", can be successfully decoded without degrading the effective throughput in original transmission. As such, an extra and "free" coordination channel can be built. Based on this idea we propose a DC-MAC to leverage this "free" channel for efficient medium access in a multiple-user wireless network. We theoretically analyze the capacity of this channel under different environments with various modulation schemes. USRP2-based implementation experiments show that compared with the widely adopted CSMA, DC-MAC can improve the channel utilization efficiency by up to 250%.

88 citations

Journal ArticleDOI
TL;DR: This paper conducts a systematic analysis on errors occurring at chip level and proposes Simple Rule, a simple yet effective method based on the chip error patterns to infer the link condition with an accuracy of over 96 percent in evaluations.
Abstract: IEEE 802.15.4 standard specifies physical layer (PHY) and medium access control (MAC) sublayer protocols for low-rate and low-power communication applications. In this protocol, every 4-bit symbol is encoded into a sequence of 32 chips that are actually transmitted over the air. The 32 chips as a whole is also called a pseudonoise code (PN-Code). Due to complex channel conditions such as attenuation and interference, the transmitted PN-Code will often be received with some PN-Code chips corrupted. In this paper, we conduct a systematic analysis on these errors occurring at chip level. We find that there are notable error patterns corresponding to different cases. We then show that recognizing these patterns enables us to identify the channel condition in great details. We believe that understanding what happened to the transmission in our way can potentially bring benefit to channel coding, routing, and error correction protocol design. Finally, we propose Simple Rule, a simple yet effective method based on the chip error patterns to infer the link condition with an accuracy of over 96 percent in our evaluations.

69 citations


Cited by
More filters
Journal ArticleDOI
Yu Zheng1
TL;DR: A systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics, and introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors.
Abstract: The advances in location-acquisition and mobile computing techniques have generated massive spatial trajectory data, which represent the mobility of a diversity of moving objects, such as people, vehicles, and animals. Many techniques have been proposed for processing, managing, and mining trajectory data in the past decade, fostering a broad range of applications. In this article, we conduct a systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics. Following a road map from the derivation of trajectory data, to trajectory data preprocessing, to trajectory data management, and to a variety of mining tasks (such as trajectory pattern mining, outlier detection, and trajectory classification), the survey explores the connections, correlations, and differences among these existing techniques. This survey also introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors, to which more data mining and machine learning techniques can be applied. Finally, some public trajectory datasets are presented. This survey can help shape the field of trajectory data mining, providing a quick understanding of this field to the community.

1,289 citations

Proceedings ArticleDOI
24 Aug 2014
TL;DR: A citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources is proposed.
Abstract: In this paper, we propose a citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources. Though this is a strategically important task in many traffic monitoring and routing systems, the problem has not been well solved yet given the following three challenges. The first is the data sparsity problem, i.e., many road segments may not be traveled by any GPS-equipped vehicles in present time slot. In most cases, we cannot find a trajectory exactly traversing a query path either. Second, for the fragment of a path with trajectories, they are multiple ways of using (or combining) the trajectories to estimate the corresponding travel time. Finding an optimal combination is a challenging problem, subject to a tradeoff between the length of a path and the number of trajectories traversing the path (i.e., support). Third, we need to instantly answer users' queries which may occur in any part of a given city. This calls for an efficient, scalable and effective solution that can enable a citywide and real-time travel time estimation. To address these challenges, we model different drivers' travel times on different road segments in different time slots with a three dimension tensor. Combined with geospatial, temporal and historical contexts learned from trajectories and map data, we fill in the tensor's missing values through a context-aware tensor decomposition approach. We then devise and prove an object function to model the aforementioned tradeoff, with which we find the most optimal concatenation of trajectories for an estimate through a dynamic programming solution. In addition, we propose using frequent trajectory patterns (mined from historical trajectories) to scale down the candidates of concatenation and a suffix-tree-based index to manage the trajectories received in the present time slot. We evaluate our method based on extensive experiments, using GPS trajectories generated by more than 32,000 taxis over a period of two months. The results demonstrate the effectiveness, efficiency and scalability of our method beyond baseline approaches.

488 citations

Posted Content
TL;DR: Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.
Abstract: This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.

476 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: A novel scalable algorithm for time series subsequence all-pairs-similarity-search that computes the answer to the time series motif and time series discord problem as a side-effect, and incidentally provides the fastest known algorithm for both these extensively-studied problems.
Abstract: The all-pairs-similarity-search (or similarity join) problem has been extensively studied for text and a handful of other datatypes. However, surprisingly little progress has been made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for two time series data mining problems, including motif discovery and novelty discovery.

452 citations

Proceedings ArticleDOI
25 Mar 2012
TL;DR: The frequency diversity of the subcarriers in OFDM systems is explored and a novel approach called FILA is proposed, which leverages the channel state information (CSI) to alleviate multipath effect at the receiver, which can significantly improve the localization accuracy compared with the corresponding RSSI approach.
Abstract: Indoor positioning systems have received increasing attention for supporting location-based services in indoor environments. WiFi-based indoor localization has been attractive due to its open access and low cost properties. However, the distance estimation based on received signal strength indicator (RSSI) is easily affected by the temporal and spatial variance due to the multipath effect, which contributes to most of the estimation errors in current systems. How to eliminate such effect so as to enhance the indoor localization performance is a big challenge. In this work, we analyze this effect across the physical layer and account for the undesirable RSSI readings being reported. We explore the frequency diversity of the subcarriers in OFDM systems and propose a novel approach called FILA, which leverages the channel state information (CSI) to alleviate multipath effect at the receiver. We implement the FILA system on commercial 802.11 NICs, and then evaluate its performance in different typical indoor scenarios. The experimental results show that the accuracy and latency of distance calculation can be significantly enhanced by using CSI. Moreover, FILA can significantly improve the localization accuracy compared with the corresponding RSSI approach.

359 citations