scispace - formally typeset
Search or ask a question
Author

Wuman Luo

Other affiliations: University of Macau
Bio: Wuman Luo is an academic researcher from Hong Kong University of Science and Technology. The author has contributed to research in topics: Computer science & Scalability. The author has an hindex of 8, co-authored 14 publications receiving 648 citations. Previous affiliations of Wuman Luo include University of Macau.

Papers
More filters
Proceedings ArticleDOI
07 Dec 2011
TL;DR: This paper proposes an efficient parallel density-based clustering algorithm and implements it by a 4-stages MapReduce paradigm and adopts a quick partitioning strategy for large scale non-indexed data.
Abstract: Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging due to the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in kinds of data process fields. In this paper, we propose an efficient parallel density-based clustering algorithm and implement it by a 4-stages MapReduce paradigm. Furthermore, we adopt a quick partitioning strategy for large scale non-indexed data. We study the metric of merge among bordering partitions and make optimizations on it. At last, we evaluate our work on real large scale datasets using Hadoop platform. Results reveal that the speedup and scale up of our work are very efficient.

213 citations

Proceedings ArticleDOI
22 Jun 2013
TL;DR: A new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data and proposes efficient search algorithms together with novel indexes to speed up the processing of TPMFP.
Abstract: The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. In this paper, we study a new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data. We refer to this query as time period-based MFP (TPMFP). Specifically, given a time period T, a source v_s and a destination v_d, TPMFP searches the MFP from v_s to v_d during T. Though there exist several proposals on defining MFP, they only consider a fixed time period. Most importantly, we find that none of them can well reflect people's common sense notion which can be described by three key properties, namely suffix-optimal (i.e., any suffix of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time effectiveness into consideration. Therefore, our first task is to give a TPMFP definition that satisfies the above three properties. Then, given the comprehensive TPMFP definition, our next task is to find TPMFP over huge amount of trajectory data efficiently. Particularly, we propose efficient search algorithms together with novel indexes to speed up the processing of TPMFP. To demonstrate both the effectiveness and the efficiency of our approach, we conduct extensive experiments using a real dataset containing over 11 million trajectories.

193 citations

Journal ArticleDOI
TL;DR: MR-DBSCAN is presented, a scalable DBSCAN algorithm using MapReduce that achieves desirable load balancing even in the context of heavily skewed data and proposes a novel data partitioning method based on computation cost estimation.
Abstract: DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

142 citations

Proceedings ArticleDOI
29 Oct 2012
TL;DR: The design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop is presented and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.
Abstract: During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. In this paper, we present the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scan the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we proposes scalable bulk loading algorithms capable of incrementally adding new data into the system. We conduct our experiments using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.

69 citations

Proceedings ArticleDOI
23 Jul 2012
TL;DR: A cost model is proposed to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases and DAA (Dimension Aggregation Approximation) is proposed, an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs.
Abstract: High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mobile data management Nowadays, performing HDSJs efficiently faces two challenges First, the scale of datasets is increasing rapidly, making parallel computing on a scalable platform a must Second, the dimensionality of the data can be up to hundreds or even thousands, which brings about the issue of dimensionality curse In this paper, we address these challenges and study how to perform parallel HDSJs efficiently in the MapReduce paradigm Particularly, we propose a cost model to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality We perform extensive experiments using both synthetic and real datasets to evaluate the speedup and the scale up of our algorithms

26 citations


Cited by
More filters
Journal ArticleDOI
Yu Zheng1
TL;DR: A systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics, and introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors.
Abstract: The advances in location-acquisition and mobile computing techniques have generated massive spatial trajectory data, which represent the mobility of a diversity of moving objects, such as people, vehicles, and animals. Many techniques have been proposed for processing, managing, and mining trajectory data in the past decade, fostering a broad range of applications. In this article, we conduct a systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics. Following a road map from the derivation of trajectory data, to trajectory data preprocessing, to trajectory data management, and to a variety of mining tasks (such as trajectory pattern mining, outlier detection, and trajectory classification), the survey explores the connections, correlations, and differences among these existing techniques. This survey also introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors, to which more data mining and machine learning techniques can be applied. Finally, some public trajectory datasets are presented. This survey can help shape the field of trajectory data mining, providing a quick understanding of this field to the community.

1,289 citations

Journal ArticleDOI
TL;DR: The origin and main issues facing the smart city concept are introduced, and the fundamentals of a smart city by analyzing its definition and application domains are presented.
Abstract: Rapid urbanization creates new challenges and issues, and the smart city concept offers opportunities to rise to these challenges, solve urban problems and provide citizens with a better living environment. This paper presents an exhaustive literature survey of smart cities. First, it introduces the origin and main issues facing the smart city concept, and then presents the fundamentals of a smart city by analyzing its definition and application domains. Second, a data-centric view of smart city architectures and key enabling technologies is provided. Finally, a survey of recent smart city research is presented. This paper provides a reference to researchers who intend to contribute to smart city research and implementation.

536 citations

Proceedings ArticleDOI
24 Aug 2014
TL;DR: A citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources is proposed.
Abstract: In this paper, we propose a citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources. Though this is a strategically important task in many traffic monitoring and routing systems, the problem has not been well solved yet given the following three challenges. The first is the data sparsity problem, i.e., many road segments may not be traveled by any GPS-equipped vehicles in present time slot. In most cases, we cannot find a trajectory exactly traversing a query path either. Second, for the fragment of a path with trajectories, they are multiple ways of using (or combining) the trajectories to estimate the corresponding travel time. Finding an optimal combination is a challenging problem, subject to a tradeoff between the length of a path and the number of trajectories traversing the path (i.e., support). Third, we need to instantly answer users' queries which may occur in any part of a given city. This calls for an efficient, scalable and effective solution that can enable a citywide and real-time travel time estimation. To address these challenges, we model different drivers' travel times on different road segments in different time slots with a three dimension tensor. Combined with geospatial, temporal and historical contexts learned from trajectories and map data, we fill in the tensor's missing values through a context-aware tensor decomposition approach. We then devise and prove an object function to model the aforementioned tradeoff, with which we find the most optimal concatenation of trajectories for an estimate through a dynamic programming solution. In addition, we propose using frequent trajectory patterns (mined from historical trajectories) to scale down the candidates of concatenation and a suffix-tree-based index to manage the trajectories received in the present time slot. We evaluate our method based on extensive experiments, using GPS trajectories generated by more than 32,000 taxis over a period of two months. The results demonstrate the effectiveness, efficiency and scalability of our method beyond baseline approaches.

488 citations

Posted Content
TL;DR: Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.
Abstract: This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.

476 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: A novel scalable algorithm for time series subsequence all-pairs-similarity-search that computes the answer to the time series motif and time series discord problem as a side-effect, and incidentally provides the fastest known algorithm for both these extensively-studied problems.
Abstract: The all-pairs-similarity-search (or similarity join) problem has been extensively studied for text and a handful of other datatypes. However, surprisingly little progress has been made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for two time series data mining problems, including motif discovery and novelty discovery.

452 citations