Showing papers in &quot;IEEE Transactions on Big Data in 2017&quot;

Big Scholarly Data: A Survey

TL;DR: This research provides an innovative data mining framework that synthesizes the state-of-the-art techniques in extracting mobility patterns from raw mobile phone CDR data, and design a pipeline that can translate the massive and passive mobile phone records to meaningful spatial human mobility patterns readily interpretable for urban and transportation planning purposes.

...read moreread less

Abstract: In this study, with Singapore as an example, we demonstrate how we can use mobile phone call detail record (CDR) data, which contains millions of anonymous users, to extract individual mobility networks comparable to the activity-based approach. Such an approach is widely used in the transportation planning practice to develop urban micro simulations of individual daily activities and travel; yet it depends highly on detailed travel survey data to capture individual activity-based behavior. We provide an innovative data mining framework that synthesizes the state-of-the-art techniques in extracting mobility patterns from raw mobile phone CDR data, and design a pipeline that can translate the massive and passive mobile phone records to meaningful spatial human mobility patterns readily interpretable for urban and transportation planning purposes. With growing ubiquitous mobile sensing, and shrinking labor and fiscal resources in the public sector globally, the method presented in this research can be used as a low-cost alternative for transportation and planning agencies to understand the human activity patterns in cities, and provide targeted plans for future sustainable development.

...read moreread less

351 citations

Journal Article•DOI•

[...]

Feng Xia¹, Wei Wang¹, Teshome Megersa Bekele¹, Huan Liu²•Institutions (2)

Dalian University of Technology¹, Arizona State University²

Optimized Deep Learning for EEG Big Data and Seizure Prediction BCI via Internet of Things

TL;DR: The background and state of the art of scholarly data management and relevant technologies are examined, and data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data are reviewed.

...read moreread less

Abstract: With the rapid growth of digital publishing, harvesting, managing, and analyzing scholarly information have become increasingly challenging. The term Big Scholarly Data is coined for the rapidly growing scholarly data, which contains information including millions of authors, papers, citations, figures, tables, as well as scholarly networks and digital libraries. Nowadays, various scholarly data can be easily accessed and powerful data analysis technologies are being developed, which enable us to look into science itself with a new perspective. In this paper, we examine the background and state of the art of big scholarly data. We first introduce the background of scholarly data management and relevant technologies. Second, we review data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data. Finally, we look into representative research issues in this area, including scientific impact evaluation, academic recommendation, and expert finding. For each issue, the background, main challenges, and latest research are covered. These discussions aim to provide a comprehensive review of this emerging area. This survey paper concludes with a discussion of open issues and promising future directions.

...read moreread less

234 citations

Journal Article•DOI•

[...]

Mohammad-Parsa Hosseini¹, Dario Pompili¹, Kost Elisevich², Hamid Soltanian-Zadeh³•Institutions (3)

Rutgers University¹, Spectrum Health², Henry Ford Health System³

NPP: A New Privacy-Aware Public Auditing Scheme for Cloud Data Sharing with Group Users

TL;DR: A new method for epileptic seizure prediction and localization of the seizure focus is presented, an extended optimization approach on existing deep-learning structures, Stacked Auto-encoder and Convolutional Neural Network, is proposed and a cloud-computing solution is developed to define the proposed structures for real-time processing, automatic computing and storage of big data.

...read moreread less

Abstract: A brain-computer interface (BCI) for seizure prediction provides a means of controlling epilepsy in medically refractory patients whose site of epileptogenicity cannot be resected but yet can be defined sufficiently to be selectively influenced by strategically implanted electrodes. Challenges remain in offering real-time solutions with such technology because of the immediacy of electrographic ictal behavior. The nonstationary nature of electroencephalographic (EEG) and electrocorticographic (ECoG) signals results in wide variation of both normal and ictal patterns among patients. The use of manually extracted features in a prediction task is impractical and the large amount of data generated even among a limited set of electrode contacts will create significant processing delays. Big data in such circumstances not only must allow for safe storage but provide high computational resources for recognition, capture and real-time processing of the preictal period in order to execute the timely abrogation of the ictal event. By leveraging the potential of cloud computing and deep learning, we develop and deploy BCI seizure prediction and localization from scalp EEG and ECoG big data. First, a new method for epileptic seizure prediction and localization of the seizure focus is presented. Second, an extended optimization approach on existing deep-learning structures, Stacked Auto-encoder and Convolutional Neural Network (CNN), is proposed based on principle component analysis (PCA), independent component analysis (ICA), and Differential Search Algorithm (DSA). Third, a cloud-computing solution (i.e., Internet of Things (IoT)), is developed to define the proposed structures for real-time processing, automatic computing and storage of big data. The ECoG clinical datasets on 11 patients illustrate the superiority of the proposed patient-specific BCI as an alternative to current methodology to offer support for patients with intractable focal epilepsy.

...read moreread less

135 citations

Journal Article•DOI•

[...]

Anmin Fu¹, Shui Yu, Yuqing Zhang, Huaqun Wang, Chanying Huang - Show less +1 more•Institutions (1)

Nanjing University of Science and Technology¹

05 May 2017-IEEE Transactions on Big Data

TL;DR: This paper proposes a new privacy-aware public auditing mechanism for shared cloud data by constructing a homomorphic verifiable group signature that eliminates the abuse of single-authority power and provides non-frameability.

...read moreread less

Abstract: Today, cloud storage becomes one of the critical services, because users can easily modify and share data with others in cloud. However, the integrity of shared cloud data is vulnerable to inevitable hardware faults, software failures or human errors. To ensure the integrity of the shared data, some schemes have been designed to allow public verifiers (i.e., third party auditors) to efficiently audit data integrity without retrieving the entire users’ data from cloud. Unfortunately, public auditing on the integrity of shared data may reveal data owners’ sensitive information to the third party auditor. In this paper, we propose a new privacy-aware public auditing mechanism for shared cloud data by constructing a homomorphic verifiable group signature. Unlike the existing solutions, our scheme requires at least t group managers to recover a trace key cooperatively, which eliminates the abuse of single-authority power and provides nonframeability. Moreover, our scheme ensures that group users can trace data changes through designated binary tree; and can recover the latest correct data block when the current data block is damaged. In addition, the formal security analysis and experimental results indicate that our scheme is provably secure and efficient.

...read moreread less

110 citations

Journal Article•DOI•

PPHOPCM: Privacy-preserving High-order Possibilistic c-Means Algorithm for Big Data Clustering with Cloud Computing

[...]

Qingchen Zhang, Laurence T. Yang¹, Zhikui Chen, Peng Li•Institutions (1)

St. Francis Xavier University¹

05 May 2017-IEEE Transactions on Big Data

TL;DR: Experimental results indicate that PPHOPCM can effectively cluster a large number of heterogeneous data using cloud computing without disclosure of private data.

...read moreread less

Abstract: As one important technique of fuzzy clustering in data mining and pattern recognition, the possibilistic c-means algorithm (PCM) has been widely used in image analysis and knowledge discovery. However, it is difficult for PCM to produce a good result for clustering big data, especially for heterogenous data, since it is initially designed for only small structured dataset. To tackle this problem, the paper proposes a high-order PCM algorithm (HOPCM) for big data clustering by optimizing the objective function in the tensor space. Further, we design a distributed HOPCM method based on MapReduce for very large amounts of heterogeneous data. Finally, we devise a privacy-preserving HOPCM algorithm (PPHOPCM) to protect the private data on cloud by applying the BGV encryption scheme to HOPCM, In PPHOPCM, the functions for updating the membership matrix and clustering centers are approximated as polynomial functions to support the secure computing of the BGV scheme. Experimental results indicate that PPHOPCM can effectively cluster a large number of heterogeneous data using cloud computing without disclosure of private data.

...read moreread less

91 citations

Journal Article•DOI•

Discovering Congestion Propagation Patterns in Spatio-Temporal Traffic Data

[...]

Hoang Nguyen¹, Wei Liu², Fang Chen¹•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, University of Technology, Sydney²

Time Series Anomaly Detection for Trustworthy Services in Cloud Computing Systems

TL;DR: Algorithms which construct causality trees from congestions and estimate their propagation probabilities based on temporal and spatial information of the congestions are introduced.

...read moreread less

Abstract: Traffic congestion is a condition of a segment in the road network where the traffic demand is greater than the available road capacity. The detection of unusual traffic patterns including congestions is a significant research problem in the data mining and knowledge discovery community. However, to the best of our knowledge, the discovery of propagations, or causal interactions among detected traffic congestions has not been appropriately investigated before. In this research, we introduce algorithms which construct causality trees from congestions and estimate their propagation probabilities based on temporal and spatial information of the congestions. Frequent sub-structures of these causality trees reveal not only recurring interactions among spatio-temporal congestions, but potential bottlenecks or flaws in the designs of existing traffic networks. Our algorithms have been validated by experiments on a travel time data set recorded from an urban road network.

...read moreread less

88 citations

Journal Article•DOI•

[...]

Chengqiang Huang¹, Geyong Min, Yulei Wu, Yiming Ying, Ke Pei, Zuochang Xiang - Show less +2 more•Institutions (1)

University of Exeter¹

Low-Rank Graph-Regularized Structured Sparse Regression for Identifying Genetic Biomarkers

TL;DR: A relaxed form of linear programming SVDD (RLPSVDD) is proposed and important insights into parameter selection for practical time series anomaly detection are presented in order to monitor the operations of cloud services.

...read moreread less

Abstract: As a powerful architecture for large-scale computation, cloud computing has revolutionized the way that computing infrastructure is abstracted and utilized Coupled with the challenges caused by Big Data, the rocketing development of cloud computing boosts the complexity of system management and maintenance, resulting in weakened trustworthiness of cloud services To cope with this problem, a compelling method, ie, Support Vector Data Description (SVDD), is investigated in this paper for detecting anomalous performance metrics of cloud services Although competent in general anomaly detection, SVDD suffers from unsatisfactory false alarm rate and computational complexity in time series anomaly detection, which considerably hinders its practical applications Therefore, this paper proposes a relaxed form of linear programming SVDD (RLPSVDD) and presents important insights into parameter selection for practical time series anomaly detection in order to monitor the operations of cloud services Experiments on the Iris dataset and the Yahoo benchmark datasets validate the effectiveness of our approaches Furthermore, the comparison of RLPSVDD and the methods obtained from Twitter, Numenta, Etsy and Yahoo, shows the overall preference for RLPSVDD in time series anomaly detection

...read moreread less

73 citations

Journal Article•DOI•

[...]

Xiaofeng Zhu¹, Heung-Il Suk², Heng Huang³, Dinggang Shen¹•Institutions (3)

University of North Carolina at Chapel Hill¹, Korea University², University of Pittsburgh³

An Extended Spatio-Temporal Granger Causality Model for Air Quality Estimation with Heterogeneous Urban Big Data

TL;DR: The experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset showed that the proposed method could select the important SNPs to more accurately estimate the brain imaging features than the state-of theart methods.

...read moreread less

Abstract: In this paper, we propose a novel sparse regression method for Brain-Wide and Genome-Wide association study. Specifically, we impose a low-rank constraint on the weight coefficient matrix and then decompose it into two low-rank matrices, which find relationships in genetic features and in brain imaging features, respectively. We also introduce a sparse acyclic digraph with sparsity-inducing penalty to take further into account the correlations among the genetic variables, by which it can be possible to identify the representative SNPs that are highly associated with the brain imaging features. We optimize our objective function by jointly tackling low-rank regression and variable selection in a framework. In our method, the low-rank constraint allows us to conduct variable selection with the low-rank representations of the data; the learned low-sparsity weight coefficients allow discarding unimportant variables at the end. The experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset showed that the proposed method could select the important SNPs to more accurately estimate the brain imaging features than the state-of-the-art methods.

...read moreread less

65 citations

Journal Article•DOI•

[...]

Julie Yixuan Zhu¹, Chenxi Sun¹, Victor O. K. Li¹•Institutions (1)

University of Hong Kong¹

STaRS: Simulating Taxi Ride Sharing at Scale

TL;DR: The non-causality test is introduced to rule out urban dynamics that do not “Granger” cause air pollution, and the region of influence (ROI) is introduced, which enables us to only analyze data with the highest causality levels.

...read moreread less

Abstract: This paper deals with city-wide air quality estimation with limited air quality monitoring stations which are geographically sparse. Since air pollution is influenced by urban dynamics (e.g., meteorology and traffic) which are available throughout the city, we can infer the air quality in regions without monitoring stations based on such spatial-temporal (ST) heterogeneous urban big data. However, big data-enabled estimation poses three challenges. The first challenge is data diversity, i.e., there are many different categories of urban data, some of which may be useless for the estimation. To overcome this, we extend Granger causality to the ST space to analyze all the causality relations in a consistent manner. The second challenge is the computational complexity due to processing the massive volume of data. To overcome this, we introduce the non-causality test to rule out urban dynamics that do not “Granger” cause air pollution, and the region of influence (ROI), which enables us to only analyze data with the highest causality levels. The third challenge is to adapt our grid-based algorithm to non-grid-based applications. By developing a flexible grid-based estimation algorithm, we can decrease the inaccuracies due to grid-based algorithm while maintaining computation efficiency.

...read moreread less

57 citations

Journal Article•DOI•

[...]

Masayo Ota¹, Huy T. Vo¹, Cláudio T. Silva¹, Juliana Freire¹•Institutions (1)

New York University¹

Taxi-Passenger-Demand Modeling Based on Big Data from a Roving Sensor Network

TL;DR: This paper proposes a real-time, data-driven simulation framework that supports the efficient analysis of taxi ride sharing and describes a new optimization algorithm that is linear in the number of trips and makes use of an efficient indexing scheme that makes the approach scalable.

...read moreread less

Abstract: As urban populations grow, cities face many challenges related to transportation, resource consumption, and the environment. Ride sharing has been proposed as an effective approach to reduce traffic congestion, gasoline consumption, and pollution. However, despite great promise, researchers and policy makers lack adequate tools to assess the tradeoffs and benefits of various ride-sharing strategies. In this paper, we propose a real-time, data-driven simulation framework that supports the efficient analysis of taxi ride sharing. By modeling taxis and trips as distinct entities, our framework is able to simulate a rich set of realistic scenarios. At the same time, by providing a comprehensive set of parameters, we are able to study the taxi ride-sharing problem from different angles, considering different stakeholders’ interests and constraints. To address the computational complexity of the model, we describe a new optimization algorithm that is linear in the number of trips and makes use of an efficient indexing scheme, which combined with parallelization, makes our approach scalable. We evaluate our framework through a study that uses data about 360 million trips taken by 13,000 taxis in New York City during 2011 and 2012. We describe the findings of the study which demonstrate that our framework can provide insights into strategies for implementing city-wide ride-sharing solutions. We also carry out a detailed performance analysis which shows the efficiency of our approach.

...read moreread less

56 citations

Journal Article•DOI•

[...]

Desheng Zhang¹, Tian He², Shan Lin³, Sirajum Munir⁴, John A. Stankovic⁵ - Show less +1 more•Institutions (5)

Rutgers University¹, University of Minnesota², Stony Brook University³, Bosch⁴, University of Virginia⁵

Optimal Pick up Point Selection for Effective Ride Sharing

TL;DR: Dmodel is proposed, employing roving taxicabs as real-time mobile sensors to infer passenger arriving moments by interactions of vacantTaxicabs, and then infer passenger demand by customized online training by utilizing an entropy of pickup events to reduce the size of big historical taxicab data to be processed.

...read moreread less

Abstract: Investigating passenger demand is essential for the taxicab business. Existing solutions are typically based on offline data collected by manual investigations, which are often dated and inaccurate for real-time analysis. To address this issue, we propose Dmodel, employing roving taxicabs as real-time mobile sensors to (i) infer passenger arriving moments by interactions of vacant taxicabs, and then (ii) infer passenger demand by customized online training with both historical and real-time data. Dmodel utilizes a novel parameter called pickup pattern based on an entropy of pickup events (accounts for various real-world logical information, e.g., bad weather) to reduce the size of big historical taxicab data to be processed. We evaluate Dmodel with a real-world 450 GB dataset of 14,000 taxicabs for a half year, and results show that compared to the ground truth, Dmodel achieves 83 percent accuracy and outperforms a statistical model by 42 percent. We further present an application where Dmodel is used to dispatch vacant taxicabs to achieve an equilibrium between passenger demand and taxicab supply across urban regions.

...read moreread less

Journal Article•DOI•

[...]

Preeti Goel¹, Lars Kulik¹, Kotagiri Ramamohanarao¹•Institutions (1)

University of Melbourne¹

Resting-State fMRI Functional Connectivity: Big Data Preprocessing Pipelines and Topological Data Analysis

TL;DR: This work presents an approach to ride sharing where the pick up/drop off locations for passengers are selected from a fixed set, which has the advantage of increased safety through video surveillance and enhances privacy as the users do not need to provide their precise home/work locations.

...read moreread less

Abstract: Car occupancy rates (travelers per vehicle) are currently very low in most developed countries, for example, on average between 1.15 and 1.25 in Australia. Enabling shared rides on short notice can be an effective solution to counter the problem of increasing traffic through the use of the untapped transportation capacity. Common inhibitors for the uptake of ride sharing services are privacy and safety concerns. We present an approach to ride sharing where the pick up/drop off locations for passengers are selected from a fixed set, which has the advantage of increased safety through video surveillance. We present a scheme that chooses optimally fixed locations of Pick up Points (PuPs) and aims to maximize the car occupancy rates while preserving user privacy and safety. Our method enhances privacy as the users do not need to provide their precise home/work locations. We have extended the well studied 1-coverage problem, i.e., to cover an area with the minimum number of circles of a given radius [1] , to road networks. The challenges for road networks are the varying population densities of suburbs which requires circles of different radii. The aim is to ensure that every point of a city's area is covered by at least one PuP while minimizing the total number of PuPs. By ensuring that we have different circle radii for PuPs the anonymity of individuals is the same throughout. Using Voronoi diagrams we present a k -anonymity model that guarantees a minimum number of individuals covered by every PuP. Our problem is a multi objective problem where we aim to maximize coverage, k -anonymity and privacy provided by the system to its users while facilitating ride sharing. Through greedy randomized adaptive search procedure (GRASP) we find out the Pareto front of solutions and evaluate their impact on ride sharing.

...read moreread less

Journal Article•DOI•

[...]

Angkoon Phinyomark¹, Esther Ibáñez-Marcelo¹, Giovanni Petri¹•Institutions (1)

Institute for Scientific Interchange¹

Service Rating Prediction by Exploring Social Mobile Users’ Geographical Locations

TL;DR: A number of novel methods rooted in algebraic topology and collectively referred to as Topological Data Analysis to rfMRI functional connectivity are proposed and their properties for big data analysis are discussed.

...read moreread less

Abstract: Resting state functional magnetic resonance imaging (rfMRI) can be used to measure functional connectivity and then identify brain networks and related brain disorders and diseases. To explore these complex networks, however, huge amounts of data are necessary. Recent advances in neuroimaging technologies, and the unique methodological approach of rfMRI, have enabled us to an era of Biomedical Big Data. The recent progress of big data sharing projects with their challenges are discussed. This increasing amount of neuroimaging data has greatly increased the importance of developing preprocessing pipelines and advanced analytic techniques, which are better at handling large-scale datasets. Before applying any analysis method on rfMRI data, several preprocessing steps need to be applied to reduce all unwanted effects. Three alternative ways to get access to big preprocessed rfMRI data are presented involving the minimal preprocessing pipelines. There are several commonly used methods to examine functional connectivity. However, they become limited in the analysis of big data, and a new tool to explore such data is necessary. We propose a number of novel methods rooted in algebraic topology and collectively referred to as Topological Data Analysis to rfMRI functional connectivity. Their properties for big data analysis are also discussed.

...read moreread less

Journal Article•DOI•

[...]

Guoshuai Zhao¹, Xueming Qian², Chen Kang¹•Institutions (2)

Xi'an Jiaotong University¹, Chinese Ministry of Education²

A Framework for Community Detection in Large Networks Using Game-Theoretic Modeling

TL;DR: This paper makes full use of the mobile users’ location sensitive characteristics to carry out rating prediction andMine the relevance between user's ratings and user-item geographical location distances, called as user-user geographical connection and conducts a series of experiments on a real social rating network dataset Yelp.

...read moreread less

Abstract: Recently, advances in intelligent mobile device and positioning techniques have fundamentally enhanced social networks, which allows users to share their experiences, reviews, ratings, photos, check-ins, etc. The geographical information located by smart phone bridges the gap between physical and digital worlds. Location data functions as the connection between user's physical behaviors and virtual social networks structured by the smart phone or web services. We refer to these social networks involving geographical information as location-based social networks (LBSNs). Such information brings opportunities and challenges for recommender systems to solve the cold start, sparsity problem of datasets and rating prediction. In this paper, we make full use of the mobile users’ location sensitive characteristics to carry out rating prediction. We mine: 1) the relevance between user's ratings and user-item geographical location distances, called as user-item geographical connection, 2) the relevance between users’ rating differences and user-user geographical location distances, called as user-user geographical connection. It is discovered that humans’ rating behaviors are affected by geographical location significantly. Moreover, three factors: user-item geographical connection, user-user geographical connection, and interpersonal interest similarity, are fused into a unified rating prediction model. We conduct a series of experiments on a real social rating network dataset Yelp. Experimental results demonstrate that the proposed approach outperforms existing models.

...read moreread less

Journal Article•DOI•

[...]

Pravin Chopade¹, Justin Zhan²•Institutions (2)

North Carolina Agricultural and Technical State University¹, University of Nevada, Las Vegas²

DiP-SVM : Distribution Preserving Kernel Support Vector Machine for Big Data

TL;DR: A new game-theoretic approach towards community detection in large-scale complex networks based on modified modularity is presented; this method was developedbased on modified adjacency, modified Laplacian matrices and neighborhood similarity to partition a given network into dense communities.

...read moreread less

Abstract: Community detection is a fundamental component of large network analysis. In both academia and industry, progressive research has been made on problems related to community network analysis. Community detection is gaining significant attention and importance in the area of network science. Regular and synthetic complex networks have motivated intense interest in studying the fundamental unifying principles of various complex networks. This paper presents a new game-theoretic approach towards community detection in large-scale complex networks based on modified modularity; this method was developed based on modified adjacency, modified Laplacian matrices and neighborhood similarity. This approach was used to partition a given network into dense communities. It is based on determining a Nash stable partition, which is a pure strategy Nash equilibrium of an appropriately defined strategic game in which the nodes of the network were the players and the strategy of a node was to decide to which community it ought to belong. Players chose to belong to a community according to a maximized fitness/payoff. Quality of the community networks was assessed using modified modularity along with a new fitness function. Community partitioning was performed using Normalized Mutual Information and a ‘modularity measure’, which involved comparing the new game-theoretic community detection algorithm (NGTCDA) with well-studied and well-known algorithms, such as Fast Newman, Fast Modularity Detection, and Louvain Community. The quality of a network partition in communities was evaluated by looking at the contribution of each node and its neighbors against the strength of its community.

...read moreread less

Journal Article•DOI•

[...]

Dinesh Singh¹, Debaditya Roy¹, C. Krishna Mohan¹•Institutions (1)

Indian Institute of Technology, Hyderabad¹

Detecting and Analyzing Urban Regions with High Impact of Weather Change on Transport

TL;DR: DiP-SVM is presented, a distribution preserving kernel support vector machine where the first and second order statistics of the entire dataset are retained in each of the partitions, thereby reducing the chance of missing important global support vectors.

...read moreread less

Abstract: In literature, the task of learning a support vector machine for large datasets has been performed by splitting the dataset into manageable sized “partitions” and training a sequential support vector machine on each of these partitions separately to obtain local support vectors. However, this process invariably leads to the loss in classification accuracy as global support vectors may not have been chosen as local support vectors in their respective partitions. We hypothesize that retaining the original distribution of the dataset in each of the partitions can help solve this issue. Hence, we present DiP-SVM, a distribution preserving kernel support vector machine where the first and second order statistics of the entire dataset are retained in each of the partitions. This helps in obtaining local decision boundaries which are in agreement with the global decision boundary, thereby reducing the chance of missing important global support vectors. We show that DiP-SVM achieves a minimal loss in classification accuracy among other distributed support vector machine techniques on several benchmark datasets. We further demonstrate that our approach reduces communication overhead between partitions leading to faster execution on large datasets and making it suitable for implementation in cloud environments.

...read moreread less

Journal Article•DOI•

[...]

Ye Ding¹, Yanhua Li², Ke Deng³, Haoyu Tan¹, Mingxuan Yuan⁴, Lionel M. Ni⁵ - Show less +2 more•Institutions (5)

Hong Kong University of Science and Technology¹, Worcester Polytechnic Institute², RMIT University³, Huawei⁴, University of Macau⁵

Achieving Efficient and Privacy-Preserving Cross-Domain Big Data Deduplication in Cloud

TL;DR: The first study on two fundamental questions that are unprecedentedly important to urban planners to understand the functional characteristics of various urban regions throughout a city are focused on, by developing a weather-traffic index (WTI) system.

...read moreread less

Abstract: In this work, we focus on two fundamental questions that are unprecedentedly important to urban planners to understand the functional characteristics of various urban regions throughout a city, namely, (i) how to identify regional weather-traffic sensitivity index throughout a city, that indicates the degree to which the region traffic in a city is impacted by weather changes; (ii) among complex regional features, such as road structure and population density, how to dissect the most influential regional features that drive the urban region traffic to be more vulnerable to weather changes. However, these two questions are nontrivial to answer, because urban traffic changes dynamically over time and is essentially affected by many other factors, which may dominate the overall impact. We make the first study on these questions, by developing a weather-traffic index (WTI) system. The system includes two main components: weather-traffic index establishment and key factor analysis. Using the proposed system, we conducted comprehensive empirical study in Shanghai, and the weather-traffic indices extracted have been validated to be surprisingly consistent with real world observations. Further regional key factor analysis yields interesting results. For example, house age has significant impact on the weather-traffic index, which sheds light on future urban planning and reconstruction.

...read moreread less

Journal Article•DOI•

[...]

Xue Yang¹, Rongxing Lu², Kim-Kwang Raymond Choo³, Fan Yin¹, Xiaohu Tang¹ - Show less +1 more•Institutions (3)

Southwest Jiaotong University¹, University of New Brunswick², University of Texas at San Antonio³

29 Jun 2017-IEEE Transactions on Big Data

TL;DR: This paper investigates a three-tier cross-domain architecture, and proposes an efficient and privacy-preserving big data deduplication in cloud storage (EPCDD), and demonstrates that EPCDD outperforms existing competing schemes, in terms of computation, communication and storage overheads.

...read moreread less

Abstract: Secure data deduplication can significantly reduce the communication and storage overheads in cloud storage services, and has potential applications in our big data-driven society. Existing data deduplication schemes are generally designed to either resist brute-force attacks or ensure the efficiency and data availability, but not both conditions. We are also not aware of any existing scheme that achieves accountability, in the sense of reducing duplicate information disclosure (e.g., to determine whether plaintexts of two encrypted messages are identical). In this paper, we investigate a three-tier cross-domain architecture, and propose an efficient and privacy-preserving big data deduplication in cloud storage (hereafter referred to as EPCDD). EPCDD achieves both privacy-preserving and data availability, and resists brute-force attacks. In addition, we take accountability into consideration to offer better privacy assurances than existing schemes. We then demonstrate that EPCDD outperforms existing competing schemes, in terms of computation, communication and storage overheads. In addition, the time complexity of duplicate search in EPCDD is logarithmic.

...read moreread less

Journal Article•DOI•

Visual Analysis of Multiple Route Choices Based on General GPS Trajectories

[...]

Min Lu¹, Chufan Lai¹, Tangzhi Ye¹, Jie Liang², Xiaoru Yuan¹ - Show less +1 more•Institutions (2)

Chinese Ministry of Education¹, University of Technology, Sydney²

Modeling Urban Behavior by Mining Geotagged Social Data

TL;DR: A visual analytic system to help users handle the large-scale trajectory data, compare different route choices, and explore the underlying reasons of route choice behaviour is developed.

...read moreread less

Abstract: There are often multiple routes between regions. Drivers choose different routes with different considerations. Such considerations, have always been a point of interest in the transportation area. Studies of route choice behaviour are usually based on small range experiments with a group of volunteers. However, the experiment data is quite limited in its spatial and temporal scale as well as the practical reliability. In this work, we explore the possibility of studying route choice behaviour based on general trajectory dataset, which is more realistic in a wider scale. We develop a visual analytic system to help users handle the large-scale trajectory data, compare different route choices, and explore the underlying reasons. Specifically, the system consists of: 1. the interactive trajectory filtering which supports graphical trajectory query; 2. the spatial visualization which gives an overview of all feasible routes extracted from filtered trajectories; 3. the factor visual analytics which provides the exploration and hypothesis construction of different factors’ impact on route choice behaviour, and the verification with an integrated route choice model. Applying to real taxi GPS dataset, we report the system’s performance and demonstrate its effectiveness with three cases.

...read moreread less

Journal Article•DOI•

[...]

Emre Çelikten¹, Géraud Le Falher², Michael Mathioudakis¹•Institutions (2)

Helsinki Institute for Information Technology¹, university of lille²

Tales of Two Cities: Using Social Media to Understand Idiosyncratic Lifestyles in Distinctive Metropolitan Areas

TL;DR: It is demonstrated how the model learned with the method can be used to identify the most likely and distinctive features of a geographical area, quantify the importance features used in the model, and discover similar regions across different cities using publicly shared Foursquare data.

...read moreread less

Abstract: Data generated on location-based social networks provide rich information on the whereabouts of urban dwellers. Specifically, such data reveal who spends time where, when, and on what type of activity (e.g., shopping at a mall, or dining at a restaurant). That information can, in turn, be used to describe city regions in terms of activity that takes place therein. For example, the data might reveal that citizens visit one region mainly for shopping in the morning, while another for dining in the evening. Furthermore, once such a description is available, one can ask more elaborate questions. For example, one might ask what features distinguish one region from another—some regions might be different in terms of the type of venues they host and others in terms of the visitors they attract. As another example, one might ask which regions are similar across cities. In this paper, we present a method to answer such questions using publicly shared Foursquare data. Our analysis makes use of a probabilistic model, the features of which include the exact location of activity, the users who participate in the activity, as well as the time of the day and day of week the activity takes place. Compared to previous approaches to similar tasks, our probabilistic modeling approach allows us to make minimal assumptions about the data—which relieves us from having to set arbitrary parameters in our analysis (e.g., regarding the granularity of discovered regions or the importance of different features). We demonstrate how the model learned with our method can be used to identify the most likely and distinctive features of a geographical area, quantify the importance features used in the model, and discover similar regions across different cities. Finally, we perform an empirical comparison with previous work and discuss insights obtained through our findings.

...read moreread less

Journal Article•DOI•

[...]

Tianran Hu¹, Eric Bigelow¹, Jiebo Luo¹, Henry Kautz¹•Institutions (1)

University of Rochester¹

Significant Linear Hotspot Discovery

TL;DR: In this article, the authors examined and compared lifestyle behaviors of people living in cities of different sizes, utilizing freely available social media data as a large-scale, low-cost alternative to traditional survey methods.

...read moreread less

Abstract: Lifestyles are a valuable model for understanding individuals’ physical and mental lives, comparing social groups, and making recommendations for improving people's lives. In this paper, we examine and compare lifestyle behaviors of people living in cities of different sizes, utilizing freely available social media data as a large-scale, low-cost alternative to traditional survey methods. We use the Greater New York City area as a representative for large cities, and the Greater Rochester area as a representative for smaller cities in the United States. We employed matrix factor analysis as an unsupervised method to extract salient mobility and work-rest patterns for a large population of users within each metropolitan area. We discovered interesting human behavior patterns at both a larger scale and a finer granularity than is present in previous literature, some of which allow us to quantitatively compare the behaviors of individuals of living in big cities to those living in small cities. We believe that our social media-based approach to lifestyle analysis represents a powerful tool for social computing in the big data age.

...read moreread less

Journal Article•DOI•

[...]

Xun Tang¹, Emre Eftelioglu¹, Dev Oliver², Shashi Shekhar¹•Institutions (2)

University of Minnesota¹, Esri²

An Empirical Comparison of Algorithms to Find Communities in Directed Graphs and Their Application in Web Data Analytics

TL;DR: Novel models and algorithms for discovering statistically significant linear hotspots using the algorithms of neighbor node filter, shortest path tree pruning, and Monte Carlo speedup are proposed.

...read moreread less

Abstract: Given a spatial network and a collection of activities (e.g., pedestrian fatality reports, crime reports), Significant Linear Hotspot Discovery (SLHD) finds all shortest paths in the spatial network where the concentration of activities is statistically significantly high. SLHD is important for societal applications in transportation safety or public safety such as finding paths with significant concentrations of accidents or crimes. SLHD is challenging because 1) there are a potentially large number of candidate paths ( $\sim10^{16}$ ) in a given dataset with millions of activities and road network nodes and 2) test statistic (e.g., density ratio) is not monotonic. Hotspot detection approaches on euclidean space (e.g., SaTScan) may miss significant paths since a large fraction of an area bounded by shapes in euclidean space for activities on a path will be empty. Previous network-based approaches consider only paths between road intersections but not activities. This paper proposes novel models and algorithms for discovering statistically significant linear hotspots using the algorithms of neighbor node filter, shortest path tree pruning, and Monte Carlo speedup. We present case studies comparing the proposed approaches with existing techniques on real data. Experimental results show that the proposed algorithms yield substantial computational savings without reducing result quality.

...read moreread less

Journal Article•DOI•

[...]

Santa Agreste¹, Pasquale De Meo¹, Giacomo Fiumara¹, Giuseppe Piccione¹, Sebastiano Piccolo², Domenico Rosaci, Giuseppe M. L. Sarné, Athanasios V. Vasilakos³ - Show less +4 more•Institutions (3)

University of Messina¹, Technical University of Denmark², Luleå University of Technology³

Revocable Identity-Based Access Control for Big Data with Verifiable Outsourced Computing

TL;DR: The Infomap algorithm showcased the best trade-off between accuracy and computational performance and, therefore, it has to be considered as a promising tool for Web Data Analytics purposes.

...read moreread less

Abstract: Detecting communities in graphs is a fundamental tool to understand the structure of Web-based systems and predict their evolution. Many community detection algorithms are designed to process undirected graphs (i.e., graphs with bidirectional edges) but many graphs on the Web-e.g., microblogging Web sites, trust networks or the Web graph itself-are often directed . Few community detection algorithms deal with directed graphs but we lack their experimental comparison. In this paper we evaluated some community detection algorithms across accuracy and scalability. A first group of algorithms (Label Propagation and Infomap) are explicitly designed to manage directed graphs while a second group (e.g., WalkTrap) simply ignores edge directionality; finally, a third group of algorithms (e.g., Eigenvector) maps input graphs onto undirected ones and extracts communities from the symmetrized version of the input graph. We ran our tests on both artificial and real graphs and, on artificial graphs, WalkTrap achieved the highest accuracy, closely followed by other algorithms; Label Propagation has outstanding performance in scalability on both artificial and real graphs. The Infomap algorithm showcased the best trade-off between accuracy and computational performance and, therefore, it has to be considered as a promising tool for Web Data Analytics purposes.

...read moreread less

Journal Article•DOI•

[...]

Hu Xiong¹, Kim-Kwang Raymond Choo, Athanasios V. Vasilakos•Institutions (1)

University of Electronic Science and Technology of China¹

25 Apr 2017-IEEE Transactions on Big Data

TL;DR: This paper puts forward the first identity-based (ID-based) signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure big data communications between data collectors and data analytical system(s).

...read moreread less

Abstract: To be able to leverage big data to achieve enhanced strategic insight, process optimization and make informed decision, we need to be an efficient access control mechanism for ensuring end-to-end security of such information asset. Signcryption is one of several promising techniques to simultaneously achieve big data confidentiality and authenticity. However, signcryption suffers from the limitation of not being able to revoke users from a large-scale system efficiently. We put forward, in this paper, the first identity-based (ID-based) signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure big data communications between data collectors and data analytical system(s). Our scheme is designed to achieve end-to-end confidentiality, authentication, non-repudiation, and integrity simultaneously, while providing scalable revocation functionality such that the overhead demanded by the private key generator (PKG) in the key-update phase only increases logarithmically based on the cardiality of users. Although in our scheme the majority of the unsigncryption tasks are outsourced to an untrusted cloud server, this approach does not affect the security of the proposed scheme. We then prove the security of our scheme, as well as demonstrating its utility using simulations.

...read moreread less

Journal Article•DOI•

Mitosis Detection in Phase Contrast Microscopy Image Sequences of Stem Cell Populations: A Critical Review

[...]

An-An Liu¹, Yao Lu¹, Mei Chen², Yuting Su¹•Institutions (2)

Tianjin University¹, University at Albany, SUNY²

Trust Based Incentive Scheme to Allocate Big Data Tasks with Mobile Social Cloud

TL;DR: This review paper discussed different kinds of methods for mitosis detection, like tracking based methods, tracking free methods, hybrid methods, and the most recently proposed works based on deep learning architecture, and found that deep learning based approaches have achieved a great improvement in performance.

...read moreread less

Abstract: Detecting mitosis from cell population is a fundamental problem in many biological researches and biomedical applications. In modern researches, advanced imaging technologies have been applied to generate large amount of microscopy images of cells. However, detecting all mitotic cells from these images with human eye is tedious and time-consuming. In recent years, several approaches have been proposed to help humans finish this job automatically with high efficiency and accuracy. In this review paper, we first described some commonly used datasets for mitosis detection, and then discussed different kinds of methods for mitosis detection, like tracking based methods, tracking free methods, hybrid methods, and the most recently proposed works based on deep learning architecture. We compared these methods on same datasets, and found that deep learning based approaches have achieved a great improvement in performance. At last, we discussed the future possible approaches on mitosis detection, to combine the success of previous works and the advantage of big data in modern researches. Considering expertise is highly required in biomedical area, we will further discuss the possibility to learn information from biomedical big data with less expert annotation.

...read moreread less

Journal Article•DOI•

[...]

Qichao Xu¹, Zhou Su¹, Shui Yu², Ying Wang³•Institutions (3)

Shanghai University¹, Deakin University², Beijing University of Posts and Telecommunications³

23 Oct 2017-IEEE Transactions on Big Data

TL;DR: This paper proposes a novel incentive scheme based on the trust of mobile users in the MSC to allocate the tasks of big data and proves that the proposal can outperform other existing methods with a low delay and a high efficiency.

...read moreread less

Abstract: Recently, mobile social cloud (MSC), formed by mobile users with social ties, has been advocated to allocate tasks of big data applications instead of relying the conventional cloud systems. However, due to the dynamic topology of networks and social features of users, how to optimally allocate tasks to mobile users based on the trust becomes a new challenge. Therefore, this paper proposes a novel incentive scheme based on the trust of mobile users in the MSC to allocate the tasks of big data. Firstly, a social trust degree is defined according to the social tie among users, the importance of task, and the available resources of networks. With the social trust degree, the task owner can select a group of mobile users as the candidates for task allocation. Secondly, a reverse auction game model is developed to study the interactions among the task owner and the candidates. With the reverse auction game model, the optimal strategy of task allocation can be obtained with a low cost for the task owner where the selected candidate of mobile users can also obtain the high profit. Finally, simulation experiments are carried out to prove that the proposal can outperform other existing methods with a low delay and a high efficiency to allocate tasks in the MSC.

...read moreread less

Journal Article•DOI•

Graph Regularized EEG Source Imaging with In-Class Consistency and Out-Class Discrimination

[...]

Feng Liu¹, Jay M. Rosenberger¹, Yifei Lou, Rahilsadat Hosseini¹, Jianzhong Su¹, Shouyi Wang¹ - Show less +2 more•Institutions (1)

University of Texas at Arlington¹

Effective Urban Structure Inference from Traffic Flow Dynamics

TL;DR: This work proposes that by leveraging label information, the task related discriminative sources can be much better retrieved among strong spontaneous background signals and extends the framework to VB-SCCD model which aim to estimate extended brain sources by including a spatial total variation regularization term.

...read moreread less

Abstract: EEG source imaging integrates temporal and spatial components of EEG to localize the generating source of electrical potentials based on recorded EEG data on the scalp. As EEG sensors can't directly measure activated brain sources, many approaches were proposed to estimate brain source activation pattern given EEG data. However, since most part of the brain activity is composed of the spontaneous non-task related activations, true task caused activation sources will be corrupted in strong background signal. For decades, the EEG inverse problem was solved in an unsupervised way without any utilization of the label information that represents different brain states. We propose that by leveraging label information, the task related discriminative sources can be much better retrieved among strong spontaneous background signals. A novel model for solving EEG inverse problem called Laplacian Graph Regularized Discriminative Source Reconstruction which aims to explicitly extract the discriminative sources by implicitly coding the label information into the graph regularization term. The proposed model can be generally extended with different assumptions. The extension of our framework is applied to VB-SCCD model which aim to estimate extended brain sources by including a spatial total variation regularization term. Simulated results show the effectiveness of the proposed framework.

...read moreread less

Journal Article•DOI•

[...]

Somwrita Sarkar¹, Sanjay Chawla², Shameem Ahmad³, Jaideep Srivastava⁴, Hosam Hammady², Fethi Filali⁵, Wasim Znaidi⁵, Javier Borge-Holthoefer⁶ - Show less +4 more•Institutions (6)

University of Sydney¹, Qatar Computing Research Institute², University of Technology of Compiègne³, University of Minnesota⁴, Qatar Airways⁵, Open University of Catalonia⁶