scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Big Data in 2017"


Journal ArticleDOI
TL;DR: This research provides an innovative data mining framework that synthesizes the state-of-the-art techniques in extracting mobility patterns from raw mobile phone CDR data, and design a pipeline that can translate the massive and passive mobile phone records to meaningful spatial human mobility patterns readily interpretable for urban and transportation planning purposes.
Abstract: In this study, with Singapore as an example, we demonstrate how we can use mobile phone call detail record (CDR) data, which contains millions of anonymous users, to extract individual mobility networks comparable to the activity-based approach. Such an approach is widely used in the transportation planning practice to develop urban micro simulations of individual daily activities and travel; yet it depends highly on detailed travel survey data to capture individual activity-based behavior. We provide an innovative data mining framework that synthesizes the state-of-the-art techniques in extracting mobility patterns from raw mobile phone CDR data, and design a pipeline that can translate the massive and passive mobile phone records to meaningful spatial human mobility patterns readily interpretable for urban and transportation planning purposes. With growing ubiquitous mobile sensing, and shrinking labor and fiscal resources in the public sector globally, the method presented in this research can be used as a low-cost alternative for transportation and planning agencies to understand the human activity patterns in cities, and provide targeted plans for future sustainable development.

351 citations


Journal ArticleDOI
TL;DR: The background and state of the art of scholarly data management and relevant technologies are examined, and data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data are reviewed.
Abstract: With the rapid growth of digital publishing, harvesting, managing, and analyzing scholarly information have become increasingly challenging. The term Big Scholarly Data is coined for the rapidly growing scholarly data, which contains information including millions of authors, papers, citations, figures, tables, as well as scholarly networks and digital libraries. Nowadays, various scholarly data can be easily accessed and powerful data analysis technologies are being developed, which enable us to look into science itself with a new perspective. In this paper, we examine the background and state of the art of big scholarly data. We first introduce the background of scholarly data management and relevant technologies. Second, we review data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data. Finally, we look into representative research issues in this area, including scientific impact evaluation, academic recommendation, and expert finding. For each issue, the background, main challenges, and latest research are covered. These discussions aim to provide a comprehensive review of this emerging area. This survey paper concludes with a discussion of open issues and promising future directions.

234 citations


Journal ArticleDOI
TL;DR: A new method for epileptic seizure prediction and localization of the seizure focus is presented, an extended optimization approach on existing deep-learning structures, Stacked Auto-encoder and Convolutional Neural Network, is proposed and a cloud-computing solution is developed to define the proposed structures for real-time processing, automatic computing and storage of big data.
Abstract: A brain-computer interface (BCI) for seizure prediction provides a means of controlling epilepsy in medically refractory patients whose site of epileptogenicity cannot be resected but yet can be defined sufficiently to be selectively influenced by strategically implanted electrodes. Challenges remain in offering real-time solutions with such technology because of the immediacy of electrographic ictal behavior. The nonstationary nature of electroencephalographic (EEG) and electrocorticographic (ECoG) signals results in wide variation of both normal and ictal patterns among patients. The use of manually extracted features in a prediction task is impractical and the large amount of data generated even among a limited set of electrode contacts will create significant processing delays. Big data in such circumstances not only must allow for safe storage but provide high computational resources for recognition, capture and real-time processing of the preictal period in order to execute the timely abrogation of the ictal event. By leveraging the potential of cloud computing and deep learning, we develop and deploy BCI seizure prediction and localization from scalp EEG and ECoG big data. First, a new method for epileptic seizure prediction and localization of the seizure focus is presented. Second, an extended optimization approach on existing deep-learning structures, Stacked Auto-encoder and Convolutional Neural Network (CNN), is proposed based on principle component analysis (PCA), independent component analysis (ICA), and Differential Search Algorithm (DSA). Third, a cloud-computing solution (i.e., Internet of Things (IoT)), is developed to define the proposed structures for real-time processing, automatic computing and storage of big data. The ECoG clinical datasets on 11 patients illustrate the superiority of the proposed patient-specific BCI as an alternative to current methodology to offer support for patients with intractable focal epilepsy.

135 citations


Journal ArticleDOI
TL;DR: This paper proposes a new privacy-aware public auditing mechanism for shared cloud data by constructing a homomorphic verifiable group signature that eliminates the abuse of single-authority power and provides non-frameability.
Abstract: Today, cloud storage becomes one of the critical services, because users can easily modify and share data with others in cloud. However, the integrity of shared cloud data is vulnerable to inevitable hardware faults, software failures or human errors. To ensure the integrity of the shared data, some schemes have been designed to allow public verifiers (i.e., third party auditors) to efficiently audit data integrity without retrieving the entire users’ data from cloud. Unfortunately, public auditing on the integrity of shared data may reveal data owners’ sensitive information to the third party auditor. In this paper, we propose a new privacy-aware public auditing mechanism for shared cloud data by constructing a homomorphic verifiable group signature. Unlike the existing solutions, our scheme requires at least t group managers to recover a trace key cooperatively, which eliminates the abuse of single-authority power and provides nonframeability. Moreover, our scheme ensures that group users can trace data changes through designated binary tree; and can recover the latest correct data block when the current data block is damaged. In addition, the formal security analysis and experimental results indicate that our scheme is provably secure and efficient.

110 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that PPHOPCM can effectively cluster a large number of heterogeneous data using cloud computing without disclosure of private data.
Abstract: As one important technique of fuzzy clustering in data mining and pattern recognition, the possibilistic c-means algorithm (PCM) has been widely used in image analysis and knowledge discovery. However, it is difficult for PCM to produce a good result for clustering big data, especially for heterogenous data, since it is initially designed for only small structured dataset. To tackle this problem, the paper proposes a high-order PCM algorithm (HOPCM) for big data clustering by optimizing the objective function in the tensor space. Further, we design a distributed HOPCM method based on MapReduce for very large amounts of heterogeneous data. Finally, we devise a privacy-preserving HOPCM algorithm (PPHOPCM) to protect the private data on cloud by applying the BGV encryption scheme to HOPCM, In PPHOPCM, the functions for updating the membership matrix and clustering centers are approximated as polynomial functions to support the secure computing of the BGV scheme. Experimental results indicate that PPHOPCM can effectively cluster a large number of heterogeneous data using cloud computing without disclosure of private data.

91 citations


Journal ArticleDOI
TL;DR: Algorithms which construct causality trees from congestions and estimate their propagation probabilities based on temporal and spatial information of the congestions are introduced.
Abstract: Traffic congestion is a condition of a segment in the road network where the traffic demand is greater than the available road capacity. The detection of unusual traffic patterns including congestions is a significant research problem in the data mining and knowledge discovery community. However, to the best of our knowledge, the discovery of propagations, or causal interactions among detected traffic congestions has not been appropriately investigated before. In this research, we introduce algorithms which construct causality trees from congestions and estimate their propagation probabilities based on temporal and spatial information of the congestions. Frequent sub-structures of these causality trees reveal not only recurring interactions among spatio-temporal congestions, but potential bottlenecks or flaws in the designs of existing traffic networks. Our algorithms have been validated by experiments on a travel time data set recorded from an urban road network.

88 citations


Journal ArticleDOI
TL;DR: A relaxed form of linear programming SVDD (RLPSVDD) is proposed and important insights into parameter selection for practical time series anomaly detection are presented in order to monitor the operations of cloud services.
Abstract: As a powerful architecture for large-scale computation, cloud computing has revolutionized the way that computing infrastructure is abstracted and utilized Coupled with the challenges caused by Big Data, the rocketing development of cloud computing boosts the complexity of system management and maintenance, resulting in weakened trustworthiness of cloud services To cope with this problem, a compelling method, ie, Support Vector Data Description (SVDD), is investigated in this paper for detecting anomalous performance metrics of cloud services Although competent in general anomaly detection, SVDD suffers from unsatisfactory false alarm rate and computational complexity in time series anomaly detection, which considerably hinders its practical applications Therefore, this paper proposes a relaxed form of linear programming SVDD (RLPSVDD) and presents important insights into parameter selection for practical time series anomaly detection in order to monitor the operations of cloud services Experiments on the Iris dataset and the Yahoo benchmark datasets validate the effectiveness of our approaches Furthermore, the comparison of RLPSVDD and the methods obtained from Twitter, Numenta, Etsy and Yahoo, shows the overall preference for RLPSVDD in time series anomaly detection

73 citations


Journal ArticleDOI
TL;DR: The experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset showed that the proposed method could select the important SNPs to more accurately estimate the brain imaging features than the state-of theart methods.
Abstract: In this paper, we propose a novel sparse regression method for Brain-Wide and Genome-Wide association study. Specifically, we impose a low-rank constraint on the weight coefficient matrix and then decompose it into two low-rank matrices, which find relationships in genetic features and in brain imaging features, respectively. We also introduce a sparse acyclic digraph with sparsity-inducing penalty to take further into account the correlations among the genetic variables, by which it can be possible to identify the representative SNPs that are highly associated with the brain imaging features. We optimize our objective function by jointly tackling low-rank regression and variable selection in a framework. In our method, the low-rank constraint allows us to conduct variable selection with the low-rank representations of the data; the learned low-sparsity weight coefficients allow discarding unimportant variables at the end. The experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset showed that the proposed method could select the important SNPs to more accurately estimate the brain imaging features than the state-of-the-art methods.

65 citations


Journal ArticleDOI
TL;DR: The non-causality test is introduced to rule out urban dynamics that do not “Granger” cause air pollution, and the region of influence (ROI) is introduced, which enables us to only analyze data with the highest causality levels.
Abstract: This paper deals with city-wide air quality estimation with limited air quality monitoring stations which are geographically sparse. Since air pollution is influenced by urban dynamics (e.g., meteorology and traffic) which are available throughout the city, we can infer the air quality in regions without monitoring stations based on such spatial-temporal (ST) heterogeneous urban big data. However, big data-enabled estimation poses three challenges. The first challenge is data diversity, i.e., there are many different categories of urban data, some of which may be useless for the estimation. To overcome this, we extend Granger causality to the ST space to analyze all the causality relations in a consistent manner. The second challenge is the computational complexity due to processing the massive volume of data. To overcome this, we introduce the non-causality test to rule out urban dynamics that do not “Granger” cause air pollution, and the region of influence (ROI), which enables us to only analyze data with the highest causality levels. The third challenge is to adapt our grid-based algorithm to non-grid-based applications. By developing a flexible grid-based estimation algorithm, we can decrease the inaccuracies due to grid-based algorithm while maintaining computation efficiency.

57 citations


Journal ArticleDOI
TL;DR: This paper proposes a real-time, data-driven simulation framework that supports the efficient analysis of taxi ride sharing and describes a new optimization algorithm that is linear in the number of trips and makes use of an efficient indexing scheme that makes the approach scalable.
Abstract: As urban populations grow, cities face many challenges related to transportation, resource consumption, and the environment. Ride sharing has been proposed as an effective approach to reduce traffic congestion, gasoline consumption, and pollution. However, despite great promise, researchers and policy makers lack adequate tools to assess the tradeoffs and benefits of various ride-sharing strategies. In this paper, we propose a real-time, data-driven simulation framework that supports the efficient analysis of taxi ride sharing. By modeling taxis and trips as distinct entities, our framework is able to simulate a rich set of realistic scenarios. At the same time, by providing a comprehensive set of parameters, we are able to study the taxi ride-sharing problem from different angles, considering different stakeholders’ interests and constraints. To address the computational complexity of the model, we describe a new optimization algorithm that is linear in the number of trips and makes use of an efficient indexing scheme, which combined with parallelization, makes our approach scalable. We evaluate our framework through a study that uses data about 360 million trips taken by 13,000 taxis in New York City during 2011 and 2012. We describe the findings of the study which demonstrate that our framework can provide insights into strategies for implementing city-wide ride-sharing solutions. We also carry out a detailed performance analysis which shows the efficiency of our approach.

56 citations


Journal ArticleDOI
TL;DR: Dmodel is proposed, employing roving taxicabs as real-time mobile sensors to infer passenger arriving moments by interactions of vacantTaxicabs, and then infer passenger demand by customized online training by utilizing an entropy of pickup events to reduce the size of big historical taxicab data to be processed.
Abstract: Investigating passenger demand is essential for the taxicab business. Existing solutions are typically based on offline data collected by manual investigations, which are often dated and inaccurate for real-time analysis. To address this issue, we propose Dmodel, employing roving taxicabs as real-time mobile sensors to (i) infer passenger arriving moments by interactions of vacant taxicabs, and then (ii) infer passenger demand by customized online training with both historical and real-time data. Dmodel utilizes a novel parameter called pickup pattern based on an entropy of pickup events (accounts for various real-world logical information, e.g., bad weather) to reduce the size of big historical taxicab data to be processed. We evaluate Dmodel with a real-world 450 GB dataset of 14,000 taxicabs for a half year, and results show that compared to the ground truth, Dmodel achieves 83 percent accuracy and outperforms a statistical model by 42 percent. We further present an application where Dmodel is used to dispatch vacant taxicabs to achieve an equilibrium between passenger demand and taxicab supply across urban regions.

Journal ArticleDOI
TL;DR: This work presents an approach to ride sharing where the pick up/drop off locations for passengers are selected from a fixed set, which has the advantage of increased safety through video surveillance and enhances privacy as the users do not need to provide their precise home/work locations.
Abstract: Car occupancy rates (travelers per vehicle) are currently very low in most developed countries, for example, on average between 1.15 and 1.25 in Australia. Enabling shared rides on short notice can be an effective solution to counter the problem of increasing traffic through the use of the untapped transportation capacity. Common inhibitors for the uptake of ride sharing services are privacy and safety concerns. We present an approach to ride sharing where the pick up/drop off locations for passengers are selected from a fixed set, which has the advantage of increased safety through video surveillance. We present a scheme that chooses optimally fixed locations of Pick up Points (PuPs) and aims to maximize the car occupancy rates while preserving user privacy and safety. Our method enhances privacy as the users do not need to provide their precise home/work locations. We have extended the well studied 1-coverage problem, i.e., to cover an area with the minimum number of circles of a given radius [1] , to road networks. The challenges for road networks are the varying population densities of suburbs which requires circles of different radii. The aim is to ensure that every point of a city's area is covered by at least one PuP while minimizing the total number of PuPs. By ensuring that we have different circle radii for PuPs the anonymity of individuals is the same throughout. Using Voronoi diagrams we present a k -anonymity model that guarantees a minimum number of individuals covered by every PuP. Our problem is a multi objective problem where we aim to maximize coverage, k -anonymity and privacy provided by the system to its users while facilitating ride sharing. Through greedy randomized adaptive search procedure (GRASP) we find out the Pareto front of solutions and evaluate their impact on ride sharing.

Journal ArticleDOI
TL;DR: A number of novel methods rooted in algebraic topology and collectively referred to as Topological Data Analysis to rfMRI functional connectivity are proposed and their properties for big data analysis are discussed.
Abstract: Resting state functional magnetic resonance imaging (rfMRI) can be used to measure functional connectivity and then identify brain networks and related brain disorders and diseases. To explore these complex networks, however, huge amounts of data are necessary. Recent advances in neuroimaging technologies, and the unique methodological approach of rfMRI, have enabled us to an era of Biomedical Big Data. The recent progress of big data sharing projects with their challenges are discussed. This increasing amount of neuroimaging data has greatly increased the importance of developing preprocessing pipelines and advanced analytic techniques, which are better at handling large-scale datasets. Before applying any analysis method on rfMRI data, several preprocessing steps need to be applied to reduce all unwanted effects. Three alternative ways to get access to big preprocessed rfMRI data are presented involving the minimal preprocessing pipelines. There are several commonly used methods to examine functional connectivity. However, they become limited in the analysis of big data, and a new tool to explore such data is necessary. We propose a number of novel methods rooted in algebraic topology and collectively referred to as Topological Data Analysis to rfMRI functional connectivity. Their properties for big data analysis are also discussed.

Journal ArticleDOI
TL;DR: This paper makes full use of the mobile users’ location sensitive characteristics to carry out rating prediction andMine the relevance between user's ratings and user-item geographical location distances, called as user-user geographical connection and conducts a series of experiments on a real social rating network dataset Yelp.
Abstract: Recently, advances in intelligent mobile device and positioning techniques have fundamentally enhanced social networks, which allows users to share their experiences, reviews, ratings, photos, check-ins, etc. The geographical information located by smart phone bridges the gap between physical and digital worlds. Location data functions as the connection between user's physical behaviors and virtual social networks structured by the smart phone or web services. We refer to these social networks involving geographical information as location-based social networks (LBSNs). Such information brings opportunities and challenges for recommender systems to solve the cold start, sparsity problem of datasets and rating prediction. In this paper, we make full use of the mobile users’ location sensitive characteristics to carry out rating prediction. We mine: 1) the relevance between user's ratings and user-item geographical location distances, called as user-item geographical connection, 2) the relevance between users’ rating differences and user-user geographical location distances, called as user-user geographical connection. It is discovered that humans’ rating behaviors are affected by geographical location significantly. Moreover, three factors: user-item geographical connection, user-user geographical connection, and interpersonal interest similarity, are fused into a unified rating prediction model. We conduct a series of experiments on a real social rating network dataset Yelp. Experimental results demonstrate that the proposed approach outperforms existing models.

Journal ArticleDOI
TL;DR: A new game-theoretic approach towards community detection in large-scale complex networks based on modified modularity is presented; this method was developedbased on modified adjacency, modified Laplacian matrices and neighborhood similarity to partition a given network into dense communities.
Abstract: Community detection is a fundamental component of large network analysis. In both academia and industry, progressive research has been made on problems related to community network analysis. Community detection is gaining significant attention and importance in the area of network science. Regular and synthetic complex networks have motivated intense interest in studying the fundamental unifying principles of various complex networks. This paper presents a new game-theoretic approach towards community detection in large-scale complex networks based on modified modularity; this method was developed based on modified adjacency, modified Laplacian matrices and neighborhood similarity. This approach was used to partition a given network into dense communities. It is based on determining a Nash stable partition, which is a pure strategy Nash equilibrium of an appropriately defined strategic game in which the nodes of the network were the players and the strategy of a node was to decide to which community it ought to belong. Players chose to belong to a community according to a maximized fitness/payoff. Quality of the community networks was assessed using modified modularity along with a new fitness function. Community partitioning was performed using Normalized Mutual Information and a ‘modularity measure’, which involved comparing the new game-theoretic community detection algorithm (NGTCDA) with well-studied and well-known algorithms, such as Fast Newman, Fast Modularity Detection, and Louvain Community. The quality of a network partition in communities was evaluated by looking at the contribution of each node and its neighbors against the strength of its community.

Journal ArticleDOI
TL;DR: DiP-SVM is presented, a distribution preserving kernel support vector machine where the first and second order statistics of the entire dataset are retained in each of the partitions, thereby reducing the chance of missing important global support vectors.
Abstract: In literature, the task of learning a support vector machine for large datasets has been performed by splitting the dataset into manageable sized “partitions” and training a sequential support vector machine on each of these partitions separately to obtain local support vectors. However, this process invariably leads to the loss in classification accuracy as global support vectors may not have been chosen as local support vectors in their respective partitions. We hypothesize that retaining the original distribution of the dataset in each of the partitions can help solve this issue. Hence, we present DiP-SVM, a distribution preserving kernel support vector machine where the first and second order statistics of the entire dataset are retained in each of the partitions. This helps in obtaining local decision boundaries which are in agreement with the global decision boundary, thereby reducing the chance of missing important global support vectors. We show that DiP-SVM achieves a minimal loss in classification accuracy among other distributed support vector machine techniques on several benchmark datasets. We further demonstrate that our approach reduces communication overhead between partitions leading to faster execution on large datasets and making it suitable for implementation in cloud environments.

Journal ArticleDOI
TL;DR: The first study on two fundamental questions that are unprecedentedly important to urban planners to understand the functional characteristics of various urban regions throughout a city are focused on, by developing a weather-traffic index (WTI) system.
Abstract: In this work, we focus on two fundamental questions that are unprecedentedly important to urban planners to understand the functional characteristics of various urban regions throughout a city, namely, (i) how to identify regional weather-traffic sensitivity index throughout a city, that indicates the degree to which the region traffic in a city is impacted by weather changes; (ii) among complex regional features, such as road structure and population density, how to dissect the most influential regional features that drive the urban region traffic to be more vulnerable to weather changes. However, these two questions are nontrivial to answer, because urban traffic changes dynamically over time and is essentially affected by many other factors, which may dominate the overall impact. We make the first study on these questions, by developing a weather-traffic index (WTI) system. The system includes two main components: weather-traffic index establishment and key factor analysis. Using the proposed system, we conducted comprehensive empirical study in Shanghai, and the weather-traffic indices extracted have been validated to be surprisingly consistent with real world observations. Further regional key factor analysis yields interesting results. For example, house age has significant impact on the weather-traffic index, which sheds light on future urban planning and reconstruction.

Journal ArticleDOI
TL;DR: This paper investigates a three-tier cross-domain architecture, and proposes an efficient and privacy-preserving big data deduplication in cloud storage (EPCDD), and demonstrates that EPCDD outperforms existing competing schemes, in terms of computation, communication and storage overheads.
Abstract: Secure data deduplication can significantly reduce the communication and storage overheads in cloud storage services, and has potential applications in our big data-driven society. Existing data deduplication schemes are generally designed to either resist brute-force attacks or ensure the efficiency and data availability, but not both conditions. We are also not aware of any existing scheme that achieves accountability, in the sense of reducing duplicate information disclosure (e.g., to determine whether plaintexts of two encrypted messages are identical). In this paper, we investigate a three-tier cross-domain architecture, and propose an efficient and privacy-preserving big data deduplication in cloud storage (hereafter referred to as EPCDD). EPCDD achieves both privacy-preserving and data availability, and resists brute-force attacks. In addition, we take accountability into consideration to offer better privacy assurances than existing schemes. We then demonstrate that EPCDD outperforms existing competing schemes, in terms of computation, communication and storage overheads. In addition, the time complexity of duplicate search in EPCDD is logarithmic.

Journal ArticleDOI
TL;DR: A visual analytic system to help users handle the large-scale trajectory data, compare different route choices, and explore the underlying reasons of route choice behaviour is developed.
Abstract: There are often multiple routes between regions. Drivers choose different routes with different considerations. Such considerations, have always been a point of interest in the transportation area. Studies of route choice behaviour are usually based on small range experiments with a group of volunteers. However, the experiment data is quite limited in its spatial and temporal scale as well as the practical reliability. In this work, we explore the possibility of studying route choice behaviour based on general trajectory dataset, which is more realistic in a wider scale. We develop a visual analytic system to help users handle the large-scale trajectory data, compare different route choices, and explore the underlying reasons. Specifically, the system consists of: 1. the interactive trajectory filtering which supports graphical trajectory query; 2. the spatial visualization which gives an overview of all feasible routes extracted from filtered trajectories; 3. the factor visual analytics which provides the exploration and hypothesis construction of different factors’ impact on route choice behaviour, and the verification with an integrated route choice model. Applying to real taxi GPS dataset, we report the system’s performance and demonstrate its effectiveness with three cases.

Journal ArticleDOI
TL;DR: It is demonstrated how the model learned with the method can be used to identify the most likely and distinctive features of a geographical area, quantify the importance features used in the model, and discover similar regions across different cities using publicly shared Foursquare data.
Abstract: Data generated on location-based social networks provide rich information on the whereabouts of urban dwellers. Specifically, such data reveal who spends time where, when, and on what type of activity (e.g., shopping at a mall, or dining at a restaurant). That information can, in turn, be used to describe city regions in terms of activity that takes place therein. For example, the data might reveal that citizens visit one region mainly for shopping in the morning, while another for dining in the evening. Furthermore, once such a description is available, one can ask more elaborate questions. For example, one might ask what features distinguish one region from another—some regions might be different in terms of the type of venues they host and others in terms of the visitors they attract. As another example, one might ask which regions are similar across cities. In this paper, we present a method to answer such questions using publicly shared Foursquare data. Our analysis makes use of a probabilistic model, the features of which include the exact location of activity, the users who participate in the activity, as well as the time of the day and day of week the activity takes place. Compared to previous approaches to similar tasks, our probabilistic modeling approach allows us to make minimal assumptions about the data—which relieves us from having to set arbitrary parameters in our analysis (e.g., regarding the granularity of discovered regions or the importance of different features). We demonstrate how the model learned with our method can be used to identify the most likely and distinctive features of a geographical area, quantify the importance features used in the model, and discover similar regions across different cities. Finally, we perform an empirical comparison with previous work and discuss insights obtained through our findings.

Journal ArticleDOI
TL;DR: In this article, the authors examined and compared lifestyle behaviors of people living in cities of different sizes, utilizing freely available social media data as a large-scale, low-cost alternative to traditional survey methods.
Abstract: Lifestyles are a valuable model for understanding individuals’ physical and mental lives, comparing social groups, and making recommendations for improving people's lives. In this paper, we examine and compare lifestyle behaviors of people living in cities of different sizes, utilizing freely available social media data as a large-scale, low-cost alternative to traditional survey methods. We use the Greater New York City area as a representative for large cities, and the Greater Rochester area as a representative for smaller cities in the United States. We employed matrix factor analysis as an unsupervised method to extract salient mobility and work-rest patterns for a large population of users within each metropolitan area. We discovered interesting human behavior patterns at both a larger scale and a finer granularity than is present in previous literature, some of which allow us to quantitatively compare the behaviors of individuals of living in big cities to those living in small cities. We believe that our social media-based approach to lifestyle analysis represents a powerful tool for social computing in the big data age.

Journal ArticleDOI
TL;DR: Novel models and algorithms for discovering statistically significant linear hotspots using the algorithms of neighbor node filter, shortest path tree pruning, and Monte Carlo speedup are proposed.
Abstract: Given a spatial network and a collection of activities (e.g., pedestrian fatality reports, crime reports), Significant Linear Hotspot Discovery (SLHD) finds all shortest paths in the spatial network where the concentration of activities is statistically significantly high. SLHD is important for societal applications in transportation safety or public safety such as finding paths with significant concentrations of accidents or crimes. SLHD is challenging because 1) there are a potentially large number of candidate paths ( $\sim10^{16}$ ) in a given dataset with millions of activities and road network nodes and 2) test statistic (e.g., density ratio) is not monotonic. Hotspot detection approaches on euclidean space (e.g., SaTScan) may miss significant paths since a large fraction of an area bounded by shapes in euclidean space for activities on a path will be empty. Previous network-based approaches consider only paths between road intersections but not activities. This paper proposes novel models and algorithms for discovering statistically significant linear hotspots using the algorithms of neighbor node filter, shortest path tree pruning, and Monte Carlo speedup. We present case studies comparing the proposed approaches with existing techniques on real data. Experimental results show that the proposed algorithms yield substantial computational savings without reducing result quality.

Journal ArticleDOI
TL;DR: The Infomap algorithm showcased the best trade-off between accuracy and computational performance and, therefore, it has to be considered as a promising tool for Web Data Analytics purposes.
Abstract: Detecting communities in graphs is a fundamental tool to understand the structure of Web-based systems and predict their evolution. Many community detection algorithms are designed to process undirected graphs (i.e., graphs with bidirectional edges) but many graphs on the Web-e.g., microblogging Web sites, trust networks or the Web graph itself-are often directed . Few community detection algorithms deal with directed graphs but we lack their experimental comparison. In this paper we evaluated some community detection algorithms across accuracy and scalability. A first group of algorithms (Label Propagation and Infomap) are explicitly designed to manage directed graphs while a second group (e.g., WalkTrap) simply ignores edge directionality; finally, a third group of algorithms (e.g., Eigenvector) maps input graphs onto undirected ones and extracts communities from the symmetrized version of the input graph. We ran our tests on both artificial and real graphs and, on artificial graphs, WalkTrap achieved the highest accuracy, closely followed by other algorithms; Label Propagation has outstanding performance in scalability on both artificial and real graphs. The Infomap algorithm showcased the best trade-off between accuracy and computational performance and, therefore, it has to be considered as a promising tool for Web Data Analytics purposes.

Journal ArticleDOI
TL;DR: This paper puts forward the first identity-based (ID-based) signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure big data communications between data collectors and data analytical system(s).
Abstract: To be able to leverage big data to achieve enhanced strategic insight, process optimization and make informed decision, we need to be an efficient access control mechanism for ensuring end-to-end security of such information asset. Signcryption is one of several promising techniques to simultaneously achieve big data confidentiality and authenticity. However, signcryption suffers from the limitation of not being able to revoke users from a large-scale system efficiently. We put forward, in this paper, the first identity-based (ID-based) signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure big data communications between data collectors and data analytical system(s). Our scheme is designed to achieve end-to-end confidentiality, authentication, non-repudiation, and integrity simultaneously, while providing scalable revocation functionality such that the overhead demanded by the private key generator (PKG) in the key-update phase only increases logarithmically based on the cardiality of users. Although in our scheme the majority of the unsigncryption tasks are outsourced to an untrusted cloud server, this approach does not affect the security of the proposed scheme. We then prove the security of our scheme, as well as demonstrating its utility using simulations.

Journal ArticleDOI
TL;DR: This review paper discussed different kinds of methods for mitosis detection, like tracking based methods, tracking free methods, hybrid methods, and the most recently proposed works based on deep learning architecture, and found that deep learning based approaches have achieved a great improvement in performance.
Abstract: Detecting mitosis from cell population is a fundamental problem in many biological researches and biomedical applications. In modern researches, advanced imaging technologies have been applied to generate large amount of microscopy images of cells. However, detecting all mitotic cells from these images with human eye is tedious and time-consuming. In recent years, several approaches have been proposed to help humans finish this job automatically with high efficiency and accuracy. In this review paper, we first described some commonly used datasets for mitosis detection, and then discussed different kinds of methods for mitosis detection, like tracking based methods, tracking free methods, hybrid methods, and the most recently proposed works based on deep learning architecture. We compared these methods on same datasets, and found that deep learning based approaches have achieved a great improvement in performance. At last, we discussed the future possible approaches on mitosis detection, to combine the success of previous works and the advantage of big data in modern researches. Considering expertise is highly required in biomedical area, we will further discuss the possibility to learn information from biomedical big data with less expert annotation.

Journal ArticleDOI
TL;DR: This paper proposes a novel incentive scheme based on the trust of mobile users in the MSC to allocate the tasks of big data and proves that the proposal can outperform other existing methods with a low delay and a high efficiency.
Abstract: Recently, mobile social cloud (MSC), formed by mobile users with social ties, has been advocated to allocate tasks of big data applications instead of relying the conventional cloud systems. However, due to the dynamic topology of networks and social features of users, how to optimally allocate tasks to mobile users based on the trust becomes a new challenge. Therefore, this paper proposes a novel incentive scheme based on the trust of mobile users in the MSC to allocate the tasks of big data. Firstly, a social trust degree is defined according to the social tie among users, the importance of task, and the available resources of networks. With the social trust degree, the task owner can select a group of mobile users as the candidates for task allocation. Secondly, a reverse auction game model is developed to study the interactions among the task owner and the candidates. With the reverse auction game model, the optimal strategy of task allocation can be obtained with a low cost for the task owner where the selected candidate of mobile users can also obtain the high profit. Finally, simulation experiments are carried out to prove that the proposal can outperform other existing methods with a low delay and a high efficiency to allocate tasks in the MSC.

Journal ArticleDOI
TL;DR: This work proposes that by leveraging label information, the task related discriminative sources can be much better retrieved among strong spontaneous background signals and extends the framework to VB-SCCD model which aim to estimate extended brain sources by including a spatial total variation regularization term.
Abstract: EEG source imaging integrates temporal and spatial components of EEG to localize the generating source of electrical potentials based on recorded EEG data on the scalp. As EEG sensors can't directly measure activated brain sources, many approaches were proposed to estimate brain source activation pattern given EEG data. However, since most part of the brain activity is composed of the spontaneous non-task related activations, true task caused activation sources will be corrupted in strong background signal. For decades, the EEG inverse problem was solved in an unsupervised way without any utilization of the label information that represents different brain states. We propose that by leveraging label information, the task related discriminative sources can be much better retrieved among strong spontaneous background signals. A novel model for solving EEG inverse problem called Laplacian Graph Regularized Discriminative Source Reconstruction which aims to explicitly extract the discriminative sources by implicitly coding the label information into the graph regularization term. The proposed model can be generally extended with different assumptions. The extension of our framework is applied to VB-SCCD model which aim to estimate extended brain sources by including a spatial total variation regularization term. Simulated results show the effectiveness of the proposed framework.

Journal ArticleDOI
TL;DR: This paper uses real time traffic flow data to generate dense functional correlation matrices between zones during different times of the day, and derives optimal sparse representations of these dense functional matrices that accurately recover not only the existing road network connectivity between zones, but also reveal new latent links between zones that do not yet exist but are suggested by traffic flow dynamics.
Abstract: Mobility in a city is represented as traffic flows in and out of defined urban travel or administrative zones. While the zones and the road networks connecting them are fixed in space, traffic flows between pairs of zones are dynamic through the day. Understanding these dynamics in real time is crucial for real time traffic planning in the city. In this paper, we use real time traffic flow data to generate dense functional correlation matrices between zones during different times of the day. Then, we derive optimal sparse representations of these dense functional matrices, that accurately recover not only the existing road network connectivity between zones, but also reveal new latent links between zones that do not yet exist but are suggested by traffic flow dynamics. We call this sparse representation the time-varying effective traffic connectivity of the city. A convex optimization problem is formulated and used to infer the sparse effective traffic network from time series data of traffic flow for arbitrary levels of temporal granularity. We demonstrate the results for the city of Doha, Qatar on data collected from several hundred bluetooth sensors deployed across the city to record vehicular activity through the city's traffic zones. While the static road network connectivity between zones is accurately inferred, other long range connections are also predicted that could be useful in planning future road linkages in the city. Further, the proposed model can be applied to socio-economic activity other than traffic, such as new housing, construction, or economic activity captured as functional correlations between zones, and can also be similarly used to predict new traffic linkages that are latently needed but as yet do not exist. Preliminary experiments suggest that our framework can be used by urban transportation experts and policy specialists to take a real time data-driven approach towards urban planning and real time traffic planning in the city, especially at the level of administrative zones of a city.

Journal ArticleDOI
TL;DR: Experimental results on real-life data demonstrate that this approach can significantly improve the scalability of multidimensional anonymisation over existing methods, and the applicability of this approach to differential privacy is shown.
Abstract: Scalable data processing platforms built on cloud computing becomes increasingly attractive as infrastructure for supporting big data applications. But privacy concerns are one of the major obstacles to making use of public cloud platforms. Multidimensional anonymisation, a global-recoding generalisation scheme for privacy-preserving data publishing, has been a recent focus due to its capability of balancing data obfuscation and usability. Existing multidimensional anonymisation methods suffer from scalability problems when handling big data due to the impractical serial I/O cost. Given the recursive feature of multidimensional anonymisation, parallelisation is an ideal solution to scalability issues. However, it is still a challenge to use existing distributed and parallel paradigms directly for recursive computation. In this paper, we propose a scalable approach for big data multidimensional anonymisation based on MapReduce, a state-of-the-art data processing paradigm. Our basic idea is to partition a data set recursively into smaller partitions using MapReduce until all partitions can fit in the memory of a computing node. A tree indexing structure is proposed to achieve recursive computation. Moreover, we show the applicability of our approach to differential privacy. Experimental results on real-life data demonstrate that our approach can significantly improve the scalability of multidimensional anonymisation over existing methods.

Journal ArticleDOI
TL;DR: A recent entity-centric knowledge graph effort resulted in a semantic search engine to assist analysts and investigative experts in the HT domain, and enables investigators to satisfy their information needs by posing investigative search queries to a special-purpose semantic execution engine.
Abstract: Web advertising related to Human Trafficking (HT) activity has been on the rise in recent years. Answering entity-centric questions over crawled HT Web corpora to assist investigators in the real world is an important social problem, involving many technical challenges. This paper describes a recent entity-centric knowledge graph effort that resulted in a semantic search engine to assist analysts and investigative experts in the HT domain. The overall approach takes as input a large corpus of advertisements crawled from the Web, structures it into an indexed knowledge graph, and enables investigators to satisfy their information needs by posing investigative search queries to a special-purpose semantic execution engine. We evaluated the search engine on real-world data collected from over 90,000 webpages, a significant fraction of which correlates with HT activity. Performance on four relevant categories of questions on a mean average precision metric were found to be promising, outperforming a learning-to-rank approach on three of the four categories. The prototype uses open-source components and scales to terabyte-scale corpora. Principles of the prototype have also been independently replicated, with similarly successful results.