scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Big Data in 2020"


Journal ArticleDOI
TL;DR: Network representation learning as discussed by the authors is a new learning paradigm to embed network vertices into a low-dimensional vector space, by preserving network topology structure, vertex content, and other side information.
Abstract: With the widespread use of information technologies, information networks are becoming increasingly popular to capture complex relationships across various disciplines, such as social networks, citation networks, telecommunication networks, and biological networks. Analyzing these networks sheds light on different aspects of social life such as the structure of societies, information diffusion, and communication patterns. In reality, however, the large scale of information networks often makes network analytic tasks computationally expensive or intractable. Network representation learning has been recently proposed as a new learning paradigm to embed network vertices into a low-dimensional vector space, by preserving network topology structure, vertex content, and other side information. This facilitates the original network to be easily handled in the new vector space for further analysis. In this survey, we perform a comprehensive review of the current literature on network representation learning in the data mining and machine learning field. We propose new taxonomies to categorize and summarize the state-of-the-art network representation learning techniques according to the underlying learning mechanisms, the network information intended to preserve, as well as the algorithmic designs and methodologies. We summarize evaluation protocols used for validating network representation learning including published benchmark datasets, evaluation methods, and open source algorithms. We also perform empirical studies to compare the performance of representative algorithms on common datasets, and analyze their computational complexity. Finally, we suggest promising research directions to facilitate future study.

494 citations


Journal ArticleDOI
TL;DR: Deep models based on transfer and multi-task learning significantly outperformed other methods for annotating gene expression patterns at different stage ranges and a partial transfer learning scheme was proposed.
Abstract: A central theme in learning from image data is to develop appropriate representations for the specific task at hand. Thus, a practical challenge is to determine what features are appropriate for specific tasks. For example, in the study of gene expression patterns in Drosophila , texture features were particularly effective for determining the developmental stages from in situ hybridization images. Such image representation is however not suitable for controlled vocabulary term annotation. Here, we developed feature extraction methods to generate hierarchical representations for ISH images. Our approach is based on the deep convolutional neural networks that can act on image pixels directly. To make the extracted features generic, the models were trained using a natural image set with millions of labeled examples. These models were transferred to the ISH image domain. To account for the differences between the source and target domains, we proposed a partial transfer learning scheme in which only part of the source model is transferred. We employed multi-task learning method to fine-tune the pre-trained models with labeled ISH images. Results showed that feature representations computed by deep models based on transfer and multi-task learning significantly outperformed other methods for annotating gene expression patterns at different stage ranges.

133 citations


Journal ArticleDOI
TL;DR: A novel mask matrix is proposed to assist the back-propagation in the training stage of HSI classification with an end-to-end, pixel- to-pixel architecture and the dense conditional random field is introduced into the framework to further balance the local and global information.
Abstract: In recent years, patchwise classification methods are commonly adopted when dealing with the hyperspectral image (HSI) classification. Despite their promising results from the perspective of accuracy, the efficiency of these methods can hardly be ensured since there are redundant computations between adjacent patches. In this paper, we propose a spectral-spatial fully convolutional network for HSI classification with an end-to-end, pixel-to-pixel architecture. Compared with patchwise methods, the proposed framework can avoid the patch extraction and is more efficient. Since the training samples in HSIs are highly sparse, the training strategy in original fully convolutional networks is no longer feasible for HSIs. To solve this problem, we propose a novel mask matrix to assist the back-propagation in the training stage. Considering the importance of spectral and spatial features may vary for different objects and scenes, we combine both features with two weighting factors which can be adaptively learned during the network training. Besides, the dense conditional random field (CRF) is introduced into the framework to further balance the local and global information. Experiments on three benchmark HSI data sets demonstrate that the proposed method can yield competitive results with less time costs compared with patchwise methods.

108 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper analyzed the three core issues of remote sensing image retrieval and provided a comprehensive review on existing methods, focusing on the feature extraction issue and how to use powerful deep representations to address this task.
Abstract: Remote sensing (RS) image retrieval is of great significant for geological information mining. Over the past two decades, a large amount of research on this task has been carried out, which mainly focuses on the following three core issues: feature extraction, similarity metric, and relevance feedback. Due to the complexity and multiformity of ground objects in high-resolution remote sensing (HRRS) images, there is still room for improvement in the current retrieval approaches. In this article, we analyze the three core issues of RS image retrieval and provide a comprehensive review on existing methods. Furthermore, for the goal to advance the state-of-the-art in HRRS image retrieval, we focus on the feature extraction issue and delve how to use powerful deep representations to address this task. We conduct systematic investigation on evaluating correlative factors that may affect the performance of deep features. By optimizing each factor, we acquire remarkable retrieval results on publicly available HRRS datasets. Finally, we explain the experimental phenomenon in detail and draw conclusions according to our analysis. Our work can serve as a guiding role for the research of content-based RS image retrieval.

95 citations


Journal ArticleDOI
TL;DR: This paper proposes and implements a machine learning strategy for smart edges using differential privacy, guaranteeing privacy protection by adding Laplace mechanisms, and designs two different algorithms Output Perturbation and Objective PERTurbation, which satisfy differential privacy.
Abstract: With the popularity of smart devices and the widespread use of machine learning methods, smart edges have become the mainstream of dealing with wireless big data. When smart edges use machine learning models to analyze wireless big data, nevertheless, some models may unintentionally store a small portion of the training data with sensitive records. Thus, intruders can expose sensitive information by careful analysis of this model. To solve this privacy issue, in this paper, we propose and implement a machine learning strategy for smart edges using differential privacy. We focus our attention on privacy protection in training datasets in wireless big data scenario. Moreover, we guarantee privacy protection by adding Laplace mechanisms, and design two different algorithms Output Perturbation (OPP) and Objective Perturbation (OJP), which satisfy differential privacy. In addition, we consider the privacy preserving issues presented in the existing literatures for differential privacy in the correlated datasets, and further provided differential privacy preserving methods for correlated datasets, guaranteeing privacy by theoretical deduction. Finally, we implement the experiments on the TensorFlow, and evaluate our strategy on four datasets, i.e., MNIST, SVHN, CIFAR-10 and STL-10. The experiment results show that our methods can efficiently protect the privacy of training datasets and guarantee the accuracy on benchmark datasets.

93 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed DASGD algorithm outperforms state-of-the-art distributed SGD solvers for recommender systems in terms of prediction accuracy as well as scalability, making it highly useful for training LFA-based recommenders on large scale HiDS matrices with the help of cloud computing facilities.
Abstract: Latent factor analysis (LFA) via stochastic gradient descent (SGD) is highly efficient in discovering user and item patterns from high-dimensional and sparse (HiDS) matrices from recommender systems. However, most LFA-based recommender systems adopt a standard SGD algorithm, which suffers limited scalability when addressing big data. On the other hand, most existing parallel SGD solvers are either under the memory-sharing framework designed for a bare machine or suffering high communicational costs, which also greatly limits their applications in large-scale systems. To address the above issues, this paper proposes a distributed alternative stochastic gradient descent (DASGD) solver for an LFA-based recommender. Its training-dependences among latent features are decoupled via alternatively fixing one-half of the features to learn the other half following the principle of SGD but in parallel. It's distribution mechanism consists of efficient data partition, allocation and task parallelization strategies, which greatly reduces its communicational cost for high scalability. Experimental results on three large-scale HiDS matrices generated by real-world applications demonstrate that the proposed DASGD algorithm outperforms state-of-the-art distributed SGD solvers for recommender systems in terms of prediction accuracy as well as scalability. Hence, it is highly useful for LFA on HiDS matrices with the help of cloud computing facilities.

85 citations


Journal ArticleDOI
TL;DR: In this article, a big data and machine learning enabled wireless channel model framework is proposed, which is based on artificial neural networks (ANNs), including feed-forward neural network (FNN) and radial basis function neural network(RBF-NN).
Abstract: The standardization process of the fifth generation (5G) wireless communications has recently been accelerated and the first commercial 5G services would be provided as early as in 2018. The increasing of enormous smartphones, new complex scenarios, large frequency bands, massive antenna elements, and dense small cells will generate big datasets and bring 5G communications to the era of big data. This paper investigates various applications of big data analytics, especially machine learning algorithms in wireless communications and channel modeling. We propose a big data and machine learning enabled wireless channel model framework. The proposed channel model is based on artificial neural networks (ANNs), including feed-forward neural network (FNN) and radial basis function neural network (RBF-NN). The input parameters are transmitter (Tx) and receiver (Rx) coordinates, Tx–Rx distance, and carrier frequency, while the output parameters are channel statistical properties, including the received power, root mean square (RMS) delay spread (DS), and RMS angle spreads (ASs). Datasets used to train and test the ANNs are collected from both real channel measurements and a geometry based stochastic model (GBSM). Simulation results show good performance and indicate that machine learning algorithms can be powerful analytical tools for future measurement-based wireless channel modeling.

75 citations


Journal ArticleDOI
TL;DR: WAR (Web APIs Recommendation), the first data-driven approach for web APIs recommendation that integrates web API discovery, verification and selection operations based on keywords search over the web API correlation graph, is proposed.
Abstract: The ever-increasing popularity of web APIs allows app developers to leverage a set of existing APIs to achieve their sophisticated objectives. The heavily fragmented distribution of web APIs makes it challenging for an app developer to find appropriate and compatible web APIs. Currently, app developers usually have to manually discover candidate web APIs, verify their compatibility and select appropriate and compatible ones. This process is cumbersome and requires detailed knowledge of web APIs which is often too demanding. It has become a major obstacle to further and broader applications of web APIs. To address this issue, we first propose a web API correlation graph built on extensive data about the compatibility between web APIs. Then, we propose WAR (Web APIs Recommendation), the first data-driven approach for web APIs recommendation that integrates API discovery, verification and selection operations based on keywords search over the web API correlation graph. WAR assists app developers without detailed knowledge of web APIs in searching for appropriate and compatible APIs by typing a few keywords that represent the tasks required to achieve app developers’ objectives. We conducted large-scale experiments on 18,478 real-world APIs and 6,146 real-world apps to demonstrate the usefulness and efficiency of WAR.

68 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel solution called Dynamic Regional Combined short-term rainfall Forecasting approach (DRCF) using Multi-layer Perceptron (MLP) and shows that DRCF outperforms existing approaches in both threat score (TS) and root mean square error (RMSE).
Abstract: Rainfall forecasting is crucial in the field of meteorology and hydrology However, existing solutions always achieve low prediction accuracy for short-term rainfall forecasting Atmospheric forecasting models perform worse in many conditions Machine learning approaches neglect the influences of physical factors in upstream or downstream regions, which make forecasting accuracy fluctuate in different areas To improve the overall forecasting accuracy for short-term rainfall, this paper proposes a novel solution called D ynamic R egional C ombined short-term rainfall F orecasting approach (DRCF) using Multi-layer Perceptron (MLP) First, Principal Component Analysis (PCA) is used to reduce the dimension of thirteen physical factors, which serves as the input of MLP Second, a greedy algorithm is applied to determine the structure of MLP The surrounding sites are perceived based on the forecasting site Finally, to solve the clutter interference which is caused by the extension of the perception range, DRCF is enhanced with several dynamic strategies Experiments are conducted on data from 56 real-world meteorology sites in China, and we compare DRCF with atmospheric models and other machine learning approaches The experimental results show that DRCF outperforms existing approaches in both threat score (TS) and root mean square error (RMSE)

62 citations


Journal ArticleDOI
TL;DR: The proposed MDHOSVD method speeds up data processing, scales with data volume, improves the adaptability and extensibility over data diversity and converts low-level data into actionable knowledge.
Abstract: Big service is an extremely important application of service computing to provide predictive and needed services to humans. To operationalize big services, the heterogeneous data collected from Cyber-Physical-Social Systems (CPSS) must be processed efficiently. However, because of the rapid rise in the volume of data, faster and more efficient computational techniques are required. Therefore, in this paper, we propose a multi-order distributed high-order singular value decomposition method (MDHOSVD) with its incremental computational algorithm. To realize the MDHOSVD, a tensor blocks unfolding integration regulation is proposed. This method allows for the efficient analysis of large-scale heterogeneous data in blocks in an incremental fashion. Using simulation and experimental results from real-life, the high-efficiency of the proposed data processing and computational method, is demonstrated. Further, a case study about cyber-physical-social system data processing is illustrated. The proposed MDHOSVD method speeds up data processing, scales with data volume, improves the adaptability and extensibility over data diversity and converts low-level data into actionable knowledge.

60 citations


Journal ArticleDOI
TL;DR: It is possible to find a close relationship among the GMM clustering mechanism, the multipath propagation characteristics and the CI evaluation index, and to determine the optimal number of Gaussian distributions without resorting to cross-validation.
Abstract: In this paper, the Gaussian mixture model (GMM) is introduced to implement channel multipath clustering. The GMM incorporates the covariance structure and the mean information of the channel multipaths, thus it can effectively reveal the similarity of the channel multipaths. First, the expectation-maximization (EM) algorithm is utilized to search for the posterior estimation of the GMM parameters. Then, the variational Bayesian (VB) algorithm is employed to optimize the GMM parameters to enhance the searching ability of EM and further to determine the optimal number of Gaussian distributions without resorting to cross-validation. Finally, a compact index (CI) is proposed to validate the clustering results reasonably. Thanks to the proposed CI, it is possible to find a close relationship among the GMM clustering mechanism, the multipath propagation characteristics and the CI evaluation index. Experiments with synthetic data and outdoor-to-indoor (O2I) channel data are presented to demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: An automated crop classification workflow, which is based on machine-learning techniques, is developed and deployment of the workflow on the cloud platform can overcome challenges of Big data downloading and processing.
Abstract: For accurate crop classification, it is necessary to use time-series of high-resolution satellite data to better discriminate among certain crop types. This task brings the following challenges: a large amount of satellite data for download, Big data processing and computational resources for utilization of state-of-the-art classification approaches. For solving these problems, we have developed an automated crop classification workflow, which is based on machine-learning techniques. By deployment of the workflow on the cloud platform, we can overcome challenges of Big data downloading and processing. In this paper, we present the system architecture and describe the experiments on structural and parametric identification of machine learning models utilized in the system.

Journal ArticleDOI
TL;DR: This study innovatively proposes a multilayered-and-randomized latent factor (MLF) model, adopting randomized-learning to train LFs for implementing a ‘one-iteration’ training process for saving time and adopting the principle of a generally multilayer structure as in a deep forest or multilayed extreme learning machine to structure its LFs, thereby enhancing its representative learning ability.
Abstract: How to extract useful knowledge from a high-dimensional and sparse (HiDS) matrix efficiently is critical for many big data-related applications. A latent factor (LF) model has been widely adopted to address this problem. It commonly relies on an iterative learning algorithm like stochastic gradient descent. However, an algorithm of this kind commonly consumes many iterations to converge, resulting in considerable time cost on large-scale datasets. How to accelerate an LF model's training process without accuracy loss becomes a vital issue. To address it, this study innovatively proposes a multilayered-and-randomized latent factor (MLF) model. Its main idea is two-fold: a) adopting randomized-learning to train LFs for implementing a ‘one-iteration’ training process for saving time; and 2) adopting the principle of a generally multilayered structure as in a deep forest or multilayered extreme learning machine to structure its LFs, thereby enhancing its representative learning ability. Empirical studies on six HiDS matrices from real applications demonstrate that compared with state-of-the-art LF models, an MLF model achieves significantly higher computational efficiency with satisfactory prediction accuracy. It has the potential to handle LF analysis on a large scale HiDS matrix with real-time requirements.

Journal ArticleDOI
TL;DR: This paper proposes a method for change-point detection of overlapping community evolution analysis by reformulating an overlapping community in the form of a one-dimensional stream constrained by gentle degree fluctuation and the heterogeneous size distribution of the overlapping communities.
Abstract: Change-point detection is a task that looks for specific moments across which a network changes fundamentally. Change-point detection is one of the most important challenges for overlapping community evolution analysis, and its aim is to identify the moment, type, and degree of change of a specific dynamic event when an overlapping community is evolving. In contrast to overlapping community detection, change-point detection addresses the evolution of an overlapping community rather than a network topology. In this paper, we propose such a method by reformulating an overlapping community in the form of a one-dimensional stream constrained by gentle degree fluctuation and the heterogeneous size distribution of the overlapping communities. According to the number of interacting overlapping communities involved in a specific change event, overlapping community change-points are classified as unary or binary. Based on a signal processing framework and a decision function-based strategy, our proposed method finds the change-points for both unary and binary cases. The experimental results from a synthetic dataset show that our proposed approach can ensure higher accuracy and a lower false positive rate than the traditional two-stage approach.

Journal ArticleDOI
TL;DR: This work proposes a novel real-time scheduling algorithm using task-duplication, RTSATD, such that minimizing both the completion time and monetary cost of processing big data workflows in clouds is minimized.
Abstract: Scheduling big data processing workflows involves both large-scale tasks and transmission of massive intermediate data among tasks, thus optimizing their completion time and monetary cost becomes a challenging issue. Besides, data streams are continuously generated, and dynamically submitted to clouds for real-time or near real-time processing. Naturally, responsive schedules are required to keep pace with such dynamic environments and this further aggravates the difficulty of the workflow scheduling problem. To address these issues, we first derive two theorems to minimize the completion time of a set of parallel workflow tasks and the start time of each workflow task, and then define the latest finish time for workflow tasks, which is also proved its advantage in reducing costs without delaying the completion of workflows. On the basis of these theorems, we propose a novel r eal- t ime s cheduling a lgorithm using t ask- d uplication, RTSATD, such that minimizing both the completion time and monetary cost of processing big data workflows in clouds. The performance of RTSATD is analyzed by using both synthesized and real-world workflows. The experimental results demonstrate the superiority of the proposed algorithm with respect to completion time (up to 28.73 percent) and resource utilization (up to 46.31 percent) over two existing approaches.

Journal ArticleDOI
TL;DR: This work innovatively introduces an inductive clustering algorithm that tends to address the clustering problem for attributed graphs without any assumption made on the clusters.
Abstract: Attributed graphs are widely used to represent network data where the attribute information of nodes is available. To address the problem of identify clusters in attributed graphs, most of existing solutions are developed simply based on certain particular assumptions related to the characteristics of clusters of interest. However, it is yet unknown whether such assumed characteristics are consistent with attributed graphs. To overcome this issue, we innovatively introduce an inductive clustering algorithm that tends to address the clustering problem for attributed graphs without any assumption made on the clusters. To do so, we first process the attribute information to obtain pairwise attribute values that significantly frequently cooccur in adjacent nodes as we believe that they have potential ability to represent the characteristics of a given attributed graph. For two adjacent nodes, their likelihood of being grouped in the same cluster can be weighted by their ability to characterize the graph. Then based on these verifed characteristics instead of assumed ones, a depth-first search strategy is applied to perform the clustering task. Moreover, we are also able to classify clusters such that their significances can be indicated. The experimental results demonstrate the performance and usefulness of our algorithm.

Journal ArticleDOI
TL;DR: Based on a novel Proof-of-Stake consensus mechanism by accumulating stakes through message forwarding, B4SDC not only provides incentives for all participating nodes, but also avoids forking and ensures high efficiency and real decentralization.
Abstract: Security-related data collection is an essential part for attack detection and security measurement in Mobile Ad Hoc Networks (MANETs). A detection node (i.e., collector) should discover available routes to a collection node for data collection and collect security-related data during route discovery for determining reliable routes. However, few studies provide incentives for security-related data collection in MANETs. In this paper, we propose B4SDC, a blockchain system for security-related data collection in MANETs. Through controlling the scale of Route REQuest (RREQ) forwarding in route discovery, the collector can constrain its payment and simultaneously make each forwarder of control information (namely RREQs and Route REPlies, in short RREPs) obtain rewards as much as possible to ensure fairness. At the same time, B4SDC avoids collusion attacks with cooperative receipt reporting, and spoofing attacks by adopting a secure digital signature. Based on a novel Proof-of-Stake consensus mechanism by accumulating stakes through message forwarding, B4SDC not only provides incentives for all participating nodes, but also avoids forking and ensures high efficiency and real decentralization. We analyze B4SDC in terms of incentives and security, and evaluate its performance through simulations. The thorough analysis and experimental results show the efficacy and effectiveness of B4SDC.

Journal ArticleDOI
TL;DR: The analysis shows that the end-user can accrue economic benefits by shifting consumer loads away from higher-priced periods, and the most likely sources of value to be derived from demand response technologies are assessed.
Abstract: We describe the background and an analytical framework for a mathematical optimization model for home energy management systems (HEMS) to manage electricity demand on the smart grid by efficiently shifting electricity loads of households from peak times to off-peak times. We illustrate the flexibility of the model by modularizing various available technologies such as plug-in electric vehicles, battery storage, and automatic windows. First, the analysis shows that the end-user can accrue economic benefits by shifting consumer loads away from higher-priced periods. Specifically, we assessed the most likely sources of value to be derived from demand response technologies. Therefore, wide adoption of such modeling could create significant cost savings for consumers. Second, the findings are promising for the further development of more intelligent HEMS in the residential sector. Third, we formulated a smart grid valuation framework that is helpful for interpreting the model's results concerning the efficiency of current smart appliances and their respective prices. Finally, we explain the model's benefits, the major concerns when the model is applied in the real world, and the possible future areas that can be explored.

Journal ArticleDOI
TL;DR: This work implements Kira, a flexible and distributed astronomy image processing toolkit, and its Source Extractor application, and examines the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the Amazon EC2 cloud.
Abstract: Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, HPC tools are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark—a modern platform for data intensive computing—to parallelize many-task applications. We implement Kira, a flexible and distributed astronomy image processing toolkit, and its Source Extractor (Kira SE) application. Using Kira SE as a case study, we examine the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the Amazon EC2 cloud. By exploiting data locality, Kira SE achieves a 4.1× speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, Kira SE on the Amazon EC2 cloud achieves a 1.8× speedup over the C program on the NERSC Edison supercomputer. A 128-core Amazon EC2 cloud deployment of Kira SE using Spark Streaming can achieve a second-scale latency with a sustained throughput of $\sim$ ∼ 800 MB/s. Our experience with Kira demonstrates that data intensive computing platforms like Apache Spark are a performant alternative for many-task scientific applications.

Journal ArticleDOI
TL;DR: This work presents the first large-scale analysis of user matchability in real mobility datasets on realistic scales, i.e. among two datasets that consist of several million people's mobility traces, coming from a mobile network operator and transportation smart card usage.
Abstract: The problem of unicity and reidentifiability of records in large-scale databases has been studied in different contexts and approaches, with focus on preserving privacy or matching records from different data sources. With an increasing number of service providers nowadays routinely collecting location traces of their users on unprecedented scales, there is a pronounced interest in the possibility of matching records and datasets based on spatial trajectories. Extending previous work on reidentifiability of spatial data and trajectory matching, we present the first large-scale analysis of user matchability in real mobility datasets on realistic scales, i.e. among two datasets that consist of several million people's mobility traces, coming from a mobile network operator and transportation smart card usage. We extract the relevant statistical properties which influence the matching process and analyze their impact on the matchability of users. We show that for individuals with typical activity in the transportation system (those making 3-4 trips per day on average), a matching algorithm based on the co-occurrence of their activities is expected to achieve a 16.8 percent success only after a one-week long observation of their mobility traces, and over 55 percent after four weeks. We show that the main determinant of matchability is the expected number of co-occurring records in the two datasets. Finally, we discuss different scenarios in terms of data collection frequency and give estimates of matchability over time. We show that with higher frequency data collection becoming more common, we can expect much higher success rates in even shorter intervals.

Journal ArticleDOI
TL;DR: An effective integration of advanced remote sensing methods and new ICT technologies can successfully contribute to deeply investigate the Earth System processes and to address new challenges within the Big Data EO scenario.
Abstract: We present an automatic pipeline implemented within the Amazon Web Services (AWS) Cloud Computing platform for the interferometric processing of large Sentinel-1 (S1) multi-temporal SAR datasets, aimed at analyzing Earth surface deformation phenomena at wide spatial scale. The developed processing chain is based on the advanced DInSAR approach referred to as Small BAseline Subset (SBAS) technique, which allows producing, with centimeter to millimeter accuracy, surface deformation time series and the corresponding mean velocity maps from a temporal sequence of SAR images. The implemented solution addresses the aspects relevant to i) S1 input data archiving; ii) interferometric processing of S1 data sequences, performed in parallel on the AWS computing nodes through both multi-node and multi-core programming techniques; iii) storage of the generated interferometric products. The experimental results are focused on a national scale DInSAR analysis performed over the whole Italian territory by processing 18 S1 slices acquired from descending orbits between March 2015 and April 2017, corresponding to 2612 S1 acquisitions. Our analysis clearly shows that an effective integration of advanced remote sensing methods and new ICT technologies can successfully contribute to deeply investigate the Earth System processes and to address new challenges within the Big Data EO scenario.

Journal ArticleDOI
TL;DR: A novel Spatial Network-based Markov Decision Process (SN-MDP) with a rolling horizon configuration to recommend better driving directions given a set of historical taxi records and the current status of a vacant taxi to maximize the profit in the near future is proposed.
Abstract: Taxi services play an important role in the public transportation system of large cities. Improving taxi business efficiency is an important societal problem. Most of the recent analytical approaches on this topic only considered how to maximize the pickup chance, energy efficiency, or profit for the immediate next trip when recommending seeking routes, therefore may not be optimal for the overall profit over an extended period of time due to ignoring the destination choice of potential passengers. To tackle this issue, we propose a novel Spatial Network-based Markov Decision Process (SN-MDP) with a rolling horizon configuration to recommend better driving directions. Given a set of historical taxi records and the current status (e.g., road segment and time) of a vacant taxi, we find the best move for this taxi to maximize the profit in the near future. We propose statistical models to estimate the necessary time-variant parameters of SN-MDP from data to avoid competition between drivers. In addition, we take into account fuel cost to assess profit, rather than only income. A case study and several experimental evaluations on a real taxi dataset from a major city in China show that our proposed approach improves the profit efficiency by up to 13.7 percent and outperforms baseline methods in all the time slots.

Journal ArticleDOI
TL;DR: This work proposes a novel Wi-Fi signals based fatigue detection approach, called WiFind, which can detect the fatigue symptoms in the vehicle without relying on any visual image or video, and can achieve the recognition accuracy of 89.6 percent in a single driver scenario.
Abstract: Driver fatigue is a leading factor in road accidents that can cause severe fatalities. Existing fatigue detection works focus on vision and electroencephalography (EEG) based means of detection. However, vision-based approaches suffer from view-blocking or vision distortion problems and EEG-based systems are intrusive, and the drivers have to use/wear the devices with inconvenience or additional costs. In our work, we propose a novel Wi-Fi signals based fatigue detection approach, called WiFind to overcome the drawbacks as associated with the current works. WiFind is simple and (wearable) device-free. It can detect the fatigue symptoms in the vehicle without relying on any visual image or video. By applying self-adaptive method, it can recognize the body features of drivers in multiple modes. It applies Hilbert-Huang transform (HHT) based pattern extract method results in accuracy increase in motion detection mode. WiFind can be easily deployed in a commodity Wi-Fi infrastructure, and we have evaluated its performance in real driving environments. The experimental results have shown that WiFind can achieve the recognition accuracy of 89.6 percent in a single driver scenario.

Journal ArticleDOI
TL;DR: To develop effective SGA prediction models, four groups of experiments were conducted that considered basic ML methods, imbalanced data, feature selection and the time characteristics of variables, respectively and the RF ensemble classifier performed best.
Abstract: Diagnosing infants who are small for gestational age (SGA) at early stages could help physicians to introduce interventions for SGA infants earlier Machine learning (ML) is envisioned as a tool to identify SGA infants However, ML has not been widely studied in this field To develop effective SGA prediction models, we conducted four groups of experiments that considered basic ML methods, imbalanced data, feature selection and the time characteristics of variables, respectively Infants with SGA data collected from 2010 to 2013 with gestational weeks between 24 and 42 were detected Support vector machine (SVM), random forest (RF), logistic regression (LR) and Sparse LR models were trained on 10-fold cross validation Precision and the area under the curve (AUC) of the receiver operator characteristic curve were evaluated For each group, the performance of SVM and Sparse LR was similarly well LR without any sparsity penalties performed worst, possibly caused by the overfitting problem With the combination of handling imbalanced data and feature selection, the RF ensemble classifier performed best, which even obtained the highest AUC value (08547) with the help of expert knowledge In other cases, RF performed worse than Sparse LR and SVM, possibly because of fully grown trees

Journal ArticleDOI
TL;DR: An LSTM based phishing detection method for big email data that can reach 95 percent accuracy and includes two important stages, sample expansion stage and testing stage under sufficient samples.
Abstract: In recent years, cyber criminals have successfully invaded many important information systems by using phishing mail, causing huge losses. The detection of phishing mail from big email data has been paid public attention. However, the camouflage technology of phishing mail is becoming more and more complex, and the existing detection methods are unable to confront with the increasingly complex deception methods and the growing number of email. In this paper, we proposed a LSTM based phishing detection method for big email data. The new method includes two important stages, sample expansion stage and testing stage under sufficient samples. In the sample expansion stage, we combined KNN with K-means to expand the training data set, so that the size of training samples can meet the needs of in-depth learning. In the testing stage, we first preprocess these samples, including generalization, word segmentation and word vector generation. Then, the preprocessed data is used to train a LSTM model. Finally, on the basis of the trained model, we classify the phishing emails. By experiment, we evaluate the performance of the proposed method, and experimental results show that the accuracy of our phishing detection method can reach 95%.

Journal ArticleDOI
TL;DR: The key idea of Sleepy is that the energy feature of the wireless channel follows a Gaussian Mixture Model derived from the accumulated channel data over a long period, leading to a low-cost yet promising solution for sleep monitoring.
Abstract: Sleep is a major event of our daily lives. Its quality constitutes a critical indicator of people's health conditions, both mentally and physically. Existing sensor-based or vision-based sleep monitoring systems either are obstructive to use or fail to provide adequate coverage. With the fast expansion of wireless infrastructures nowadays, channel data, which is pervasive and transparent, emerges as another alternative. To this end, we propose Sleepy, a wireless channel data driven sleep monitoring system leveraging commercial WiFi devices. The key idea of Sleepy is that the energy feature of the wireless channel follows a Gaussian Mixture Model (GMM) derived from the accumulated channel data over a long period. Therefore, a GMM based foreground extraction method has been designed to adaptively distinguish motions like rollovers (foreground) from background (stationary postures), leading to certain major merits, e.g., no calibrations or target-dependent training needed. We prototype Sleepy and evaluate it in two real environments. In the short-term controlled experiments, Sleepy achieves 95.65 percent detection accuracy (DA) and 2.16 percent false negative rate (FNR) on average. In the 60-minute real sleep studies, Sleepy demonstrates strong stability, i.e., 0 percent FNR and 98.22 percent DA. Considering that Sleepy is compatible with existing WiFi infrastructure, it constitutes a low-cost yet promising solution for sleep monitoring.

Journal ArticleDOI
TL;DR: This survey makes a comprehensive review of the state-of-the-art research on urban anomaly analytics and gives an overview of four main types of urban anomalies, traffic anomaly, unexpected crowds, environment anomaly, and individual anomaly.
Abstract: Urban anomalies may result in loss of life or property if not handled properly. Automatically alerting anomalies in their early stage or even predicting anomalies before happening are of great value for populations. Recently, data-driven urban anomaly detection and prediction frameworks have been forming, which utilize urban big data and machine learning algorithms to detect and predict urban anomalies automatically. In this survey, we make a comprehensive review of the state-of-the-art research on the urban anomaly. We first give an overview of four main types of urban anomalies, traffic anomaly, unexpected crowds, environment anomaly, and individual anomaly. Next, we summarize various types of urban datasets obtained from diverse devices, i.e., trajectory, trip records, CDRs, urban sensors, event records, environment data, social media and surveillance cameras. Subsequently, a comprehensive survey of issues on detecting and predicting techniques for urban anomalies is presented. Finally, open research challenges and future directions as discussed.

Journal ArticleDOI
TL;DR: Both quantitative evaluation and case studies demonstrate that the proposed MKE system can successfully provide useful medical knowledge and accurate doctor expertise and is applied to real-world datasets crawled from xywy.com, one of the most popular medical crowdsourced Q&A websites.
Abstract: The medical crowdsourced question answering (Q&A) websites are booming in recent years, and an increasingly large amount of patients and doctors are involved. The valuable information from these medical crowdsourced Q&A websites can benefit patients, doctors and the society. One key to unleash the power of these Q&A websites is to extract medical knowledge from the noisy question-answer pairs and filter out unrelated or even incorrect information. Facing the daunting scale of information generated on medical Q&A websites everyday, it is unrealistic to fulfill this task via supervised method due to the expensive annotation cost. In this paper, we propose a Medical Knowledge Extraction (MKE) system that can automatically provide high-quality knowledge triples extracted from the noisy question-answer pairs, and at the same time, estimate expertise for the doctors who give answers on these Q&A websites. The MKE system is built upon a truth discovery framework, where we jointly estimate trustworthiness of answers and doctor expertise from the data without any supervision. We further tackle three unique challenges in the medical knowledge extraction task, namely representation of noisy input, multiple linked truths, and the long-tail phenomenon in the data. The MKE system is applied to real-world datasets crawled from xywy.com , one of the most popular medical crowdsourced Q&A websites. Both quantitative evaluation and case studies demonstrate that the proposed MKE system can successfully provide useful medical knowledge and accurate doctor expertise. We further demonstrate a real-world application, Ask A Doctor , which can automatically give patients suggestions to their questions.

Journal ArticleDOI
TL;DR: A Privilege-based Multilevel Organizational Data-sharing scheme (P-MOD) is proposed that incorporates a privilege-based access structure into an attribute-based encryption mechanism to handle the management and sharing of big data sets.
Abstract: Cloud computing has changed the way enterprises store, access and share data. Big data sets are constantly being uploaded to the cloud and shared within a hierarchy of many different individuals with different access privileges. With more data storage needs turning over to the cloud, finding a secure and efficient data access structure has become a major research issue. In this paper, a Privilege-based Multilevel Organizational Data-sharing scheme (P-MOD) is proposed that incorporates a privilege-based access structure into an attribute-based encryption mechanism to handle the management and sharing of big data sets. Our proposed privilege-based access structure helps reduce the complexity of defining hierarchies as the number of users grows, which makes managing healthcare records using mobile healthcare devices feasible. It can also facilitate organizations in applying big data analytics to understand populations in a holistic way. Security analysis shows that P-MOD is secure against adaptively chosen plaintext attack assuming the DBDH assumption holds. The comprehensive performance and simulation analyses using the real U.S. Census Income data set demonstrate that P-MOD is more efficient in computational complexity and storage space than the existing schemes.

Journal ArticleDOI
TL;DR: This paper proposes to build powerful semantic features using the probabilistic latent semantic analysis (pLSA) model, by employing the pre-trained deep convolutional neural networks (CNNs) as feature extractors rather than relying on the hand-crafted features.
Abstract: Scene classification is one of the most fundamental task in interpretation of high-resolution remote sensing (HRRS) images. Many recent works show that the probabilistic topic models which are capable of mining latent semantics of images can be effectively applied to HRRS scene classification. However, the existing approaches based on topic models simply utilize low-level hand-crafted features to form semantic features, which severely limit the representative capability of the semantic features derived from topic models. To alleviate this problem, this paper propose to build powerful semantic features using the probabilistic latent semantic analysis (pLSA) model, by employing the pre-trained deep convolutional neural networks (CNNs) as feature extractors rather than relying on the hand-crafted features. Specifically, we develop two methods to generate semantic features, called multi-scale deep semantic representation (MSDS) and multi-level deep semantic representation (MLDS), by extracting CNN features from different layers: (1) in MSDS, the final semantic features are learned by the pLSA with multi-scale features extracted from the convolutional layer of a pre-trained CNN; (2) in MLDS, we extract CNN features for densely sampled image patches at different size level from the fully-connected layer of a pre-trained CNN, and concatenate the sematic features learned by the pLSA at each level. We comprehensively evaluate the two methods on two public HRRS scene datasets, and achieve significant performance improvement over the state-of-the-art. The outstanding results demonstrate that the pLSA model is capable of discovering considerably discriminative semantic features from the deep CNN features.