scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Big Data in 2018"


Journal ArticleDOI
TL;DR: A tensor-based multiple clustering on bicycle renting and returning data is illustrated, which can provide several suggestions for rebalancing of the bicycle-sharing system and some challenges about the proposed framework are discussed.
Abstract: Due to the rapid advances of information technologies, Big Data, recognized with 4Vs characteristics (volume, variety, veracity, and velocity), bring significant benefits as well as many challenges A major benefit of Big Data is to provide timely information and proactive services for humans The primary purpose of this paper is to review the current state-of-the-art of Big Data from the aspects of organization and representation, cleaning and reduction, integration and processing, security and privacy, analytics and applications, then present a novel framework to provide high-quality so called Big Data-as-a-Service The framework consists of three planes, namely sensing plane, cloud plane and application plane, to systemically address all challenges of the above aspects Also, to clearly demonstrate the working process of the proposed framework, a tensor-based multiple clustering on bicycle renting and returning data is illustrated, which can provide several suggestions for rebalancing of the bicycle-sharing system Finally, some challenges about the proposed framework are discussed

121 citations


Journal ArticleDOI
TL;DR: Fuzzy cluster based analytical method, game theory and reinforcement learning are integrated seamlessly to perform the security situational analysis for the smart grid and show the advantages in terms of high efficiency and low error rate for security situational awareness.
Abstract: Advanced communications and data processing technologies bring great benefits to the smart grid. However, cyber-security threats also extend from the information system to the smart grid. The existing security works for smart grid focus on traditional protection and detection methods. However, a lot of threats occur in a very short time and overlooked by exiting security components. These threats usually have huge impacts on smart gird and disturb its normal operation. Moreover, it is too late to take action to defend against the threats once they are detected, and damages could be difficult to repair. To address this issue, this paper proposes a security situational awareness mechanism based on the analysis of big data in the smart grid. Fuzzy cluster based analytical method, game theory and reinforcement learning are integrated seamlessly to perform the security situational analysis for the smart grid. The simulation and experimental results show the advantages of our scheme in terms of high efficiency and low error rate for security situational awareness.

108 citations


Journal ArticleDOI
TL;DR: Experimental results reveal that the proposed approach can reliably pre-alarm security risk events, substantially reduce storage space of recorded video and significantly speed up the evidence video retrieval associated with specific suspects.
Abstract: Video surveillance system has become a critical part in the security and protection system of modem cities, since smart monitoring cameras equipped with intelligent video analytics techniques can monitor and pre-alarm abnormal behaviors or events. However, with the expansion of the surveillance network, massive surveillance video data poses huge challenges to the analytics, storage and retrieval in the Big Data era. This paper presents a novel intelligent processing and utilization solution to big surveillance video data based on the event detection and alarming messages from front-end smart cameras. The method includes three parts: the intelligent pre-alarming for abnormal events, smart storage for surveillance video and rapid retrieval for evidence videos, which fully explores the temporal-spatial association analysis with respect to the abnormal events in different monitoring sites. Experimental results reveal that our proposed approach can reliably pre-alarm security risk events, substantially reduce storage space of recorded video and significantly speed up the evidence video retrieval associated with specific suspects.

100 citations


Journal ArticleDOI
TL;DR: This paper presents a multi-objective optimization algorithm to trade off the performance, availability, and cost of Big Data application running on Cloud, and designs and implements this approach on experimental environment.
Abstract: Increasing popular big data applications bring about invaluable information, but along with challenges to industrial community and academia. Cloud computing with unlimited resources seems to be the way out. However, this panacea cannot play its role if we do not arrange fine allocation for cloud infrastructure resources. In this paper, we present a multi-objective optimization algorithm to trade off the performance, availability, and cost of Big Data application running on Cloud. After analyzing and modeling the interlaced relations among these objectives, we design and implement our approach on experimental environment. Finally, three sets of experiments show that our approach can run about 20 percent faster than traditional optimization approaches, and can achieve about 15 percent higher performance than other heuristic algorithms, while saving 4 to 20 percent cost.

94 citations


Journal ArticleDOI
TL;DR: A secure and verifiable access control scheme based on the NTRU cryptosystem for big data storage in clouds that enables the data owner and eligible users to effectively verify the legitimacy of a user for accessing the data, and a user to validate the information provided by other users for correct plaintext recovery.
Abstract: Due to the complexity and volume, outsourcing ciphertexts to a cloud is deemed to be one of the most effective approaches for big data storage and access. Nevertheless, verifying the access legitimacy of a user and securely updating a ciphertext in the cloud based on a new access policy designated by the data owner are two critical challenges to make cloud-based big data storage practical and effective. Traditional approaches either completely ignore the issue of access policy update or delegate the update to a third party authority; but in practice, access policy update is important for enhancing security and dealing with the dynamism caused by user join and leave activities. In this paper, we propose a secure and verifiable access control scheme based on the NTRU cryptosystem for big data storage in clouds. We first propose a new NTRU decryption algorithm to overcome the decryption failures of the original NTRU, and then detail our scheme and analyze its correctness, security strengths, and computational efficiency. Our scheme allows the cloud server to efficiently update the ciphertext when a new access policy is specified by the data owner, who is also able to validate the update to counter against cheating behaviors of the cloud. It also enables (i) the data owner and eligible users to effectively verify the legitimacy of a user for accessing the data, and (ii) a user to validate the information provided by other users for correct plaintext recovery. Rigorous analysis indicates that our scheme can prevent eligible users from cheating and resist various attacks such as the collusion attack.

86 citations


Journal ArticleDOI
TL;DR: The feasibility of SBT-Rec is validated, through a set of experiments deployed on MovieLens-1M dataset, and the present CF recommendation can perform very well, if the target user owns similar friends (user-based CF), or the product items purchased and preferred by target user own one or more similar product items (item-basedCF).
Abstract: Recommending appropriate product items to the target user is becoming the key to ensure continuous success of E-commerce. Today, many E-commerce systems adopt various recommendation techniques, e.g., Collaborative Filtering (abbreviated as CF)-based technique, to realize product item recommendation. Overall, the present CF recommendation can perform very well, if the target user owns similar friends (user-based CF), or the product items purchased and preferred by target user own one or more similar product items (item-based CF). While due to the sparsity of big rating data in E-commerce, similar friends and similar product items may be both absent from the user-product purchase network, which lead to a big challenge to recommend appropriate product items to the target user. Considering the challenge, we put forward a Structural Balance Theory-based Recommendation (i.e., SBT-Rec ) approach. In the concrete, (I) user-based recommendation: we look for target user's “enemy” (i.e., the users having opposite preference with target user); afterwards, we determine target user's “possible friends”, according to “enemy's enemy is a friend” rule of Structural Balance Theory, and recommend the product items preferred by “possible friends” of target user to the target user. (II) likewise, for the product items purchased and preferred by target user, we determine their “possibly similar product items” based on Structural Balance Theory and recommend them to the target user. At last, the feasibility of SBT-Rec is validated, through a set of experiments deployed on MovieLens-1M dataset.

56 citations


Journal ArticleDOI
TL;DR: In this article, the authors use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into five figure types and study the resulting patterns of visual information as they relate to scholarly impact.
Abstract: Scientific results are communicated visually in the literature through diagrams, visualizations, and photographs. These information-dense objects have been largely ignored in bibliometrics and scientometrics studies when compared to citations and text. In this paper, we use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into five figure types and study the resulting patterns of visual information as they relate to scholarly impact. We find that the distribution of figures and figure types in the literature has remained relatively constant over time, but can vary widely across field and topic. Remarkably, we find a significant correlation between scientific impact and the use of visual information, where higher impact papers tend to include more diagrams, and to a lesser extent more plots. To explore these results and other ways of extracting this visual information, we have built a visual browser to illustrate the concept and explore design alternatives for supporting viziometric analysis and organizing visual information. We use these results to articulate a new research agenda-viziometrics-to study the organization and presentation of visual information in the scientific literature.

52 citations


Journal ArticleDOI
TL;DR: HashTag Erasure Codes (HTECs) as mentioned in this paper provide the lowest data read and data transfer, and thus the lowest repair time for an arbitrary subpacketization level, where the repair process is linear and highly parallel.
Abstract: Minimum-Storage Regenerating (MSR) codes have emerged as a viable alternative to Reed-Solomon (RS) codes as they minimize the repair bandwidth while they are still optimal in terms of reliability and storage overhead. Although several MSR constructions exist, so far they have not been practically implemented mainly due to the big number of I/O operations. In this paper, we analyze high-rate MDS codes that are simultaneously optimized in terms of storage, reliability, I/O operations, and repair-bandwidth for single and multiple failures of the systematic nodes. The codes were recently introduced in [1] without any specific name. Due to the resemblance between the hashtag sign # and the procedure of the code construction, we call them in this paper HashTag Erasure Codes (HTECs) . HTECs provide the lowest data-read and data-transfer, and thus the lowest repair time for an arbitrary sub-packetization level $\alpha$ , where $\alpha \leq r^{\lceil {^{k}\!/\!_{r}} \rceil }$ , among all existing MDS codes for distributed storage including MSR codes. The repair process is linear and highly parallel. Additionally, we show that HTECs are the first high-rate MDS codes that reduce the repair bandwidth for more than one failure. Practical implementations of HTECs in Hadoop release 3.0.0-alpha2 demonstrate their great potentials.

48 citations


Journal ArticleDOI
TL;DR: A new framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection and econometric model construction to reveal the hidden patterns for economic development is presented.
Abstract: With the rapidly increasing popularity of economic activities, a large amount of economic data is being collected. Although such data offers super opportunities for economic analysis, its low-quality, high-dimensionality and huge-volume pose great challenges on efficient analysis of economic big data. The existing methods have primarily analyzed economic data from the perspective of econometrics, which involves limited indicators and demands prior knowledge of economists. When embracing large varieties of economic factors, these methods tend to yield unsatisfactory performance. To address the challenges, this paper presents a new framework for efficient analysis of high-dimensional economic big data based on innovative distributed feature selection. Specifically, the framework combines the methods of economic feature selection and econometric model construction to reveal the hidden patterns for economic development. The functionality rests on three pillars: (i) novel data pre-processing techniques to prepare high-quality economic data, (ii) an innovative distributed feature identification solution to locate important and representative economic indicators from multidimensional data sets, and (iii) new econometric models to capture the hidden patterns for economic development. The experimental results on the economic data collected in Dalian, China, demonstrate that our proposed framework and methods have superior performance in analyzing enormous economic data.

48 citations


Journal ArticleDOI
TL;DR: A novel big data based security analytics approach to detecting advanced attacks in virtualized infrastructures using Hadoop Distributed File System and MapReduce parser based identification of potential attack paths.
Abstract: Virtualized infrastructure in cloud computing has become an attractive target for cyberattackers to launch advanced attacks. This paper proposes a novel big data based security analytics approach to detecting advanced attacks in virtualized infrastructures. Network logs as well as user application logs collected periodically from the guest virtual machines (VMs) are stored in the Hadoop Distributed File System (HDFS). Then, extraction of attack features is performed through graph-based event correlation and MapReduce parser based identification of potential attack paths. Next, determination of attack presence is performed through two-step machine learning, namely logistic regression is applied to calculate attack's conditional probabilities with respect to the attributes, and belief propagation is applied to calculate the belief in existence of an attack based on them. Experiments are conducted to evaluate the proposed approach using well-known malware as well as in comparison with existing security techniques for virtualized infrastructure. The results show that our proposed approach is effective in detecting attacks with minimal performance overhead.

46 citations


Journal ArticleDOI
TL;DR: In this paper, an indexing architecture to store and search in a database of high-dimensional vectors from the perspective of statistical signal processing and decision theory is proposed, which is composed of several memory units, each of which summarizes a fraction of the database by a single representative vector.
Abstract: We study an indexing architecture to store and search in a database of high-dimensional vectors from the perspective of statistical signal processing and decision theory. This architecture is composed of several memory units, each of which summarizes a fraction of the database by a single representative vector. The potential similarity of the query to one of the vectors stored in the memory unit is gauged by a simple correlation with the memory unit's representative vector. This representative optimizes the test of the following hypothesis: the query is independent from any vector in the memory unit versus the query is a simple perturbation of one of the stored vectors. Compared to exhaustive search, our approach finds the most similar database vectors significantly faster without a noticeable reduction in search quality. Interestingly, the reduction of complexity is provably better in high-dimensional spaces. We empirically demonstrate its practical interest in a large-scale image search scenario with off-the-shelf state-of-the-art descriptors.

Journal ArticleDOI
TL;DR: This paper develops an efficient and practical secure outsourcing algorithm for solving large-scale SLSEs, which has low computational and memory I/O complexities and can protect clients’ privacy well and offers significant time savings for the client.
Abstract: Solving large-scale sparse linear systems of equations (SLSEs) is one of the most common and fundamental problems in big data, but it is very challenging for resource-limited users. Cloud computing has been proposed as a timely, efficient, and cost-effective way of solving such expensive computing tasks. Nevertheless, one critical concern in cloud computing is data privacy. Specifically, clients’ SLSEs usually contain private information that should remain hidden from the cloud for ethical, legal, or security reasons. Many previous works on secure outsourcing of linear systems of equations (LSEs) have high computational complexity, and do not exploit the sparsity in the LSEs. More importantly, they share a common serious problem, i.e., a huge number of memory I/O operations. This problem has been largely neglected in the past, but in fact is of particular importance and may eventually render those outsourcing schemes impractical. In this paper, we develop an efficient and practical secure outsourcing algorithm for solving large-scale SLSEs, which has low computational and memory I/O complexities and can protect clients’ privacy well. We implement our algorithm on Amazon Elastic Compute Cloud, and find that the proposed algorithm offers significant time savings for the client (up to 74 percent) compared to previous algorithms.

Journal ArticleDOI
TL;DR: A novel power state evaluation algorithm is developed based on the multiple high dimensional covariance matrix test of massive streaming PMU data to jointly reveal the relative magnitude, duration and location of an system event.
Abstract: Analogous deployment of phase measurement units (PMUs), the increase of data quantum and deregulation of energy market, all call for robust state evaluation in large scale power systems. Implementing model based estimators is impracticable as the complexity scale of solving the high dimension power flow equations. In this paper, we first represent massive streaming PMU data as big random matrix flow. Motivated by exploiting the variations in the covariance matrix of the massive streaming PMU data, a novel power state evaluation algorithm is then developed based on the multiple high dimensional covariance matrix test. The proposed test statistic is nonparametric without assuming a specific parameter distribution for the PMU data and of a wide range of data dimensions and sample size. Besides, it can jointly reveal the relative magnitude, duration and location of an system event. For the sake of practical application, we reduce the computation of the proposed test statistic from $O(\varepsilon n_g^4)$ to $O(\eta n_g^2)$ by principal component calculation and redundant computation elimination. The novel algorithm is numerically evaluated utilizing the IEEE 30-, 118-bus system and a Polish 2383-bus system and a real 34-PMU system. The case studies illustrate and verify the superiority of proposed state evaluation indicator.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed TSTD algorithm significantly outperforms the state-of-the-art energy efficient algorithms from total, computing, and cooling energy consumption perspectives, as well as coolingEnergy consumption proportion and total energy consumption savings.
Abstract: Big data has received considerable attentions in recent years because of massive data volumes in multifarious fields. Considering various “V” features, big data tasks are usually highly complex and computational intensive. These tasks are generally performed in parallel in data centers resulting in massive energy consumption and Green House Gases emissions. Therefore, efficient resource allocation considering the synergy of the performance and energy efficiency is one of the crucial challenges today. In this paper, we aim to achieve maximum energy efficiency by combining thermal-aware and dynamic voltage and frequency scaling (DVFS) techniques. This paper proposes: (a) a thermal-aware and power-aware hybrid energy consumption model synchronously considering the computing, cooling, and migration energy consumption; (b) a tensor-based task allocation and frequency assignment model for representing the relationship among different tasks, nodes, time slots, and frequencies; and (c) a big data Task Scheduling algorithm based on Thermal-aware and DVFS-enabled techniques (TSTD) to minimize the total energy consumption of data centers. The experimental results demonstrate that the proposed TSTD algorithm significantly outperforms the state-of-the-art energy efficient algorithms from total, computing, and cooling energy consumption perspectives, as well as cooling energy consumption proportion and total energy consumption savings.

Journal ArticleDOI
TL;DR: An LS-decomposition approach that decomposes a sensory reading matrix as the superposition of a L-rank matrix and a S-rank Matrix and aparse anomaly matrix is proposed, and it is proved that the convex surrogate of the LS- decomposition problem guarantees bounded recovery error under proper conditions.
Abstract: The emerging Internet of Things (IoT) systems are fueling an exponential explosion of sensory data. The major challenge to effective implementation of IoT systems is the presence of massive missing data entries, measurement noise, and anomaly readings , which motivates us to investigate the robust recovery of sensory big data. In this paper, we propose an LS-decomposition approach that decomposes a sensory reading matrix as the superposition of a L ow-rank matrix and a S parse anomaly matrix. First, based on data sets from three representative real-world IoT projects, i.e., the IntelLab project (indoor environment), the GreenOrbs project (mountain environment), and the NBDC-CTD project (ocean environment), we observe that anomaly readings are ubiquitous and cannot be ignored. Second, we prove that the convex surrogate of the LS-decomposition problem guarantees bounded recovery error under proper conditions. Third, we propose an accelerated proximal gradient algorithm that converges to the optimal solution at a rate that is inversely proportional to the square of the number of iterations. Evaluations on the above three data sets show that the proposed scheme achieves (relative) recovery error $\leq 0.05$ for missing data rate $\leq 50$ percent and almost exact recovery for missing data rate $\leq 40$ percent, while previous methods have (relative) recovery error $0.04\! \sim\! 0.15$ even at only 10 percent missing data rate.

Journal ArticleDOI
TL;DR: In this paper, a novel pattern-aided graphical causality analysis approach that combines the strengths of pattern mining and Bayesian learning to efficiently identify the spatiotemporal (ST) causal pathways for air pollutants is presented.
Abstract: Many countries are suffering from severe air pollution Understanding how different air pollutants accumulate and propagate is critical to making relevant public policies In this paper, we use urban big data (air quality data and meteorological data) to identify the spatiotemporal (ST) causal pathways for air pollutants This problem is challenging because: (1) there are numerous noisy and low-pollution periods in the raw air quality data, which may lead to unreliable causality analysis; (2) for large-scale data in the ST space, the computational complexity of constructing a causal structure is very high; and (3) the ST causal pathways are complex due to the interactions of multiple pollutants and the influence of environmental factors Therefore, we present pg-Causality , a novel pattern-aided graphical causality analysis approach that combines the strengths of pattern mining and Bayesian learning to efficiently identify the ST causal pathways First, pattern mining helps suppress the noise by capturing frequent evolving patterns (FEPs) of each monitoring sensor, and greatly reduce the complexity by selecting the pattern-matched sensors as “causers” Then, Bayesian learning carefully encodes the local and ST causal relations with a Gaussian Bayesian Network (GBN)-based graphical model, which also integrates environmental influences to minimize biases in the final results We evaluate our approach with three real-world data sets containing 982 air quality sensors in 128 cities, in three regions of China from 01-Jun-2013 to 31-Dec-2016 Results show that our approach outperforms the traditional causal structure learning methods in time efficiency, inference accuracy and interpretability

Journal ArticleDOI
TL;DR: The design of MtMR is proposed, a Merkle tree-based verification method that assures high result integrity of MapReduce jobs and a series of theoretical studies are performed to analyze its security and performance overhead.
Abstract: Big data applications have made significant impacts in recent years thanks to the fast growth of cloud computing and big data infrastructures. However, public cloud is still not widely accepted to perform big data computing, due to the concern with the public cloud's security. Result integrity is one of the most significant security problems that exists in the cloud-based big data computing scenario. In this paper, we propose MtMR, a Merkle tree-based verification method that assures high result integrity of MapReduce jobs. MtMR overlays MapReduce on a hybrid cloud environment and applies two rounds of Merkle tree-based verifications on the pre-reduce phase (i.e., the map phase and the shuffle phase) and the reduce phase, respectively. In each round, MtMR samples a small portion of reduce task input/output records on the private cloud and performs Merkle tree-based verification on all the task input/output records. Based on the design of MtMR, we perform a series of theoretical studies to analyze its security and performance overhead. Our results indicate that MtMR is a promising method in terms of high result integrity and low performance overhead. For example, by setting the sampled record ratio as an optimal value, MtMR can guarantee no more than 10 incorrect records in each reduce task by sampling only 4 percent of records in that task.

Journal ArticleDOI
TL;DR: A complete mining framework is proposed, which includes an optimal method for the light setting and an approximate method that can provide the performance guarantee by utilizing the greedy heuristic, and it is comprised of efficient updating strategy, index partition and workload-based optimization techniques.
Abstract: Mining the most influential location set finds $k$ locations, traversed by the maximum number of unique trajectories, in a given spatial region. These influential locations are valuable for resource allocation applications, such as selecting charging stations for electric automobiles and suggesting locations for placing billboards. This problem is NP-hard and usually calls for an interactive mining processes involving a user's input, e.g., changing the spatial region and $k$ , or removing some locations (from the results in the previous round) that are not eligible for an application according to the domain knowledge. Efficiency is the major concern in conducting this human-in-the-loop mining. To this end, we propose a complete mining framework, which includes an optimal method for the light setting (i.e., small region and $k$ ) and an approximate method for the heavy setting (i.e., large region and $k$ ). The optimal method leverages vertex grouping and best-first pruning techniques to expedite the mining process. The approximate method can provide the performance guarantee by utilizing the greedy heuristic, and it is comprised of efficient updating strategy , index partition and workload-based optimization techniques. We evaluate the efficiency and effectiveness of our methods based on two taxi datasets from China, and one check-in dataset from New York.

Journal ArticleDOI
TL;DR: This paper has imposed three different regularization terms to constrain the objective functions of matrix factorization and built five corresponding models that can effectively improve the performance of missing data prediction in multivariable time series.
Abstract: More massive volume of data are generated in many areas than ever before. However, the missing of some values in collected data always occurs in practice and challenges extracting maximal value from these large scale data sets. Nevertheless, in multivariable time series, most of the existing methods either might be infeasible or could be inefficient to predict the missing data. In this paper, we have taken up the challenge of missing data prediction in multivariable time series by employing improved matrix factorization techniques. Our approaches are optimally designed to largely utilize both the internal patterns of each time series and the information of time series across multiple sources. Based on the idea, we have imposed three different regularization terms to constrain the objective functions of matrix factorization and built five corresponding models. Extensive experiments on real-world data sets and synthetic data set demonstrate that the proposed approaches can effectively improve the performance of missing data prediction in multivariable time series. Furthermore, we have also demonstrated how to take advantage of the high processing power of Apache Spark to perform missing data prediction in large scale multivariable time series.

Journal ArticleDOI
TL;DR: Simulations show that the proposed resource allocation scheme remarkably improves the max-min fairness in utilities of the topology throughput, and is low in computational complexity.
Abstract: Distributed stream big data analytics platforms have emerged to tackle the continuously generated data streams. In stream big data analytics, the data processing workflow is abstracted as a directed graph referred to as a topology. Data are read from the storage and processed tuple by tuple, and these processing results are updated dynamically. The performance of a topology is evaluated by its throughput. This paper proposes an efficient resource allocation scheme for a heterogeneous stream big data analytics cluster shared by multiple topologies, in order to achieve max-min fairness in the utilities of the throughput for all the topologies. We first formulate a novel resource allocation problem, which is a mixed 0-1 integer program. The NP-hardness of the problem is rigorously proven. To tackle this problem, we transform the non-convex constraint to several linear constraints using linearization and reformulation techniques. Based on the analysis of the problem-specific structure and characteristics, we propose an approach that iteratively solves the continuous problem with a fixed set of discrete variables optimally, and updates the discrete variables heuristically. Simulations show that our proposed resource allocation scheme remarkably improves the max-min fairness in utilities of the topology throughput, and is low in computational complexity.

Journal ArticleDOI
TL;DR: This paper describes a large-scale dataset, combining topology, traffic demand from call detail records, and demographic information throughout a whole country, and investigates how these aspects interact, revealing effects that are normally not captured by smaller-scale or synthetic datasets.
Abstract: In a world of open data and large-scale measurements, it is often feasible to obtain a real-world trace to fit to one's research problem. Feasible, however, does not imply simple. Taking next-generation cellular network planning as a case study, in this paper we describe a large-scale dataset, combining topology, traffic demand from call detail records, and demographic information throughout a whole country. We investigate how these aspects interact, revealing effects that are normally not captured by smaller-scale or synthetic datasets. In addition to making the resulting dataset available for download, we discuss how our experience can be generalized to other scenarios and case studies, i.e., how everyone can construct a similar dataset from publicly available information.

Journal ArticleDOI
TL;DR: This model is as efficient as the fastest online similarity learning model OASIS, while performing generally as well as the accurate model OMLLR and can exclude irrelevant / redundant feature dimension simultaneously.
Abstract: In this paper, we propose a general model to address the overfitting problem in online similarity learning for big data, which is generally generated by two kinds of redundancies: 1) feature redundancy, that is there exists redundant (irrelevant) features in the training data; 2) rank redundancy, that is non-redundant (or relevant) features lie in a low rank space. To overcome these, our model is designed to obtain a simple and robust metric matrix through detecting the redundant rows and columns in the metric matrix and constraining the remaining matrix to a low rank space. To reduce feature redundancy, we employ the group sparsity regularization, i.e., the $\ell _{2,1}$ norm, to encourage a sparse feature set. To address rank redundancy, we adopt the low rank regularization, the m ax norm, instead of calculating the SVD as in traditional models using the nuclear norm. Therefore, our model can not only generate a low rank metric matrix to avoid overfitting, but also achieves feature selection simultaneously. For model optimization, an online algorithm based on the stochastic proximal method is derived to solve this problem efficiently with the complexity of $O(d^2)$ . To validate the effectiveness and efficiency of our algorithms, we apply our model to online scene categorization and synthesized data and conduct experiments on various benchmark datasets with comparisons to several state-of-the-art methods. Our model is as efficient as the fastest online similarity learning model OASIS, while performing generally as well as the accurate model OMLLR. Moreover, our model can exclude irrelevant / redundant feature dimension simultaneously.

Journal ArticleDOI
TL;DR: The results show that the proposed Euler clustering approach achieves overall better clustering performance compared to using popular Mercer kernels and approximation models, whilst keeping the computational complexity of the same magnitude as the most popular linear clustering method.
Abstract: Our concern is nonlinear clustering on large-scale dataset. While existing popular kernels (RBF, Polynomials, Spatial Pyramid, etc.) are popularly used for implicitly mapping data into a high-dimensional or infinite dimensional space in order to generalise linear clustering methods, using these kernels cannot make kernel clustering approaches directly applicable for large scale dataset, since large scale kernel matrix or similarity matrix consumes a lot of memory (e.g., 7,450 GB memory over 1 million samples of data). To solve this problem, we introduce an Euler clustering approach. Euler clustering employs Euler kernels in order to intrinsically map the input data onto a complex space of the same dimension as the input or twice, so that Euler clustering can get rid of kernel trick and does not need to rely on any approximation or random sampling on kernel function/matrix, whilst performing a more robust nonlinear clustering against noise and outliers. Moreover, since the original Euler kernel cannot generate a non-negative similarity matrix and thus is inapplicable to spectral clustering, we introduce a positive Euler kernel, and more importantly we have proved when it can generate a non-negative similarity matrix. We apply Euler kernel and the proposed positive Euler kernel to kernel $k$ -means and spectral clustering so as to develop Euler $k$ -means and Euler spectral clustering, respectively. An efficient Stiefel-manifold-based gradient method and an equivalent weighted positive Euler $k$ -means are derived for fast computation of Euler spectral clustering and further alleviating the impact of discretization of the cluster membership indicators in Euler spectral clustering. The results show that the proposed Euler clustering approach achieves overall better clustering performance compared to using popular Mercer kernels and approximation models, whilst keeping the computational complexity of the same magnitude as the most popular linear clustering method $k$ -means.

Journal ArticleDOI
TL;DR: In this article, a web resources based state detecting algorithm of an event is developed in order to let the people know of an emergency event clearly and help the social group or government process the emergency events effectively.
Abstract: An emergency event is a sudden, urgent, usually unexpected incident or occurrence that requires an immediate reaction or assistance for emergency situations, which plays an increasingly important role in the global economy and in our daily lives. Recently, the web is becoming an important event information provider and repository due to its real-time, open, and dynamic features. In this paper, web resources based states detecting algorithm of an event is developed in order to let the people know of an emergency event clearly and help the social group or government process the emergency events effectively. The relationship between web and emergency events is first introduced, which is the foundation of using web resources to detect the state of emergency events imaged on the web. Second, five temporal features of emergency events are developed to provide the basis for state detection. Moreover, the outbreak power and the fluctuation power are presented to integrate the above temporal features for measuring the different states of an emergency event. Using these two powers, an automatic state detecting algorithm for emergency events is proposed. In addition, heuristic rules for detecting the states of emergency event on the web are discussed. Our evaluations using real-world data sets demonstrate the utility of the proposed algorithm, in terms of performance and effectiveness in the analysis of emergency events.

Journal ArticleDOI
TL;DR: An SDN-based Cooperative Cache Network (SCCN) for ISP networks, aiming to minimize the content transmission latency while reducing the inter-ISP traffic is proposed, and a Relaxation Algorithm (RA) based on relaxation-rounding technique is proposed to solve the problem.
Abstract: Cooperative cache has become a promising technique to optimize the traffic by caching big data in networks. However, controlling distributed cache nodes to update cached contents synergistically is still challenging in designing cooperative cache systems. This paper proposes an SDN-based Cooperative Cache Network (SCCN) for ISP networks, aiming to minimize the content transmission latency while reducing the inter-ISP traffic. Based on the proposed increment recording mechanism, the SCCN Controller can timely capture the change of content popularity, and place the most popular contents on the appropriate SCCN Switches. We formulate the optimal content placement as a specific multi-commodity facility location problem and prove its NP-hardness. We propose a Relaxation Algorithm (RA) based on relaxation-rounding technique to solve the problem, which can achieve an approximation ratio of $1/2$ in the worst case. To solve large scale problems for big data efficiently, we further design a Heuristic Algorithm (HA), which can find a near-optimal solution with three orders of magnitude speedup compared to RA. Specifically, HA can achieve a desirable tradeoff between the transmission delay and the Internet traffic. We implement a prototype based on Open vSwitch to demonstrate the feasibility of SCCN. Extensive trace-based simulation results show the effectiveness of SCCN under various network conditions.

Journal ArticleDOI
TL;DR: The design, implementation and evaluation of G-Storm is presented, a GPU-enabled parallel system based on Storm, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing.
Abstract: The Single Instruction Multiple Data (SIMD) architecture of Graphic Processing Units (GPUs) makes them perfect for parallel processing of big data. In this paper, we present the design, implementation and evaluation of G-Storm , a GPU-enabled parallel system based on Storm, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing. G-Storm has the following desirable features: 1) G-Storm is designed to be a general data processing platform as Storm, which can handle various applications and data types. 2) G-Storm exposes GPUs to Storm applications while preserving its easy-to-use programming model. 3) G-Storm achieves high-throughput and low-overhead data processing with GPUs. 4) G-Storm accelerates data processing further by enabling Direct Data Transfer (DDT), between two executors that process data at a common GPU. We implemented G-Storm based on Storm 0.9.2 and tested it using three different applications, including continuous query, matrix multiplication and image resizing. Extensive experimental results show that 1) Compared to Storm, G-Storm achieves over 7× improvement on throughput for continuous query, while maintaining reasonable average tuple processing time. It also leads to 2.3× and 1.3× throughput improvements on the other two applications, respectively. 2) DDT significantly reduces data processing time.

Journal ArticleDOI
TL;DR: Experimental results and comparative studies over numerous datasets demonstrate the effectiveness of the generative model derived to determine the sparse hyperparameter effectively and efficiently.
Abstract: Sparse autoencoder is an unsupervised feature extractor and has been widely used in the machine learning and data mining community. However, a sparse hyperparameter has to be determined to balance the trade-off between the reconstruction error and the sparsity of sparse autoencoder. Traditional sparse hyperparameter determination method is time-consuming, especially when the dataset is large. In this paper, we derive a generative model for sparse autoencoder. Based on this model, we derive a formulation to determine the sparse hyperparameter effectively and efficiently. The relationship between the sparse hyperparameter and the average activation of sparse autoencoder hidden units is also presented in this paper. Experimental results and comparative studies over numerous datasets demonstrate the effectiveness of our method to determine the sparse hyperparameter.

Journal ArticleDOI
TL;DR: A mediation based component to optimize and execute complex queries over multiple data stores in Cloud environments, referred to as virtual data store (VDS), which is a simple global schema describing the different data sources and their relationships.
Abstract: The production of huge amount of data and the emergence of cloud computing have introduced new requirements for data management. Many applications need to interact with several heterogeneous data stores depending on the type of data they have to manage: relational and NoSQL (i.e., document, graph, key-value, and column) data stores. Interacting with heterogeneous data models via different APIs and query languages imposes challenging tasks to the developers of multiple data stores applications. Indeed, the execution of complex queries over heterogeneous data models cannot, currently, be achieved in a declarative way as it is used to be with single data store application, and therefore requires extra implementation efforts. In this paper we propose a mediation based component to optimize and execute complex queries over multiple data stores in Cloud environments. This component is referred to as virtual data store (VDS). The key ingredients of our solution are (1) a simple global schema describing the different data sources and their relationships, (2) a cost model to evaluate the cost of the operations, (3) an inter data stores parallelism execution model, and (4) a dynamic programming based approach to generate optimal execution plan. Quantitative and qualitative experiments are conducted to validate our approach.

Journal ArticleDOI
TL;DR: The results demonstrate that the aggregated contributions of the QuantCloud infrastructure, parallel algorithms, and sophisticated implementations offer the algorithmic trading and financial engineering community new hope and numeric insights for their research and development.
Abstract: In this paper, we present the QuantCloud infrastructure, designed for performing big data analytics in modern quantitative finance. Through analyzing market observations, quantitative finance (QF) utilizes mathematical models to search for subtle patterns and inefficiencies in financial markets to improve prospective profits. To discover profitable signals in anticipation of volatile trading patterns amid a global market, analytics are carried out on Exabyte-scale market metadata with a complex process in pursuit of a microsecond or even a nanosecond of data processing advantage. This objective motivates the development of innovative tools to address challenges for handling high volume, velocity, and variety investment instruments. Inspired by this need, we developed QuantCloud by employing large-scale SSD-backed datastore, various parallel processing algorithms, and portability in Cloud computing. QuantCloud bridges the gap between model computing techniques and financial data-driven research. The large volume of market data is structured in an SSD-backed datastore, and a daemon reacts to provide the Data-on-Demand services. Multiple client services process user requests in a parallel mode and query on-demand datasets from the datastore through Internet connections. We benchmark QuantCloud performance on a 40-core, 1TB-memory computer and a 5-TB SSD-backed datastore. We use NYSE TAQ data from the fourth quarter of 2014 as our market data. The results indicate data-access application latency as low as 3.6 nanoseconds per message, sustained throughput for parallel data processing as high as 74 million messages per second, and completion of 11 petabyte-level data analytics within 53 minutes. Our results demonstrate that the aggregated contributions of our infrastructure, parallel algorithms, and sophisticated implementations offer the algorithmic trading and financial engineering community new hope and numeric insights for their research and development.

Journal ArticleDOI
TL;DR: This paper proposes an enhanced projection pursuit method to better project and visualize the structures of big high-dimensional (HD) longitudinal data on a lower-dimensional plane and demonstrates its better performance in visualizing big HD longitudinal data.
Abstract: Big longitudinal data provide more reliable information for decision making and are common in all kinds of fields. Trajectory pattern recognition is in an urgent need to discover important structures for such data. Developing better and more computationally-efficient visualization tool is crucial to guide this technique. This paper proposes an enhanced projection pursuit (EPP) method to better project and visualize the structures (e.g., clusters) of big high-dimensional (HD) longitudinal data on a lower-dimensional plane. Unlike classic PP methods potentially useful for longitudinal data, EPP is built upon nonlinear mapping algorithms to compute its stress (error) function by balancing the paired weights for between and within structure stress while preserving original structure membership in the high-dimensional space. Specifically, EPP solves an NP hard optimization problem by integrating gradual optimization and non-linear mapping algorithms, and automates the searching of an optimal number of iterations to display a stable structure for varying sample sizes and dimensions. Using publicized UCI and real longitudinal clinical trial datasets as well as simulation, EPP demonstrates its better performance in visualizing big HD longitudinal data.