scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Big Data in 2019"


Journal ArticleDOI
TL;DR: This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing DataAugmentation, a data-space solution to the problem of limited data.
Abstract: Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfortunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. The application of augmentation methods based on GANs are heavily covered in this survey. In addition to augmentation techniques, this paper will briefly discuss other characteristics of Data Augmentation such as test-time augmentation, resolution impact, final dataset size, and curriculum learning. This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation. Readers will understand how Data Augmentation can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data.

5,782 citations


Journal ArticleDOI
TL;DR: Examination of existing deep learning techniques for addressing class imbalanced data finds that research in this area is very limited, that most existing work focuses on computer vision tasks with convolutional neural networks, and that the effects of big data are rarely considered.
Abstract: The purpose of this study is to examine existing deep learning techniques for addressing class imbalanced data. Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g., fraud detection and cancer detection. Moreover, highly imbalanced data poses added difficulty, as most learners will exhibit bias towards the majority class, and in extreme cases, may ignore the minority class altogether. Class imbalance has been studied thoroughly over the last two decades using traditional machine learning models, i.e. non-deep learning. Despite recent advances in deep learning, along with its increasing popularity, very little empirical work in the area of deep learning with class imbalance exists. Having achieved record-breaking performance results in several complex domains, investigating the use of deep neural networks for problems containing high levels of class imbalance is of great interest. Available studies regarding class imbalance and deep learning are surveyed in order to better understand the efficacy of deep learning when applied to class imbalanced data. This survey discusses the implementation details and experimental results for each study, and offers additional insight into their strengths and weaknesses. Several areas of focus include: data complexity, architectures tested, performance interpretation, ease of use, big data application, and generalization to other domains. We have found that research in this area is very limited, that most existing work focuses on computer vision tasks with convolutional neural networks, and that the effects of big data are rarely considered. Several traditional methods for class imbalance, e.g. data sampling and cost-sensitive learning, prove to be applicable in deep learning, while more advanced methods that exploit neural network feature learning abilities show promising results. The survey concludes with a discussion that highlights various gaps in deep learning from class imbalanced data for the purpose of guiding future research.

1,377 citations


Journal ArticleDOI
TL;DR: To provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data.
Abstract: ‘Big data’ is massive amounts of information that can work wonders. It has become a topic of special interest for the past two decades because of a great potential that is hidden in it. Various public and private sector industries generate, store, and analyze big data with an aim to improve the services they provide. In the healthcare industry, various sources for big data include hospital records, medical records of patients, results of medical examinations, and devices that are a part of internet of things. Biomedical research also generates a significant portion of big data relevant to public healthcare. This data requires proper management and analysis in order to derive meaningful information. Otherwise, seeking solution by analyzing big data quickly becomes comparable to finding a needle in the haystack. There are various challenges associated with each step of handling big data which can only be surpassed by using high-end computing solutions for big data analysis. That is why, to provide relevant solutions for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data. An efficient management, analysis, and interpretation of big data can change the game by opening new avenues for modern healthcare. That is exactly why various industries, including the healthcare industry, are taking vigorous steps to convert this potential into better services and financial advantages. With a strong integration of biomedical and healthcare data, modern healthcare organizations can possibly revolutionize the medical therapies and personalized medicine.

615 citations


Journal ArticleDOI
TL;DR: This paper researches how to apply the convolutional neural network (CNN) based algorithm on a chest X-ray dataset to classify pneumonia and shows that data augmentation generally is an effective way for all three algorithms to improve performance.
Abstract: Medical image classification plays an essential role in clinical treatment and teaching tasks. However, the traditional method has reached its ceiling on performance. Moreover, by using them, much time and effort need to be spent on extracting and selecting classification features. The deep neural network is an emerging machine learning method that has proven its potential for different classification tasks. Notably, the convolutional neural network dominates with the best results on varying image classification tasks. However, medical image datasets are hard to collect because it needs a lot of professional expertise to label them. Therefore, this paper researches how to apply the convolutional neural network (CNN) based algorithm on a chest X-ray dataset to classify pneumonia. Three techniques are evaluated through experiments. These are linear support vector machine classifier with local rotation and orientation free features, transfer learning on two convolutional neural network models: Visual Geometry Group i.e., VGG16 and InceptionV3, and a capsule network training from scratch. Data augmentation is a data preprocessing method applied to all three methods. The results of the experiments show that data augmentation generally is an effective way for all three algorithms to improve performance. Also, Transfer learning is a more useful classification method on a small dataset compared to a support vector machine with oriented fast and rotated binary (ORB) robust independent elementary features and capsule network. In transfer learning, retraining specific features on a new target dataset is essential to improve performance. And, the second important factor is a proper network complexity that matches the scale of the dataset.

481 citations


Journal ArticleDOI
TL;DR: The article discusses different challenges and key issues of IoT, architecture and important application domains, and the importance of big data and its analysis with respect to IoT has been discussed.
Abstract: Internet of Things (IoT) is a new paradigm that has changed the traditional way of living into a high tech life style. Smart city, smart homes, pollution control, energy saving, smart transportation, smart industries are such transformations due to IoT. A lot of crucial research studies and investigations have been done in order to enhance the technology through IoT. However, there are still a lot of challenges and issues that need to be addressed to achieve the full potential of IoT. These challenges and issues must be considered from various aspects of IoT such as applications, challenges, enabling technologies, social and environmental impacts etc. The main goal of this review article is to provide a detailed discussion from both technological and social perspective. The article discusses different challenges and key issues of IoT, architecture and important application domains. Also, the article bring into light the existing literature and illustrated their contribution in different aspects of IoT. Moreover, the importance of big data and its analysis with respect to IoT has been discussed. This article would help the readers and researcher to understand the IoT and its applicability to the real world.

433 citations


Journal ArticleDOI
TL;DR: This article reviews previous work in big data analytics and presents a discussion of open challenges and future directions for recognizing and mitigating uncertainty in this domain.
Abstract: Big data analytics has gained wide attention from both academia and industry as the demand for understanding trends in massive datasets increases. Recent developments in sensor networks, cyber-physical systems, and the ubiquity of the Internet of Things (IoT) have increased the collection of data (including health care, social media, smart cities, agriculture, finance, education, and more) to an enormous scale. However, the data collected from sensors, social media, financial records, etc. is inherently uncertain due to noise, incompleteness, and inconsistency. The analysis of such massive amounts of data requires advanced analytical techniques for efficiently reviewing and/or predicting future courses of action with high precision and advanced decision-making strategies. As the amount, variety, and speed of data increases, so too does the uncertainty inherent within, leading to a lack of confidence in the resulting analytics process and decisions made thereof. In comparison to traditional data techniques and platforms, artificial intelligence techniques (including machine learning, natural language processing, and computational intelligence) provide more accurate, faster, and scalable results in big data analytics. Previous research and surveys conducted on big data analytics tend to focus on one or two techniques or specific application domains. However, little work has been done in the field of uncertainty when applied to big data analytics as well as in the artificial intelligence techniques applied to the datasets. This article reviews previous work in big data analytics and presents a discussion of open challenges and future directions for recognizing and mitigating uncertainty in this domain.

246 citations


Journal ArticleDOI
TL;DR: The main focus of this survey is application of deep learning techniques in detecting the exact count, involved persons and the happened activity in a large crowd at all climate conditions.
Abstract: Big data applications are consuming most of the space in industry and research area. Among the widespread examples of big data, the role of video streams from CCTV cameras is equally important as other sources like social media data, sensor data, agriculture data, medical data and data evolved from space research. Surveillance videos have a major contribution in unstructured big data. CCTV cameras are implemented in all places where security having much importance. Manual surveillance seems tedious and time consuming. Security can be defined in different terms in different contexts like theft identification, violence detection, chances of explosion etc. In crowded public places the term security covers almost all type of abnormal events. Among them violence detection is difficult to handle since it involves group activity. The anomalous or abnormal activity analysis in a crowd video scene is very difficult due to several real world constraints. The paper includes a deep rooted survey which starts from object recognition, action recognition, crowd analysis and finally violence detection in a crowd environment. Majority of the papers reviewed in this survey are based on deep learning technique. Various deep learning methods are compared in terms of their algorithms and models. The main focus of this survey is application of deep learning techniques in detecting the exact count, involved persons and the happened activity in a large crowd at all climate conditions. Paper discusses the underlying deep learning implementation technology involved in various crowd video analysis methods. Real time processing, an important issue which is yet to be explored more in this field is also considered. Not many methods are there in handling all these issues simultaneously. The issues recognized in existing methods are identified and summarized. Also future direction is given to reduce the obstacles identified. The survey provides a bibliographic summary of papers from ScienceDirect, IEEE Xplore and ACM digital library.

219 citations


Journal ArticleDOI
TL;DR: The proposed method aims to focus on selecting the attributes that ail in early detection of Diabetes Miletus using Predictive analysis and shows the decision tree algorithm and the Random forest holds best for the analysis of diabetic data.
Abstract: Diabetes is a chronic disease or group of metabolic disease where a person suffers from an extended level of blood glucose in the body, which is either the insulin production is inadequate, or because the body’s cells do not respond properly to insulin. The constant hyperglycemia of diabetes is related to long-haul harm, brokenness, and failure of various organs, particularly the eyes, kidneys, nerves, heart, and veins. The objective of this research is to make use of significant features, design a prediction algorithm using Machine learning and find the optimal classifier to give the closest result comparing to clinical outcomes. The proposed method aims to focus on selecting the attributes that ail in early detection of Diabetes Miletus using Predictive analysis. The result shows the decision tree algorithm and the Random forest has the highest specificity of 98.20% and 98.00%, respectively holds best for the analysis of diabetic data. Naive Bayesian outcome states the best accuracy of 82.30%. The research also generalizes the selection of optimal features from dataset to improve the classification accuracy.

217 citations


Journal ArticleDOI
TL;DR: The requirements for process data analysis pipelines are characterized and recommendations are offered for each phase of the pipeline, showing a stronger focus on the storage and analysis phases of pipelines than on the ingestion, communication, and visualization stages.
Abstract: Smart manufacturing is strongly correlated with the digitization of all manufacturing activities. This increases the amount of data available to drive productivity and profit through data-driven decision making programs. The goal of this article is to assist data engineers in designing big data analysis pipelines for manufacturing process data. Thus, this paper characterizes the requirements for process data analysis pipelines and surveys existing platforms from academic literature. The results demonstrate a stronger focus on the storage and analysis phases of pipelines than on the ingestion, communication, and visualization stages. Results also show a tendency towards custom tools for ingestion and visualization, and relational data tools for storage and analysis. Tools for handling heterogeneous data are generally well-represented throughout the pipeline. Finally, batch processing tools are more widely adopted than real-time stream processing frameworks, and most pipelines opt for a common script-based data processing approach. Based on these results, recommendations are offered for each phase of the pipeline.

177 citations


Journal ArticleDOI
TL;DR: This study aims to analyze the effectiveness of various machine learning classification models for predicting personalized usage utilizing individual’s phone log data and presents the empirical evaluations of Artificial Neural Network based classification model, which is frequently used in deep learning and makes comparative analysis in this context-aware study.
Abstract: Due to the increasing popularity of recent advanced features and context-awareness in smart mobile phones, the contextual data relevant to users’ diverse activities with their phones are recorded through the device logs. Modeling and predicting individual’s smartphone usage based on contexts, such as temporal, spatial, or social information, can be used to build various context-aware personalized systems. In order to intelligently assist them, a machine learning classifier based usage prediction model for individual users’ is the key. Thus, we aim to analyze the effectiveness of various machine learning classification models for predicting personalized usage utilizing individual’s phone log data. In our context-aware analysis, we first employ ten classic and well-known machine learning classification techniques, such as ZeroR, Naive Bayes, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, Adaptive Boosting, Repeated Incremental Pruning to Produce Error Reduction, Ripple Down Rule Learner, and Logistic Regression classifiers. We also present the empirical evaluations of Artificial Neural Network based classification model, which is frequently used in deep learning and make comparative analysis in our context-aware study. The effectiveness of these classifier based context-aware models is examined by conducting a range of experiments on the real mobile phone datasets collected from individual users. The overall experimental results and discussions can help both the researchers and applications developers to design and build intelligent context-aware systems for smartphone users.

163 citations


Journal ArticleDOI
TL;DR: The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template, which enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.
Abstract: Manual exploratory literature reviews should be a thing of the past, as technology and development of machine learning methods have matured. The learning curve for using machine learning methods is rapidly declining, enabling new possibilities for all researchers. A framework is presented on how to use topic modelling on a large collection of papers for an exploratory literature review and how that can be used for a full literature review. The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template. The framework consists of three steps; pre-processing, topic modelling, and post-processing, where the topic model Latent Dirichlet Allocation is used. The framework enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

Journal ArticleDOI
TL;DR: This study shows that smart and smarter cities are associated with misunderstanding and deficiencies as regards their incorporation of, and contribution to, sustainability, and tremendous opportunities are available for utilising big data analytics and its application in smart cities of the future.
Abstract: There has recently been a conscious push for cities across the globe to be smart and even smarter and thus more sustainable by developing and implementing big data technologies and their applications across various urban domains in the hopes of reaching the required level of sustainability and improving the living standard of citizens. Having gained momentum and traction as a promising response to the needed transition towards sustainability and to the challenges of urbanisation, smart and smarter cities as approaches to data-driven urbanism are increasingly adopting the advanced forms of ICT to improve their performance in line with the goals of sustainable development and the requirements of urban growth. One of such forms that has tremendous potential to enhance urban operations, functions, services, designs, strategies, and policies in this direction is big data analytics and its application. This is due to the kind of well-informed decision-making and enhanced insights enabled by big data computing in the form of applied intelligence. However, topical studies on big data technologies and their applications in the context of smart and smarter cities tend to deal largely with economic growth and the quality of life in terms of service efficiency and betterment while overlooking and barely exploring the untapped potential of such applications for advancing sustainability. In fact, smart and smarter cities raise several issues and involve significant challenges when it comes to their development and implementation in the context of sustainability. With that in regard, this paper provides a comprehensive, state-of-the-art review and synthesis of the field of smart and smarter cities in relation to sustainability and related big data analytics and its application in terms of the underlying foundations and assumptions, research issues and debates, opportunities and benefits, technological developments, emerging trends, future practices, and challenges and open issues. This study shows that smart and smarter cities are associated with misunderstanding and deficiencies as regards their incorporation of, and contribution to, sustainability. Nevertheless, as also revealed by this study, tremendous opportunities are available for utilising big data analytics and its application in smart cities of the future to improve their contribution to the goals of sustainable development by optimising and enhancing urban operations, functions, services, designs, strategies, and policies, as well as by finding answers to challenging analytical questions and thereby advancing knowledge forms. However, just as there are immense opportunities ahead to embrace and exploit, there are enormous challenges and open issues ahead to address and overcome in order to achieve a successful implementation of big data technology and its novel applications in such cities.

Journal ArticleDOI
TL;DR: This work develops a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn and builds a new way of features’ engineering and selection on big data platform.
Abstract: Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct effect on the revenues of the companies, especially in the telecom field, companies are seeking to develop means to predict potential customer to churn. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn. The main contribution of our work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platform and builds a new way of features’ engineering and selection. In order to measure the performance of the model, the Area Under Curve (AUC) standard measure is adopted, and the AUC value obtained is 93.3%. Another main contribution is to use customer social network in the prediction model by extracting Social Network Analysis (SNA) features. The use of SNA enhanced the performance of the model from 84 to 93.3% against AUC standard. The model was prepared and tested through Spark environment by working on a large dataset created by transforming big raw data provided by SyriaTel telecom company. The dataset contained all customers’ information over 9 months, and was used to train, test, and evaluate the system at SyriaTel. The model experimented four algorithms: Decision Tree, Random Forest, Gradient Boosted Machine Tree “GBM” and Extreme Gradient Boosting “XGBOOST”. However, the best results were obtained by applying XGBOOST algorithm. This algorithm was used for classification in this churn predictive model.

Journal ArticleDOI
TL;DR: It was recommended that research efforts should be geared towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data.
Abstract: Recently, big data streams have become ubiquitous due to the fact that a number of applications generate a huge amount of data at a great velocity. This made it difficult for existing data mining tools, technologies, methods, and techniques to be applied directly on big data streams due to the inherent dynamic characteristics of big data. In this paper, a systematic review of big data streams analysis which employed a rigorous and methodical approach to look at the trends of big data stream tools and technologies as well as methods and techniques employed in analysing big data streams. It provides a global view of big data stream tools and technologies and its comparisons. Three major databases, Scopus, ScienceDirect and EBSCO, which indexes journals and conferences that are promoted by entities such as IEEE, ACM, SpringerLink, and Elsevier were explored as data sources. Out of the initial 2295 papers that resulted from the first search string, 47 papers were found to be relevant to our research questions after implementing the inclusion and exclusion criteria. The study found that scalability, privacy and load balancing issues as well as empirical analysis of big data streams and technologies are still open for further research efforts. We also found that although, significant research efforts have been directed to real-time analysis of big data stream not much attention has been given to the preprocessing stage of big data streams. Only a few big data streaming tools and technologies can do all of the batch, streaming, and iterative jobs; there seems to be no big data tool and technology that offers all the key features required for now and standard benchmark dataset for big data streaming analytics has not been widely adopted. In conclusion, it was recommended that research efforts should be geared towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data.

Journal ArticleDOI
TL;DR: This study proposed an ECG (Electrocardiogram) classification approach using machine learning based on several ECG features that achieved an overall accuracy of 96.75% using GDB Tree algorithm and 97.98% using random Forest for binary classification.
Abstract: This study proposed an ECG (Electrocardiogram) classification approach using machine learning based on several ECG features. An electrocardiogram (ECG) is a signal that measures the electric activity of the heart. The proposed approach is implemented using ML-libs and Scala language on Apache Spark framework; MLlib is Apache Spark’s scalable machine learning library. The key challenge in ECG classification is to handle the irregularities in the ECG signals which is very important to detect the patient status. Therefore, we have proposed an efficient approach to classify ECG signals with high accuracy Each heartbeat is a combination of action impulse waveforms produced by different specialized cardiac heart tissues. Heartbeats classification faces some difficulties because these waveforms differ from person to another, they are described by some features. These features are the inputs of machine learning algorithm. In general, using Spark–Scala tools simplifies the usage of many algorithms such as machine-learning (ML) algorithms. On other hand, Spark–Scala is preferred to be used more than other tools when size of processing data is too large. In our case, we have used a dataset with 205,146 records to evaluate the performance of our approach. Machine learning libraries in Spark–Scala provide easy ways to implement many classification algorithms (Decision Tree, Random Forests, Gradient-Boosted Trees (GDB), etc.). The proposed method is evaluated and validated on baseline MIT-BIH Arrhythmia and MIT-BIH Supraventricular Arrhythmia database. The results show that our approach achieved an overall accuracy of 96.75% using GDB Tree algorithm and 97.98% using random Forest for binary classification. For multi class classification, it achieved to 98.03% accuracy using Random Forest, Gradient Boosting tree supports only binary classification.

Journal ArticleDOI
TL;DR: This study proposes zero-padding for resizing images to the same size and compares it with the conventional approach of scaling images up (zooming in) using interpolation, showing that zero- padding had no effect on the classification accuracy but considerably reduced the training time.
Abstract: The input to a machine learning model is a one-dimensional feature vector. However, in recent learning models, such as convolutional and recurrent neural networks, two- and three-dimensional feature tensors can also be inputted to the model. During training, the machine adjusts its internal parameters to project each feature tensor close to its target. After training, the machine can be used to predict the target for previously unseen feature tensors. What this study focuses on is the requirement that feature tensors must be of the same size. In other words, the same number of features must be present for each sample. This creates a barrier in processing images and texts, as they usually have different sizes, and thus different numbers of features. In classifying an image using a convolutional neural network (CNN), the input is a three-dimensional tensor, where the value of each pixel in each channel is one feature. The three-dimensional feature tensor must be the same size for all images. However, images are not usually of the same size and so are not their corresponding feature tensors. Resizing images to the same size without deforming patterns contained therein is a major challenge. This study proposes zero-padding for resizing images to the same size and compares it with the conventional approach of scaling images up (zooming in) using interpolation. Our study showed that zero-padding had no effect on the classification accuracy but considerably reduced the training time. The reason is that neighboring zero input units (pixels) will not activate their corresponding convolutional unit in the next layer. Therefore, the synaptic weights on outgoing links from input units do not need to be updated if they contain a zero value. Theoretical justification along with experimental endorsements are provided in this paper.

Journal ArticleDOI
TL;DR: This research work addresses the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for huge volumes of multidimensional unstructured data and presents a systematic literature review of state-of-the-art techniques for a variety of big data.
Abstract: Process of information extraction (IE) is used to extract useful information from unstructured or semi-structured data. Big data arise new challenges for IE techniques with the rapid growth of multifaceted also called as multidimensional unstructured data. Traditional IE systems are inefficient to deal with this huge deluge of unstructured big data. The volume and variety of big data demand to improve the computational capabilities of these IE systems. It is necessary to understand the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for huge volumes of multidimensional unstructured data. Numerous studies have been conducted on IE, addressing the challenges and issues for different data types such as text, image, audio and video. Very limited consolidated research work have been conducted to investigate the task-dependent and task-independent limitations of IE covering all data types in a single study. This research work address this limitation and present a systematic literature review of state-of-the-art techniques for a variety of big data, consolidating all data types. Recent challenges of IE are also identified and summarized. Potential solutions are proposed giving future research directions in big data IE. The research is significant in terms of recent trends and challenges related to big data analytics. The outcome of the research and recommendations will help to improve the big data analytics by making it more productive.

Journal ArticleDOI
TL;DR: This paper addresses the issue of data fusion in the context of IoT networks, consisting of edge devices, network and communications units, and Cloud platforms, and proposes a distributed hierarchical data fusion architecture, in which different data sources are combined at each level of the IoT taxonomy to produce timely and accurate results.
Abstract: The Internet of Things (IoT) facilitates creation of smart spaces by converting existing environments into sensor-rich data-centric cyber-physical systems with an increasing degree of automation, giving rise to Industry 4.0. When adopted in commercial/industrial contexts, this trend is revolutionising many aspects of our everyday life, including the way people access and receive healthcare services. As we move towards Healthcare Industry 4.0, the underlying IoT systems of Smart Healthcare spaces are growing in size and complexity, making it important to ensure that extreme amounts of collected data are properly processed to provide valuable insights and decisions according to requirements in place. This paper focuses on the Smart Healthcare domain and addresses the issue of data fusion in the context of IoT networks, consisting of edge devices, network and communications units, and Cloud platforms. We propose a distributed hierarchical data fusion architecture, in which different data sources are combined at each level of the IoT taxonomy to produce timely and accurate results. This way, mission-critical decisions, as demonstrated by the presented Smart Healthcare scenario, are taken with minimum time delay, as soon as necessary information is generated and collected. The proposed approach was implemented using the Complex Event Processing technology, which natively supports the hierarchical processing model and specifically focuses on handling streaming data ‘on the fly’—a key requirement for storage-limited IoT devices and time-critical application domains. Initial experiments demonstrate that the proposed approach enables fine-grained decision taking at different data fusion levels and, as a result, improves the overall performance and reaction time of public healthcare services, thus promoting the adoption of the IoT technologies in Healthcare Industry 4.0.

Journal ArticleDOI
TL;DR: A survey on previous work in the area of contextual smartphone data analytics is made and a discussion of challenges and future directions for effectively learning context-aware rules from smartphone data, in order to build rule-based automated and intelligent systems are presented.
Abstract: Smartphones are considered as one of the most essential and highly personal devices of individuals in our current world. Due to the popularity of context-aware technology and recent developments in smartphones, these devices can collect and process raw contextual data about users’ surrounding environment and their corresponding behavioral activities with their phones. Thus, smartphone data analytics and building data-driven context-aware systems have gained wide attention from both academia and industry in recent days. In order to build intelligent context-aware applications on smartphones, effectively learning a set of context-aware rules from smartphone data is the key. This requires advanced data analytical techniques with high precision and intelligent decision making strategies based on contexts. In comparison to traditional approaches, machine learning based techniques provide more effective and efficient results for smartphone data analytics and corresponding context-aware rule learning. Thus, this article first makes a survey on previous work in the area of contextual smartphone data analytics and then presents a discussion of challenges and future directions for effectively learning context-aware rules from smartphone data, in order to build rule-based automated and intelligent systems.

Journal ArticleDOI
TL;DR: The results posit that the credentials of the technology acceptance model together with task-technology fit contribute substantially to the enhancement of behavioral intentions to use the big data analytics system in healthcare, ultimately leading towards actual use.
Abstract: Big data analytics is gaining substantial attention due to its innovative contribution to decision making and strategic development across the healthcare field. Therefore, this study explored the adoption mechanism of big data analytics in healthcare organizations to inspect elements correlated to behavioral intention using the technology acceptance model and task-technology fit paradigm. Using a survey questionnaire, we analyzed 224 valid responses in AMOS v21 to test the hypotheses. Our results posit that the credentials of the technology acceptance model together with task-technology fit contribute substantially to the enhancement of behavioral intentions to use the big data analytics system in healthcare, ultimately leading towards actual use. Meanwhile, trust in and security of the information system also positively influenced the behavioral intention for use. Employee resistance to change is a key factor underlying failure of the innovative system in organizations and has been proven in this study to negatively moderate the relationship between intention to use and actual use of big data analytics in healthcare. Our results can be implemented by healthcare organizations to develop an understanding of the implementation of big data analytics and to promote psychological empowerment of employees to accept this innovative system.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed system can count vehicles and classify their driving direction during weekday rush hours with mean absolute percentage error that is less than 10% and might be further used as a challenging test or additional training data.
Abstract: This study addresses the problem of traffic flow estimation based on the data from a video surveillance camera. Target problem here is formulated as counting and classifying vehicles by their driving direction. This subject area is in early development, and the focus of this work is only one of the busiest crossroads in city Chelyabinsk, Russia. To solve the posed problem, we employed the state-of-the-art Faster R-CNN two-stage detector together with SORT tracker. A simple regions-based heuristic algorithm was used to classify vehicles movement direction. The baseline performance of the Faster R-CNN was enhanced by several modifications: focal loss, adaptive feature pooling, additional mask branch, and anchors optimization. To train and evaluate detector, we gathered 982 video frames with more than 60,000 objects presented in various conditions. The experimental results show that the proposed system can count vehicles and classify their driving direction during weekday rush hours with mean absolute percentage error that is less than 10%. The dataset presented here might be further used by other researches as a challenging test or additional training data.

Journal ArticleDOI
TL;DR: This is the first study to compare multiple data-level and algorithm-level deep learning methods across a range of class distributions and a unique analysis of the relationship between minority class size and optimal decision threshold and state-of-the-art performance on the given Medicare fraud detection task.
Abstract: Access to affordable healthcare is a nationwide concern that impacts a large majority of the United States population Medicare is a Federal Government healthcare program that provides affordable health insurance to the elderly population and individuals with select disabilities Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that costs taxpayers billions of dollars and puts beneficiaries’ health and welfare at risk Previous work has shown that publicly available Medicare claims data can be leveraged to construct machine learning models capable of automating fraud detection, but challenges associated with class-imbalanced big data hinder performance With a minority class size of 003% and an opportunity to improve existing results, we use the Medicare fraud detection task to compare six deep learning methods designed to address the class imbalance problem Data-level techniques used in this study include random over-sampling (ROS), random under-sampling (RUS), and a hybrid ROS–RUS The algorithm-level techniques evaluated include a cost-sensitive loss function, the Focal Loss, and the Mean False Error Loss A range of class ratios are tested by varying sample rates and desirable class-wise performance is achieved by identifying optimal decision thresholds for each model Neural networks are evaluated on a 20% holdout test set, and results are reported using the area under the receiver operating characteristic curve (AUC) Results show that ROS and ROS–RUS perform significantly better than baseline and algorithm-level methods with average AUC scores of 08505 and 08509, while ROS–RUS maximizes efficiency with a 4× speedup in training time Plain RUS outperforms baseline methods with up to 30× improvements in training time, and all algorithm-level methods are found to produce more stable decision boundaries than baseline methods Thresholding results suggest that the decision threshold always be optimized using a validation set, as we observe a strong linear relationship between the minority class size and the optimal threshold To the best of our knowledge, this is the first study to compare multiple data-level and algorithm-level deep learning methods across a range of class distributions Additional contributions include a unique analysis of the relationship between minority class size and optimal decision threshold and state-of-the-art performance on the given Medicare fraud detection task

Journal ArticleDOI
TL;DR: A systematic review of studies on Big Data in relation to discrimination highlights the need for additional empirical research to assess how discriminatory practices are both voluntarily and accidentally emerging from the increasing use of data analytics in daily life.
Abstract: Big Data analytics such as credit scoring and predictive analytics offer numerous opportunities but also raise considerable concerns, among which the most pressing is the risk of discrimination. Although this issue has been examined before, a comprehensive study on this topic is still lacking. This literature review aims to identify studies on Big Data in relation to discrimination in order to (1) understand the causes and consequences of discrimination in data mining, (2) identify barriers to fair data-mining and (3) explore potential solutions to this problem. Six databases were systematically searched (between 2010 and 2017): PsychINDEX, SocIndex, PhilPapers, Cinhal, Pubmed and Web of Science. Most of the articles addressed the potential risk of discrimination of data mining technologies in numerous aspects of daily life (e.g. employment, marketing, credit scoring). The majority of the papers focused on instances of discrimination related to historically vulnerable categories, while others expressed the concern that scoring systems and predictive analytics might introduce new forms of discrimination in sectors like insurance and healthcare. Discriminatory consequences of data mining were mainly attributed to human bias and shortcomings of the law; therefore suggested solutions included comprehensive auditing strategies, implementation of data protection legislation and transparency enhancing strategies. Some publications also highlighted positive applications of Big Data technologies. This systematic review primarily highlights the need for additional empirical research to assess how discriminatory practices are both voluntarily and accidentally emerging from the increasing use of data analytics in our daily life. Moreover, since the majority of papers focused on the negative discriminative consequences of Big Data, more research is needed on the potential positive uses of Big Data with regards to social disparity.

Journal ArticleDOI
TL;DR: This paper examines the applicability of employing distributed stream processing frameworks at the data processing layer of Smart City and appraises the current state of their adoption and maturity among the IoT applications, namely Apache Storm, Apache Spark Streaming, and Apache Flink.
Abstract: The widespread growth of Big Data and the evolution of Internet of Things (IoT) technologies enable cities to obtain valuable intelligence from a large amount of real-time produced data. In a Smart City, various IoT devices generate streams of data continuously which need to be analyzed within a short period of time; using some Big Data technique. Distributed stream processing frameworks (DSPFs) have the capacity to handle real-time data processing for Smart Cities. In this paper, we examine the applicability of employing distributed stream processing frameworks at the data processing layer of Smart City and appraising the current state of their adoption and maturity among the IoT applications. Our experiments focus on evaluating the performance of three DSPFs, namely Apache Storm, Apache Spark Streaming, and Apache Flink. According to our obtained results, choosing a proper framework at the data analytics layer of a Smart City requires enough knowledge about the characteristics of target applications. Finally, we conclude each of the frameworks studied here have their advantages and disadvantages. Our experiments show Storm and Flink have very similar performance, and Spark Streaming, has much higher latency, while it provides higher throughput.

Journal ArticleDOI
TL;DR: This survey paper is providing a state of the art overview of Cloud-centric Big Data placement together with the data storage methodologies and is an attempt to highlight the actual correlation between these two in terms of better supporting Big Data management.
Abstract: Currently, the data to be explored and exploited by computing systems increases at an exponential rate. The massive amount of data or so-called “Big Data” put pressure on existing technologies for providing scalable, fast and efficient support. Recent applications and the current user support from multi-domain computing, assisted in migrating from data-centric to knowledge-centric computing. However, it remains a challenge to optimally store and place or migrate such huge data sets across data centers (DCs). In particular, due to the frequent change of application and DC behaviour (i.e., resources or latencies), data access or usage patterns need to be analyzed as well. Primarily, the main objective is to find a better data storage location that improves the overall data placement cost as well as the application performance (such as throughput). In this survey paper, we are providing a state of the art overview of Cloud-centric Big Data placement together with the data storage methodologies. It is an attempt to highlight the actual correlation between these two in terms of better supporting Big Data management. Our focus is on management aspects which are seen under the prism of non-functional properties. In the end, the readers can appreciate the deep analysis of respective technologies related to the management of Big Data and be guided towards their selection in the context of satisfying their non-functional application requirements. Furthermore, challenges are supplied highlighting the current gaps in Big Data management marking down the way it needs to evolve in the near future.

Journal ArticleDOI
TL;DR: This study presents a machine learning approach to analyze the tweets to improve the customer’s experience and found that convolutional neural network (CNN) outperformed SVM and ANN models.
Abstract: Customer’s experience is one of the important concern for airline industries. Twitter is one of the popular social media platform where flight travelers share their feedbacks in the form of tweets. This study presents a machine learning approach to analyze the tweets to improve the customer’s experience. Features were extracted from the tweets using word embedding with Glove dictionary approach and n-gram approach. Further, SVM (support vector machine) and several ANN (artificial neural network) architectures were considered to develop classification model that maps the tweet into positive and negative category. Additionally, convolutional neural network (CNN) were developed to classify the tweets and the results were compared with the most accurate model among SVM and several ANN architectures. It was found that CNN outperformed SVM and ANN models. In the end, association rule mining have been performed on different categories of tweets to map the relationship with sentiment categories. The results show that interesting associations were identified that certainly helps the airline industries to improve their customer’s experience.

Journal ArticleDOI
TL;DR: It is argued that smart sustainable cities are becoming knowable, controllable, and tractable in new dynamic ways thanks to urban science, responsive to the data generated about their systems and domains by reacting to the analytical outcome of many aspects of urbanity.
Abstract: We are moving into an era where instrumentation, datafication, and computerization are routinely pervading the very fabric of cities, coupled with the interlinking, integration, and coordination of their systems and domains. As a result, vast troves of data are generated and exploited to operate, manage, organize, and regulate urban life, or a deluge of contextual and actionable data is produced, analyzed, and acted upon in real time in relation to various urban processes and practices. This data-driven approach to urbanism is increasingly becoming the mode of production for smart sustainable cities. In other words, a new era is presently unfolding wherein smart sustainable urbanism is increasingly becoming data-driven. However, topical studies tend to deal mostly with data-driven smart urbanism while barely exploring how this approach can improve and advance sustainable urbanism under what is labeled ‘data-driven smart sustainable cities.’ Having a threefold aim, this paper first examines how data-driven smart sustainable cities are being instrumented, datafied, and computerized so as to improve, advance, and maintain their contribution to the goals of sustainable development through more optimized processes and enhanced practices. Second, it highlights and substantiates the great potential of big data technology for enabling such contribution by identifying, synthesizing, distilling, and enumerating the key practical and analytical applications of this advanced technology in relation to multiple urban systems and domains with respect to operations, functions, services, designs, strategies, and policies. Third, it proposes, illustrates, and describes a novel architecture and typology of data-driven smart sustainable cities. The overall aim of this study suits thematic analysis as a research approach. I argue that smart sustainable cities are becoming knowable, controllable, and tractable in new dynamic ways thanks to urban science, responsive to the data generated about their systems and domains by reacting to the analytical outcome of many aspects of urbanity in terms of optimizing and enhancing operational functioning, management, planning, design, development, and governance in line with the goals of sustainable development. The proposed architecture, which can be replicated, tested, and evaluated in empirical research, will add additional depth to studies in the field. This study intervenes in the existing scholarly conversation by bringing new insights to and informing the ongoing debate on smart sustainable urbanism in light of big data science and analytics. This work serves to inform city stakeholders about the pivotal role of data-driven analytic thinking in smart sustainable urbanism practices, as well as draws special attention to the enormous benefits of the emerging paradigm of big data computing as to transforming the future form of such urbanism.

Journal ArticleDOI
TL;DR: The outcome of this research would help the insurance industries to assess the driving risk more accurately and to propose a solution to calculate the personalized premium based on the driving behavior with most importance towards prevention of risk.
Abstract: The emergence and growth of connected technologies and the adaptation of big data are changing the face of all industries. In the insurance industry, Usage-Based Insurance (UBI) is the most popular use case of big data adaptation. Initially UBI is started as a simple unitary Pay-As-You-Drive (PAYD) model in which the classification of good and bad drivers is an unresolved task. PAYD is progressed towards Pay-How-You-Drive (PHYD) model in which the premium is charged for the personal auto insurance depending on the post-trip analysis. Providing proactive alerts to guide the driver during the trip is the drawback of the PHYD model. PHYD model is further progressed towards Manage-How-You-Drive (MHYD) model in which the proactive engagement in the form of alerts is provided to the drivers while they drive. The evolution of PAYD, PHYD and MHYD models serve as the building blocks of UBI and facilitates the insurance industry to bridge the gap between insurer and the customer with the introduction of MHYD model. Increasing number of insurers are starting to launch PHYD or MHYD models all over the world and widespread customer adaptation is seen to improve the driver safety by monitoring the driving behavior. Consequently, the data flow between an insurer and their customers is increasing exponentially, which makes the need for big data adaptation, a foundational brick in the technology landscape of insurers. The focus of this paper is to perform a detailed survey about the categories of MHYD. The survey results in the need to address the aggressive driving behavior and road rage incidents of the drivers during short-term and long-term driving. The exhaustive survey is also used to propose a solution that finds the risk posed by aggressive driving and road rage incidents by considering the behavioral and emotional factors of a driver. The outcome of this research would help the insurance industries to assess the driving risk more accurately and to propose a solution to calculate the personalized premium based on the driving behavior with most importance towards prevention of risk.

Journal ArticleDOI
TL;DR: A systematic and structured literature review of the feature-selection techniques used in studies related to big genomic data analytics and how it contributes to the research community is presented.
Abstract: In the era of accelerating growth of genomic data, feature-selection techniques are believed to become a game changer that can help substantially reduce the complexity of the data, thus making it easier to analyze and translate it into useful information. It is expected that within the next decade, researchers will head towards analyzing the genomes of all living creatures making genomics the main generator of data. Feature selection techniques are believed to become a game changer that can help substantially reduce the complexity of genomic data, thus making it easier to analyze it and translating it into useful information. With the absence of a thorough investigation of the field, it is almost impossible for researchers to get an idea of how their work relates to existing studies as well as how it contributes to the research community. In this paper, we present a systematic and structured literature review of the feature-selection techniques used in studies related to big genomic data analytics.

Journal ArticleDOI
TL;DR: A new architecture for real-time health status prediction and analytics system using big data technologies and measures the performance of Spark DT against traditional machine learning tools including Weka to show the effectiveness of the proposed architecture.
Abstract: A number of technologies enabled by Internet of Thing (IoT) have been used for the prevention of various chronic diseases, continuous and real-time tracking system is a particularly important one. Wearable medical devices with sensor, health cloud and mobile applications have continuously generating a huge amount of data which is often called as streaming big data. Due to the higher speed of the data generation, it is difficult to collect, process and analyze such massive data in real-time in order to perform real-time actions in case of emergencies and extracting hidden value. using traditional methods which are limited and time-consuming. Therefore, there is a significant need to real-time big data stream processing to ensure an effective and scalable solution. In order to overcome this issue, this work proposes a new architecture for real-time health status prediction and analytics system using big data technologies. The system focus on applying distributed machine learning model on streaming health data events ingested to Spark streaming through Kafka topics. Firstly, we transform the standard decision tree (DT) (C4.5) algorithm into a parallel, distributed, scalable and fast DT using Spark instead of Hadoop MapReduce which becomes limited for real-time computing. Secondly, this model is applied to streaming data coming from distributed sources of various diseases to predict health status. Based on several input attributes, the system predicts health status, send an alert message to care providers and store the details in a distributed database to perform health data analytics and stream reporting. We measure the performance of Spark DT against traditional machine learning tools including Weka. Finally, performance evaluation parameters such as throughput and execution time are calculated to show the effectiveness of the proposed architecture. The experimental results show that the proposed system is able to effectively process and predict real-time and massive amount of medical data enabled by IoT from distributed and various diseases.