scispace - formally typeset
Search or ask a question
Book ChapterDOI

Big Data Analytics and Preprocessing

01 Jan 2021-pp 25-43
TL;DR: In this article, the authors discuss the importance of preprocessing big data data in terms of analysis time, utilized resources percentage, storage, efficiency of analyzed data and the output gained information.
Abstract: Big data is a trending word in the industry and academia that represents the huge flood of collected data, this data is very complex in its nature. Big data as a term used to describe many concepts related to the data from technological and cultural meaning. In the big data community, big data analytics is used to discover the hidden patterns and values that give an accurate representation of the data. Big data preprocessing is considered an important step in the analysis process. It a key to the success of the analysis process in terms of analysis time, utilized resources percentage, storage, the efficiency of the analyzed data and the output gained information. Preprocessing data involves dealing with concepts like concept drift, data streams that are considered as significant challenges.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , Liu et al. proposed a new concept suitable for data-driven robust optimization, and design two new methods for constructing datadriven uncertainty sets, i.e., partial least squares (PLS) or kernel principal component analysis (KPCA) to capture the underlying uncertainties and correlation of uncertain data, and the projection of uncertain projections on each principal component is obtained.

7 citations

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors developed a novel integrated battery data cleaning framework to systematically solve data quality problems in cloud-based vehicle battery monitoring and management, which can further boost the practical application of the vehicle big data platform and Internet of vehicle.

2 citations

Journal ArticleDOI
TL;DR: In this article, the authors incorporated local spatial information in the Fuzzy k-plane clustering method to handle the noise present in the image and showed that the proposed FkPC_S method is superior in comparison with 10 related methods in the presence of noise.
Abstract: Human brain MRI images are complex, and matters present in the brain exhibit non-spherical shape. There exits uncertainty in the overlapping structure of brain tissue, i.e. a lack of distinctness in the class definition. Soft clustering methods can efficiently handle the uncertainty, and plane-based clustering methods are found to be more efficient for non-spherical shape data. Fuzzy k-plane clustering (FkPC) method is a soft plane-based clustering algorithms that can handle the uncertainty in medical images, but its performance degraded in the presence of noise. In this research work, we incorporated local spatial information in the FkPC clustering method to handle the noise present in the image. This spatial regularization term included in the proposed FkPC_S method refines the membership value of noisy pixel with the help of immediate neighbour pixels information. To show the effectiveness of the proposed FkPC_S method, extensive experiments are performed on one synthetic image and two publicly available human brain MRI datasets. The performance of the proposed method is compared with 10 related methods in terms of average segmentation accuracy and dice score. The experiments result shows that the proposed FkPC_S method is superior in comparison with 10 related methods in the presence of noise. Statistically significance difference and superior performance of the proposed method in comparison with other methods are also found using Friedman test.

1 citations

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , the authors describe the gathering and use of a significant amount of information are required in every aspect of a Head Start program, including content and management, for children to enroll in the program.
Abstract: AbstractData, which is shorthand for “information”, has always been gathered, reviewed, and/or analyzed as part of the running of the Head Start program. For children to enroll in the program, numerous pieces of information are needed. Information from screenings and any subsequent services are included in the delivery of health and dental services. The gathering and use of a significant amount of information are required in every aspect of a Head Start program, including content and management.
Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed big data research architecture and an analysis model for grain storage security; as an example, it illustrates the supervision of the grain loss problem in storage security.
Abstract: Grain security guarantees national security. China has many widely distributed grain depots to supervise grain storage security. However, this has led to a lack of regulatory capacity and manpower. Amid the development of reserve-level information technology, big data supervision of grain storage security should be improved. This study proposes big data research architecture and an analysis model for grain storage security; as an example, it illustrates the supervision of the grain loss problem in storage security. The statistical analysis model and the prediction and clustering-based model for grain loss supervision were used to mine abnormal data. A combination of feature extraction and feature selection reduction methods were chosen for dimensionality. A comparative analysis showed that the nonlinear prediction model performed better on the grain loss data set, with R2 of 87.21%, 87.83%, 91.97%, and 89.40% for Gradient Boosting Regressor (GBR), Random Forest, Decision Tree, XGBoost regression on test sets, respectively. Nineteen abnormal data were filtered out by GBR combined with residuals as an example. The deep learning model had the best performance on the mean absolute error, with an R2 of 85.14% on the test set and only one abnormal data identified. This is contrary to the original intention of finding as many anomalies as possible for supervisory purposes. Five classes were generated using principal component analysis dimensionality reduction combined with Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering, with 11 anomalous data points screened by adding the amount of normalized grain loss. Based on the existing grain information system, this paper provides a supervision model for grain storage that can help mine abnormal data. Unlike the current post-event supervision model, this study proposes a pre-event supervision model. This study provides a framework of ideas for subsequent scholarly research; the addition of big data technology will help improve efficient supervisory capacity in the field of grain supervision.
References
More filters
Journal ArticleDOI
TL;DR: The need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats is highlighted and the need to devise new tools for predictive analytics for structured big data is reinforced.

2,962 citations

Journal ArticleDOI
07 Feb 2014
TL;DR: Big data analytics in healthcare is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs, and its potential is great; however there remain challenges to overcome.
Abstract: Objective: To describe the promise and potential of big data analytics in healthcare. Methods: The paper describes the nascent field of big data analytics in healthcare, discusses the benefits, outlines an architectural framework and methodology, describes examples reported in the literature, briefly discusses the challenges, and offers conclusions. Results: The paper provides a broad overview of big data analytics for healthcare researchers and practitioners. Conclusions: Big data analytics in healthcare is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great; however there remain challenges to overcome.

2,272 citations

Journal ArticleDOI
TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.

2,141 citations

Journal ArticleDOI
TL;DR: In this article, the authors present a state-of-the-art review that presents a holistic view of the BD challenges and BDA methods theorized/proposed/employed by organizations to help others understand this landscape with the objective of making robust investment decisions.

1,267 citations

Journal ArticleDOI
TL;DR: This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges.
Abstract: Recent technological advancements have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.

1,002 citations

Trending Questions (1)
Why preprocessing and filtering data is important before visualizing it ?

Preprocessing and filtering data is important before visualizing it to ensure accuracy, efficiency, and to handle challenges like concept drift and data streams.