Showing papers on "Data warehouse published in 2015"

PDF

Open Access

Journal Article•DOI•

A big data approach for logistics trajectory discovery from RFID-enabled production data

[...]

Ray Y. Zhong¹, Ray Y. Zhong², George Q. Huang¹, Shulin Lan¹, Qingyun Dai, Xu Chen², Tongda Zhang - Show less +3 more•Institutions (2)

University of Hong Kong¹, Shenzhen University²

01 Jul 2015-International Journal of Production Economics

TL;DR: A holistic Big Data approach to excavate frequent trajectory from massive RFID-enabled shopfloor logistics data with several innovations highlighted is proposed, which are able to guide end-users to carry out associated decisions.

...read moreread less

367 citations

Journal Article•DOI•

Methodologies for Cross-Domain Data Fusion: An Overview

[...]

Yu Zheng¹•Institutions (1)

Microsoft¹

01 Mar 2015-IEEE Transactions on Big Data

TL;DR: High-level principles of each category of methods are introduced, and examples in which these techniques are used to handle real big data problems are given, to help a wide range of communities find a solution for data fusion in big data projects.

...read moreread less

Abstract: Traditional data mining usually deals with data from a single domain. In the big data era, we face a diversity of datasets from different sources in different domains. These datasets consist of multiple modalities, each of which has a different representation, distribution, scale, and density. How to unlock the power of knowledge from multiple disparate (but potentially connected) datasets is paramount in big data research, essentially distinguishing big data from traditional data mining tasks. This calls for advanced techniques that can fuse knowledge from various datasets organically in a machine learning and data mining task. This paper summarizes the data fusion methodologies, classifying them into three categories: stage-based, feature level-based, and semantic meaning-based data fusion methods. The last category of data fusion methods is further divided into four groups: multi-view learning-based, similarity-based, probabilistic dependency-based, and transfer learning-based methods. These methods focus on knowledge fusion rather than schema mapping and data merging, significantly distinguishing between cross-domain data fusion and traditional data fusion studied in the database community. This paper does not only introduce high-level principles of each category of methods, but also give examples in which these techniques are used to handle real big data problems. In addition, this paper positions existing works in a framework, exploring the relationship and difference between different data fusion methods. This paper will help a wide range of communities find a solution for data fusion in big data projects.

...read moreread less

356 citations

Journal Article•DOI•

A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology

[...]

Ishwarappa¹, J. Anuradha²•Institutions (2)

College of Engineering, Pune¹, VIT University²

01 Jan 2015-Procedia Computer Science

TL;DR: This paper presents the 5Vs characteristics of big data and the technique and technology used to handle big data in a wide variety of scalable database tools and techniques.

...read moreread less

253 citations

Proceedings Article•DOI•

Lambda architecture for cost-effective batch and speed big data processing

[...]

Mariam Kiran¹, Peter Murphy, Inder Monga, Jon Dugan, Sartaj Singh Baveja² - Show less +1 more•Institutions (2)

University of Bradford¹, Netaji Subhas Institute of Technology²

29 Oct 2015

TL;DR: An implementation of the lambda architecture design pattern is presented to construct a data-handling backend on Amazon EC2, providing high throughput, dense and intense data demand delivered as services, minimizing the cost of the network maintenance.

...read moreread less

Abstract: Sensor and smart phone technologies present opportunities for data explosion, streaming and collecting from heterogeneous devices every second. Analyzing these large datasets can unlock multiple behaviors previously unknown, and help optimize approaches to city wide applications or societal use cases. However, collecting and handling of these massive datasets presents challenges in how to perform optimized online data analysis ‘on-the-fly’, as current approaches are often limited by capability, expense and resources. This presents a need for developing new methods for data management particularly using public clouds to minimize cost, network resources and on-demand availability. This paper presents an implementation of the lambda architecture design pattern to construct a data-handling backend on Amazon EC2, providing high throughput, dense and intense data demand delivered as services, minimizing the cost of the network maintenance. This paper combines ideas from database management, cost models, query management and cloud computing to present a general architecture that could be applied in any given scenario where affordable online data processing of Big Datasets is needed. The results are presented with a case study of processing router sensor data on the current ESnet network data as a working example of the approach. The results showcase a reduction in cost and argue benefits for performing online analysis and anomaly detection for sensor data.

...read moreread less

178 citations

Journal Article•DOI•

Data Science in Statistics Curricula: Preparing Students to “Think with Data”

[...]

Johanna Hardin¹, Roger Hoerl², Nicholas J. Horton³, Deborah Nolan⁴, Ben Baumer⁵, O. Hall-Holt⁶, Paul Murrell⁷, Roger D. Peng⁸, P. Roback⁶, D. Temple Lang⁹, Mark Daniel Ward¹⁰ - Show less +7 more•Institutions (10)

Pomona College¹, Union College², Amherst College³, University of California, Berkeley⁴, Smith College⁵, St. Olaf College⁶, University of Auckland⁷, Johns Hopkins University⁸, University of California, Davis⁹, Purdue University¹⁰

29 Dec 2015-The American Statistician

TL;DR: In this article, the importance of data science proficiency and resources for instructors to implement data science in their own statistics curricula are discussed. But these data science topics have not traditionally been a major component of undergraduate programs in statistics.

...read moreread less

Abstract: A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to use databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this article is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of und...

...read moreread less

151 citations

Posted Content•

The Anatomy of Big Data Computing

[...]

Raghavendra Kune¹, Pramod Kumar Konugurthi¹, Arun Agarwal², Raghavendra Rao Chillarige², Rajkumar Buyya³ - Show less +1 more•Institutions (3)

Department of Space¹, University UCINF², University of Melbourne³

04 Sep 2015-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: The evolution of big data computing, differences between traditional data warehousing and big data, taxonomy ofbig data computing and underpinning technologies, integrated platform of bigdata and clouds known as big data clouds, layered architecture and components of bigData cloud, and finally open‐technical challenges and future directions are discussed.

...read moreread less

Abstract: Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.

...read moreread less

148 citations

Journal Article•DOI•

Using Semantic Web Technologies for Exploratory OLAP: A Survey

[...]

Alberto Abelló¹, Oscar Romero¹, Torben Bach Pedersen², Rafael Berlanga³, Victoria Nebot³, María José Aramburu³, Alkis Simitsis⁴ - Show less +3 more•Institutions (4)

Polytechnic University of Catalonia¹, Aalborg University², James I University³, Hewlett-Packard⁴

01 Feb 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The convergence of some of the most influential technologies in the last few years, namely data warehousing (DW), on-line analytical processing (OLAP), and the Semantic Web (SW) is described, including SW support for intelligent MD querying, using SW technologies for providing context to data warehouses, and scalability issues.

...read moreread less

Abstract: This paper describes the convergence of some of the most influential technologies in the last few years, namely data warehousing (DW), on-line analytical processing (OLAP), and the Semantic Web (SW). OLAP is used by enterprises to derive important business-critical knowledge from data inside the company. However, the most interesting OLAP queries can no longer be answered on internal data alone, external data must also be discovered (most often on the web), acquired, integrated, and (analytically) queried, resulting in a new type of OLAP, exploratory OLAP . When using external data, an important issue is knowing the precise semantics of the data. Here, SW technologies come to the rescue, as they allow semantics (ranging from very simple to very complex) to be specified for web-available resources. SW technologies do not only support capturing the “passive” semantics, but also support active inference and reasoning on the data. The paper first presents a characterization of DW/OLAP environments, followed by an introduction to the relevant SW foundation concepts. Then, it describes the relationship of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms. Next, the paper goes on to survey the use of SW technologies for data modeling and data provisioning, including semantic data annotation and semantic-aware extract, transform, and load (ETL) processes. Finally, all the findings are discussed and a number of directions for future research are outlined, including SW support for intelligent MD querying, using SW technologies for providing context to data warehouses, and scalability issues.

...read moreread less

144 citations

Proceedings Article•DOI•

Amazon Redshift and the Case for Simpler Data Warehouses

[...]

Anurag Windlass Gupta¹, Deepak Agarwal¹, Derek Tan¹, Jakub Kulesza¹, Pathak Rahul¹, Stefano Stefani¹, Vidhya Srinivasan¹ - Show less +3 more•Institutions (1)

Amazon.com¹

27 May 2015

TL;DR: An oft-overlooked differentiating characteristic of Amazon Redshift is discussed -- simplicity, designed to bring data warehousing to a mass market by making it easy to buy, easy to tune and easy to manage while also being fast and cost-effective.

...read moreread less

Abstract: Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. Since launching in February 2013, it has been Amazon Web Service's (AWS) fastest growing service, with many thousands of customers and many petabytes of data under management. Amazon Redshift's pace of adoption has been a surprise to many participants in the data warehousing community. While Amazon Redshift was priced disruptively at launch, available for as little as $1000/TB/year, there are many open-source data warehousing technologies and many commercial data warehousing engines that provide free editions for development or under some usage limit. While Amazon Redshift provides a modern MPP, columnar, scale-out architecture, so too do many other data warehousing engines. And, while Amazon Redshift is available in the AWS cloud, one can build data warehouses using EC2 instances and the database engine of one's choice with either local or network-attached storage. In this paper, we discuss an oft-overlooked differentiating characteristic of Amazon Redshift -- simplicity. Our goal with Amazon Redshift was not to compete with other data warehousing engines, but to compete with non-consumption. We believe the vast majority of data is collected but not analyzed. We believe, while most database vendors target larger enterprises, there is little correlation in today's economy between data set size and company size. And, we believe the models used to procure and consume analytics technology need to support experimentation and evaluation. Amazon Redshift was designed to bring data warehousing to a mass market by making it easy to buy, easy to tune and easy to manage while also being fast and cost-effective.

...read moreread less

131 citations

Book•

Trends in Cleaning Relational Data: Consistency and Deduplication

[...]

Ihab F. Ilyas¹, Xu Chu¹•Institutions (1)

University of Waterloo¹

30 Oct 2015

TL;DR: A taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation is proposed, and is concluded by highlighting current trends in "big data" cleaning.

...read moreread less

Abstract: Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a reportby InsightSquared in 2012. To detect data errors, data quality rules or integrity constraints ICs have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced, where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms. In this paper, we discuss the main facets and directions in designing error detection and repairing techniques. We propose a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. We also propose a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. We conclude by highlighting current trends in "big data" cleaning.

...read moreread less

128 citations

Journal Article•DOI•

Ten Simple Rules for Creating a Good Data Management Plan

[...]

William K. Michener¹•Institutions (1)

University of New Mexico¹

22 Oct 2015-PLOS Computational Biology

TL;DR: A data management plan is a document that describes how you will treat your data during a project and what happens with the data after the project ends, and is used in part to evaluate a project’s merit.

...read moreread less

Abstract: Research papers and data products are key outcomes of the science enterprise. Governmental, nongovernmental, and private foundation sponsors of research are increasingly recognizing the value of research data. As a result, most funders now require that sufficiently detailed data management plans be submitted as part of a research proposal. A data management plan (DMP) is a document that describes how you will treat your data during a project and what happens with the data after the project ends. Such plans typically cover all or portions of the data life cycle—from data discovery, collection, and organization (e.g., spreadsheets, databases), through quality assurance/quality control, documentation (e.g., data types, laboratory methods) and use of the data, to data preservation and sharing with others (e.g., data policies and dissemination approaches). Fig 1 illustrates the relationship between hypothetical research and data life cycles and highlights the links to the rules presented in this paper. The DMP undergoes peer review and is used in part to evaluate a project’s merit. Plans also document the data management activities associated with funded projects and may be revisited during performance reviews. Open in a separate window Fig 1 Relationship of the research life cycle (A) to the data life cycle (B); note: highlighted circles refer to the rules that are most closely linked to the steps of the data life cycle. As part of the research life cycle (A), many researchers (1) test ideas and hypotheses by (2) acquiring data that are (3) incorporated into various analyses and visualizations, leading to interpretations that are then (4) published in the literature and disseminated via other mechanisms (e.g., conference presentations, blogs, tweets), and that often lead back to (1) new ideas and hypotheses. During the data life cycle (B), researchers typically (1) develop a plan for how data will be managed during and after the project; (2) discover and acquire existing data and (3) collect and organize new data; (4) assure the quality of the data; (5) describe the data (i.e., ascribe metadata); (6) use the data in analyses, models, visualizations, etc.; and (7) preserve and (8) share the data with others (e.g., researchers, students, decision makers), possibly leading to new ideas and hypotheses.

...read moreread less

128 citations

Journal Article•DOI•

MouseMine: a new data warehouse for MGI.

[...]

Howie Motenko, Steven B. Neuhauser, M O'Keefe¹, Joel E. Richardson•Institutions (1)

Mitre Corporation¹

20 Jun 2015-Mammalian Genome

TL;DR: A general introduction to MouseMine is presented, examples of its use are presented, and the potential for further integration into the MGI interface is discussed.

...read moreread less

Abstract: MouseMine (www.mousemine.org) is a new data warehouse for accessing mouse data from Mouse Genome Informatics (MGI). Based on the InterMine software framework, MouseMine supports powerful query, reporting, and analysis capabilities, the ability to save and combine results from different queries, easy integration into larger workflows, and a comprehensive Web Services layer. Through MouseMine, users can access a significant portion of MGI data in new and useful ways. Importantly, MouseMine is also a member of a growing community of online data resources based on InterMine, including those established by other model organism databases. Adopting common interfaces and collaborating on data representation standards are critical to fostering cross-species data analysis. This paper presents a general introduction to MouseMine, presents examples of its use, and discusses the potential for further integration into the MGI interface.

...read moreread less

Book•

Big Data Fundamentals: Concepts, Drivers & Techniques

[...]

Thomas Erl, Wajid Khattak, Paul A. Buhler

29 Dec 2015

TL;DR: The authors begin by explaining how Big Data can propel an organization forward by solving a spectrum of previously intractable business problems and show how a Big Data solution environment can be built and integrated to offer competitive advantages.

...read moreread less

Abstract: This text should be required reading for everyone in contemporary business. --Peter Woodhull, CEO, Modus21 The one book that clearly describes and links Big Data concepts to business utility. --Dr. Christopher Starr, PhD Simply, this is the best Big Data book on the market! --Sam Rostam, Cascadian IT Group ...one of the most contemporary approaches Ive seen to Big Data fundamentals... --Joshua M. Davis, PhD The Definitive Plain-English Guide to Big Data for Business and Technology Professionals Big Data Fundamentals provides a pragmatic, no-nonsense introduction to Big Data. Best-selling IT author Thomas Erl and his team clearly explain key Big Data concepts, theory and terminology, as well as fundamental technologies and techniques. All coverage is supported with case study examples and numerous simple diagrams. The authors begin by explaining how Big Data can propel an organization forward by solving a spectrum of previously intractable business problems. Next, they demystify key analysis techniques and technologies and show how a Big Data solution environment can be built and integrated to offer competitive advantages. Discovering Big Datas fundamental concepts and what makes it different from previous forms of data analysis and data science Understanding the business motivations and drivers behind Big Data adoption, from operational improvements through innovation Planning strategic, business-driven Big Data initiatives Addressing considerations such as data management, governance, and security Recognizing the 5 V characteristics of datasets in Big Data environments: volume, velocity, variety, veracity, and value Clarifying Big Datas relationships with OLTP, OLAP, ETL, data warehouses, and data marts Working with Big Data in structured, unstructured, semi-structured, and metadata formats Increasing value by integrating Big Data resources with corporate performance monitoring Understanding how Big Data leverages distributed and parallel processing Using NoSQL and other technologies to meet Big Datas distinct data processing requirements Leveraging statistical approaches of quantitative and qualitative analysis Applying computational analysis methods, including machine learning

...read moreread less

Managing Big Data for Scientific Visualization

[...]

Michael Cox, David S. Ellsworth

01 Jan 2015

TL;DR: This section offers some structure to understand what has been done to manage big data for engineering and scientific visualization, and to understand and go forward in areas that may prove fruitful.

...read moreread less

Abstract: Many areas of endeavor have problems with big data. Some classical business applications have faced big data for some time (e.g. airline reservation systems), and newer business applications to exploit big data are under construction (e.g. data warehouses, federations of databases). While engineering and scientific visualization have also faced the problem for some time, solutions are less well developed, and common techniques are less well understood. In this section we offer some structure to understand what has been done to manage big data for engineering and scientific visualization, and to understand and go forward in areas that may prove fruitful. With this structure as backdrop, we discuss the work that has been done in management of big data, as well as our own work on demand-paged segments for fluid flow visualization.

...read moreread less

Journal Article•DOI•

From Data Quality to Big Data Quality

[...]

Carlo Batini¹, Anisa Rula¹, Monica Scannapieco², Gianluigi Viscusi³•Institutions (3)

University of Milano-Bicocca¹, National Institute of Statistics², École Polytechnique Fédérale de Lausanne³

01 Jan 2015-Journal of Database Management

TL;DR: The nature of the relationship between Data Quality and several research coordinates that are relevant in Big Data, such as the variety of data types, data sources and application domains, are examined, focusing on maps, semi-structured texts, linked open data, sensor & sensor networks and official statistics.

...read moreread less

Abstract: This article investigates the evolution of data quality issues from traditional structured data managed in relational databases to Big Data. In particular, the paper examines the nature of the relationship between Data Quality and several research coordinates that are relevant in Big Data, such as the variety of data types, data sources and application domains, focusing on maps, semi-structured texts, linked open data, sensor & sensor networks and official statistics. Consequently a set of structural characteristics is identified and a systematization of the a posteriori correlation between them and quality dimensions is provided. Finally, Big Data quality issues are considered in a conceptual framework suitable to map the evolution of the quality paradigm according to three core coordinates that are significant in the context of the Big Data phenomenon: the data type considered, the source of data, and the application domain. Thus, the framework allows ascertaining the relevant changes in data quality emerging with the Big Data phenomenon, through an integrative and theoretical literature review.

...read moreread less

Journal Article•DOI•

Evaluating the Quality of Social Media Data in Big Data Architecture

[...]

Anne Immonen¹, Pekka Pääkkönen¹, Eila Ovaska¹•Institutions (1)

VTT Technical Research Centre of Finland¹

16 Oct 2015-IEEE Access

TL;DR: The proposed solution improves business decision making by providing real-time, validated data for the user and is validated with an industrial case example, in which the customer insight is extracted from social media data in order to determine the customer satisfaction regarding the quality of a product.

...read moreread less

Abstract: The use of freely available online data is rapidly increasing, as companies have detected the possibilities and the value of these data in their businesses. In particular, data from social media are seen as interesting as they can, when properly treated, assist in achieving customer insight into business decision making. However, the unstructured and uncertain nature of this kind of big data presents a new kind of challenge: how to evaluate the quality of data and manage the value of data within a big data architecture? This paper contributes to addressing this challenge by introducing a new architectural solution to evaluate and manage the quality of social media data in each processing phase of the big data pipeline. The proposed solution improves business decision making by providing real-time, validated data for the user. The solution is validated with an industrial case example, in which the customer insight is extracted from social media data in order to determine the customer satisfaction regarding the quality of a product.

...read moreread less

Journal Article•DOI•

An interactive machine-learning-based electronic fraud and abuse detection system in healthcare insurance

[...]

Ilker Kose¹, Mehmet Göktürk¹, Kemal Kilic²•Institutions (2)

Gebze Institute of Technology¹, Sabancı University²

01 Nov 2015

TL;DR: The purpose of this study is to implement and evaluate a novel framework to detect fraudulent and abusive cases independently from the actors and commodities involved in the claims and an extensible structure to introduce new fraud and abuse types.

...read moreread less

Abstract: The nature of problem and claim management environment are well-defined.Very comprehensive literature revive is presented.An IML based novel DSS detecting fraud and abuse in healthcare insurance is developed.Unlike earlier studies, the system embraces all relevant actors and commodities.Real data are used for the evaluation and perform better results wrt previous studies. Detecting fraudulent and abusive cases in healthcare is one of the most challenging problems for data mining studies. However, most of the existing studies have a shortage of real data for analysis and focus on a very limited version of the problem by covering only a specific actor, healthcare service, or disease. The purpose of this study is to implement and evaluate a novel framework to detect fraudulent and abusive cases independently from the actors and commodities involved in the claims and an extensible structure to introduce new fraud and abuse types. Interactive machine learning that allows incorporating expert knowledge in an unsupervised setting is utilized to detect fraud and abusive cases in healthcare. In order to increase the accuracy of the framework, several well-known methods are utilized, such as the pairwise comparison method of analytic hierarchical processing (AHP) for weighting the actors and attributes, expectation maximization (EM) for clustering similar actors, two-stage data warehousing for proactive risk calculations, visualization tools for effective analyzing, and z-score and standardization in order to calculate the risks. The experts are involved in all phases of the study and produce six different abnormal behavior types using storyboards. The proposed framework is evaluated with real-life data for six different abnormal behavior types for prescriptions by covering all relevant actors and commodities. The Area Under the Curve (AUC) values are presented for each experiment. Moreover, a cost-saving model is also presented. The developed framework, i.e., the eFAD suite, is actor- and commodity-independent, configurable (i.e., easily adaptable in the dynamic environment of fraud and abusive behaviors), and effectively handles the fragmented nature of abnormal behaviors. The proposed framework combines both proactive and retrospective analysis with an enhanced visualization tool that significantly reduces the time requirements for the fact-finding process after the eFAD detects risky claims. This system is utilized by a company to produce monthly reports that include abnormal behaviors to be evaluated by the insurance company.

...read moreread less

Proceedings Article•DOI•

Big Data Pre-processing: A Quality Framework

[...]

Ikbal Taleb¹, Rachida Dssouli¹, Mohamed Adel Serhani²•Institutions (2)

Concordia University¹, College of Information Technology²

27 Jun 2015

TL;DR: A QBD model incorporating processes to support Data quality profile selection and adaptation is proposed and it tracks and registers on a data provenance repository the effect of every data transformation happened in the pre-processing phase.

...read moreread less

Abstract: With the abundance of raw data generated from various sources, Big Data has become a preeminent approach in acquiring, processing, and analyzing large amounts of heterogeneous data to derive valuable evidences. The size, speed, and formats in which data is generated and processed affect the overall quality of information. Therefore, Quality of Big Data (QBD) has become an important factor to ensure that the quality of data is maintained at all Big data processing phases. This paper addresses the QBD at the pre-processing phase, which includes sub-processes like cleansing, integration, filtering, and normalization. We propose a QBD model incorporating processes to support Data quality profile selection and adaptation. In addition, it tracks and registers on a data provenance repository the effect of every data transformation happened in the pre-processing phase. We evaluate the data quality selection module using large EEG dataset. The obtained results illustrate the importance of addressing QBD at an early phase of Big Data processing lifecycle since it significantly save on costs and perform accurate data analysis.

...read moreread less

Journal Article•DOI•

Processes Meet Big Data: Connecting Data Science with Process Science

[...]

Wil M. P. van der Aalst¹, Ernesto Damiani²•Institutions (2)

Eindhoven University of Technology¹, Khalifa University²

01 Nov 2015-IEEE Transactions on Services Computing

TL;DR: This editorial discusses the interplay between data science and process science and relates process mining to Big data technologies, service orientation, and cloud computing.

...read moreread less

Abstract: As more and more companies are embracing Big data, it has become apparent that the ultimate challenge is to relate massive amounts of event data to processes that are highly dynamic. To unleash the value of event data, events need to be tightly connected to the control and management of operational processes. However, the primary focus of Big data technologies is currently on storage, processing, and rather simple analytical tasks. Big data initiatives rarely focus on the improvement of end-to-end processes. To address this mismatch, we advocate a better integration of data science , data technology and process science . Data science approaches tend to be process agonistic whereas process science approaches tend to be model-driven without considering the “evidence” hidden in the data. Process mining aims to bridge this gap. This editorial discusses the interplay between data science and process science and relates process mining to Big data technologies, service orientation, and cloud computing.

...read moreread less

Proceedings Article•DOI•

Personal Data Lake with Data Gravity Pull

[...]

Coral Walker¹, Hassan H. Alrehamy¹•Institutions (1)

Cardiff University¹

26 Aug 2015

TL;DR: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data, and allows third-party plugins so that the unstructured data can be analyzed and queried.

...read moreread less

Abstract: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data. A data lake stores data regardless of format and thus provides an intuitive way to store personal data fragments of any type. Metadata management is a central part of the lake architecture. For structured/semi-structured data fragments, metadata may contain information about the schema of the data so that the data can be transformed into queryable data objects when required. For unstructured data, enabling gravity pull means allowing third-party plugins so that the unstructured data can be analyzed and queried.

...read moreread less

Journal Article•DOI•

Ontology-Based Data Integration between Clinical and Research Systems

[...]

Sebastian Mate¹, Felix Köpcke¹, Dennis Toddenroth¹, Marcus Martin¹, Hans-Ulrich Prokosch¹, Thomas Bürkle², Thomas Ganslandt¹ - Show less +3 more•Institutions (2)

University of Erlangen-Nuremberg¹, Bern University of Applied Sciences²

14 Jan 2015-PLOS ONE

TL;DR: This work uses ontologies to organize and describe the medical concepts of both the source system and the target system and demonstrates how a suitable level of abstraction may not only aid the interpretation of clinical data, but can also foster the reutilization of methods for un-locking it.

...read moreread less

Abstract: Data from the electronic medical record comprise numerous structured but uncoded ele-ments, which are not linked to standard terminologies. Reuse of such data for secondary research purposes has gained in importance recently. However, the identification of rele-vant data elements and the creation of database jobs for extraction, transformation and loading (ETL) are challenging: With current methods such as data warehousing, it is not feasible to efficiently maintain and reuse semantically complex data extraction and trans-formation routines. We present an ontology-supported approach to overcome this challenge by making use of abstraction: Instead of defining ETL procedures at the database level, we use ontologies to organize and describe the medical concepts of both the source system and the target system. Instead of using unique, specifically developed SQL statements or ETL jobs, we define declarative transformation rules within ontologies and illustrate how these constructs can then be used to automatically generate SQL code to perform the desired ETL procedures. This demonstrates how a suitable level of abstraction may not only aid the interpretation of clinical data, but can also foster the reutilization of methods for un-locking it.

...read moreread less

Journal Article•DOI•

The Impact of Driving Styles on Fuel Consumption: A Data-Warehouse-and-Data-Mining-Based Discovery Process

[...]

João Ferreira¹, Jose de Almeida, Alberto Rodrigues da Silva•Institutions (1)

University of Minho¹

20 Apr 2015-IEEE Transactions on Intelligent Transportation Systems

TL;DR: These findings show that introducing simple practices, such as optimal clutch, engine rotation, and engine running in idle, can reduce fuel consumption on average from 3 to 5l/100 km, meaning a saving of 30 l per bus on one day.

...read moreread less

Abstract: This paper discusses the results of applied research on the eco-driving domain based on a huge data set produced from a fleet of Lisbon's public transportation buses for a three-year period. This data set is based on events automatically extracted from the control area network bus and enriched with GPS coordinates, weather conditions, and road information. We apply online analytical processing (OLAP) and knowledge discovery (KD) techniques to deal with the high volume of this data set and to determine the major factors that influence the average fuel consumption, and then classify the drivers involved according to their driving efficiency. Consequently, we identify the most appropriate driving practices and styles. Our findings show that introducing simple practices, such as optimal clutch, engine rotation, and engine running in idle, can reduce fuel consumption on average from 3 to 5l/100 km, meaning a saving of 30 l per bus on one day. These findings have been strongly considered in the drivers' training sessions.

...read moreread less

Reference Entry•DOI•

Business Intelligence and Analytics

[...]

Daniel J. Power¹, Ramesh Sharda²•Institutions (2)

University of Northern Iowa¹, Oklahoma State University–Stillwater²

21 Jan 2015

TL;DR: Data mining is a process of analyzing large amounts of data to identify data content relationships and is the key component of predictive analytics.

...read moreread less

Abstract: A business intelligence system is a data-driven decision support system Managing data is especially important for business intelligence and analytics Data warehouses, marts or data-driven decision support systems are intended to help managers transform data into information and knowledge Routinely data is moved from source systems to a decision support data store Some comparison reports include external data on competitors or other relevant data Analytics refers to quantitative analysis of data There are three components of business analytics: i) descriptive analytics, ii) predictive analytics and iii) prescriptive analytics Besides statistical analysis techniques, data mining is the key component of predictive analytics Data mining is a process of analyzing large amounts of data to identify data content relationships Cloud computing and “Big” data are changing business intelligence and analytics Columnar data bases let analysts work with data from web logs and other nonrelational data sources Keywords: analytics; business intelligence; decision support; decision sciene

...read moreread less

Journal Article•DOI•

Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive

[...]

James A. Rodger¹•Institutions (1)

Indiana University of Pennsylvania¹

01 Jan 2015-Informatics in Medicine Unlocked

TL;DR: The method improved data classification and showed that survival, mortality, and morbidity rates can be derived from the superset of Medical Operations data and used for future decision-making and planning.

...read moreread less

Patent•

Knowledge-intensive data processing system

[...]

Eric S. Chan¹, Dieter Gawlick¹, Adel Ghoneimy¹, Zhen Hua Liu¹•Institutions (1)

Business International Corporation¹

23 Mar 2015

TL;DR: In this paper, the authors provide systems and methods for managing and processing large amounts of complex and high-velocity data by capturing and extracting high-value data from low value data using big data and related technologies.

...read moreread less

Abstract: Embodiments of the invention provide systems and methods for managing and processing large amounts of complex and high-velocity data by capturing and extracting high-value data from low value data using big data and related technologies. Illustrative database systems described herein may collect and process data while extracting or generating high-value data. The high-value data may be handled by databases providing functions such as multi-temporality, provenance, flashback, and registered queries. In some examples, computing models and system may be implemented to combine knowledge and process management aspects with the near real-time data processing frameworks in a data-driven situation aware computing system.

...read moreread less

Journal Article•DOI•

Business intelligence for cross-process knowledge extraction at tourism destinations

[...]

Wolfram Höpken¹, Matthias Fuchs², Dimitri Keil², Maria Lexhagen²•Institutions (2)

University of Applied Sciences Ravensburg-Weingarten¹, Mid Sweden University²

06 May 2015-Information Technology & Tourism

TL;DR: This study proposes a novel approach for business intelligence-based cross-process knowledge extraction and decision support for tourism destinations that demonstrates the effectiveness of the proposed business intelligence architecture and the gained business benefits for a tourism destination.

...read moreread less

Abstract: Decision-relevant data stemming from various business processes within tourism destinations (e.g. booking or customer feedback) are usually extensively available in electronic form. However, these data are not typically utilized for product optimization and decision support by tourism managers. Although methods of business intelligence and knowledge extraction are employed in many travel and tourism domains, current applications usually deal with different business processes separately, which lacks a cross-process analysis approach. This study proposes a novel approach for business intelligence-based cross-process knowledge extraction and decision support for tourism destinations. The approach consists of (a) a homogeneous and comprehensive data model that serves as the basis of a central data warehouse, (b) mechanisms for extracting data from heterogeneous sources and integrating these data into the homogeneous data structures of the data warehouse, and (c) analysis methods for identifying important relationships and patterns across different business processes, thereby bringing to light new knowledge. A prototype of the proposed concepts was implemented for the leading Swedish mountain destination Are, which demonstrates the effectiveness of the proposed business intelligence architecture and the gained business benefits for a tourism destination.

...read moreread less

Introduction To Data Mining”, Person Education, 2007

[...]

Pang-Ning Tan, Michael Steinbach, Vipin Kumar

17 Jan 2015

Proceedings Article•DOI•

Implementing Multidimensional Data Warehouses into NoSQL

[...]

Max Chevalier¹, Mohammed El Malki¹, Arlind Kopliku¹, Olivier Teste¹, Ronan Tournier¹ - Show less +1 more•Institutions (1)

University of Toulouse¹

27 Apr 2015

TL;DR: A set of rules to map star schemas into two NoSQL models: column-oriented and document-oriented are defined and shown to be faster than MongoDB (document-oriented) in terms of loading time.

...read moreread less

Abstract: Not only SQL (NoSQL) databases are becoming increasingly popular and have some interesting strengths such as scalability and flexibility. In this paper, we investigate on the use of NoSQL systems for implementing OLAP (On-Line Analytical Processing) systems. More precisely, we are interested in instantiating OLAP systems (from the conceptual level to the logical level) and instantiating an aggregation lattice (optimization). We define a set of rules to map star schemas into two NoSQL models: column-oriented and document-oriented. The experimental part is carried out using the reference benchmark TPC. Our experiments show that our rules can effectively instantiate such systems (star schema and lattice). We also analyze differences between the two NoSQL systems considered. In our experiments, HBase (column-oriented) happens to be faster than MongoDB (document-oriented) in terms of loading time.

...read moreread less

Proceedings Article•DOI•

Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics

[...]

Benjamin Heintz¹, Abhishek Chandra¹, Ramesh K. Sitaraman²•Institutions (2)

University of Minnesota¹, University of Massachusetts Amherst²

15 Jun 2015

TL;DR: This work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness, and presents a family of optimal offline algorithms that jointly minimize both staleness and traffic.

...read moreread less

Abstract: Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geo-distributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Towards this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from anonymized traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.

...read moreread less

Journal Article•DOI•

Augmenting Data Warehouses with Big Data

[...]

Nenad Jukic¹, Abhishek Sharma¹, Svetlozar Nestorov¹, Boris Jukic²•Institutions (2)

Loyola University Chicago¹, Clarkson University²

01 Jul 2015-Information Systems Management

TL;DR: The ability of big data (while acting as a direct source for impactful analysis) to also augment and enrich the analytical power of data warehouses is focused on.

...read moreread less

Abstract: In the past decade, corporations are increasingly engaging in efforts whose aim is the analysis and wide-ranging use of big data. The majority of academic big data articles have been focused on methods, approaches, opportunities, and organizational impact of big data analytics. In this article, the focus is on the ability of big data while acting as a direct source for impactful analysis to also augment and enrich the analytical power of data warehouses.

...read moreread less

Reference Book•DOI•

Big Data: Algorithms, Analytics, and Applications

[...]

Kuan-Ching Li, Hai Jiang, Laurence T. Yang, Alfredo Cuzzocrea

23 Feb 2015

TL;DR: Big Data: Algorithms, Analytics, and Applications bridges the gap between the vastness of Big Data and the appropriate computational methods for scientific and social discovery and provides readers with the basis for further efforts in this challenging scientific field.

...read moreread less

Abstract: As todays organizations are capturing exponentially larger amounts of data than ever, now is the time for organizations to rethink how they digest that data. Through advanced algorithms and analytics techniques, organizations can harness this data, discover hidden patterns, and use the newly acquired knowledge to achieve competitive advantages.Presenting the contributions of leading experts in their respective fields, Big Data: Algorithms, Analytics, and Applications bridges the gap between the vastness of Big Data and the appropriate computational methods for scientific and social discovery. It covers fundamental issues about Big Data, including efficient algorithmic methods to process data, better analytical strategies to digest data, and representative applications in diverse fields, such as medicine, science, and engineering. The book is organized into five main sections: Big Data Managementconsiders the research issues related to the management of Big Data, including indexing and scalability aspects Big Data Processingaddresses the problem of processing Big Data across a wide range of resource-intensive computational settings Big Data Stream Techniques and Algorithmsexplores research issues regarding the management and mining of Big Data in streaming environments Big Data Privacyfocuses on models, techniques, and algorithms for preserving Big Data privacy Big Data Applicationsillustrates practical applications of Big Data across several domains, including finance, multimedia tools, biometrics, and satellite Big Data processing Overall, the book reports on state-of-the-art studies and achievements in algorithms, analytics, and applications of Big Data. It provides readers with the basis for further efforts in this challenging scientific field that will play a leading role in next-generation database, data warehousing, data mining, and cloud computing research. It also explores related applications in diverse sectors, covering technologies for media/data communication, elastic media/data storage, cross-network media/data fusion, and SaaS.

...read moreread less

Collapse