Showing papers on "Data warehouse published in 2012"

PDF

Open Access

Journal Article•DOI•

YeastMine--an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit.

[...]

Rama Balakrishnan¹, Julie Park², Kalpana Karra², Benjamin C. Hitz², Gail Binkley², Eurie L. Hong², Julie Sullivan², Gos Micklem², J. Michael Cherry² - Show less +5 more•Institutions (2)

Stanford University¹, University of Cambridge²

01 Jan 2012-Database

TL;DR: YeastMine is a multifaceted search and retrieval environment that provides access to diverse data types and offers multiple scenarios in which it can be used such as a powerful search interface, a discovery tool, a curation aid and also a complex database presentation format.

...read moreread less

Abstract: The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) provides high-quality curated genomic, genetic, and molecular information on the genes and their products of the budding yeast Saccharomyces cerevisiae. To accommodate the increasingly complex, diverse needs of researchers for searching and comparing data, SGD has implemented InterMine (http://www.InterMine.org), an open source data warehouse system with a sophisticated querying interface, to create YeastMine (http://yeastmine.yeastgenome.org). YeastMine is a multifaceted search and retrieval environment that provides access to diverse data types. Searches can be initiated with a list of genes, a list of Gene Ontology terms, or lists of many other data types. The results from queries can be combined for further analysis and saved or downloaded in customizable file formats. Queries themselves can be customized by modifying predefined templates or by creating a new template to access a combination of specific data types. YeastMine offers multiple scenarios in which it can be used such as a powerful search interface, a discovery tool, a curation aid and also a complex database presentation format. DATABASE URL: http://yeastmine.yeastgenome.org.

...read moreread less

281 citations

Book•

Foundations of Data Quality Management

[...]

Wenfei Fan, Floris Geerts¹•Institutions (1)

University of Antwerp¹

10 Aug 2012

TL;DR: A uniform logical framework for dealing with fundamental issues underlying central aspects of data quality, namely, data consistency, data deduplication, data accuracy, data currency, and information completeness is promoted, based on data quality rules.

...read moreread less

Abstract: Data quality is one of the most important problems in data management. A database system typically aims to support the creation, maintenance, and use of large amount of data, focusing on the quantity of data. However, real-life data are often dirty: inconsistent, duplicated, inaccurate, incomplete, or stale. Dirty data in a database routinely generate misleading or biased analytical results and decisions, and lead to loss of revenues, credibility and customers. With this comes the need for data quality management. In contrast to traditional data management tasks, data quality management enables the detection and correction of errors in the data, syntactic or semantic, in order to improve the quality of the data and hence, add value to business processes. While data quality has been a longstanding problem for decades, the prevalent use of the Web has increased the risks, on an unprecedented scale, of creating and propagating dirty data. This monograph gives an overview of fundamental issues underlying central aspects of data quality, namely, data consistency, data deduplication, data accuracy, data currency, and information completeness. We promote a uniform logical framework for dealing with these issues, based on data quality rules. The text is organized into seven chapters, focusing on relational data. Chapter One introduces data quality issues. A conditional dependency theory is developed in Chapter Two, for capturing data inconsistencies. It is followed by practical techniques in Chapter 2b for discovering conditional dependencies, and for detecting inconsistencies and repairing data based on conditional dependencies. Matching dependencies are introduced in Chapter Three, as matching rules for data deduplication. A theory of relative information completeness is studied in Chapter Four, revising the classical Closed World Assumption and the Open World Assumption, to characterize incomplete information in the real world. A data currency model is presented in Chapter Five, to identify the current values of entities in a database and to answer queries with the current values, in the absence of reliable timestamps. Finally, interactions between these data quality issues are explored in Chapter Six. Important theoretical results and practical algorithms are covered, but formal proofs are omitted. The bibliographical notes contain pointers to papers in which the results were presented and proven, as well as references to materials for further reading. This text is intended for a seminar course at the graduate level. It is also to serve as a useful resource for researchers and practitioners who are interested in the study of data quality. The fundamental research on data quality draws on several areas, including mathematical logic, computational complexity and database theory. It has raised as many questions as it has answered, and is a rich source of questions and vitality. Table of Contents: Data Quality: An Overview / Conditional Dependencies / Cleaning Data with Conditional Dependencies / Data Deduplication / Information Completeness / Data Currency / Interactions between Data Quality Issues

...read moreread less

264 citations

Proceedings Article•DOI•

Sieve: linked data quality assessment and fusion

[...]

Pablo N. Mendes¹, Hannes Mühleisen¹, Christian Bizer¹•Institutions (1)

Free University of Berlin¹

30 Mar 2012

TL;DR: Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.

...read moreread less

Abstract: The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object.In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible.To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion.We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

...read moreread less

263 citations

Journal Article•DOI•

InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data.

[...]

Richard N. Smith¹, Jelena Aleksic¹, Daniela Butano¹, Adrian Carr¹, Sergio Contrino¹, Fengyuan Hu¹, Mike Lyne¹, Rachel Lyne¹, Alex Kalderimis¹, Kim Rutherford¹, Radek Stepan¹, Julie Sullivan¹, Matthew Wakeling¹, Xavier Watkins¹, Gos Micklem¹ - Show less +11 more•Institutions (1)

University of Cambridge¹

01 Dec 2012-Bioinformatics

TL;DR: Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types.

...read moreread less

Abstract: Summary: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast customizable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search and a library of ‘widgets’ performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages. Availability: Freely available from http://www.intermine.org under the LGPL license. Contact: ku.ca.mac.neg@melkcim.g Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

224 citations

Book•

Materialized Views

[...]

Rada Chirkova, Jun Yang

16 Nov 2012

TL;DR: This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.

...read moreread less

Abstract: Materialized views are a natural embodiment of the ideas of precomputation and caching in databases. Instead of computing a query from scratch, a system can use results that have already been computed, stored, and kept in sync with database updates. The ability of materialized views to speed up queries benefits most database applications, ranging from traditional querying and reporting to web database caching, online analytical processing, and data mining. By reducing dependency on the availability of base data, materialized views have also laid much of the foundation for information integration and data warehousing systems. The database tradition of declarative querying distinguishes materialized views from generic applications of precomputation and caching in other contexts, and makes materialized views especially interesting, powerful, and challenging at the same time. Study of materialized views has generated a rich research literature and mature commercial implementations, aimed at providing efficient, effective, automated, and general solutions to the selection, use, and maintenance of materialized views. This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.

...read moreread less

172 citations

Patent•

System and method for collecting and processing data

[...]

Mark D. Ghuneim, Matthew R. Dennebaum, Dustin J. Norlander

05 Sep 2012

TL;DR: In this article, a data mining marshaller module associates each plugin to a particular data source and manages the plugin to periodically retrieve unstructured data from the data source based on a plurality of data items to be monitored on behalf of users.

...read moreread less

Abstract: A system and method for collecting and processing data over a communications network. A data mining marshaller module associates each plugin to a particular data source and manages the plugin to periodically retrieve unstructured data from the data source based on a plurality of data items to be monitored on behalf of a plurality of users. The plugins convert unstructured data received from the data sources to structured data and the data marshaller module stores the structured data in a database. This enables the system and method to aggregate and display the structured data in multiple graphical representations according to the user's preference.

...read moreread less

155 citations

Proceedings Article•DOI•

Data quality: A survey of data quality dimensions

[...]

Fatimah Sidi¹, P. H. Shariat Panahy¹, Lilly Suriani Affendey¹, Marzanah A. Jabar¹, Hamidah Ibrahim¹, Aida Mustapha¹ - Show less +2 more•Institutions (1)

Information Technology University¹

13 Mar 2012

TL;DR: This paper focuses on systematic review of data quality dimensions in order to use at proposed framework which combining data mining and statistical techniques to measure dependencies among dimensions and illustrate how extracting knowledge can increase process quality.

...read moreread less

Abstract: Nowadays, activities and decisions making in an organization is based on data and information obtained from data analysis, which provides various services for constructing reliable and accurate process. As data are significant resources in all organizations the quality of data is critical for managers and operating processes to identify related performance issues. Moreover, high quality data can increase opportunity for achieving top services in an organization. However, identifying various aspects of data quality from definition, dimensions, types, strategies, techniques are essential to equip methods and processes for improving data. This paper focuses on systematic review of data quality dimensions in order to use at proposed framework which combining data mining and statistical techniques to measure dependencies among dimensions and illustrate how extracting knowledge can increase process quality.

...read moreread less

154 citations

Journal Article•DOI•

Business Intelligence and Analytics Education, and Program Development: A Unique Opportunity for the Information Systems Discipline

[...]

Roger H. L. Chiang¹, Paulo Goes², Edward A. Stohr³•Institutions (3)

University of Cincinnati¹, University of Arizona², Stevens Institute of Technology³

01 Oct 2012

TL;DR: This essay contends that a new vision for the IS discipline should address the challenges facing IS departments, and discusses the role of IS curricula and program development, in delivering BI&A education.

...read moreread less

Abstract: “Big Data,” huge volumes of data in both structured and unstructured forms generated by the Internet, social media, and computerized transactions, is straining our technical capacity to manage it. More importantly, the new challenge is to develop the capability to understand and interpret the burgeoning volume of data to take advantage of the opportunities it provides in many human endeavors, ranging from science to business. Data Science, and in business schools, Business Intelligence and Analytics (BI&A) are emerging disciplines that seek to address the demands of this new era. Big Data and BI&A present unique challenges and opportunities not only for the research community, but also for Information Systems (IS) programs at business schools. In this essay, we provide a brief overview of BI&A, speculate on the role of BI&A education in business schools, present the challenges facing IS departments, and discuss the role of IS curricula and program development, in delivering BI&A education. We contend that a new vision for the IS discipline should address these challenges.

...read moreread less

151 citations

Proceedings Article•DOI•

Shark: fast data analysis using coarse-grained distributed memory

[...]

Cliff Engle¹, Antonio Lupher¹, Reynold Xin¹, Matei Zaharia¹, Michael J. Franklin¹, Scott Shenker¹, Ion Stoica¹ - Show less +3 more•Institutions (1)

University of California, Berkeley¹

20 May 2012

TL;DR: Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data.

...read moreread less

Abstract: Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.

...read moreread less

144 citations

Patent•

Systems and methods to combine transaction terminal location data and social networking check-in

[...]

Leigh Amaro¹, Parag Ladhawala¹•Institutions (1)

Visa Inc.¹

15 Mar 2012

TL;DR: In this article, a system and method configured to provide enhanced services based on check-in information obtained in a social network system and transaction location information observed in a payment processing system is presented.

...read moreread less

Abstract: A system and method configured to provide enhanced services based on check-in information obtained in a social network system and transaction location information observed in a payment processing system. In one aspect, the transaction location may be used to validate, verify or authenticate the check-in location declared in the social network system. In another aspect, the transaction location can be used as a basis to automate a check-in in the social network system in accordance with a preference of a user. In a further aspect, the transaction location and the check-in location can be correlated to detect inaccurate data, correct the inaccurate data, and/or augment the data in a data warehouse about the locations of transaction terminals.

...read moreread less

139 citations

Book Chapter•DOI•

3 – Data Preprocessing

[...]

Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2012

TL;DR: This chapter introduces the basic concepts of data preprocessing and the methods for data pre processing are organized into the following categories: data cleaning, data integration, data reduction, and data transformation.

...read moreread less

Abstract: Publisher Summary This chapter introduces the basic concepts of data preprocessing and the methods for data preprocessing are organized into the following categories: data cleaning, data integration, data reduction, and data transformation. Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability. There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range. This can improve the accuracy and efficiency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. The different attribute types and data characteristics can help identify erroneous values and outliers, which will be useful in the data cleaning and integration steps. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining.

...read moreread less

Journal Article•DOI•

Design science in decision support systems research: An assessment using the Hevner, March, Park, and Ram guidelines

[...]

David Arnott, Graham Pervan¹•Institutions (1)

Curtin University¹

01 Nov 2012-Journal of the Association for Information Systems

TL;DR: This paper presents research that used bibliometric content analysis to apply the HMPR guidelines to a representative sample of 362 DSS design-science research papers in 14 journals, highlighting major issues in DSS research that need attention.

...read moreread less

Abstract: Research Perspective David Arnott Monash University david.arnott@monash.edu Graham Pervan Curtin University graham.pervan@cbs.curtin.edu.au Design science has been an important strategy in decision support systems (DSS) research since the field’s inception in the early 1970s. Recent reviews of DSS research have indicated a need to improve its quality and relevance. DSS design-science research has an important role in this improvement because design-science research can engage industry and the profession in intellectually important projects. The Hevner, March, Park, and Ram’s (HMPR) guidelines for the conduct and assessment of information systems design-science research, published in MIS Quarterly in 2004, provides a vehicle for assessing DSS design-science research. This paper presents research that used bibliometric content analysis to apply the HMPR guidelines to a representative sample of 362 DSS design-science research papers in 14 journals. The analysis highlights major issues in DSS research that need attention: research design, evaluation, relevance, strategic focus, and theorizing.

...read moreread less

Proceedings Article•DOI•

Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

[...]

Haicheng Wu¹, Gregory Diamos², Srihari Cadambi, Sudhakar Yalamanchili¹•Institutions (2)

Georgia Institute of Technology¹, Nvidia²

01 Dec 2012

TL;DR: A compiler framework, Kernel Weaver, is proposed that can automatically fuse relational algebra operators thereby eliminating redundant data movement and key insights, lessons learned, measurements from the compiler implementation, and opportunities for further improvements are presented.

...read moreread less

Abstract: Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the micro-benchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements.

...read moreread less

Patent•

Power and load management based on contextual information

[...]

Jered Donald Aasheim¹, Dominique Fortier¹, Michael Hall¹, Akshay Johar¹, Daniel A. Reed¹ - Show less +1 more•Institutions (1)

Microsoft¹

15 Jun 2012

TL;DR: In this article, a power context system is described that makes decisions related to device power usage based on factors such as location, load, available alternatives, cost of power, and cost of bandwidth.

...read moreread less

Abstract: A power context system is described herein that makes decisions related to device power usage based on factors such as location, load, available alternatives, cost of power, and cost of bandwidth. The system incorporates contextual knowledge about the situation in which a device is being used. Using the context of location, devices can make smarter decisions about deciding which processes to migrate to the cloud, load balancing between applications, and switching to power saving modes depending on how far the user is from a power source. As the cloud becomes more frequently used, load balancing by utilizing distributed data warehouses to move processes to different locations in the world depending on factors such as accessibility, locales, and cost of electricity are considerations for power management. Power management of mobile devices is becoming important as integration with the cloud yields expectations of devices being able to reliably access and persist data.

...read moreread less

Proceedings Article•DOI•

Application of data mining in educational databases for predicting academic trends and patterns

[...]

Suhem Parack¹, Zain Zahid¹, Fatima Merchant¹•Institutions (1)

M. H. Saboo Siddik College of Engineering¹

01 Jun 2012

TL;DR: This paper applies Apriori algorithm to the database containing academic records of various students and tries to extract association rules in order to profile students based on various parameters like exam scores, term work grades, attendance and practical exams.

...read moreread less

Abstract: Data mining is a process of identifying and extracting hidden patterns and information from databases and data warehouses. There are various algorithms and tools available for this purpose. Data mining has a vast range of applications ranging from business to medicine to engineering. In this paper, we discuss the application of data mining in education for student profiling and grouping. We make use of Apriori algorithm for student profiling which is one of the popular approaches for mining associations i.e. discovering co-relations among set of items. The other algorithm used, for grouping students is K-means clustering which assigns a set of observations into subsets. In the field of academics, data mining can be very useful in discovering valuable information which can be used for profiling students based on their academic record. We apply Apriori algorithm to the database containing academic records of various students and try to extract association rules in order to profile students based on various parameters like exam scores, term work grades, attendance and practical exams. We also apply K-means clustering to the same set of data in order to group the students. The implemented algorithms offer an effective way of profiling students which can be used in educational systems.

...read moreread less

Journal Article•DOI•

MitoMiner: a data warehouse for mitochondrial proteomics data

[...]

Anthony C. Smith¹, James A. Blackshaw¹, Alan J. Robinson¹•Institutions (1)

Wellcome Trust¹

01 Jan 2012-Nucleic Acids Research

TL;DR: MitoMiner can be used to characterize the variability of the mitochondrial proteome between tissues and investigate how changes in the proteome may contribute to mitochondrial dysfunction and mitochondrial-associated diseases such as cancer, neurodegenerative diseases, obesity, diabetes, heart failure and the ageing process.

...read moreread less

Abstract: MitoMiner (http://mitominer.mrc-mbu.cam.ac.uk/) is a data warehouse for the storage and analysis of mitochondrial proteomics data gathered from publications of mass spectrometry and green fluorescent protein tagging studies. In MitoMiner, these data are integrated with data from UniProt, Gene Ontology, Online Mendelian Inheritance in Man, HomoloGene, Kyoto Encyclopaedia of Genes and Genomes and PubMed. The latest release of MitoMiner stores proteomics data sets from 46 studies covering 11 different species from eumetazoa, viridiplantae, fungi and protista. MitoMiner is implemented by using the open source InterMine data warehouse system, which provides a user interface allowing users to upload data for analysis, personal accounts to store queries and results and enables queries of any data in the data model. MitoMiner also provides lists of proteins for use in analyses, including the new MitoMiner mitochondrial proteome reference sets that specify proteins with substantial experimental evidence for mitochondrial localization. As further mitochondrial proteomics data sets from normal and diseased tissue are published, MitoMiner can be used to characterize the variability of the mitochondrial proteome between tissues and investigate how changes in the proteome may contribute to mitochondrial dysfunction and mitochondrial-associated diseases such as cancer, neurodegenerative diseases, obesity, diabetes, heart failure and the ageing process.

...read moreread less

Book Chapter•DOI•

Pricing Approaches for Data Markets

[...]

Alexander Muschalle¹, Florian Stahl², Alexander Löser¹, Gottfried Vossen²•Institutions (2)

Technical University of Berlin¹, University of Münster²

27 Aug 2012

TL;DR: Insight is presented from interviews with seven established vendors about their key challenges with regard to pricing strategies in different market situations and associated research problems for the business intelligence community.

...read moreread less

Abstract: Currently, multiple data vendors utilize the cloud-computing paradigm for trading raw data, associated analytical services, and analytic results as a commodity good. We observe that these vendors often move the functionality of data warehouses to cloud-based platforms. On such platforms, vendors provide services for integrating and analyzing data from public and commercial data sources. We present insights from interviews with seven established vendors about their key challenges with regard to pricing strategies in different market situations and derive associated research problems for the business intelligence community.

...read moreread less

Journal Article•

The survey of data mining applications and feature scope

[...]

Neelamadhab Padhy, Pragnyaban Mishra, Rasmita Panigrahi

01 Jan 2012-Asian Journal of Computer Science and Information Technology

TL;DR: In this paper, the authors give an overview of the data mining systems and some of its applications in the different fields and give a survey of data mining tools and their applications in science and engineering.

...read moreread less

Abstract: Today, multinational companies and large organizations have operations in many places in the world. Each place of operation may generate large volumes of data. Corporate decision makers require access from all such sources and take strategic decisions. The information and communication technologies have highly used in the industry .The data warehouse is used in the significant business value by improving the effectiveness of managerial decision-making. In an uncertain and highly competitive business environment, the value of strategic information systems such as these are easily recognized however in today’s business environment, efficiency or speed is not the only key for competitiveness. Such tremendous amount of data, in the order of tera- to peta-bytes, has fundamentally changed science and engineering, transforming many disciplines from data-poor to increasingly data-rich, and calling for new, data-intensive methods to conduct research in science and engineering. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications in the different fields.

...read moreread less

Patent•

Multidimensional digital platform for building integration and analysis

[...]

Mark Raymond Miller, Hadar Wissotzky, Nate Goore, Douglass Humphrey, Signo Uddenberg, Rachel Posman - Show less +2 more

17 Aug 2012

TL;DR: In this paper, a system is provided for qualifying and analyzing data for at least one business intelligence, where a data management system transforms raw data and stores it, and an analytic engine is included.

...read moreread less

Abstract: A system is provided for qualifying and analyzing data for at least one business intelligence. A platform receives source data. A data management system transforms raw data and stores it. An analytic engine is included. In operation the data management system receives first, second, and third streams of source data. The first stream is client source data, the second stream is public source data and the third stream is acquired by the data management system. The data management system organizes the first, second and third streams of data into items and their attributes. The analytic engine receives the items with their attributes from the data management system and applies logic to provide multi-dimensional analysis relative to a scale for at least one business intelligence.

...read moreread less

Patent•

Systems and methods to provide offer communications to users via social networking sites

[...]

Ayman Hammad¹, Matthew Joy¹, Joseph Spears¹, Mark Carlson¹, Patrick Stan¹ - Show less +1 more•Institutions (1)

Visa Inc.¹

27 Mar 2012

TL;DR: In this article, the authors present a system to integrate offer processing in a social networking environment, which includes a data warehouse configured to store data associating social networking accounts of a user with a financial payment account of the user.

...read moreread less

Abstract: Systems and methods to integrate offer processing in a social networking environment. A computing apparatus includes: a data warehouse configured to store data associating a social networking account of a user with a financial payment account of the user; a portal configured to provide an offer to the user of the social networking account and to store data associating the offer with the financial payment account; a transaction handler configured to monitor transactions in the financial payment account of the user and detect a transaction that satisfies a set of requirements of the offer, when an authorization request for the transaction is being processed by the transaction handler; a message broker configured to generate a message in response to the transaction being detected; and a media controller coupled with the message broker to communicate the message to a social networking site.

...read moreread less

Journal Article•DOI•

Distributed data mining: a survey

[...]

Li Zeng¹, Ling Li², Lian Duan¹, Lian Duan³, Kevin Lü⁴, Zhongzhi Shi¹, Maoguang Wang¹, Wenjuan Wu, Ping Luo¹ - Show less +5 more•Institutions (4)

Chinese Academy of Sciences¹, Old Dominion University², New Jersey Institute of Technology³, Brunel University London⁴

17 May 2012-Information Technology & Management

TL;DR: The-state-of-the-art algorithms and applications in distributed data mining are surveyed and the future research opportunities are discussed.

...read moreread less

Abstract: Most data mining approaches assume that the data can be provided from a single source. If data was produced from many physically distributed locations like Wal-Mart, these methods require a data center which gathers data from distributed locations. Sometimes, transmitting large amounts of data to a data center is expensive and even impractical. Therefore, distributed and parallel data mining algorithms were developed to solve this problem. In this paper, we survey the-state-of-the-art algorithms and applications in distributed data mining and discuss the future research opportunities.

...read moreread less

Patent•

Elastic, massively parallel processing data warehouse

[...]

Andrew Crockett, Ryan Hawk

07 Feb 2012

TL;DR: In this paper, an elastic, massively parallel processing (MPP) data warehouse leveraging a cloud computing system is described, where queries received via one or more API endpoints are decomposed into parallelizable subqueries and executed across a heterogeneous set of demand-instantiable computing units.

...read moreread less

Abstract: In one embodiment, an elastic, massively parallel processing (MPP) data warehouse leveraging a cloud computing system is disclosed. Queries received via one or more API endpoints are decomposed into parallelizable subqueries and executed across a heterogenous set of demand-instantiable computing units. Available computing units vary in capacity, storage, memory, bandwidth, and hardware; the specific mix of computing units instantiated is determined dynamically according to the specifics of the query. Better performance is obtained by modifying the mix of instantiated computing units according to a machine learning algorithm.

...read moreread less

Journal Article•DOI•

SODA: generating SQL for business users

[...]

Lukas Blunschi¹, Claudio Jossen², Donald Kossmann¹, Magdalini Mori², Kurt Stockinger² - Show less +1 more•Institutions (2)

ETH Zurich¹, Credit Suisse²

01 Jun 2012

TL;DR: Search over DAta Warehouse (SODA) as discussed by the authors enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL, which makes it much easier for business users to interactively explore highly-complex data warehouses.

...read moreread less

Abstract: The purpose of data warehouses is to enable business analysts to make better decisions. Over the years the technology has matured and data warehouses have become extremely successful. As a consequence, more and more data has been added to the data warehouses and their schemas have become increasingly complex. These systems still work great in order to generate pre-canned reports. However, with their current complexity, they tend to be a poor match for non tech-savvy business analysts who need answers to ad-hoc queries that were not anticipated.This paper describes the design, implementation, and experience of the SODA system (Search over DAta Warehouse). SODA bridges the gap between the business needs of analysts and the technical complexity of current data warehouses. SODA enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL. The key idea is to use a graph pattern matching algorithm that uses the metadata model of the data warehouse. Our results with real data from a global player in the financial services industry show that SODA produces queries with high precision and recall, and makes it much easier for business users to interactively explore highly-complex data warehouses.

...read moreread less

Journal Article•DOI•

Data Mining in Education: Data Classification and Decision Tree Approach

[...]

Sonali Agarwal, Gaurav Pandey, M. D. Tiwari

01 Jan 2012-International Journal of e-Education, e-Business, e-Management and e-Learning

TL;DR: Support Vector Machines (SVM) are established as a best classifier with maximum accuracy and minimum root mean square error (RMSE) and the Radial Basis Kernel is identified as abest choice for Support Vector Machine.

...read moreread less

Abstract: Educational organizations are one of the important parts of our society and playing a vital role for growth and development of any nation. Data Mining is an emerging technique with the help of this one can efficiently learn with historical data and use that knowledge for predicting future behavior of concern areas. Growth of current education system is surely enhanced if data mining has been adopted as a futuristic strategic management tool. The Data Mining tool is able to facilitate better resource utilization in terms of student performance, course development and finally the development of nation's education related standards. In this paper a student data from a community college database has been taken and various classification approaches have been performed and a comparative analysis has been done. In this research work Support Vector Machines (SVM) are established as a best classifier with maximum accuracy and minimum root mean square error (RMSE). The study also includes a comparative analysis of all Support Vector Machine Kernel types and in this the Radial Basis Kernel is identified as a best choice for Support Vector Machine. A Decision tree approach is proposed which may be taken as an important basis of selection of student during any course program. The paper is aimed to develop a faith on Data Mining techniques so that present education and business system may adopt this as a strategic management tool. decision tree approach and decision rule approach. Considering the global opportunities coupled with global competition even in case of education it is essential to admit the best students as far as possible. So their academic performance and subsequent placements are best in the world. Data mining is useful whenever a system is dealing with large data sets. In any education system, student records i.e. enrollment details, course eligibility criteria, course interest and academic performance may be an important consideration to analyze various trends since all the systems are now computer based information system so data availability, modification and updation are a common process now. Data warehousing may be taken as good choice for maintaining the records of past history. The data warehouse can be easily developed in any education institute with the adaptation of common data standard. Common data standards may eliminate the need of data clarity and modification before loading this for a data warehouse. An institute with efficient Data Warehousing and Data Mining approach can find out novel way of improving student's behavior, success rate and course popularity. All these effort may finally improve the quality of education, better student intake, better career counseling and overall practices of education system. In Data Mining classification clustering and regression are the three key approaches. Classification is a supervised learning approach in which students are grouped into identified classes (1). Classification rules may be identified from a part of data known as training data and further it may be tested for rest of the data (2). The effectiveness of classification approach may be evaluated in terms of reliability of the rule with test data set. Clustering approach is based on unsupervised learning because there are no predefined classes. In this approach data may be grouped together as cluster (2), (3). The usability of clusters in terms of relevant area may be interpreted by data mining expert. Regression is a data mining approach in which it uses the explanatory variable to predict an outcome variable. For example, performance appraisal of faculty members may be done by regression analysis. Here, faculty qualification, feedback rating, amount of content covered may be taken as explanatory variable and faculty salary, increment, bonus and perks may be estimated as outcome variable so regression may be the best way to setting few important parameters based on existing variables (2).

...read moreread less

Patent•

System and method for continuous optimization of computing systems with automated assignment of virtual machines and physical machines to hosts

[...]

Douglas M. Neuse, Laurence E. Clay, Neal Tibrewala, Kenneth C. Zink, Paul Peterson - Show less +1 more

13 Jul 2012

TL;DR: In this paper, a system and method for automatically reconfiguring a computing environment comprises a consumption analysis server, a placement server and a deployment server in communication with a set of virtual machine monitors.

...read moreread less

Abstract: A system and method for automatically reconfiguring a computing environment comprises a consumption analysis server, a placement server, a deployment server in communication with a set of virtual machine monitors and a data warehouse in communication with a set of data collection agents, and a database. The consumption analysis server operates on measured resource utilization data in the data warehouse to yield a set of resource consumptions, available capacities and host and virtual machine configurations from the computing environment. The deployment server continuously monitors an event triggering condition and when the triggering condition is met, the placement server assigns a set of target virtual machines to a target set of hosts in a new placement and the deployment server implements the new placement through communication with the set of virtual machine monitors. The placement server right-sizes the virtual machines and the target set of hosts.

...read moreread less

Patent•

Methods and systems for loading data into a temporal data warehouse

[...]

Ian A. Willson

02 Mar 2012

TL;DR: In this paper, the authors present a system that includes a temporal data warehouse and a platform independent data warehouse load application that uses timestamp data from incoming data in conjunction with a relational algebra of set operators to identify and sequence net changes between the incoming data and data previously stored within the data warehouse.

...read moreread less

Abstract: A system disclosed includes a temporal data warehouse and a platform independent temporal data warehouse load application operable to run on the system. The load application uses timestamp data from incoming data in conjunction with a relational algebra of set operators to identify and sequence net changes between the incoming data and data previously stored within the data warehouse. The load application loads the identified and sequenced net changes into the data warehouse with relatively little intrusion into normal operation of the data warehouse. Optimizations, including but not limited to, distinct partitioning of the workload into parallel streams are selectable via metadata.

...read moreread less

Book Chapter•DOI•

BPMN-based conceptual modeling of ETL processes

[...]

Zineb El Akkaoui¹, Jose-Norberto Mazón², Alejandro A. Vaisman¹, Esteban Zim¹, nyi - Show less +1 more•Institutions (2)

Université libre de Bruxelles¹, University of Alicante²

03 Sep 2012

TL;DR: This paper proposes to model ETL processes using the standard representation mechanism denoted BPMN (Business Process Modeling and Notation), based on a classification of ETL objects resulting from a study of the most used commercial and open source ETL tools.

...read moreread less

Abstract: Business Intelligence (BI) solutions require the design and implementation of complex processes (denoted ETL) that extract, transform, and load data from the sources to a common repository. New applications, like for example, real-time data warehousing, require agile and flexible tools that allow BI users to take timely decisions based on extremely up-to-date data. This calls for new ETL tools able to adapt to constant changes and quickly produce and modify executable code. A way to achieve this is to make ETL processes become aware of the business processes in the organization, in order to easily identify which data are required, and when and how to load them in the data warehouse. Therefore, we propose to model ETL processes using the standard representation mechanism denoted BPMN (Business Process Modeling and Notation). In this paper we present a BPMN-based metamodel for conceptual modeling of ETL processes. This metamodel is based on a classification of ETL objects resulting from a study of the most used commercial and open source ETL tools.

...read moreread less

Patent•

Tax analysis tool

[...]

Richard Finley¹, Palika Jayasuriya¹•Institutions (1)

Business International Corporation¹

29 May 2012

TL;DR: In this article, a tax analysis data model, preloaded key performance indicator queries, and a query interface are described, along with a tax data warehouse that facilitates real-time processing of KPI queries.

...read moreread less

Abstract: Systems, methods, and other embodiments associated with a tax analysis tool are described. In one embodiment, a method includes providing a tax analysis data model, preloaded key performance indicator queries, and a query interface. The example method may also include extracting tax related data from a tax analysis data base and transforming and loading the tax related data into a tax data warehouse that facilitates real time processing of KPI queries.

...read moreread less

Journal Article•

An i2b2-based, generalizable, open source, self-scaling chronic disease registry

[...]

Marc D. Natter¹, Justin Quan, David M. Ortiz¹, Athos Bousvaros¹, Norman T. Ilowite², Christi J. Inman³, Keith Marsolo⁴, Andrew J. McMurry⁵, Christy Sandborg⁶, Laura E. Schanberg⁷, Carol A. Wallace⁸, Robert W. Warren⁹, Griffin M. Weber⁵, Kenneth D. Mandl¹, Kenneth D. Mandl⁵ - Show less +11 more•Institutions (9)

Boston Children's Hospital¹, Albert Einstein College of Medicine², University of Utah³, Cincinnati Children's Hospital Medical Center⁴, Harvard University⁵, Stanford University⁶, Duke University⁷, Seattle Children's⁸, Medical University of South Carolina⁹

01 Jun 2012-BMJ

TL;DR: The implementation of i2b2-SSR for the multi-site, multi-stakeholder CARRA Registry has established a digital infrastructure for community-driven research data sharing in pediatric rheumatology in the USA.

...read moreread less

Abstract: Objective Registries are a well-established mechanism for obtaining high quality, disease-specific data, but are often highly project-specific in their design, implementation, and policies for data use. In contrast to the conventional model of centralized data contribution, warehousing, and control, we design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software. Materials and methods Focusing our design around creation of a scalable solution for collaboration within multi-site disease registries, we leverage the i2b2 and SHRINE open source software to create a modular, ontology-based, federated infrastructure that provides research investigators full ownership and access to their contributed data while supporting permissioned yet robust data sharing. We accomplish these objectives via web services supporting peer-group overlays, group-aware data aggregation, and administrative functions.

...read moreread less

Journal Article•DOI•

A Model of Data Warehousing Process Maturity

[...]

Arun Sen¹, K. Ramamurthy², Atish P. Sinha²•Institutions (2)

Texas A&M University¹, University of Wisconsin–Milwaukee²

01 Mar 2012-IEEE Transactions on Software Engineering

TL;DR: It is argued that the maturity of a data warehousing process (DWP) could significantly mitigate such large-scale failures and ensure the delivery of consistent, high quality, “single-version of truth” data in a timely manner.

...read moreread less

Abstract: Even though data warehousing (DW) requires huge investments, the data warehouse market is experiencing incredible growth. However, a large number of DW initiatives end up as failures. In this paper, we argue that the maturity of a data warehousing process (DWP) could significantly mitigate such large-scale failures and ensure the delivery of consistent, high quality, “single-version of truth” data in a timely manner. However, unlike software development, the assessment of DWP maturity has not yet been tackled in a systematic way. In light of the critical importance of data as a corporate resource, we believe that the need for a maturity model for DWP could not be greater. In this paper, we describe the design and development of a five-level DWP maturity model (DWP-M) over a period of three years. A unique aspect of this model is that it covers processes in both data warehouse development and operations. Over 20 key DW executives from 13 different corporations were involved in the model development process. The final model was evaluated by a panel of experts; the results strongly validate the functionality, productivity, and usability of the model. We present the initial and final DWP-M model versions, along with illustrations of several key process areas at different levels of maturity.

...read moreread less

Collapse