scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2012"


Journal ArticleDOI
01 Jan 2012-Database
TL;DR: YeastMine is a multifaceted search and retrieval environment that provides access to diverse data types and offers multiple scenarios in which it can be used such as a powerful search interface, a discovery tool, a curation aid and also a complex database presentation format.
Abstract: The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) provides high-quality curated genomic, genetic, and molecular information on the genes and their products of the budding yeast Saccharomyces cerevisiae. To accommodate the increasingly complex, diverse needs of researchers for searching and comparing data, SGD has implemented InterMine (http://www.InterMine.org), an open source data warehouse system with a sophisticated querying interface, to create YeastMine (http://yeastmine.yeastgenome.org). YeastMine is a multifaceted search and retrieval environment that provides access to diverse data types. Searches can be initiated with a list of genes, a list of Gene Ontology terms, or lists of many other data types. The results from queries can be combined for further analysis and saved or downloaded in customizable file formats. Queries themselves can be customized by modifying predefined templates or by creating a new template to access a combination of specific data types. YeastMine offers multiple scenarios in which it can be used such as a powerful search interface, a discovery tool, a curation aid and also a complex database presentation format. DATABASE URL: http://yeastmine.yeastgenome.org.

281 citations


Book
10 Aug 2012
TL;DR: A uniform logical framework for dealing with fundamental issues underlying central aspects of data quality, namely, data consistency, data deduplication, data accuracy, data currency, and information completeness is promoted, based on data quality rules.
Abstract: Data quality is one of the most important problems in data management. A database system typically aims to support the creation, maintenance, and use of large amount of data, focusing on the quantity of data. However, real-life data are often dirty: inconsistent, duplicated, inaccurate, incomplete, or stale. Dirty data in a database routinely generate misleading or biased analytical results and decisions, and lead to loss of revenues, credibility and customers. With this comes the need for data quality management. In contrast to traditional data management tasks, data quality management enables the detection and correction of errors in the data, syntactic or semantic, in order to improve the quality of the data and hence, add value to business processes. While data quality has been a longstanding problem for decades, the prevalent use of the Web has increased the risks, on an unprecedented scale, of creating and propagating dirty data. This monograph gives an overview of fundamental issues underlying central aspects of data quality, namely, data consistency, data deduplication, data accuracy, data currency, and information completeness. We promote a uniform logical framework for dealing with these issues, based on data quality rules. The text is organized into seven chapters, focusing on relational data. Chapter One introduces data quality issues. A conditional dependency theory is developed in Chapter Two, for capturing data inconsistencies. It is followed by practical techniques in Chapter 2b for discovering conditional dependencies, and for detecting inconsistencies and repairing data based on conditional dependencies. Matching dependencies are introduced in Chapter Three, as matching rules for data deduplication. A theory of relative information completeness is studied in Chapter Four, revising the classical Closed World Assumption and the Open World Assumption, to characterize incomplete information in the real world. A data currency model is presented in Chapter Five, to identify the current values of entities in a database and to answer queries with the current values, in the absence of reliable timestamps. Finally, interactions between these data quality issues are explored in Chapter Six. Important theoretical results and practical algorithms are covered, but formal proofs are omitted. The bibliographical notes contain pointers to papers in which the results were presented and proven, as well as references to materials for further reading. This text is intended for a seminar course at the graduate level. It is also to serve as a useful resource for researchers and practitioners who are interested in the study of data quality. The fundamental research on data quality draws on several areas, including mathematical logic, computational complexity and database theory. It has raised as many questions as it has answered, and is a rich source of questions and vitality. Table of Contents: Data Quality: An Overview / Conditional Dependencies / Cleaning Data with Conditional Dependencies / Data Deduplication / Information Completeness / Data Currency / Interactions between Data Quality Issues

264 citations


Proceedings ArticleDOI
30 Mar 2012
TL;DR: Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.
Abstract: The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object.In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible.To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion.We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

263 citations


Journal ArticleDOI
TL;DR: Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types.
Abstract: Summary: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast customizable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search and a library of ‘widgets’ performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages. Availability: Freely available from http://www.intermine.org under the LGPL license. Contact: ku.ca.mac.neg@melkcim.g Supplementary information: Supplementary data are available at Bioinformatics online.

224 citations


Book
16 Nov 2012
TL;DR: This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.
Abstract: Materialized views are a natural embodiment of the ideas of precomputation and caching in databases. Instead of computing a query from scratch, a system can use results that have already been computed, stored, and kept in sync with database updates. The ability of materialized views to speed up queries benefits most database applications, ranging from traditional querying and reporting to web database caching, online analytical processing, and data mining. By reducing dependency on the availability of base data, materialized views have also laid much of the foundation for information integration and data warehousing systems. The database tradition of declarative querying distinguishes materialized views from generic applications of precomputation and caching in other contexts, and makes materialized views especially interesting, powerful, and challenging at the same time. Study of materialized views has generated a rich research literature and mature commercial implementations, aimed at providing efficient, effective, automated, and general solutions to the selection, use, and maintenance of materialized views. This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.

172 citations


Patent
05 Sep 2012
TL;DR: In this article, a data mining marshaller module associates each plugin to a particular data source and manages the plugin to periodically retrieve unstructured data from the data source based on a plurality of data items to be monitored on behalf of users.
Abstract: A system and method for collecting and processing data over a communications network. A data mining marshaller module associates each plugin to a particular data source and manages the plugin to periodically retrieve unstructured data from the data source based on a plurality of data items to be monitored on behalf of a plurality of users. The plugins convert unstructured data received from the data sources to structured data and the data marshaller module stores the structured data in a database. This enables the system and method to aggregate and display the structured data in multiple graphical representations according to the user's preference.

155 citations


Proceedings ArticleDOI
13 Mar 2012
TL;DR: This paper focuses on systematic review of data quality dimensions in order to use at proposed framework which combining data mining and statistical techniques to measure dependencies among dimensions and illustrate how extracting knowledge can increase process quality.
Abstract: Nowadays, activities and decisions making in an organization is based on data and information obtained from data analysis, which provides various services for constructing reliable and accurate process. As data are significant resources in all organizations the quality of data is critical for managers and operating processes to identify related performance issues. Moreover, high quality data can increase opportunity for achieving top services in an organization. However, identifying various aspects of data quality from definition, dimensions, types, strategies, techniques are essential to equip methods and processes for improving data. This paper focuses on systematic review of data quality dimensions in order to use at proposed framework which combining data mining and statistical techniques to measure dependencies among dimensions and illustrate how extracting knowledge can increase process quality.

154 citations


Journal ArticleDOI
01 Oct 2012
TL;DR: This essay contends that a new vision for the IS discipline should address the challenges facing IS departments, and discusses the role of IS curricula and program development, in delivering BI&A education.
Abstract: “Big Data,” huge volumes of data in both structured and unstructured forms generated by the Internet, social media, and computerized transactions, is straining our technical capacity to manage it. More importantly, the new challenge is to develop the capability to understand and interpret the burgeoning volume of data to take advantage of the opportunities it provides in many human endeavors, ranging from science to business. Data Science, and in business schools, Business Intelligence and Analytics (BI&A) are emerging disciplines that seek to address the demands of this new era. Big Data and BI&A present unique challenges and opportunities not only for the research community, but also for Information Systems (IS) programs at business schools. In this essay, we provide a brief overview of BI&A, speculate on the role of BI&A education in business schools, present the challenges facing IS departments, and discuss the role of IS curricula and program development, in delivering BI&A education. We contend that a new vision for the IS discipline should address these challenges.

151 citations


Proceedings ArticleDOI
20 May 2012
TL;DR: Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data.
Abstract: Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.

144 citations


Patent
Leigh Amaro1, Parag Ladhawala1
15 Mar 2012
TL;DR: In this article, a system and method configured to provide enhanced services based on check-in information obtained in a social network system and transaction location information observed in a payment processing system is presented.
Abstract: A system and method configured to provide enhanced services based on check-in information obtained in a social network system and transaction location information observed in a payment processing system. In one aspect, the transaction location may be used to validate, verify or authenticate the check-in location declared in the social network system. In another aspect, the transaction location can be used as a basis to automate a check-in in the social network system in accordance with a preference of a user. In a further aspect, the transaction location and the check-in location can be correlated to detect inaccurate data, correct the inaccurate data, and/or augment the data in a data warehouse about the locations of transaction terminals.

139 citations


Book ChapterDOI
01 Jan 2012
TL;DR: This chapter introduces the basic concepts of data preprocessing and the methods for data pre processing are organized into the following categories: data cleaning, data integration, data reduction, and data transformation.
Abstract: Publisher Summary This chapter introduces the basic concepts of data preprocessing and the methods for data preprocessing are organized into the following categories: data cleaning, data integration, data reduction, and data transformation. Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability. There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range. This can improve the accuracy and efficiency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. The different attribute types and data characteristics can help identify erroneous values and outliers, which will be useful in the data cleaning and integration steps. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining.

Journal ArticleDOI
TL;DR: This paper presents research that used bibliometric content analysis to apply the HMPR guidelines to a representative sample of 362 DSS design-science research papers in 14 journals, highlighting major issues in DSS research that need attention.
Abstract: Research Perspective David Arnott Monash University david.arnott@monash.edu Graham Pervan Curtin University graham.pervan@cbs.curtin.edu.au Design science has been an important strategy in decision support systems (DSS) research since the field’s inception in the early 1970s. Recent reviews of DSS research have indicated a need to improve its quality and relevance. DSS design-science research has an important role in this improvement because design-science research can engage industry and the profession in intellectually important projects. The Hevner, March, Park, and Ram’s (HMPR) guidelines for the conduct and assessment of information systems design-science research, published in MIS Quarterly in 2004, provides a vehicle for assessing DSS design-science research. This paper presents research that used bibliometric content analysis to apply the HMPR guidelines to a representative sample of 362 DSS design-science research papers in 14 journals. The analysis highlights major issues in DSS research that need attention: research design, evaluation, relevance, strategic focus, and theorizing.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A compiler framework, Kernel Weaver, is proposed that can automatically fuse relational algebra operators thereby eliminating redundant data movement and key insights, lessons learned, measurements from the compiler implementation, and opportunities for further improvements are presented.
Abstract: Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the micro-benchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements.

Patent
15 Jun 2012
TL;DR: In this article, a power context system is described that makes decisions related to device power usage based on factors such as location, load, available alternatives, cost of power, and cost of bandwidth.
Abstract: A power context system is described herein that makes decisions related to device power usage based on factors such as location, load, available alternatives, cost of power, and cost of bandwidth. The system incorporates contextual knowledge about the situation in which a device is being used. Using the context of location, devices can make smarter decisions about deciding which processes to migrate to the cloud, load balancing between applications, and switching to power saving modes depending on how far the user is from a power source. As the cloud becomes more frequently used, load balancing by utilizing distributed data warehouses to move processes to different locations in the world depending on factors such as accessibility, locales, and cost of electricity are considerations for power management. Power management of mobile devices is becoming important as integration with the cloud yields expectations of devices being able to reliably access and persist data.

Proceedings ArticleDOI
01 Jun 2012
TL;DR: This paper applies Apriori algorithm to the database containing academic records of various students and tries to extract association rules in order to profile students based on various parameters like exam scores, term work grades, attendance and practical exams.
Abstract: Data mining is a process of identifying and extracting hidden patterns and information from databases and data warehouses. There are various algorithms and tools available for this purpose. Data mining has a vast range of applications ranging from business to medicine to engineering. In this paper, we discuss the application of data mining in education for student profiling and grouping. We make use of Apriori algorithm for student profiling which is one of the popular approaches for mining associations i.e. discovering co-relations among set of items. The other algorithm used, for grouping students is K-means clustering which assigns a set of observations into subsets. In the field of academics, data mining can be very useful in discovering valuable information which can be used for profiling students based on their academic record. We apply Apriori algorithm to the database containing academic records of various students and try to extract association rules in order to profile students based on various parameters like exam scores, term work grades, attendance and practical exams. We also apply K-means clustering to the same set of data in order to group the students. The implemented algorithms offer an effective way of profiling students which can be used in educational systems.

Journal ArticleDOI
TL;DR: MitoMiner can be used to characterize the variability of the mitochondrial proteome between tissues and investigate how changes in the proteome may contribute to mitochondrial dysfunction and mitochondrial-associated diseases such as cancer, neurodegenerative diseases, obesity, diabetes, heart failure and the ageing process.
Abstract: MitoMiner (http://mitominer.mrc-mbu.cam.ac.uk/) is a data warehouse for the storage and analysis of mitochondrial proteomics data gathered from publications of mass spectrometry and green fluorescent protein tagging studies. In MitoMiner, these data are integrated with data from UniProt, Gene Ontology, Online Mendelian Inheritance in Man, HomoloGene, Kyoto Encyclopaedia of Genes and Genomes and PubMed. The latest release of MitoMiner stores proteomics data sets from 46 studies covering 11 different species from eumetazoa, viridiplantae, fungi and protista. MitoMiner is implemented by using the open source InterMine data warehouse system, which provides a user interface allowing users to upload data for analysis, personal accounts to store queries and results and enables queries of any data in the data model. MitoMiner also provides lists of proteins for use in analyses, including the new MitoMiner mitochondrial proteome reference sets that specify proteins with substantial experimental evidence for mitochondrial localization. As further mitochondrial proteomics data sets from normal and diseased tissue are published, MitoMiner can be used to characterize the variability of the mitochondrial proteome between tissues and investigate how changes in the proteome may contribute to mitochondrial dysfunction and mitochondrial-associated diseases such as cancer, neurodegenerative diseases, obesity, diabetes, heart failure and the ageing process.

Book ChapterDOI
27 Aug 2012
TL;DR: Insight is presented from interviews with seven established vendors about their key challenges with regard to pricing strategies in different market situations and associated research problems for the business intelligence community.
Abstract: Currently, multiple data vendors utilize the cloud-computing paradigm for trading raw data, associated analytical services, and analytic results as a commodity good. We observe that these vendors often move the functionality of data warehouses to cloud-based platforms. On such platforms, vendors provide services for integrating and analyzing data from public and commercial data sources. We present insights from interviews with seven established vendors about their key challenges with regard to pricing strategies in different market situations and derive associated research problems for the business intelligence community.

Journal Article
TL;DR: In this paper, the authors give an overview of the data mining systems and some of its applications in the different fields and give a survey of data mining tools and their applications in science and engineering.
Abstract: Today, multinational companies and large organizations have operations in many places in the world. Each place of operation may generate large volumes of data. Corporate decision makers require access from all such sources and take strategic decisions. The information and communication technologies have highly used in the industry .The data warehouse is used in the significant business value by improving the effectiveness of managerial decision-making. In an uncertain and highly competitive business environment, the value of strategic information systems such as these are easily recognized however in today’s business environment, efficiency or speed is not the only key for competitiveness. Such tremendous amount of data, in the order of tera- to peta-bytes, has fundamentally changed science and engineering, transforming many disciplines from data-poor to increasingly data-rich, and calling for new, data-intensive methods to conduct research in science and engineering. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications in the different fields.

Patent
17 Aug 2012
TL;DR: In this paper, a system is provided for qualifying and analyzing data for at least one business intelligence, where a data management system transforms raw data and stores it, and an analytic engine is included.
Abstract: A system is provided for qualifying and analyzing data for at least one business intelligence. A platform receives source data. A data management system transforms raw data and stores it. An analytic engine is included. In operation the data management system receives first, second, and third streams of source data. The first stream is client source data, the second stream is public source data and the third stream is acquired by the data management system. The data management system organizes the first, second and third streams of data into items and their attributes. The analytic engine receives the items with their attributes from the data management system and applies logic to provide multi-dimensional analysis relative to a scale for at least one business intelligence.

Patent
Ayman Hammad1, Matthew Joy1, Joseph Spears1, Mark Carlson1, Patrick Stan1 
27 Mar 2012
TL;DR: In this article, the authors present a system to integrate offer processing in a social networking environment, which includes a data warehouse configured to store data associating social networking accounts of a user with a financial payment account of the user.
Abstract: Systems and methods to integrate offer processing in a social networking environment. A computing apparatus includes: a data warehouse configured to store data associating a social networking account of a user with a financial payment account of the user; a portal configured to provide an offer to the user of the social networking account and to store data associating the offer with the financial payment account; a transaction handler configured to monitor transactions in the financial payment account of the user and detect a transaction that satisfies a set of requirements of the offer, when an authorization request for the transaction is being processed by the transaction handler; a message broker configured to generate a message in response to the transaction being detected; and a media controller coupled with the message broker to communicate the message to a social networking site.

Journal ArticleDOI
TL;DR: The-state-of-the-art algorithms and applications in distributed data mining are surveyed and the future research opportunities are discussed.
Abstract: Most data mining approaches assume that the data can be provided from a single source. If data was produced from many physically distributed locations like Wal-Mart, these methods require a data center which gathers data from distributed locations. Sometimes, transmitting large amounts of data to a data center is expensive and even impractical. Therefore, distributed and parallel data mining algorithms were developed to solve this problem. In this paper, we survey the-state-of-the-art algorithms and applications in distributed data mining and discuss the future research opportunities.

Patent
07 Feb 2012
TL;DR: In this paper, an elastic, massively parallel processing (MPP) data warehouse leveraging a cloud computing system is described, where queries received via one or more API endpoints are decomposed into parallelizable subqueries and executed across a heterogeneous set of demand-instantiable computing units.
Abstract: In one embodiment, an elastic, massively parallel processing (MPP) data warehouse leveraging a cloud computing system is disclosed. Queries received via one or more API endpoints are decomposed into parallelizable subqueries and executed across a heterogenous set of demand-instantiable computing units. Available computing units vary in capacity, storage, memory, bandwidth, and hardware; the specific mix of computing units instantiated is determined dynamically according to the specifics of the query. Better performance is obtained by modifying the mix of instantiated computing units according to a machine learning algorithm.

Journal ArticleDOI
01 Jun 2012
TL;DR: Search over DAta Warehouse (SODA) as discussed by the authors enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL, which makes it much easier for business users to interactively explore highly-complex data warehouses.
Abstract: The purpose of data warehouses is to enable business analysts to make better decisions. Over the years the technology has matured and data warehouses have become extremely successful. As a consequence, more and more data has been added to the data warehouses and their schemas have become increasingly complex. These systems still work great in order to generate pre-canned reports. However, with their current complexity, they tend to be a poor match for non tech-savvy business analysts who need answers to ad-hoc queries that were not anticipated.This paper describes the design, implementation, and experience of the SODA system (Search over DAta Warehouse). SODA bridges the gap between the business needs of analysts and the technical complexity of current data warehouses. SODA enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL. The key idea is to use a graph pattern matching algorithm that uses the metadata model of the data warehouse. Our results with real data from a global player in the financial services industry show that SODA produces queries with high precision and recall, and makes it much easier for business users to interactively explore highly-complex data warehouses.

Journal ArticleDOI
TL;DR: Support Vector Machines (SVM) are established as a best classifier with maximum accuracy and minimum root mean square error (RMSE) and the Radial Basis Kernel is identified as abest choice for Support Vector Machine.
Abstract: Educational organizations are one of the important parts of our society and playing a vital role for growth and development of any nation. Data Mining is an emerging technique with the help of this one can efficiently learn with historical data and use that knowledge for predicting future behavior of concern areas. Growth of current education system is surely enhanced if data mining has been adopted as a futuristic strategic management tool. The Data Mining tool is able to facilitate better resource utilization in terms of student performance, course development and finally the development of nation's education related standards. In this paper a student data from a community college database has been taken and various classification approaches have been performed and a comparative analysis has been done. In this research work Support Vector Machines (SVM) are established as a best classifier with maximum accuracy and minimum root mean square error (RMSE). The study also includes a comparative analysis of all Support Vector Machine Kernel types and in this the Radial Basis Kernel is identified as a best choice for Support Vector Machine. A Decision tree approach is proposed which may be taken as an important basis of selection of student during any course program. The paper is aimed to develop a faith on Data Mining techniques so that present education and business system may adopt this as a strategic management tool. decision tree approach and decision rule approach. Considering the global opportunities coupled with global competition even in case of education it is essential to admit the best students as far as possible. So their academic performance and subsequent placements are best in the world. Data mining is useful whenever a system is dealing with large data sets. In any education system, student records i.e. enrollment details, course eligibility criteria, course interest and academic performance may be an important consideration to analyze various trends since all the systems are now computer based information system so data availability, modification and updation are a common process now. Data warehousing may be taken as good choice for maintaining the records of past history. The data warehouse can be easily developed in any education institute with the adaptation of common data standard. Common data standards may eliminate the need of data clarity and modification before loading this for a data warehouse. An institute with efficient Data Warehousing and Data Mining approach can find out novel way of improving student's behavior, success rate and course popularity. All these effort may finally improve the quality of education, better student intake, better career counseling and overall practices of education system. In Data Mining classification clustering and regression are the three key approaches. Classification is a supervised learning approach in which students are grouped into identified classes (1). Classification rules may be identified from a part of data known as training data and further it may be tested for rest of the data (2). The effectiveness of classification approach may be evaluated in terms of reliability of the rule with test data set. Clustering approach is based on unsupervised learning because there are no predefined classes. In this approach data may be grouped together as cluster (2), (3). The usability of clusters in terms of relevant area may be interpreted by data mining expert. Regression is a data mining approach in which it uses the explanatory variable to predict an outcome variable. For example, performance appraisal of faculty members may be done by regression analysis. Here, faculty qualification, feedback rating, amount of content covered may be taken as explanatory variable and faculty salary, increment, bonus and perks may be estimated as outcome variable so regression may be the best way to setting few important parameters based on existing variables (2).

Patent
13 Jul 2012
TL;DR: In this paper, a system and method for automatically reconfiguring a computing environment comprises a consumption analysis server, a placement server and a deployment server in communication with a set of virtual machine monitors.
Abstract: A system and method for automatically reconfiguring a computing environment comprises a consumption analysis server, a placement server, a deployment server in communication with a set of virtual machine monitors and a data warehouse in communication with a set of data collection agents, and a database. The consumption analysis server operates on measured resource utilization data in the data warehouse to yield a set of resource consumptions, available capacities and host and virtual machine configurations from the computing environment. The deployment server continuously monitors an event triggering condition and when the triggering condition is met, the placement server assigns a set of target virtual machines to a target set of hosts in a new placement and the deployment server implements the new placement through communication with the set of virtual machine monitors. The placement server right-sizes the virtual machines and the target set of hosts.

Patent
02 Mar 2012
TL;DR: In this paper, the authors present a system that includes a temporal data warehouse and a platform independent data warehouse load application that uses timestamp data from incoming data in conjunction with a relational algebra of set operators to identify and sequence net changes between the incoming data and data previously stored within the data warehouse.
Abstract: A system disclosed includes a temporal data warehouse and a platform independent temporal data warehouse load application operable to run on the system. The load application uses timestamp data from incoming data in conjunction with a relational algebra of set operators to identify and sequence net changes between the incoming data and data previously stored within the data warehouse. The load application loads the identified and sequenced net changes into the data warehouse with relatively little intrusion into normal operation of the data warehouse. Optimizations, including but not limited to, distinct partitioning of the workload into parallel streams are selectable via metadata.

Book ChapterDOI
03 Sep 2012
TL;DR: This paper proposes to model ETL processes using the standard representation mechanism denoted BPMN (Business Process Modeling and Notation), based on a classification of ETL objects resulting from a study of the most used commercial and open source ETL tools.
Abstract: Business Intelligence (BI) solutions require the design and implementation of complex processes (denoted ETL) that extract, transform, and load data from the sources to a common repository. New applications, like for example, real-time data warehousing, require agile and flexible tools that allow BI users to take timely decisions based on extremely up-to-date data. This calls for new ETL tools able to adapt to constant changes and quickly produce and modify executable code. A way to achieve this is to make ETL processes become aware of the business processes in the organization, in order to easily identify which data are required, and when and how to load them in the data warehouse. Therefore, we propose to model ETL processes using the standard representation mechanism denoted BPMN (Business Process Modeling and Notation). In this paper we present a BPMN-based metamodel for conceptual modeling of ETL processes. This metamodel is based on a classification of ETL objects resulting from a study of the most used commercial and open source ETL tools.

Patent
29 May 2012
TL;DR: In this article, a tax analysis data model, preloaded key performance indicator queries, and a query interface are described, along with a tax data warehouse that facilitates real-time processing of KPI queries.
Abstract: Systems, methods, and other embodiments associated with a tax analysis tool are described. In one embodiment, a method includes providing a tax analysis data model, preloaded key performance indicator queries, and a query interface. The example method may also include extracting tax related data from a tax analysis data base and transforming and loading the tax related data into a tax data warehouse that facilitates real time processing of KPI queries.

Journal Article
01 Jun 2012-BMJ
TL;DR: The implementation of i2b2-SSR for the multi-site, multi-stakeholder CARRA Registry has established a digital infrastructure for community-driven research data sharing in pediatric rheumatology in the USA.
Abstract: Objective Registries are a well-established mechanism for obtaining high quality, disease-specific data, but are often highly project-specific in their design, implementation, and policies for data use. In contrast to the conventional model of centralized data contribution, warehousing, and control, we design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software. Materials and methods Focusing our design around creation of a scalable solution for collaboration within multi-site disease registries, we leverage the i2b2 and SHRINE open source software to create a modular, ontology-based, federated infrastructure that provides research investigators full ownership and access to their contributed data while supporting permissioned yet robust data sharing. We accomplish these objectives via web services supporting peer-group overlays, group-aware data aggregation, and administrative functions.

Journal ArticleDOI
TL;DR: It is argued that the maturity of a data warehousing process (DWP) could significantly mitigate such large-scale failures and ensure the delivery of consistent, high quality, “single-version of truth” data in a timely manner.
Abstract: Even though data warehousing (DW) requires huge investments, the data warehouse market is experiencing incredible growth. However, a large number of DW initiatives end up as failures. In this paper, we argue that the maturity of a data warehousing process (DWP) could significantly mitigate such large-scale failures and ensure the delivery of consistent, high quality, “single-version of truth” data in a timely manner. However, unlike software development, the assessment of DWP maturity has not yet been tackled in a systematic way. In light of the critical importance of data as a corporate resource, we believe that the need for a maturity model for DWP could not be greater. In this paper, we describe the design and development of a five-level DWP maturity model (DWP-M) over a period of three years. A unique aspect of this model is that it covers processes in both data warehouse development and operations. Over 20 key DW executives from 13 different corporations were involved in the model development process. The final model was evaluated by a panel of experts; the results strongly validate the functionality, productivity, and usability of the model. We present the initial and final DWP-M model versions, along with illustrations of several key process areas at different levels of maturity.