scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2011"


Journal ArticleDOI
06 May 2011
TL;DR: This paper examines a number of SQL and socalled "NoSQL" data stores designed to scale simple OLTP-style application loads over many servers, and contrasts the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions.
Abstract: In this paper, we examine a number of SQL and socalled "NoSQL" data stores designed to scale simple OLTP-style application loads over many servers. Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses. We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.

1,412 citations


Proceedings ArticleDOI
11 Apr 2011
TL;DR: This work presents an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data.
Abstract: The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.

674 citations


Book ChapterDOI
05 Oct 2011
TL;DR: This chapter contains sections titled: Introduction Data-Mining Roots data-mining process large data sets data Warehouses for data Mining business aspects of data mining business Aspects of Data Mining: Why a data-mining project Fails.
Abstract: This chapter contains sections titled: Introduction Data-Mining Roots Data-Mining Process Large Data Sets Data Warehouses for Data Mining Business Aspects of Data Mining: Why a Data-Mining Project Fails Organization of This Book Review Questions and Problems References for Further Study ]]>

450 citations


Journal ArticleDOI
01 Jan 2011-Database
TL;DR: The BioMart Central Portal is a first-of-its-kind, community-driven effort to provide unified access to dozens of biological databases and proved that large-scale projects involving next generation sequencing data can be managed efficiently in a distributed environment.
Abstract: Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems. The BioMart project (www.biomart.org) was initiated to address these challenges. The BioMart software is based on two fundamental concepts: data agnostic modelling and data federation. Data agnostic modelling simplifies the difficult and time-consuming task of data modelling. In BioMart, this is achieved by using a predefined, query-optimized relational schema that can be used to represent any kind of data (1). Data federation makes it possible to organize multiple, disparate and distributed database systems into what appears to be a single integrated virtual database. It therefore allows users to access and cross reference data from these data sources using a single user interface, without the need for database administrators to physically collate the data in one location. Using these fundamental concepts, the BioMart project has driven a change in the biological data management paradigm, where individual biological databases are managed by different custom built systems. To give more control to both the users and the data providers, a new, innovative solution was required. BioMart started by adapting data warehousing ideas to create one universal software system for biological data management and empower biologists with the ability to create complex, customized datasets through a web interface without the need for bioinformatics support (1). It subsequently introduced a new innovative way of creating large multi-database repositories that avoid the need to store all the data in a single location (2), and finally it proved that large-scale projects involving next generation sequencing data can be managed efficiently in a distributed environment (3). BioMart has successfully adapted data warehousing ideas such as data marts, dimensional modelling (4), and query optimization into the world of biological databases (5–13). BioMart's ability to quickly deploy a website hosting any type of data, user-friendly graphical user interface, several programmatic interfaces and support for third party tools contributed to its success and adoption by many different types of projects around the world as their data management platform (14). During the 10 years of its existence, BioMart has grown from humble beginnings as a ‘data mining extension’ for the Ensembl website (1), to become an international collaboration involving large number of different organizations located on five continents: Asia, Australia, Europe, North America and South America (3,15). It has a large community of users and developers and it has been successfully used in both academia and industry. The latest version of the BioMart software has been significantly enhanced with numerous graphical user interfaces that are tailored to different user groups. In addition, it has been further improved by parallel query processing, it is now extensible with different analysis tools and the installation process can be effortlessly completed with just a few mouse clicks (16). Building on the wealth of information that has become accessible through the BioMart interface, the BioMart Central Portal (15) has introduced an innovative alternative to the large data stores maintained by specialized organizations such as The European Bioinformatics Institute (EBI) or The National Center for Biotechnology Information (NCBI). BioMart Central Portal is a first-of-its-kind, community-driven effort to provide unified access to dozens of biological databases. All development and maintenance of individual databases is left to the individual data providers, making it a very cost-effective approach. The groups that maintain individual sources can do so at their own location without the necessity of any data exchange procedures. In addition, they can draw on the wealth of information available through the portal to expose their data in the context of third party annotations. The BioMart Central Portal approach is very democratic: everyone can join or remove their data source at any time. BioMart Central Portal is effectively a ‘Virtual Bioinformatics Institute’ with no onsite personnel, minimal administration, and a very ‘green’ footprint. More recently, the International Cancer Genome Consortium (ICGC) Data Portal has demonstrated how BioMart can scale to manage large collaborative projects involving next generation sequencing data (3). The ICGC is generating data on an unprecedented scale by sequencing 500 cancer genomes and matched normal control genomes for 50 different cancer types (17). The effort is distributed between multiple participating countries and sequencing centres. Given the scale of the effort, moving all of the data to a single location is impractical. Instead, the ICGC Data Portal relies on BioMart data federation. By replicating and distributing the data model across different centres that produce the same type of data according to the same recipe, the scalability of the effort is greatly improved. Each centre is only responsible for managing their own data while data access to all of the consortium data is managed by the BioMart software. This presents a scalable approach, not only in the traditional sense of parallelizing data processing and storage, but also in a more general sense of outsourcing the external annotation expertise by federating annotations from additional, independently-maintained databases that are available in the BioMart Central Portal. The future developments for BioMart involve specialized ‘pre-packaged’ and reusable data portals. One example already in development is the OncoPortal, aimed at researchers managing cancer data. It will include preconfigured access to sources of annotations that are useful for cancer research such as Ensembl (5), Reactome (12), COSMIC (9), Pancreatic Expression Database (10) and others. It will also include a set of tools that are specifically designed for cancer data analysis. There are plans to build other preconfigured portals for different research areas, such as a mouse portal and a model organism portal. It is an ambition of the BioMart community that the BioMart project remains at the forefront of innovative solutions for biological data management in the years to come. By creating these specialized solutions and further reducing the barriers to entry, the aim is to encourage more groups to share their data through BioMart, thereby further enhancing the entire BioMart community.

339 citations


Proceedings ArticleDOI
28 Oct 2011
TL;DR: This paper provides an overview of state-of-the-art research issues and achievements in the field of analytics over big data, and extends the discussion to Analytics over big multidimensional data as well, by highlighting open problems and actual research trends.
Abstract: In this paper, we provide an overview of state-of-the-art research issues and achievements in the field of analytics over big data, and we extend the discussion to analytics over big multidimensional data as well, by highlighting open problems and actual research trends. Our analytical contribution is finally completed by several novel research directions arising in this field, which plays a leading role in next-generation Data Warehousing and OLAP research.

321 citations


Book ChapterDOI
01 Jan 2011
TL;DR: In this chapter, data mining and knowledge discovery (DMKD) is presented with basic concepts, a brief history of its evolution, mathematical foundations, and usable techniques, along with the data warehouse and the decision support system (DSS).
Abstract: In this chapter, data mining and knowledge discovery (DMKD) is presented with basic concepts, a brief history of its evolution, mathematical foundations, and usable techniques, along with the data warehouse and the decision support system (DSS). First, dataset and knowledge will be defined and elucidated as under DMKD. DMKD is a discovery process with different hierarchies, granularities, and/or scales. For a set of concepts that may be best understood if being viewed and explained from various perspectives, the chapter starts with a definition followed by a table explaining DMKD from different views (Sect. 5.1). The evolution of DMKD is then briefly tracked from the rapid advance in massive data to the birth of DMKD (Sect. 5.2). Some mathematical foundations are given in Sect. 5.3, i.e. probability theory, statistics, fuzzy set, rough set, data fields, and cloud models. Section 5.4 introduces some usable DMKD techniques. DMKD is used to discover a set of rules and exceptions with association, classification, clustering, prediction, discrimination, and exception detection. In Sects. 5.5 and 5.6, data warehouses and decision support systems are given. The first one mentioned is one of the data sources for DMKD, and DMKD is a new technique to assist the latter with a task. Finally, trends and perspectives are summarized and forecasted into two promising fields, web mining and spatial data mining (Sect. 5.7).

300 citations


Proceedings ArticleDOI
11 Apr 2011
TL;DR: This paper presents a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system and shows the effectiveness of RCFile in satisfying the four requirements.
Abstract: MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!

285 citations


Patent
22 Jul 2011
TL;DR: In this paper, a data warehouse is configured to store transaction data of accounts issued by a plurality of issuers, and at least one processor is configurable to calculate values of a first plurality of variables for each of the accounts.
Abstract: Systems and methods are provided to generate tools to evaluate the probability of an account being actually used by a business rather than an individual. In one aspect, a computing apparatus includes: a data warehouse configured to store transaction data of accounts issued by a plurality of issuers; and at least one processor configured to calculate values of a first plurality of variables for each of the accounts using the transaction data of the accounts issued by the plurality of issuers. The accounts include business accounts and non-business accounts. The at least one processor is further configured to identify a second plurality of variables from the first plurality of variables for a classification model to distinguish, using the values and logistic regression, the business accounts from the non-business accounts.

231 citations


Patent
14 Jun 2011
TL;DR: In this article, the authors present a system that collects patron data, manages patron data in a high performance data warehouse, shares patron data with other systems and reports vital patron information.
Abstract: A casino resort management system collects patron data, manages patron data in a high performance data warehouse, shares patron data with other systems and reports vital patron information. The system particularly tracks machine history, including changes in location, configuration and performance, and tracks the location history, including game type and denomination, and allows for placards to be moved from one gaming machine to another without losing historical machine or location information. In addition, the system provides two particularly useful graphical displays that simplify visual analysis of the large amounts of data within a casino. One display method depicts tables of thin bar graphs that compactly allow side-by-side comparison of different groupings of machines and another display method depicts machines in three dimensions so that each dimension can provide visual information to a viewer.

182 citations


Proceedings ArticleDOI
12 Jun 2011
TL;DR: Graph Cube is introduced, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks and is shown to be a powerful and efficient tool for decision support on large multi-dimensional networks.
Abstract: We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online Analytical Processing) technology have proven to be effective tools for decision support on relational data. However, they are not well-equipped to handle the new yet important multidimensional networks. In this paper, we introduce Graph Cube, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks. By taking account of both attribute aggregation and structure summarization of the networks, Graph Cube goes beyond the traditional data cube model involved solely with numeric value based group-by's, thus resulting in a more insightful and structure-enriched aggregate network within every possible multidimensional space. Besides traditional cuboid queries, a new class of OLAP queries, crossboid, is introduced that is uniquely useful in multidimensional networks and has not been studied before. We implement Graph Cube by combining special characteristics of multidimensional networks with the existing well-studied data cube techniques. We perform extensive experimental studies on a series of real world data sets and Graph Cube is shown to be a powerful and efficient tool for decision support on large multidimensional networks.

179 citations


Journal ArticleDOI
TL;DR: This paper introduces Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters and presents a new data processing strategy which performs filtering-join-aggregation tasks in two successive Map Reduce jobs.
Abstract: Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce's filtering-aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark. The results show that our approach significantly boosts the performance of complex analysis queries.

Journal ArticleDOI
TL;DR: DEDUCE is envisioned as a simple, web-based environment that allows investigators access to administrative, financial, and clinical information generated during patient care that lets users filter through millions of clinical records, explore aggregate reports, and, export extracts.

Patent
17 Mar 2011
TL;DR: In this paper, a system includes a transaction handler to process transactions, a data warehouse to store data recording the transactions and a portal configured to communicate with a search engine and to provide a user interface to receive a request from a merchant, and at least one processor coupled with the data warehouse and the portal.
Abstract: In one aspect, a system includes a transaction handler to process transactions, a data warehouse to store data recording the transactions, a portal configured to communicate with a search engine and to provide a user interface to receive a request from a merchant, and at least one processor coupled with the data warehouse and the portal. In response to the request received from the merchant via the portal, the at least one processor identifies a set of first statistics based on search activities of the search engine, identifies a set of second statistics based on the transactions relevant to the search activities, and uses the portal to juxtapose the set of first statistics and the set of second statistics.

Book
26 Aug 2011
TL;DR: Issues concerning the transference of data from existing databases to other applications are discussed and integration of data is necessary in the migration process.
Abstract: Issues concerning the transference of data from existing databases to other applications are discussed. The process of transferring data is called data migration, and it is divided into two processes: extracting data from existing systems in the form of an extracted file, and loading data from the extracted file into the new application. The new application usually requires data in different formats, and so transformation of data is generally needed. Moreover, the new application frequently requires data from more than one source database system, so integration of data is necessary in the migration process. >

Journal ArticleDOI
TL;DR: This paper will propose a model for conceptual design of ETL processes, built upon the enhancement of the models in the previous models to support some missing mapping features, and is based on UML environment.
Abstract: Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, its cleansing, customization, reformatting, integration, and insertion into a data warehouse. Building the ETL process is potentially one of the biggest tasks of building a warehouse; it is complex, time consuming, and consumes most of data warehouse project's implementation efforts, costs, and resources. Building a data warehouse requires focusing closely on understanding three main areas: the source area, the destination area, and the mapping area (ETL processes). The source area has standard models such as entity relationship diagram, and the destination area has standard models such as star schema, but the mapping area has not a standard model till now. In spite of the importance of ETL processes, little research has been done in this area due to its complexity. There is a clear lack of a standard model that can be used to represent the ETL scenarios. In this paper we will try to navigate through the efforts done to conceptualize the ETL processes. Research in the field of modeling ETL processes can be categorized into three main approaches: Modeling based on mapping expressions and guidelines, modeling based on conceptual constructs, and modeling based on UML environment. These projects try to represent the main mapping activities at the conceptual level. Due to the variation and differences between the proposed solutions for the conceptual design of ETL processes and due to their limitations, this paper also will propose a model for conceptual design of ETL processes. The proposed model is built upon the enhancement of the models in the previous models to support some missing mapping features.

Proceedings ArticleDOI
12 Jun 2011
TL;DR: This paper proposes the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems and designs a new join algorithm to facilitate fast join processing.
Abstract: To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.

Patent
Nancy Switzer1
13 Sep 2011
TL;DR: In this article, a computing apparatus includes: a data warehouse configured to store transaction data, geo-demographic data, attitudinal data and lifestyle data of a plurality of customers; a profile generator coupled with the data warehouse to determine a profile for each respective customer of the plurality, the profile including at least one profile parameter to cluster customers based on the transaction data.
Abstract: In one aspect, a computing apparatus includes: a data warehouse configured to store transaction data, geo-demographic data, attitudinal data and lifestyle data of a plurality of customers; a profile generator coupled with the data warehouse to determine a profile for each respective customer of the plurality of customers, the profile including at least one profile parameter to cluster customers based on the transaction data, the geo-demographic data, the attitudinal data and the lifestyle data; and a segment detector coupled with the data warehouse to segment the plurality of customers in a space having at least one first dimension corresponding to the at least one profile parameter, a second dimension for a value score indicative of a level of profitability value of each respective customer, and a third dimension for a current status of each respective customer in connection with a goal

Proceedings ArticleDOI
Rimma V. Nehme1, Nicolas Bruno1
12 Jun 2011
TL;DR: This paper presents a partitioning advisor that recommends the best partitioning design for an expected workload and its techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time.
Abstract: In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.

Patent
18 Mar 2011
TL;DR: In this paper, a system includes a transaction handler to process transactions, a data warehouse to store transaction data recording the transactions, and a portal configured to determine online activity tracking data, and at least one processor coupled with the data warehouse and the portal and configured to identify first users who have not been to a website of a first merchant within a predetermined period of time.
Abstract: In one aspect, a system includes a transaction handler to process transactions, a data warehouse to store transaction data recording the transactions, a portal configured to determine online activity tracking data, and at least one processor coupled with the data warehouse and the portal and configured to identify, using the transaction data and the online activity tracking data, first users who have not been to a website of a first merchant within a predetermined period of time, identify a set of transactions of the first users, and determine a spending pattern in the set of transactions of the first users.

Journal ArticleDOI
08 Mar 2011-PLOS ONE
TL;DR: An objective protocol for target prioritisation using TargetMine is proposed and the results show that the protocol can identify known disease-associated genes with high precision and coverage.
Abstract: Prioritising candidate genes for further experimental characterisation is a non-trivial challenge in drug discovery and biomedical research in general. An integrated approach that combines results from multiple data types is best suited for optimal target selection. We developed TargetMine, a data warehouse for efficient target prioritisation. TargetMine utilises the InterMine framework, with new data models such as protein-DNA interactions integrated in a novel way. It enables complicated searches that are difficult to perform with existing tools and it also offers integration of custom annotations and in-house experimental data. We proposed an objective protocol for target prioritisation using TargetMine and set up a benchmarking procedure to evaluate its performance. The results show that the protocol can identify known disease-associated genes with high precision and coverage. A demonstration version of TargetMine is available at http://targetmine.nibio.go.jp/.

Book
01 Jan 2011
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of your data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it’s still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition’s publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today’s most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of your data

Journal ArticleDOI
TL;DR: This paper addresses the applications of data mining in educational institution to extract useful information from the huge data sets and providing analytical tool to view and use this information for decision making processes by taking real life examples.
Abstract: Few years ago, the information flow in education field was relatively simple and the application of technology was limited. However, as we progress into a more integrated world where technology has become an integral part of the business processes, the process of transfer of information has become more complicated. Today, one of the biggest challenges that educational institutions face is the explosive growth of educational data and to use this data to improve the quality of managerial decisions. Data mining techniques are analytical tools that can be used to extract meaningful knowledge from large data sets. This paper addresses the applications of data mining in educational institution to extract useful information from the huge data sets and providing analytical tool to view and use this information for decision making processes by taking real life examples. In modern world a huge amount of data is available which can be used effectively to produce vital information. The information achieved can be used in the field of Medical science, Education, Business, Agriculture and so on. As huge amount of data is being collected and stored in the databases, traditional statistical techniques and database management tools are no longer adequate for analyzing this huge amount of data. Data Mining (sometimes called data or knowledge discovery) has become the area of growing significance because it helps in analyzing data from different perspectives and summarizing it into useful information. (1) There are increasing research interests in using data mining in education. This new emerging field, called Educational Data Mining, concerns with developing methods that discover knowledge from data originating from educational environments (1). The data can be collected from various educational institutes that reside in their databases. The data can be personal or academic which can be used to understand students' behavior, to assist instructors, to improve teaching, to evaluate and improve e-learning systems , to improve curriculums and many other benefits.(1)(2) Educational data mining uses many techniques such as decision trees, neural networks, k-nearest neighbor, naive bayes, support vector machines and many others.(3) Using these techniques many kinds of knowledge can be discovered such as association rules, classifications and clustering. The discovered knowledge can be used for organization of syllabus, prediction regarding enrolment of students in a particular programme, alienation of traditional classroom teaching model, detection of unfair means used in online examination, detection of abnormal values in the result sheets of the students and so on. This paper is organized as follows: Section II describes the related work. Section III describes the research question. Section IV describes data mining techniques adopted. Section V discusses the application areas of these techniques in an educational institute. Section VI concludes the paper.

Patent
09 Mar 2011
TL;DR: In this article, a relational database warehouse system with query optimization capabilities is described that allows for speedy identification of sets of records of interest from amongst tens of millions of records, which may include complex derived attributes, generated by aggregating data from a plurality of records in base data tables.
Abstract: A relational database warehouse system with query optimization capabilities is described that allows for speedy identification of sets of records of interest from amongst tens of millions of records. The records of interest may include complex derived attributes, generated, at least in part, by aggregating data from a plurality of records in base data tables. In various embodiments, the query optimization capabilities allow the database warehouse system to identify conditions under which normal query execution may be replaced by one or more optimized execution methods, including, for example, eliminating unnecessary inner join operations on base data tables specified by a query, re-ordering the execution of group-by operations and left-outer join operations to greatly reduce the size of join tables produced while processing a query, and/or consolidating a set of segmentation queries for execution in one pass over the records of the database.

Journal ArticleDOI
TL;DR: This study focuses on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness, which are of particular concern for operational systems and most importantly for data warehouses.
Abstract: Data quality remains a persistent problem in practice and a challenge for research. In this study we focus on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness. These dimensions are of particular concern for operational systems, and most importantly for data warehouses, which are often used as the primary data source for analyses such as classification, a general type of data mining. However, the definitions and conceptual models of these dimensions have not been collectively considered with respect to data mining in general or classification in particular. Nor have they been considered for problem complexity. Conversely, these four dimensions of data quality have only been indirectly addressed by data mining research. Using definitions and constructs of data quality dimensions, our research evaluates the effects of both data quality and problem complexity on generated data and tests the results in a real-world case. Six different classification outcomes selected from the spectrum of classification algorithms show that data quality and problem complexity have significant main and interaction effects. From the findings of significant effects, the economics of higher data quality are evaluated for a frequent application of classification and illustrated by the real-world case.

Patent
18 Mar 2011
TL;DR: In this paper, a system includes a transaction handler to process transactions, a data warehouse to store data recording the transactions, and at least one processor coupled with the data warehouse and configured to identify a first set of customers who made first transactions correlated with an advertisement.
Abstract: In one aspect, a system includes a transaction handler to process transactions, a data warehouse to store data recording the transactions, and at least one processor coupled with the data warehouse and configured to identify a first set of customers who made first transactions correlated with an advertisement, identify a second set of customers not in the first set of customers, and determine a difference between a first pattern in a first set of transactions of the first set of customers and a second pattern in a second set of transactions of the second set of customers.

01 Jan 2011
TL;DR: This survey covers the conceptual and logical modeling of ETL processes, along with some design methods, and visits each stage of the E-T-L triplet, and examines problems that fall within each of these stages.
Abstract: The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a) first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]

Patent
18 Apr 2011
TL;DR: In this article, the authors describe a method for analysis, migration, and validation of data from a source environment (such as an RDBMS system) to a target environment (e.g., a data warehouse appliance).
Abstract: Systems, apparatus, computer-readable storage media, and methods are disclosed for allowing analysis, migration, and validation of data from a source environment (such as an RDBMS system) to a target environment (such as a data warehouse (DW) appliance). In one example, a method comprises analyzing a source database, a source ETL environment, a target database, and a target ETL environment to produce configuration data, the configuration data being used for generating a mapping of the source database to a target database in the target database environment, a mapping of the source DDL code to target DDL code in the target database environment, and a mapping of source ETL code to target ETL code for the target database environment, and migrating at least one table from the source database, at least a portion of the source DDL code, and at least a portion of the source ETL code to the target database environment, where the migrating is based at least in part on the mapping generated using the configuration data.

Journal ArticleDOI
TL;DR: This paper presents a novel data mining framework for the exploration and extraction of actionable knowledge from data generated by electricity meters that incorporates functionality for interim summarization and incremental analysis using intelligent techniques.
Abstract: This paper presents a novel data mining framework for the exploration and extraction of actionable knowledge from data generated by electricity meters. Although a rich source of information for energy consumption analysis, electricity meters produce a voluminous, fast-paced, transient stream of data that conventional approaches are unable to address entirely. In order to overcome these issues, it is important for a data mining framework to incorporate functionality for interim summarization and incremental analysis using intelligent techniques. The proposed Incremental Summarization and Pattern Characterization (ISPC) framework demonstrates this capability. Stream data is structured in a data warehouse based on key dimensions enabling rapid interim summarization. Independently, the IPCL algorithm incrementally characterizes patterns in stream data and correlates these across time. Eventually, characterized patterns are consolidated with interim summarization to facilitate an overall analysis and prediction of energy consumption trends. Results of experiments conducted using the actual data from electricity meters confirm applicability of the ISPC framework.

Proceedings ArticleDOI
12 Jun 2011
TL;DR: This work considers processing data warehousing queries over very large datasets by analyzing the complexity of this problem in the split execution environment of HadoopDB, with particular focus on join and aggregation operations.
Abstract: Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework.In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

Journal ArticleDOI
TL;DR: The DeskTEAM software system, developed in the context of the Tropical Ecology Assessment and Monitoring Network (TEAM), a global network that monitors terrestrial vertebrates, incorporates features and functionality that make it relevant to the broad camera trapping community.