Showing papers on "Data warehouse published in 2010"

PDF

Open Access

Journal Article•

[...]

S. Gnanapriya, R. Suganya, G. Sumithra Devi, M. Suresh Kumar

01 Jan 2010-Data mining and knowledge engineering

TL;DR: Data mining is the search for new, valuable, and nontrivial information in large volumes of data, a cooperative effort of humans and computers that is possible to put data-mining activities into one of two categories: Predictive data mining, which produces the model of the system described by the given data set, or Descriptive data mining which produces new, nontrivials information based on the available data set.

...read moreread less

Abstract: Understand the need for analyses of large, complex, information-rich data sets. Identify the goals and primary tasks of the data-mining process. Describe the roots of data-mining technology. Recognize the iterative character of a data-mining process and specify its basic steps. Explain the influence of data quality on a data-mining process. Establish the relation between data warehousing and data mining. Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers. In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data-mining activities into one of two categories: Predictive data mining, which produces the model of the system described by the given data set, or Descriptive data mining, which produces new, nontrivial information based on the available data set.

...read moreread less

4,646 citations

Proceedings Article•DOI•

Hive - a petabyte scale data warehouse using Hadoop

[...]

Ashish Thusoo¹, Joydeep Sen Sarma¹, Namit Jain¹, Zheng Shao¹, Prasad Chakka¹, Ning Zhang¹, Suresh Antony¹, Hao Liu¹, Raghotham Murthy¹ - Show less +5 more•Institutions (1)

Facebook¹

01 Mar 2010

TL;DR: Hive is presented, an open-source data warehousing solution built on top of Hadoop that supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoops.

...read moreread less

Abstract: The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [1] is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.

...read moreread less

959 citations

Proceedings Article•DOI•

Data warehousing and analytics infrastructure at facebook

[...]

Ashish Thusoo¹, Zheng Shao¹, Suresh Anthony¹, Dhruba Borthakur¹, Namit Jain¹, Joydeep Sen Sarma¹, Raghotham Murthy¹, Hao Liu¹ - Show less +4 more•Institutions (1)

Facebook¹

06 Jun 2010

TL;DR: This paper presents how Scribe, Hadoop and Hive together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook and enabled us to implement a data warehouse that stores more than 15PB of data and loads more than 60TB of new data every day.

...read moreread less

Abstract: Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

...read moreread less

455 citations

Patent•

Systems and methods to match identifiers

[...]

Edward W. Fordyce, Leigh Amaro, Michelle Eng Winters, Nurtekin Savas, Charles Raymond Byce, Ryan Hagey - Show less +2 more

06 Aug 2010

TL;DR: In this article, a system is proposed to identify a second account identifier of a user from the second user identifier based on the mapping data between the first user identifiers and the first account identifiers to facilitate targeted advertising using the profile of the user and/or to provide information about certain transactions of the users related to a previously presented advertisement.

...read moreread less

Abstract: In one aspect, a system includes a transaction handler to process transactions, a data warehouse to store transaction data recording the transactions processed at the transaction handler and to store mapping data between first user identifiers and first account identifiers, a profile generator to generate a profile of a user based on the transaction data, and a portal coupled to the transaction handler to receive a query identifying a second user identifier used by the first tracker to track online activities of a user. The system is to identify a second account identifier of the user from the second user identifier based on the mapping data between the first user identifiers and the first account identifiers to facilitate targeted advertising using the profile of the user and/or to provide information about certain transactions of the user related to a previously presented advertisement.

...read moreread less

390 citations

Patent•

Systems and Methods for Propensity Analysis and Validation

[...]

Peter Ciurea

06 Aug 2010

TL;DR: In this article, a data warehouse is coupled with a portal to determine a second value for the first propensity score based on transaction data recording payment transactions of the at least one user identified by user data.

...read moreread less

Abstract: In one aspect, a computing apparatus includes: a transaction handler to process transactions; a data warehouse to store transaction data recording the transactions processed at the transaction handler; and a portal to receive a request from a client device over a network, the request including user data identifying at least one user. The client device has activity data recording activities of the user, and has the capability to determine from the activity data a first value for a first propensity score of the user. The computing apparatus further includes a score evaluator coupled to the data warehouse and the portal to determine a second value for the first propensity score based on transaction data recording payment transactions of the at least one user identified by the user data. The portal is configured to provide information based on the second value in response to the request.

...read moreread less

270 citations

Journal Article•DOI•

Development of traditional Chinese medicine clinical data warehouse for medical knowledge discovery and decision support

[...]

Xuezhong Zhou¹, Shibo Chen, Baoyan Liu, Runsun Zhang², Yinghui Wang², Ping Li², Yufeng Guo², Hua Zhang³, Zhuye Gao³, Xiufeng Yan² - Show less +6 more•Institutions (3)

Beijing Jiaotong University¹, Peking Union Medical College², Beijing University of Chinese Medicine³

01 Feb 2010-Artificial Intelligence in Medicine

TL;DR: The CDW platform would be a promising infrastructure to make full use of the TCM clinical data for scientific hypothesis generation, and promote the development of TCM from individualized empirical knowledge to large-scale evidence-based medicine.

...read moreread less

210 citations

Book•

Data Mining Techniques in CRM: Inside Customer Segmentation

[...]

Konstantinos Tsiptsis, Antonios Chorianopoulos

01 Mar 2010

TL;DR: A complete and comprehensive handbook for the application of data mining techniques in marketing and customer relationship management that combines a technical and a business perspective, bridging the gap between data mining and its use in marketing.

...read moreread less

Abstract: A complete and comprehensive handbook for the application of data mining techniques in marketing and customer relationship management. It combines a technical and a business perspective, bridging the gap between data mining and its use in marketing.It guides readers through all the phases of the data mining process, presenting a solid data mining methodology, data mining best practices and recommendations for the use of the data mining results for effective marketing. It answers the crucial question of 'what data to use' by proposing mining data marts and full lists of KPIs for all major industries.Data mining algorithms are presented in a simple and comprehensive way for the business users along with real-world application examples from all major industries.The book is mainly addressed to marketers, business analysts and data mining practitioners who are looking for a how-to guide on data mining. It presents the authors' knowledge and experience from the "data mining trenches", revealing the secrets for data mining success.

...read moreread less

184 citations

Patent•

Systems and Methods for Social Media Data Mining

[...]

Junlan Feng¹, Luciano Barbosa¹, Valerie Torres¹•Institutions (1)

AT&T¹

18 Aug 2010

TL;DR: In this paper, the authors collected, analyzed and reported social media aggregated from a plurality of social media websites, analyzed for sentiment, and categorized by topic and user demographics, and archived in a data warehouse and various interfaces are provided to query and generate reports on the archived data.

...read moreread less

Abstract: Systems and methods are provided to collect, analyze and report social media aggregated from a plurality of social media websites. Social media is retrieved from social media websites, analyzed for sentiment, and categorized by topic and user demographics. The data is then archived in a data warehouse and various interfaces are provided to query and generate reports on the archived data. In some embodiments, the system further recognizes alert conditions and sends alerts to interested users. In some embodiments, the system further recognizes situations where users can be influenced to view a company or its products in a more favorable light, and automatically posts responsive social media to one or more social media websites.

...read moreread less

180 citations

Patent•

Systems and methods to deliver targeted advertisements to audience

[...]

Edward W. Fordyce, Michelle Eng Winters, Kevin P. Siegel, Leigh Amaro, Charles Raymond Byce, Nurtekin Savas - Show less +2 more

03 Aug 2010

TL;DR: In this paper, a system includes a transaction handler to process transactions, a data warehouse to store transaction data recording the transactions processed at the transaction handler, a profile generator to generate a profile of a user based on transaction data, an advertisement selector to identify an advertisement based on the profile of the user, and a portal coupled to the transaction handlers to provide the advertisement for presentation to the user in connection with information about the transaction.

...read moreread less

Abstract: In one aspect, a system includes a transaction handler to process transactions, a data warehouse to store transaction data recording the transactions processed at the transaction handler, a profile generator to generate a profile of a user based on the transaction data, an advertisement selector to identify an advertisement based on the profile of the user in response to the transaction handler processing a transaction of the user, and a portal coupled to the transaction handler to provide the advertisement for presentation to the user in connection with information about the transaction of the user. In one example, the profile includes a plurality of values representing aggregated spending of the user in various areas to summarize the transactions of the user.

...read moreread less

174 citations

Proceedings Article•DOI•

PreDatA – preparatory data analytics on peta-scale machines

[...]

Fang Zheng¹, Hasan Abbasi¹, Ciprian Docan², Jay Lofstead¹, Qing Liu³, Scott Klasky³, Manish Parashar², Norbert Podhorszki³, Karsten Schwan¹, Matthew Wolf¹ - Show less +6 more•Institutions (3)

Georgia Institute of Technology¹, Rutgers University², Oak Ridge National Laboratory³

19 Apr 2010

TL;DR: PreDatA, short for Preparatory Data Analytics, is an approach to preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines that enhances the scalability and flexibility of the current I/O stack on HEC platforms.

...read moreread less

Abstract: Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics ‘hidden’ or ‘latent’ in these massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach to preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the machine as ‘staging’ nodes and by staging simulations' output data through these nodes, PreDatA can exploit their computational power to perform select data manipulations with lower latency than attainable by first moving data into file systems and storage. Such intransit manipulations are supported by the PreDatA middleware through asynchronous data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. PreDatA enhances the scalability and flexibility of the current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulations.

...read moreread less

173 citations

Proceedings Article•DOI•

Research on data mining models for the internet of things

[...]

Shen Bin¹, Liu Yuan², Wang Xiao-yi²•Institutions (2)

Ningbo Institute of Technology, Zhejiang University¹, Zhejiang University²

09 Apr 2010

TL;DR: This paper proposes four data mining models for the Internet of Things, which are multi-layer data mining model, distributed data Mining model, Grid based datamining model and data miningmodel from multi-technology integration perspective.

...read moreread less

Abstract: In this paper, we propose four data mining models for the Internet of Things, which are multi-layer data mining model, distributed data mining model, Grid based data mining model and data mining model from multi-technology integration perspective. Among them, multi-layer model includes four layers: 1) data collection layer, 2) data management layer, 3) event processing layer, and 4) data mining service layer. Distributed data mining model can solve problems from depositing data at different sites. Grid based data mining model allows Grid framework to realize the functions of data mining. Data mining model from multi-technology integration perspective describes the corresponding framework for the future Internet. Several key issues in data mining of IoT are also discussed.

...read moreread less

Journal Article•DOI•

Cheetah: a high performance, custom data warehouse on top of MapReduce

[...]

Songting Chen

01 Sep 2010

TL;DR: This paper describes a data warehouse system, called Cheetah, built on top of MapReduce, designed specifically for the authors' online advertising application to allow various simplifications and custom optimizations and describes a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views.

...read moreread less

Abstract: Large-scale data analysis has become increasingly important for many enterprises. Recently, a new distributed computing paradigm, called MapReduce, and its open source implementation Hadoop, has been widely adopted due to its impressive scalability and flexibility to handle structured as well as unstructured data. In this paper, we describe our data warehouse system, called Cheetah, built on top of MapReduce. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. First, we take a fresh look at the data warehouse schema design. In particular, we define a virtual view on top of the common star or snowflake data warehouse schema. This virtual view abstraction not only allows us to design a SQL-like but much more succinct query language, but also makes it easier to support many advanced query processing features. Next, we describe a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views. In fact, each node with commodity hardware in our cluster is able to process raw data at 1GBytes/s. Lastly, we show how to seamlessly integrate Cheetah into any ad-hoc MapReduce jobs. This allows MapReduce developers to fully leverage the power of both MapReduce and data warehouse technologies.

...read moreread less

Patent•

Systems and methods for multi-channel offer redemption

[...]

John Hamilton Macilwaine, Patrick Stan, Kirsten E. Garen, Mark Carlson, George William Coombes, Surendra Keshan, Russell P. Zink - Show less +3 more

23 Nov 2010

TL;DR: In this article, a system includes a transaction handler, a data warehouse to store transaction data recording transactions processed at the transaction handler and to store account data identifying an account of the user, and a portal to receive a user selection of a first portion of an advertisement and, in response, to present a user interface.

...read moreread less

Abstract: In one aspect, a system includes a transaction handler, a data warehouse to store transaction data recording transactions processed at the transaction handler and to store account data identifying an account of the user, and a portal to receive a user selection of a first portion of an advertisement and, in response, to present a user interface. The advertisement provides an offer and includes a second portion which when selected directs the user to a website of an advertiser. The data warehouse is to store data associating the offer with the account data of the user in response to a request made in the user interface. The system is to monitor transactions processed at the transaction handler to identify a purchase paid via the account and eligible for the redemption of the offer. The transaction handler is to provide statement credits to the user, if the payment transaction is identified.

...read moreread less

Proceedings Article•DOI•

The DataPath system: a data-centric analytic processing engine for large data warehouses

[...]

Subi Arumugam¹, Alin Dobra¹, Chris Jermaine², Niketan Pansare², Luis Perez² - Show less +1 more•Institutions (2)

University of Florida¹, Rice University²

06 Jun 2010

TL;DR: In DataPath, queries do not request data, and data are automatically pushed onto processors, where they are then processed by any interested computation, making for a very lean and fast database system.

...read moreread less

Abstract: Since the 1970's, database systems have been "compute-centric". When a computation needs the data, it requests the data, and the data are pulled through the system. We believe that this is problematic for two reasons. First, requests for data naturally incur high latency as the data are pulled through the memory hierarchy, and second, it makes it difficult or impossible for multiple queries or operations that are interested in the same data to amortize the bandwidth and latency costs associated with their data access. In this paper, we describe a purely-push based, research prototype database system called DataPath. DataPath is "data-centric". In DataPath, queries do not request data. Instead, data are automatically pushed onto processors, where they are then processed by any interested computation. We show experimentally on a multi-terabyte benchmark that this basic design principle makes for a very lean and fast database system.

...read moreread less

Patent•

Systems and Methods for Targeting Offers

[...]

Andrew Clyne

10 Aug 2010

TL;DR: In this article, a transaction handler is used to process transactions and a data warehouse to store transaction data recording the transactions processed at the transaction handler and a profile generator to identify a set of user clusters based on transaction data.

...read moreread less

Abstract: In one aspect, a computing apparatus includes: a transaction handler to process transactions; a data warehouse to store transaction data recording the transactions processed at the transaction handler; a profile generator to identify a set of user clusters based on transaction data; and a portal to enroll users and identify preferred communication channels of the users, receive offers from a plurality of entities, present data identifying the set of user clusters to the entities, receive bids on the clusters from the entities in accordance with types of the offers, based on the bids determine winning entities for a predetermined time period, and provide offers of the winning entities to respective enrolled users in respective clusters during the predetermined time period, using preferred communication channels of the respective enrolled users.

...read moreread less

Patent•

Systems and methods to provide intelligent analytics to cardholders and merchants

[...]

Edward W. Fordyce, Michelle Eng Winters, Leigh Amaro

18 Oct 2010

TL;DR: In this paper, a data warehouse is used to store transaction data recorded by the transaction handler and a portal coupled with the data warehouse to receive one or more parameters as an input and to provide spending activity information for presentation as a response to the input.

...read moreread less

Abstract: In one aspect, a computing apparatus includes: a transaction handler to process transactions, a data warehouse to store transaction data recording the transactions processed by the transaction handler, a portal coupled with the data warehouse to receive one or more parameters as an input and to provide spending activity information for presentation as a response to the input, and an analytics engine coupled with the portal and the data warehouse to analyze spending activities of a user based on the transaction data and the one or more parameters to generate the spending activity information regarding transactions in a plurality of accounts of the user.

...read moreread less

Journal Article•DOI•

Incremental Load in a Data Warehousing Environment

[...]

Nayem Rahman¹•Institutions (1)

Intel¹

01 Jul 2010-International Journal of Intelligent Information Technologies

TL;DR: An Extract-Transform-Load ETL metadata model is proposed that archives load observation timestamps and other useful load parameters and recommends algorithms and techniques for incremental refreshes that enable table loading while ensuring data consistency, integrity, and improving load performance.

...read moreread less

Abstract: Incremental load is an important factor for successful data warehousing. Lack of standardized incremental refresh methodologies can lead to poor analytical results, which can be unacceptable to an organization's analytical community. Successful data warehouse implementation depends on consistent metadata as well as incremental data load techniques. If consistent load timestamps are maintained and efficient transformation algorithms are used, it is possible to refresh databases with complete accuracy and with little or no manual checking. This paper proposes an Extract-Transform-Load ETL metadata model that archives load observation timestamps and other useful load parameters. The author also recommends algorithms and techniques for incremental refreshes that enable table loading while ensuring data consistency, integrity, and improving load performance. In addition to significantly improving quality in incremental load techniques, these methods will save a substantial amount of data warehouse systems resources.

...read moreread less

Patent•

Systems and methods for advertising services based on an sku-level profile

[...]

Edward W. Fordyce, Michelle Eng Winters, Leigh Amaro, Nurtekin Savas, Kevin P. Siegel, Charles Raymond Byce - Show less +2 more

07 Oct 2010

TL;DR: In this article, a data warehouse is used to store transaction data and purchase details associated with the authorization request in response to a determination that the first account identifier is associated with consent data.

...read moreread less

Abstract: In one aspect, a computing apparatus includes: a transaction handler to process transactions; a portal to receive, from users, consent data that identifies account identifiers of the users; a data warehouse to store transaction data recording the transactions and store purchase details for at least some of the transactions; and a profile generator to generate profiles based on the transaction data and the purchase details stored in the data warehouse. In response to an authorization request received in the transaction handler for a payment transaction identifying a first account identifier, the system is to use the transaction handler to request purchase details associated with the authorization request from the merchant via a response to the authorization request, and receive and store the purchase details associated with the authorization request in the data warehouse, in response to a determination that the first account identifier is associated with consent data.

...read moreread less

Journal Article•DOI•

A framework for multidimensional design of data warehouses from ontologies

[...]

Oscar Romero¹, Alberto Abelló¹•Institutions (1)

Polytechnic University of Catalonia¹

01 Nov 2010

TL;DR: A user-centered approach to support the end-user requirements elicitation and the data warehouse multidimensional design tasks is introduced, based on a reengineering process that derives the multiddimensional schema from a conceptual formalization of the domain.

...read moreread less

Abstract: The data warehouse design task needs to consider both the end-user requirements and the organization data sources. For this reason, the data warehouse design has been traditionally considered a reengineering process, guided by requirements, from the data sources. Most current design methods available demand highly-expressive end-user requirements as input, in order to carry out the exploration and analysis of the data sources. However, the task to elicit the end-user information requirements might result in a thorough task. Importantly, in the data warehousing context, the analysis capabilities of the target data warehouse depend on what kind of data is available in the data sources. Thus, in those scenarios where the analysis capabilities of the data sources are not (fully) known, it is possible to help the data warehouse designer to identify and elicit unknown analysis capabilities. In this paper we introduce a user-centered approach to support the end-user requirements elicitation and the data warehouse multidimensional design tasks. Our proposal is based on a reengineering process that derives the multidimensional schema from a conceptual formalization of the domain. It starts by fully analyzing the data sources to identify, without considering requirements yet, the multidimensional knowledge they capture (i.e., data likely to be analyzed from a multidimensional point of view). Next, we propose to exploit this knowledge in order to support the requirements elicitation task. In this way, we are already conciliating requirements with the data sources, and we are able to fully exploit the analysis capabilities of the sources. Once requirements are clear, we automatically create the data warehouse conceptual schema according to the multidimensional knowledge extracted from the sources.

...read moreread less

Book•

Data Stream Management

[...]

Lukasz Golab¹, M. Tamer Zsu²•Institutions (2)

AT&T Labs¹, University of Waterloo²

02 Jun 2010

TL;DR: This lecture gives an overview of recent research in stream processing, ranging from answering simple queries on high-speed streams to loading real-time data feeds into a streaming warehouse for off-line analysis.

...read moreread less

Abstract: In this lecture many applications process high volumes of streaming data, among them Internet traffic analysis, financial tickers, and transaction log mining. In general, a data stream is an unbounded data set that is produced incrementally over time, rather than being available in full before its processing begins. In this lecture, we give an overview of recent research in stream processing, ranging from answering simple queries on high-speed streams to loading real-time data feeds into a streaming warehouse for off-line analysis. We will discuss two types of systems for end-to-end stream processing: Data Stream Management Systems (DSMSs) and Streaming Data Warehouses (SDWs). A traditional database management system typically processes a stream of ad-hoc queries over relatively static data. In contrast, a DSMS evaluates static (long-running) queries on streaming data, making a single pass over the data and using limited working memory. In the first part of this lecture, we will discuss research problems in DSMSs, such as continuous query languages, non-blocking query operators that continually react to new data, and continuous query optimization. The second part covers SDWs, which combine the real-time response of a DSMS by loading new data as soon as they arrive with a data warehouse's ability to manage Terabytes of historical data on secondary storage. Table of Contents: Introduction / Data Stream Management Systems / Streaming Data Warehouses / Conclusions

...read moreread less

A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing

[...]

Ranjit Singh, Kawaljeet Singh

01 Jan 2010

TL;DR: The state-of-the-art purpose of the paper is to identify the reasons for data deficiencies, non-availability or reach ability problems at all the aforementioned stages of data warehousing and to formulate descriptive classification of these causes.

...read moreread less

Abstract: Data warehousing is gaining in eminence as organizations become awake of the benefits of decision oriented and business intelligence oriented data bases. However, there is one key stumbling block to the rapid development and implementation of quality data warehouses, specifically that of warehouse data quality issues at various stages of data warehousing. Specifically, problems arise in populating a warehouse with quality data. Over the period of time many researchers have contributed to the data quality issues, but no research has collectively gathered all the causes of data quality problems at all the phases of data warehousing Viz. 1) data sources, 2) data integration & data profiling, 3) Data staging and ETL, 4) data warehouse modeling & schema design. The state-of-the-art purpose of the paper is to identify the reasons for data deficiencies, non-availability or reach ability problems at all the aforementioned stages of data warehousing and to formulate descriptive classification of these causes. We have identified possible set of causes of data quality issues from the extensive literature review and with consultation of the data warehouse practitioners working in renowned IT giants on India. We hope this will help developers & Implementers of warehouse to examine and analyze these issues before moving ahead for data integration and data warehouse solutions for quality decision oriented and business intelligence oriented applications.

...read moreread less

Patent•

High-concurrency query operator and method

[...]

George Candea¹, Neoklis Polyzotis¹•Institutions (1)

Aster¹

12 May 2010

TL;DR: In this paper, the authors present a method for concurrently executing a set of multiple queries through a processor to improve a resource usage within a data warehouse system. But the method does not consider the group of users of the system to simultaneously run queries.

...read moreread less

Abstract: In one embodiment, a method includes concurrently executing a set of multiple queries, through a processor, to improve a resource usage within a data warehouse system. The method also includes permitting a group of users of the data warehouse system to simultaneously run a set of queries. In addition, the method includes applying a high-concurrency query operator to continuously optimize a large number of concurrent queries for a set of highly concurrent dynamic workloads.

...read moreread less

Journal Article•DOI•

Multidimensional Databases and Data Warehousing

[...]

Christian S. Jensen¹, Torben Bach Pedersen², Christian Thomsen²•Institutions (2)

Aarhus University¹, Aalborg University²

13 Sep 2010-Synthesis Lectures on Data Management

TL;DR: The book offers a principled overview of key implementation techniques that are particularly important to multidimensional databases, including materialized views, bitmap indices, join indices, and star join processing.

...read moreread less

Abstract: The present book's subject is multidimensional data models and data modeling concepts as they are applied in real data warehouses. The book aims to present the most important concepts within this subject in a precise and understandable manner. The book's coverage of fundamental concepts includes data cubes and their elements, such as dimensions, facts, and measures and their representation in a relational setting; it includes architecture-related concepts; and it includes the querying of multidimensional databases. The book also covers advanced multidimensional concepts that are considered to be particularly important. This coverage includes advanced dimension-related concepts such as slowly changing dimensions, degenerate and junk dimensions, outriggers, parent-child hierarchies, and unbalanced, non-covering, and non-strict hierarchies. The book offers a principled overview of key implementation techniques that are particularly important to multidimensional databases, including materialized views, bitmap indices, join indices, and star join processing. The book ends with a chapter that presents the literature on which the book is based and offers further readings for those readers who wish to engage in more in-depth study of specific aspects of the book's subject. Table of Contents: Introduction / Fundamental Concepts / Advanced Concepts / Implementation Issues / Further Readings

...read moreread less

Book•

The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence

[...]

Ralph Kimball, Margy Ross

08 Feb 2010

TL;DR: These practical, hands-on articles are fully updated to reflect current practices and terminology and cover the complete lifecycle including project planning, requirements gathering, dimensional modeling, ETL, and business intelligence and analytics.

...read moreread less

Abstract: An unparalleled collection of recommended guidelines for data warehousing and business intelligence pioneered by Ralph Kimball and his team of colleagues from the Kimball Group. Recognized and respected throughout the world as the most influential leaders in the data warehousing industry, Ralph Kimball and the Kimball Group have written articles covering more than 250 topics that define the field of data warehousing. For the first time, the Kimball Group's incomparable advice, design tips, and best practices have been gathered in this remarkable collection of articles, which spans a decade of data warehousing innovation. Each group of articles is introduced with original commentaries that explain their role in the overall lifecycle methodology developed by the Kimball Group. These practical, hands-on articles are fully updated to reflect current practices and terminology and cover the complete lifecycleincluding project planning, requirements gathering, dimensional modeling, ETL, and business intelligence and analytics. This easily referenced collection is nothing less than vital if you are involved with data warehousing or business intelligence in any capacity.

...read moreread less

Journal Article•DOI•

Mining manufacturing data for discovery of high productivity process characteristics.

[...]

Salim Charaniya¹, Huong Le¹, Huzefa Rangwala², Keri Mills³, Kevin M. Johnson³, George Karypis, Wei Shou Hu¹ - Show less +3 more•Institutions (3)

University of Minnesota¹, George Mason University², Genentech³

01 Jun 2010-Journal of Biotechnology

TL;DR: Evaluation of cell culture stage-specific models indicates that production performance can be reliably predicted days prior to harvest, and implementation of this methodology on the manufacturing floor can facilitate a real-time decision making process and thereby improve the robustness of large scale bioprocesses.

...read moreread less

Journal Article•DOI•

Automatic validation of requirements to support multidimensional design

[...]

Oscar Romero¹, Alberto Abelló¹•Institutions (1)

Polytechnic University of Catalonia¹

01 Sep 2010

TL;DR: The most relevant step in the framework is Multidimensional Design by Examples (MDBE), which is a novel method for deriving multidimensional conceptual schemas from relational sources according to end-user requirements, and is a fully automatic approach that handles and analyzes the end- user requirements automatically.

...read moreread less

Abstract: It is widely accepted that the conceptual schema of a data warehouse must be structured according to the multidimensional model. Moreover, it has been suggested that the ideal scenario for deriving the multidimensional conceptual schema of the data warehouse would consist of a hybrid approach (i.e., a combination of data-driven and requirement-driven paradigms). Thus, the resulting multidimensional schema would satisfy the end-user requirements and would be conciliated with the data sources. Most current methods follow either a data-driven or requirement-driven paradigm and only a few use a hybrid approach. Furthermore, hybrid methods are unbalanced and do not benefit from all of the advantages brought by each paradigm. In this paper we present our approach for multidimensional design. The most relevant step in our framework is Multidimensional Design by Examples (MDBE), which is a novel method for deriving multidimensional conceptual schemas from relational sources according to end-user requirements. MDBE introduces several advantages over previous approaches, which can be summarized as three main contributions. (i) The MDBE method is a fully automatic approach that handles and analyzes the end-user requirements automatically. (ii) Unlike data-driven methods, we focus on data of interest to the end-user. However, the user may not be aware of all the potential analyses of the data sources and, in contrast to requirement-driven approaches, MDBE can propose new multidimensional knowledge related to concepts already queried by the user. (iii) Finally, MDBE proposes meaningful multidimensional schemas derived from a validation process. Therefore, the proposed schemas are sound and meaningful.

...read moreread less

Patent•

Systems and methods for real-time data ingestion to a clinical analytics platform

[...]

Agneta Breitenstein, Stanley Huang, Donald Pettini, Paul Bleicher, Ryan Scharer, Hua Ye - Show less +2 more

24 Sep 2010

TL;DR: The clinical informatics platform may include a data extraction facility that gathers clinical data from numerous sources, a data mapping facility that identifies and maps key data elements and links data over time, data normalization facility to normalize the clinical data and, optionally, de-identify the data, a flexible data warehouse for storing raw clinical data or longitudinal patient data, and a clinical analytics facility for data mining, analytic model building, patient risk identification, benchmarking, performing quality assurance, and patient tracking.

...read moreread less

Abstract: The clinical analytics platform automates the capture, extraction, and reporting of data required for certain quality measures, provides real-time clinical surveillance, clinical dashboards, tracking lists, and alerts for specific, high-priority conditions, and offers dynamic, ad-hoc quality reporting capabilities. The clinical informatics platform may include a data extraction facility that gathers clinical data from numerous sources, a data mapping facility that identifies and maps key data elements and links data over time, a data normalization facility to normalize the clinical data and, optionally, de-identify the data, a flexible data warehouse for storing raw clinical data or longitudinal patient data, a clinical analytics facility for data mining, analytic model building, patient risk identification, benchmarking, performing quality assurance, and patient tracking, and a graphical user interface for presenting clinical analytics in an actionable format.

...read moreread less

Proceedings Article•DOI•

Business Analytics and Competitive Advantage: A Review and a Research Agenda

[...]

Rajeev Sharma¹, Peter Reynolds¹, Rens Scheepers¹, Peter B. Seddon¹, Graeme Shanks¹ - Show less +1 more•Institutions (1)

University of Melbourne¹

01 Aug 2010

TL;DR: This paper identifies the critical roles of organizational routines and organization-wide capabilities for identifying, resourcing and implementing business analytics-based competitive actions in delivering performance gains and competitive advantage.

...read moreread less

Abstract: Business analytics has the potential to deliver performance gains and competitive advantage. However, a theoretically grounded model identifying the factors and processes involved in realizing those performance gains has not been clearly articulated in the literature. This paper draws on the literature on dynamic capabilities to develop such a theoretical framework. It identifies the critical roles of organizational routines and organization-wide capabilities for identifying, resourcing and implementing business analytics-based competitive actions in delivering performance gains and competitive advantage. A theoretical framework and propositions for future research are developed.

...read moreread less

Journal Article•DOI•

Key organizational factors in data warehouse architecture selection

[...]

Thilini Ariyachandra¹, Hugh J. Watson²•Institutions (2)

Xavier University¹, University of Georgia²

01 May 2010

TL;DR: The research suggests an overall model for predicting the data warehouse architecture selection decision and identifies the various contextual factors that affect the selection decision.

...read moreread less

Abstract: Even though data warehousing has been in existence for over a decade, companies are still uncertain about a critical decision - which data warehouse architecture to implement? Based on the existing literature, theory, and interviews with experts, a research model was created that identifies the various contextual factors that affect the selection decision. The results from the field survey and multinomial logistic regression suggest that various combinations of organizational factors influence data warehouse architecture selection. The strategic view of the data warehouse prior to implementation emerged as a key determinant. The research suggests an overall model for predicting the data warehouse architecture selection decision.

...read moreread less

Patent•

Systems and Methods for Advertising Services Based on a Local Profile

[...]

Leigh Amaro, Joe Cunningham, Jack Yang, Karteek Hasmukh Patel, Edward W. Fordyce, Robert Eng, Uzma Makhdumi, Charles Raymond Byce, Michelle Eng Winters - Show less +5 more

30 Sep 2010

TL;DR: In this article, a data warehouse is used to store transaction data and a profile generator to generate a profile including a plurality of values representing aggregated spending in various spending areas to summarize transactions in a geographical area.

...read moreread less

Abstract: In one aspect, a computing apparatus includes: a transaction handler to process transactions; a data warehouse to store transaction data recording the transactions processed at the transaction handler; a profile generator to generate, based on the transaction data, a profile including a plurality of values representing aggregated spending in various spending areas to summarize transactions in a geographical area; and a portal to receive advertisement data from an advertiser and to create an advertisement campaign based on the profile to deliver advertisements to users in the geographical area on behalf of the advertiser using one or more media channels.

...read moreread less

Collapse