scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2004"


Book
01 Jan 2004
TL;DR: This book delivers realworld solutions for the most timeand labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) process andelineates best practices for extracting data from scattered sources.
Abstract: Cowritten by Ralph Kimball, the worlds leading data warehousing authority, whose previous books have sold more than 150,000 copiesDelivers realworld solutions for the most timeand labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) processDelineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouseOffers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality

576 citations


Journal ArticleDOI
TL;DR: The EnsMart system, a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools, has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets.
Abstract: The EnsMart system (www.ensembl.org/EnsMart) provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to 'non-Ensembl' data sets.

425 citations


Proceedings ArticleDOI
01 Nov 2004
TL;DR: This paper investigates data mining as a technique for masking data, therefore, termed data mining based privacy protection, and adapts an iterative bottom-up generalization from data mining to generalize the data.
Abstract: The well-known privacy-preserved data mining modifies existing data mining techniques to randomized data. In this paper, we investigate data mining as a technique for masking data, therefore, termed data mining based privacy protection. This approach incorporates partially the requirement of a targeted data mining task into the process of masking data so that essential structure is preserved in the masked data. The idea is simple but novel: we explore the data generalization concept from data mining as a way to hide detailed information, rather than discover trends and patterns. Once the data is masked, standard data mining techniques can be applied without modification. Our work demonstrated another positive use of data mining technology: not only can it discover useful patterns, but also mask private information. We consider the following privacy problem: a data holder wants to release a version of data for building classification models, but wants to protect against linking the released data to an external source for inferring sensitive information. We adapt an iterative bottom-up generalization from data mining to generalize the data. The generalized data remains useful to classification but becomes difficult to link to other sources. The generalization space is specified by a hierarchical structure of generalizations. A key is identifying the best generalization to climb up the hierarchy at each iteration. Enumerating all candidate generalizations is impractical. We present a scalable solution that examines at most one generalization in each iteration for each attribute involved in the linking.

330 citations


Proceedings ArticleDOI
13 Jun 2004
TL;DR: New algorithms for performing fast computation of several common database operations on commodity graphics processors, taking into account some of the limitations of the programming model of current GPUs and performing no data rearrangements are presented.
Abstract: We present new algorithms for performing fast computation of several common database operations on commodity graphics processors. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semi-linear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA's GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPU-based algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective co-processor for performing database operations.

306 citations


Journal ArticleDOI
01 Apr 2004
TL;DR: Investigating the factors influencing adoption of data warehouse technology in the banking industry in Taiwan revealed that factors such as support from the top management, size of the bank, effect of champion, internal needs, and competitive pressure would affect the adoption ofData warehouse technology.
Abstract: Previous literature suggests that various factors play crucial roles in the adoption of an information system; however, there is little empirical research about the factors affecting adoption of data warehouse technology, particularly in a single information technology intensive industry. In this study, we used a survey to investigate the factors influencing adoption of data warehouse technology in the banking industry in Taiwan. A total of 50 questionnaires were mailed to CIOs in domestic banks. The response rate was 60%. Discriminant analysis was employed to test hypotheses. The results revealed that factors such as support from the top management, size of the bank, effect of champion, internal needs, and competitive pressure would affect the adoption of data warehouse technology. The results and conclusions from this study may be a good reference for global banks in these aforementioned countries to establish and develop operational strategies, which in turn will facilitate the implementation in overseas branches.

221 citations


Patent
14 Jun 2004
TL;DR: A data services handler comprises an interface between a data store and applications that supply and consume data, and a real time information director (RTID) that transforms data under direction of polymorphic metadata that defines a security model and data integrity rules for application to the data.
Abstract: A data services handler comprises an interface between a data store and applications that supply and consume data, and a real time information director (RTID) that transforms data under direction of polymorphic metadata that defines a security model and data integrity rules for application to the data.

221 citations


Proceedings ArticleDOI
13 Jun 2004
TL;DR: A privacy framework for data integration is laid out, in the context of existing accomplishments in data integration, that addresses challenges and opportunities for the data mining community.
Abstract: Integrating data from multiple sources has been a longstanding challenge in the database community. Techniques such as privacy-preserving data mining promises privacy, but assume data has integration has been accomplished. Data integration methods are seriously hampered by inability to share the data to be integrated. This paper lays out a privacy framework for data integration. Challenges for data integration in the context of this framework are discussed, in the context of existing accomplishments in data integration. Many of these challenges are opportunities for the data mining community.

195 citations


Proceedings ArticleDOI
22 Aug 2004
TL;DR: This paper mine tables present in data warehouses and relational databases to develop an automatic segmentation system that overcome limitations of existing supervised text segmentation approaches, and is robust, accurate, and efficient.
Abstract: Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches.

167 citations


Journal ArticleDOI
01 Dec 2004
TL;DR: The structures and processes used by BCBSNC for data warehouse governance represent best practices for other companies to follow and add to the body of knowledge about IT Governance, in general, and data warehousing governance, in particular.
Abstract: Effective governance is a key to data warehousing success. An example of a company that has excelled in data warehouse governance is Blue Cross and Blue Shield of North Carolina (BCBSNC). The data warehouse has resulted in many organizational benefits, including providing "a single version of the truth," better data analysis and time savings for users, reductions in head count, facilitation of the development of new applications, better data, and support for customer-focused business strategies. The structures and processes used by BCBSNC for data warehouse governance represent best practices for other companies to follow. For researchers, the experiences at BCBSNC support and add to the body of knowledge about IT governance, in general, and data warehousing governance, in particular.

152 citations


Journal ArticleDOI
TL;DR: The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception, and many lessons learned over the last four years and the challenges that still need to be addressed are discussed.
Abstract: The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing data mining using weblogs (e.g., sessionization and conflating multi-sourced data) were obviated, thus allowing us to concentrate on actual data mining goals. The paper briefly reviews the architecture and discusses many lessons learned over the last four years and the challenges that still need to be addressed. The lessons and challenges are presented across two dimensions: business-level vs. technical, and throughout the data mining lifecycle stages of data collection, data warehouse construction, business intelligence, and deployment. The lessons and challenges are also widely applicable to data mining domains outside retail e-commerce.

139 citations


Proceedings ArticleDOI
13 Jun 2004
TL;DR: The lattice property of the product of hierarchical dimensions ("diamond") is crucially exploited in the online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions.
Abstract: Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of large-valued regions ("heavy hitters"). Such Hierarchical Heavy Hitters (HHHs) were previously introduced as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions.We identify and study two variants of HHHs for multi-dimensional data, namely the "overlap" and "split" cases, depending on how an aggregate computed for a child node in the multi-dimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees.We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions ("diamond") is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions.

Book ChapterDOI
31 Aug 2004
TL;DR: A novel method is proposed that computes a thin layer of the data cube that will be capable of supporting flexible and fast OLAP operations in the original high dimensional space, and has I/O costs that scale nicely with dimensionality.
Abstract: Data cube has been playing an essential role in fast OLAP (online analytical processing) in many multi-dimensional data warehouses. However, there exist data sets in applications like bioinformatics, statistics, and text processing that are characterized by high dimensionality, e.g., over 100 dimensions, and moderate size, e.g., around 106 tuples. No feasible data cube can be constructed with such data sets. In this paper we will address the problem of developing an efficient algorithm to perform OLAP on such data sets. Experience tells us that although data analysis tasks may involve a high dimensional space, most OLAP operations are performed only on a small number of dimensions at a time. Based on this observation, we propose a novel method that computes a thin layer of the data cube together with associated value-list indices. This layer, while being manageable in size, will be capable of supporting flexible and fast OLAP operations in the original high dimensional space. Through experiments we will show that the method has I/O costs that scale nicely with dimensionality. Furthermore, the costs are comparable to that of accessing an existing data cube when full materialization is possible.

Journal ArticleDOI
TL;DR: A multi-criteria decision analysis based process that would empower DM project teams to do thorough experimentation and analysis without being overwhelmed by the task of analyzing a significant number of DTs would offer a positive contribution to the DM process.

Proceedings ArticleDOI
14 Mar 2004
TL;DR: This paper presents a concept and an ongoing implementation of a multiversion data warehouse that is capable of handling changes in the structure of its schema as well as simulating alternative business scenarios.
Abstract: A data warehouse (DW) provides an information for analytical processing, decision making, and data mining tools. On the one hand, the structure and content of a data warehouse reflects a real world, i.e. data stored in a DW come from real production systems. On the other hand, a DW and its tools may be used for predicting trends and simulating a virtual business scenarios. This activity is often called the what-if analysis. Traditional DW systems have static structure of their schemas and relationships between data, and therefore they are not able to support any dynamics in their structure and content. For these purposes, multiversion data warehouses seem to be very promising. In this paper we present a concept and an ongoing implementation of a multiversion data warehouse that is capable of handling changes in the structure of its schema as well as simulating alternative business scenarios.

Book ChapterDOI
08 Nov 2004
TL;DR: In this paper, the authors present a framework for the design of the Data Warehouse back-stage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity including transformation rules at the attribute level.
Abstract: In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW. In this paper, we present a framework for the design of the DW back-stage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity including transformation rules at the attribute level. Specifically, we present a disciplined framework for the modeling of the relationships between sources and targets in different levels of granularity (including coarse mappings at the database and table levels to detailed inter-attribute mappings at the attribute level). In order to accomplish this goal, we extend UML (Unified Modeling Language) to model attributes as first-class citizens. In our attempt to provide complementary views of the design artifacts in different levels of detail, our framework is based on a principled approach in the usage of UML packages, to allow zooming in and out the design of a scenario.

Journal ArticleDOI
Arun Sen1
01 Apr 2004
TL;DR: It is proposed that the metadata warehouse needs to be designed to store the metadata and manage its changes and several architectures are proposed that can be used to develop a metadata warehouse.
Abstract: In the past, metadata has always been a second-class citizen in the world of databases and data warehouses. Its main purpose has been to define the data. However, the current emphasis on metadata in the data warehouse and software repository communities has elevated it to a new prominence. The organization now needs metadata for tool integration, data integration and change management. The paper presents a chronological account of this evolution—both from conceptual and management perspectives. Repository concepts are currently being used to manage metadata for tool integration and data integration. As a final chapter in this evolution process, we point out the need of a concept called “metadata warehouse.” A real-life data warehouse project called TAMUS Information Portal (TIP) is used to describe the types of metadata needed in a data warehouse and the changes that the metadata go through. We propose that the metadata warehouse needs to be designed to store the metadata and manage its changes. We propose several architectures that can be used to develop a metadata warehouse.

Journal ArticleDOI
01 Jan 2004
TL;DR: The paper extends an existing multiddimensional data model and algebraic query language to accommodate spatial values that exhibit partial containment relationships instead of the total containment relationships normally assumed in multidimensional data models.
Abstract: With the recent and continuing advances in areas such as wireless communications and positioning technologies, mobile, location-based services are becoming possible.Such services deliver location-dependent content to their users. More specifically, these services may capture the movements and requests of their users in multidimensional databases, i.e., data warehouses, and content delivery may be based on the results of complex queries on these data warehouses. Such queries aggregate detailed data in order to find useful patterns, e.g., in the interaction of a particular user with the services.The application of multidimensional technology in this context poses a range of new challenges. The specific challenge addressed here concerns the provision of an appropriate multidimensional data model. In particular, the paper extends an existing multidimensional data model and algebraic query language to accommodate spatial values that exhibit partial containment relationships instead of the total containment relationships normally assumed in multidimensional data models. Partial containment introduces imprecision in aggregation paths. The paper proposes a method for evaluating the imprecision of such paths. The paper also offers transformations of dimension hierarchies with partial containment relationships to simple hierarchies, to which existing precomputation techniques are applicable.

Journal ArticleDOI
01 Dec 2004
TL;DR: This special issue presents a set of articles that describe recent work on semantic heterogeneity at the schema level, referring to as data deduplication, record linkage, and entity/object matching.
Abstract: Semantic heterogeneity is one of the key challenges in integrating and sharing data across disparate sources, data exchange and migration, data warehousing, model management, the Semantic Web and peer-to-peer databases. Semantic heterogeneity can arise at the schema level and at the data level. At the schema level, sources can differ in relations, attribute and tag names, data normalization, levels of detail, and the coverage of a particular domain. The problem of reconciling schema-level heterogeneity is often referred to as schema matching or schema mapping. At the data level, we find different representations of the same real-world entities (e.g., people, companies, publications, etc.). Reconciling data-level heterogeneity is referred to as data deduplication, record linkage, and entity/object matching. To exacerbate the heterogeneity challenges, schema elements of one source can be represented as data in another. This special issue presents a set of articles that describe recent work on semantic heterogeneity at the schema level.

Proceedings ArticleDOI
12 Nov 2004
TL;DR: This work proposes an extension of a conceptual multidimensional model with spatial dimensions, spatial hierarchies, and spatial measures and considers the inclusion of topological relationships and topological operators in the model.
Abstract: Data Warehouses and On-Line Analytical Processing systems rely on a multidimensional model that includes dimensions, hierarchies, and measures. Such model allows to express users' requirements for supporting the decision-making process and to facilitate its afterward implementation. Although Data Warehouses typically include a spatial or location dimension, this dimension is usually represented in an alphanumeric format. However, it is well-known that a visual representation of spatial data allows to reveal patterns that are difficult to discover otherwise. Further, a multidimensional model is seldom used for representing spatial data.In this work we propose an extension of a conceptual multidimensional model with spatial dimensions, spatial hierarchies, and spatial measures. We also consider the inclusion of topological relationships and topological operators in the model. We analyze different scenarios showing the significance and convenience of the proposed extension.

Journal ArticleDOI
TL;DR: This research merges data integrity theory with management theories about quality improvement using a data quality lens, and it demonstrates the usefulness of the combined theory for data quality improvement.
Abstract: Despite the established theory and the history of the practical use of integrity rules, data quality problems, which should be solvable using data integrity rules, persist in organizations. One effective mechanism to solve this problem is to embed data integrity in a continuous data quality improvement process. The result is an iterative data quality improvement process as data integrity rules are defined, violations of these rules are measured and analyzed, and then the rules are redefined to reflect the dynamic and global context of business process changes. Using action research, we study a global manufacturing company that applied these ideas for improving data quality as it built a global data warehouse. This research merges data integrity theory with management theories about quality improvement using a data quality lens, and it demonstrates the usefulness of the combined theory for data quality improvement.

Patent
23 Jan 2004
TL;DR: In this paper, the authors propose a distributed system in which data is shared between multiple enterprise data sources and mobile clients in a distributed manner such that requests from a mobile client for enterprise data are received, the appropriate enterprise datapoints that contain the requested data are determined, and the enterprise data is retrieved from the determined enterprise datoints.
Abstract: Data is shared between multiple enterprise data sources and mobile clients in a distributed system such that requests from a mobile client for enterprise data are received, the appropriate enterprise data sources that contain the requested data are determined, and the enterprise data is retrieved from the determined enterprise data sources. Data maintained at a mobile client is shared with multiple enterprise data sources. The mobile clients send requests to an application server for synchronization of data records maintained at the mobile client with corresponding data records at the enterprise data sources. The client request includes metadata that identifies enterprise data sources for the requested data records and that specifies a relational correspondence between the requested data. The mobile client data records and the corresponding data records of the enterprise data sources are compared to identify any data conflicts between the two sets of data records. Any identified data conflicts are resolved.

Book ChapterDOI
20 Sep 2004
TL;DR: The service infrastructure provided by OGSA-DAI is presented providing a snapshot of its current state, in an evolutionary process, which is attempting to build infrastructure to allow easy integration and access to distributed data using grids or web services.
Abstract: In today's large collaborative environments, potentially composed of multiple distinct organisations, uniform controlled access to data has become a key requirement if these organisations are to work together as Virtual Organisations. We refer to such an integrated set of data resources as a virtual data warehouse. The Open Grid Services Architecture – Data Access and Integration (OGSA-DAI) project was established to produce a common middleware solution, aligned with the Global Grid Forum's (GGF) OGSA vision [OGSA] to allow uniform access to data resources using a service based architecture. In this paper the service infrastructure provided by OGSA-DAI is presented providing a snapshot of its current state, in an evolutionary process, which is attempting to build infrastructure to allow easy integration and access to distributed data using grids or web services. More information about OGSA-DAI is available from the project web site: www.ogsadai.org.

Patent
09 Jul 2004
TL;DR: In this paper, a method and system for aggregating data in an enterprise network (100) are provided, which includes receiving real-time network performance data associated with a network resource (118).
Abstract: A method and system for aggregating data in an enterprise network (100) are provided. In one embodiment, a method for processing data in an enterprise network (100) includes receiving real-time network performance data associated with a network resource (118). A database (124) is updated with the real-time network performance data. The database table includes historical network performance data associated with the real-time network performance data.

Patent
13 May 2004
TL;DR: In this paper, the authors present a data integration method and system that enables data architects and others to simply load structured data objects (e.g., XML schemas, database tables, EDI documents or other data objects) and to visually draw mappings between and among elements in the data objects.
Abstract: A data integration method and system that enables data architects and others to simply load structured data objects (e.g., XML schemas, database tables, EDI documents or other structured data objects) and to visually draw mappings between and among elements in the data objects. From there, the tool auto-generates software program code required, for example, to programmatically marshal data from a source data object to a target data object.

Book ChapterDOI
20 Sep 2004
TL;DR: This paper presents a method for privately computing k-nn classification from distributed sources without revealing any information about the sources or their data, other than that revealed by the final classification result.
Abstract: The ability of databases to organize and share data often raises privacy concerns. Data warehousing combined with data mining, bringing data from multiple sources under a single authority, increases the risk of privacy violations. Privacy preserving data mining provides a means of addressing this issue, particularly if data mining is done in a way that doesnt disclose information beyond the result. This paper presents a method for privately computing k-nn classification from distributed sources without revealing any information about the sources or their data, other than that revealed by the final classification result.

Journal ArticleDOI
TL;DR: Various pairwise regression models were developed and described, and the best model was identified as the pairwise quadratic regression model with selective median, which is currently being used to impute missing data in real time.
Abstract: Loop detectors have been used to gather traffic data for over four decades. Loop data diagnostics have been extensively researched for single loops. Loop data diagnostics for the dual loops laid along 63 km (39 mi) of I-4 in Orlando, Florida, are specifically addressed here. In the I-4 data warehouse, dual-loop detectors provide flow, speed, and occupancy every 30 s. The mathematical relationships among flow, speed, occupancy, and average length of vehicles were used to flag bad data samples provided by a loop detector. A value called the entropy statistic is defined and used to determine the detectors that are stuck. Regression techniques were applied to fill the holes formed by the bad or missing samples. Various pairwise regression models were developed and described, and their performance on the loop data from January and February 2003 was analyzed. The best model was identified as the pairwise quadratic regression model with selective median, which is currently being used to impute missing data in re...

Journal Article
TL;DR: The problem of supporting data provenance in scientific database applications is motivated and the DBNotes prototype developed at UC Santa Cruz is described that can be used to “eagerly” trace the provenance and flow of relational data.
Abstract: The problem of tracing the provenance (also known as lineage) of data is an ubiquitous problem that is frequently encountered in databases that are the result of many transformation steps. Scientific databases and data warehouses are some examples of such databases. However, contributions from the database research community towards this problem have been somewhat limited. In this paper, we motivate the problem of supporting data provenance in scientific database applications and provide some background on previous research. We also briefly describe the DBNotes prototype developed at UC Santa Cruz that can be used to “eagerly” trace the provenance and flow of relational data and describe some directions for further research.

Patent
07 May 2004
TL;DR: In this article, a data management services framework coupled with a centralized master repository for core enterprise reference data associated with an enterprise is presented, where a set of synonyms representing a mapping between the field in the master schema and a corresponding field in a particular one of the multiple schemas.
Abstract: In one embodiment, a system is provided for managing a centrally managed master repository for core enterprise reference data associated with an enterprise. A centralized master repository contains the reference data, the reference data being associated with multiple schemas, each schema including one or more data models for reference data, each data model including one or more fields. A data management services framework coupled to the repository provides services for managing the reference data in the repository. The services framework supports a master schema including a union of multiple models and associated fields in the multiple schemas. The services framework also supports a thesaurus including, for each field in the master schema, a set of synonyms each representing a mapping between the field in the master schema and a corresponding field in a particular one of the multiple schemas. The master schema and thesaurus facilitate centralized management of the reference data in the repository across multiple heterogeneous external operational systems that have different associated data models and are provided indirect access to the reference data in the repository for operational use of the reference data according to associated business workflows.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper implemented the bud-cut approach for incremental evaluation of ATGs, a formalism for schema-directed XML publishing, and experimentally evaluated its performance compared to recomputation.
Abstract: When large XML documents published from a database are maintained externally, it is inefficient to repeatedly recompute them when the database is updated. Vastly preferable is incremental update, as common for views stored in a data warehouse. However, to support schema-directed publishing, there may be no simple query that defines the mapping from the database to the external document. To meet the need for efficient incremental update, this paper studies two approaches for incremental evaluation of ATGs [4], a formalism for schema-directed XML publishing. The reduction approach seeks to push as much work as possible to the underlying DBMS. It is based on a relational encoding of XML trees and a nontrivial translation of ATGs to SQL 99 queries with recursion. However, a weakness of this approach is that it relies on high-end DBMS features rather than the lowest common denominator. In contrast, the bud-cut approach pushes only simple queries to the DBNS and performs the bulk of the work in middleware. It capitalizes on the tree-structure of XML views to minimize unnecessary recomputations and leverages optimization techniques developed for XML publishing. While implementation of the reduction approach is not yet in the reach of commercial DBMS, we have implemented the bud-cut approach and experimentally evaluated its performance compared to recomputation.

Patent
19 May 2004
TL;DR: An improved method of and apparatus for aggregating data including a scalable multi-dimensional database (MDDB) storing multidimensional data logically organized along N dimensions and a high performance aggregation engine that performs multi-stage data aggregation operations on the multiddimensional data as discussed by the authors.
Abstract: An improved method of and apparatus for aggregating data including a scalable multi-dimensional database (MDDB) storing multidimensional data logically organized along N dimensions and a high performance aggregation engine that performs multi-stage data aggregation operations on the multidimensional data. A first stage of such data aggregation operations is performed along a first dimension of the N dimensions; and a second stage of such data aggregation operations is performed for a given slice in the first dimension along another dimension of the N dimensions. Such multi-stage data aggregation operations achieve a significant increase in system performance (e.g. deceased access/search time). The MDDB and high performance aggregation engine of the present invention may be integrated into a standalone data aggregation server supporting an OLAP system (one or more OLAP servers and clients), or may be integrated into a database management system (DBMS), thus achieving improved user flexibility and ease of use. The improved DBMS system of the present invention can be used to realize an improved Data Warehouse for supporting on-line analytical processing (OLAP) operations or to realize an improved informational database system, operational database system, or the like.