scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2001"


Journal ArticleDOI
01 Dec 2001
TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.
Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

3,693 citations


Journal ArticleDOI
TL;DR: It was found that management support and resources help to address organizational issues that arise during warehouse implementations; resources, user participation, and highly-skilled project team members increase the likelihood that warehousing projects will finish on-time, on-budget, with the right functionality; and diverse, unstandardized source systems and poor development technology will increase the technical issues that project teams must overcome.
Abstract: The IT implementation literature suggests that various implementation factors play critical roles in the success of an information system; however, there is little empirical research about the implementation of data warehousing projects. Data warehousing has unique characteristics that may impact the importance of factors that apply to it. In this study, a cross-sectional survey investigated a model of data warehousing success. Data warehousing managers and data suppliers from 111 organizations completed paired mail questionnaires on implementation factors and the success of the warehouse. The results from a Partial Least Squares analysis of the data identified significant relationships between the system quality and data quality factors and perceived net benefits. It was found that management support and resources help to address organizational issues that arise during warehouse implementations; resources, user participation, and highly-skilled project team members increase the likelihood that warehousing projects will finish on-time, on-budget, with the right functionality; and diverse, unstandardized source systems and poor development technology will increase the technical issues that project teams must overcome. The implementation's success with organizational and project issues, in turn, influence the system quality of the data warehouse; however, data quality is best explained by factors not included in the research model.

1,579 citations


Proceedings Article
11 Sep 2001
TL;DR: This paper proposes a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches.
Abstract: Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems.

1,533 citations


Proceedings Article
11 Sep 2001
TL;DR: Potter’s Wheel is presented, an interactive data cleaning system that tightly integrates transformation and discrepancy detection, and users can gradually build a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays.
Abstract: Cleaning data of errors in structure and content is important for data warehousing and integration. Current solutions for data cleaning involve many iterations of data “auditing” to find errors, and long-running transformations to fix them. Users need to endure long waits, and often write complex transformation scripts. We present Potter’s Wheel, an interactive data cleaning system that tightly integrates transformation and discrepancy detection. Users gradually build transformations to clean the data by adding or undoing transforms on a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. These transforms are specified either through simple graphical operations, or by showing the desired effects on example data values. In the background, Potter’s Wheel automatically infers structures for data values in terms of user-defined domains, and accordingly checks for constraint violations. Thus users can gradually build a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays.

610 citations


Book
03 Sep 2001
TL;DR: Leading researchers from the fields of data mining, data visualization, and statistics present findings organized around topics introduced in two recent international knowledge discovery and data mining workshops as formal chapters that together comprise a complete, cohesive body of research.
Abstract: From the Publisher: Mainstream data mining techniques significantly limit the role of human reasoning and insight. Likewise, in data visualization, the role of computational analysis is relatively small. The power demonstrated individually by these approaches to knowledge discovery suggests that somehow uniting the two could lead to increased efficiency and more valuable results. But is this true? How might it be achieved? And what are the consequences for data-dependent enterprises? Information Visualization in Data Mining and Knowledge Discovery is the first book to ask and answer these thought-provoking questions. It is also the first book to explore the fertile ground of uniting data mining and data visualization principles in a new set of knowledge discovery techniques. Leading researchers from the fields of data mining, data visualization, and statistics present findings organized around topics introduced in two recent international knowledge discovery and data mining workshops. Collected and edited by three of the area's most influential figures, these chapters introduce the concepts and components of visualization, detail current efforts to include visualization and user interaction in data mining, and explore the potential for further synthesis of data mining algorithms and data visualization techniques. This incisive, groundbreaking research is sure to wield a strong influence in subsequent efforts in both academic and corporate settings. Features Details advances made by leading researchers from the fields of data mining, data visualization, and statistics. Provides a useful introduction to the science of visualization, sketches the current role for visualization in data mining, and then takes a long look into its mostly untapped potential. Presents the findings of recent international KDD workshops as formal chapters that together comprise a complete, cohesive body of research. Offerss compelling and practical information for professionals and researchers in database technology, data mining, knowledge discovery, artificial intelligence, machine learning, neural networks, statistics, pattern recognition, information retrieval, high-performance computing, and data visualization. Author Biography: Usama Fayyad is co-founder, president, and CEO of digiMine, a data warehousing and data mining ASP. Prior to digiMine, he founded and led Microsoft's Data Mining and Exploration Group, where he developed data mining prediction components for Microsoft Site Server and scalable algorithms for mining large databases. Georges G. Grinstein is a professor of computer science, director of the Institute for Visualization and Perception Research, and co-director of the Center for Bioinformatics and Computational Biology at the University of Massachusetts, Lowell. He is currently the chief technologist for AnVil Informatics, a data exploration company. Andreas Wierse is the managing director of VirCinity, a spin-off company of the Computing Centre of the University of Stuttgart. Previously, he worked at the Computer Centre, where he designed and implemented distributed data management for the COVISE visualization system and maintained a wide range of graphics workstations.

431 citations


Proceedings Article
11 Sep 2001
TL;DR: This paper presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently and experimental results report on the assessement of the proposed framework for data cleaning.
Abstract: The problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning.

380 citations


Book ChapterDOI
12 Jul 2001
TL;DR: An ad-hoc grouping hierarchy based on the spatial index at the finest spatial granularity is constructed and incorporated in the lattice model and efficient methods to process arbitrary aggregations are presented.
Abstract: Spatial databases store information about the position of individual objects in space. In many applications however, such as traffic supervision or mobile communications, only summarized data, like the number of cars in an area or phones serviced by a cell, is required. Although this information can be obtained from transactional spatial databases, its computation is expensive, rendering online processing inapplicable. Driven by the non-spatial paradigm, spatial data warehouses can be constructed to accelerate spatial OLAP operations. In this paper we consider the star-schema and we focus on the spatial dimensions. Unlike the non-spatial case, the groupings and the hierarchies can be numerous and unknown at design time, therefore the well-known materialization techniques are not directly applicable. In order to address this problem, we construct an ad-hoc grouping hierarchy based on the spatial index at the finest spatial granularity. We incorporate this hierarchy in the lattice model and present efficient methods to process arbitrary aggregations. We finally extend our technique to moving objects by employing incremental update methods.

367 citations



Journal ArticleDOI
TL;DR: Multidimensional database technology will increasingly be applied where analysis results are fed directly into other systems, thereby eliminating humans from the loop, when coupled with the need for continuous updates.
Abstract: Multidimensional database technology is a key factor in the interactive analysis of large amounts of data for decision making purposes. In contrast to previous technologies, these databases view data as multidimensional cubes that are particularly well suited for data analysis. Multidimensional models categorize data either as facts with associated numerical measures or as textual dimensions that characterize the facts. Queries aggregate measure values over a range of dimension values to provide results such as total sales per month of a given product. Multidimensional database technology is being applied to distributed data and to new types of data that current technology often cannot adequately analyze. For example, classic techniques such as preaggregation cannot ensure fast query response times when data-such as that obtained from sensors or GPS-equipped moving objects-changes continuously. Multidimensional database technology will increasingly be applied where analysis results are fed directly into other systems, thereby eliminating humans from the loop. When coupled with the need for continuous updates, this context poses stringent performance requirements not met by current technology.

303 citations


Journal ArticleDOI
TL;DR: An approach that provides a theoretical foundation for the use of object-oriented databases and object-relational databases in data warehouse, multidimensional database, and online analytical processing applications and introduces a set of minimal constraints and extensions to the Unified Modeling Language for representing multiddimensional modeling properties for these applications.
Abstract: The authors propose an approach that provides a theoretical foundation for the use of object-oriented databases and object-relational databases in data warehouse, multidimensional database, and online analytical processing applications. This approach introduces a set of minimal constraints and extensions to the Unified Modeling Language for representing multidimensional modeling properties for these applications. Multidimensional modeling offers two benefits. First, the model closely parallels how data analyzers think and, therefore, helps users understand data. Second, multidimensional modeling helps predict what final users want to do, thereby facilitating performance improvements. The authors are using their approach to create an automatic implementation of a multidimensional model. They plan to integrate commercial online-analytical-processing tool facilities within their GOLD model case tool as well, a task that involves data warehouse prototyping and sample data generation issues.

298 citations


Journal Article
TL;DR: This paper presents a method to support the identification and design of data marts by exploiting a goal-oriented process based on the Goal/Question/Metric paradigm developed at the University of Maryland.
Abstract: Data warehouses are databases devoted to analytical processing. They are used to support decision-making activities in most modern business settings, when complex data sets have to be studied and analyzed. The technology for analytical processing assumes that data are presented in the form of simple data marts, consisting of a well-identified collection of facts and data analysis dimensions (star schema). Despite the wide diffusion of data warehouse technology and concepts, we still miss methods that help and guide the designer in identifying and extracting such data marts out of an enterprisewide information system, covering the upstream, requirement-driven stages of the design process. Many existing methods and tools support the activities related to the efficient implementation of data marts on top of specialized technology (such as the ROLAP or MOLAP data servers). This paper presents a method to support the identification and design of data marts. The method is based on three basic steps. A first top-down step makes it possible to elicit and consolidate user requirements and expectations. This is accomplished by exploiting a goal-oriented process based on the Goal/Question/Metric paradigm developed at the University of Maryland. Ideal data marts are derived from user requirements. The second bottom-up step extracts candidate data marts

Book ChapterDOI
01 Feb 2001
TL;DR: This paper describes several languages for describing contents of data sources, the tradeoffs between them, and the associated reformulation algorithms.
Abstract: The data integration problem is to provide uniform access to multiple heterogeneous information sources available online (eg, databases on the WWW) This problem has recently received considerable attention from researches in the fields of Artificial Intelligence and Database Systems The data integration problem is complicated by the facts that (1) sources contain closely related and overlapping data, (2) data is stored in multiple data models and schemas, and (3) data sources have differing query processing capabilities A key element in a data integration system is the language used to describe the contents and capabilities of the data sources While such a language needs to be as expressive as possible, it should also enable to efficiently address the main inference problem that arises in this context: to translate a user query that is formulated over a mediated schema into a query on the local schemas This paper describes several languages for describing contents of data sources, the tradeoffs between them, and the associated reformulation algorithms

Proceedings ArticleDOI
12 Jun 2001
TL;DR: An overview of the research in real time data mining-based intrusion detection systems (IDS) and an architecture consisting of sensors, detectors, a data warehouse, and model generation components is presented that improves the efficiency and scalability of the IDS.
Abstract: We present an overview of our research in real time data mining-based intrusion detection systems (IDSs). We focus on issues related to deploying a data mining-based IDS in a real time environment. We describe our approaches to address three types of issues: accuracy, efficiency, and usability. To improve accuracy, data mining programs are used to analyze audit data and extract features that can distinguish normal activities from intrusions; we use artificial anomalies along with normal and/or intrusion data to produce more effective misuse and anomaly detection models. To improve efficiency, the computational costs of features are analyzed and a multiple-model cost-based approach is used to produce detection models with low cost and high accuracy. We also present a distributed architecture for evaluating cost-sensitive models in real-time. To improve usability, adaptive learning algorithms are used to facilitate model construction and incremental updates; unsupervised anomaly detection algorithms are used to reduce the reliance on labeled data. We also present an architecture consisting of sensors, detectors, a data warehouse, and model generation components. This architecture facilitates the sharing and storage of audit data and the distribution of new or updated models. This architecture also improves the efficiency and scalability of the IDS.

Journal ArticleDOI
TL;DR: This paper reports on the experiences with two systems that were developed at the University of Pennsylvania: K2, a view integration implementation, and GUS, a data warehouse.
Abstract: The integrated access to heterogeneous data sources is a major challenge for the biomedical community. Several solution strategies have been explored: link-driven federation of databases, view integration, and warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: K2, a view integration implementation, and GUS, a data warehouse. Although the view integration and the warehouse approaches each have advantages, there is no clear "winner." Therefore, in selecting the best strategy for a particular application, users must consider the data characteristics, the performance guarantees required, and the programming resources available. Our experiences also point to some practical tips on how database updates should be published, and how XML can be used to facilitate the processing of updates in a warehousing environment.

Journal Article
TL;DR: Heterogeneous database systems attempt to unify disparate databases by providing uniform conceptual schemas that resolve representational heterogeneities, and by providing querying capabilities that aggregate and integrate distributed data.

Patent
16 Nov 2001
TL;DR: In this paper, a business model for use in a data warehouse system adaptable for multiple organizations is provided, which consists of a set of dimensions representing business reference aspects of the multiple organizations, a subset of measures representing measurements of business activity aspects of each organization, and relationships between the sets of dimensions and measures.
Abstract: A business model for use in a data warehouse system adaptable for multiple organizations is provided. The business model comprises a set of dimensions representing business reference aspects of the multiple organizations, a set of measures representing measurements of business activity aspects of the multiple organizations, and relationships between the set of dimensions and measures. A subset of the set of measures represents the business activity aspects of the specific organization. A subset of the set of dimensions represents the business aspects of a particular organization. The relationships allow for functional areas of analysis to use common dimensions for cross-functional analysis.

Book
01 Jan 2001
TL;DR: This book discusses the design and management of data Warehousing, and the role of metadata in this process.
Abstract: Foreword. Preface. PART 1: OVERVIEW AND CONCEPTS. The Compelling Need for Data Warehousing. Data Warehouse: The Building Blocks. Trends in Data Warehousing. PART 2: PLANNING AND REQUIREMENTS. Planning and Project Management. Defining the Business Requirements. Requirements as the Driving Force for Data Warehousing. PART 3: ARCHITECTURE AND INFRASTRUCTURE. The Architectural Components. Infrastructure as the Foundation for Data Warehousing. The Significant Role of Metadata. PART 4: DATA DESIGN AND DATA PREPARATION. Principles of Dimensional Modeling. Dimensional Modeling: Advanced Topics. Data Extraction, Transformation, and Loading. Data Quality: A Key to Success. PART 5: INFORMATION ACCESS AND DELIVERY. Matching Information to the Classes of Users. OLAP in the Data Warehouse. Data Warehousing and the Web. Data Mining Basics. PART 6: IMPLEMENTATION AND MAINTENANCE. The Physical Design Process. Data Warehouse Deployment. Growth and Maintenance. Appendix A: Project Life Cycle Steps and Checklists. Appendix B: Critical Factors for Success. Appendix C: Guidelines for Evaluating Vendor Solutions. References. Glossary. Index.

Proceedings ArticleDOI
01 May 2001
TL;DR: This paper shows how to find an efficient plan for the maintenance of a set of materialized views, by exploiting common subexpressions between different view maintenance expressions, and develops a framework that cleanly integrates the various choices in a systematic and efficient manner.
Abstract: Materialized views have been found to be very effective at speeding up queries, and are increasingly being supported by commercial databases and data warehouse systems. However, whereas the amount of data entering a warehouse and the number of materialized views are rapidly increasing, the time window available for maintaining materialized views is shrinking. These trends necessitate efficient techniques for the maintenance of materialized views.In this paper, we show how to find an efficient plan for the maintenance of a set of materialized views, by exploiting common subexpressions between different view maintenance expressions. In particular, we show how to efficiently select (a) expressions and indices that can be effectively shared, by transient materialization; (b) additional expressions and indices for permanent materialization; and (c) the best maintenance plan — incremental or recomputation — for each view. These three decisions are highly interdependent, and the choice of one affects the choice of the others. We develop a framework that cleanly integrates the various choices in a systematic and efficient manner. Our evaluations show that many-fold improvement in view maintenance time can be achieved using our techniques. Our algorithms can also be used to efficiently select materialized views to speed up workloads containing queries and updates.

Patent
28 Feb 2001
TL;DR: In this article, a stand-alone aggregation server for multidimensional databases (MDDBs) is presented, which can uniformly distribute data elements among a plurality of processors for balanced loading and processing, and therefore is highly scalable.
Abstract: Improved method of and apparatus for aggregating data elements in multidimensional databases (MDDB). In one aspect of the present invention, the apparatus is realized in the form of a high-performance stand-alone (i.e. external) aggregation server which can be plugged-into conventional OLAP systems to achieve significant improments in system performance. In accordance with the principles of the present invention, the stand-alone aggregation server contains a scalable MDDB and a high-performance aggregation engine that are integrated into the modular architecture of the aggregation server. The stand-alone aggregation server of the present invention can uniformly distribute data elements among a plurality of processors, for balanced loading and processing, and therefore is highly scalable. The stand-alone aggregation server of the present invention can be used to realize (i) an improved MDDB for supporting on-line analytical processing (OLAP) operations, (ii) an improved Internet URL Directory for supporting on-line information searching operations by Web-enabled client machines, as well as (iii) diverse types of MDDB-based systems for supporting real-time control of processes in response to complex states of information reflected in the MDDB. In another aspect of the present invention, the apparatus is integrated within a database management system (DBMS). The improved DBMS can be used to realize achieving a significant increase in system performance (e.g. deceased access/search time), user flexibility and ease of use. The improved DBMS system of the present invention can be used to realize an improved Data Warehouse for supporting on-line analytical processing (OLAP) operations or to realize an improved informational database system, operational database system, or the like.

Journal ArticleDOI
01 Aug 2001
TL;DR: The experiment shows that the hybrid evolutionary algorithm delivers better performance than either the evolutionary algorithm or heuristics used alone in terms of the minimal query and maintenance cost and the evaluation cost to obtain the minimal cost.
Abstract: A data warehouse (DW) contains multiple views accessed by queries. One of the most important decisions in designing a DW is selecting views to materialize for the purpose of efficiently supporting decision making. The search space for possible materialized views is exponentially large. Therefore heuristics have been used to search for a near optimal solution. In this paper, we explore the use of an evolutionary algorithm for materialized view selection based on multiple global processing plans for queries. We apply a hybrid evolutionary algorithm to solve three related problems. The first is to optimize queries. The second is to choose the best global processing plan from multiple global processing plans. The third is to select materialized views from a given global processing plan. Our experiment shows that the hybrid evolutionary algorithm delivers better performance than either the evolutionary algorithm or heuristics used alone in terms of the minimal query and maintenance cost and the evaluation cost to obtain the minimal cost.

Proceedings ArticleDOI
01 May 2001
TL;DR: Clio is demonstrated, a new semi-automated tool for creating schema mappings that employs a mapping-by-example paradigm that relies on the use of value correspondences describing how a value of a target attribute can be created from a set of values of source attributes.
Abstract: We consider the integration requirements of modern data intensive applications including data warehousing, global information systems and electronic commerce. At the heart of these requirements lies the schema mapping problem in which a source (legacy) database must be mapped into a different, but xed, target schema. The goal of schema mapping is the discovery of a query or set of queries to map source databases into the new structure. We demonstrate Clio, a new semi-automated tool for creating schema mappings. Clio employs a mapping-by-example paradigm that relies on the use of value correspondences describing how a value of a target attribute can be created from a set of values of source attributes. A typical session with Clio starts with the user loading a source and a target schema into the system. These schemas are read from either an underlying Object-Relational database or from an XML le with an associated XML Schema. Users can then draw value correspondences mapping source attributes into target attributes. Clio's mapping engine incrementally produces the SQL queries that realize the mappings implied by the correspondences. Clio provides schema and data browsers and other feedback to allow users to understand the mapping produced. Entering and manipulating value correspondences can be done in two modes. In the Schema View mode, users see a representation of the source and target schema and create value correspondences by selecting schema objects from the source and mapping them to a target attribute. The alternative Data View mode o ers a WYSIWYG interface for the mapping process that displays example data for both the source and target tables [3]. Users may add and delete value correspondences from this view and immediately see the changes re ected in the resulting target tuples. Also, the Data View mode helps users navigate through alternative mappings, understanding the often subtle di erences between them. For example, in some cases, changing a join from an inner join to an outer join may dramatically change the resulting table. In other cases, the same change may have no e ect due to constraints that hold on the source

01 Feb 2001
TL;DR: A taxonomy that covers many of the existing approaches to schema matching, and distinguishes betweenschema- and instance- level, element- and structure-level, and language- and constraint-based match-ers, is presented.
Abstract: Schema matching is a basic problem in many database application domains, such as data integration,E-business, data warehousing, and semantic query processing. In current implementations, schemamatching is typically performed manually, which has significant limitations. On the other hand, inprevious research many techniques have been proposed to achieve a partial automation of the Matchoperation for specific application domains. We present a taxonomy that covers many of the existingapproaches, and we describe these approaches in some detail. In particular, we distinguish betweenschema- and instance-level, element- and structure-level, and language- and constraint-based match-ers. Based on our classification we review some previous match implementations thereby indicatingwhich part of the solution space they cover. We intend our taxonomy and review of past work to beuseful when comparing different approaches to schema matching, when developing a new matchalgorithm, and when implementing a schema matching component.

Journal ArticleDOI
TL;DR: More work must be done to develop domain-independent tools that solve the data cleaning problems associated with data warehouse development, and to achieve better synergy between database systems and data mining technology.
Abstract: Decision support systems form the core of business IT infrastructures because they let companies translate business information into tangible and lucrative results. Collecting, maintaining, and analyzing large amounts of data, however, involves expensive technical challenges that require organizational commitment. Many commercial tools are available for each of the three major data warehousing tasks: populating the data warehouse from independent operational databases, storing and managing the data, and analyzing the data to make intelligent business decisions. Data cleaning relates to heterogeneous data integration, a problem studied for many years. More work must be done to develop domain-independent tools that solve the data cleaning problems associated with data warehouse development. Most data mining research has focused on developing algorithms for building more accurate models or building models faster. However, data preparation and mining model deployment present several engaging problems that relate specifically to achieving better synergy between database systems and data mining technology.

Journal ArticleDOI
01 Apr 2001
TL;DR: This paper proposes the concept of pre-large itemsets and designs a novel, efficient, incremental mining algorithm based on it, which doesn't need to rescan the original database until a number of transactions have been newly inserted.
Abstract: Due to the increasing use of very large databases and data warehouses, mining useful information and helpful knowledge from transactions is evolving into an important research area. In the past, researchers usually assumed databases were static to simplify data mining problems. Thus, most of the classic algorithms proposed focused on batch mining, and did not utilize previously mined information in incrementally growing databases. In real-world applications, however, developing a mining algorithm that can incrementally maintain discovered information as a database grows is quite important. In this paper, we propose the concept of pre-large itemsets and design a novel, efficient, incremental mining algorithm based on it. Pre-large itemsets are defined by a lower support threshold and an upper support threshold. They act as gaps to avoid the movements of itemsets directly from large to small and vice-versa. The proposed algorithm doesn't need to rescan the original database until a number of transactions have been newly inserted. If the database has grown larger, then the number of new transactions allowed will be larger too.

Journal ArticleDOI
TL;DR: A new dimension, called the data span dimension, is introduced, which allows user-defined selections of a temporal subset of the database, and a generic algorithm is described that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the dataspan dimension.
Abstract: Data mining algorithms have been the focus of much research. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept up-to-date through periodic or occasional addition and deletion of blocks of data. Most data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records. We consider a dynamic environment that evolves through systematic addition or deletion of blocks of data. We introduce a new dimension, called the data span dimension, which allows user-defined selections of a temporal subset of the database. Taking this new degree of freedom into account, we describe efficient model maintenance algorithms for frequent item sets and clusters. We then describe a generic algorithm that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the data span dimension. We also develop an algorithm for automatically discovering a specific class of interesting block selection sequences. In a detailed experimental study, we examine the validity and performance of our ideas on synthetic and real datasets.

Book ChapterDOI
04 Jan 2001
TL;DR: In this article, the authors propose a new characterization of minimal functional dependencies, which provides a formal framework simpler than previous proposals and is more efficient than the best operational solution (according to their knowledge): the algorithm Tane.
Abstract: Discovering functional dependencies from existing databases is an important technique strongly required in database design and administration tools. Investigated for long years, such an issue has been recently addressed with a data mining viewpoint, in a novel and more efficient way by following from principles of level-wise algorithms. In this paper, we propose a new characterization of minimal functional dependencies which provides a formal framework simpler than previous proposals. The algorithm, defined for enforcing our approach has been implemented and experimented. It is more efficient (in whatever configuration of original data) than the best operational solution (according to our knowledge): the algorithm Tane. Moreover, our approach also performs (without additional execution time) the mining of embedded functional dependencies, i.e. dependencies holding for a subset of the attribute set initially considered (e.g. for materialized views widely used in particular for managing data warehouses).

Proceedings ArticleDOI
02 Apr 2001
TL;DR: A new index structure called the SB-tree is introduced, which incorporates features from both segment trees and B-trees and support the fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes.
Abstract: Considers the problems of computing aggregation queries in temporal databases and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging, since a single data update can cause aggregate results to change over the entire time-line. We introduce a new index structure called the SB-tree, which incorporates features from both segment trees (S-trees) and B-trees. SB-trees support the fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes. We also extend the basic SB-tree index to handle cumulative (also called moving-window) aggregates. For materialized aggregate views in a temporal database or data warehouse, we propose building and maintaining SB-tree indices instead of the views themselves.

Book
01 Jan 2001
TL;DR: This book covers the fundamentals of data warehousing specifically for the IT professional who wants to get into the field.
Abstract: From the Publisher: This book covers the fundamentals of data warehousing specifically for the IT professional who wants to get into the field. It covers all significant topics, including planning requirements, architecture, infrastructure, design, data preparation, information delivery, implementation, and maintenance.

Proceedings ArticleDOI
29 Nov 2001
TL;DR: It is shown that the e-commerce domain can provide all the right ingredients for successful data mining and an integrated architecture for supporting this integration is described, which can dramatically reduce the pre-processing, cleaning, and data understanding effort in knowledge discovery projects.
Abstract: We show that the e-commerce domain can provide all the right ingredients for successful data mining. We describe an integrated architecture for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the application server layer (not the Web server) in order to support logging of data and metadata that is essential to the discovery process. We describe the data transformation bridges required from the transaction processing systems and customer event streams (e.g., clickstreams) to the data warehouse. We detail the mining workbench, which needs to provide multiple views of the data through reporting, data mining algorithms, visualization, and OLAP. We conclude with a set of challenges.

Journal ArticleDOI
TL;DR: A technique for declaratively specifying suitable reconciliation correspondences to be used in order to solve conflicts among data in different sources and the main goal of the method is to support the design of mediators that materialize the data in the Data Warehouse relations.
Abstract: Information integration is one of the most important aspects of a Data Warehouse. When data passes from the sources of the application-oriented operational environment to the Data Warehouse, possible inconsistencies and redundancies should be resolved, so that the warehouse is able to provide an integrated and reconciled view of data of the organization. We describe a novel approach to data integration in Data Warehousing. Our approach is based on a conceptual representation of the Data Warehouse application domain, and follows the so-called local-as-view paradigm: both source and Data Warehouse relations are defined as views over the conceptual model. We propose a technique for declaratively specifying suitable reconciliation correspondences to be used in order to solve conflicts among data in different sources. The main goal of the method is to support the design of mediators that materialize the data in the Data Warehouse relations. Starting from the specification of one such relation as a query over the conceptual model, a rewriting algorithm reformulates the query in terms of both the source relations and the reconciliation correspondences, thus obtaining a correct specification of how to load the data in the materialized view.