scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 1999"


Book
22 Dec 1999
TL;DR: This one-stop guide to choosing the right tools and technologies for a state-of-the-art data management strategy built on a Customer Relationship Management (CRM) framework helps you understand the principles of data warehousing and data mining systems and carefully spell out techniques for applying them so that your business gets the biggest pay-off possible.
Abstract: From the Publisher: How data mining delivers a powerful competitive advantage! Are you fully harnessing the power of information to support management and marketing decisions? You will,with this one-stop guide to choosing the right tools and technologies for a state-of-the-art data management strategy built on a Customer Relationship Management (CRM) framework. Authors Alex Berson,Stephen Smith,and Kurt Thearling help you understand the principles of data warehousing and data mining systems,and carefully spell out techniques for applying them so that your business gets the biggest pay-off possible. Find out about Online Analytical Processing (OLAP) tools that quickly navigate within your collected data. Explore privacy and legal issues. . . evaluate current data mining application packages. . . and let real-world examples show you how data mining can impact — and improve — all of your key business processes. Start uncovering your best prospects and offering them the products they really want (not what you think they want)! How data mining delivers a powerful competitive advantage! Are you fully harnessing the power of information to support management and marketing decisions? You will,with this one-stop guide to choosing the right tools and technologies for a state-of-the-art data management strategy built on a Customer Relationship Management (CRM) framework. Authors Alex Berson,Stephen Smith,and Kurt Thearling help you understand the principles of data warehousing and data mining systems,and carefully spell out techniques for applying them so that your business gets the biggest pay-off possible. Find out about Online Analytical Processing (OLAP) tools thatquickly navigate within your collected data. Explore privacy and legal issues. . . evaluate current data mining application packages. . . and let real-world examples show you how data mining can impact — and improve — all of your key business processes. Start uncovering your best prospects and offering them the products they really want (not what you think they want)!

637 citations


Book
28 Dec 1999
TL;DR: Mastering Data Mining shifts the focus from understanding data mining techniques to achieving business results, placing particular emphasis on customer relationship management.
Abstract: From the Publisher: "Berry and Linoff lead the reader down an enlightened path of best practices." -Dr. Jim Goodnight, President and Cofounder, SAS Institute Inc."This is a great book, and it will be in my stack of four or five essential resources for my professional work." -Ralph Kimball, Author of The Data Warehouse Lifecycle ToolkitMastering Data MiningIn this follow-up to their successful first book, Data Mining Techniques, Michael J. A. Berry and Gordon S. Linoff offer a case study-based guide to best practices in commercial data mining. Their first book acquainted you with the new generation of data mining tools and techniques and showed you how to use them to make better business decisions. Mastering Data Mining shifts the focus from understanding data mining techniques to achieving business results, placing particular emphasis on customer relationship management.In this book, you'll learn how to apply data mining techniques to solve practical business problems. After providing the fundamental principles of data mining and customer relationship management, Berry and Linoff share the lessons they have learned through a series of warts-and-all case studies drawn from their experience in a variety of industries, including e-commerce, banking, cataloging, retailing, and telecommunications.Through the cases, you will learn how to formulate the business problem, analyze the data, evaluate the results, and utilize this information for similar business problems in different industries.Berry and Linoff show you how to use data mining to:* Retain customer loyalty* Target the right prospects* Identify new markets for products and services* Recognize cross-selling opportunities on and off the Web. Thecompanion Web site features:* Updated information on data mining products and service providers* Information on data mining conferences, courses, and other sources of information* Full-color versions of the illustrations used in the book

470 citations


Journal ArticleDOI
TL;DR: Improved access to large electronic data sets, reliable and consistent annotation and effective tools for 'data mining' are critical to realize the full potential of whole–genome RNA expression studies.
Abstract: Technologies for whole-genome RNA expression studies are becoming increasingly reliable and accessible. However, universal standards to make the data more suitable for comparative analysis and for inter-operability with other information resources have yet to emerge. Improved access to large electronic data sets, reliable and consistent annotation and effective tools for 'data mining' are critical. Analysis methods that exploit large data warehouses of gene expression experiments will be necessary to realize the full potential of this technology.

434 citations


Book
07 Dec 1999
TL;DR: The Knowledge Management Toolkit as mentioned in this paper is the only "how-to" guide for building an enterprise knowledge management system from start to finish, showing how every stage can serve as a foundation for later enhancements.
Abstract: The only "how-to" guide for building an enterprise knowledge management system!Until now, implementing Knowledge Management (KM) has been like nailing jelly to the wall-but not anymore! The Knowledge Management Toolkit delivers hands-on techniques and tools for making KM happen at your company. You'll learn exactly how to use KM to make sure that every key decision is fully informed as you build on your existing intranet, data warehouse, and project management investments. Top researcher Amrit Tiwana walks you through the development of an enterprise KM system from start to finish, showing how every stage can serve as a foundation for later enhancements. 10-step roadmap for implementing KM successfully Checklists help you focus on critical issues every step of the way Interactive toolkit format guides your strategic design decisions Identify your key intangibles-and audit the knowledge you already have Staff your project team and manage it effectively Build a foundation of KM infrastructure that can evolve through results-driven, incremental steps Mobilize your organization's subtle, "tacit" knowledge Calculate and maximize ROI in KM systems www.kmtoolkit.com-stay informed with the author's dedicated Web site, which provides ongoing support and updates from the KM community!Discover the best ways to align KM with business strategy, avoid key KM pitfalls such as excessive formalization and overreliance on technology, master prototyping, and understand the new role of the Chief Knowledge Officer. Tiwana also presents KM case studies from leading companies worldwide, from Nortel to Rolls Royce. If you're ready to transform KM from business-school theory to real-world competitive advantage, start right here!CD-ROM INCLUDED Knowledge Management Toolkit, including an interactive 10-step KM roadmap and easy-to-customize KM evaluation forms -complete and unrestricted! MindManager Personal for creating, organizing, and sharing knowledge maps Performance Now Enterprise, a trial version of the #1 change management tool FrontPage 2000 45-day trial Plus great tools for data mining, integrating mobile systems, workflow, modeling, and more!

420 citations


Journal ArticleDOI
01 Jun 1999
TL;DR: A novel method is presented that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner and provides significantly more accurate results than other efficient approximation techniques such as random sampling.
Abstract: Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

363 citations


Book
01 Jun 1999
TL;DR: This comprehensive volume, with a foreword by Jeff Ullman of Stanford University, will serve as a reference for students and commercial users, and encourage further use and development of materialized views.
Abstract: When an application is built, an underlying data model is chosen to make that application effective. Frequently, other applications need the same data, only modeled differently. The naive solution of copying the underlying data and modeling is costly in terms of storage and makes data maintenance and evolution impossible. View mechanisms are a technique to model data differently for various applications without affecting the underlying format and structure of the data. The technique enables applications to customize shared data objects without affecting other applications that use the same objects. The growing data-manipulation needs of companies cannot be met by existing legacy systems that contain valuable data. Thus view mechanisms are becoming increasingly important as a way to model and use legacy data in new applications.Materialized views are views that have been computed and stored in databases. Because they reduce the need to recompute the view and/or data being queried, they speed up the querying of large amounts of data. Further, because they provide a systematic way to describe how to recompute the data, maintenance and evolution can be automated. Materialized views are especially useful in data warehousing, query optimization, integrity constraint maintenance, online analytical processing, and applications such as billing, banking, and retailing. This comprehensive volume, with a foreword by Jeff Ullman of Stanford University, will serve as a reference for students and commercial users, and encourage further use and development of materialized views.

303 citations


Journal ArticleDOI
TL;DR: A conceptual framework for enhancing data quality in data warehouse environments is offered and the factors that should be considered, such as the current level of data quality, the levels of quality needed by the relevant decision processes, and the potential benefits of projects designed to enhance data quality are explored.
Abstract: D ecisions by senior management lay the groundwork for lower corporate levels to develop policies and procedures for various corporate activities. However, the potential business contribution of these activities depends on the quality of the decisions and in turn on the quality of the data used to make them. Some inputs are judgmental, others are from transactional systems, and still others are from external sources, but all must have a level of quality appropriate for the decisions they will be part of. Although concern about the quality of one’s data is not new, what is fairly recent is using the same data for multiple purposes, which can be quite different from their original purposes. Users working with a particular data set come to know and internalize its deficiencies and idiosyncrasies. This knowledge is lost when data is made available to other parties, like when data needed for decision making is collected in repositories called data warehouses. Here, we offer a conceptual framework for enhancing data quality in data warehouse environments. We explore the factors that should be considered, such as the current level of data quality, the levels of quality needed by the relevant decision processes, and the potential benefits of projects designed to enhance data quality. Those responsible for data quality have to understand the importance of such factors, as well as the interaction among them. This understanding is mandatory in data warehousing environments characterized by multiple users with differing needs for data quality. For warehouses supporting a limited number of decision processes, awareness of these issues coupled with good judgment should suffice. For more complex situations, however, the number and diversity of trade-offs make reliance on judgment alone problematic. For such situations, we offer a methodology that Nothing is more likely to undermine the performance and business value of a data warehouse than inappropriate, misunderstood, or ignored data quality. Enhancing DataQuality in DataWarehouse Environments

298 citations


Journal Article
TL;DR: This article develops algorithms to select a set of views to materialize in a data warehouse in order to minimize the total query response time under the constraint of a given total view maintenance time and designs an A* heuristic, that delivers an optimal solution.
Abstract: A data warehouse stores materialized views derived from one or more sources for the purpose of efficiently implementing decision-support or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and/or the cost of maintaining the selected views, given a limited amount of resource such as materialization time, storage space, or total view maintenance time. In this article, we develop algorithms to select a set of views to materialize in a data warehouse in order to minimize the total query response time under the constraint of a given total view maintenance time. As the above maintenance-cost view-selection problem is extremely intractable, we tackle some special cases and design approximation algorithms. First, we design an approximation greedy algorithm for the maintenance-cost view-selection problem in OR view graphs, which arise in many practical applications, e.g., data cubes. We prove that the query benefit of the solution delivered by the proposed greedy heuristic is within 63% of that of the optimal solution. Second, we also design an A * heuristic, that delivers an optimal solution, for the general case of AND-OR view graphs. We implemented our algorithms and a performance study of the algorithms shows that the proposed greedy algorithm for OR view graphs almost always delivers an optimal solution.

289 citations


Proceedings ArticleDOI
06 Nov 1999
TL;DR: In this article, the starER model combines the star structure, which is dominant in data warehouses, with the semantically rich constructs of the ER model; special types of relationships have been further added to support hierarchies.
Abstract: Modeling data warehouses is a complex task focusing, very often, into internal structures and implementation issues. In this paper we argue that, in order to accurately reflect the users requirements into an error-free, understandable, and easily extendable data warehouse schema, special attention should be paid at the conceptual modeling phase. Based on a real mortgage business warehouse environment, we present a set of user modeling requirements and we discuss the involved concepts. Understanding the semantics of these concepts, allow us to build a conceptual model—namely, the starER model—for their efficient handling. More specifically, the starER model combines the star structure, which is dominant in data warehouses, with the semantically rich constructs of the ER model; special types of relationships have been further added to support hierarchies. We present an evaluation of the starER model as well as a comparison of the proposed model with other existing models, pointing out differences and similarities. Examples from a mortgage data warehouse environment, in which starER is tested, reveal the ease of understanding of the model, as well as the efficiency in representing complex information at the semantic level.

279 citations


Patent
07 Jul 1999
TL;DR: An apparatus and corresponding method for selecting multimedia information, such as video, audio, graphics and text residing on a plurality of Data Warehouses, relational database management systems (RDMS) or object-oriented database systems (ODBA) connected to the Internet or other network, and for linking the multimedia information across the Internet, or other networks, to any phrase, work, sentence and paragraph of text.
Abstract: An apparatus and corresponding method for selecting multimedia information, such as video, audio, graphics and text residing on a plurality of Data Warehouses, relational database management systems (RDMS) or object-oriented database systems (ODBA) connected to the Internet or other network, and for linking the multimedia information across the Internet, or other network, to any phrase, work, sentence and paragraph of text; or numbers; or maps; charta, and tables; or still pictures and/or graphics' or moving pictures and/or graphics; or audio elements contained in documents on an Internet or intranet web site so that any viewer of a web site, or other network resource, can directly access updated information in the Data Warehouse or a database in real time are disclosed. The apparatus and corresponding method each: (i) stores a plurality of predetermined authentication procedures (such as user names and passwords) to gain admittance to Data Warehouses or databases, (ii) stores the Universal Resource Locators of intranet and Internet addresses of a plurality of expert-predetermined optimum databases or Data Warehouses containing text, audio, video and graphic information, or multimedia information relating to the information on the web site or other network resource; (iii) stores a plurality of expert-predetermined optimum queries for use in the search engines of each of the pre-selected databases, each query representing a discrete searchable concept as expressed by a work, phrase, sentence or paragraph of text, or any other media such as audio and video on a web site, or other network resource; and (iv) presents to the user the results of a search of the Data Warehouse or database through a graphical user interface (GUI) which coordinates and correlates viewer selection criteria with the expert optimum remote database selection and queries.

273 citations


Journal ArticleDOI
01 Jun 1999
TL;DR: DynaMat is presented, a system that dynamically materializes information at multiple levels of granularity in order to match the demand but also takes into account the maintenance restrictions for the warehouse, such as down time to update the views and space availability.
Abstract: Pre-computation and materialization of views with aggregate functions is a common technique in Data Warehouses. Due to the complex structure of the warehouse and the different profiles of the users who submit queries, there is need for tools that will automate the selection and management of the materialized data. In this paper we present DynaMat, a system that dynamically materializes information at multiple levels of granularity in order to match the demand (workload) but also takes into account the maintenance restrictions for the warehouse, such as down time to update the views and space availability. DynaMat unifies the view selection and the view maintenance problems under a single framework using a novel “goodness” measure for the materialized views. DynaMat constantly monitors incoming queries and materializes the best set of views subject to the space constraints. During updates, DynaMat reconciles the current materialized view selection and refreshes the most beneficial subset of it within a given maintenance window. We compare DynaMat against a system that is given all queries in advance and the pre-computed optimal static view selection. The comparison is made based on a new metric, the Detailed Cost Savings Ratio introduced for quantifying the benefits of view materialization against incoming queries. These experiments show that DynaMat's dynamic view selection outperforms the optimal static view selection and thus, any sub-optimal static algorithm that has appeared in the literature.

Journal Article
TL;DR: This paper looks at database middleware systems as tranformation engines, and discusses when and how data is transformed to provide users with the information they need.
Abstract: Many applications today need information from diverse data sources, in which related data may be represented quite differently. In one common scenario, a DBA wants to add data from a new source to an existing warehouse. The data in the new source may not match the existing warehouse schema. The new data may also be partially redundant with that in the existing warehouse, or formatted differently. Other applications may need to integrate data more dynamically, in response to user queries. Even applications using data from a single source often want to present it in a form other than that it is stored in. For example, a user may want to publish some information using a particular XML DTD, though the data is not stored in that form. In each of these scenarios, one or more data sets must be mapped into a single target representation. Needed transformations may include schema transformations (changing the structure of the data) [BLN86, RR98] and data transformation and cleansing (changing the the format and vocabulary of the data and eliminating or at least reducing duplicates and errors) [Val, ETI, ME97, HS95]. In each area, there is a broad range of possible transformations, from simple to complex. Schema and data transformation have typically been studied separately. We believe they need to be handled together via a uniform mechanism. Database middleware systems [PGMW95, TRV96, ACPS96, Bon95] integrate data from multiple sources. To be effective, such systems must provide one or more integrated schemas, and must be able to transform data from different sources to answer queries against these schema. The power of their query engines and their ability to connect to several information sources makes them a natural base for doing more complex transformations as well. In this paper, we look at database middleware systems as tranformation engines, and discuss when and how data is transformed to provide users with the information they need.

Journal ArticleDOI
TL;DR: The authors' adaptive resampling approach surpasses previous decision-tree performance and validates the effectiveness of small, pooled local dictionaries.
Abstract: The authors' adaptive resampling approach surpasses previous decision-tree performance and validates the effectiveness of small, pooled local dictionaries. They demonstrate their approach using the Reuters-21578 benchmark data and a real-world customer E-mail routing system.

Proceedings ArticleDOI
23 Mar 1999
TL;DR: A formal model of dimension updates in a multidimensional model, a collection of primitive operators to perform them, and a study of the effect of these updates on a class of materialized views are presented, giving an algorithm to efficiently maintain them.
Abstract: OLAP systems support data analysis through a multidimensional data model, according to which data facts are viewed as points in a space of application-related "dimensions", organized into levels which conform to a hierarchy. The usual assumption is that the data points reflect the dynamic aspect of the data warehouse, while dimensions are relatively static. However, in practice, dimension updates are often necessary to adapt the multidimensional database to changing requirements. Structural updates can also take place, like addition of categories or modification of the hierarchical structure. When these updates are performed, the materialized aggregate views that are typically stored in OLAP systems must be efficiently maintained. These updates are poorly supported (or not supported at all) in current commercial systems, and have received little attention in the research literature. We present a formal model of dimension updates in a multidimensional model, a collection of primitive operators to perform them, and a study of the effect of these updates on a class of materialized views, giving an algorithm to efficiently maintain them.

Patent
06 Apr 1999
TL;DR: In this article, a system for querying disparate, heterogeneous data sources over a network, where at least some of the data sources are World Wide Web pages or other semi-structured data sources, includes a query converter, a command transmitter, and a data retriever.
Abstract: A system for querying disparate, heterogeneous data sources over a network, where at least some of the data sources are World Wide Web pages or other semi-structured data sources, includes a query converter, a command transmitter, and a data retriever. The query converter produces, from at least a portion of a query, a set of commands which can be used to interact with a semi-structured data source. The query converter may accept a request in the same form as normally used to access a relational data base, therefore increasing the number of data bases available to a user in a transparent manner. The command transmitter issues the produced commands to the semi-structured data source. The data retriever then retrieves the desired data from the data source. In this manner, structured queries may be used to access both traditional, relational data bases as well as non-traditional semi-structured data bases such as web sites and flat files. The system may also include a request translator and a data translator for providing data context interchange. The request translator translates a request for data having a first data context into a query having a second data context which the query converter described above. The data translator translates data retrieved from the data context of the data source into the data context associated with the request. A related method for querying disparate data sources over a network is also described.

Journal ArticleDOI
TL;DR: This paper identifies recent accomplishments and associated research needs of the near term in spatial databases, addressing the growing data management and analysis needs of spatial applications such as geographic information systems.
Abstract: Spatial databases, addressing the growing data management and analysis needs of spatial applications such as geographic information systems, have been an active area of research for more than two decades. This research has produced a taxonomy of models for space, spatial data types and operators, spatial query languages and processing strategies, as well as spatial indexes and clustering techniques. However, more research is needed to improve support for network and field data, as well as query processing (e.g., cost models, bulk load). Another important need is to apply spatial data management accomplishments to newer applications, such as data warehouses and multimedia information systems. The objective of this paper is to identify recent accomplishments and associated research needs of the near term.

Book ChapterDOI
10 Jan 1999
TL;DR: In this paper, the authors propose an approximation greedy algorithm for the maintenance-cost view-selection problem in OR view graphs, which arise in many practical applications, e.g., data cubes.
Abstract: A data warehouse stores materialized views derived from one or more sources for the purpose of efficiently implementing decision-support or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and/or the cost of maintaining the selected views, given a limited amount of resource such as materialization time, storage space, or total view maintenance time. In this article, we develop algorithms to select a set of views to materialize in a data warehouse in order to minimize the total query response time under the constraint of a given total view maintenance time. As the above maintenance-cost view-selection problem is extremely intractable, we tackle some special cases and design approximation algorithms. First, we design an approximation greedy algorithm for the maintenance-cost view-selection problem in OR view graphs, which arise in many practical applications, e.g., data cubes. We prove that the query benefit of the solution delivered by the proposed greedy heuristic is within 63% of that of the optimal solution. Second, we also design an A* heuristic, that delivers an optimal solution, for the general case of AND-OR view graphs. We implemented our algorithms and a performance study of the algorithms shows that the proposed greedy algorithm for OR view graphs almost always delivers an optimal solution.

Journal ArticleDOI
TL;DR: This paper adapts the Goal-Question-Metric approach from software quality management to a meta data management environment in order to link these special techniques to a generic conceptual framework of DW quality.

Patent
24 Sep 1999
TL;DR: In this paper, a method, apparatus, article of manufacture, and a memory structure for controlling the collection and dissemination of data stored in a data warehouse is disclosed, which comprises the steps of accepting a request for a privacy card from a consumer, querying the consumer for consumer personal information and privacy preferences, storing a customer unique proxy identifying the customer in the data warehouse, and issuing a private proxy comprising the proxy to the customer.
Abstract: A method, apparatus, article of manufacture, and a memory structure for controlling the collection and dissemination of data stored in a data warehouse is disclosed. The method comprises the steps of accepting a request for a privacy card from a consumer, querying the consumer for consumer personal information and privacy preferences, storing a customer unique proxy identifying the customer in the data warehouse, and issuing a privacy card comprising the proxy to the customer. The program storage device comprises a medium for storing instructions performing the method steps outlined above. The apparatus comprises a means for accepting the request for a privacy card from the consumer and for querying the consumer for personal information an privacy preferences, such as a kiosk, ATM or internet connection, a data warehouse for storing the customer unique proxy, and a means for issuing the privacy card.

Journal ArticleDOI
TL;DR: Together, constraint-based and multidimensional techniques can provide a more ad hoc, query-driven process that effectively exploits the semantics of data than those supported by current standalone data-mining systems.
Abstract: Although many data-mining methodologies and systems have been developed in recent years, the authors contend that by and large, present mining models lack human involvement, particularly in the form of guidance and user control. They believe that data mining is most effective when the computer does what it does best-like searching large databases or counting-and users do what they do best, like specifying the current mining session's focus. This division of labor is best achieved through constraint-based mining, in which the user provides restraints that guide a search. Mining can also be improved by employing a multidimensional, hierarchical view of the data. Current data warehouse systems have provided a fertile ground for systematic development of this multidimensional mining. Together, constraint-based and multidimensional techniques can provide a more ad hoc, query-driven process that effectively exploits the semantics of data than those supported by current standalone data-mining systems.

Book ChapterDOI
30 Aug 1999
TL;DR: This paper presents several effcient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood and implements a data cleansing system which can detect and remove more duplicate records than existing methods.
Abstract: Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the "garbage in, garbage out" principle. "Dirty" data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several effcient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

Journal ArticleDOI
TL;DR: A broad range of algorithms are described that address three classical data mining problems: market basket analysis, clustering, and classification that are scalable to very large data sets.
Abstract: Established companies have had decades to accumulate masses of data about their customers, suppliers, products and services, and employees. Data mining, also known as knowledge discovery in databases, gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases however, are much too large to be held in main memory. To be efficient, the data mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if (given a fixed amount of main memory), its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data mining algorithms to very large data sets. The authors describe a broad range of algorithms that address three classical data mining problems: market basket analysis, clustering, and classification.

Journal Article
TL;DR: This work argues that the Entity Relationship Model is not suited for multidimensional conceptual modeling because the semantics of the main characteristics of the paradigm cannot be adequately represented, and presents a specialization of the E/R model - called Multidimensional ntity Relationship (ME/R) Model.
Abstract: Multidimensional data modeling plays a key role in the design of a data warehouse. We argue that the Entity Relationship Model is not suited for multidimensional conceptual modeling because the semantics of the main characteristics of the paradigm cannot be adequately represented. Consequently, we present a specialization of the E/R model - called Multidimensional ntity Relationship (ME/R) Model. In order to express the multidimensional structure of the data we define two specialized relationship sets and a specialized entity set. The resulting ME/R model allows the adequate conceptual representation of the multidimensional data view inherent to OLAP, namely the separation of qualifying and quantifying data and the complex structure of dimensions. We demonstrate the usability of the ME/R model by an example taken from an actual project dealing with the analysis of vehicle repairs.

Patent
18 Nov 1999
TL;DR: In this article, a wide area network (WAN) of vending machines connected to a host that builds a database of vending-related information received from the vending machines is described, where the data warehouse is made available to one or more bottlers for analysis of individual vending machine routing needs and profitability.
Abstract: A wide area network (WAN) of vending machines connected to a host that builds a database of vending-related information received from the vending machines. Also, a communications system within each vending machine having a vending machine data acquisition unit and a multiple-communication-technology adapter to interface the data acquisition unit to multiple communication technologies including at least one wireless technology. Also, a data structure used to build the database, the data structure having data elements corresponding to an identity of a machine, recent and previous prediction information for the machine, and recent and previous refill-visit information for the machine, the elements being linked together. The multiple vending machines communicate with a communications concentrator via one of many communication technologies. The communications concentrator interfaces the multiple vending machines to a data warehouse that builds a database using the data structure mentioned above. The data warehouse is made available to one or more bottlers for analysis of individual vending machine routing needs and profitability.

Journal ArticleDOI
01 Dec 1999
TL;DR: A model of a data cube and an algebra to support OLAP operations on this cube is proposed that is simple and intuitive, and the algebra provides a means to concisely express complex OLAP queries.
Abstract: Data warehousing and On-Line Analytical Processing (OLAP) are two of the most significant new technologies in the business data processing arena. A data warehouse can be defined as a “very large” repository of historical data pertaining to an organization. OLAP refers to the technique of performing complex analysis over the information stored in a data warehouse. The complexity of queries required to support OLAP applications makes it difficult to implement using standard relational database technology. Moreover, there is currently no standard conceptual model for OLAP. There is clearly a need for such a model and an algebra as evidenced by the numerous SQL extensions offered by many vendors of OLAP products. In this paper, we address this issue by proposing a model of a data cube and an algebra to support OLAP operations on this cube. The model we present is simple and intuitive, and the algebra provides a means to concisely express complex OLAP queries.

Book ChapterDOI
01 Sep 1999
TL;DR: This paper presents a formal framework to describe evolutions of multidimensional schemas and their effects on the schema and on the instances and describes how the algebra enables a tool supported environment for schema evolution.
Abstract: Database systems offering a multidimensional schema on a logical level (e.g. OLAP systems) are often used in data warehouse environments. The user requirements in these dynamic application areas are subject to frequent changes. This implies frequent structural changes of the database schema. In this paper, we present a formal framework to describe evolutions of multidimensional schemas and their effects on the schema and on the instances. The framework is based on a formal conceptual description of a multidimensional schema and a corresponding schema evolution algebra. Thus, the approach is independent of the actual implementation (e.g. MOLAP or ROLAP). We also describe how the algebra enables a tool supported environment for schema evolution.

Patent
26 Feb 1999
TL;DR: In this article, a data services layer is disclosed which maintains a dictionary of conceptual information and physical information about the data, and requests are written in a conceptual query language (CQL) which substantially uses terms belonging to or derived from a natural language.
Abstract: A data services layer is disclosed which maintains a dictionary of conceptual information and physical information about the data. Machine-readable requests to access the data are in a form related to a conceptual organization of the data, and is not specific to a physical organization of the data. A machine-readable query to obtain a subset of the data is produced by referencing the dictionary of conceptual and physical information about the data. The conceptual information is obtained from an object-relational-model of the data, and the physical information indicates how the data is organized on the data storage medium. Requests are written in a conceptual query language (CQL) which substantially uses terms belonging to or derived from a natural language. CQL includes terms in the classes of names and concepts, and wherein name terms are used to describe objects in the object-relational-model of the data, and concept terms are used to specify the data subset desired. Concept terms specify Facts desired from the data, and filters and sort specifications to be applied to the Facts. In an example embodiment, the data is organized in rows, and CQL includes a select command that retrieves data in rows. A set of data representing a profile of performance characteristics related to how to retrieve data is provided, and queries are formed based at least in part on the performance characteristics.

Patent
24 May 1999
TL;DR: In this article, the authors propose a method of defining aggregate levels to be used in aggregation in a data store having one or more dimensions, where levels correspond to attributes in the dimension, so that data can be aggregated into aggregates corresponding to values of those attributes.
Abstract: A method of defining aggregate levels to be used in aggregation in a data store having one or more dimensions. Levels are defined corresponding to attributes in the dimension, so that data can be aggregated into aggregates corresponding to values of those attributes. The invention provides for the definition of sub-levels which act as levels but define which detail entries in the associated dimension will contribute to the sub-level. The invention also provides for the definition of level groups. A level group can replace a level in a level cross-product and such a cross product is then expanded before aggregation into a set of cross products, each containing one of the level group entries.

Book
07 Nov 1999
TL;DR: Getting Started with Data Mining: Reading the Data and Building a Model, and Understanding the Model: Perform Prediction.
Abstract: I. STARTING OUT. 1. Introduction to Data Mining. What Is Data Mining? Why Use Data Mining? Case Studies of Implementing Data Mining. A Process for Successfully Deploying Data Mining for Competitive Advantage. A Note on Privacy Issues. Summary. 2. Getting Started with Data Mining. Classification (Supervised Learning). Clustering (Unsupervised Learning). A Clustering Example. Visualization. Association (Market Basket). Assortment Optimization. Prediction. Estimation. Summary. 3. The Data-Mining Process. Discussion of Data-Mining Methodology. The Example. Data Preparation. Defining a Study. Reading the Data and Building a Model. Understanding Your Model. Prediction. Summary. 4. Data-Mining Algorithms. Introduction. Decision Trees. Genetic Algorithms. Neural Networks. Bayesian Belief Networks. Statistics. Advanced Algorithms for Association. Algorithms for Assortment Optimization. Summary. 5. The Data-Mining Marketplace. Introduction (Trends). Data-Mining Vendors. Visualization. Useful Web Sites/Commercially Available Code. Data Sources For Mining. Summary. II. A RAPID TUTORIAL. 6. A Look at Angoss: KnowledgeSEEKER. Introduction. Data Preparation. Defining the Study. Building the Model. Understanding the Model. Prediction. Summary. 7. A Look at RightPoint DataCruncher. Introduction. Data Preparation. Defining the Study. Read Your Data/Build a Discovery Model. Understanding the Model. Perform Prediction. Summary. III. INDUSTRY FOCUS. 8. Industry Applications of Data Mining. Data-Mining Applications in Banking and Finance. Data-Mining Applications in Retail. Data-Mining Applications in Healthcare. Data-Mining Applications in Telecommunications. Summary. 9. Enabling Data Mining through Data Warehouses. Introduction. A Data-Warehouse Example in Banking and Finance. A Data-Warehouse Example in Retail. A Data-Warehouse Example in Healthcare. A Data-Warehouse Example in Telecommunications. Summary. Appendix A: Data-Mining Vendors. Data-Mining Players. Visualization Tools. Useful Web Sites. Information Access Providers. Data-Warehousing Vendors. Appendix B: Installing Demo Software. Installing Angoss KnowledgeSEEKER Demo. Installing the RightPointPoint DataCruncher Demo. Appendix C: References. Index.

Book
01 Jun 1999
TL;DR: This paper proposes a method of maintaining aggregate views (the summary-delta table method), and uses it to solve two problems in maintaining summary tables in a warehouse: how to efficiently maintain a summary table while minimizing the batch window needed for maintenance, and how to maintain a large set of summary tables defined over the same base tables.
Abstract: Data warehouses contain large amounts of information, often collected from a variety of independent sources. Decision-support functions in a warehouse, such as on-line analytical processing (OLAP), involve hundreds of complex aggregate queries over large volumes of data. It is not feasible to compute these queries by scanning the data sets each time. Warehouse applications therefore build a large number of summary tables, or materialized aggregate views, to help them increase the system performance.As changes, most notably new transactional data, are collected at the data sources, all summary tables at the warehouse that depend upon this data need to be updated. Usually, source changes are loaded into the warehouse at regular intervals, usually once a day, in a batch window, and the warehouse is made unavailable for querying while it is updated. Since the number of summary tables that need to be maintained is often large, a critical issue for data warehousing is how to maintain the summary tables efficiently.In this paper we propose a method of maintaining aggregate views (the summary-delta table method), and use it to solve two problems in maintaining summary tables in a warehouse: (1) how to efficiently maintain a summary table while minimizing the batch window needed for maintenance, and (2) how to maintain a large set of summary tables defined over the same base tables.While several papers have addressed the issues relating to choosing and materializing a set of summary tables, this is the first paper to address maintaining summary tables efficiently.