scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2000"


Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations


Journal Article
TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.
Abstract: We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

1,675 citations


Journal ArticleDOI
TL;DR: The lineage problem is formally defined, lineage tracing algorithms for relational views with aggregation are developed, and mechanisms for performing consistent lineage tracing in a multisource data warehousing environment are proposed.
Abstract: We consider the view data lineageproblem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formally define the lineage problem, develop lineage tracing algorithms for relational views with aggregation, and propose mechanisms for performing consistent lineage tracing in a multisource data warehousing environment. Our result can form the basis of a tool that allows analysts to browse warehouse data, select view tuples of interest, and then “drill-through” to examine the exact source tuples that produced the view tuples of interest.

463 citations


Proceedings ArticleDOI
09 Oct 2000
TL;DR: Polaris is presented, an interface for exploring large multi-dimensional databases that extends the well-known Pivot Table interface that includes an interfaces for constructing visual specifications of table based graphical displays and the ability to generate a precise set of relational queries from the visual specifications.
Abstract: In the last several years, large multi-dimensional databases have become common in a variety of applications such as data warehousing and scientific computing. Analysis and exploration tasks place significant demands on the interfaces to these databases. Because of the size of the data sets, dense graphical representations are more effective for exploration than spreadsheets and charts. Furthermore, because of the exploratory nature of the analysis, it must be possible for the analysts to change visualizations rapidly as they pursue a cycle involving first hypothesis and then experimentation. The authors present Polaris, an interface for exploring large multi-dimensional databases that extends the well-known Pivot Table interface. The novel features of Polaris include an interface for constructing visual specifications of table based graphical displays and the ability to generate a precise set of relational queries from the visual specifications. The visual specifications can be rapidly and incrementally developed, giving the analyst visual feedback as they construct complex queries and visualizations.

420 citations


Journal ArticleDOI
TL;DR: Data mining is the science of finding unexpected, valuable, or interesting structures in large data sets, taking ideas and methods from statistics, machine learning, database technology, and other areas.
Abstract: Data mining is the science of finding unexpected, valuable, or interesting structures in large data sets It is an interdisciplinary activity, taking ideas and methods from statistics, machine lear

414 citations


Journal ArticleDOI
16 May 2000
TL;DR: This paper studies how to refresh a local copy of an autonomous data source to maintain the copy up-to-date, and defines two freshness metrics, change models of the underlying data, and synchronization policies.
Abstract: In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more difficult to maintain the copy \ fresh, “making it crucial to synchronize the copy effectively. We define two freshness metrics, change models of the underlying data, and synchronization policies. We analytically study how effective the various policies are. We also experimentally verify our analysis, based on data collected from 270 web sites for more than 4 months, and we show that our new policy improves the \ freshness” very significantly compared to current policies in use.

406 citations


Book
01 Jan 2000
TL;DR: This book presents a comparative review of the state of the art and best current practice of data warehouses and offers a conceptual framework by which the architecture and quality of data warehouse efforts can be assessed and improved using enriched metadata management combined with advanced techniques from databases, business modeling, and artificial intelligence.
Abstract: From the Publisher: Data warehouses have captured the attention of practitioners and researchers alike. But the design and optimization of data warehouses remains an art rather than a science. This book presents a comparative review of the state of the art and best current practice of data warehouses. It covers source and data integration, multidimensional aggregation, query optimization, update propagation, metadata management, quality assessment, and design optimization. Also, based on results of the European Data Warehouse Quality project, it offers a conceptual framework by which the architecture and quality of data warehouse efforts can be assessed and improved using enriched metadata management combined with advanced techniques from databases, business modeling, and artificial intelligence. For researchers and database professionals in academia and industry, the book offers an excellent introduction to the issues of quality and metadata usage in the context of data warehouses.

401 citations


Patent
03 Nov 2000
TL;DR: In this article, a data warehouse computing system (20) including a server connected with a client, a software distribution tool, a configuration and asset management, a fault management and recovery management tool, capacity planning, a performance management tool and a license management tool is presented.
Abstract: A data warehouse computing system (20) including a server connected to a client (26), a data warehouse architecture (40), metadata management (130), a population architecture (140), an end-user access architecture (110), an operations architecture (78), and a development architecture (50). The operations architecture includes a server connected with a client, a software distribution tool, a configuration and asset management tool, a fault management and recovery management tool, a capacity planning tool, a performance management tool, a license management tool, a remote management tool, an event management tool, a systems monitoring and tuning tool, a security tool, a user administration tool, a production control application set, and a help desk tool. The development architecture includes a process management tool, a personal productivity tool, a quality management tool, a system building tool, an environment management tool, a program and project management tool, a personal productivity tool and an information management tool.

338 citations


Journal ArticleDOI
TL;DR: Recent advances in learning and mining problems related to hypertext in general and the Web in particular are surveyed and the continuum of supervised to semi-supervised to unsupervised learning problems is reviewed.
Abstract: With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end.In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.

331 citations


PatentDOI
TL;DR: In this paper, a method for collecting data associated with the voice of a voice system user includes conducting a conversation with the user and capturing and digitizing a speech waveform of the user, extracting at least one acoustic feature from the digitized speech wave form and storing attribute data corresponding to the acoustic feature, together with an identifying indicia, in the data warehouse in a form to facilitate subsequent data mining.
Abstract: A method for collecting data associated with the voice of a voice system user includes conducting a conversation with the user, capturing and digitizing a speech waveform of the user, extracting at least one acoustic feature from the digitized speech waveform and storing attribute data corresponding to the acoustic feature, together with an identifying indicia, in the data warehouse in a form to facilitate subsequent data mining. User attributes can include gender, age, accent, native language, dialect, socioeconomic classification, educational level and emotional state. Data gathering can be repeated for a large number of users, until sufficient data is present. The attribute data to be stored can include raw acoustic features, or processed features, such as the user's emotional state, age, gender, socioeconomic group, and the like. In an alternative form of method, the user attribute can be used to real-time modify behavior of the voice system, with or without storage of data for subsequent data mining. An apparatus for collecting data associated with a voice of a user includes a dialog management unit, an audio capture module, an acoustic front end, a processing module and a data warehouse. The acoustic front end receives and digitizes a speech waveform from the user and extracts at least one acoustic feature from the digitized speech waveform. The feature is correlated with at least one user attribute. The processing module analyzes the acoustic feature to determine the user attribute, which can then be stored in the data warehouse. The dialog management unit can include, for example, a telephone interactive voice response system. The processor can be an application specific circuit, a separate general purpose computer with appropriate software, or a processor portion of the IVR. The processing module can include an emotional state classifier, a speaker clusterer and classifier, a speech recognizer, and/or an accent identifier. Alternatively, the apparatus can be configured as a real-time- modifiable voice system for interaction with a user, which can be used to practice the method for tailoring a voice system response.

272 citations


01 Jan 2000
TL;DR: A method for developing dimensional models from traditional Entity Relationship models, which can be used to design data warehouses and data marts based on enterprise data models is described.
Abstract: This paper describes a method for developing dimensional models from traditional Entity Relationship models. This can be used to design data warehouses and data marts based on enterprise data models. The first step of the method involves classifying entities in the data model into a number of categories. The second step involves identifying hierarchies that exist in the model. The final step involves collapsing these hierarchies and aggregating transaction data to form dimensional models. A number of design alternatives are presented, including a flat schema, a terraced schema, a star schema and a snowflake schema. We also define a new type of schema called a star cluster schema. This is a restricted form of snowflake schema, which minimises the number of tables while avoiding overlap between different dimensional hierarchies. Individual schemas can be collected together to form constellations or galaxies. We illustrate the method using a simple example.

Journal ArticleDOI
16 May 2000
TL;DR: A one pass algorithm for constructing a congressional sample is presented and this technique is used to also incrementally maintain the sample up-to-date without accessing the base relation, which demonstrates the efficacy of the techniques proposed.
Abstract: In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex decision support queries using precomputed summary statistics, such as samples. Decision support queries routinely segment the data into groups and then aggregate the information in each group (group-by queries). Depending on the data, there can be a wide disparity between the number of data items in each group. As a result, approximate answers based on uniform random samples of the data can result in poor accuracy for groups with very few data items, since such groups will be represented in the sample by very few (often zero) tuples.In this paper, we propose a general class of techniques for obtaining fast, highly-accurate answers for group-by queries. These techniques rely on precomputed non-uniform (biased) samples of the data. In particular, we propose congressional samples, a hybrid union of uniform and biased samples. Given a fixed amount of space, congressional samples seek to maximize the accuracy for all possible group-by queries on a set of columns. We present a one pass algorithm for constructing a congressional sample and use this technique to also incrementally maintain the sample up-to-date without accessing the base relation. We also evaluate query rewriting strategies for providing approximate answers from congressional samples. Finally, we conduct an extensive set of experiments on the TPC-D database, which demonstrates the efficacy of the techniques proposed.

Patent
07 Jan 2000
TL;DR: In this paper, a system and method for integrating and accessing multiple data sources within a data warehouse architecture is described, where four types of information are represented by the metadata: abstract concepts, databases, transformations and mappings.
Abstract: A system and method is disclosed for integrating and accessing multiple data sources within a data warehouse architecture. The metadata formed by the present method provide a way to declaratively present domain specific knowledge, obtained by analyzing data sources, in a consistent and useable way. Four types of information are represented by the metadata: abstract concepts, databases, transformations and mappings. A mediator generator automatically generates data management computer code based on the metadata. The resulting code defines a translation library and a mediator class. The translation library provides a data representation for domain specific knowledge represented in a data warehouse, including “get” and “set” methods for attributes that call transformation methods and derive a value of an attribute if it is missing. The mediator class defines methods that take “distinguished” high-level objects as input and traverse their data structures and enter information into the data warehouse.

Journal ArticleDOI
TL;DR: This study comprehensively study the option of expressing the mining algorithm in the form of SQL queries using Association rule mining as a case in point and compares these alternatives on the basis of qualitative factors like automatic parallelization, development ease, portability and inter-operability.
Abstract: Data mining on large data warehouses is becoming increasingly important. In support of this trend, we consider a spectrum of architectural alternatives for coupling mining with database systems. These alternatives include: loose-coupling through a SQL cursor interfaces encapsulation of a mining algorithm in a stored procedures caching the data to a file system on-the-fly and minings tight-coupling using primarily user-defined functionss and SQL implementations for processing in the DBMS. We comprehensively study the option of expressing the mining algorithm in the form of SQL queries using Association rule mining as a case in point. We consider four options in SQL-92 and six options in SQL enhanced with object-relational extensions (SQL-OR). Our evaluation of the different architectural alternatives shows that from a performance perspective, the Cache option is superior, although the performance of the SQL-OR option is within a factor of two. Both the Cache and the SQL-OR approaches incur a higher storage penalty than the loose-coupling approach which performance-wise is a factor of 3 to 4 worse than Cache. The SQL-92 implementations were too slow to qualify as a competitive option. We also compare these alternatives on the basis of qualitative factors like automatic parallelization, development ease, portability and inter-operability. As a byproduct of this study, we identify some primitives for native support in database systems for decision-support applications.

Posted Content
TL;DR: In this article, the authors propose an efficient view maintenance approach by exploiting common subexpressions between different view maintenance expressions, which can be materialized temporarily during view maintenance and then chosen to be maintained permanently.
Abstract: Because the presence of views enhances query performance, materialized views are increasingly being supported by commercial database/data warehouse systems. Whenever the data warehouse is updated, the materialized views must also be updated. However, whereas the amount of data entering a warehouse, the query loads, and the need to obtain up-to-date responses are all increasing, the time window available for making the warehouse up-to-date is shrinking. These trends necessitate efficient techniques for the maintenance of materialized views. In this paper, we show how to find an efficient plan for maintenance of a {\em set} of views, by exploiting common subexpressions between different view maintenance expressions. These common subexpressions may be materialized temporarily during view maintenance. Our algorithms also choose subexpressions/indices to be materialized permanently (and maintained along with other materialized views), to speed up view maintenance. While there has been much work on view maintenance in the past, our novel contributions lie in exploiting a recently developed framework for multiquery optimization to efficiently find good view maintenance plans as above. In addition to faster view maintenance, our algorithms can also be used to efficiently select materialized views to speed up workloads containing queries.

Journal ArticleDOI
TL;DR: The study demonstrates that most of the items in classic end-user satisfaction measure are still valid in the data warehouse environment, and that end- user satisfaction with data warehouses depends heavily on the roles and performance of organizational information centers.

01 Jan 2000
TL;DR: This paper shows how to systematically derive a conceptual warehouse schema that is even in generalized multidimensional normal form from an operational database.
Abstract: A data warehouse is an integrated and timevarying collection of data derived from operational data and primarily used in strategic decision making by means of online analytical processing (OLAP) techniques. Although it is generally agreed that warehouse design is a non-trivial problem and that multidimensional data models and star or snowflake schemata are relevant in this context, hardly any methods exist to date for deriving such a schema from an operational database. In this paper, we fill this gap by showing how to systematically derive a conceptual warehouse schema that is even in generalized multidimensional normal form.

Proceedings Article
10 Sep 2000
TL;DR: This paper discusses the major issues of a UB-Tree integration and favors the kernel integration because of the tight coupling with the query optimizer, which allows for optimal usage of the UBTree in execution plans.
Abstract: Multidimensional access methods have shown high potential for significant performance improvements in various application domains. However, only few approaches have made their way into commercial products. In commercial database management systems (DBMSs) the BTree is still the prevalent indexing technique. Integrating new indexing methods into existing database kernels is in general a very complex and costly task. Exceptions exist, as our experience of integrating the UB-Tree into TransBase, a commercial DBMS, shows. The UB-Tree is a very promising multidimensional index, which has shown its superiority over traditional access methods in different scenarios, especially in OLAP applications. In this paper we discuss the major issues of a UB-Tree integration. As we will show, the complexity and cost of this task is reduced significantly due to the fact that the UBTree relies on the classical B-Tree. Even though commercial DBMSs provide interfaces for index extensions, we favor the kernel integration because of the tight coupling with the query optimizer, which allows for optimal usage of the UBTree in execution plans. Measurements on a real-world data warehouse show that the kernel integration leads to an additional performance improvement compared to our prototype implementation and competing index methods.

Journal ArticleDOI
01 Sep 2000
TL;DR: A textual data mining architecture that extends a classic paradigm for knowledge discovery in databases is introduced, and a broad view of data mining—the process of discovering patterns in large collections of data—is described in some detail.
Abstract: This paper surveys applications of data mining techniques to large text collections, and illustrates how those techniques can be used to support the management of science and technology research Specific issues that arise repeatedly in the conduct of research management are described, and a textual data mining architecture that extends a classic paradigm for knowledge discovery in databases is introduced That architecture integrates information retrieval from text collections, information extraction to obtain data from individual texts, data warehousing for the extracted data, data mining to discover useful patterns in the data, and visualization of the resulting patterns At the core of this architecture is a broad view of data mining—the process of discovering patterns in large collections of data—and that step is described in some detail The final section of the paper illustrates how these ideas can be applied in practice, drawing upon examples from the recently completed first phase of the textual data mining program at the Office of Naval Research The paper concludes by identifying some research directions that offer significant potential for improving the utility of textual data mining for research management applications

Proceedings ArticleDOI
01 Feb 2000
TL;DR: A lineage tracing package for relational views with aggregation is implemented in the WHIPS data warehousing system prototype at Stanford, and a number of schemes for storing auxiliary views that enable consistent and efficient lineage tracing in a multi-source data warehouse are proposed.
Abstract: We consider the view data lineage problem in a warehousing environment: for a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formalize the problem and we present a lineage tracing algorithm for relational views with aggregation. Based on our tracing algorithm, we propose a number of schemes for storing auxiliary views that enable consistent and efficient lineage tracing in a multi-source data warehouse. We report on a performance study of the various schemes, identifying which schemes perform best in which settings. Based on our results, we have implemented a lineage tracing package in the WHIPS data warehousing system prototype at Stanford. With this package, users can select view tuples of interest, then efficiently "drill through" to examine the exact source tuples that produced the view tuples of interest.

Journal ArticleDOI
TL;DR: In contrast to the on-demand approach to information integration, the approach of tailored information repository construction, commonly referred to as data warehousing, is characterized by the following properties:
Abstract: I n recent years, the number of digital information storage and retrieval systems has increased immensely. These information sources are generally interconnected via some network, and hence the task of integrating data from different sources to serve it up to users is an increasingly important one [10]. Applications that could benefit from this wealth of digital information are thus experiencing a pressing need for suitable integration tools that allow them to make effective use of such distributed and diverse data sets. In contrast to the on-demand approach to information integration, the approach of tailored information repository construction, commonly referred to as data warehousing, is characterized by the following properties:

Book
01 Jan 2000
TL;DR: This new book provides a strong foundation for the use of models within the context of building and using decision support systems, and it will focus on multi-dimensional databases and client/server computing.
Abstract: From the Publisher: Decision Support and Data Warehouse Systems ties the more traditional view of decision support to the rapidly evolving topics of database management and data warehouse As organizations move quickly into networked-based environments,the nature of decision support tools has become increasingly complex These tools are now used collaboratively and the use of data warehousing mechanisms will be a critical success factor for the survival of many organizations This new book provides a strong foundation for the use of models within the context of building and using decision support systems,and it will focus on multi-dimensional databases and client/server computing

Book
15 Dec 2000
TL;DR: The Function Point Counting Process is illustrated with an Example of Counting ILFs and EIFs, as well as three case studies in Project Management, which show the challenges and opportunities in establishing and applying function points in the software measurement industry.
Abstract: Foreword. Preface. Introduction. Basic Counting Rules. Advanced Counting. Preparing for Certification. What's Different? Software Measurement. Function Points and the Executive. Function Point Utilization. Automation. Industry Benchmarking Data. The International Function Point Users Group. About the Authors. 1. Software Measurement. Introduction. The Need for Software Measurement. Basic Software Measurement Elements. Software Measurement Model: Quantitative and Qualitative Elements. World-Class Measurement Program. Entry Level. Basic Level. Industry Leader Level. World-Class Level. Establishing a World-Class Measurement Program. Discovery Phase. Gap Analysis Phase. Summary. 2. Executive Introduction to Function Points. Introduction. Historical Perspective. Balanced Scorecard. Return on Investment. Unit of Work. Function Points. Defining Value. Time to Market. Accountability. Summary. 3. Measuring with Function Points. Introduction. Function Points in the Lifecycle. Function Point Measures. Productivity. Quality. Financial. Maintenance. Using Function Point Measurement Data Effectively. Developing a Measurement Profile. Available Industry Comparisons. Summary. 4. Using Function Points Effectively. Introduction. Project Manager Level: Estimating Software Projects. Using Function Points. IT Management Level: Establishing Performance Benchmarks. Industry Best Practices. Organization Level: Establishing Service-Level Measures. Project and Application Outsourcing. Maintenance Outsourcing. AD/M Outsourcing. Summary. 5. Software Industry Benchmark Data. Introduction. How IT Is Using Industry Data. Benchmarking. Concerns with Industry Data. Representativeness. Consistency. Standard Definitions. What Role Do Function Points Play? Sources of Industry Data. The Gartner Group. META Group. Rubin Systems, Inc. Software Productivity Research. ISBSG. Compass America. The David Consulting Group. The Benchmarking Exchange. Hackett Benchmarking & Research. Hope for the Future. Summary. 6. Introduction to Function Point Analysis. Introduction. The Function Point Counting Process. The Process Used to Size Function Points. Types of Counts. Identifying the Counting Scope and the Application Boundary. Summary. 7. Sizing Data Functions. Introduction. Data Functions. Internal Logical Files. External Interface Files. Complexity and Contribution: ILFs and EIFs. An Example of Counting ILFs and EIFs. Summary. 8. Sizing Transactional Functions. Introduction. Transactional Functions. External Inputs. Complexity and Contribution: EIs. An Example of Counting EIs. External Outputs. Complexity and Contribution: EOs. An Example of Counting EOs. External Inquiries. Complexity and Contribution: EQs. An Example of Counting EQs. Summary. 9. GENERAL SYSTEM CHARACTERISTICS. Introduction. The Process. General System Characteristics. 1. Data Communications. 2. Distributed Data Processing. 3. Performance. 4. Heavily Used Configuration. 5. Transaction Rate. 6. Online Data Entry. 7. End User Efficiency. 8. Online Update. 9. Complex Processing. 10. Reusability. 11. Installation Ease. 12. Operational Ease. 13. Multiple Sites. 14. Facilitate Change. Value Adjustment Factor. Summary. 10. Calculating and Applying Function Points. Introduction. Final Adjusted Function Point Count. Counting a Catalog Business: An Example. Function Point Calculations and Formulas. Development Project Function Point Count. Enhancement Project Function Point Count. Application Function Point Count. Summary. 11. Case Studies in Counting. Introduction. Three Case Studies. Problem A. Problem B. Problem C. Answers to the Three Case Studies. A Short Case Study in Project Management. The Problem. Answers. A Function Point Counting Exercise in Early Definition. The Problem. Answers. 12. Counting Advanced Technologies. Introduction. Object-Oriented Analysis. Client-Server Applications. Application Boundary. Data Functions. Technical Features. Transactional Functions. Web-Based Applications. Application Boundary. Functionality of Web-Based Applications. Data Warehouse Applications. Functionality of Data Warehouse Applications. Concerns about Productivity Rates for Data Warehouse Applications. Query/Report Generators. Data Functionality. Transactional Functionality. Summary. 13. Counting a GUI Application. Introduction. Counting GUI Functionality. GUI Counting Guidelines. Exercise in Counting a GUI System. 1. Determine the Type of Function Point Count. 2. Identify the Counting Scope and the Application Boundary. 3 and 4. Identify All Data and Transactional Functions and Their Complexity. 5. Determine the Unadjusted Function Point Count. 6. Determine the Value Adjustment Factor. 7. Calculate the Final Adjusted Function Point Count. 14. Counting an Object-Oriented Application. Introduction. Functional Description of Personnel Query Service. Starting Personnel Query Service. Query. Update. Create. Delete. Add and Delete Title, Location, and Organization Records. Add and Delete an Employee's Picture. Exit. Object Model for Personnel Query Service. System Diagram for Personnel Query Service. Function Point Analysis for Personnel Query Service. 15. Tools. Introduction. Basic Tool Selection Criteria. Selecting a Function Point Repository Tool. Selecting a Project-Estimating Tool. Conducting a Proof of Concept. 1. Identification of the Current Estimating Problem. 2. Definition of the Deliverable. 3. Process and Tool Selection. 4. Project Selection. 5. Review of the Estimating Process with the Project Managers. 6. Sizing and Complexity Analysis. 7. Identification of Project Variables. 8. Analysis of the Data. 9. Review of the Estimate. 10. Assessment of the Process. Summary. 16. Preparing for the CFPs Exam. Practice Certified Function Point Specialist Exam. Part I. Part II. Part III. Answer Sheets. Answer Sheet: Part I. Answer Sheet: Part II. Answer Sheet: Part III. Appendix A: Project Profile Worksheet. Appendix B: Project Profile Worksheet Guidelines. Appendix C: Complexity Factors Project Worksheet. Appendix D: Sample Project Analysis. Appendix E: Frequently Asked Questions (FAQs). Appendix F: Answers to the CFPS Practice Exam. Answers to the CFPS Practice Exam. Bibliography. Index. 0201699443T04062001

Proceedings ArticleDOI
28 Feb 2000
TL;DR: The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation.
Abstract: Data quality concerns arise when one wants to correct anomalies in a single data source (e.g., duplicate elimination in a file), or when one wants to integrate data coming from multiple sources into a single new data source (e.g., data warehouse construction). Three data quality problems are typically encountered: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyboard errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process.We propose a framework that models a data cleaning application as a directed graph of data transformations. Transformations are divided into four distinct classes: mapping, matching, clustering and merging; and each of them is implemented by a macro-operator. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability to include human interaction explicitly in the process. Finally, we study performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation.

Journal ArticleDOI
01 Dec 2000
TL;DR: The theoretical issues concerning the problem of answering queries using views, which is to find efficient methods of answering a query using a set of previously materialized views over the database, are surveyed.
Abstract: The problem of answering queries using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database relations The problem has recently received significant attention because of its relevance to a wide variety of data management problems, such as query optimization, the maintenance of physical data independence, data integration and data warehousing This article surveys the theoretical issues concerning the problem of answering queries using views

Patent
21 Jan 2000
TL;DR: A data warehousing system stores the raw data population for an underlying delivery system for a personal intelligence network that actively delivers highly personalized and timely informational and transactional data collects and distributes e-mail and other content from a hub-and-spoke style source architecture as discussed by the authors.
Abstract: A data warehousing system stores the raw data population for an underlying delivery system for a personal intelligence network that actively delivers highly personalized and timely informational and transactional data collects and distributes e-mail and other content from a hub-and-spoke style source architecture. The architecture of the data storage system includes a data distribution depository warehousing the raw data, and a data distribution control server monitoring the state of the data for fault and other conditions. For instance, data may revert to the next-most recent state when a corruption is detected. Furthermore, data may be collected continuously but the data image of the repository may be frozen during subscriber inquiries, to avoid inconsistent output.

Proceedings ArticleDOI
01 Feb 2000
TL;DR: This work is the first to exhibit decidability in cases where the language for expressing the query and the views allows for recursion, and characterize data, expression, and combined complexity of the problem, showing that the proposed algorithms are essentially optimal.
Abstract: Query answering using views amounts to computing the answer to a query having information only on the extension of a set of views. This problem is relevant in several fields, such as information integration, data warehousing, query optimization, mobile computing, and maintaining physical data independence. We address query answering using views in a context where queries and views are regular path queries, i.e., regular expressions that denote the pairs of objects in the database connected by a matching path. Regular path queries are the basic query mechanism when the database is conceived as a graph, such as in semistructured data and data on the Web. We study algorithms for answering regular path queries using views under different assumptions, namely, closed and open domain, and sound, complete, and exact information on view extensions. We characterize data, expression, and combined complexity of the problem, showing that the proposed algorithms are essentially optimal. Our results are the first to exhibit decidability in cases where the language for expressing the query and the views allows for recursion.

Book ChapterDOI
09 Oct 2000
TL;DR: The case study illustrates the value of model management as a methodology for approaching meta-data related problems and helps clarify the required semantics of key operations.
Abstract: Model management is a framework for supporting meta-data related applications where models and mappings are manipulated as first class objects using operations such as Match, Merge, ApplyFunction, and Compose. To demonstrate the approach, we show how to use model management in two scenarios related to loading data warehouses. The case study illustrates the value of model management as a methodology for approaching meta-data related problems. It also helps clarify the required semantics of key operations. These detailed scenarios provide evidence that generic model management is useful and, very likely, implementable.

Patent
13 Jan 2000
TL;DR: In this paper, a method for graphically analyzing relationships in data (103) from one or more data sources of an enterprise is presented. But the method is especially useful in conjunction with a meta-model based technique for modeling the enterprise data.
Abstract: According to the invention, techniques for visualizing customer data (103) contained in databases (6), data marts and data warehouses (8). In an exemplary embodiment, the invention provides a method for graphically analyzing relationships in data (103) from one or more data sources of an enterprise. The method can be used with many popular visualization tools (21), such as a On Line Analytical Processing (OLAP) tools (2) and the like. The method is especially useful in conjunction with a meta-model (103) based technique for modeling the enterprise data. The enterprise is typically a business activity (21), but can also be other loci of human activity (10). Embodiments according to the invention can display data from a variety of sources in order to provide visual representations of data in a data warehousing environment (8).

Journal ArticleDOI
Tom Barclay1, Jim Gray1, Don Slutz1
16 May 2000
TL;DR: TerraServer as discussed by the authors is the world's largest online atlas, combining eight terabytes of image data from the United States Geological Survey (USGS) and SPIN-2.
Abstract: Microsoft® TerraServer stores aerial, satellite, and topographic images of the earth in a SQL database available via the Internet. It is the world's largest online atlas, combining eight terabytes of image data from the United States Geological Survey (USGS) and SPIN-2. Internet browsers provide intuitive spatial and text interfaces to the data. Users need no special hardware, software, or knowledge to locate and browse imagery. This paper describes how terabytes of “Internet unfriendly” geo-spatial images were scrubbed and edited into hundreds of millions of “Internet friendly” image tiles and loaded into a SQL data warehouse. All meta-data and imagery are stored in the SQL database.TerraServer demonstrates that general-purpose relational database technology can manage large scale image repositories, and shows that web browsers can be a good geo-spatial image presentation system.