scispace - formally typeset
Search or ask a question
Author

Usama M. Fayyad

Bio: Usama M. Fayyad is an academic researcher from Northeastern University. The author has contributed to research in topics: Knowledge extraction & Cluster analysis. The author has an hindex of 58, co-authored 135 publications receiving 24868 citations. Previous affiliations of Usama M. Fayyad include University of Michigan & California Institute of Technology.


Papers
More filters
Journal ArticleDOI
TL;DR: An overview of this emerging field is provided, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases.
Abstract: ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.

4,782 citations

Proceedings Article
01 Sep 1993
TL;DR: This paper addresses the use of the entropy minimization heuristic for discretizing the range of a continuous-valued attribute into multiple intervals.
Abstract: Since most real-world applications of classification learning involve continuous-valued attributes, properly addressing the discretization process is an important problem. This paper addresses the use of the entropy minimization heuristic for discretizing the range of a continuous-valued attribute into multiple intervals.

3,192 citations

Proceedings Article
24 Jul 1998
TL;DR: A procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution that allows the iterative algorithm to converge to a “better” local minimum.
Abstract: Practical approaches to clustering use an iterative procedure (e.g. K-Means, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition allows the iterative algorithm to converge to a “better” local minimum. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the popular K-Means clustering algorithm and show that refined initial starting points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the large-scale clustering problems in data mining.

1,214 citations


Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: In this article, a reprocessed composite of the COBE/DIRBE and IRAS/ISSA maps, with the zodiacal foreground and confirmed point sources removed, is presented.
Abstract: We present a full-sky 100 μm map that is a reprocessed composite of the COBE/DIRBE and IRAS/ISSA maps, with the zodiacal foreground and confirmed point sources removed. Before using the ISSA maps, we remove the remaining artifacts from the IRAS scan pattern. Using the DIRBE 100 and 240 μm data, we have constructed a map of the dust temperature so that the 100 μm map may be converted to a map proportional to dust column density. The dust temperature varies from 17 to 21 K, which is modest but does modify the estimate of the dust column by a factor of 5. The result of these manipulations is a map with DIRBE quality calibration and IRAS resolution. A wealth of filamentary detail is apparent on many different scales at all Galactic latitudes. In high-latitude regions, the dust map correlates well with maps of H I emission, but deviations are coherent in the sky and are especially conspicuous in regions of saturation of H I emission toward denser clouds and of formation of H2 in molecular clouds. In contrast, high-velocity H I clouds are deficient in dust emission, as expected. To generate the full-sky dust maps, we must first remove zodiacal light contamination, as well as a possible cosmic infrared background (CIB). This is done via a regression analysis of the 100 μm DIRBE map against the Leiden-Dwingeloo map of H I emission, with corrections for the zodiacal light via a suitable expansion of the DIRBE 25 μm flux. This procedure removes virtually all traces of the zodiacal foreground. For the 100 μm map no significant CIB is detected. At longer wavelengths, where the zodiacal contamination is weaker, we detect the CIB at surprisingly high flux levels of 32 ± 13 nW m-2 sr-1 at 140 μm and of 17 ± 4 nW m-2 sr-1 at 240 μm (95% confidence). This integrated flux ~2 times that extrapolated from optical galaxies in the Hubble Deep Field. The primary use of these maps is likely to be as a new estimator of Galactic extinction. To calibrate our maps, we assume a standard reddening law and use the colors of elliptical galaxies to measure the reddening per unit flux density of 100 μm emission. We find consistent calibration using the B-R color distribution of a sample of the 106 brightest cluster ellipticals, as well as a sample of 384 ellipticals with B-V and Mg line strength measurements. For the latter sample, we use the correlation of intrinsic B-V versus Mg2 index to tighten the power of the test greatly. We demonstrate that the new maps are twice as accurate as the older Burstein-Heiles reddening estimates in regions of low and moderate reddening. The maps are expected to be significantly more accurate in regions of high reddening. These dust maps will also be useful for estimating millimeter emission that contaminates cosmic microwave background radiation experiments and for estimating soft X-ray absorption. We describe how to access our maps readily for general use.

15,988 citations

MonographDOI
TL;DR: The art and science of cause and effect have been studied in the social sciences for a long time as mentioned in this paper, see, e.g., the theory of inferred causation, causal diagrams and the identification of causal effects.
Abstract: 1. Introduction to probabilities, graphs, and causal models 2. A theory of inferred causation 3. Causal diagrams and the identification of causal effects 4. Actions, plans, and direct effects 5. Causality and structural models in the social sciences 6. Simpson's paradox, confounding, and collapsibility 7. Structural and counterfactual models 8. Imperfect experiments: bounds and counterfactuals 9. Probability of causation: interpretation and identification Epilogue: the art and science of cause and effect.

12,606 citations