scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2005"


Book
01 Jan 2005
TL;DR: This book first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently.
Abstract: This book organizes key concepts, theories, standards, methodologies, trends, challenges and applications of data mining and knowledge discovery in databases. It first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. It also gives in-depth descriptions of data mining applications in various interdisciplinary industries.

2,836 citations


Journal ArticleDOI
TL;DR: The major challenge of biomedical text mining over the next 5-10 years will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are addressed.
Abstract: The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Among the tools that can aid researchers in coping with this information overload are text mining and knowledge extraction. Significant progress has been made in applying text mining to named entity recognition, text classification, terminology extraction, relationship extraction and hypothesis generation. Several research groups are constructing integrated flexible text-mining systems intended for multiple uses. The major challenge of biomedical text mining over the next 5–10 years is to make these systems useful to biomedical researchers. This will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are addressed.

782 citations


Journal ArticleDOI
TL;DR: This paper presents an electricity consumer characterization framework based on a knowledge discovery in databases (KDD) procedure, supported by data mining techniques, applied on the different stages of the process.
Abstract: This paper presents an electricity consumer characterization framework based on a knowledge discovery in databases (KDD) procedure, supported by data mining (DM) techniques, applied on the different stages of the process. The core of this framework is a data mining model based on a combination of unsupervised and supervised learning techniques. Two main modules compose this framework: the load profiling module and the classification module. The load profiling module creates a set of consumer classes using a clustering operation and the representative load profiles for each class. The classification module uses this knowledge to build a classification model able to assign different consumers to the existing classes. The quality of this framework is illustrated with a case study concerning a real database of LV consumers from the Portuguese distribution company.

446 citations


Proceedings Article
01 Jan 2005
TL;DR: This article presents a description and case study of CiteSpace II, a Java application which supports visual exploration with knowledge discovery in bibliographic databases and qualitatively evaluated two resulting document-term co-citation and MeSH term co-occurrence visualizations.
Abstract: This article presents a description and case study of CiteSpace II, a Java application which supports visual exploration with knowledge discovery in bibliographic databases. Highly cited and pivotal documents, areas of specialization within a knowledge domain, and emergence of research topics are visually mapped through a progressive knowledge domain visualization approach to detecting and visualizing trends and patterns in scientific literature. The test case in this study is progressive knowledge domain visualization of the field of medical informatics. Datasets based on publications from twelve journals in the medical informatics field covering the time period from 1964-2004 were extracted from PubMed and Web of Science (WOS) and developed as testbeds for evaluation of the CiteSpace system. Two resulting document-term co-citation and MeSH term co-occurrence visualizations are qualitatively evaluated for identification of pivotal documents, areas of specialization, and research trends. Practical applications in bio-medical research settings are discussed.

358 citations


Journal ArticleDOI
TL;DR: A review of the available literature on the various measures devised for evaluating and ranking the discovered patterns produced by the data mining process and their strengths and weaknesses with respect to the level of user integration within the discovery process is presented.
Abstract: It is a well-known fact that the data mining process can generate many hundreds and often thousands of patterns from data. The task for the data miner then becomes one of determining the most useful patterns from those that are trivial or are already well known to the organization. It is therefore necessary to filter out those patterns through the use of some measure of the patterns actual worth. This article presents a review of the available literature on the various measures devised for evaluating and ranking the discovered patterns produced by the data mining process. These so-called interestingness measures are generally divided into two categories: objective measures based on the statistical strengths or properties of the discovered patterns and subjective measures that are derived from the user's beliefs or expectations of their particular problem domain. We evaluate the strengths and weaknesses of the various interestingness measures with respect to the level of user integration within the discovery process.

344 citations


Book ChapterDOI
31 Oct 2005
TL;DR: A new language based on Linear Temporal Logic (LTL) is developed and this is combined with a standard XML format to store event logs and the LTL Checker verifies whether the observed behavior matches the (un)expected/(un)desirable behavior.
Abstract: Information systems are facing conflicting requirements. On the one hand, systems need to be adaptive and self-managing to deal with rapidly changing circumstances. On the other hand, legislation such as the Sarbanes-Oxley Act, is putting increasing demands on monitoring activities and processes. As processes and systems become more flexible, both the need for, and the complexity of monitoring increases. Our earlier work on process mining has primarily focused on process discovery, i.e., automatically constructing models describing knowledge extracted from event logs. In this paper, we focus on a different problem complementing process discovery. Given an event log and some property, we want to verify whether the property holds. For this purpose we have developed a new language based on Linear Temporal Logic (LTL) and we combine this with a standard XML format to store event logs. Given an event log and an LTL property, our LTL Checker verifies whether the observed behavior matches the (un)expected/(un)desirable behavior.

332 citations


Journal ArticleDOI
TL;DR: Methods and implemented systems for information extraction systems used to extract concrete data from a set of documents are discussed and results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions are summarized.
Abstract: An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.

256 citations


Journal ArticleDOI
TL;DR: A framework for automated network analysis and visualization was proposed that incorporates several advanced techniques: a concept space approach, hierarchical clustering, social network analysis methods, and multidimensional scaling, which demonstrated that the system could achieve higher clustering recall and precision than did untrained subjects when detecting subgroups from criminal networks.
Abstract: Knowledge about the structure and organization of criminal networks is important for both crime investigation and the development of effective strategies to prevent crimes. However, except for network visualization, criminal network analysis remains primarily a manual process. Existing tools do not provide advanced structural analysis techniques that allow extraction of network knowledge from large volumes of criminal-justice data. To help law enforcement and intelligence agencies discover criminal network knowledge efficiently and effectively, in this research we proposed a framework for automated network analysis and visualization. The framework included four stages: network creation, network partition, structural analysis, and network visualization. Based upon it, we have developed a system called CrimeNet Explorer that incorporates several advanced techniques: a concept space approach, hierarchical clustering, social network analysis methods, and multidimensional scaling. Results from controlled experiments involving student subjects demonstrated that our system could achieve higher clustering recall and precision than did untrained subjects when detecting subgroups from criminal networks. Moreover, subjects identified central members and interaction patterns between groups significantly faster with the help of structural analysis functionality than with only visualization functionality. No significant gain in effectiveness was present, however. Our domain experts also reported that they believed CrimeNet Explorer could be very useful in crime investigation.

248 citations


Book
18 Nov 2005
TL;DR: This paper presents a meta-modelling architecture for data mining that automates the very labor-intensive and therefore time-heavy and expensive process of designing and implementing data mining techniques.
Abstract: Part I: INTRODUCTION Chapter 1: Initial Description of Data Mining in Business Chapter 2: Data Mining Processes and Knowledge Discovery Chapter 3: Database Support to Data Mining Part II: DATA MINING METHODS AS TOOLS Chapter 4: Overview of Data Mining Techniques Chapter 4 Appendix: Enterprise Miner Demonstration on Expenditure Data Set Chapter 5: Cluster Analysis Chapter 5 Appendix: Clementine Chapter 6: Regression Algorithms in Data Mining Chapter 7: Neural Networks in Data Mining Chapter 8: Decision Tree Algorithms Appendix 8: Demonstration of See5 Decision Tree Analysis Chapter 9: Linear Programming-Based Methods Chapter 9 Appendix: Data Mining Linear Programming Formulations Part III: BUSINESS APPLICATIONS Chapter 10: Business Data Mining Applications Applications Chapter 11: Market-Basket Analysis Chapter 11 Appendix: Market-Basket Procedure Part IV: DEVELOPING ISSUES Chapter 12: Text and Web Mining Chapter 12 Appendix: Semantic Text Analysis Chapter 13: Ethical Aspects of Data Mining

245 citations


Book ChapterDOI
TL;DR: It is shown how concept map-based knowledge models can be used to organize repositories of information in a way that makes them easily browsable, and how concept maps can improve searching algorithms for the Web.
Abstract: Information visualization has been a research topic for many years, leading to a mature field where guidelines and practices are well established. Knowledge visualization, in contrast, is a relatively new area of research that has received more attention recently due to the interest from the business community in Knowledge Management. In this paper we present the CmapTools software as an example of how concept maps, a knowledge visualization tool, can be combined with recent technology to provide integration between knowledge and information visualizations. We show how concept map-based knowledge models can be used to organize repositories of information in a way that makes them easily browsable, and how concept maps can improve searching algorithms for the Web. We also report on how information can be used to complement knowledge models and, based on the searching algorithms, improve the process of constructing concept maps.

220 citations


Book ChapterDOI
TL;DR: A generalized privacy preserving variant of the ID3 algorithm for vertically partitioned data distributed over two or more parties is introduced and a complete proof of security that gives a tight bound on the information revealed is given.
Abstract: Privacy and security concerns can prevent sharing of data, derailing data mining projects.Distributed knowledge discovery, if done correctly, can alleviate this problem. In this paper, we tackle the problem of classification. We introduce a generalized privacy preserving variant of the ID3 algorithm for vertically partitioned data distributed over two or more parties. Along with the algorithm, we give a complete proof of security that gives a tight bound on the information revealed.

Book ChapterDOI
TL;DR: This article gives an overview of the Rough Set Exploration System (RSES), a freely available software system toolset for data exploration, classification support and knowledge discovery.
Abstract: This article gives an overview of the Rough Set Exploration System (RSES). RSES is a freely available software system toolset for data exploration, classification support and knowledge discovery. The main functionalities of this software system are presented along with a brief explanation of the algorithmic methods used by RSES. Many of the RSES methods have originated from rough set theory introduced by Zdzislaw Pawlak during the early 1980s.

Journal ArticleDOI
TL;DR: The Intelligent Discovery Assistant (IDA) as discussed by the authors provides users with systematic enumerations of valid data mining processes and effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute.
Abstract: A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and nontrivial interactions, both novices and data mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype intelligent discovery assistant (IDA), which provides users with 1) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and 2) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a demonstration of cost-sensitive classification using a more complicated process and data from the 1998 KDDCUP competition.

Proceedings ArticleDOI
02 Oct 2005
TL;DR: An improved method for feature extraction that draws on an existing unsupervised method is introduced that turns the task of feature extraction into one of term similarity by mapping crude (learned) features into a user-defined taxonomy of the entity's features.
Abstract: Capturing knowledge from free-form evaluative texts about an entity is a challenging task. New techniques of feature extraction, polarity determination and strength evaluation have been proposed. Feature extraction is particularly important to the task as it provides the underpinnings of the extracted knowledge. The work in this paper introduces an improved method for feature extraction that draws on an existing unsupervised method. By including user-specific prior knowledge of the evaluated entity, we turn the task of feature extraction into one of term similarity by mapping crude (learned) features into a user-defined taxonomy of the entity's features. Results show promise both in terms of the accuracy of the mapping as well as the reduction in the semantic redundancy of crude features.

Proceedings ArticleDOI
03 Jan 2005
TL;DR: A new classification method called multi-class classification based on association rules (MCAR) is presented, which uses an efficient technique for discovering frequent items and employs a rule ranking method which ensures detailed rules with high confidence are part of the classifier.
Abstract: Summary form only given. Constructing fast, accurate classifiers for large data sets is an important task in data mining and knowledge discovery. In this research paper, a new classification method called multi-class classification based on association rules (MCAR) is presented. MCAR uses an efficient technique for discovering frequent items and employs a rule ranking method which ensures detailed rules with high confidence are part of the classifier. After experimentation with fifteen different data sets, the results indicated that the proposed method is an accurate and efficient classification technique. Furthermore, the classifiers produced are highly competitive with regards to error rate and efficiency, if compared with those generated by popular methods like decision trees, RIPPER and CBA.

Journal ArticleDOI
TL;DR: A first evaluation framework for estimating and comparing different kinds of PPDM algorithms and applies its criteria to a specific set of algorithms and discusses the evaluation results the authors obtain.
Abstract: Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods, instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated. It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements. In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then, we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations about future work and promising directions in the context of privacy preservation in data mining are discussed.


BookDOI
01 Jan 2005
TL;DR: In this paper, a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids is presented for NER using Inductive Logic Programming for predicting protein-protein interactions from multiple Genomic Data.
Abstract: Invited Talks.- Data Analysis in the Life Sciences - Sparking Ideas -.- Machine Learning for Natural Language Processing (and Vice Versa?).- Statistical Relational Learning: An Inductive Logic Programming Perspective.- Recent Advances in Mining Time Series Data.- Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce.- Data Streams and Data Synopses for Massive Data Sets.- Long Papers.- k-Anonymous Patterns.- Interestingness is Not a Dichotomy: Introducing Softness in Constrained Pattern Mining.- Generating Dynamic Higher-Order Markov Models in Web Usage Mining.- Tree 2 - Decision Trees for Tree Structured Data.- Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results.- Cluster Aggregate Inequality and Multi-level Hierarchical Clustering.- Ensembles of Balanced Nested Dichotomies for Multi-class Problems.- Protein Sequence Pattern Mining with Constraints.- An Adaptive Nearest Neighbor Classification Algorithm for Data Streams.- Support Vector Random Fields for Spatial Classification.- Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication.- A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns.- Improving Generalization by Data Categorization.- Mining Model Trees from Spatial Data.- Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification.- Mining Paraphrases from Self-anchored Web Sentence Fragments.- M2SP: Mining Sequential Patterns Among Several Dimensions.- A Systematic Comparison of Feature-Rich Probabilistic Classifiers for NER Tasks.- Knowledge Discovery from User Preferences in Conversational Recommendation.- Unsupervised Discretization Using Tree-Based Density Estimation.- Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization.- Non-stationary Environment Compensation Using Sequential EM Algorithm for Robust Speech Recognition.- Hybrid Cost-Sensitive Decision Tree.- Characterization of Novel HIV Drug Resistance Mutations Using Clustering, Multidimensional Scaling and SVM-Based Feature Ranking.- Object Identification with Attribute-Mediated Dependences.- Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids.- Using Inductive Logic Programming for Predicting Protein-Protein Interactions from Multiple Genomic Data.- ISOLLE: Locally Linear Embedding with Geodesic Distance.- Active Sampling for Knowledge Discovery from Biomedical Data.- A Multi-metric Index for Euclidean and Periodic Matching.- Fast Burst Correlation of Financial Data.- A Propositional Approach to Textual Case Indexing.- A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston.- Efficient Classification from Multiple Heterogeneous Databases.- A Probabilistic Clustering-Projection Model for Discrete Data.- Short Papers.- Collaborative Filtering on Data Streams.- The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-Based FIM Algorithms.- Community Mining from Multi-relational Networks.- Evaluating the Correlation Between Objective Rule Interestingness Measures and Real Human Interest.- A Kernel Based Method for Discovering Market Segments in Beef Meat.- Corpus-Based Neural Network Method for Explaining Unknown Words by WordNet Senses.- Segment and Combine Approach for Non-parametric Time-Series Classification.- Producing Accurate Interpretable Clusters from High-Dimensional Data.- Stress-Testing Hoeffding Trees.- Rank Measures for Ordering.- Dynamic Ensemble Re-Construction for Better Ranking.- Frequency-Based Separation of Climate Signals.- Efficient Processing of Ranked Queries with Sweeping Selection.- Feature Extraction from Mass Spectra for Classification of Pathological States.- Numbers in Multi-relational Data Mining.- Testing Theories in Particle Physics Using Maximum Likelihood and Adaptive Bin Allocation.- Improved Naive Bayes for Extremely Skewed Misclassification Costs.- Clustering and Prediction of Mobile User Routes from Cellular Data.- Elastic Partial Matching of Time Series.- An Entropy-Based Approach for Generating Multi-dimensional Sequential Patterns.- Visual Terrain Analysis of High-Dimensional Datasets.- An Auto-stopped Hierarchical Clustering Algorithm for Analyzing 3D Model Database.- A Comparison Between Block CEM and Two-Way CEM Algorithms to Cluster a Contingency Table.- An Imbalanced Data Rule Learner.- Improvements in the Data Partitioning Approach for Frequent Itemsets Mining.- On-Line Adaptive Filtering of Web Pages.- A Bi-clustering Framework for Categorical Data.- Privacy-Preserving Collaborative Filtering on Vertically Partitioned Data.- Indexed Bit Map (IBM) for Mining Frequent Sequences.- STochFS: A Framework for Combining Feature Selection Outcomes Through a Stochastic Process.- Speeding Up Logistic Model Tree Induction.- A Random Method for Quantifying Changing Distributions in Data Streams.- Deriving Class Association Rules Based on Levelwise Subspace Clustering.- An Incremental Algorithm for Mining Generators Representation.- Hybrid Technique for Artificial Neural Network Architecture and Weight Optimization.

Book
01 Jan 2005
TL;DR: Invited Talks.- Data Analysis in the Life Sciences - Sparking Ideas - Machine Learning for Natural Language Processing (and Vice Versa?).- Statistical Relational Learning: An Inductive Logic Programming Perspective.
Abstract: Invited Talks.- Data Analysis in the Life Sciences - Sparking Ideas -.- Machine Learning for Natural Language Processing (and Vice Versa?).- Statistical Relational Learning: An Inductive Logic Programming Perspective.- Recent Advances in Mining Time Series Data.- Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce.- Data Streams and Data Synopses for Massive Data Sets.- Long Papers.- k-Anonymous Patterns.- Interestingness is Not a Dichotomy: Introducing Softness in Constrained Pattern Mining.- Generating Dynamic Higher-Order Markov Models in Web Usage Mining.- Tree 2 - Decision Trees for Tree Structured Data.- Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results.- Cluster Aggregate Inequality and Multi-level Hierarchical Clustering.- Ensembles of Balanced Nested Dichotomies for Multi-class Problems.- Protein Sequence Pattern Mining with Constraints.- An Adaptive Nearest Neighbor Classification Algorithm for Data Streams.- Support Vector Random Fields for Spatial Classification.- Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication.- A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns.- Improving Generalization by Data Categorization.- Mining Model Trees from Spatial Data.- Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification.- Mining Paraphrases from Self-anchored Web Sentence Fragments.- M2SP: Mining Sequential Patterns Among Several Dimensions.- A Systematic Comparison of Feature-Rich Probabilistic Classifiers for NER Tasks.- Knowledge Discovery from User Preferences in Conversational Recommendation.- Unsupervised Discretization Using Tree-Based Density Estimation.- Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization.- Non-stationary Environment Compensation Using Sequential EM Algorithm for Robust Speech Recognition.- Hybrid Cost-Sensitive Decision Tree.- Characterization of Novel HIV Drug Resistance Mutations Using Clustering, Multidimensional Scaling and SVM-Based Feature Ranking.- Object Identification with Attribute-Mediated Dependences.- Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids.- Using Inductive Logic Programming for Predicting Protein-Protein Interactions from Multiple Genomic Data.- ISOLLE: Locally Linear Embedding with Geodesic Distance.- Active Sampling for Knowledge Discovery from Biomedical Data.- A Multi-metric Index for Euclidean and Periodic Matching.- Fast Burst Correlation of Financial Data.- A Propositional Approach to Textual Case Indexing.- A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston.- Efficient Classification from Multiple Heterogeneous Databases.- A Probabilistic Clustering-Projection Model for Discrete Data.- Short Papers.- Collaborative Filtering on Data Streams.- The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-Based FIM Algorithms.- Community Mining from Multi-relational Networks.- Evaluating the Correlation Between Objective Rule Interestingness Measures and Real Human Interest.- A Kernel Based Method for Discovering Market Segments in Beef Meat.- Corpus-Based Neural Network Method for Explaining Unknown Words by WordNet Senses.- Segment and Combine Approach for Non-parametric Time-Series Classification.- Producing Accurate Interpretable Clusters from High-Dimensional Data.- Stress-Testing Hoeffding Trees.- Rank Measures for Ordering.- Dynamic Ensemble Re-Construction for Better Ranking.- Frequency-Based Separation of Climate Signals.- Efficient Processing of Ranked Queries with Sweeping Selection.- Feature Extraction from Mass Spectra for Classification of Pathological States.- Numbers in Multi-relational Data Mining.- Testing Theories in Particle Physics Using Maximum Likelihood and Adaptive Bin Allocation.- Improved Naive Bayes for Extremely Skewed Misclassification Costs.- Clustering and Prediction of Mobile User Routes from Cellular Data.- Elastic Partial Matching of Time Series.- An Entropy-Based Approach for Generating Multi-dimensional Sequential Patterns.- Visual Terrain Analysis of High-Dimensional Datasets.- An Auto-stopped Hierarchical Clustering Algorithm for Analyzing 3D Model Database.- A Comparison Between Block CEM and Two-Way CEM Algorithms to Cluster a Contingency Table.- An Imbalanced Data Rule Learner.- Improvements in the Data Partitioning Approach for Frequent Itemsets Mining.- On-Line Adaptive Filtering of Web Pages.- A Bi-clustering Framework for Categorical Data.- Privacy-Preserving Collaborative Filtering on Vertically Partitioned Data.- Indexed Bit Map (IBM) for Mining Frequent Sequences.- STochFS: A Framework for Combining Feature Selection Outcomes Through a Stochastic Process.- Speeding Up Logistic Model Tree Induction.- A Random Method for Quantifying Changing Distributions in Data Streams.- Deriving Class Association Rules Based on Levelwise Subspace Clustering.- An Incremental Algorithm for Mining Generators Representation.- Hybrid Technique for Artificial Neural Network Architecture and Weight Optimization.

Journal ArticleDOI
TL;DR: A specific data mining tool is presented that can help non-experts in data mining carry out the complete rule discovery process, and its utility is demonstrated by applying it to an adaptive Linux course that was developed.
Abstract: We introduce a methodology to improve Adaptive Systems for Web-Based Education. This methodology uses evolutionary algorithms as a data mining method for discovering interesting relationships in students' usage data. Such knowledge may be very useful for teachers and course authors to select the most appropriate modifications to improve the effectiveness of the course. We use Grammar-Based Genetic Programming (GBGP) with multi-objective optimization techniques to discover prediction rules. We present a specific data mining tool that can help non-experts in data mining carry out the complete rule discovery process, and demonstrate its utility by applying it to an adaptive Linux course that we developed.

Book ChapterDOI
03 Oct 2005
TL;DR: This paper systematically analyzes the problem of mining hidden communities on heterogeneous social networks and proposes a new method for learning an optimal linear combination of these relations which can best meet the user's expectation.
Abstract: Social network analysis has attracted much attention in recent years. Community mining is one of the major directions in social network analysis. Most of the existing methods on community mining assume that there is only one kind of relation in the network, and moreover, the mining results are independent of the users' needs or preferences. However, in reality, there exist multiple, heterogeneous social networks, each representing a particular kind of relationship, and each kind of relationship may play a distinct role in a particular task. In this paper, we systematically analyze the problem of mining hidden communities on heterogeneous social networks. Based on the observation that different relations have different importance with respect to a certain query, we propose a new method for learning an optimal linear combination of these relations which can best meet the user's expectation. With the obtained relation, better performance can be achieved for community mining.

Journal ArticleDOI
TL;DR: The accuracy and the interpretability of fuzzy models derived by this approach are studied and presented and it is shown that the proposed approach is effective and practical in knowledge extraction.

Journal ArticleDOI
TL;DR: A technique that uses EM mixture modeling to perform clustering on distributed data that controls data sharing, preventing disclosure of individual data items or any results that can be traced to an individual site.
Abstract: Privacy and security considerations can prevent sharing of data, derailing data mining projects. Distributed knowledge discovery can alleviate this problem. We present a technique that uses EM mixture modeling to perform clustering on distributed data. This method controls data sharing, preventing disclosure of individual data items or any results that can be traced to an individual site.

Book
01 Jan 2005
TL;DR: This chapter discusses information and Knowledge Visualization in development and use of a Management Information System (MIS) for DaimlerChrysler.
Abstract: Visualizing Knowledge and Information: An Introduction.- Visualizing Knowledge and Information: An Introduction.- Background.- Visual Queries: The Foundation of Visual Thinking.- Representational Correspondence as a Basic Principle of Diagram Design.- Knowledge Visualization.- Node-Link Mapping Principles for Visualizing Knowledge and Information.- Tools for Representing Problems and the Knowledge Required to Solve Them.- Collaborative Knowledge Visualization for Cross-Community Learning.- Information Visualization.- Modeling Interactive, 3-Dimensional Information Visualizations Supporting Information Seeking Behaviors.- Visualizing Information in Virtual Space: Prospects and Pitfalls.- The Impact of Dimensionality and Color Coding of Information Visualizations on Knowledge Acquisition.- Synergies Visualizing Knowledge and Information for Fostering Learning and Instruction.- Digital Concept Maps for Managing Knowledge and Information.- Concept Maps: Integrating Knowledge and Information Visualization.- Comprehensive Mapping of Knowledge and Information Resources: The Case of Webster.- Towards a Framework and a Model for Knowledge Visualization: Synergies Between Information and Knowledge Visualization.- ParIS - Visualizing Ideas and Information in a Resource-Based Learning Scenario.- Knowledge-Oriented Organization of Information for Fostering Information Use.- LEO: A Concept Map Based Course Visualization Tool for Instructors and Students.- Navigating Personal Information Repositories with Weblog Authoring and Concept Mapping.- Facilitating Web Search with Visualization and Data Mining Techniques.- The Role of Content Representations in Hypermedia Learning: Effects of Task and Learner Variables.- Supporting Self-regulated E-Learning with Visual Topic-Map-Navigation.- Information and Knowledge Visualization in Development and Use of a Management Information System (MIS) for DaimlerChrysler.

Journal ArticleDOI
TL;DR: The judgment theorems of consistent sets are examined, and the discernibility matrix of a formal context is introduced, by which an approach to attribute reduction in the concept lattice is presented.
Abstract: The theory of the concept lattice is an efficient tool for knowledge representation and knowledge discovery, and is applied to many fields successfully. One focus of knowledge discovery is knowledge reduction. This paper proposes the theory of attribute reduction in the concept lattice, which extends the theory of the concept lattice. In this paper, the judgment theorems of consistent sets are examined, and the discernibility matrix of a formal context is introduced, by which we present an approach to attribute reduction in the concept lattice. The characteristics of three types of attributes are analyzed.

Book ChapterDOI
01 Jan 2005
TL;DR: The goal of the chapter is to present a knowledge discovery paradigm for multi-attribute and multicriteria decision making, which is based upon the concept of rough sets, in order to find concise classification patterns that agree with situations that are described by the data.
Abstract: In this chapter, we are concerned with discovering knowledge from data. The aim is to find concise classification patterns that agree with situations that are described by the data. Such patterns are useful for explanation of the data and for the prediction of future situations. They are particularly useful in such decision problems as technical diagnostics, performance evaluation and risk assessment. The situations are described by a set of attributes, which we might also call properties, features, characteristics, etc. Such attributes may be concerned with either the input or output of a situation. These situations may refer to states, examples, etc. Within this chapter, we will refer to them as objects. The goal of the chapter is to present a knowledge discovery paradigm for multi-attribute and multicriteria decision making, which is based upon the concept of rough sets. Rough set theory was introduced by (Pawlak 1982, Pawlak 1991). Since then, it has often proved to be an excellent mathematical tool for the analysis of a vague description of objects. The adjective vague (referring to the quality of information) is concerned with inconsistency or ambiguity. The rough set philosophy is based on the assumption that with every object of the universe U there is associated a certain amount of information (data, knowledge). This information can be expressed by means of a number of attributes. The attributes describe the object. Objects which have the same description are said to be indiscernible (similar) with respect to the available information.

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter describes a six-stepDMKD process model and its component technologies, which help to design flexible, semiautomated, and easy-to-use DMKD models to enable building knowledge repositories and allowing for communication between several data mining tools, databases, and knowledge repositories.
Abstract: Data mining and knowledge discovery (DMKD) is a fast-growing field of research. Its popularity is caused by an ever increasing demand for tools that help in revealing and comprehending information hidden in huge amounts of data. Such data are generated on a daily basis by federal agencies, banks, insurance companies, retail stores, and on the WWW. This explosion came about through the increasing use of computers, scanners, digital cameras, bar codes, etc. We are in a situation where rich sources of data, stored in databases, warehouses, and other data repositories, are readily available but not easily analyzable. This causes pressure from the federal, business, and industry communities for improvements in the DMKD technology. What is needed is a clear and simple methodology for extracting the knowledge hidden in the data. In this chapter, an integrated DMKD process model based on technologies like XML, PMML, SOAP, UDDI, and OLE BD-DM is introduced. These technologies help to design flexible, semiautomated, and easy-to-use DMKD models to enable building knowledge repositories and allowing for communication between several data mining tools, databases, and knowledge repositories. They also enable integration and automation of the DMKD tasks. This chapter describes a six-step DMKD process model and its component technologies.

Book ChapterDOI
18 May 2005
TL;DR: This work introduces a new technique based on a bit level approximation of the data that allows raw data to be directly compared to the reduced representation, while still guaranteeing lower bounds to Euclidean distance.
Abstract: Because time series are a ubiquitous and increasingly prevalent type of data, there has been much research effort devoted to time series data mining recently. As with all data mining problems, the key to effective and scalable algorithms is choosing the right representation of the data. Many high level representations of time series have been proposed for data mining. In this work, we introduce a new technique based on a bit level approximation of the data. The representation has several important advantages over existing techniques. One unique advantage is that it allows raw data to be directly compared to the reduced representation, while still guaranteeing lower bounds to Euclidean distance. This fact can be exploited to produce faster exact algorithms for similarly search. In addition, we demonstrate that our new representation allows time series clustering to scale to much larger datasets.

Journal ArticleDOI
TL;DR: In this paper, the authors developed a method to index design knowledge that is intuitive to an engineering designer and therefore encourage the reuse of information, which has been evaluated in two stages: evaluation of individual taxonomies within the method; and indexing of 92 reports using the method.

Book
01 Jan 2005
TL;DR: The goal of this workshop was to encourage KDD researchers to take on the numerous challenges that Bioinformatics offers, and encouraged papers that proposed novel data mining techniques for tasks such as gene expression, drug design and other emerging problems in genomics and proteomics.
Abstract: Written especially for computer scientists, all necessary biology is explained. Presents new techniques on gene expression data mining, gene mapping for disease detection, and phylogenetic knowledge discovery.