scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2004"


Book
17 Sep 2004
TL;DR: Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition, and design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks.
Abstract: Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition. ART networks function both as models of human cognitive information processing [1,2,3] and as neural systems for technology transfer [4]. A neural computation central to both the scientific and the technological analyses is the ART matching rule [5], which models the interaction between topdown expectation and bottom-up input, thereby creating a focus of attention which, in turn, determines the nature of coded memories. Sites of early and ongoing transfer of ART-based technologies include industrial venues such as the Boeing Corporation [6] and government venues such as MIT Lincoln Laboratory [7]. A recent report on industrial uses of neural networks [8] states: “[The] Boeing ... Neural Information Retrieval System is probably still the largest-scale manufacturing application of neural networks. It uses [ART] to cluster binary templates of aeroplane parts in a complex hierarchical network that covers over 100,000 items, grouped into thousands of self-organised clusters. Claimed savings in manufacturing costs are in millions of dollars per annum.” At Lincoln Lab, a team led by Waxman developed an image mining system which incorporates several models of vision and recognition developed in the Boston University Department of Cognitive and Neural Systems (BU/CNS). Over the years a dozen CNS graduates (Aguilar, Baloch, Baxter, Bomberger, Cunningham, Fay, Gove, Ivey, Mehanian, Ross, Rubin, Streilein) have contributed to this effort, which is now located at Alphatech, Inc. Customers for BU/CNS neural network technologies have attributed their selection of ART over alternative systems to the model's defining design principles. In listing the advantages of its THOT technology, for example, American Heuristics Corporation (AHC) cites several characteristic computational capabilities of this family of neural models, including fast on-line (one-pass) learning, “vigilant” detection of novel patterns, retention of rare patterns, improvement with experience, “weights [which] are understandable in real world terms,” and scalability (www.heuristics.com). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP [9], ART-EMAP [10], ARTMAP-IC [11],

1,745 citations


Book
18 Nov 2004
TL;DR: The second edition of a highly praised, successful reference on data mining, with thorough coverage of big data applications, predictive analytics, and statistical analysis.
Abstract: The second edition of a highly praised, successful reference on data mining, with thorough coverage of big data applications, predictive analytics, and statistical analysis.Includes new chapters on Multivariate Statistics, Preparing to Model the Data, and Imputation of Missing Data, and an Appendix on Data Summarization and VisualizationOffers extensive coverage of the R statistical programming languageContains 280 end-of-chapter exercisesIncludes a companion website with further resources for all readers, and Powerpoint slides, a solutions manual, and suggested projects for instructors who adopt the book

1,637 citations


Journal ArticleDOI
01 Feb 2004
TL;DR: An approach to the online learning of Takagi-Sugeno (TS) type models is proposed, based on a novel learning algorithm that recursively updates TS model structure and parameters by combining supervised and unsupervised learning.
Abstract: An approach to the online learning of Takagi-Sugeno (TS) type models is proposed in the paper. It is based on a novel learning algorithm that recursively updates TS model structure and parameters by combining supervised and unsupervised learning. The rule-base and parameters of the TS model continually evolve by adding new rules with more summarization power and by modifying existing rules and parameters. In this way, the rule-base structure is inherited and up-dated when new data become available. By applying this learning concept to the TS model we arrive at a new type adaptive model called the Evolving Takagi-Sugeno model (ETS). The adaptive nature of these evolving TS models in combination with the highly transparent and compact form of fuzzy rules makes them a promising candidate for online modeling and control of complex processes, competitive to neural networks. The approach has been tested on data from an air-conditioning installation serving a real building. The results illustrate the viability and efficiency of the approach. The proposed concept, however, has significantly wider implications in a number of fields, including adaptive nonlinear control, fault detection and diagnostics, performance analysis, forecasting, knowledge extraction, robotics, behavior modeling.

956 citations


Journal ArticleDOI
TL;DR: A formal methodology is introduced, which allows us to compare multiple split criteria and permits us to present fundamental insights into the decision process.
Abstract: Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for a high payoff in many business and scientific applications. One of the main tasks in KDD is classification. A particular efficient method for classification is decision tree induction. The selection of the attribute used at each node of the tree to split the data (split criterion) is crucial in order to correctly classify objects. Different split criteria were proposed in the literature (Information Gain, Gini Index, etc.). It is not obvious which of them will produce the best decision tree for a given data set. A large amount of empirical tests were conducted in order to answer this question. No conclusive results were found. In this paper we introduce a formal methodology, which allows us to compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given data set. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

554 citations


Book
04 Nov 2004
TL;DR: This chapter discusses Tech Mining in an Information Age, which is concerned with the development of knowledge discovery, information representation, and decision-making in the rapidly changing environment.
Abstract: List of Figures. Preface. Acknowledgments. Acronyms & Shorthands-Glossary. PART I. UNDERSTAND TECH MINING. Chapter 1. Technological Innovation and the Need for Tech Mining. 1.1 Why Innovation is Significant. 1.2 Innovation Processes. 1.3 Innovation Institutions and Their Interests. 1.4 Innovators and Their Interests. 1.5 Technological Innovation in an Information Age. 1.6 Information About Emerging Technologies. Chapter 1 Take-Home Messages. Chapter Resources. Chapter 2. How Tech Mining Works. 2.1 What is Tech Mining? 2.2 Why Do Tech Mining? 2.3 What Is Tech Mining's Ancestry? 2.4 How to Conduct the Tech Mining Process? 2.5 Who Does Tech Mining? 2.6 Where Is Tech Mining Most Needed? Chapter 2 Take-Home Messages. Chapter Resources. Chapter 3. What Tech Mining Can Do for You. 3.1 Tech Mining Basics. 3.2 Tech Mining Analyses. 3.3 Putting Tech Mining Information to Good Use. 3.4 Managing and Measuring Tech Mining. Chapter 3 Take-Home Messages. Chapter 4. Example Results: Fuel Cells Tech Mining. 4.1 Overview of Fuel Cells. 4.2 Tech Mining Analyses. 4.3 Tech Mining Results. 4.4 Tech Mining Information Processes. 4.5 Tech Mining Information Products. Chapter 4 Take-Home Messages. Chapter Resources. Chapter 5. What to Watch For in Tech Mining. 5.1 Better Basics. 5.2 Research Profiling and Other Perspectives on the Data. 5.3 More Informative Products. 5.4 Knowledge Discovery. 5.5 Knowledge Management. 5.6 New Tech Mining Markets. 5.7 Dangers. Chapter 5 Take-Home Messages. Chapter Resources. PART II. DOING TECH MINING. Chapter 6. Finding the Right Sources. 6.1 R&D Activity. 6.2 R&D Output Databases. 6.3 Determining the Best Sources. 6.4 Arranging Access to Databases. Chapter 6 Take-Home Messages. Chapter Resources. Chapter 7. Forming the Right Query. 7.1 An Iterative Process. 7.2 Queries Based on Substantive Terms. 7.3 Nominal Queries. 7.4 Tactics and Strategies for Query Design. 7.5 Changing the Query. Chapter 7 Take-Home Messages. Chapter 8. Getting the Data. 8.1 Accessing Databases. 8.2 Search and Retrieval from a Database. 8.3 What to Do, and Not to Do. Chapter 8 Take-Home Messages. Chapter 9. Basic Analyses. 9.1 In the Beginning. 9.2 What You Can Do with the Data. 9.3 Relations Among Documents and Terms Occurring in Their Information Fields. 9.4 Relationships. 9.5 Helpful Basic Analyses. Chapter 9 Take-Home Messages. Chapter 10. Advanced Analyses. 10.1 Why Perform Advanced Analyses? 10.2 Data Representation. 10.3 Analytical Families. Chapter 10 Take-Home Messages. Chapter Resources. Chapter 11. Trend Analyses. 11.1 Perspective. 11.2 An Example Time Series Description and Forecast. 11.3 Multiple Forecasts. 11.4 Research Fronts. 11.5 Novelty. Chapter 11 Take-Home Messages. Chapter Resources. Chapter 12. Patent Analyses. 12.1 Why patent Analyses? 12.2 Getting Started. 12.3 The 'What' and 'Why' of patent Analysis. 12.4 Tech Mining Patent Analysis Case Illustration: Fuel Cells. 12.5 Patent Citation Analysis. 12.6 For Whom? 12.7 TRIZ. 12.8 Reflections. Chapter 12 Take-Home Messages. Chapter Resources. Chapter 13. Generating and Presenting Innovation Indicators. 13.1 Expert Opinion in Tech Mining. 13.2 Innovation Indicators. 13.3 Information Representation and Packaging. 13.4 Examples of Putting Tech Mining Information Representation to Use. 13.5 Summing Up. Chapter Resources. Chapter 14. Managing the Tech Mining Process. 14.1 Tough Challenges. 14.2 Tech Mining Communities. 14.3 Process Management. 14.4 Enhancing the Prospects of Tech Mining Utilization. 14.5 Institutionalizing the Tech Mining Function. 14.6 The Learning Curve. Chapter 14 Take-Home Messages. Chapter 15. Measuring Tech Mining Results. 15.1 Why Measure? 15.2 What to Measure. 15.3 How to Measure. 15.4 Enabling Measurement. 15.5 Effective Measurement. 15.6 Using Measurements to Bolster Tech Mining. Chapter 15 Take-Home Messages. Chapter Resources. Chapter 16. Examples Process: Tech Mining on Fuel Cells. 16.1 Introduction. 16.2 First Step: Issue Identification. 16.3 Second Step: Selection of Information Sources. 16.4 Third Step: Search Refinement and Data Retrieval. 16.5 Fourth Step: Data Cleaning. 16.6 Fifth Step: Basic Analyses. 16.7 Sixth Step: Advanced Analyses. 16.8 Seventh Step: Representation. 16.9 Eight Step: Interpretation. 16.10 Ninth Step: Utilization. 16.11 What Can We Learn. Chapter 6 Take-Home Messages. Chapter Resources. Appendix A: Selected Publication and patent Databases. Appendix B: Text Mining Software. Appendix C: What You Can Do Without Tech Mining Software. Appendix D: Statistics and Distributions for Analyzing Text Entities. References. Index.

369 citations


Journal Article
TL;DR: Cooperative Information Systems (CoopIS) 2004 International Conference (International Conference on Cooperative Information Systems) PC Co-chairs' Message- Keynote- Business Process Optimization- Workflow/Process/Web Services, I- Discovering Workflow Transactional behavior from Event-based Log- A Flexible Mediation Process for Large Distributed Information Systems- Exception Handling Through a Workflow- WorkFlow/Process, Web Services, II- Flexible and Composite Schema Matching Algorithm- Analysis, Transformation, and Improvements of ebXML Choreographies based on Work
Abstract: Cooperative Information Systems (CoopIS) 2004 International Conference- CoopIS 2004 International Conference (International Conference on Cooperative Information Systems) PC Co-chairs' Message- Keynote- Business Process Optimization- Workflow/Process/Web Services, I- Discovering Workflow Transactional Behavior from Event-Based Log- A Flexible Mediation Process for Large Distributed Information Systems- Exception Handling Through a Workflow- Workflow/Process/Web Services, II- A Flexible and Composite Schema Matching Algorithm- Analysis, Transformation, and Improvements of ebXML Choreographies Based on Workflow Patterns- The Notion of Business Process Revisited- Workflow/Process/Web Services, III- Disjoint and Overlapping Process Changes: Challenges, Solutions, Applications- Untangling Unstructured Cyclic Flows - A Solution Based on Continuations- Making Workflow Models Sound Using Petri Net Controller Synthesis- Database Management/Transaction- Concurrent Undo Operations in Collaborative Environments Using Operational Transformation- Refresco: Improving Query Performance Through Freshness Control in a Database Cluster- Automated Supervision of Data Production - Managing the Creation of Statistical Reports on Periodic Data- Schema Integration/Agents- Deriving Sub-schema Similarities from Semantically Heterogeneous XML Sources- Supporting Similarity Operations Based on Approximate String Matching on the Web- Managing Semantic Compensation in a Multi-agent System- Modelling with Ubiquitous Agents a Web-Based Information System Accessed Through Mobile Devices- Events- A Meta-service for Event Notification- Classification and Analysis of Distributed Event Filtering Algorithms- P2P/Collaboration- A Collaborative Model for Agricultural Supply Chains- FairNet - How to Counter Free Riding in Peer-to-Peer Data Structures- Supporting Collaborative Layouting in Word Processing- A Reliable Content-Based Routing Protocol over Structured Peer-to-Peer Networks- Applications, I- Covering Your Back: Intelligent Virtual Agents in Humanitarian Missions Providing Mutual Support- Dynamic Modelling of Demand Driven Value Networks- An E-marketplace for Auctions and Negotiations in the Constructions Sector- Applications, II- Managing Changes to Engineering Products Through the Co-ordination of Human and Technical Activities- Towards Automatic Deployment in eHome Systems: Description Language and Tool Support- A Prototype of a Context-Based Architecture for Intelligent Home Environments- Trust/Security/Contracts- Trust-Aware Collaborative Filtering for Recommender Systems- Service Graphs for Building Trust- Detecting Violators of Multi-party Contracts- Potpourri- Leadership Maintenance in Group-Based Location Management Scheme- TLS: A Tree-Based DHT Lookup Service for Highly Dynamic Networks- Minimizing the Network Distance in Distributed Web Crawling- Ontologies, DataBases, and Applications of Semantics (ODBASE) 2004 International Conference- ODBASE 2004 International Conference (Ontologies, DataBases, and Applications of Semantics) PC Co-chairs' Message- Keynote- Helping People (and Machines) Understanding Each Other: The Role of Formal Ontology- Knowledge Extraction- Automatic Initiation of an Ontology- Knowledge Extraction from Classification Schemas- Semantic Web in Practice- Generation and Management of a Medical Ontology in a Semantic Web Retrieval System- Semantic Web Based Content Enrichment and Knowledge Reuse in E-science- The Role of Foundational Ontologies in Manufacturing Domain Applications- Intellectual Property Rights Management Using a Semantic Web Information System- Ontologies and IR- Intelligent Retrieval of Digital Resources by Exploiting Their Semantic Context- The Chrysostom Knowledge Base: An Ontology of Historical Interactions- Text Simplification for Information-Seeking Applications- Information Integration- Integration of Integrity Constraints in Federated Schemata Based on Tight Constraining- Modal Query Language for Databases with Partial Orders- Composing Mappings Between Schemas Using a Reference Ontology- Assisting Ontology Integration with Existing Thesauri

284 citations


Book ChapterDOI
01 Jan 2004
TL;DR: This paper discusses modeling, representation and computation or validation of three types of complex semantic relationships: using predefined multi-ontology relationships for query processing and virtual relationships based on a set of patterns and paths between entities of interest.
Abstract: The primary goal of today's search and browsing techniques is to find relevant documents. As the current web evolves into the next generation termed the Semantic Web, the emphasis will shift from finding documents to finding facts, actionable information, and insights. Improving ability to extract facts, mainly in the form of entities, embedded within documents leads to the fundamental challenge of discovering relevant and interesting relationships amongst the entities that these documents describe. Relationships are fundamental to semantics—to associate meanings to words, terms and entities. They are a key to new insights. Knowledge discovery is also about discovery of heretofore new relationships. The Semantic Web seeks to associate annotations (i.e., metadata), primarily consisting of based on concepts (often representing entities) from one or more ontologies/vocabularies with all Web-accessible resources such that programs can associate "meaning with data". Not only it supports the goal of automatic interpretation and processing (access, invoke, utilize, and analyze), it also enables improvements in scalability compared to approaches that are not semantics-based. Identification, discovery, validation and utilization of relationships (such as during query evaluation), will be a critical computation on the Semantic Web. Based on our research over the last decade, this paper takes an empirical look at various types of simple and complex relationships, what is captured and how they are represented, and how they are identified, discovered or validated, and exploited. These relationships may be based only on what is contained in or directly derived from data (direct content based relationships), or may be based on information extraction, external and prior knowledge and user defined computations (content descriptive relationships). We also present some recent techniques for discovering indirect (i.e., transitive) and virtual (i.e., user-defined) yet meaningful (i.e., contextually relevant) relationships based on a set of patterns and paths between entities of interest. In particular, we will discuss modeling, representation and computation or validation of three types of complex semantic relationships: (a) using predefined multi-ontology relationships for query processing and

281 citations


Journal ArticleDOI
TL;DR: In spite of many remaining unsolved problems and need for further research and development, use of knowledge and semi-automation are the only viable alternatives towards development of useful object extraction systems, as some commercial systems on building extraction and 3D city modelling as well as advanced, practically oriented research have shown.
Abstract: The paper focuses mainly on extraction of important topographic objects, like buildings and roads, that have received much attention the last decade. As main input data, aerial imagery is considered, although other data, like from laser scanner, SAR and high-resolution satellite imagery, can be also used. After a short review of recent image analysis trends, and strategy and overall system aspects of knowledge-based image analysis, the paper focuses on aspects of knowledge that can be used for object extraction: types of knowledge, problems in using existing knowledge, knowledge representation and management, current and possible use of knowledge, upgrading and augmenting of knowledge. Finally, an overview on commercial systems regarding automated object extraction and use of a priori knowledge is given. In spite of many remaining unsolved problems and need for further research and development, use of knowledge and semi-automation are the only viable alternatives towards development of useful object extraction systems, as some commercial systems on building extraction and 3D city modelling as well as advanced, practically oriented research have shown.

277 citations


Journal ArticleDOI
TL;DR: The PowerBioNE system is the first system which deals with the cascaded entity name phenomenon and the HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated H MM, support vector machines, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem.
Abstract: Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition. Results: In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k-Nearest Neighbor (k-NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F-measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F-measure of 75.8 on the 'protein' class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F-measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F-measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the 'protein' class, without help of any dictionaries. We think that a F-measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the 'protein' class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary. Availability: A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm. Technology license is available upon the bilateral agreement.

261 citations


Journal ArticleDOI
TL;DR: This is it, the handbook of data mining and knowledge discovery that will be your best choice for better reading book that you will not spend wasted by reading this website.
Abstract: Give us 5 minutes and we will show you the best book to read today. This is it, the handbook of data mining and knowledge discovery that will be your best choice for better reading book. Your five times will not spend wasted by reading this website. You can take the book as a source to make better concept. Referring the books that can be situated with your needs is sometime difficult. But here, this is so easy. You can find the best thing of book that you can read.

252 citations


Dissertation
01 Jan 2004
TL;DR: Experimental results demonstrate that discovered patterns in extracted text can be used to effectively improve the underlying IE method, and an approach to using rules mined from extracted data to improve the accuracy of information extraction is presented.
Abstract: The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this dissertation, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems. We present a general text-mining framework called DISCOTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other sources. We introduce the notion of discovering “soft-matching” rules from text and present two new learning algorithms. TEXTRISE is an inductive method for learning soft-matching prediction rules that integrates rule-based and instance-based learning methods. Simple, interpretable rules are discovered using rule induction, while a nearest-neighbor algorithm provides soft matching. SOFTAPRIORI is a text-mining algorithm for discovering association rules from texts that uses a similarity measure to allow flexible matching to variable database items. We present experimental results on inducing prediction and association rules from natural-language texts demonstrating that TEXTRISE and SOFTA PRIORI learn more accurate rules than previous methods for these tasks. We also present an approach to using rules mined from extracted data to improve the accuracy of information extraction. Experimental results demonstrate that such discovered patterns can be used to effectively improve the underlying IE method.

Journal ArticleDOI
TL;DR: The solution is to join all the log files and reconstitute the visit of users who accessed the Web site, and consists of a data summarization step, which will allow the analyst to select only the information of interest.
Abstract: Web usage mining applies data mining procedures to analyze user access of Web sites. As with any KDD (knowledge discovery and data mining) process, WUM contains three main steps: preprocessing, knowledge extraction, and results analysis. We focus on data preprocessing, a fastidious, complex process. Analysts aim to determine the exact list of users who accessed the Web site and to reconstitute user sessions-the sequence of actions each user performed on the Web site. Intersites WUM deals with Web server logs from several Web sites, generally belonging to the same organization. Thus, analysts must reassemble the users' path through all the different Web servers that they visited. Our solution is to join all the log files and reconstitute the visit. Classical data preprocessing involves three steps: data fusion, data cleaning, and data structuration. Our solution for WUM adds what we call advanced data preprocessing. This consists of a data summarization step, which will allow the analyst to select only the information of interest. We've successfully tested our solution in an experiment with log files from INRIA Web sites.

Journal ArticleDOI
TL;DR: A novel methodology that acquires collective knowledge from the World Wide Web using the GoogleTM API is presented, PANKOW, a concrete instantiation of this methodology which is evaluated in two experiments: one with the aim of classifying novel instances with regard to an existing ontology and one withThe aim of learning sub-/superconcept relations.
Abstract: The goal of giving a well-defined meaning to information is currently shared by endeavors such as the Semantic Web as well as by current trends within Knowledge Management. They all depend on the large-scale formalization of knowledge and on the availability of formal metadata about information resources. However, the question how to provide the necessary formal metadata in an effective and efficient way is still not solved to a satisfactory extent. Certainly, the most effective way to provide such metadata as well as formalized knowledge is to let humans encode them directly into the system, but this is neither efficient nor feasible. Furthermore, as current social studies show, individual knowledge is often less powerful than the collective knowledge of a certain community.As a potential way out of the knowledge acquisition bottleneck, we present a novel methodology that acquires collective knowledge from the World Wide Web using the GoogleTM API. In particular, we present PANKOW, a concrete instantiation of this methodology which is evaluated in two experiments: one with the aim of classifying novel instances with regard to an existing ontology and one with the aim of learning sub-/superconcept relations.

Journal ArticleDOI
TL;DR: A semantic-link-making tool for users to conveniently describe their understandings of provided resources and background knowledge is developed for e-science through a set of relevant application services and semantic resources.
Abstract: The Internet and World Wide Web are milestones in the history of information sharing. Scientists are increasingly relying on them to support their research. Knowledge is the basis of realizing intelligent services. The knowledge grid is a mechanism that can synthesize knowledge from data through mining and reference methods and enable search engines to make references, answer questions, and draw conclusions from masses of data. The knowledge grid infrastructure supports e-science through a set of relevant application services and semantic resources. We have developed a semantic-link-making tool for users to conveniently describe their understandings of provided resources and background knowledge.

BookDOI
01 Sep 2004
TL;DR: It is shown how carefully crafted random matrices can achieve distance-preserving dimensionality reduction, accelerate spectral computations, and reduce the sample complexity of certain kernel methods.
Abstract: We show how carefully crafted random matrices can achieve distance-preserving dimensionality reduction, accelerate spectral computations, and reduce the sample complexity of certain kernel methods.

Book ChapterDOI
23 Feb 2004
TL;DR: The existing ARM methods are discussed, a set of guidelines for the design of novel ones are provided, and some open algorithmic issues on the FCA side are list and two on-line methods computing the minimal generators of a closure system are proposed.
Abstract: Data mining (DM) is the extraction of regularities from raw data, which are further transformed within the wider process of knowledge discovery in databases (KDD) into non-trivial facts intended to support decision making. Formal concept analysis (FCA) offers an appropriate framework for KDD, whereby our focus here is on its potential for DM support. A variety of mining methods powered by FCA have been published and the figures grow steadily, especially in the association rule mining (ARM) field. However, an analysis of current ARM practices suggests the impact of FCA has not reached its limits, i.e., appropriate FCA-based techniques could successfully apply in a larger set of situations. As a first step in the projected FCA expansion, we discuss the existing ARM methods, provide a set of guidelines for the design of novel ones, and list some open algorithmic issues on the FCA side. As an illustration, we propose two on-line methods computing the minimal generators of a closure system.

Book
01 Oct 2004
TL;DR: This collection surveys the most recent advances in the field and charts directions for future research, discussing topics that include distributed data mining algorithms for new application areas, several aspects of next-generation data mining systems and applications, and detection of recurrent patterns in digital media.
Abstract: Data mining, or knowledge discovery, has become an indispensable technology for businesses and researchers in many fields. Drawing on work in such areas as statistics, machine learning, pattern recognition, databases, and high performance computing, data mining extracts useful information from the large data sets now available to industry and science. This collection surveys the most recent advances in the field and charts directions for future research.The first part looks at pervasive, distributed, and stream data mining, discussing topics that include distributed data mining algorithms for new application areas, several aspects of next-generation data mining systems and applications, and detection of recurrent patterns in digital media. The second part considers data mining, counter-terrorism, and privacy concerns, examining such topics as biosurveillance, marshalling evidence through data mining, and link discovery. The third part looks at scientific data mining; topics include mining temporally-varying phenomena, data sets using graphs, and spatial data mining. The last part considers web, semantics, and data mining, examining advances in text mining algorithms and software, semantic webs, and other subjects.

Journal ArticleDOI
01 Jan 2004
TL;DR: A new approach is offered which deals with improving classification accuracies by using a preliminary filtering procedure which is finally compared to the relaxation relabelling schema.
Abstract: Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

Journal ArticleDOI
TL;DR: A multi-criteria decision analysis based process that would empower DM project teams to do thorough experimentation and analysis without being overwhelmed by the task of analyzing a significant number of DTs would offer a positive contribution to the DM process.

Book ChapterDOI
TL;DR: A survey of condensed representations for frequent sets can be found in this paper, where the core concepts used in the recent works on condensed representation for frequent set are surveyed and discussed in detail.
Abstract: Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Frequent sets are indeed useful for many data mining tasks, including the popular association rule mining task but also feature construction, association-based classification, clustering, etc. The research in this area has been boosted by the fascinating concept of condensed representations w.r.t. frequency queries. Such representations can be used to support the discovery of every frequent set and its support without looking back at the data. Interestingly, the size of condensed representations can be several orders of magnitude smaller than the size of frequent set collections. Most of the proposals concern exact representations while it is also possible to consider approximated ones, i.e., to trade computational complexity with a bounded approximation on the computed support values. This paper surveys the core concepts used in the recent works on condensed representation for frequent sets.

01 Jan 2004
TL;DR: A new single-pass algorithm, called DSMFI (Data Stream Mining for Frequent Itemsets), to mine all frequent itemsets over the entire history of data streams, which shows that DSM-FI outperforms the well-known algorithm Lossy Counting in the same streaming environment.
Abstract: A data stream is a continuous, huge, fast changing, rapid, infinite sequence of data elements. The nature of streaming data makes it essential to use online algorithms which require only one scan over the data for knowledge discovery. In this paper, we propose a new single-pass algorithm, called DSMFI (Data Stream Mining for Frequent Itemsets), to mine all frequent itemsets over the entire history of data streams. DSM-FI has three major features, namely single streaming data scan for counting itemsets’ frequency information, extended prefix-tree-based compact pattern representation, and top-down frequent itemset discovery scheme. Our performance study shows that DSM-FI outperforms the well-known algorithm Lossy Counting in the same streaming environment.

Journal ArticleDOI
01 Jan 2004
TL;DR: MOWCATL, an efficient method for mining frequent association rules from multiple sequential data sets, introduces the use of separate antecedent and consequent inclusion constraints, in addition to the traditional frequency and support constraints in sequential data mining.
Abstract: This paper presents MOWCATL, an efficient method for mining frequent association rules from multiple sequential data sets. Our goal is to find patterns in one or more sequences that precede the occurrence of patterns in other sequences. Recent work has highlighted the importance of using constraints to focus the mining process on the association rules relevant to the user. To refine the data mining process, this approach introduces the use of separate antecedent and consequent inclusion constraints, in addition to the traditional frequency and support constraints in sequential data mining. Moreover, separate antecedent and consequent maximum window widths are used to specify the antecedent and consequent patterns that are separated by either a maximal width time lag or a fixed width time lag. Multiple time series drought risk management data are used to show that our approach can be effectively employed in real-life problems. This approach is compared to existing methods to show how they complement each other to discover associations in the drought risk management domain. The experimental results validate the superior performance of our method for efficiently finding relationships between global climatic episodes and local drought conditions. Both the maximal and fixed width time lags are shown to be useful when finding interesting associations.

Proceedings ArticleDOI
18 Apr 2004
TL;DR: This paper investigates the use, the benefits and the development requirements of Web-accessible ontologies for discrete-event simulation and modeling, and develops a prototype OWL-based ontology for modeling and simulation called the discrete- event modeling ontology (DeMO).
Abstract: Many fields have or are developing ontologies for their subdomains. The gene ontology (GO) is now considered to be a great success in biology, a field that has already developed several extensive ontologies. Similar advantages could accrue to the simulation and modeling community. Ontologies provide a way to establish common vocabularies and capture domain knowledge for organizing the domain with a community wide agreement or with the context of agreement between leading domain experts. They can be used to deliver significantly improved (semantic) search and browsing, integration of heterogeneous information sources, and improved analytics and knowledge discovery capabilities. Such knowledge can be used to establish common vocabularies, nomenclatures and taxonomies with links to detailed information sources. This paper investigates the use, the benefits and the development requirements of Web-accessible ontologies for discrete-event simulation and modeling. As a case study, the development of a prototype OWL-based ontology for modeling and simulation called the discrete-event modeling ontology (DeMO) is also discussed. Prototype ontologies such as DeMO can serve as a basis for achieving broader community agreement and adoption of ontologies for this field.

Journal ArticleDOI
01 Dec 2004
TL;DR: A visual concept ontology is proposed to be used to guide experts in the visual description of the objects of their domain (e.g., pollen grain) to result in a knowledge base enabling semantic image interpretation.
Abstract: This paper details a visual-concept-ontology-driven knowledge acquisition methodology. We propose to use a visual concept ontology to guide experts in the visual description of the objects of their domain (e.g., pollen grain). The proposed knowledge acquisition process results in a knowledge base enabling semantic image interpretation. An important benefit of our approach is that the knowledge acquisition process guided by the ontology leads to a knowledge base close to low-level vision. A visual concept ontology and a dedicated knowledge acquisition tool have been developed and are presented. We propose a generic methodology that is not linked to any application domain. An example shows how the knowledge acquisition model can be applied to the description of pollen grain images.

01 Apr 2004
TL;DR: This thesis is focused on the monotonicity property in knowledge discovery and more specifically in classification, attribute reduction, function decomposition, frequent patterns generation and missing values handling.
Abstract: textThe monotonicity property is ubiquitous in our lives and it appears in different roles: as domain knowledge, as a requirement, as a property that reduces the complexity of the problem, and so on. It is present in various domains: economics, mathematics, languages, operations research and many others. This thesis is focused on the monotonicity property in knowledge discovery and more specifically in classification, attribute reduction, function decomposition, frequent patterns generation and missing values handling. Four specific problems are addressed within four different methodologies, namely, rough sets theory, monotone decision trees, function decomposition and frequent patterns generation. In the first three parts, the monotonicity is domain knowledge and a requirement for the outcome of the classification process. The three methodologies are extended for dealing with monotone data in order to be able to guarantee that the outcome will also satisfy the monotonicity requirement. In the last part, monotonicity is a property that helps reduce the computation of the process of frequent patterns generation. Here the focus is on two of the best algorithms and their comparison both theoretically and experimentally. About the Author: Viara Popova was born in Bourgas, Bulgaria in 1972. She followed her secondary education at Mathematics High School "Nikola Obreshkov" in Bourgas. In 1996 she finished her higher education at Sofia University, Faculty of Mathematics and Informatics where she graduated with major in Informatics and specialization in Information Technologies in Education. She then joined the Department of Information Technologies, First as an associated member and from 1997 as an assistant professor. In 1999 she became a PhD student at Erasmus University Rotterdam, Faculty of Economics, Department of Computer Science. In 2004 she joined the Artificial Intelligence Group within the Department of Computer Science, Faculty of Sciences at Vrije Universiteit Amsterdam as a PostDoc researcher.

Journal ArticleDOI
TL;DR: This work proposes a comprehensive software architecture for the next-generation grid, which integrates currently available services and components in Semantic Web, Semantic Grid, P2P, and ubiquitous systems.
Abstract: Just as the Internet is shifting its focus from information and communication to a knowledge delivery infrastructure, we see the Grid moving from computation and data management to a pervasive, worldwide knowledge management infrastructure. We have the technology to store and access data, but we seem to lack the ability to transform data tombs into useful data and extract knowledge from them. We review some of the current and future technologies that will impact the architecture, computational model, and applications of future grids. We attempt to forecast the evolution of computational grids into what we call the next-generation grid, with a particular focus on the use of semantics and knowledge discovery techniques and services. We propose a comprehensive software architecture for the next-generation grid, which integrates currently available services and components in Semantic Web, Semantic Grid, P2P, and ubiquitous systems. We'll also discuss a case study that shows how some new technologies can improve grid applications.

Book ChapterDOI
30 Aug 2004
TL;DR: Information scientists face the same challenge as a result of the digital revolution that expedites the production of terabytes of data from credit card transactions, medical examinations, telephone calls, stock values, and other numerous human activities.
Abstract: This chapter provides a research frame for geographic information scientists to study the integration of geospatial data mining and knowledge discovery. It examines the current state of Data mining (DM) and Knowledge discovery data (KDD) technology, identifies special needs for geospatial DM and KDD, and outlines research challenges and the significance to national research needs. The chapter explores research frontiers in geographic knowledge discovery briefly and propose a research agenda to highlight short-term, midterm, and long-term objectives. The University Consortium for Geographic Information Science seeks to facilitate a multidisciplinary research effort on the development of geospatial DM/KDD science and technology. Geographic data has unique properties that require special consideration and techniques. Geographic information exists within highly dimensioned geographic measurement frameworks. Knowledge-based Geographic Information Science attempts to build higher-level geographic knowledge into digital geographic databases for analyzing complex phenomena. The development of DM and knowledge discovery tools must be supported by a solid geographic foundation.

Journal ArticleDOI
TL;DR: The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space and ensemble of neural networks may be the best solution for large datasets.

Journal ArticleDOI
TL;DR: Ontology-enabled knowledge management experiences derived from a domain ontology development project at Intel Corporation are described and assessed.
Abstract: Ontology-enabled knowledge management experiences derived from a domain ontology development project at Intel Corporation are described and assessed.

Book ChapterDOI
01 Jan 2004
TL;DR: The MiningMart system presented in this chapter focuses on setting up and re-using best practice cases of preprocessing data stored in very large databases using a metadata model named M4 to declaratively define and document both, all steps of such a preprocessing chain and all the data involved.
Abstract: Although preprocessing is one of the key issues in data analysis, it is still common practice to address this task by manually entering SQL statements and using a variety of stand-alone tools. The results are not properly documented and hardly re-usable. The MiningMart system presented in this chapter focuses on setting up and re-using best practice cases of preprocessing data stored in very large databases. A metadata model named M4 is used to declaratively define and document both, all steps of such a preprocessing chain and all the data involved. For data and applied operators there is an abstract level, understandable by human users, and an executable level, used by the metadata compiler to run cases for given data sets. An integrated environment allows for rapid development of preprocessing chains. Adaptation to different environments is supported simply by specifying all involved database entities in the target DBMS. This allows reuse of best practice cases published on the Internet.