scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2012"


Proceedings Article
01 Jan 2012
TL;DR: This paper is a brief introduction to the special session on interpretable models in machine learning, organized as part of the 20 th European Symposium on Artificial Neural Networks, Computational In- telligence and Machine Learning, with an overview of the context of wider research on interpretability of machine learning models.
Abstract: Data of different levels of complexity and of ever growing diversity of characteristics are the raw materials that machine learning practitioners try to model using their wide palette of methods and tools. The obtained models are meant to be a synthetic representation of the available, observed data that captures some of their intrinsic regularities or patterns. Therefore, the use of machine learning techniques for data analysis can be understood as a problem of pattern recognition or, more informally, of knowledge discovery and data mining. There exists a gap, though, between data modeling and knowledge extraction. Models, de- pending on the machine learning techniques employed, can be described in diverse ways but, in order to consider that some knowledge has been achieved from their description, we must take into account the human cog- nitive factor that any knowledge extraction process entails. These models as such can be rendered powerless unless they can be interpreted ,a nd the process of human interpretation follows rules that go well beyond techni- cal prowess. For this reason, interpretability is a paramount quality that machine learning methods should aim to achieve if they are to be applied in practice. This paper is a brief introduction to the special session on interpretable models in machine learning, organized as part of the 20 th European Symposium on Artificial Neural Networks, Computational In- telligence and Machine Learning. It includes a discussion on the several works accepted for the session, with an overview of the context of wider research on interpretability of machine learning models.

280 citations


Journal ArticleDOI
TL;DR: A general process for transforming historical electrical grid data into models that aim to predict the risk of failures for components and systems is introduced, and these models are sufficiently accurate to assist in maintaining New York City's electrical grid.
Abstract: Power companies can benefit from the use of knowledge discovery methods and statistical machine learning for preventive maintenance. We introduce a general process for transforming historical electrical grid data into models that aim to predict the risk of failures for components and systems. These models can be used directly by power companies to assist with prioritization of maintenance and repair work. Specialized versions of this process are used to produce (1) feeder failure rankings, (2) cable, joint, terminator, and transformer rankings, (3) feeder Mean Time Between Failure (MTBF) estimates, and (4) manhole events vulnerability rankings. The process in its most general form can handle diverse, noisy, sources that are historical (static), semi-real-time, or real-time, incorporates state-of-the-art machine learning algorithms for prioritization (supervised ranking or MTBF), and includes an evaluation of results via cross-validation and blind test. Above and beyond the ranked lists and MTBF estimates are business management interfaces that allow the prediction capability to be integrated directly into corporate planning and decision support; such interfaces rely on several important properties of our general modeling approach: that machine learning features are meaningful to domain experts, that the processing of data is transparent, and that prediction results are accurate enough to support sound decision making. We discuss the challenges in working with historical electrical grid data that were not designed for predictive purposes. The “rawness” of these data contrasts with the accuracy of the statistical models that can be obtained from the process; these models are sufficiently accurate to assist in maintaining New York City's electrical grid.

245 citations


Book
06 Dec 2012
TL;DR: This book discusses Neural-Symbolic Integration: The Road Ahead, a method for integrating Neurons and Symbols into Acceptable Programs and Neural Networks, and its applications in Logic Programming and Nonmonotonic Theory.
Abstract: 1. Introduction and Overview.- 1.1 Why Integrate Neurons and Symbols?.- 1.2 Strategies of Neural-Symbolic Integration.- 1.3 Neural-Symbolic Learning Systems.- 1.4 A Simple Example.- 1.5 How to Read this Book.- 1.6 Summary.- 2. Background.- 2.1 General Preliminaries.- 2.2 Inductive Learning.- 2.3 Neural Networks.- 2.3.1 Architectures.- 2.3.2 Learning Strategy.- 2.3.3 Recurrent Networks.- 2.4 Logic Programming.- 2.4.1 What is Logic Programming?.- 2.4.2 Fixpoints and Definite Programs.- 2.5 Nonmonotonic Reasoning.- 2.5.1 Stable Models and Acceptable Programs.- 2.6 Belief Revision.- 2.6.1 Truth Maintenance Systems.- 2.6.2 Compromise Revision.- I. Knowledge Refinement in Neural Networks.- 3. Theory Refinement in Neural Networks.- 3.1 Inserting Background Knowledge.- 3.2 Massively Parallel Deduction.- 3.3 Performing Inductive Learning.- 3.4 Adding Classical Negation.- 3.5 Adding Metalevel Priorities.- 3.6 Summary and Further Reading.- 4. Experiments on Theory Refinement.- 4.1 DNA Sequence Analysis.- 4.2 Power Systems Fault Diagnosis.- 4.3.Discussion.- 4.4.Appendix.- II. Knowledge Extraction from Neural Networks.- 5. Knowledge Extraction from Trained Networks.- 5.1 The Extraction Problem.- 5.2 The Case of Regular Networks.- 5.2.1 Positive Networks.- 5.2.2 Regular Networks.- 5.3 The General Case Extraction.- 5.3.1 Regular Subnetworks.- 5.3.2 Knowledge Extraction from Subnetworks.- 5.3.3 Assembling the Final Rule Set.- 5.4 Knowledge Representation Issues.- 5.5 Summary and Further Reading.- 6. Experiments on Knowledge Extraction.- 6.1 Implementation.- 6.2 The Monk's Problems.- 6.3 DNA Sequence Analysis.- 6.4 Power Systems Fault Diagnosis.- 6.5 Discussion.- III. Knowledge Revision in Neural Networks.- 7. Handling Inconsistencies in Neural Networks.- 7.1 Theory Revision in Neural Networks.- 7.1.1The Equivalence with Truth Maintenance Systems.- 7.1.2Minimal Learning.- 7.2 Solving Inconsistencies in Neural Networks.- 7.2.1 Compromise Revision.- 7.2.2 Foundational Revision.- 7.2.3 Nonmonotonic Theory Revision.- 7.3 Summary of the Chapter.- 8. Experiments on Handling Inconsistencies.- 8.1 Requirements Specifications Evolution as Theory Refinement.- 8.1.1Analysing Specifications.- 8.1.2Revising Specifications.- 8.2 The Automobile Cruise Control System.- 8.2.1Knowledge Insertion.- 8.2.2Knowledge Revision: Handling Inconsistencies.- 8.2.3Knowledge Extraction.- 8.3 Discussion.- 8.4 Appendix.- 9. Neural-Symbolic Integration: The Road Ahead.- 9.1 Knowledge Extraction.- 9.2 Adding Disjunctive Information.- 9.3 Extension to the First-Order Case.- 9.4 Adding Modalities.- 9.5 New Preference Relations.- 9.6 A Proof Theoretical Approach.- 9.7 The "Forbidden Zone" [Amax, Amin].- 9.8 Acceptable Programs and Neural Networks.- 9.9 Epilogue.

245 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: In this article, a local-first approach to community discovery is proposed, which democratically lets each node vote for the communities it sees surrounding it in its limited view of the global system, using a label propagation algorithm.
Abstract: Community discovery in complex networks is an interesting problem with a number of applications, especially in the knowledge extraction task in social and information networks. However, many large networks often lack a particular community organization at a global level. In these cases, traditional graph partitioning algorithms fail to let the latent knowledge embedded in modular structure emerge, because they impose a top-down global view of a network. We propose here a simple local-first approach to community discovery, able to unveil the modular organization of real complex networks. This is achieved by democratically letting each node vote for the communities it sees surrounding it in its limited view of the global system, i.e. its ego neighborhood, using a label propagation algorithm; finally, the local communities are merged into a global collection. We tested this intuition against the state-of-the-art overlapping and non-overlapping community discovery methods, and found that our new method clearly outperforms the others in the quality of the obtained communities, evaluated by using the extracted communities to predict the metadata about the nodes of several real world networks. We also show how our method is deterministic, fully incremental, and has a limited time complexity, so that it can be used on web-scale real networks.

230 citations


Proceedings ArticleDOI
29 Oct 2012
TL;DR: A novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases is developed, which improves the quality of prior link-based models, and also eliminates the need for explicit interlinkage between entities.
Abstract: Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.

224 citations


Journal ArticleDOI
TL;DR: The results of this study indicate that the HeuristicsMiner algorithm is especially suited in a real-life setting, and it is shown that, particularly for highly complex event logs, knowledge discovery from such data sets can become a major problem for traditional process discovery techniques.

216 citations


Journal ArticleDOI
TL;DR: A discussion of the design and implementation choices for each visual analysis technique is presented, followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find.
Abstract: We present TopicNets, a Web-based system for visual and interactive analysis of large sets of documents using statistical topic models A range of visualization types and control mechanisms to support knowledge discovery are presented These include corpus- and document-specific views, iterative topic modeling, search, and visual filtering Drill-down functionality is provided to allow analysts to visualize individual document sections and their relations within the global topic space Analysts can search across a dataset through a set of expansion techniques on selected document and topic nodes Furthermore, analysts can select relevant subsets of documents and perform real-time topic modeling on these subsets to interactively visualize topics at various levels of granularity, allowing for a better understanding of the documents A discussion of the design and implementation choices for each visual analysis technique is presented This is followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find These include a corpus of 50,000 successful NSF grant proposals, 10,000 publications from a large research center, and single documents including a grant proposal and a PhD thesis

163 citations


Book ChapterDOI
01 Jan 2012
TL;DR: This chapter provides a survey of the major work on named entity recognition and relation extraction in the past few decades, with a focus on work from the natural language processing community.
Abstract: Information extraction is the task of finding structured information from unstructured or semi-structured text. It is an important task in text mining and has been extensively studied in various research communities including natural language processing, information retrieval and Web mining. It has a wide range of applications in domains such as biomedical literature mining and business intelligence. Two fundamental tasks of information extraction are named entity recognition and relation extraction. The former refers to finding names of entities such as people, organizations and locations. The latter refers to finding the semantic relations such as FounderOf and HeadquarteredIn between entities. In this chapter we provide a survey of the major work on named entity recognition and relation extraction in the past few decades, with a focus on work from the natural language processing community.

158 citations


Proceedings ArticleDOI
20 Aug 2012
TL;DR: This paper presents three system design principles that can inform organizations on effective analytic and data collection processes, system organization, and data dissemination practices and illustrates each principle with their own experiences and recommendations.
Abstract: Big data phenomenon refers to the practice of collection and processing of very large data sets and associated systems and algorithms used to analyze these massive datasets. Architectures for big data usually range across multiple machines and clusters, and they commonly consist of multiple special purpose sub-systems. Coupled with the knowledge discovery process, big data movement offers many unique opportunities for organizations to benefit (with respect to new insights, business optimizations, etc.). However, due to the difficulty of analyzing such large datasets, big data presents unique systems engineering and architectural challenges. In this paper, we present three system design principles that can inform organizations on effective analytic and data collection processes, system organization, and data dissemination practices. The principles presented derive from our own research and development experiences with big data problems from various federal agencies, and we illustrate each principle with our own experiences and recommendations.

154 citations


Journal ArticleDOI
TL;DR: This paper reviews the methods for functional dependency, conditional Functional Dependency, approximate functional Dependence, and inclusion dependency discovery in relational databases and a method for discovering XML functional dependencies.
Abstract: Functional and inclusion dependency discovery is important to knowledge discovery, database semantics analysis, database design, and data quality assessment. Motivated by the importance of dependency discovery, this paper reviews the methods for functional dependency, conditional functional dependency, approximate functional dependency, and inclusion dependency discovery in relational databases and a method for discovering XML functional dependencies.

136 citations


Journal ArticleDOI
TL;DR: This paper describes in detail what kind of shallow knowledge is extracted, how it is automatically done from a large corpus, and how additional semantics are inferred from aggregate statistics of the automatically extracted shallow knowledge.
Abstract: Access to a large amount of knowledge is critical for success at answering open-domain questions for DeepQA systems such as IBM Watson™. Formal representation of knowledge has the advantage of being easy to reason with, but acquisition of structured knowledge in open domains from unstructured data is often difficult and expensive. Our central hypothesis is that shallow syntactic knowledge and its implied semantics can be easily acquired and can be used in many areas of a question-answering system. We take a two-stage approach to extract the syntactic knowledge and implied semantics. First, shallow knowledge from large collections of documents is automatically extracted. Second, additional semantics are inferred from aggregate statistics of the automatically extracted shallow knowledge. In this paper, we describe in detail what kind of shallow knowledge is extracted, how it is automatically done from a large corpus, and how additional semantics are inferred from aggregate statistics. We also briefly discuss the various ways extracted knowledge is used throughout the IBM DeepQA system.

Journal Article
TL;DR: This paper is an introductory paper on different techniques used for classification and feature selection.
Abstract: Data mining is a form of knowledge discovery essential for solving problems in a specific domain. Classification is a technique used for discovering classes of unknown data. Various methods for classification exists like bayesian, decision trees, rule based, neural networks etc. Before applying any mining technique, irrelevant attributes needs to be filtered. Filtering is done using different feature selection techniques like wrapper, filter, embedded technique. This paper is an introductory paper on different techniques used for classification and feature selection.

Journal Article
TL;DR: This study explores how the interaction of students with each other and with their instructors predicts their learning outcomes (as measured by their final grades) and aims to enrich the existing body of literature, while augmenting the understanding of effective learning strategies across a variety of new delivery modes.
Abstract: Introduction According to a recent survey conducted by Campus Computing (campuscomputing.net) and WCET (wcet.info), almost 88% of the surveyed institutions reported having used an LMS (Learning Management System) as a medium for course delivery for both on-campus and online offerings. In addition to various student information management systems (SISs), LMSs are providing the educational community with a goldmine of unexploited data about students' learning characteristics, behaviours, and patterns. The turning of such raw data into useful information and knowledge will enable institutes of higher education (HEIs) to rethink and improve students' learning experiences by using the data to streamline their teaching and learning processes, to extract and analyse students' learning and navigation patterns and behaviours, to analyse threaded discussion and interaction logs, and to provide feedback to students and to faculty about the unfolding of their students' learning experiences (Hung & Crooks, 2009; Garcia, Romero, Ventura, & de Castro, 2011). To this end, data mining has emerged as a powerful analytical and exploratory tool supported by faster multi-core 64 CPUs with larger memories, and by powerful database reporting tools. Originating in corporate business practices, data mining is multidisciplinary by nature and springs from several different disciplines including computer science, artificial intelligence, statistics, and biometrics. Using various approaches (such as classification, clustering, association rules, and visualization), data mining has been gaining momentum in higher education, which is now using a variety of applications, most notably in enrolment, learning patterns, personalization, and threaded discussion analysis. By discovering hidden relationships, patterns, and interdependencies, and by correlating raw/unstructured institutional data, data mining is beginning to facilitate the decision-making process in higher educational institutions. This interest in data mining is timely and critical, particularly as universities are diversifying their delivery modes to include more online and mobile learning environments. EDM has the potential to help HEIs understand the dynamics and patterns of a variety of learning environments and to provide insightful data for rethinking and improving students' learning experiences. This paper is focused on understanding live video streaming (LVS) students' learning behaviours, their interactions, and their learning outcomes. More specifically, this study explores how the interaction of students with each other and with their instructors predicts their learning outcomes (as measured by their final grades). By investigating these interrelated dimensions, this study aims to enrich the existing body of literature, while augmenting the understanding of effective learning strategies across a variety of new delivery modes. This paper is divided into four sections. It begins by reviewing the literature dealing with the use of data mining in administrative and academic environments, followed by a short discussion of the way in which data mining is used to understand various dimensions of learning. The second section explains the purpose and the research questions explored in this paper. The third section describes the background of the study and details its methodological approach (sampling, data collection, and analysis). The paper concludes by highlighting key findings, by discussing the study's limitations, and by proposing several recommendations for distance education administrators and practitioners. Data mining applications in administrative and academic environments At the intersection of several disciplines including computer science, statistics, psychometrics (Garcia et al., 2011), data mining has thrived in business practices as a knowledge discovery tool intended to transform raw data into highlevel knowledge for decision support (Hen & Lee, 2008). …

Book ChapterDOI
08 Oct 2012
TL;DR: This work defines a mapping between DRT and RDF/OWL for the production of quality linked data and ontologies, and presents FRED, an online tool for converting text into internally well-connected and linked-data-ready ontologies in web-service-acceptable time.
Abstract: We have implemented a novel approach for robust ontology design from natural language texts by combining Discourse Representation Theory (DRT), linguistic frame semantics, and ontology design patterns We show that DRT-based frame detection is feasible by conducting a comparative evaluation of our approach and existing tools Furthermore, we define a mapping between DRT and RDF/OWL for the production of quality linked data and ontologies, and present FRED, an online tool for converting text into internally well-connected and linked-data-ready ontologies in web-service-acceptable time

Journal ArticleDOI
TL;DR: An extensive experimental evaluation shows that the proposed parallel method for computing rough set approximations based on the MapReduce technique is effective for data mining.

Journal ArticleDOI
TL;DR: In this paper, a new methodology for examining all associations and correlations between building operational data, thereby discovering useful knowledge about energy conservation is presented, which is based on a basic data mining technique (association rule mining).

Journal ArticleDOI
TL;DR: DBpedia‐Live publishes the newly added/deleted triples in files, in order to enable synchronization between the DBpedia endpoint and other DBpedia mirrors.
Abstract: Purpose – DBpedia extracts structured information from Wikipedia, interlinks it with other knowledge bases and freely publishes the results on the web using Linked Data and SPARQL. However, the DBpedia release process is heavyweight and releases are sometimes based on several months old data. DBpedia‐Live solves this problem by providing a live synchronization method based on the update stream of Wikipedia. This paper seeks to address these issues.Design/methodology/approach – Wikipedia provides DBpedia with a continuous stream of updates, i.e. a stream of articles, which were recently updated. DBpedia‐Live processes that stream on the fly to obtain RDF data and stores the extracted data back to DBpedia. DBpedia‐Live publishes the newly added/deleted triples in files, in order to enable synchronization between the DBpedia endpoint and other DBpedia mirrors.Findings – During the realization of DBpedia‐Live the authors learned that it is crucial to process Wikipedia updates in a priority queue. Recently‐upd...

Book ChapterDOI
01 Jan 2012
TL;DR: How this work produces, on a largely automated and ongoing basis, nonredundant lists of atomic-resolution structures at different resolution thresholds for use in knowledge-driven RNA applications is addressed.
Abstract: The continual improvement of methods for RNA 3D structure modeling and prediction requires accurate and statistically meaningful data concerning RNA structure, both for extraction of knowledge and for benchmarking of structure predictions. The source of sufficiently accurate structural data for these purposes is atomic-resolution X-ray structures of RNA nucleotides, oligonucleotides, and biologically functional RNA molecules. All of our basic knowledge of bond lengths, angles, and stereochemistry in RNA nucleotides, as well as their interaction preferences, including all types of base-pairing, base-stacking, and base-backbone interactions, is ultimately extracted from X-ray structures. One key requirement for reference databases intended for knowledge extraction is the nonredundancy of the structures that are included in the analysis, to avoid bias in the deduced frequency parameters. Here, we address this issue and detail how we produce, on a largely automated and ongoing basis, nonredundant lists of atomic-resolution structures at different resolution thresholds for use in knowledge-driven RNA applications. The file collections are available for download at http://rna.bgsu.edu/nrlist. The primary lists that we provide only include X-ray structures, organized by resolution thresholds, but for completeness, we also provide separate lists that include structures solved by NMR or cryo-EM.

Journal ArticleDOI
TL;DR: In this article, the authors explored the applications of data mining techniques which have been developed to support knowledge management process and classified the journal articles indexed in ScienceDirect Database from 2007 to 2012.
Abstract: Data mining is one of the most important steps of the knowledge discovery in databases process and is considered as significant subfield in knowledge management. Research in data mining continues growing in business and in learning organization over coming decades. This review paper explores the applications of data mining techniques which have been developed to support knowledge management process. The journal articles indexed in ScienceDirect Database from 2007 to 2012 are analyzed and classified. The discussion on the findings is divided into 4 topics: (i) knowledge resource; (ii) knowledge types and/or knowledge datasets; (iii) data mining tasks; and (iv) data mining techniques and applications used in knowledge management. The article first briefly describes the definition of data mining and data mining functionality. Then the knowledge management rationale and major knowledge management tools integrated in knowledge management cycle are described. Finally, the applications of data mining techniques in the process of knowledge management are summarized and discussed.

Journal ArticleDOI
30 Mar 2012-PLOS ONE
TL;DR: By building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm.
Abstract: Purpose Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible.

Proceedings Article
12 Jul 2012
TL;DR: A distributed, Web-scale implementation of a path-constrained random walk model that learns syntactic-semantic inference rules for binary relations from a graph representation of the parsed text and the knowledge base is described.
Abstract: We study how to extend a large knowledge base (Freebase) by reading relational information from a large Web text corpus. Previous studies on extracting relational knowledge from text show the potential of syntactic patterns for extraction, but they do not exploit background knowledge of other relations in the knowledge base. We describe a distributed, Web-scale implementation of a path-constrained random walk model that learns syntactic-semantic inference rules for binary relations from a graph representation of the parsed text and the knowledge base. Experiments show significant accuracy improvements in binary relation prediction over methods that consider only text, or only the existing knowledge base.

Journal ArticleDOI
TL;DR: The goal of this paper is to use a set of MCDM methods to rank classification algorithms, with empirical results based on the software defect detection datasets, and involved the DM during the ranking procedure by assigning user weights to the performance measures.

Journal ArticleDOI
TL;DR: The results of knowledge extraction from data mining are illustrated as knowledge patterns, rules, and knowledge maps in order to propose suggestions and solutions to online group buying firms for future development.
Abstract: Highlights? Online group buying is an effective marketing method. ? Group buying has become extremely popular. ? This study proposes a data mining approach for exploring online group buying behavior in Taiwan. ? This study uses the Apriori algorithm and clustering analysis for data mining. ? Knowledge extraction is proposed suggestions to online group buying firms for future development. Online group buying is an effective marketing method. By using online group buying, customers get unbelievable discounts on premium products and services. This not only meets customer demand, but also helps sellers to find new ways to sell products sales and open up new business models, all parties benefit in these transactions. During these bleak economic times, group buying has become extremely popular. Therefore, this study proposes a data mining approach for exploring online group buying behavior in Taiwan. Thus, this study uses the Apriori algorithm as an association rules approach, and clustering analysis for data mining, which is implemented for mining customer knowledge among online group buying customers in Taiwan. The results of knowledge extraction from data mining are illustrated as knowledge patterns, rules, and knowledge maps in order to propose suggestions and solutions to online group buying firms for future development.

Journal ArticleDOI
TL;DR: An easy to use and freely accessible tool, GeneWizard, that exploits text mining and microarray data fusion for supporting researchers in discovering gene-disease relationships is described.
Abstract: A huge amount of important biomedical information is hidden in the bulk of research articles in biomedical fields. At the same time, the publication of databases of biological information and of experimental datasets generated by high-throughput methods is in great expansion, and a wealth of annotated gene databases, chemical, genomic (including microarray datasets), clinical and other types of data repositories are now available on the Web. Thus a current challenge of bioinformatics is to develop targeted methods and tools that integrate scientific literature, biological databases and experimental data for reducing the time of database curation and for accessing evidence, either in the literature or in the datasets, useful for the analysis at hand. Under this scenario, this article reviews the knowledge discovery systems that fuse information from the literature, gathered by text mining, with microarray data for enriching the lists of down and upregulated genes with elements for biological understanding and for generating and validating new biological hypothesis. Finally, an easy to use and freely accessible tool, GeneWizard, that exploits text mining and microarray data fusion for supporting researchers in discovering gene-disease relationships is described.

Book ChapterDOI
Junjie Wu1
01 Jan 2012
TL;DR: The phrase “data mining” was termed in the late eighties of the last century, which describes the activity that attempts to extract interesting patterns from data.
Abstract: The phrase “data mining” was termed in the late eighties of the last century, which describes the activity that attempts to extract interesting patterns from data. Since then, data mining and knowledge discovery has become one of the hottest topics in both academia and industry. It provides valuable business and scientific intelligence hidden in a large amount of historical data

Journal ArticleDOI
TL;DR: The aim of this article is to highlight the interplay between operations research and data mining, and to place a particular emphasis on the emerging theme of applying multi-objective approaches in this context.

Journal ArticleDOI
TL;DR: This is the first research work to perform single-pass incremental and interactive mining for weighted frequent patterns using a single database scan and the tree structures and algorithms are very efficient and scalable.
Abstract: Weighted frequent pattern (WFP) mining is more practical than frequent pattern mining because it can consider different semantic significance (weight) of the items. For this reason, WFP mining becomes an important research issue in data mining and knowledge discovery. However, existing algorithms cannot be applied for incremental and interactive WFP mining and also for stream data mining because they are based on a static database and require multiple database scans. In this paper, we present two novel tree structures IWFPT"W"A (Incremental WFP tree based on weight ascending order) and IWFPT"F"D (Incremental WFP tree based on frequency descending order), and two new algorithms IWFP"W"A and IWFP"F"D for incremental and interactive WFP mining using a single database scan. They are effective for incremental and interactive mining to utilize the current tree structure and to use the previous mining results when a database is updated or a minimum support threshold is changed. IWFP"W"A gets advantage in candidate pattern generation by obtaining the highest weighted item in the bottom of IWFPT"W"A. IWFP"F"D ensures that any non-candidate item cannot appear before candidate items in any branch of IWFPT"F"D and thus speeds up the prefix tree and conditional tree creation time during mining operation. IWFPT"F"D also achieves the highly compact incremental tree to save memory space. To our knowledge, this is the first research work to perform single-pass incremental and interactive mining for weighted frequent patterns. Extensive performance analyses show that our tree structures and algorithms are very efficient and scalable for single-pass incremental and interactive WFP mining.


Journal ArticleDOI
TL;DR: In this paper, the authors describe methods for analyzing a BIM to query for spatial information that is relevant for construction practitioners, and that is typically represented implicitly in BIM, and integrate ifcXML data and other spatial data to develop a richer model for construction users.

Journal ArticleDOI
TL;DR: The CoGAR framework is presented to efficiently support constrained generalized association rule mining and the opportunistic confidence constraint, a new constraint proposed in this paper, allows to discriminate between significant and redundant rules by analyzing similar rules belonging to different abstraction levels.