scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2010"


Journal ArticleDOI
TL;DR: The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text, and its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations.

1,748 citations


Journal ArticleDOI
01 Nov 2010
TL;DR: The most relevant studies carried out in educational data mining to date are surveyed and the different groups of user, types of educational environments, and the data they provide are described.
Abstract: Educational data mining (EDM) is an emerging interdisciplinary research area that deals with the development of methods to explore data originating in an educational context. EDM uses computational approaches to analyze educational data in order to study educational questions. This paper surveys the most relevant studies carried out in this field to date. First, it introduces EDM and describes the different groups of user, types of educational environments, and the data they provide. It then goes on to list the most typical/common tasks in the educational environment that have been resolved through data-mining techniques, and finally, some of the most promising future lines of research are discussed.

1,723 citations


01 Jan 2010
TL;DR: 5 papers from the accepted papers of the Fourth International Workshop on Knowledge Discovery from Data Streams that goes from recommendation algorithms, Clustering, Drifting Concepts and Frequent pattern mining are selected, the common concept in all the papers is that learning occurs while data continuously flows.
Abstract: Wide-area sensor infrastructures, remote sensors, and wireless sensor networks yield massive volumes of disparate, dynamic, and geographically distributed data. As sensors are becoming ubiquitous, a set of broad requirements is beginning to emerge across high-priority applications including disaster preparedness and management, adaptability to climate change, national or homeland security, and the management of critical infrastructures. The raw data from sensors need to be efficiently managed and transformed to usable information through data fusion, which in turn must be converted to predictive insights via knowledge discovery, ultimately facilitating automated or human-induced tactical decisions or strategic policy. The challenges for the Knowledge Discovery community are immense. Sensors produce dynamic data streams or events requiring real-time analysis methodologies and systems. Moreover, in most of the cases these streams are distributed in space, requiring spatio-temporal knowledge discovery solutions. All these aspects are of increasing importance to the research community, as new algorithms are needed to process this continuously flow of data in reasonable time. Learning from data streams require algorithms that process examples in constant time and memory, usually scanning data once. Moreover, if the process is not strictly stationary (as most of real world applications), the target concept could gradually change over time. This is an incremental task that requires incremental learning algorithms that take drift into account. For this special issue of Intelligent Data Analysis we selected 5 papers from the accepted papers of the Fourth International Workshop on Knowledge Discovery from Data Streams, an associated workshop of the 18th European Conference on Machine Learning (ECML) and the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), co-located in Warsaw, Poland, 2007 and the First ACM SIGKDD Knowledge Discovery from Sensor Data – SensorKDD07, co-located with the Knowledge Discovery and Data Mining (KDD) 2007 conference organized by the American Computing Machinery (ACM). The selected papers cover a large spectrum in the research of Knowledge Discovery from Data Streams that goes from recommendation algorithms, Clustering, Drifting Concepts and Frequent pattern mining. The common concept in all the papers is that learning occurs while data continuously flows. The first paper Novelty Detection with Application to Data Streams by Spinosa, Carvalho and Gama, presents and evaluates a new approach to novelty detection from data streams. The ability to detect novel concepts is an important aspect of a machine learning system, essential when dealing with nonstationary distributions. The approach presented here intends to take novelty detection beyond one-class

789 citations


Book ChapterDOI
06 Oct 2010
TL;DR: To deal with streaming unbalanced classes, a sliding window Kappa statistic is proposed for evaluation in time-changing data streams, and a study on Twitter data is performed using learning algorithms for data streams.
Abstract: Micro-blogs are a challenging new source of information for data mining techniques. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. We briefly discuss the challenges that Twitter data streams pose, focusing on classification problems, and then consider these streams for opinion mining and sentiment analysis. To deal with streaming unbalanced classes, we propose a sliding window Kappa statistic for evaluation in time-changing data streams. Using this statistic we perform a study on Twitter data using learning algorithms for data streams.

612 citations


Journal ArticleDOI
TL;DR: This paper provides an introduction to ontology-based information extraction and reviews the details of different OBIE systems developed so far to identify a common architecture among these systems and classify them based on different factors, which leads to a better understanding on their operation.
Abstract: Information extraction (IE) aims to retrieve certain types of information from natural language text by processing them automatically. For example, an IE system might retrieve information about geopolitical indicators of countries from a set of web pages while ignoring other types of information. Ontology-based information extraction (OBIE) has recently emerged as a subfield of information extraction. Here, ontologies - which provide formal and explicit specifications of conceptualizations - play a crucial role in the IE process. Because of the use of ontologies, this field is related to knowledge representation and has the potential to assist the development of the Semantic Web. In this paper, we provide an introduction to ontology-based information extraction and review the details of different OBIE systems developed so far. We attempt to identify a common architecture among these systems and classify them based on different factors, which leads to a better understanding on their operation. We also discuss the implementation details of these systems including the tools used by them and the metrics used to measure their performance. In addition, we attempt to identify the possible future directions for this field.

409 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: Experimental results show that the proposed system can provide effective mobile sequential recommendation and the knowledge extracted from location traces can be used for coaching drivers and leading to the efficient use of energy.
Abstract: The increasing availability of large-scale location traces creates unprecedent opportunities to change the paradigm for knowledge discovery in transportation systems. A particularly promising area is to extract energy-efficient transportation patterns (green knowledge), which can be used as guidance for reducing inefficiencies in energy consumption of transportation sectors. However, extracting green knowledge from location traces is not a trivial task. Conventional data analysis tools are usually not customized for handling the massive quantity, complex, dynamic, and distributed nature of location traces. To that end, in this paper, we provide a focused study of extracting energy-efficient transportation patterns from location traces. Specifically, we have the initial focus on a sequence of mobile recommendations. As a case study, we develop a mobile recommender system which has the ability in recommending a sequence of pick-up points for taxi drivers or a sequence of potential parking positions. The goal of this mobile recommendation system is to maximize the probability of business success. Along this line, we provide a Potential Travel Distance (PTD) function for evaluating each candidate sequence. This PTD function possesses a monotone property which can be used to effectively prune the search space. Based on this PTD function, we develop two algorithms, LCP and SkyRoute, for finding the recommended routes. Finally, experimental results show that the proposed system can provide effective mobile sequential recommendation and the knowledge extracted from location traces can be used for coaching drivers and leading to the efficient use of energy.

360 citations


Proceedings Article
23 Aug 2010
TL;DR: This work presents a state of the art system for entity disambiguation that not only addresses challenges but also scales to knowledge bases with several million entries using very little resources.
Abstract: The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources. Further, our approach achieves performance of up to 95% on entities mentioned from newswire and 80% on a public test set that was designed to include challenging queries.

356 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work addresses the question of how to automatically decide which information to transfer between classes without the need of any human intervention and taps into linguistic knowledge bases to provide the semantic link between sources (what) and targets (where) of knowledge transfer.
Abstract: Remarkable performance has been reported to recognize single object classes. Scalability to large numbers of classes however remains an important challenge for today's recognition methods. Several authors have promoted knowledge transfer between classes as a key ingredient to address this challenge. However, in previous work the decision which knowledge to transfer has required either manual supervision or at least a few training examples limiting the scalability of these approaches. In this work we explicitly address the question of how to automatically decide which information to transfer between classes without the need of any human intervention. For this we tap into linguistic knowledge bases to provide the semantic link between sources (what) and targets (where) of knowledge transfer. We provide a rigorous experimental evaluation of different knowledge bases and state-of-the-art techniques from Natural Language Processing which goes far beyond the limited use of language in related work. We also give insights into the applicability (why) of different knowledge sources and similarity measures for knowledge transfer.

332 citations



01 Jan 2010
TL;DR: The potential use of classification based data mining techniques such as Rule based, Decision tree, Naive Bayes and Artificial Neural Network to massive volume of healthcare data is examined.
Abstract: The healthcare environment is generally perceived as being 'information rich' yet 'knowledge poor'. There is a wealth of data available within the healthcare systems. However, there is a lack of effective analysis tools to discover hidden relationships and trends in data. Knowledge discovery and data mining have found numerous applications in business and scientific domain. Valuable knowledge can be discovered from application of data mining techniques in healthcare system. In this study, we briefly examine the potential use of classification based data mining techniques such as Rule based, Decision tree, Naive Bayes and Artificial Neural Network to massive volume of healthcare data. The healthcare industry collects huge amounts of healthcare data which, unfortunately, are not "mined" to discover hidden information. For data preprocessing and effective decision making One Dependency Augmented Naive Bayes classifier (ODANB) and naive credal classifier 2 (NCC2) are used. This is an extension of naive Bayes to imprecise probabilities that aims at delivering robust classifications also when dealing with small or incomplete data sets. Discovery of hidden patterns and relationships often goes unexploited. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood of patients getting a heart disease. It enables significant knowledge, e.g. patterns, relationships between medical factors related to heart disease, to be established.

279 citations


Journal ArticleDOI
TL;DR: A global comparative of all presented data mining approaches is provided, focusing on the different steps and tasks in which every approach interprets the whole KDD process, and proposes a new data mining and knowledge discovery process named refined data mining process.
Abstract: Up to now, many data mining and knowledge discovery methodologies and process models have been developed, with varying degrees of success. In this paper, we describe the most used (in industrial and academic projects) and cited (in scientific literature) data mining and knowledge discovery methodologies and process models, providing an overview of its evolution along data mining and knowledge discovery history and setting down the state of the art in this topic. For every approach, we have provided a brief description of the proposed knowledge discovery in databases (KDD) process, discussing about special features, outstanding advantages and disadvantages of every approach. Apart from that, a global comparative of all presented data mining approaches is provided, focusing on the different steps and tasks in which every approach interprets the whole KDD process. As a result of the comparison, we propose a new data mining and knowledge discovery process named refined data mining process for developing any kind of data mining and knowledge discovery project. The refined data mining process is built on specific steps taken from analyzed approaches.

Book ChapterDOI
20 Sep 2010
TL;DR: This paper considers the transductive classification problem on heterogeneous networked data which share a common topic and proposes a novel graph-based regularization framework, GNetMine, to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types.
Abstract: A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that strongly-typed heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for some objects. Learning from such labeled and unlabeled data via transductive classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied for decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the transductive classification problem on heterogeneous networked data which share a common topic. Only some objects in the given network are labeled, and we aim to predict labels for all types of the remaining objects. A novel graph-based regularization framework, GNetMine, is proposed to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types. Specifically, we explicitly respect the type differences by preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the classification accuracy over existing state-of-the-art methods.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: In this paper, appropriate and efficient networks for breast cancer knowledge discovery from clinically collected data sets are investigated and principal component techniques are used to reduce the dimension of data and find appropriate networks.
Abstract: In this paper, appropriate and efficient networks for breast cancer knowledge discovery from clinically collected data sets are investigated. Invoking various data mining techniques, it is desired to find out the percentage of disease development, using the developed network. The results, help in choosing a reasonable treatment of the patient. Several neural network structures are evaluated for this investigation. The performance of the statistical neural network structures, self organizing map(SOM), radial basis function network (RBF), general regression neural network (GRNN) and probabilistic neural network (PNN) are tested both on the Wisconsin breast cancer data (WBCD) and on the Shiraz Namazi Hospital breast cancer data (NHBCD). To overcome the problem of high dimension of the data set and realizing the correlated nature of the data, principal component techniques are used to reduce the dimension of data and find appropriate networks. The results are quite satisfactory while presenting a comparison of effectiveness each proposed network for such problems.

01 Jan 2010
TL;DR: This work focuses particularly on the area of time series motif discovery (Lin and Keogh 2002) , also known as the extraction of recurrent patterns, which are relevant because they summarise the time series of a domain and help the domain expert understand the database at hand.
Abstract: Data Mining or Knowledge Discovery in Databases (KDD) is an important area of computer sciences. The relevance of this area is due to the enormous quantity of information daily produced by different sources, for instance the web, biological processes, finance, the aeronautic industry, retail, and telecommunications data. A considerable amount of this information represents temporal events which are typically stored in the form of time series. There are several phenomena expected to be identified among databases of this type, namely through motif (pattern) discovery, classification, clustering, query by content, abnormality detection, and forecast of property values. We focus particularly on the area of time series motif discovery (Lin and Keogh 2002) , also known as the extraction of recurrent patterns. These patterns are relevant because they summarise the time series of a domain and help the domain expert understand the database at hand (Ferreira et al. 2006). Figure 1 shows one example of such type of pattern in the context of electroencephalogram (EEG) time series. This specific motif is detected in three different time series in the database.

Journal ArticleDOI
TL;DR: The CDW platform would be a promising infrastructure to make full use of the TCM clinical data for scientific hypothesis generation, and promote the development of TCM from individualized empirical knowledge to large-scale evidence-based medicine.

Journal ArticleDOI
TL;DR: In this paper, a review of rule extraction from SVM classifiers is presented, and a comparison of the algorithms' salient features and relative performance as measured by a number of metrics is made.

Journal ArticleDOI
TL;DR: The geospatial knowledge discovery process, its relation to scientific knowledge construction, and identifying challenges to a greater role in regional science are addressed.
Abstract: We have access to an unprecedented amount of fine-grained data on cities, transportation, economies, and societies, much of these data referenced in geo-space and time. There is a tremendous opportunity to discover new knowledge about spatial economies that can inform theory and modeling in regional science. However, there is little evidence of computational methods for discovering knowledge from databases in the regional science literature. This paper addresses this gap by clarifying the geospatial knowledge discovery process, its relation to scientific knowledge construction, and identifying challenges to a greater role in regional science.

Journal ArticleDOI
TL;DR: A broad overview of the methodologies developed to handle and process MS metabolomic data, compare the samples and highlight the relevant metabolites, starting from the raw data to the biomarker discovery is provided.
Abstract: While metabolomics attempts to comprehensively analyse the small molecules characterising a biological system, MS has been promoted as the gold standard to study the wide chemical diversity and range of concentrations of the metabolome. On the other hand, extracting the relevant information from the overwhelming amount of data generated by modern analytical platforms has become an important issue for knowledge discovery in this research field. The appropriate treatment of such data is therefore of crucial importance in order, for the data, to provide valuable information. The aim of this review is to provide a broad overview of the methodologies developed to handle and process MS metabolomic data, compare the samples and highlight the relevant metabolites, starting from the raw data to the biomarker discovery. As data handling can be further separated into data processing, data pre-treatment and data analysis, recent advances in each of these steps are detailed separately.

Journal ArticleDOI
TL;DR: This study provides compelling evidence that information theory can explain a significant number of phenomena or events in visualization, while no example has been found which is fundamentally in conflict with information theory.
Abstract: In this paper, we examine whether or not information theory can be one of the theoretic frameworks for visualization. We formulate concepts and measurements for qualifying visual information. We illustrate these concepts with examples that manifest the intrinsic and implicit use of information theory in many existing visualization techniques. We outline the broad correlation between visualization and the major applications of information theory, while pointing out the difference in emphasis and some technical gaps. Our study provides compelling evidence that information theory can explain a significant number of phenomena or events in visualization, while no example has been found which is fundamentally in conflict with information theory. We also notice that the emphasis of some traditional applications of information theory, such as data compression or data communication, may not always suit visualization, as the former typically focuses on the efficient throughput of a communication channel, whilst the latter focuses on the effectiveness in aiding the perceptual and cognitive process for data understanding and knowledge discovery. These findings suggest that further theoretic developments are necessary for adopting and adapting information theory for visualization.

Journal ArticleDOI
TL;DR: A novel framework for mining high‐utility sequential patterns for more real‐life applicable information extraction from sequence databases with non‐binary frequency values of items in sequences and different importance/significance values for distinct items is proposed.
Abstract: Mining sequential patterns is an important research issue in data mining and knowledge discovery with broad applications. However, the existing sequential pattern mining approaches consider only binary frequency values of items in sequences and equal importance/significance values of distinct items. Therefore, they are not applicable to actually represent many real-world scenarios. In this paper, we propose a novel framework for mining high-utility sequential patterns for more real-life applicable information extraction from sequence databases with non-binary frequency values of items in sequences and different importance/significance values for distinct items. Moreover, for mining high-utility sequential patterns, we propose two new algorithms: UtilityLevel is a high-utility sequential pattern mining with a level-wise candidate generation approach, and UtilitySpan is a high-utility sequential pattern mining with a pattern growth approach. Extensive performance analyses show that our algorithms are very efficient and scalable for mining high-utility sequential patterns.

Journal ArticleDOI
TL;DR: A new concept-based mining model that analyzes terms on the sentence, document, and corpus levels rather than the traditional analysis of the document only is introduced and can efficiently find significant matching concepts between documents, according to the semantics of their sentences.
Abstract: Most of the common techniques in text mining are based on the statistical analysis of a term, either word or phrase. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Thus, the underlying text mining model should indicate terms that capture the semantics of text. In this case, the mining model can capture terms that present the concepts of the sentence, which leads to discovery of the topic of the document. A new concept-based mining model that analyzes terms on the sentence, document, and corpus levels is introduced. The concept-based mining model can effectively discriminate between nonimportant terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed mining model consists of sentence-based concept analysis, document-based concept analysis, corpus-based concept-analysis, and concept-based similarity measure. The term which contributes to the sentence semantics is analyzed on the sentence, document, and corpus levels rather than the traditional analysis of the document only. The proposed model can efficiently find significant matching concepts between documents, according to the semantics of their sentences. The similarity between documents is calculated based on a new concept-based similarity measure. The proposed similarity measure takes full advantage of using the concept analysis measures on the sentence, document, and corpus levels in calculating the similarity between documents. Large sets of experiments using the proposed concept-based mining model on different data sets in text clustering are conducted. The experiments demonstrate extensive comparison between the concept-based analysis and the traditional analysis. Experimental results demonstrate the substantial enhancement of the clustering quality using the sentence-based, document-based, corpus-based, and combined approach concept analysis.

Journal ArticleDOI
TL;DR: An overview of driving forces, theoretical frameworks, architectures, techniques, case studies, and open issues of domain-driven data mining shows that the findings are not actionable, and lack soft power in solving real-world complex problems.
Abstract: Traditional data mining research mainly focus]es on developing, demonstrating, and pushing the use of specific algorithms and models. The process of data mining stops at pattern identification. Consequently, a widely seen fact is that 1) many algorithms have been designed of which very few are repeatable and executable in the real world, 2) often many patterns are mined but a major proportion of them are either commonsense or of no particular interest to business, and 3) end users generally cannot easily understand and take them over for business use. In summary, we see that the findings are not actionable, and lack soft power in solving real-world complex problems. Thorough efforts are essential for promoting the actionability of knowledge discovery in real-world smart decision making. To this end, domain-driven data mining (D3M) has been proposed to tackle the above issues, and promote the paradigm shift from ?data-centered knowledge discovery? to ?domain-driven, actionable knowledge delivery.? In D3M, ubiquitous intelligence is incorporated into the mining process and models, and a corresponding problem-solving system is formed as the space for knowledge discovery and delivery. Based on our related work, this paper presents an overview of driving forces, theoretical frameworks, architectures, techniques, case studies, and open issues of D3M. We understand D3M discloses many critical issues with no thorough and mature solutions available for now, which indicates the challenges and prospects for this new topic.

Journal ArticleDOI
TL;DR: This paper surveys an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora, giving basic concepts, advantages and disadvantages in a chronological order.
Abstract: Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.

Patent
30 Jun 2010
TL;DR: In this article, the authors present a system for determining user specific information and knowledge relevancy, relevant knowledge and information discovery, user intent and relevant interactions via intelligent messaging, collaboration, sharing and information categorisation, further delivering created knowledge accessible through a personalised user experience.
Abstract: Systems and methods for determining user specific information and knowledge relevancy, relevant knowledge and information discovery, user intent and relevant interactions via intelligent messaging, collaboration, sharing and information categorisation, further delivering created knowledge accessible through a personalised user experience.

Journal ArticleDOI
01 Dec 2010
TL;DR: It is concluded that when integrating a real-life application like BibSonomy into research, certain constraints have to be considered; but in general, the tight interplay between the scientific work and the running system has made Bibsonomy a valuable platform for demonstrating and evaluating Web 2.0 research.
Abstract: Social resource sharing systems are central elements of the Web 2.0 and use the same kind of lightweight knowledge representation, called folksonomy. Their large user communities and ever-growing networks of user-generated content have made them an attractive object of investigation for researchers from different disciplines like Social Network Analysis, Data Mining, Information Retrieval or Knowledge Discovery. In this paper, we summarize and extend our work on different aspects of this branch of Web 2.0 research, demonstrated and evaluated within our own social bookmark and publication sharing system BibSonomy, which is currently among the three most popular systems of its kind. We structure this presentation along the different interaction phases of a user with our system, coupling the relevant research questions of each phase with the corresponding implementation issues. This approach reveals in a systematic fashion important aspects and results of the broad bandwidth of folksonomy research like capturing of emergent semantics, spam detection, ranking algorithms, analogies to search engine log data, personalized tag recommendations and information extraction techniques. We conclude that when integrating a real-life application like BibSonomy into research, certain constraints have to be considered; but in general, the tight interplay between our scientific work and the running system has made BibSonomy a valuable platform for demonstrating and evaluating Web 2.0 research.

Proceedings Article
02 Jun 2010
TL;DR: This paper presents the first joint approach for bio-event extraction that obtains state-of-the-art results and adopts a novel formulation by jointly predicting events and arguments, as well as individual dependency edges that compose the argument paths.
Abstract: Knowledge extraction from online repositories such as PubMed holds the promise of dramatically speeding up biomedical research and drug design. After initially focusing on recognizing proteins and binary interactions, the community has recently shifted their attention to the more ambitious task of recognizing complex, nested event structures. State-of-the-art systems use a pipeline architecture in which the candidate events are identified first, and subsequently the arguments. This fails to leverage joint inference among events and arguments for mutual disambiguation. Some joint approaches have been proposed, but they still lag much behind in accuracy. In this paper, we present the first joint approach for bio-event extraction that obtains state-of-the-art results. Our system is based on Markov logic and adopts a novel formulation by jointly predicting events and arguments, as well as individual dependency edges that compose the argument paths. On the BioNLP'09 Shared Task dataset, it reduced F1 errors by more than 10% compared to the previous best joint approach.

Journal ArticleDOI
TL;DR: FiVaTech proposes an unsupervised, page-level data extraction approach to deduce the schema and templates for each individual deep Website, which contains either singleton or multiple data records in one Webpage.
Abstract: Web data extraction has been an important part for many Web data analysis applications. In this paper, we formulate the data extraction problem as the decoding process of page generation based on structured data and tree templates. We propose an unsupervised, page-level data extraction approach to deduce the schema and templates for each individual deep Website, which contains either singleton or multiple data records in one Webpage. FiVaTech applies tree matching, tree alignment, and mining techniques to achieve the challenging task. In experiments, FiVaTech has much higher precision than EXALG and is comparable with other record-level extraction systems like ViPER and MSE. The experiments show an encouraging result for the test pages used in many state-of-the-art Web data extraction works.

Journal ArticleDOI
TL;DR: An ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts is described, creating a network of 40,000 relationships between more than 200 entity types with clear semantics.

Journal ArticleDOI
TL;DR: A comparative introduction to TCM and modern biomedicine is presented, a survey of the related information sources of TCM, a review and discussion of the state of the art and the development of text mining techniques with applications toTCM, and a discussion ofThe research issues around TCM text mining and its future directions are presented.

Book
20 Jan 2010
TL;DR: Domain Driven Data Mining enhances the actionability and wider deployment of existing data-centered data mining through a combination of domain and business oriented factors, constraints and intelligence.
Abstract: In the present thriving global economy a need has evolved for complex data analysis to enhance an organizations production systems, decision-making tactics, and performance. In turn, data mining has emerged as one of the most active areas in information technologies. Domain Driven Data Mining offers state-of the-art research and development outcomes on methodologies, techniques, approaches and successful applications in domain driven, actionable knowledge discovery. About this book: Enhances the actionability and wider deployment of existing data-centered data mining through a combination of domain and business oriented factors, constraints and intelligence. Examines real-world challenges to and complexities of the current KDD methodologies and techniques. Details a paradigm shift from "data-centered pattern mining" to "domain driven actionable knowledge discovery" for next-generation KDD research and applications. Bridges the gap between business expectations and research output through detailed exploration of the findings, thoughts and lessons learned in conducting several large-scale, real-world data mining business applications Includes techniques, methodologies and case studies in real-life enterprise data mining Addresses new areas such as blog mining Domain Driven Data Mining is suitable for researchers, practitioners and university students in the areas of data mining and knowledge discovery, knowledge engineering, human-computer interaction, artificial intelligence, intelligent information processing, decision support systems, knowledge management, and KDD project management.