scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2017"


Journal ArticleDOI
TL;DR: This article provides a systematic review of existing techniques of Knowledge graph embedding, including not only the state-of-the-arts but also those with latest trends, based on the type of information used in the embedding task.
Abstract: Knowledge graph (KG) embedding is to embed components of a KG including entities and relations into continuous vector spaces, so as to simplify the manipulation while preserving the inherent structure of the KG. It can benefit a variety of downstream tasks such as KG completion and relation extraction, and hence has quickly gained massive attention. In this article, we provide a systematic review of existing techniques, including not only the state-of-the-arts but also those with latest trends. Particularly, we make the review based on the type of information used in the embedding task. Techniques that conduct embedding using only facts observed in the KG are first introduced. We describe the overall framework, specific model design, typical training procedures, as well as pros and cons of such techniques. After that, we discuss techniques that further incorporate additional information besides facts. We focus specifically on the use of entity types, relation paths, textual descriptions, and logical rules. Finally, we briefly introduce how KG embedding can be applied to and benefit a wide variety of downstream tasks such as KG completion, relation extraction, question answering, and so forth.

1,905 citations


Journal ArticleDOI
TL;DR: The state-of-the-art of data mining and analytics are reviewed through eight unsupervisedLearning and ten supervised learning algorithms, as well as the application status of semi-supervised learning algorithms.
Abstract: Data mining and analytics have played an important role in knowledge discovery and decision making/supports in the process industry over the past several decades. As a computational engine to data mining and analytics, machine learning serves as basic tools for information extraction, data pattern recognition and predictions. From the perspective of machine learning, this paper provides a review on existing data mining and analytics applications in the process industry over the past several decades. The state-of-the-art of data mining and analytics are reviewed through eight unsupervised learning and ten supervised learning algorithms, as well as the application status of semi-supervised learning algorithms. Several perspectives are highlighted and discussed for future researches on data mining and analytics in the process industry.

657 citations


Journal ArticleDOI
TL;DR: This survey summarizes, categorize and analyze those contributions on data preprocessing that cope with streaming data, and takes into account the existing relationships between the different families of methods (feature and instance selection, and discretization).

342 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper reviews data mining techniques and its applications such as educational data mining (EDM), finance, commerce, life sciences and medical etc, and group existing approaches to determine how the data mining can be used in different fields.
Abstract: Data mining is also known as Knowledge Discovery in Database (KDD). It is also defined as the process which includes extracting the interesting, interpretable and useful information from the raw data. There are different sources that generate raw data in very large amount. This is the main reason the applications of data mining are increasing rapidly. This paper reviews data mining techniques and its applications such as educational data mining (EDM), finance, commerce, life sciences and medical etc. We group existing approaches to determine how the data mining can be used in different fields. Our categorization specifically focuses on the research that has been published over the period 2007–2017. With this categorization, we present an easy and concise view of different models adapted in the data mining.

235 citations


Journal ArticleDOI
TL;DR: The most relevant PPDM techniques from the literature and the metrics used to evaluate such techniques are surveyed and typical applications of PPDD methods in relevant fields are presented.
Abstract: The collection and analysis of data are continuously growing due to the pervasiveness of computing devices. The analysis of such information is fostering businesses and contributing beneficially to the society in many different fields. However, this storage and flow of possibly sensitive data poses serious privacy concerns. Methods that allow the knowledge extraction from data, while preserving privacy, are known as privacy-preserving data mining (PPDM) techniques. This paper surveys the most relevant PPDM techniques from the literature and the metrics used to evaluate such techniques and presents typical applications of PPDM methods in relevant fields. Furthermore, the current challenges and open issues in PPDM are discussed.

223 citations


Book ChapterDOI
Bo Xu1, Yong Xu1, Jiaqing Liang1, Chenhao Xie1, Bin Liang1, Wanyun Cui1, Yanghua Xiao1 
27 Jun 2017
TL;DR: A never-ending Chinese Knowledge extraction system, CN-DBpedia, which can automatically generate a knowledge base that is of ever-increasing in size and constantly updated, and reduces the human costs by reusing the ontology of existing knowledge bases and building an end-to-end facts extraction model.
Abstract: Great efforts have been dedicated to harvesting knowledge bases from online encyclopedias These knowledge bases play important roles in enabling machines to understand texts However, most current knowledge bases are in English and non-English knowledge bases, especially Chinese ones, are still very rare Many previous systems that extract knowledge from online encyclopedias, although are applicable for building a Chinese knowledge base, still suffer from two challenges The first is that it requires great human efforts to construct an ontology and build a supervised knowledge extraction model The second is that the update frequency of knowledge bases is very slow To solve these challenges, we propose a never-ending Chinese Knowledge extraction system, CN-DBpedia, which can automatically generate a knowledge base that is of ever-increasing in size and constantly updated Specially, we reduce the human costs by reusing the ontology of existing knowledge bases and building an end-to-end facts extraction model We further propose a smart active update strategy to keep the freshness of our knowledge base with little human costs The 164 million API calls of the published services justify the success of our system

197 citations


Journal ArticleDOI
01 Mar 2017
TL;DR: The five Vs of big data, volume, velocity, variety, veracity, and value, are reviewed, as well as new technologies, including NoSQL databases that have emerged to accommodate the needs ofbig data initiatives.
Abstract: The era of big data has resulted in the development and applications of technologies and methods aimed at effectively using massive amounts of data to support decision-making and knowledge discovery activities. In this paper, the five Vs of big data, volume, velocity, variety, veracity, and value, are reviewed, as well as new technologies, including NoSQL databases that have emerged to accommodate the needs of big data initiatives. The role of conceptual modeling for big data is then analyzed and suggestions made for effective conceptual modeling efforts with respect to big data.

197 citations


Book ChapterDOI
01 Jan 2017
TL;DR: The objective of this background paper is to describe emerging sources of Big Data, their use in urban research, and the challenges that arise with their use.
Abstract: Big Data is the term being used to describe a wide spectrum of observational or “naturally-occurring” data generated through transactional, operational, planning and social activities that are not specifically designed for research. Due to the structure and access conditions associated with such data, their use for research and analysis becomes significantly complicated. New sources of Big Data are rapidly emerging as a result of technological, institutional, social, and business innovations. The objective of this background paper is to describe emerging sources of Big Data, their use in urban research, and the challenges that arise with their use. To a certain extent, Big Data in the urban context has become narrowly associated with sensor (e.g., Internet of Things) or socially generated (e.g., social media or citizen science) data. However, there are many other sources of observational data that are meaningful to different groups of urban researchers and user communities. Examples include privately held transactions data, confidential administrative micro-data, data from arts and humanities collections, and hybrid data consisting of synthetic or linked data.

168 citations


Journal ArticleDOI
TL;DR: The first outcomes for imbalanced classification in Big Data problems are presented, introducing the current research state of this area and analyzing the behavior of standard pre-processing techniques in this particular framework.
Abstract: Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.

165 citations


Journal ArticleDOI
TL;DR: Overall, the unsupervised rules generated by flexible pattern mining are found to be the most consistent, whereas the supervised rules from classification trees are the most sensitive to user-preferences.
Abstract: Four methods are developed for data mining discrete multi-objective optimization datasets.Two of the methods are unsupervised, one is supervised and the other is hybrid.Knowledge is represented as patterns in one method, and as rules in other methods.Methods are applied to three real-world production system optimization problems.Extracted knowledge is compared across methods and provides new insights. The first part of this paper served as a comprehensive survey of data mining methods that have been used to extract knowledge from solutions generated during multi-objective optimization. The current paper addresses three major shortcomings of existing methods, namely, lack of interactiveness in the objective space, inability to handle discrete variables and inability to generate explicit knowledge. Four data mining methods are developed that can discover knowledge in the decision space and visualize it in the objective space. These methods are (i) sequential pattern mining, (ii) clustering-based classification trees, (iii) hybrid learning, and (iv) flexible pattern mining. Each method uses a unique learning strategy to generate explicit knowledge in the form of patterns, decision rules and unsupervised rules. The methods are also capable of taking the decision makers preferences into account to generate knowledge unique to preferred regions of the objective space. Three realistic production systems involving different types of discrete variables are chosen as application studies. A multi-objective optimization problem is formulated for each system and solved using NSGA-II to generate the optimization datasets. Next, all four methods are applied to each dataset. In each application, the methods discover similar knowledge for specified regions of the objective space. Overall, the unsupervised rules generated by flexible pattern mining are found to be the most consistent, whereas the supervised rules from classification trees are the most sensitive to user-preferences.

145 citations


Journal ArticleDOI
TL;DR: This paper provides a review of current efforts to create ontologies related to behavior change interventions and suggests future work, and introduces ontologies, a systematic method for articulating a “controlled vocabulary” of agreed-upon terms and their inter-relationships.
Abstract: A central goal of behavioral medicine is the creation of evidence-based interventions for promoting behavior change. Scientific knowledge about behavior change could be more effectively accumulated using "ontologies." In information science, an ontology is a systematic method for articulating a "controlled vocabulary" of agreed-upon terms and their inter-relationships. It involves three core elements: (1) a controlled vocabulary specifying and defining existing classes; (2) specification of the inter-relationships between classes; and (3) codification in a computer-readable format to enable knowledge generation, organization, reuse, integration, and analysis. This paper introduces ontologies, provides a review of current efforts to create ontologies related to behavior change interventions and suggests future work. This paper was written by behavioral medicine and information science experts and was developed in partnership between the Society of Behavioral Medicine's Technology Special Interest Group (SIG) and the Theories and Techniques of Behavior Change Interventions SIG. In recent years significant progress has been made in the foundational work needed to develop ontologies of behavior change. Ontologies of behavior change could facilitate a transformation of behavioral science from a field in which data from different experiments are siloed into one in which data across experiments could be compared and/or integrated. This could facilitate new approaches to hypothesis generation and knowledge discovery in behavioral science.

Journal ArticleDOI
TL;DR: This work develops a novel method for feature learning on biological knowledge graphs that combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs.
Abstract: Motivation Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. In the past years, feature learning methods that are applicable to graph-structured data are becoming available, but have not yet widely been applied and evaluated on structured biological knowledge. Results: We develop a novel method for feature learning on biological knowledge graphs. Our method combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. We apply these embeddings to the prediction of edges in the knowledge graph representing problems of function prediction, finding candidate genes of diseases, protein-protein interactions, or drug target relations, and demonstrate performance that matches and sometimes outperforms traditional approaches based on manually crafted features. Our method can be applied to any biological knowledge graph, and will thereby open up the increasing amount of Semantic Web based knowledge bases in biology to use in machine learning and data analytics. Availability and implementation https://github.com/bio-ontology-research-group/walking-rdf-and-owl. Contact robert.hoehndorf@kaust.edu.sa. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Granular computing and acquisition of IF-THEN rules are two basic issues in knowledge representation and data mining and a rough set approach to knowledge discovery in incomplete multi-scale decision tables from the perspective of granular computing is proposed.

Journal ArticleDOI
TL;DR: A large-scale knowledge graph is constructed, which integrates terms, documents, databases and other knowledge resources and can facilitate various knowledge services such as knowledge visualization, knowledge retrieval, and knowledge recommendation, and helps the sharing, interpretation, and utilization of TCM health care knowledge.

Journal ArticleDOI
TL;DR: A knowledge discovery-based approach that allows the context-aware system to adapt its behaviour in runtime by analysing large amounts of data generated in Ambient assisted living systems and stored in cloud repositories is proposed.
Abstract: Context-aware monitoring is an emerging technology that provides real-time personalised health-care services and a rich area of big data application. In this paper, we propose a knowledge discovery-based approach that allows the context-aware system to adapt its behaviour in runtime by analysing large amounts of data generated in ambient assisted living (AAL) systems and stored in cloud repositories . The proposed BDCaM model facilitates analysis of big data inside a cloud environment. It first mines the trends and patterns in the data of an individual patient with associated probabilities and utilizes that knowledge to learn proper abnormal conditions. The outcomes of this learning method are then applied in context-aware decision-making processes for the patient. A use case is implemented to illustrate the applicability of the framework that discovers the knowledge of classification to identify the true abnormal conditions of patients having variations in blood pressure (BP) and heart rate (HR). The evaluation shows a much better estimate of detecting proper anomalous situations for different types of patients. The accuracy and efficiency obtained for the implemented case study demonstrate the effectiveness of the proposed model.

Proceedings ArticleDOI
Ingo Scholtes1
13 Aug 2017
TL;DR: This work develops a model selection technique to infer the optimal number of layers of such a model and shows that it outperforms baseline Markov order detection techniques and allows to infer graphical models that capture both topological and temporal characteristics of such data.
Abstract: We introduce a framework for the modeling of sequential data capturing pathways of varying lengths observed in a network. Such data are important, e.g., when studying click streams in the Web, travel patterns in transportation systems, information cascades in social networks, biological pathways, or time-stamped social interactions. While it is common to apply graph analytics and network analysis to such data, recent works have shown that temporal correlations can invalidate the results of such methods. This raises a fundamental question: When is a network abstraction of sequential data justified?Addressing this open question, we propose a framework that combines Markov chains of multiple, higher orders into a multi-layer graphical model that captures temporal correlations in pathways at multiple length scales simultaneously. We develop a model selection technique to infer the optimal number of layers of such a model and show that it outperforms baseline Markov order detection techniques. An application to eight real-world data sets on pathways and temporal networks shows that it allows to infer graphical models that capture both topological and temporal characteristics of such data. Our work highlights fallacies of network abstractions and provides a principled answer to the open question when they are justified. Generalizing network representations to multi-order graphical models, it opens perspectives for new data mining and knowledge discovery algorithms.

Journal ArticleDOI
23 Jun 2017-PLOS ONE
TL;DR: Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations and is the first attempt at extracting dietary concepts.
Abstract: Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations.

Book ChapterDOI
18 Sep 2017
TL;DR: This work focuses on corpus-based set expansion, which is a critical task in knowledge discovery and may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search.
Abstract: Corpus-based set expansion (i.e., finding the “complete” set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds) is a critical task in knowledge discovery. It may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search.

Journal ArticleDOI
09 Aug 2017
TL;DR: A semi-random data partitioning framework is proposed to solve class imbalance and sample representativeness issues in granular computing, and shows that avoiding class imbalance results in better model performance.
Abstract: Due to the vast and rapid increase in the size of data, machine learning has become an increasingly more popular approach for the purpose of knowledge discovery and predictive modelling. For both of the above purposes, it is essential to have a data set partitioned into a training set and a test set. In particular, the training set is used towards learning a model and the test set is then used towards evaluating the performance of the model learned from the training set. The split of the data into the two sets, however, and the influence on model performance, has only been investigated with respect to the optimal proportion for the two sets, with no attention paid to the characteristics of the data within the training and test sets. Thus, the current practice is to randomly split the data into approximately 70% for training and 30% for testing. In this paper, we show that this way of partitioning the data leads to two major issues: (a) class imbalance and (b) sample representativeness issues. Class imbalance is known to affect the performance of many classifiers by introducing a bias towards the majority class; the representativeness of the training set affects a model’s performance through the lack of opportunity for the algorithm to learn, by not presenting it with relevant examples—similar to testing a student on material that was not taught. To solve the above two issues, we propose a semi-random data partitioning framework, in the setting of granular computing. While we discuss how the framework can address both issues, in this paper, we focus on avoiding class imbalance when partitioning the data, through the proposed approach. The results show that avoiding class imbalance results in better model performance.

Journal ArticleDOI
TL;DR: A novel approach for estimation of temporal association pattern prevalence values and a novel temporal fuzzy similarity measure which holds monotonicity to find similarity between any two temporal patterns are proposed.

Journal ArticleDOI
TL;DR: The data index and search engine DataMed is designed to be, for data, what PubMed has been for the scientific literature, which supports the findability and accessibility of data sets.
Abstract: The value of broadening searches for data across multiple repositories has been identified by the biomedical research community. As part of the US National Institutes of Health (NIH) Big Data to Knowledge initiative, we work with an international community of researchers, service providers and knowledge experts to develop and test a data index and search engine, which are based on metadata extracted from various data sets in a range of repositories. DataMed is designed to be, for data, what PubMed has been for the scientific literature. DataMed supports the findability and accessibility of data sets. These characteristics—along with interoperability and reusability—compose the four FAIR principles to facilitate knowledge discovery in today's big data–intensive science landscape.

Journal ArticleDOI
TL;DR: Using the three-stage-based knowledge transfer, the beneficial knowledge from the source domain can be extensively, self-adaptively leveraged in the target domain.
Abstract: We study a novel fuzzy clustering method to improve the segmentation performance on the target texture image by leveraging the knowledge from a prior texture image. Two knowledge transfer mechanisms, i.e. knowledge-leveraged prototype transfer (KL-PT) and knowledge-leveraged prototype matching (KL-PM) are first introduced as the bases. Applying them, the knowledge-leveraged transfer fuzzy C-means (KL-TFCM) method and its three-stage-interlinked framework, including knowledge extraction, knowledge matching, and knowledge utilization, are developed. There are two specific versions: KL-TFCM-c and KL-TFCM-f, i.e. the so-called crisp and flexible forms, which use the strategies of maximum matching degree and weighted sum, respectively. The significance of our work is fourfold: 1) Owing to the adjustability of referable degree between the source and target domains, KL-PT is capable of appropriately learning the insightful knowledge, i.e. the cluster prototypes, from the source domain; 2) KL-PM is able to self-adaptively determine the reasonable pairwise relationships of cluster prototypes between the source and target domains, even if the numbers of clusters differ in the two domains; 3) The joint action of KL-PM and KL-PT can effectively resolve the data inconsistency and heterogeneity between the source and target domains, e.g. the data distribution diversity and cluster number difference. Thus, using the three-stage-based knowledge transfer, the beneficial knowledge from the source domain can be extensively, self-adaptively leveraged in the target domain. As evidence of this, both KL-TFCM-c and KL-TFCM-f surpass many existing clustering methods in texture image segmentation; and 4) In the case of different cluster numbers between the source and target domains, KL-TFCM-f proves higher clustering effectiveness and segmentation performance than does KL-TFCM-c.

Journal ArticleDOI
01 Jan 2017
TL;DR: It is argued that the knowledge graph is a suitable candidate for this data model, and current research is described and some of the promises and challenges of this approach are discussed.
Abstract: In modern machine learning, raw data is the pre-ferred input for our models. Where a decade ago data scien-tists were still engineering features, manually picking out the details they thought salient, they now prefer the data in their raw form. As long as we can assume that all relevant and ir-relevant information is present in the input data, we can de-sign deep models that build up intermediate representations to sift out relevant features. However, these models are often domain specific and tailored to the task at hand, and therefore unsuited for learning on heterogeneous knowledge: informa-tion of different types and from different domains. If we can develop methods that operate on this form of knowledge, we can dispense with a great deal of ad-hoc feature engineering and train deep models end-to-end in many more domains. To accomplish this, we first need a data model capable of ex-pressing heterogeneous knowledge naturally in various do-mains, in as usable a form as possible, and satisfying as many use cases as possible. In this position paper, we argue that the knowledge graph is a suitable candidate for this data model. This paper describes current research and discusses some of the promises and challenges of this approach.

Journal ArticleDOI
TL;DR: This paper summarizes privacy-preserving techniques, focusing on graph-modification methods which alter graph’s structure and release the entire anonymous network, which allow researchers and third-parties to apply all graph-mining processes on anonymous data, from local to global knowledge extraction.
Abstract: Recently, a huge amount of social networks have been made publicly available. In parallel, several definitions and methods have been proposed to protect users' privacy when publicly releasing these data. Some of them were picked out from relational dataset anonymization techniques, which are riper than network anonymization techniques. In this paper we summarize privacy-preserving techniques, focusing on graph-modification methods which alter graph's structure and release the entire anonymous network. These methods allow researchers and third-parties to apply all graph-mining processes on anonymous data, from local to global knowledge extraction.

Journal ArticleDOI
TL;DR: It has been explored that, instead of the more efficient alternative approaches, the Apriori algorithm is still a widely used frequent itemset generation technique for application of association rule mining for health informatics.
Abstract: Association rule mining is an effective data mining technique which has been used widely in health informatics research right from its introduction. Since health informatics has received a lot of attention from researchers in last decade, and it has developed various sub-domains, so it is interesting as well as essential to review state of the art health informatics research. As knowledge discovery researchers and practitioners have applied an array of data mining techniques for knowledge extraction from health data, so the application of association rule mining techniques to health informatics domain has been focused and studied in detail in this survey. Through critical analysis of applications of association rule mining literature for health informatics from 2005 to 2014, it has been explored that, instead of the more efficient alternative approaches, the Apriori algorithm is still a widely used frequent itemset generation technique for application of association rule mining for health informatics. Moreover, other limitations related to applications of association rule mining for health informatics have also been identified and recommendations have been made to mitigate those limitations. Furthermore, the algorithms and tools utilized for application of association rule mining have also been identified, conclusions have been drawn from the literature surveyed, and future research directions have been presented.

Journal ArticleDOI
TL;DR: A graph-based approach to knowledge reuse for supporting knowledge-driven decision-making in new product development and the feasibility and effectiveness of the proposed approach are demonstrated.
Abstract: Pre-existing knowledge buried in manufacturing enterprises can be reused to help decision-makers develop good judgements to make decisions about the problems in new product development, which in turn speeds up and improves the quality of product innovation. This paper presents a graph-based approach to knowledge reuse for supporting knowledge-driven decision-making in new product development. The paper first illustrates the iterative process of knowledge-driven decision-making in new product development. Then, a novel framework is proposed to facilitate this process, where knowledge maps and knowledge navigation are involved. Here, OWL ontologies are employed to construct knowledge maps, which appropriately capture and organise knowledge resources generated at various stages of product lifecycle; the Personalised PageRank algorithm is used to perform knowledge navigation, which finds the most relevant knowledge in knowledge maps for a given problem in new product development. Finally, the feasibility and ...

Journal ArticleDOI
Hailun Lin1, Yong Liu1, Weiping Wang1, Yinliang Yue1, Zheng Lin1 
01 Jan 2017
TL;DR: This work proposes ETransR, a method which automatically learns entity and relation feature representations in continuous vector spaces, in order to measure the semantic relatedness of knowledge mentions for knowledge resolution.
Abstract: Knowledge resolution is the task of clustering knowledge mentions, e.g., entity and relation mentions into several disjoint groups with each group representing a unique entity or relation. Such resolution is a central step in constructing high-quality knowledge graph from unstructured text. Previous research has tackled this problem by making use of various textual and structural features from a semantic dictionary or a knowledge graph. This may lead to poor performance on knowledge mentions with poor or not well-known contexts. In addition, it is also limited by the coverage of the semantic dictionary or knowledge graph. In this work, we propose ETransR, a method which automatically learns entity and relation feature representations in continuous vector spaces, in order to measure the semantic relatedness of knowledge mentions for knowledge resolution. Experimental results on two benchmark datasets show that our proposed method delivers significant improvements compared with the state-of-the-art baselines on the task of knowledge resolution.

Journal ArticleDOI
TL;DR: Two studies on an industrial scale separation tower and the Tennessee Eastman process simulation demonstrate data clustering and feature extraction effectively revealing significant process trends from high dimensional, multivariate data.

Journal ArticleDOI
TL;DR: This work has created a domain-targeted, high precision knowledge extraction pipeline, leveraging Open IE, crowdsourcing, and a novel canonical schema learning algorithm (called CASI), that produces high precisionknowledge targeted to a particular domain - in this case, elementary science.
Abstract: Our goal is to construct a domain-targeted, high precision knowledge base (KB), containing general (subject,predicate,object) statements about the world, in support of a downstream question-answering (QA) application. Despite recent advances in information extraction (IE) techniques, no suitable resource for our task already exists; existing resources are either too noisy, too named-entity centric, or too incomplete, and typically have not been constructed with a clear scope or purpose. To address these, we have created a domain-targeted, high precision knowledge extraction pipeline, leveraging Open IE, crowdsourcing, and a novel canonical schema learning algorithm (called CASI), that produces high precision knowledge targeted to a particular domain - in our case, elementary science. To measure the KB’s coverage of the target domain’s knowledge (it’s "comprehensiveness" with respect to science) we measure recall with respect to an independent corpus of domain text, and show that our pipeline produces output with over 80% precision and 23% recall with respect to that target, a substantially higher coverage of tuple-expressible science knowledge than other comparable resources. We have made the KB publicly available at http://data.allenai.org/tuple-kb .

Journal ArticleDOI
TL;DR: The impact that data sharing has in science and society is reviewed and guidelines to improve the efficient sharing of research data are presented.
Abstract: Initiatives for sharing research data are opportunities to increase the pace of knowledge discovery and scientific progress. The reuse of research data has the potential to avoid the duplication of data sets and to bring new views from multiple analysis of the same data set. For example, the study of genomic variations associated with cancer profits from the universal collection of such data and helps in selecting the most appropriate therapy for a specific patient. However, data sharing poses challenges to the scientific community. These challenges are of ethical, cultural, legal, financial, or technical nature. This article reviews the impact that data sharing has in science and society and presents guidelines to improve the efficient sharing of research data.