scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2015"


Journal ArticleDOI
TL;DR: A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed.
Abstract: The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

702 citations


Journal ArticleDOI
TL;DR: This paper provides a survey of self-labeled methods for semi-supervised classification and proposes a taxonomy based on the main characteristics presented in them, aiming to measure their performance in terms of transductive and inductive classification capabilities.
Abstract: Semi-supervised classification methods are suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. This problem has been addressed by several approaches with different assumptions about the characteristics of the input data. Among them, self-labeled techniques follow an iterative procedure, aiming to obtain an enlarged labeled data set, in which they accept that their own predictions tend to be correct. In this paper, we provide a survey of self-labeled methods for semi-supervised classification. From a theoretical point of view, we propose a taxonomy based on the main characteristics presented in them. Empirically, we conduct an exhaustive study that involves a large number of data sets, with different ratios of labeled data, aiming to measure their performance in terms of transductive and inductive classification capabilities. The results are contrasted with nonparametric statistical tests. Note is then taken of which self-labeled models are the best-performing ones. Moreover, a semi-supervised learning module has been developed for the Knowledge Extraction based on Evolutionary Learning software, integrating analyzed methods and data sets.

457 citations


Proceedings ArticleDOI
01 Dec 2015
TL;DR: This paper proposes an architecture to create a flexible and scalable machine learning as a service, using real-world sensor and weather data by running different algorithms at the same time.
Abstract: The demand for knowledge extraction has been increasing. With the growing amount of data being generated by global data sources (e.g., social media and mobile apps) and the popularization of context-specific data (e.g., the Internet of Things), companies and researchers need to connect all these data and extract valuable information. Machine learning has been gaining much attention in data mining, leveraging the birth of new solutions. This paper proposes an architecture to create a flexible and scalable machine learning as a service. An open source solution was implemented and presented. As a case study, a forecast of electricity demand was generated using real-world sensor and weather data by running different algorithms at the same time.

281 citations


Journal ArticleDOI
TL;DR: This paper focus on identifying the slow learners among students and displaying it by a predictive data mining model using classification based algorithms and a knowledge flow model is also shown among all five classifiers.

212 citations


Journal ArticleDOI
TL;DR: This review paper has consolidated the papers reviewed inline to the disciplines, model, tasks and methods involved in data mining in terms of method, algorithms and results.

209 citations


Journal ArticleDOI
TL;DR: This study investigates the use of visualization techniques reported between 1996 and 2013 and evaluates innovative approaches to information visualization of electronic health record (EHR) data for knowledge discovery.

202 citations


Journal ArticleDOI
TL;DR: A generic framework for knowledge discovery in massive BAS data using DM techniques is presented, specifically designed considering the low quality and complexity of BAS data, the diversity of advanced DM techniques, as well as the integration of knowledge discovered by DM techniques and domain knowledge in the building field.

176 citations


Proceedings ArticleDOI
01 Feb 2015
TL;DR: This survey paper investigates why ontology has the potential to help semantic data mining and how formal semantics in ontologies can be incorporated into the data mining process.
Abstract: Semantic Data Mining refers to the data mining tasks that systematically incorporate domain knowledge, especially formal semantics, into the process. In the past, many research efforts have attested the benefits of incorporating domain knowledge in data mining. At the same time, the proliferation of knowledge engineering has enriched the family of domain knowledge, especially formal semantics and Semantic Web ontologies. Ontology is an explicit specification of conceptualization and a formal way to define the semantics of knowledge and data. The formal structure of ontology makes it a nature way to encode domain knowledge for the data mining use. In this survey paper, we introduce general concepts of semantic data mining. We investigate why ontology has the potential to help semantic data mining and how formal semantics in ontologies can be incorporated into the data mining process. We provide detail discussions for the advances and state of art of ontology-based approaches and an introduction of approaches that are based on other form of knowledge representations.

157 citations


Journal ArticleDOI
TL;DR: A definition of big data in healthcare is proposed, defined by volume, which can be defined as datasets with Log⁡(n∗p) ≥ 7 and its great variety and high velocity.
Abstract: Objective. The aim of this study was to provide a definition of big data in healthcare. Methods. A systematic search of PubMed literature published until May 9, 2014, was conducted. We noted the number of statistical individuals and the number of variables for all papers describing a dataset. These papers were classified into fields of study. Characteristics attributed to big data by authors were also considered. Based on this analysis, a definition of big data was proposed. Results. A total of 196 papers were included. Big data can be defined as datasets with . Properties of big data are its great variety and high velocity. Big data raises challenges on veracity, on all aspects of the workflow, on extracting meaningful information, and on sharing information. Big data requires new computational methods that optimize data management. Related concepts are data reuse, false knowledge discovery, and privacy issues. Conclusion. Big data is defined by volume. Big data should not be confused with data reuse: data can be big without being reused for another purpose, for example, in omics. Inversely, data can be reused without being necessarily big, for example, secondary use of Electronic Medical Records (EMR) data.

152 citations


Posted Content
TL;DR: The evolution of big data computing, differences between traditional data warehousing and big data, taxonomy ofbig data computing and underpinning technologies, integrated platform of bigdata and clouds known as big data clouds, layered architecture and components of bigData cloud, and finally open‐technical challenges and future directions are discussed.
Abstract: Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.

148 citations


Journal ArticleDOI
TL;DR: In this paper, a day-typing process that uses Symbolic Aggregate approXimation (SAX), motif and discord extraction, and clustering to detect the underlying structure of building performance data is presented.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work introduces the problem of visual verification of relation phrases and developed a Visual Knowledge Extraction system called VisKE, which has been used to not only enrich existing textual knowledge bases by improving their recall, but also augment open-domain question-answer reasoning.
Abstract: How can we know whether a statement about our world is valid. For example, given a relationship between a pair of entities e.g., ‘eat(horse, hay)’, how can we know whether this relationship is true or false in general. Gathering such knowledge about entities and their relationships is one of the fundamental challenges in knowledge extraction. Most previous works on knowledge extraction have focused purely on text-driven reasoning for verifying relation phrases. In this work, we introduce the problem of visual verification of relation phrases and developed a Visual Knowledge Extraction system called VisKE. Given a verb-based relation phrase between common nouns, our approach assess its validity by jointly analyzing over text and images and reasoning about the spatial consistency of the relative configurations of the entities and the relation involved. Our approach involves no explicit human supervision thereby enabling large-scale analysis. Using our approach, we have already verified over 12000 relation phrases. Our approach has been used to not only enrich existing textual knowledge bases by improving their recall, but also augment open-domain question-answer reasoning.

Journal ArticleDOI
TL;DR: Through the use of the accelerator, three representative heuristic fuzzy-rough feature selection algorithms have been enhanced and it is shown that these modified algorithms are much faster than their original counterparts.

Journal ArticleDOI
TL;DR: A time series data mining methodology for temporal knowledge discovery in big BAS data to identify dynamics, patterns and anomalies in building operations, derive temporal association rules within and between subsystems, assess building system performance and spot opportunities in energy conservation.

Journal ArticleDOI
TL;DR: A systematic study of the rough set-based discretization techniques found in the literature and categorizes them into a taxonomy that provides a useful roadmap for new researchers in the area of RSBD.
Abstract: The extraction of knowledge from a huge volume of data using rough set methods requires the transformation of continuous value attributes to discrete intervals. This paper presents a systematic study of the rough set-based discretization (RSBD) techniques found in the literature and categorizes them into a taxonomy. In the literature, no review is solely based on RSBD. Only a few rough set discretizers have been studied, while many new developments have been overlooked and need to be highlighted. Therefore, this study presents a formal taxonomy that provides a useful roadmap for new researchers in the area of RSBD. The review also elaborates the process of RSBD with the help of a case study. The study of the existing literature focuses on the techniques adapted in each article, the comparison of these with other similar approaches, the number of discrete intervals they produce as output, their effects on classification and the application of these techniques in a domain. The techniques adopted in each article have been considered as the foundation for the taxonomy. Moreover, a detailed analysis of the existing discretization techniques has been conducted while keeping the concept of RSBD applications in mind. The findings are summarized and presented in this paper.

Journal ArticleDOI
Min Song1, Won Chul Kim1, Dahee Lee1, Go Eun Heo1, Keun Young Kang1 
TL;DR: In this paper, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework is presented, which has fairly good performance in terms of accuracy as well as the ability to configure text processing components.

Journal ArticleDOI
TL;DR: A rule-based system underlying the conditions that trigger emotions based on an emotional model, based on Bayesian probability is proposed and the experimental results validate the feasibility of the approach.
Abstract: We develop a rule-based system that trigger emotions based on the emotional model.We extract the corresponding cause events in fine-grained emotions.We get the proportions of different cause components under different emotions.The language features and Bayesian probability are used in this paper. Emotion analysis and emotion cause extraction are key research tasks in natural language processing and public opinion mining. This paper presents a rule-based approach to emotion cause component detection for Chinese micro-blogs. Our research has important scientific values on social network knowledge discovery and data mining. It also has a great potential in analyzing the psychological processes of consumers. Firstly, this paper proposes a rule-based system underlying the conditions that trigger emotions based on an emotional model. Secondly, this paper extracts the corresponding cause events in fine-grained emotions from the results of events, actions of agents and aspects of objects. Meanwhile, it is reasonable to get the proportions of different cause components under different emotions by constructing the emotional lexicon and identifying different linguistic features, and the proposed approach is based on Bayesian probability. Finally, this paper presents the experiments on an emotion corpus of Chinese micro-blogs. The experimental results validate the feasibility of the approach. The existing problems and the further works are also present at the end.

Journal ArticleDOI
TL;DR: This paper aims to encourage those research scientists which do not have extensive programming and data mining knowledge to take advantage of existing data mining tools, to embrace classical data mining and LOD approaches in support of gaining more insight and recognizing patterns in highly complex data sets.

Journal ArticleDOI
TL;DR: Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets.
Abstract: Algorithmic scalability is a major concern for any machine learning strategy in this age of ‘big data’. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExSTraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExSTraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExSTraCS usability was made simpler through the elimination of previously critical run parameters.

Journal ArticleDOI
TL;DR: Three different parallel matrix-based methods are introduced to process large-scale, incomplete data and are built on MapReduce and implemented on Twister that is a lightweight Map Reduce runtime system.
Abstract: As the volume of data grows at an unprecedented rate, large-scale data mining and knowledge discovery present a tremendous challenge. Rough set theory, which has been used successfully in solving problems in pattern recognition, machine learning, and data mining, centers around the idea that a set of distinct objects may be approximated via a lower and upper bound. In order to obtain the benefits that rough sets can provide for data mining and related tasks, efficient computation of these approximations is vital. The recently introduced cloud computing model, MapReduce, has gained a lot of attention from the scientific community for its applicability to large-scale data analysis. In previous research, we proposed a MapReduce-based method for computing approximations in parallel, which can efficiently process complete data but fails in the case of missing (incomplete) data. To address this shortcoming, three different parallel matrix-based methods are introduced to process large-scale, incomplete data. All of them are built on MapReduce and implemented on Twister that is a lightweight MapReduce runtime system. The proposed parallel methods are then experimentally shown to be efficient for processing large-scale data.

Journal ArticleDOI
TL;DR: Mass-Up brings knowledge discovery within reach of MALDI-TOF-MS researchers by allowing data preprocessing, as well as subsequent analysis including biomarker discovery, clustering, biclustering and three-dimensional PCA visualization.
Abstract: Mass spectrometry is one of the most important techniques in the field of proteomics. MALDI-TOF mass spectrometry has become popular during the last decade due to its high speed and sensitivity for detecting proteins and peptides. MALDI-TOF-MS can be also used in combination with Machine Learning techniques and statistical methods for knowledge discovery. Although there are many software libraries and tools that can be combined for these kind of analysis, there is still a need for all-in-one solutions with graphical user-friendly interfaces and avoiding the need of programming skills. Mass-Up, an open software multiplatform application for MALDI-TOF-MS knowledge discovery is herein presented. Mass-Up software allows data preprocessing, as well as subsequent analysis including (i) biomarker discovery, (ii) clustering, (iii) biclustering, (iv) three-dimensional PCA visualization and (v) classification of large sets of spectra data. Mass-Up brings knowledge discovery within reach of MALDI-TOF-MS researchers. Mass-Up is distributed under license GPLv3 and it is open and free to all users at http://sing.ei.uvigo.es/mass-up .

Journal ArticleDOI
TL;DR: The objective of this paper is to investigate the knowledge reduction in FCA and propose a method based on Non-Negative Matrix Factorization (NMF) for addressing the issue.

Book ChapterDOI
01 Jan 2015
TL;DR: This chapter offers a first exploration of the general potential of Artificial Intelligence Techniques in Human Resource Management and a brief foundation elaborates on the central functionalities of Artificial intelligence Techniques and the central requirements of Human resource Management based on the task-technology fit approach.
Abstract: Artificial Intelligence Techniques and its subset, Computational Intelligence Techniques, are not new to Human Resource Management, and since their introduction, a heterogeneous set of suggestions on how to use Artificial Intelligence and Computational Intelligence in Human Resource Management has accumulated. While such contributions offer detailed insights into specific application possibilities, an overview of the general potential is missing. Therefore, this chapter offers a first exploration of the general potential of Artificial Intelligence Techniques in Human Resource Management . To this end, a brief foundation elaborates on the central functionalities of Artificial Intelligence Techniques and the central requirements of Human Resource Management based on the task-technology fit approach. Based on this, the potential of Artificial Intelligence in Human Resource Management is explored in six selected scenarios (turnover prediction with artificial neural networks , candidate search with knowledge-based search engines, staff rostering with genetic algorithms , HR sentiment analysis with text mining , resume data acquisition with information extraction and employee self-service with interactive voice response ). The insights gained based on the foundation and exploration are discussed and summarized.

Book ChapterDOI
11 Oct 2015
TL;DR: This paper presents an approach to building knowledge graphs by exploiting semantic technologies to reconcile the data continuously crawled from diverse sources, to scale to billions of triples extracted from the crawled content, and to support interactive queries on the data.
Abstract: There is a huge amount of data spread across the web and stored in databases that we can use to build knowledge graphs. However, exploiting this data to build knowledge graphs is difficult due to the heterogeneity of the sources, scale of the amount of data, and noise in the data. In this paper we present an approach to building knowledge graphs by exploiting semantic technologies to reconcile the data continuously crawled from diverse sources, to scale to billions of triples extracted from the crawled content, and to support interactive queries on the data. We applied our approach, implemented in the DIG system, to the problem of combating human trafficking and deployed it to six law enforcement agencies and several non-governmental organizations to assist them with finding traffickers and helping victims.

Journal ArticleDOI
TL;DR: Experimental studies prove the effectiveness of RRW against malicious attacks and show that the proposed technique outperforms existing ones.
Abstract: Advancement in information technology is playing an increasing role in the use of information systems comprising relational databases. These databases are used effectively in collaborative environments for information extraction; consequently, they are vulnerable to security threats concerning ownership rights and data tampering. Watermarking is advocated to enforce ownership rights over shared relational data and for providing a means for tackling data tampering. When ownership rights are enforced using watermarking, the underlying data undergoes certain modifications; as a result of which, the data quality gets compromised. Reversible watermarking is employed to ensure data quality along-with data recovery. However, such techniques are usually not robust against malicious attacks and do not provide any mechanism to selectively watermark a particular attribute by taking into account its role in knowledge discovery. Therefore, reversible watermarking is required that ensures; (i) watermark encoding and decoding by accounting for the role of all the features in knowledge discovery; and, (ii) original data recovery in the presence of active malicious attacks. In this paper, a robust and semi-blind reversible watermarking (RRW) technique for numerical relational data has been proposed that addresses the above objectives. Experimental studies prove the effectiveness of RRW against malicious attacks and show that the proposed technique outperforms existing ones.

Journal ArticleDOI
TL;DR: These findings show that introducing simple practices, such as optimal clutch, engine rotation, and engine running in idle, can reduce fuel consumption on average from 3 to 5l/100 km, meaning a saving of 30 l per bus on one day.
Abstract: This paper discusses the results of applied research on the eco-driving domain based on a huge data set produced from a fleet of Lisbon's public transportation buses for a three-year period. This data set is based on events automatically extracted from the control area network bus and enriched with GPS coordinates, weather conditions, and road information. We apply online analytical processing (OLAP) and knowledge discovery (KD) techniques to deal with the high volume of this data set and to determine the major factors that influence the average fuel consumption, and then classify the drivers involved according to their driving efficiency. Consequently, we identify the most appropriate driving practices and styles. Our findings show that introducing simple practices, such as optimal clutch, engine rotation, and engine running in idle, can reduce fuel consumption on average from 3 to 5l/100 km, meaning a saving of 30 l per bus on one day. These findings have been strongly considered in the drivers' training sessions.

Journal ArticleDOI
TL;DR: An electricity medium voltage (MV) customer characterization framework supported by knowledge discovery in database (KDD) is presented and a rule set for the automatic classification of new consumers is developed.

Book ChapterDOI
11 Oct 2015
TL;DR: The notion of relatedness explanation is formalized and different criteria are introduced to build explanations based on information-theory, diversity and their combinations to harness knowledge available in a variety of KGs.
Abstract: Knowledge graphs KGs are a key ingredient for searching, browsing and knowledge discovery activities. Motivated by the need to harness knowledge available in a variety of KGs, we face the following two problems. First, given a pair of entities defined in some KG, find an explanation of their relatedness. We formalize the notion of relatedness explanation and introduce different criteria to build explanations based on information-theory, diversity and their combinations. Second, given a pair of entities, find other pairs of entities sharing a similar relatedness perspective. We describe an implementation of our ideas in a tool, called RECAP, which is based on RDF and SPARQL. We provide an evaluation of RECAP and a comparison with related systems on real-world data.

Journal ArticleDOI
TL;DR: This paper presents the updating properties for dynamic maintenance of approximations when the criteria values in the set-valued decision system evolve with time, and proposes two incremental algorithms corresponding to the addition and removal of criteria values.

Journal ArticleDOI
Francesco Gullo1
TL;DR: A high-level overview of the most prominent tasks and methods that form the basis of data mining is provided.