scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2021"


Journal ArticleDOI
TL;DR: A comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions and points to several significant research questions that are unanswered in the recent literature having the same target.
Abstract: Manufacturing organizations need to use different kinds of techniques and tools in order to fulfill their foundation goals. In this aspect, using machine learning (ML) and data mining (DM) techniques and tools could be very helpful for dealing with challenges in manufacturing. Therefore, in this paper, a comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions. Furthermore, it points to several significant research questions that are unanswered in the recent literature having the same target. Our survey aims to provide researchers with a solid understanding of the main approaches and algorithms used to improve manufacturing processes over the past two decades. It presents the previous ML studies and recent advances in manufacturing by grouping them under four main subjects: scheduling, monitoring, quality, and failure. It comprehensively discusses existing solutions in manufacturing according to various aspects, including tasks (i.e., clustering, classification, regression), algorithms (i.e., support vector machine, neural network), learning types (i.e., ensemble learning, deep learning), and performance metrics (i.e., accuracy, mean absolute error). Furthermore, the main steps of knowledge discovery in databases (KDD) process to be followed in manufacturing applications are explained in detail. In addition, some statistics about the current state are also given from different perspectives. Besides, it explains the advantages of using machine learning techniques in manufacturing, expresses the ways to overcome certain challenges, and offers some possible further research directions.

237 citations


Journal ArticleDOI
TL;DR: A comprehensive literature review of major scientific contributions made so far in this research area is undertaken and a holistic overview of the main applications of Data Sciences to digital marketing is presented to generate insights related to the creation of innovative Data Mining and knowledge discovery techniques.

196 citations


Journal ArticleDOI
TL;DR: It is argued that if the project is goal-directed and process-driven the process model view still largely holds, and when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for.
Abstract: CRISP-DM(CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the leading term being favoured over data mining . In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.

120 citations


Journal ArticleDOI
TL;DR: It is shown that a LBD approach can be feasible not only for discovering drug candidates for COVID-19, but also for generating mechanistic explanations and can be generalized to other diseases as well as to other clinical questions.

106 citations


Journal ArticleDOI
TL;DR: A new method of structural modeling is utilized to generate the structured derivation relationship, thus completing the natural language knowledge extraction process of the object-oriented knowledge system.
Abstract: Withthe technological advent, the clustering phenomenon is recently being used in various domains and in natural language recognition. This article contributes to the clustering phenomenon of natural language and fulfills the requirements for the dynamic update of the knowledge system. This article proposes a method of dynamic knowledge extraction based on sentence clustering recognition using a neural network-based framework. The conversion process from natural language papers to object-oriented knowledge system is studied considering the related problems of sentence vectorization. This article studies the attributes of sentence vectorization using various basic definitions, judgment theorem, and postprocessing elements. The sentence clustering recognition method of the network uses the concept of prereliability as a measure of the credibility of sentence recognition results. An ART2 neural network simulation program is written using MATLAB, and the effect of the neural network on sentence recognition is utilized for the corresponding analysis. A postreliability evaluation indexing is done for the credibility of the model construction, and the implementation steps for the conjunctive rule sentence pattern are specifically introduced. A new method of structural modeling is utilized to generate the structured derivation relationship, thus completing the natural language knowledge extraction process of the object-oriented knowledge system. An application example with mechanical CAD is used in this work to demonstrate the specific implementation of the example, which confirms the effectiveness of the proposed method.

101 citations


Journal ArticleDOI
TL;DR: In this paper, a comprehensive review on applications of deep learning in network traffic monitoring and analysis (NTMA) applications is provided, where the authors discuss key challenges, open issues, and future research directions for using deep learning for NTMA applications.

96 citations


Proceedings ArticleDOI
14 Jan 2021
TL;DR: In this paper, the authors propose a two-stage learning algorithm that leverages knowledge from multiple tasks to solve the problem of catastrophic forgetting and difficulties in dataset balancing, by separating the two stages, i.e., knowledge extraction and knowledge composition.
Abstract: Sequential fine-tuning and multi-task learning are methods aiming to incorporate knowledge from multiple tasks; however, they suffer from catastrophic forgetting and difficulties in dataset balancing. To address these shortcomings, we propose AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks. First, in the knowledge extraction stage we learn task specific parameters called adapters, that encapsulate the task-specific information. We then combine the adapters in a separate knowledge composition step. We show that by separating the two stages, i.e., knowledge extraction and knowledge composition, the classifier can effectively exploit the representations learned from multiple tasks in a non-destructive manner. We empirically evaluate AdapterFusion on 16 diverse NLU tasks, and find that it effectively combines various types of knowledge at different layers of the model. We show that our approach outperforms traditional strategies such as full fine-tuning as well as multi-task learning. Our code and adapters are available at AdapterHub.ml.

86 citations


Journal ArticleDOI
TL;DR: An efficient algorithm for the task of HUSP mining with UL-list (HUSP-ULL), which utilizes a lexicographic $q$ -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs.
Abstract: High-utility sequential pattern (HUSP) mining is an emerging topic in the field of knowledge discovery in databases. It consists of discovering subsequences that have a high utility (importance) in sequences, which can be referred to as HUSPs. HUSPs can be applied to many real-life applications, such as market basket analysis, e-commerce recommendations, click-stream analysis, and route planning. Several algorithms have been proposed to efficiently mine utility-based useful sequential patterns. However, due to the combinatorial explosion of the search space for low utility threshold and large-scale data, the performances of these algorithms are unsatisfactory in terms of runtime and memory usage. Hence, this article proposes an efficient algorithm for the task of HUSP mining, called HUSP mining with UL-list (HUSP-ULL). It utilizes a lexicographic $q$ -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper bounds on the utility of the candidate sequences and reduce the search space by pruning unpromising candidates early. Substantial experiments on both real-life and synthetic datasets showed that HUSP-ULL can effectively and efficiently discover the complete set of HUSPs and that it outperforms the state-of-the-art algorithms.

80 citations


Journal ArticleDOI
TL;DR: A new model to identify collective abnormal human behaviors from large pedestrian data in smart cities is introduced and the results show that the deep learning solution outperforms both data mining as well as the state-of-the-art solutions in terms of runtime and accuracy performance.

77 citations


Journal ArticleDOI
TL;DR: In this article, a post-processing method for entity recognition that automatically generates a dictionary is proposed, and a pruning strategy combining Viterbi algorithm and rules is proposed to achieve a higher recognition rate of elementary mathematical entities.
Abstract: Chinese word segmentation is an important research direction in related research on elementary mathematics knowledge extraction. The speed of segmentation directly affects subsequent applications, and the accuracy of segmentation directly affects corresponding research in the next step. In the machine learning methods for extracting basic mathematical knowledge points, the Conditional Random Field (CRF) model implements new word discovery well, and is increasingly used in knowledge extraction of basic mathematics. This article first introduces the traditional CRF process of named entity recognition. Then, an improved algorithm CRF++for conditional field model is proposed. Since the recognition rate of named entities based on traditional machine learning methods is not high, a post-processing method for entity recognition that automatically generates a dictionary is proposed. After identifying mathematical entities, a pruning strategy combining Viterbi algorithm and rules is proposed to achieve a higher recognition rate of elementary mathematical entities. Finally, several methods of disambiguation after entity recognition are introduced.

64 citations


Journal ArticleDOI
TL;DR: In this paper, a review of hybrid modeling techniques, associated system identification methodologies and model assessment criteria for chemical and biochemical processes is presented, with a focus on the use of big data and machine learning frameworks.

Journal ArticleDOI
TL;DR: A general data mining-based framework that can extract typical electricity load patterns (TELPs) and discover insightful information hidden in the patterns and improve the interpretability of clustering results to explain the relations between dynamic influencing factors related to electricity consumption and TELPs is proposed.

Proceedings ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a knowledge-preserving incremental heterogeneous graph neural network (KPGNN) for incremental social event detection, which models complex social messages into unified social graphs to facilitate data utilization and explores the expressive power of GNNs for knowledge extraction.
Abstract: Social events provide valuable insights into group social behaviors and public concerns and therefore have many applications in fields such as product recommendation and crisis management. The complexity and streaming nature of social messages make it appealing to address social event detection in an incremental learning setting, where acquiring, preserving, and extending knowledge are major concerns. Most existing methods, including those based on incremental clustering and community detection, learn limited amounts of knowledge as they ignore the rich semantics and structural information contained in social data. Moreover, they cannot memorize previously acquired knowledge. In this paper, we propose a novel Knowledge-Preserving Incremental Heterogeneous Graph Neural Network (KPGNN) for incremental social event detection. To acquire more knowledge, KPGNN models complex social messages into unified social graphs to facilitate data utilization and explores the expressive power of GNNs for knowledge extraction. To continuously adapt to the incoming data, KPGNN adopts contrastive loss terms that cope with a changing number of event classes. It also leverages the inductive learning ability of GNNs to efficiently detect events and extends its knowledge from previously unseen data. To deal with large social streams, KPGNN adopts a mini-batch subgraph sampling strategy for scalable training, and periodically removes obsolete data to maintain a dynamic embedding space. KPGNN requires no feature engineering and has few hyperparameters to tune. Extensive experiment results demonstrate the superiority of KPGNN over various baselines.

Journal ArticleDOI
01 Jul 2021
TL;DR: In this article, the authors present a characterization of different types of KGs along with their construction approaches and discuss the current KG applications, problems, and challenges as well as discuss the perspective of future research.
Abstract: With the extensive growth of data that has been joined with the thriving development of the Internet in this century, finding or getting valuable information and knowledge from these huge noisy data became harder. The Concept of Knowledge Graph (KG) is one of the concepts that has come into the public view as a result of this development. In addition, with that thriving development especially in the last two decades, the need to process and extract valuable information in a more efficient way is increased. KG presents a common framework for knowledge representation, based on the analysis and extraction of entities and relationships. Techniques for KG construction can extract information from either structured, unstructured or even semi-structured data sources, and finally organize the information into knowledge, represented in a graph. This paper presents a characterization of different types of KGs along with their construction approaches. It reviews the existing academia, industry and expert KG systems and discusses in detail about the features of it. A systematic review methodology has been followed to conduct the review. Several databases (Scopus, GS, WoS) and journals (SWJ, Applied Ontology, JWS) are analysed to collect the relevant study and filtered by using inclusion and exclusion criteria. This review includes the state-of-the-art, literature review, characterization of KGs, and the knowledge extraction techniques of KGs. In addition, this paper overviews the current KG applications, problems, and challenges as well as discuss the perspective of future research. The main aim of this paper is to analyse all existing KGs with their features, techniques, applications, problems, and challenges. To the best of our knowledge, such a characterization table among these most commonly used KGs has not been presented earlier.

Journal ArticleDOI
TL;DR: In this paper, a comprehensive review of data preprocessing techniques for analysing massive building operational data is presented, including missing value imputation, outlier detection, data reduction, data scaling, data transformation, and data partitioning.
Abstract: The rapid development in data science and the increasing availability of building operational data have provided great opportunities for developing data-driven solutions for intelligent building energy management. Data preprocessing serves as the foundation for valid data analyses. It is an indispensable step in building operational data analysis considering the intrinsic complexity of building operations and deficiencies in data quality. Data preprocessing refers to a set of techniques for enhancing the quality of the raw data, such as outlier removal and missing value imputation. This article serves as a comprehensive review of data preprocessing techniques for analysing massive building operational data. A wide variety of data preprocessing techniques are summarised in terms of their applications in missing value imputation, outlier detection, data reduction, data scaling, data transformation, and data partitioning. In addition, three state-of-the-art data science techniques are proposed to tackle practical data challenges in the building field, i.e., data augmentation, transfer learning, and semi-supervised learning. In-depth discussions have been presented to present the pros and cons of existing preprocessing methods, possible directions for future research and potential applications in smart building energy management. The insights obtained are helpful for the development of data-driven research in the building field.

Journal ArticleDOI
TL;DR: A novel method, Type-aware Attentive Path Reasoning (TAPR), is proposed to complete the knowledge graph by simultaneously considering KG structural information, textual information, and type information and describes a type-level attention to select the most relevant type of given entity in a specific triple without any predefined rules or patterns.
Abstract: Knowledge graphs (KG) often encounter knowledge incompleteness. The path reasoning that predicts the unknown path relation between pairwise entities based on existing facts is one of the most promising approaches to the knowledge graph completion. However, most conventional path reasoning methods exclusively consider the entity description included in fact triples, ignoring both the type information of entities and the interaction between different semantic representations. In this study, we propose a novel method, Type-aware Attentive Path Reasoning (TAPR), to complete the knowledge graph by simultaneously considering KG structural information, textual information, and type information. More specifically, we first leverage types to enrich the representational learning of entities and relationships. Next, we describe a type-level attention to select the most relevant type of given entity in a specific triple without any predefined rules or patterns to reduce the impact of noisy types. After learning the distributed representation of all paths, path-level attention assigns different weights to paths, from which relations among entity pairs are calculated. We conduct a series of experiments on a real-world dataset to demonstrate the effectiveness of TAPR. Experimental results show that our method significantly outperforms all baselines on link prediction and entity prediction tasks.

Posted Content
TL;DR: This work proposes a method for automatically rewriting queries into “BERTese”, a paraphrase query that is directly optimized towards better knowledge extraction, and adds auxiliary loss functions that encourage the query to correspond to actual language tokens.
Abstract: Large pre-trained language models have been shown to encode large amounts of world and commonsense knowledge in their parameters, leading to substantial interest in methods for extracting that knowledge. In past work, knowledge was extracted by taking manually-authored queries and gathering paraphrases for them using a separate pipeline. In this work, we propose a method for automatically rewriting queries into "BERTese", a paraphrase query that is directly optimized towards better knowledge extraction. To encourage meaningful rewrites, we add auxiliary loss functions that encourage the query to correspond to actual language tokens. We empirically show our approach outperforms competing baselines, obviating the need for complex pipelines. Moreover, BERTese provides some insight into the type of language that helps language models perform knowledge extraction.

Journal ArticleDOI
TL;DR: In this article, the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency are summarized and examined.
Abstract: In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in NLP.

Journal ArticleDOI
02 Jan 2021
TL;DR: This communication aims to review the works involving knowledge discovery in catalysis using ML techniques to see patterns, develop models for prediction and deduce heuristic rules for the future.
Abstract: The use of machine learning (ML) in catalysis has been significantly increased in recent years due to the astonishing developments in data processing technologies and the accumulation of a large am...

Journal ArticleDOI
TL;DR: A four-stage MapReduce framework that is solely based on the well-known Spark platform for use in high-utility sequential pattern mining is presented, shown to create a more efficient and faster mining performance for dealing with large data sets.
Abstract: The concepts of sequential pattern mining have become a growing topic in data mining, finding a home most recently in the Internet of Things (IoT) where large volumes of data are presented by the second for analysis and knowledge extraction. One key topic within the realm of sequential pattern mining in high-utility sequential pattern mining (HUSPM), short form for high-utility sequential pattern mining. HUSPM takes into account the fusion of utility and sequence factors to assist in the determination of sequential patterns of high utility from databases and data sources. That being said, almost all current existing literature focus on only using a single machine to increase mining performance. In this work, we present a four-stage MapReduce framework that is solely based on the well-known Spark platform for use in HUSPM. This framework is shown to create a more efficient and faster mining performance for dealing with large data sets. It consists of four phases such as initialization, mining, updating, and generation phases to handle the big data sets based on the MapReduce framework running on the Spark platform. Experiments indicated that the designed model is capable of handling the very big data sets while state-of-the-art algorithms can only achieve good performance in small data sets.

Journal ArticleDOI
20 Apr 2021-Entropy
TL;DR: In this paper, the authors used decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines to predict student retention at each of three levels during their first, second, and third years of study.
Abstract: Data mining is employed to extract useful information and to detect patterns from often large data sets, closely related to knowledge discovery in databases and data science. In this investigation, we formulate models based on machine learning algorithms to extract relevant information predicting student retention at various levels, using higher education data and specifying the relevant variables involved in the modeling. Then, we utilize this information to help the process of knowledge discovery. We predict student retention at each of three levels during their first, second, and third years of study, obtaining models with an accuracy that exceeds 80% in all scenarios. These models allow us to adequately predict the level when dropout occurs. Among the machine learning algorithms used in this work are: decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines, of which the random forest technique performs the best. We detect that secondary educational score and the community poverty index are important predictive variables, which have not been previously reported in educational studies of this type. The dropout assessment at various levels reported here is valid for higher education institutions around the world with similar conditions to the Chilean case, where dropout rates affect the efficiency of such institutions. Having the ability to predict dropout based on student’s data enables these institutions to take preventative measures, avoiding the dropouts. In the case study, balancing the majority and minority classes improves the performance of the algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors classify the existing research into three aspects, i.e., transaction tracings and blockchain address linking, the analyses of collective user behaviors, and the study of individual user behaviors.
Abstract: Cryptocurrencies gain trust in users by publicly disclosing the full creation and transaction history. In return, the transaction history faithfully records the whole spectrum of cryptocurrency user behaviors. This article analyzes and summarizes the existing research on knowledge discovery in the cryptocurrency transactions using data mining techniques. Specifically, we classify the existing research into three aspects, i.e., transaction tracings and blockchain address linking, the analyses of collective user behaviors, and the study of individual user behaviors. For each aspect, we present the problems, summarize the methodologies, and discuss major findings in the literature. Furthermore, an enumeration of transaction data parsing and visualization tools and services is also provided. Finally, we outline several gaps and trends for future investigation in this research area.

Journal ArticleDOI
22 Jan 2021-iScience
TL;DR: This work explores whether materials science knowledge can be automatically inferred from textual information contained in journal papers and shows, using natural language processing methods that vector representations trained for every word in the authors' corpus can indeed capture this knowledge in a completely unsupervised manner.

Journal ArticleDOI
TL;DR: This work presents a graph-based approach for representing a complete transactional database that enables the storing of all relevant information for extracting FIs of the database in one pass and an algorithm that extracts the FIs from the graph- based structure.
Abstract: Frequent itemsets mining is an active research problem in the domain of data mining and knowledge discovery. With the advances in database technology and an exponential increase in data to be stored, there is a need for efficient approaches that can quickly extract useful information from such large datasets. Frequent Itemsets (FIs) mining is a data mining task to find itemsets in a transactional database which occur together above a certain frequency. Finding these FIs usually requires multiple passes over the databases; therefore, making efficient algorithms crucial for mining FIs. This work presents a graph-based approach for representing a complete transactional database. The proposed graph-based representation enables the storing of all relevant information (for extracting FIs) of the database in one pass. Later, an algorithm that extracts the FIs from the graph-based structure is presented. Experimental results are reported comparing the proposed approach with 17 related FIs mining methods using six benchmark datasets. Results show that the proposed approach performs better than others in terms of time.

Journal ArticleDOI
TL;DR: A comprehensive review of text analytics finds that the ontology- and rule-based approach has been dominant, at the same time, recent research has attempted to apply the state-of-the-art machine learning methods.

Journal ArticleDOI
TL;DR: The successful application to omics data illustrates the potential of sparse structured regularization for identifying disease's molecular signatures and for creating high-performance clinical decision support systems towards more personalized healthcare.
Abstract: The development of new molecular and cell technologies is having a significant impact on the quantity of data generated nowadays. The growth of omics databases is creating a considerable potential for knowledge discovery and, concomitantly, is bringing new challenges to statistical learning and computational biology for health applications. Indeed, the high dimensionality of these data may hamper the use of traditional regression methods and parameter estimation algorithms due to the intrinsic non-identifiability of the inherent optimization problem. Regularized optimization has been rising as a promising and useful strategy to solve these ill-posed problems by imposing additional constraints in the solution parameter space. In particular, the field of statistical learning with sparsity has been significantly contributing to building accurate models that also bring interpretability to biological observations and phenomena. Beyond the now-classic elastic net, one of the best-known methods that combine lasso with ridge penalizations, we briefly overview recent literature on structured regularizers and penalty functions that have been applied in biomedical data to build parsimonious models in a variety of underlying contexts, from survival to generalized linear models. These methods include functions of $\ell _k$-norms and network-based penalties that take into account the inherent relationships between the features. The successful application to omics data illustrates the potential of sparse structured regularization for identifying disease's molecular signatures and for creating high-performance clinical decision support systems towards more personalized healthcare. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

Journal ArticleDOI
TL;DR: KnetMiner as mentioned in this paper is an integrated, intelligent, interactive gene and gene network discovery platform that supports scientists explore and understand the biological stories of complex traits and diseases across species.
Abstract: The generation of new ideas and scientific hypotheses is often the result of extensive literature and database searches, but, with the growing wealth of public and private knowledge, the process of searching diverse and interconnected data to generate new insights into genes, gene networks, traits and diseases is becoming both more complex and more time-consuming. To guide this technically challenging data integration task and to make gene discovery and hypotheses generation easier for researchers, we have developed a comprehensive software package called KnetMiner which is open-source and containerized for easy use. KnetMiner is an integrated, intelligent, interactive gene and gene network discovery platform that supports scientists explore and understand the biological stories of complex traits and diseases across species. It features fast algorithms for generating rich interactive gene networks and prioritizing candidate genes based on knowledge mining approaches. KnetMiner is used in many plant science institutions and has been adopted by several plant breeding organizations to accelerate gene discovery. The software is generic and customizable and can therefore be readily applied to new species and data types; for example, it has been applied to pest insects and fungal pathogens; and most recently repurposed to support COVID-19 research. Here, we give an overview of the main approaches behind KnetMiner and we report plant-centric case studies for identifying genes, gene networks and trait relationships in Triticum aestivum (bread wheat), as well as, an evidence-based approach to rank candidate genes under a large Arabidopsis thaliana QTL. KnetMiner is available at: https://knetminer.org.

Proceedings ArticleDOI
19 Apr 2021
TL;DR: Wang et al. as mentioned in this paper proposed a novel relational table representation learning approach considering both the intra- and inter-table contextual information, which employs the attention mechanism to adaptively focus on the most informative intra-table cells of the same row or column.
Abstract: Information extraction from semi-structured webpages provides valuable long-tailed facts for augmenting knowledge graph. Relational Web tables are a critical component containing additional entities and attributes of rich and diverse knowledge. However, extracting knowledge from relational tables is challenging because of sparse contextual information. Existing work linearize table cells and heavily rely on modifying deep language models such as BERT which only captures related cells information in the same table. In this work, we propose a novel relational table representation learning approach considering both the intra- and inter-table contextual information. On one hand, the proposed Table Convolutional Network model employs the attention mechanism to adaptively focus on the most informative intra-table cells of the same row or column; and, on the other hand, it aggregates inter-table contextual information from various types of implicit connections between cells across different tables. Specifically, we propose three novel aggregation modules for (i) cells of the same value, (ii) cells of the same schema position, and (iii) cells linked to the same page topic. We further devise a supervised multi-task training objective for jointly predicting column type and pairwise column relation, as well as a table cell recovery objective for pre-training. Experiments on real Web table datasets demonstrate our method can outperform competitive baselines by of F1 for column type prediction and by of F1 for pairwise column relation prediction.

Journal ArticleDOI
TL;DR: In this paper, a system integrating medical professional knowledge, knowledge graphs, and question answering systems that conduct man-machine dialogue through natural language is proposed to meet the high-efficiency question answering needs of existing patients and doctors.
Abstract: To meet the high-efficiency question answering needs of existing patients and doctors, this system integrates medical professional knowledge, knowledge graphs, and question answering systems that conduct man-machine dialogue through natural language This system locates the medical field, uses crawler technology to use vertical medical websites as data sources, and uses diseases as the core entity to construct a knowledge graph containing 44,000 knowledge entities of 7 types and 300,000 entities of 11 kinds It is stored in the Neo4j graph database, using rule-based matching methods and string-matching algorithms to construct a domain lexicon to classify and query questions This system has specific practical value in the medical field knowledge graph and question answering system

Journal ArticleDOI
TL;DR: In this paper, the authors developed a new machine learning-based system for town planners, disaster recovery strategists, and landslide researchers, which revealed hidden knowledge about a range of complex scenarios created from five landslide feature attributes.
Abstract: Understanding the complex dynamics of global landslides is essential for disaster planners to make timely and effective decisions that save lives and reduce the economic impacts on society. Using NASA’s inventory of global landslide data, we developed a new machine learning (ML)–based system for town planners, disaster recovery strategists, and landslide researchers. Our system revealed hidden knowledge about a range of complex scenarios created from five landslide feature attributes. Users of our system can select from a list of $1.295\times {10}^{64}$ possible global landslide scenarios to discover valuable knowledge and predictions about the selected scenario in an interactive manner. Three ML algorithms—anomaly detection, decomposition analysis, and automated regression analysis—are used to elicit detailed knowledge about 25 scenarios selected from 14,532 global landslide records covering 12,220 injuries and 63,573 fatalities across 157 countries. Anomaly detection, logistic regression, and decomposition analysis performed well for all scenarios under study, with the area under the curve averaging 0.951, 0.911, and 0.896, respectively. Moreover, the prediction accuracy of linear regression had a mean absolute percentage error of 0.255. To the best of our knowledge, our scenario-based ML knowledge discovery system is the first of its kind to provide a comprehensive understanding of global landslide data.