Showing papers on "Knowledge extraction published in 2021"

PDF

Open Access

Journal Article•DOI•

Machine learning and data mining in manufacturing

[...]

Alican Dogan¹, Derya Birant¹•Institutions (1)

15 Mar 2021-Expert Systems With Applications

TL;DR: A comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions and points to several significant research questions that are unanswered in the recent literature having the same target.

...read moreread less

Abstract: Manufacturing organizations need to use different kinds of techniques and tools in order to fulfill their foundation goals. In this aspect, using machine learning (ML) and data mining (DM) techniques and tools could be very helpful for dealing with challenges in manufacturing. Therefore, in this paper, a comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions. Furthermore, it points to several significant research questions that are unanswered in the recent literature having the same target. Our survey aims to provide researchers with a solid understanding of the main approaches and algorithms used to improve manufacturing processes over the past two decades. It presents the previous ML studies and recent advances in manufacturing by grouping them under four main subjects: scheduling, monitoring, quality, and failure. It comprehensively discusses existing solutions in manufacturing according to various aspects, including tasks (i.e., clustering, classification, regression), algorithms (i.e., support vector machine, neural network), learning types (i.e., ensemble learning, deep learning), and performance metrics (i.e., accuracy, mean absolute error). Furthermore, the main steps of knowledge discovery in databases (KDD) process to be followed in manufacturing applications are explained in detail. In addition, some statistics about the current state are also given from different perspectives. Besides, it explains the advantages of using machine learning techniques in manufacturing, expresses the ways to overcome certain challenges, and offers some possible further research directions.

...read moreread less

237 citations

Journal Article•DOI•

Using Data Sciences in Digital Marketing: Framework, methods, and performance metrics

[...]

Jose Ramon Saura¹•Institutions (1)

King Juan Carlos University¹

01 Apr 2021-Journal of Innovation & Knowledge

TL;DR: A comprehensive literature review of major scientific contributions made so far in this research area is undertaken and a holistic overview of the main applications of Data Sciences to digital marketing is presented to generate insights related to the creation of innovative Data Mining and knowledge discovery techniques.

...read moreread less

196 citations

Journal Article•DOI•

CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories

[...]

Fernando Martínez-Plumed¹, Lidia Contreras-Ochando¹, Cèsar Ferri¹, José Hernández-Orallo¹, Meelis Kull², Nicolas Lachiche, María José Ramírez-Quintana¹, Peter A. Flach³ - Show less +4 more•Institutions (3)

Polytechnic University of Valencia¹, University of Tartu², University of Bristol³

01 Aug 2021-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is argued that if the project is goal-directed and process-driven the process model view still largely holds, and when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for.

...read moreread less

Abstract: CRISP-DM(CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the leading term being favoured over data mining . In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.

...read moreread less

120 citations

Journal Article•DOI•

Drug repurposing for COVID-19 via knowledge graph completion.

[...]

Rui Zhang¹, Dimitar Hristovski², Dalton Schutte¹, Andrej Kastrin², Marcelo Fiszman³, Halil Kilicoglu⁴ - Show less +2 more•Institutions (4)

University of Minnesota¹, University of Ljubljana², Pontifical Catholic University of Rio de Janeiro³, University of Illinois at Urbana–Champaign⁴

08 Feb 2021-Journal of Biomedical Informatics

TL;DR: It is shown that a LBD approach can be feasible not only for discovering drug candidates for COVID-19, but also for generating mechanistic explanations and can be generalized to other diseases as well as to other clinical questions.

...read moreread less

106 citations

Journal Article•DOI•

Research on the Natural Language Recognition Method Based on Cluster Analysis Using Neural Network

[...]

Guang Li, Fangfang Liu, Ashutosh Sharma, Osamah Ibrahim Khalaf, Youseef Alotaibi, Abdulmajeed Alsufyani, Saleh Alghamdi - Show less +3 more

04 May 2021-Mathematical Problems in Engineering

TL;DR: A new method of structural modeling is utilized to generate the structured derivation relationship, thus completing the natural language knowledge extraction process of the object-oriented knowledge system.

...read moreread less

Abstract: Withthe technological advent, the clustering phenomenon is recently being used in various domains and in natural language recognition. This article contributes to the clustering phenomenon of natural language and fulfills the requirements for the dynamic update of the knowledge system. This article proposes a method of dynamic knowledge extraction based on sentence clustering recognition using a neural network-based framework. The conversion process from natural language papers to object-oriented knowledge system is studied considering the related problems of sentence vectorization. This article studies the attributes of sentence vectorization using various basic definitions, judgment theorem, and postprocessing elements. The sentence clustering recognition method of the network uses the concept of prereliability as a measure of the credibility of sentence recognition results. An ART2 neural network simulation program is written using MATLAB, and the effect of the neural network on sentence recognition is utilized for the corresponding analysis. A postreliability evaluation indexing is done for the credibility of the model construction, and the implementation steps for the conjunctive rule sentence pattern are specifically introduced. A new method of structural modeling is utilized to generate the structured derivation relationship, thus completing the natural language knowledge extraction process of the object-oriented knowledge system. An application example with mechanical CAD is used in this work to demonstrate the specific implementation of the example, which confirms the effectiveness of the proposed method.

...read moreread less

101 citations

Journal Article•DOI•

Deep Learning for Network Traffic Monitoring and Analysis (NTMA): A Survey

[...]

Mahmoud Abbasi¹, Amin Shahraki², Amin Shahraki³, Amirhosein Taherkordi³•Institutions (3)

Islamic Azad University¹, Østfold University College², University of Oslo³

15 Mar 2021-Computer Communications

TL;DR: In this paper, a comprehensive review on applications of deep learning in network traffic monitoring and analysis (NTMA) applications is provided, where the authors discuss key challenges, open issues, and future research directions for using deep learning for NTMA applications.

...read moreread less

96 citations

Proceedings Article•DOI•

AdapterFusion: Non-destructive task composition for transfer learning

[...]

Jonas Pfeiffer¹, Aishwarya Kamath², Andreas Rücklé¹, Kyunghyun Cho³, Iryna Gurevych¹ - Show less +1 more•Institutions (3)

Technische Universität Darmstadt¹, New York University², Courant Institute of Mathematical Sciences³

14 Jan 2021

TL;DR: In this paper, the authors propose a two-stage learning algorithm that leverages knowledge from multiple tasks to solve the problem of catastrophic forgetting and difficulties in dataset balancing, by separating the two stages, i.e., knowledge extraction and knowledge composition.

...read moreread less

Abstract: Sequential fine-tuning and multi-task learning are methods aiming to incorporate knowledge from multiple tasks; however, they suffer from catastrophic forgetting and difficulties in dataset balancing. To address these shortcomings, we propose AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks. First, in the knowledge extraction stage we learn task specific parameters called adapters, that encapsulate the task-specific information. We then combine the adapters in a separate knowledge composition step. We show that by separating the two stages, i.e., knowledge extraction and knowledge composition, the classifier can effectively exploit the representations learned from multiple tasks in a non-destructive manner. We empirically evaluate AdapterFusion on 16 diverse NLU tasks, and find that it effectively combines various types of knowledge at different layers of the model. We show that our approach outperforms traditional strategies such as full fine-tuning as well as multi-task learning. Our code and adapters are available at AdapterHub.ml.

...read moreread less

86 citations

Journal Article•DOI•

Fast Utility Mining on Sequence Data

[...]

Wensheng Gan¹, Jerry Chun-Wei Lin², Jiexiong Zhang³, Philippe Fournier-Viger³, Han-Chieh Chao⁴, Philip S. Yu⁵ - Show less +2 more•Institutions (5)

Jinan University¹, Bergen University College², Harbin Institute of Technology³, National Dong Hwa University⁴, University of Illinois at Chicago⁵

15 Jan 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: An efficient algorithm for the task of HUSP mining with UL-list (HUSP-ULL), which utilizes a lexicographic $q$ -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs.

...read moreread less

Abstract: High-utility sequential pattern (HUSP) mining is an emerging topic in the field of knowledge discovery in databases. It consists of discovering subsequences that have a high utility (importance) in sequences, which can be referred to as HUSPs. HUSPs can be applied to many real-life applications, such as market basket analysis, e-commerce recommendations, click-stream analysis, and route planning. Several algorithms have been proposed to efficiently mine utility-based useful sequential patterns. However, due to the combinatorial explosion of the search space for low utility threshold and large-scale data, the performances of these algorithms are unsatisfactory in terms of runtime and memory usage. Hence, this article proposes an efficient algorithm for the task of HUSP mining, called HUSP mining with UL-list (HUSP-ULL). It utilizes a lexicographic $q$ -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper bounds on the utility of the candidate sequences and reduce the search space by pruning unpromising candidates early. Substantial experiments on both real-life and synthetic datasets showed that HUSP-ULL can effectively and efficiently discover the complete set of HUSPs and that it outperforms the state-of-the-art algorithms.

...read moreread less

80 citations

Journal Article•DOI•

Deep learning for pedestrian collective behavior analysis in smart cities: A model of group trajectory outlier detection

[...]

Asma Belhadi, Youcef Djenouri¹, Gautam Srivastava², Gautam Srivastava³, Djamel Djenouri⁴, Jerry Chun-Wei Lin⁵, Giancarlo Fortino⁶ - Show less +3 more•Institutions (6)

SINTEF¹, China Medical University (Taiwan)², Brandon University³, University of the West of England⁴, Bergen University College⁵, University of Calabria⁶

01 Jan 2021-Information Fusion

TL;DR: A new model to identify collective abnormal human behaviors from large pedestrian data in smart cities is introduced and the results show that the deep learning solution outperforms both data mining as well as the state-of-the-art solutions in terms of runtime and accuracy performance.

...read moreread less

77 citations

Journal Article•DOI•

A Survey of CRF Algorithm Based Knowledge Extraction of Elementary Mathematics in Chinese

[...]

Shuai Liu¹, Tenghui He¹, Jianhua Dai¹•Institutions (1)

Hunan Normal University¹

03 Jan 2021-Mobile Networks and Applications

TL;DR: In this article, a post-processing method for entity recognition that automatically generates a dictionary is proposed, and a pruning strategy combining Viterbi algorithm and rules is proposed to achieve a higher recognition rate of elementary mathematical entities.

...read moreread less

Abstract: Chinese word segmentation is an important research direction in related research on elementary mathematics knowledge extraction. The speed of segmentation directly affects subsequent applications, and the accuracy of segmentation directly affects corresponding research in the next step. In the machine learning methods for extracting basic mathematical knowledge points, the Conditional Random Field (CRF) model implements new word discovery well, and is increasingly used in knowledge extraction of basic mathematics. This article first introduces the traditional CRF process of named entity recognition. Then, an improved algorithm CRF++for conditional field model is proposed. Since the recognition rate of named entities based on traditional machine learning methods is not high, a post-processing method for entity recognition that automatically generates a dictionary is proposed. After identifying mathematical entities, a pruning strategy combining Viterbi algorithm and rules is proposed to achieve a higher recognition rate of elementary mathematical entities. Finally, several methods of disambiguation after entity recognition are introduced.

...read moreread less

64 citations

Journal Article•DOI•

Recent trends on hybrid modeling for Industry 4.0

[...]

Joel Sansana¹, Mark N. Joswiak, Ivan Castillo, Zhenyu Wang, Ricardo Rendall, Leo H. Chiang, Marco S. Reis¹ - Show less +3 more•Institutions (1)

University of Coimbra¹

01 Aug 2021-Computers & Chemical Engineering

TL;DR: In this paper, a review of hybrid modeling techniques, associated system identification methodologies and model assessment criteria for chemical and biochemical processes is presented, with a focus on the use of big data and machine learning frameworks.

...read moreread less

Journal Article•DOI•

A data mining-based framework for the identification of daily electricity usage patterns and anomaly detection in building electricity consumption data

[...]

Xue Liu¹, Yong Ding¹, Tang Hao¹, Feng Xiao²•Institutions (2)

Chongqing University¹, Southwestern University of Finance and Economics²

15 Jan 2021-Energy and Buildings

TL;DR: A general data mining-based framework that can extract typical electricity load patterns (TELPs) and discover insightful information hidden in the patterns and improve the interpretability of clustering results to explain the relations between dynamic influencing factors related to electricity consumption and TELPs is proposed.

...read moreread less

Proceedings Article•DOI•

Knowledge-Preserving Incremental Social Event Detection via Heterogeneous GNNs

[...]

Yuwei Cao¹, Hao Peng², Jia Wu³, Yingtong Dou¹, Jianxin Li², Philip S. Yu¹ - Show less +2 more•Institutions (3)

University of Illinois at Chicago¹, Beihang University², Macquarie University³

21 Jan 2021-arXiv: Learning

TL;DR: Wang et al. as mentioned in this paper proposed a knowledge-preserving incremental heterogeneous graph neural network (KPGNN) for incremental social event detection, which models complex social messages into unified social graphs to facilitate data utilization and explores the expressive power of GNNs for knowledge extraction.

...read moreread less

Abstract: Social events provide valuable insights into group social behaviors and public concerns and therefore have many applications in fields such as product recommendation and crisis management. The complexity and streaming nature of social messages make it appealing to address social event detection in an incremental learning setting, where acquiring, preserving, and extending knowledge are major concerns. Most existing methods, including those based on incremental clustering and community detection, learn limited amounts of knowledge as they ignore the rich semantics and structural information contained in social data. Moreover, they cannot memorize previously acquired knowledge. In this paper, we propose a novel Knowledge-Preserving Incremental Heterogeneous Graph Neural Network (KPGNN) for incremental social event detection. To acquire more knowledge, KPGNN models complex social messages into unified social graphs to facilitate data utilization and explores the expressive power of GNNs for knowledge extraction. To continuously adapt to the incoming data, KPGNN adopts contrastive loss terms that cope with a changing number of event classes. It also leverages the inductive learning ability of GNNs to efficiently detect events and extends its knowledge from previously unseen data. To deal with large social streams, KPGNN adopts a mini-batch subgraph sampling strategy for scalable training, and periodically removes obsolete data to maintain a dynamic embedding space. KPGNN requires no feature engineering and has few hyperparameters to tune. Extensive experiment results demonstrate the superiority of KPGNN over various baselines.

...read moreread less

Journal Article•DOI•

Recent trends in knowledge graphs: theory and practice

[...]

Sanju Mishra Tiwari¹, Fatima N. Al-Aswadi², Devottam Gaurav³•Institutions (3)

Autonomous University of Tamaulipas¹, Hodeidah University², Indian Institute of Technology Delhi³

01 Jul 2021

TL;DR: In this article, the authors present a characterization of different types of KGs along with their construction approaches and discuss the current KG applications, problems, and challenges as well as discuss the perspective of future research.

...read moreread less

Abstract: With the extensive growth of data that has been joined with the thriving development of the Internet in this century, finding or getting valuable information and knowledge from these huge noisy data became harder. The Concept of Knowledge Graph (KG) is one of the concepts that has come into the public view as a result of this development. In addition, with that thriving development especially in the last two decades, the need to process and extract valuable information in a more efficient way is increased. KG presents a common framework for knowledge representation, based on the analysis and extraction of entities and relationships. Techniques for KG construction can extract information from either structured, unstructured or even semi-structured data sources, and finally organize the information into knowledge, represented in a graph. This paper presents a characterization of different types of KGs along with their construction approaches. It reviews the existing academia, industry and expert KG systems and discusses in detail about the features of it. A systematic review methodology has been followed to conduct the review. Several databases (Scopus, GS, WoS) and journals (SWJ, Applied Ontology, JWS) are analysed to collect the relevant study and filtered by using inclusion and exclusion criteria. This review includes the state-of-the-art, literature review, characterization of KGs, and the knowledge extraction techniques of KGs. In addition, this paper overviews the current KG applications, problems, and challenges as well as discuss the perspective of future research. The main aim of this paper is to analyse all existing KGs with their features, techniques, applications, problems, and challenges. To the best of our knowledge, such a characterization table among these most commonly used KGs has not been presented earlier.

...read moreread less

Journal Article•DOI•

A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data

[...]

Cheng Fan, Meiling Chen, Xinghua Wang, Jiayuan Wang, Bufu Huang - Show less +1 more

29 Mar 2021-Frontiers in Energy Research

TL;DR: In this paper, a comprehensive review of data preprocessing techniques for analysing massive building operational data is presented, including missing value imputation, outlier detection, data reduction, data scaling, data transformation, and data partitioning.

...read moreread less

Abstract: The rapid development in data science and the increasing availability of building operational data have provided great opportunities for developing data-driven solutions for intelligent building energy management. Data preprocessing serves as the foundation for valid data analyses. It is an indispensable step in building operational data analysis considering the intrinsic complexity of building operations and deficiencies in data quality. Data preprocessing refers to a set of techniques for enhancing the quality of the raw data, such as outlier removal and missing value imputation. This article serves as a comprehensive review of data preprocessing techniques for analysing massive building operational data. A wide variety of data preprocessing techniques are summarised in terms of their applications in missing value imputation, outlier detection, data reduction, data scaling, data transformation, and data partitioning. In addition, three state-of-the-art data science techniques are proposed to tackle practical data challenges in the building field, i.e., data augmentation, transfer learning, and semi-supervised learning. In-depth discussions have been presented to present the pros and cons of existing preprocessing methods, possible directions for future research and potential applications in smart building energy management. The insights obtained are helpful for the development of data-driven research in the building field.

...read moreread less

Journal Article•DOI•

Modeling Relation Paths for Knowledge Graph Completion

[...]

Ying Shen¹, Ning Ding², Hai-Tao Zheng², Yaliang Li³, Min Yang⁴ - Show less +1 more•Institutions (4)

Sun Yat-sen University¹, Tsinghua University², Alibaba Group³, Chinese Academy of Sciences⁴

01 Nov 2021-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel method, Type-aware Attentive Path Reasoning (TAPR), is proposed to complete the knowledge graph by simultaneously considering KG structural information, textual information, and type information and describes a type-level attention to select the most relevant type of given entity in a specific triple without any predefined rules or patterns.

...read moreread less

Abstract: Knowledge graphs (KG) often encounter knowledge incompleteness. The path reasoning that predicts the unknown path relation between pairwise entities based on existing facts is one of the most promising approaches to the knowledge graph completion. However, most conventional path reasoning methods exclusively consider the entity description included in fact triples, ignoring both the type information of entities and the interaction between different semantic representations. In this study, we propose a novel method, Type-aware Attentive Path Reasoning (TAPR), to complete the knowledge graph by simultaneously considering KG structural information, textual information, and type information. More specifically, we first leverage types to enrich the representational learning of entities and relationships. Next, we describe a type-level attention to select the most relevant type of given entity in a specific triple without any predefined rules or patterns to reduce the impact of noisy types. After learning the distributed representation of all paths, path-level attention assigns different weights to paths, from which relations among entity pairs are calculated. We conduct a series of experiments on a real-world dataset to demonstrate the effectiveness of TAPR. Experimental results show that our method significantly outperforms all baselines on link prediction and entity prediction tasks.

...read moreread less

Posted Content•

BERTese: Learning to Speak to BERT

[...]

Adi Guila Haviv¹, Jonathan Berant¹, Amir Globerson¹•Institutions (1)

Tel Aviv University¹

09 Mar 2021-arXiv: Computation and Language

TL;DR: This work proposes a method for automatically rewriting queries into “BERTese”, a paraphrase query that is directly optimized towards better knowledge extraction, and adds auxiliary loss functions that encourage the query to correspond to actual language tokens.

...read moreread less

Abstract: Large pre-trained language models have been shown to encode large amounts of world and commonsense knowledge in their parameters, leading to substantial interest in methods for extracting that knowledge. In past work, knowledge was extracted by taking manually-authored queries and gathering paraphrases for them using a separate pipeline. In this work, we propose a method for automatically rewriting queries into "BERTese", a paraphrase query that is directly optimized towards better knowledge extraction. To encourage meaningful rewrites, we add auxiliary loss functions that encourage the query to correspond to actual language tokens. We empirically show our approach outperforms competing baselines, obviating the need for complex pipelines. Moreover, BERTese provides some insight into the type of language that helps language models perform knowledge extraction.

...read moreread less

Journal Article•DOI•

The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures

[...]

Sushant Singh¹, Ausif Mahmood¹•Institutions (1)

University of Bridgeport¹

04 May 2021-IEEE Access

TL;DR: In this article, the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency are summarized and examined.

...read moreread less

Abstract: In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in NLP.

...read moreread less

Journal Article•DOI•

Recent advances in knowledge discovery for heterogeneous catalysis using machine learning

[...]

M. Erdem Günay¹, Ramazan Yildirim²•Institutions (2)

Istanbul Bilgi University¹, Boğaziçi University²

02 Jan 2021

TL;DR: This communication aims to review the works involving knowledge discovery in catalysis using ML techniques to see patterns, develop models for prediction and deduce heuristic rules for the future.

...read moreread less

Abstract: The use of machine learning (ML) in catalysis has been significantly increased in recent years due to the astonishing developments in data processing technologies and the accumulation of a large am...

...read moreread less

Journal Article•DOI•

Large-Scale High-Utility Sequential Pattern Analytics in Internet of Things

[...]

Gautam Srivastava¹, Jerry Chun-Wei Lin², Xuyun Zhang³, Yuanfa Li⁴•Institutions (4)

Brandon University¹, Bergen University College², Macquarie University³, Harbin Institute of Technology⁴

15 Aug 2021-IEEE Internet of Things Journal

TL;DR: A four-stage MapReduce framework that is solely based on the well-known Spark platform for use in high-utility sequential pattern mining is presented, shown to create a more efficient and faster mining performance for dealing with large data sets.

...read moreread less

Abstract: The concepts of sequential pattern mining have become a growing topic in data mining, finding a home most recently in the Internet of Things (IoT) where large volumes of data are presented by the second for analysis and knowledge extraction. One key topic within the realm of sequential pattern mining in high-utility sequential pattern mining (HUSPM), short form for high-utility sequential pattern mining. HUSPM takes into account the fusion of utility and sequence factors to assist in the determination of sequential patterns of high utility from databases and data sources. That being said, almost all current existing literature focus on only using a single machine to increase mining performance. In this work, we present a four-stage MapReduce framework that is solely based on the well-known Spark platform for use in HUSPM. This framework is shown to create a more efficient and faster mining performance for dealing with large data sets. It consists of four phases such as initialization, mining, updating, and generation phases to handle the big data sets based on the MapReduce framework running on the Spark platform. Experiments indicated that the designed model is capable of handling the very big data sets while state-of-the-art algorithms can only achieve good performance in small data sets.

...read moreread less

Journal Article•DOI•

Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile.

[...]

Carlos A. Palacios¹, Carlos A. Palacios², Jose Antonio Reyes-Suarez², Lorena Bearzotti³, Víctor Leiva³, Carolina Marchant¹, Carolina Marchant⁴ - Show less +3 more•Institutions (4)

Catholic University of the Maule¹, University of Talca², Pontifical Catholic University of Valparaíso³, Millennium Science Initiative⁴

20 Apr 2021-Entropy

TL;DR: In this paper, the authors used decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines to predict student retention at each of three levels during their first, second, and third years of study.

...read moreread less

Abstract: Data mining is employed to extract useful information and to detect patterns from often large data sets, closely related to knowledge discovery in databases and data science. In this investigation, we formulate models based on machine learning algorithms to extract relevant information predicting student retention at various levels, using higher education data and specifying the relevant variables involved in the modeling. Then, we utilize this information to help the process of knowledge discovery. We predict student retention at each of three levels during their first, second, and third years of study, obtaining models with an accuracy that exceeds 80% in all scenarios. These models allow us to adequately predict the level when dropout occurs. Among the machine learning algorithms used in this work are: decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines, of which the random forest technique performs the best. We detect that secondary educational score and the community poverty index are important predictive variables, which have not been previously reported in educational studies of this type. The dropout assessment at various levels reported here is valid for higher education institutions around the world with similar conditions to the Chilean case, where dropout rates affect the efficiency of such institutions. Having the ability to predict dropout based on student’s data enables these institutions to take preventative measures, avoiding the dropouts. In the case study, balancing the majority and minority classes improves the performance of the algorithms.

...read moreread less

Journal Article•DOI•

Knowledge Discovery in Cryptocurrency Transactions: A Survey

[...]

Xiao Fan Liu¹, Xin-Jian Jiang², Si-Hao Liu², Chi K. Tse¹•Institutions (2)

City University of Hong Kong¹, Southeast University²

26 Feb 2021-IEEE Access

TL;DR: In this paper, the authors classify the existing research into three aspects, i.e., transaction tracings and blockchain address linking, the analyses of collective user behaviors, and the study of individual user behaviors.

...read moreread less

Abstract: Cryptocurrencies gain trust in users by publicly disclosing the full creation and transaction history. In return, the transaction history faithfully records the whole spectrum of cryptocurrency user behaviors. This article analyzes and summarizes the existing research on knowledge discovery in the cryptocurrency transactions using data mining techniques. Specifically, we classify the existing research into three aspects, i.e., transaction tracings and blockchain address linking, the analyses of collective user behaviors, and the study of individual user behaviors. For each aspect, we present the problems, summarize the methodologies, and discuss major findings in the literature. Furthermore, an enumeration of transaction data parsing and visualization tools and services is also provided. Finally, we outline several gaps and trends for future investigation in this research area.

...read moreread less

Journal Article•DOI•

Automated knowledge extraction from polymer literature using natural language processing

[...]

Pranav Shetty¹, Rampi Ramprasad¹•Institutions (1)

Georgia Institute of Technology¹

22 Jan 2021-iScience

TL;DR: This work explores whether materials science knowledge can be automatically inferred from textual information contained in journal papers and shows, using natural language processing methods that vector representations trained for every word in the authors' corpus can indeed capture this knowledge in a completely unsupervised manner.

...read moreread less

Journal Article•DOI•

On the Efficient Representation of Datasets as Graphs to Mine Maximal Frequent Itemsets

[...]

Zahid Halim¹, Omer Ali¹, Muhammad Ghufran Khan¹•Institutions (1)

Ghulam Ishaq Khan Institute of Engineering Sciences and Technology¹

01 Apr 2021-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work presents a graph-based approach for representing a complete transactional database that enables the storing of all relevant information for extracting FIs of the database in one pass and an algorithm that extracts the FIs from the graph- based structure.

...read moreread less

Abstract: Frequent itemsets mining is an active research problem in the domain of data mining and knowledge discovery. With the advances in database technology and an exponential increase in data to be stored, there is a need for efficient approaches that can quickly extract useful information from such large datasets. Frequent Itemsets (FIs) mining is a data mining task to find itemsets in a transactional database which occur together above a certain frequency. Finding these FIs usually requires multiple passes over the databases; therefore, making efficient algorithms crucial for mining FIs. This work presents a graph-based approach for representing a complete transactional database. The proposed graph-based representation enables the storing of all relevant information (for extracting FIs) of the database in one pass. Later, an algorithm that extracts the FIs from the graph-based structure is presented. Experimental results are reported comparing the proposed approach with 17 related FIs mining methods using six benchmark datasets. Results show that the proposed approach performs better than others in terms of time.

...read moreread less

Journal Article•DOI•

A critical review of text-based research in construction: Data source, analysis method, and implications

[...]

Seungwon Baek¹, Wooyong Jung², Seung Heon Han¹•Institutions (2)

Yonsei University¹, Korea Electric Power Corporation²

01 Dec 2021-Automation in Construction

TL;DR: A comprehensive review of text analytics finds that the ontology- and rule-based approach has been dominant, at the same time, recent research has attempted to apply the state-of-the-art machine learning methods.

...read moreread less

Journal Article•DOI•

Structured sparsity regularization for analyzing high-dimensional omics data

[...]

Susana Vinga¹•Institutions (1)

Instituto Superior Técnico¹

18 Jan 2021-Briefings in Bioinformatics

TL;DR: The successful application to omics data illustrates the potential of sparse structured regularization for identifying disease's molecular signatures and for creating high-performance clinical decision support systems towards more personalized healthcare.

...read moreread less

Abstract: The development of new molecular and cell technologies is having a significant impact on the quantity of data generated nowadays. The growth of omics databases is creating a considerable potential for knowledge discovery and, concomitantly, is bringing new challenges to statistical learning and computational biology for health applications. Indeed, the high dimensionality of these data may hamper the use of traditional regression methods and parameter estimation algorithms due to the intrinsic non-identifiability of the inherent optimization problem. Regularized optimization has been rising as a promising and useful strategy to solve these ill-posed problems by imposing additional constraints in the solution parameter space. In particular, the field of statistical learning with sparsity has been significantly contributing to building accurate models that also bring interpretability to biological observations and phenomena. Beyond the now-classic elastic net, one of the best-known methods that combine lasso with ridge penalizations, we briefly overview recent literature on structured regularizers and penalty functions that have been applied in biomedical data to build parsimonious models in a variety of underlying contexts, from survival to generalized linear models. These methods include functions of $\ell _k$-norms and network-based penalties that take into account the inherent relationships between the features. The successful application to omics data illustrates the potential of sparse structured regularization for identifying disease's molecular signatures and for creating high-performance clinical decision support systems towards more personalized healthcare. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

...read moreread less

Journal Article•DOI•

KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species.

[...]

Keywan Hassani-Pak¹, Ajit Singh¹, Marco Brandizi¹, Joseph Hearnshaw¹, Jeremy D Parsons¹, Sandeep Amberkar¹, Andrew L. Phillips¹, John H. Doonan², Christopher J. Rawlings¹ - Show less +5 more•Institutions (2)

Rothamsted Research¹, Aberystwyth University²

22 Mar 2021-Plant Biotechnology Journal

TL;DR: KnetMiner as mentioned in this paper is an integrated, intelligent, interactive gene and gene network discovery platform that supports scientists explore and understand the biological stories of complex traits and diseases across species.

...read moreread less

Abstract: The generation of new ideas and scientific hypotheses is often the result of extensive literature and database searches, but, with the growing wealth of public and private knowledge, the process of searching diverse and interconnected data to generate new insights into genes, gene networks, traits and diseases is becoming both more complex and more time-consuming. To guide this technically challenging data integration task and to make gene discovery and hypotheses generation easier for researchers, we have developed a comprehensive software package called KnetMiner which is open-source and containerized for easy use. KnetMiner is an integrated, intelligent, interactive gene and gene network discovery platform that supports scientists explore and understand the biological stories of complex traits and diseases across species. It features fast algorithms for generating rich interactive gene networks and prioritizing candidate genes based on knowledge mining approaches. KnetMiner is used in many plant science institutions and has been adopted by several plant breeding organizations to accelerate gene discovery. The software is generic and customizable and can therefore be readily applied to new species and data types; for example, it has been applied to pest insects and fungal pathogens; and most recently repurposed to support COVID-19 research. Here, we give an overview of the main approaches behind KnetMiner and we report plant-centric case studies for identifying genes, gene networks and trait relationships in Triticum aestivum (bread wheat), as well as, an evidence-based approach to rank candidate genes under a large Arabidopsis thaliana QTL. KnetMiner is available at: https://knetminer.org.

...read moreread less

Proceedings Article•DOI•

TCN: Table Convolutional Network for Web Table Interpretation

[...]

Daheng Wang¹, Prashant Shiralkar², Colin Lockard², Binxuan Huang², Xin Luna Dong², Meng Jiang¹ - Show less +2 more•Institutions (2)

University of Notre Dame¹, Amazon.com²

19 Apr 2021

TL;DR: Wang et al. as mentioned in this paper proposed a novel relational table representation learning approach considering both the intra- and inter-table contextual information, which employs the attention mechanism to adaptively focus on the most informative intra-table cells of the same row or column.

...read moreread less

Abstract: Information extraction from semi-structured webpages provides valuable long-tailed facts for augmenting knowledge graph. Relational Web tables are a critical component containing additional entities and attributes of rich and diverse knowledge. However, extracting knowledge from relational tables is challenging because of sparse contextual information. Existing work linearize table cells and heavily rely on modifying deep language models such as BERT which only captures related cells information in the same table. In this work, we propose a novel relational table representation learning approach considering both the intra- and inter-table contextual information. On one hand, the proposed Table Convolutional Network model employs the attention mechanism to adaptively focus on the most informative intra-table cells of the same row or column; and, on the other hand, it aggregates inter-table contextual information from various types of implicit connections between cells across different tables. Specifically, we propose three novel aggregation modules for (i) cells of the same value, (ii) cells of the same schema position, and (iii) cells linked to the same page topic. We further devise a supervised multi-task training objective for jointly predicting column type and pairwise column relation, as well as a table cell recovery objective for pre-training. Experiments on real Web table datasets demonstrate our method can outperform competitive baselines by of F1 for column type prediction and by of F1 for pairwise column relation prediction.

...read moreread less

Journal Article•DOI•

Research on Medical Question Answering System Based on Knowledge Graph

[...]

Zhixue Jiang¹, Chengying Chi, Yunyun Zhan¹•Institutions (1)

University of Science and Technology, Liaoning¹

28 Jan 2021-IEEE Access

TL;DR: In this paper, a system integrating medical professional knowledge, knowledge graphs, and question answering systems that conduct man-machine dialogue through natural language is proposed to meet the high-efficiency question answering needs of existing patients and doctors.

...read moreread less

Abstract: To meet the high-efficiency question answering needs of existing patients and doctors, this system integrates medical professional knowledge, knowledge graphs, and question answering systems that conduct man-machine dialogue through natural language This system locates the medical field, uses crawler technology to use vertical medical websites as data sources, and uses diseases as the core entity to construct a knowledge graph containing 44,000 knowledge entities of 7 types and 300,000 entities of 11 kinds It is stored in the Neo4j graph database, using rule-based matching methods and string-matching algorithms to construct a domain lexicon to classify and query questions This system has specific practical value in the medical field knowledge graph and question answering system

...read moreread less

Journal Article•DOI•

Knowledge Discovery of Global Landslides Using Automated Machine Learning Algorithms

[...]

Fahim K. Sufi, Musleh Alsulami¹•Institutions (1)

Umm al-Qura University¹

22 Sep 2021-IEEE Access

TL;DR: In this paper, the authors developed a new machine learning-based system for town planners, disaster recovery strategists, and landslide researchers, which revealed hidden knowledge about a range of complex scenarios created from five landslide feature attributes.

...read moreread less

Abstract: Understanding the complex dynamics of global landslides is essential for disaster planners to make timely and effective decisions that save lives and reduce the economic impacts on society. Using NASA’s inventory of global landslide data, we developed a new machine learning (ML)–based system for town planners, disaster recovery strategists, and landslide researchers. Our system revealed hidden knowledge about a range of complex scenarios created from five landslide feature attributes. Users of our system can select from a list of $1.295\times {10}^{64}$ possible global landslide scenarios to discover valuable knowledge and predictions about the selected scenario in an interactive manner. Three ML algorithms—anomaly detection, decomposition analysis, and automated regression analysis—are used to elicit detailed knowledge about 25 scenarios selected from 14,532 global landslide records covering 12,220 injuries and 63,573 fatalities across 157 countries. Anomaly detection, logistic regression, and decomposition analysis performed well for all scenarios under study, with the area under the curve averaging 0.951, 0.911, and 0.896, respectively. Moreover, the prediction accuracy of linear regression had a mean absolute percentage error of 0.255. To the best of our knowledge, our scenario-based ML knowledge discovery system is the first of its kind to provide a comprehensive understanding of global landslide data.

...read moreread less

Collapse