scispace - formally typeset
Search or ask a question

Showing papers in "Knowledge and Information Systems in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors conduct a systematic overview of the latest studies on model complexity in deep learning and propose several interesting future directions, including model generalization, model optimization, and model selection and design.
Abstract: Model complexity is a fundamental problem in deep learning. In this paper, we conduct a systematic overview of the latest studies on model complexity in deep learning. Model complexity of deep learning can be categorized into expressive capacity and effective model complexity. We review the existing studies on those two categories along four important factors, including model framework, model size, optimization process, and data complexity. We also discuss the applications of deep learning model complexity including understanding model generalization, model optimization, and model selection and design. We conclude by proposing several interesting future directions.

71 citations


Journal ArticleDOI
TL;DR: An up-to-date systematic review of recommender systems applied to e-Recruitment considering only papers published from 2012 up to 2020 shows a clear trend for hybrid and non-traditional techniques to overcome the challenges of e- Recruitment domain.
Abstract: Recommender Systems (RS) are a subclass of information filtering systems that seek to predict the rating or preference a user would give to an item. e-Recruitment is one of the domains in which RS can contribute due to presenting a list of interesting jobs to a candidate or a list of candidates to a recruiter. This study presents an up-to-date systematic review of recommender systems applied to e-Recruitment considering only papers published from 2012 up to 2020. We searched three databases for published journal articles, conference papers and book chapters. We then evaluated these works in terms of which kinds of RS were applied for e-Recruitment, what kind of information was used in the e-Recruitment RS, and how they were assessed. A total of 896 papers were collected, out of which sixty three research works were included in the survey based on the inclusion and exclusion criteria adopted. We divided the recommender types into five categories (Content-Based Recommendation 26.98%; Collaborative Filtering 6.35%; Knowledge-Based Recommendation 12.7%; Hybrid approaches 20.63%; and Other Types 33.33%); the types of information used were divided into four categories (Social Network 38.1%; Resumes and Job Posts 42.85%; Behavior or Feedback 12.7%; and Others 6.35%), and the assessment types were categorized into four types (Expert Validation 20.83%; Machine Learning Metrics 41.67%; Challenge-specific Metrics 22.92%; and Utility measures 14.58%). Although in many cases a paper may belong to more than one category for each evaluation axis, we chose the most predominant one for our categorization. In addition, there is a clear trend for hybrid and non-traditional techniques to overcome the challenges of e-Recruitment domain.

33 citations


Journal ArticleDOI
TL;DR: The SSTPMF model performs better in alleviating the cold start problem than state-of-the-art methods in terms of normalized discount cumulative gain on both data sets and the results obtained from two real data sets show that taking POI correlation and user similarity into account can further improve recommendation performance.
Abstract: In recent years, point of interest (POI) recommendation has gained increasing attention all over the world. POI recommendation plays an indispensable role in assisting people to find places they are likely to enjoy. The exploitation of POIs recommendation by existing models is inadequate due to implicit correlations among users and POIs and cold start problem. To overcome these problems, this work proposed a social spatio-temporal probabilistic matrix factorization (SSTPMF) model that exploits POI similarity and user similarity, which integrates different spaces including the social space, geographical space and POI category space in similarity modelling. In other words, this model proposes a multivariable inference approach for POI recommendation using latent similarity factors. The results obtained from two real data sets, Foursquare and Gowalla, show that taking POI correlation and user similarity into account can further improve recommendation performance. In addition, the experimental results show that the SSTPMF model performs better in alleviating the cold start problem than state-of-the-art methods in terms of normalized discount cumulative gain on both data sets.

32 citations


Journal ArticleDOI
TL;DR: An AMT-IRE (short for Attentive Multi-Task learning-based group Itinerary REcommendation) framework, which can dynamically learn the inner relations between group members and obtain consensus group preferences via the attention mechanism is proposed.
Abstract: Tourism is one of the largest service industries and a popular leisure activity participated by people with friends or family. A significant problem faced by the tourists is how to plan sequences of points of interest (POIs) that maintain a balance between the group preferences and the given temporal and spatial constraints. Most traditional group itinerary recommendation methods adopt predefined preference aggregate strategies without considering the group members’ distinctive characteristics and inner relations. Besides, POI textual information is beneficial to capture overall group preferences but is rarely considered. With these concerns in mind, this paper proposes an AMT-IRE (short for Attentive Multi-Task learning-based group Itinerary REcommendation) framework, which can dynamically learn the inner relations between group members and obtain consensus group preferences via the attention mechanism. Meanwhile, AMT-IRE integrates POI categories and POI textual information via another attention network. Finally, the group preferences are used in a variant of the orienteering problem to recommend group itineraries. Extensive experiments on six datasets validate the effectiveness of AMT-IRE.

21 citations


Journal ArticleDOI
TL;DR: In this paper, a unified model that jointly learns user's and POI dynamics is presented, termed RELINE (REcommendations with muLtIple Network Embeddings), by embedding eight relational graphs into one shared latent space.
Abstract: The rapid growth of users’ involvement in Location-Based Social Networks has led to the expeditious growth of the data on a global scale. The need of accessing and retrieving relevant information close to users’ preferences is an open problem which continuously raises new challenges for recommendation systems. The exploitation of points-of-interest (POIs) recommendation by existing models is inadequate due to the sparsity and the cold start problems. To overcome these problems many models were proposed in the literature; however, most of them ignore important factors, such as: geographical proximity, social influence, or temporal and preference dynamics, which tackle their accuracy while personalize their recommendations. In this work, we investigate these problems and present a unified model that jointly learns user’s and POI dynamics. Our proposal is termed RELINE (REcommendations with muLtIple Network Embeddings). More specifically, RELINE captures: (i) the social, (ii) the geographical, (iii) the temporal influence, and (iv) the users’ preference dynamics, by embedding eight relational graphs into one shared latent space. We have evaluated our approach against state-of-the-art methods with three large real-world datasets in terms of accuracy. Additionally, we have examined the effectiveness of our approach against the cold-start problem. Performance evaluation results demonstrate that significant performance improvement is achieved in comparison to existing state-of-the-art methods.

19 citations


Journal ArticleDOI
TL;DR: The proposed NCTR model is built upon a hybrid neural network framework with fine-grained modeling of latent representation and nonlinearity feature interactions for rating prediction, and shows that NCTR significantly outperforms several state-of-the-art recommendation methods.
Abstract: Collaborative filtering (CF) is a common method used by many recommender systems. Traditional CF algorithms exploit users’ ratings as the sole information source to learn user preferences. However, ratings usually sparse cause a serious impact on the recommendation results. Most existing CF algorithms use ratings and textual information to alleviate the sparsity of data and then utilize matrix factorization to achieve the latent feature interactions for rating prediction. Nevertheless, the following shortcomings remain in these studies: (1) The word orders and surrounding words of the textual information are ignored. (2) The nonlinearity of feature interactions is seldom exploited. Therefore, we propose a novel hybrid neural network to combine textual information and rating (NCTR) information for item recommendation. The proposed NCTR model is built upon a hybrid neural network framework with fine-grained modeling of latent representation and nonlinearity feature interactions for rating prediction. Specifically, convolution neural network is applied to extract effectively contextual features from textual information. Meanwhile, a fusion layer is exploited to combine features, and the multilayer perceptions are used to model the nonlinear interactions between the merged item latent features and user latent features. Experimental results over five real-world datasets show that NCTR significantly outperforms several state-of-the-art recommendation methods. Source codes are available in https://github.com/luojia527/NCTR_master .

19 citations


Journal ArticleDOI
TL;DR: Comparative evaluations between ARMR and baselines show that ARMR outperforms all baselines in terms of medication recommendation, achieving DDI reduction regardless of numbers of DDI types being considered.
Abstract: Medication recommendation is attracting enormous attention due to its promise in effectively prescribing medicines and improving the survival rate of patients. Among all challenges, drug–drug interactions (DDI) related to undesired duplication, antagonism, or alternation between drugs could lead to fatal side effects. Previous researches usually provide models with DDI knowledge to achieve DDI reduction. However, the mixed use of patients with different DDI rates places stringent requirements on the generalization performance of models. In pursuit of a more effective method, we propose the adversarially regularized model for medication recommendation (ARMR). Specifically, ARMR firstly models temporal information from medical records to obtain patient representations and builds a key-value memory network based on information from historical admissions. Then, ARMR carries out multi-hop reading on the memory network to recommend medications. Meanwhile, ARMR uses a GAN model to adversarially regulate the distribution of patient representations by matching the distribution to a desired Gaussian distribution to achieve DDI reduction. Comparative evaluations between ARMR and baselines show that ARMR outperforms all baselines in terms of medication recommendation, achieving DDI reduction regardless of numbers of DDI types being considered.

19 citations


Journal ArticleDOI
TL;DR: In this paper, the authors focus on two causal inference tasks, i.e., treatment effect estimation and causal discovery for time series data and provide a comprehensive review of the approaches in each task.
Abstract: Time series data are a collection of chronological observations which are generated by several domains such as medical and financial fields. Over the years, different tasks such as classification, forecasting and clustering have been proposed to analyze this type of data. Time series data have been also used to study the effect of interventions overtime. Moreover, in many fields of science, learning the causal structure of dynamic systems and time series data is considered an interesting task which plays an important role in scientific discoveries. Estimating the effect of an intervention and identifying the causal relations from the data can be performed via causal inference. Existing surveys on time series discuss traditional tasks such as classification and forecasting or explain the details of the approaches proposed to solve a specific task. In this paper, we focus on two causal inference tasks, i.e., treatment effect estimation and causal discovery for time series data and provide a comprehensive review of the approaches in each task. Furthermore, we curate a list of commonly used evaluation metrics and datasets for each task and provide an in-depth insight. These metrics and datasets can serve as benchmark for research in the field.

17 citations


Journal ArticleDOI
TL;DR: A novel framework called CANE is proposed to simultaneously learn the node representations and identify the network communities and achieves substantial performance gains over state-of-the-art baselines in various applications including link prediction, node classification, recommendation, network visualization, and community detection.
Abstract: Network embedding aims to learn a low-dimensional representation vector for each node while preserving the inherent structural properties of the network, which could benefit various downstream mining tasks such as link prediction and node classification. Most existing works can be considered as generative models that approximate the underlying node connectivity distribution in the network, or as discriminate models that predict edge existence under a specific discriminative task. Although several recent works try to unify the two types of models with adversarial learning to improve the performance, they only consider the local pairwise connectivity between nodes. Higher-order structural information such as communities, which essentially reflects the global topology structure of the network, is largely ignored. To this end, we propose a novel framework called CANE to simultaneously learn the node representations and identify the network communities. The two tasks are integrated and mutually reinforce each other under a novel adversarial learning framework. Specifically, with the detected communities, CANE jointly minimizes the pairwise connectivity loss and the community assignment error to improve node representation learning. In turn, the learned node representations provide high-quality features to facilitate community detection. Experimental results on multiple real datasets demonstrate that CANE achieves substantial performance gains over state-of-the-art baselines in various applications including link prediction, node classification, recommendation, network visualization, and community detection.

17 citations


Journal ArticleDOI
TL;DR: This paper proposes a framework that introduces keyword search into natural language question answering to compensate for the defects and confirms that NLQSK can answer more questions than the existing state-of-the-art question answering systems.
Abstract: Natural language question answering over knowledge graph has received widespread attention. However, the existing methods always aim to improve every phase of natural language question answering and neglect the defects; namely, not all query intentions can be identified and mapped to the correct SPARQL statement. In contrast, keyword search relies on the links among multiple keywords regardless of the exact logic relations in question. Therefore, we propose a framework (abbreviated as NLQSK for title of this paper) that introduces keyword search into natural language question answering to compensate for the defects mentioned above. First, we translate a natural language question into top-k SPARQL statements by using the existing methods. Second, we transform the valuable information that cannot be identified and mapped into keywords, and then, return the neighboring information in a knowledge graph by keyword index. Third, we combine the SPARQL block (i.e., the SPARQL statement and its result) and keyword search to produce the answer to the natural language question. Finally, the experiments on the benchmark dataset confirm that keyword search can compensate for the defects of natural language question answering and that NLQSK can answer more questions than the existing state-of-the-art question answering systems.

17 citations


Journal ArticleDOI
TL;DR: A new categorization of concept drifts for class imbalanced problems is put forward, which reveals the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors.
Abstract: Class imbalance introduces additional challenges when learning classifiers from concept drifting data streams Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet We thoroughly study the impact of such interactions on drifting imbalanced streams For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors Combinations of multiple factors are the most challenging for classifiers Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams

Journal ArticleDOI
TL;DR: This paper presents parallel solutions for evaluating k nearest neighbor queries on large databases of time series, compares them based on various measures of quality and time performance, and offers a tool that uses the characteristics of application data to determine which algorithm to choose for that application and how to set the parameters for that algorithm.
Abstract: This paper presents parallel solutions (developed based on two state-of-the-art algorithms iSAX and sketch) for evaluating k nearest neighbor queries on large databases of time series, compares them based on various measures of quality and time performance, and offers a tool that uses the characteristics of application data to determine which algorithm to choose for that application and how to set the parameters for that algorithm. Specifically, our experiments show that: (i) iSAX and its derivatives perform best in both time and quality when the time series can be characterized by a few low-frequency Fourier Coefficients, a regime where the iSAX pruning approach works well. (ii) iSAX performs significantly less well when high-frequency Fourier Coefficients have much of the energy of the time series. (iii) A random projection approach based on sketches by contrast is more or less independent of the frequency power spectrum. The experiments show the close relationship between pruning ratio and time for exact iSAX as well as between pruning ratio and the quality of approximate iSAX. Our toolkit analyzes typical time series of an application (i) to determine optimal segment sizes for iSAX and (ii) when to use Parallel Sketches instead of iSAX. Our algorithms have been implemented using Spark, evaluated over a cluster of nodes, and have been applied to both real and synthetic data. The results apply to any databases of numerical sequences, whether or not they relate to time.

Journal ArticleDOI
TL;DR: In this paper, an iterative deep learning NER framework using distant supervision is proposed for automatic labelling of domain-specific datasets, which is applied to mineral exploration reports and produced a large BIO-annotated dataset with six geological categories.
Abstract: Studies on named entity recognition (NER) often require a substantial amount of human-annotated training data. This makes technical domain-specific NER from industry data especially challenging as labelled data are scarce. Despite English as the surface language, technical jargon and writing conventions used in technical documents render the low-resource language challenges where techniques such as transfer learning hardly work. Relieving labour intensive annotations using automatic labelling is thus an important research topic, seeking ways to obtain labelled data quickly and consistently. In this work, we propose an iterative deep learning NER framework using distant supervision for automatic labelling of domain-specific datasets. The framework is applied to mineral exploration reports and produced a large BIO-annotated dataset with six geological categories. This quality-labelled dataset, OzROCK, is made publicly available to support future research on technical domain NER. Experimental results demonstrated the effectiveness of this approach, further confirmed by domain experts. The generalisation ability is verified by applying the framework to two other datasets: one for disease names and the other for chemical names. Overall, our approach can effectively reduce annotation efforts by identifying a much smaller subset, that is challenging for automatic labelling thus requires attention from human experts.

Journal ArticleDOI
TL;DR: This paper proposes a novel method for hashtag recommendation that resolves the data sparseness problem by exploiting the most relevant tweet information from external knowledge sources and significantly outperforms the current state-of-the-art methods.
Abstract: With the rapid growth of Twitter in recent years, there has been a tremendous increase in the number of tweets generated by users. Twitter allows users to make use of hashtags to facilitate effective categorization and retrieval of tweets. Despite the usefulness of hashtags, a major fraction of tweets do not contain hashtags. Several methods have been proposed to recommend hashtags based on lexical and topical features of tweets. However, semantic features and data sparsity in tweet representation have rarely been addressed by existing methods. In this paper, we propose a novel method for hashtag recommendation that resolves the data sparseness problem by exploiting the most relevant tweet information from external knowledge sources. In addition to lexical features and topical features, the proposed method incorporates the semantic features based on word-embeddings and user influence feature based on users’ influential position. To gain the advantage of various hashtag recommendation methods based on different features, our proposed method aggregates these methods using learning-to-rank and generates top-ranked hashtags. Experimental results show that the proposed method significantly outperforms the current state-of-the-art methods.

Journal ArticleDOI
Chao Wu1, Qingyu Xiong1, Min Gao1, Qiude Li1, Yang Yu1, Wang Kaige1 
TL;DR: This work proposes an improved model based on convolutional neural networks that extracts more useful information and analyzes the sentiment more accurately in the comment text and outperforms to other state-of-the-art methods.
Abstract: Aspect-based sentiment analysis can predict the sentiment polarity of specific aspect terms in the text. Compared to general sentiment analysis, it extracts more useful information and analyzes the sentiment more accurately in the comment text. Many previous approaches use long short-term memory networks with attention mechanisms to directly learn aspect-specific representations and model comment text. However, these methods always ignore the importance of the aspect terms position and interactive information between the aspect terms and other words. To address these issues, we propose an improved model based on convolutional neural networks. First, a novel relative position encode layer can integrate the relative position information of specific aspect terms validly in a text. Second, by using the aspect attention mechanism, the semantic relationship between aspect terms and words in the text is fully considered. To verify the effectiveness of the proposed models, we conduct a large number of experiments and comparisons on seven public datasets. The experimental results show that this model outperforms to other state-of-the-art methods.

Journal ArticleDOI
TL;DR: This paper explores approximate decision tree variants by means of multiple objective optimization problem, demonstrating a significant performance improvement targeting field-programmable gate array devices.
Abstract: So far, multiple classifier systems have been increasingly designed to take advantage of hardware features, such as high parallelism and computational power. Indeed, compared to software implementations, hardware accelerators guarantee higher throughput and lower latency. Although the combination of multiple classifiers leads to high classification accuracy, the required area overhead makes the design of a hardware accelerator unfeasible, hindering the adoption of commercial configurable devices. For this reason, in this paper, we exploit approximate computing design paradigm to trade hardware area overhead off for classification accuracy. In particular, starting from trained DT models and employing precision-scaling technique, we explore approximate decision tree variants by means of multiple objective optimization problem, demonstrating a significant performance improvement targeting field-programmable gate array devices.

Journal ArticleDOI
TL;DR: A framework for incorporating clustering as a method of feature extraction for classification is put forward, which serves as a platform to answer ten essential questions regarding the studied subject.
Abstract: There is a certain belief among data science researchers and enthusiasts alike that clustering can be used to improve classification quality. Insofar as this belief is fairly uncontroversial, it is also very general and therefore produces a lot of confusion around the subject. There are many ways of using clustering in classification and it obviously cannot always improve the quality of predictions, so a question arises, in which scenarios exactly does it help? Since we were unable to find a rigorous study addressing this question, in this paper, we try to shed some light on the concept of using clustering for classification. To do so, we first put forward a framework for incorporating clustering as a method of feature extraction for classification. The framework is generic w.r.t. similarity measures, clustering algorithms, classifiers, and datasets and serves as a platform to answer ten essential questions regarding the studied subject. Each answer is formulated based on a separate experiment on 16 publicly available datasets, followed by an appropriate statistical analysis. After performing the experiments and analyzing the results separately, we discuss them from a global perspective and form general conclusions regarding using clustering as feature extraction for classification.

Journal ArticleDOI
TL;DR: A comprehensive overview of the existing works in this line of research can be found in this paper, where the authors discuss and analyze various aspects of the proposed algorithms for data stream classification with concept evolution detection and adaptation.
Abstract: Developing effective and efficient data stream classifiers is challenging for the machine learning community because of the dynamic nature of data streams. As a result, many data stream learning algorithms have been proposed during the past decades and achieve great success in various fields. This paper aims to explore a specific type of challenge in learning evolving data streams, called concept evolution (emergence of novel classes). Concept evolution indicates that the underlying patterns evolve over time, and new patterns (classes) may emerge at any time in streaming data. Therefore, data stream classifiers with emerging class detection have received increasing attention in recent years due to the practical values in many real-world applications. In this article, we provide a comprehensive overview of the existing works in this line of research. We discuss and analyze various aspects of the proposed algorithms for data stream classification with concept evolution detection and adaptation. Additionally, we discuss the potential application areas in which these techniques can be used. We also provide a detailed overview of evaluation measures and datasets used in these studies. Finally, we describe the current research challenges and future directions for data stream classification with novel class detection.

Journal ArticleDOI
TL;DR: In this article, answer-set programs that specify database repairs are used as a basis for solving computational and reasoning problems around causality in databases, including causal responsibility, and causes are introduced also at the attribute level by appealing to an attribute-based repair semantics that uses null values.
Abstract: There is a recently established correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints. In this work, answer-set programs that specify database repairs are used as a basis for solving computational and reasoning problems around causality in databases, including causal responsibility. Furthermore, causes are introduced also at the attribute level by appealing to an attribute-based repair semantics that uses null values. Corresponding repair-programs are introduced, and used as a basis for computation and reasoning about attribute-level causes. The answer-set programs are extended in order to capture causality under integrity constraints.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a pipeline framework for question answering over knowledge graph (KGQA), which consists of three cascaded components: (1) an entity detection model, which can label the entity mention in the question; (2) a question pattern classifier according to the correlations between question patterns and relation types, and (3) a simple yet effective relation detection model which is used to match the semantic similarity between the question and relation candidates.
Abstract: Question answering over knowledge graph (KGQA), which automatically answers natural language questions by querying the facts in knowledge graph (KG), has drawn significant attention in recent years. In this paper, we focus on single-relation questions, which can be answered through a single fact in KG. This task is a non-trivial problem since capturing the meaning of questions and selecting the golden fact from billions of facts in KG are both challengeable. We propose a pipeline framework for KGQA, which consists of three cascaded components: (1) an entity detection model, which can label the entity mention in the question; (2) a novel entity linking model, which considers the contextual information of candidate entities in KG and builds a question pattern classifier according to the correlations between question patterns and relation types to mitigate entity ambiguity problem; and (3) a simple yet effective relation detection model, which is used to match the semantic similarity between the question and relation candidates. Substantial experiments on the SimpleQuestions benchmark dataset show that our proposed method could achieve better performance than many existing state-of-the-art methods on accuracy, top-N recall and other evaluation metrics.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a hybrid approach for the location selection, to evaluate the potential location for the automotive manufacturing plant of Turkey, and to reveal a comprehensive analysis of weighting and multiple criteria decision-making (MCDM) methods.
Abstract: The location selection is a strategic decision that significantly influences revenue, level of competition, and success of companies and countries. This study aims to propose a hybrid approach for the location selection, to evaluate the potential location for the automotive manufacturing plant of Turkey, and to reveal a comprehensive analysis of weighting and multiple criteria decision-making (MCDM) methods. The proposed approach integrates different objective and subjective weighting, MCDM, and Copeland methods. Turkey has recently introduced its first automobile prototypes and has announced that the manufacturing plant will be located in Bursa. This decision is thoroughly examined via four objective weighting methods—entropy, criteria importance through inter-criteria correlation, standard deviation, and mean weight and a subjective method—analytic hierarchy process. Besides, the alternatives are evaluated based on six MCDM methods—technique for order preference by similarity to ideal solution, preference ranking organization method for enrichment evaluations, vise kriterijumska optimizacija i kompromisno resenje, organization, rangement et synthese de donnes relationnelles, elimination and choice translating reality, and the weighted sum method. The outcomes of the weighting methods and MCDM methods, the impact of the attribute weights provided by each method on rankings, the outcome of each method pair, and selection of the best location (Bursa) are thoroughly evaluated considering a real-world case with a potential outcome that makes evaluations more realistic and tangible unlike most of the other studies in the literature. In this regard, Spearman's rank correlation coefficients are considered. Also, sensitivity analysis is conducted to reveal the robustness of the methods and the impact of each weight on outcomes. Some considerable results, including the most robust method and optimal method pairs for the case, are presented.


Journal ArticleDOI
TL;DR: In this article, a novel family of metrics has been developed based on ball coverage by classes, they are named after Overlap Number of Balls, and they provide both good estimates for class overlap, and great correlations with the classification performance.
Abstract: Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers’ performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics has been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.

Journal ArticleDOI
TL;DR: The theoretical evidences and the empirical implementation of two strategies of reducing the high time complexity of ensemble shapelet transform algorithm, which guarantees a near-lossless accuracy under some preconditions while reducing the time complexity are focused on.
Abstract: In the research area of time series classification, the ensemble shapelet transform algorithm is one of the state-of-the-art algorithms for classification. However, its high time complexity is an issue to hinder its application since its base classifier shapelet transform includes a high time complexity of a distance calculation and shapelet selection. Therefore, in this paper we introduce a novel algorithm, i.e., short isometric shapelet transform (SIST), which contains two strategies to reduce the time complexity. The first strategy of SIST fixes the length of shapelet based on a simplified distance calculation, which largely reduces the number of shapelet candidates as well as speeds up the distance calculation in the ensemble shapelet transform algorithm. The second strategy is to train a single linear classifier in the feature space instead of an ensemble classifier. The theoretical evidence of these two strategies is presented to guarantee a near-lossless accuracy under some preconditions while reducing the time complexity. Furthermore, empirical experiments demonstrate the superior performance of the proposed algorithm.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a weighted participation index (WPI) to identify co-locations with or without rare features, which can be utilized to prune the search space.
Abstract: A co-location pattern indicates a group of spatial features whose instances are frequently located together in proximate geographic area. Spatial co-location pattern mining (SCPM) is valuable for many practical applications. Numerous previous SCPM studies emphasize the equal participation per feature. As a result, the interesting co-locations with rare features cannot be captured. In this paper, we propose a novel interest measure, i.e., the weighted participation index (WPI), to identify co-locations with or without rare features. The WPI measure possesses a conditional anti-monotone property which can be utilized to prune the search space. In addition, a fast row instance identification mechanism based on the ordered NR-tree is proposed to enhance efficiency. Subsequently, the ordered NR-tree-based algorithm is developed. To further improve efficiency and process massive spatial data, we break the ordered NR-tree into multiple independent subtrees, and parallelize the ordered NR-tree-based algorithm on MapReduce framework. Extensive experiments are conducted on both real and synthetic datasets to verify the effectiveness, efficiency and scalability of our techniques.

Journal ArticleDOI
TL;DR: Experimental results on several text classification datasets demonstrate that CREX could increase the credibility of DNNs, and comprehensive analysis shows three meaningful improvements of CREX, which significantly increases DNN accuracy on new and previously unseen data beyond test set, and enhances fairness ofDNNs in terms of equality of opportunity metric and reduce models’ discrimination toward certain demographic group.
Abstract: Recent studies have shown that state-of-the-art DNNs are not always credible, despite their impressive performance on the hold-out test set of a variety of tasks. These models tend to exploit dataset shortcuts to make predictions, rather than learn the underlying task. The non-credibility could lead to low generalization, adversarial vulnerability, as well as algorithmic discrimination of the DNN models. In this paper, we propose CREX in order to develop more credible DNNs. The high-level idea of CREX is to encourage DNN models to focus more on evidences that actually matter for the task at hand and to avoid overfitting to data-dependent shortcuts. Specifically, in the DNN training process, CREX directly regularizes the local explanation with expert rationales, i.e., a subset of features highlighted by domain experts as justifications for predictions, to enforce the alignment between local explanations and rationales. Even when rationales are not available, CREX still could be useful by requiring the generated explanations to be sparse. In addition, CREX is widely applicable to different network architectures, including CNN, LSTM and attention model. Experimental results on several text classification datasets demonstrate that CREX could increase the credibility of DNNs. Comprehensive analysis further shows three meaningful improvements of CREX: (1) it significantly increases DNN accuracy on new and previously unseen data beyond test set, (2) it enhances fairness of DNNs in terms of equality of opportunity metric and reduce models’ discrimination toward certain demographic group, and (3) it promotes the robustness of DNN models with respect to adversarial attack. These experimental results highlight the advantages of the increased credibility by CREX.

Journal ArticleDOI
TL;DR: A literature review within the context of manufacturing systems that use databases and ontologies, identifying their respective strengths and weaknesses, and an implementation in a real industrial scenario that demonstrates how different modeling approaches can be used for the same purpose are presented.
Abstract: The literature on the modeling and management of data generated through the lifecycle of a manufacturing system is split into two main paradigms: product lifecycle management (PLM) and product, process, resource (PPR) modeling. These paradigms are complementary, and the latter could be considered a more neutral version of the former. There are two main technologies associated with these paradigms: ontologies and databases. Database technology is widespread in industry and is well established. Ontologies remain largely a plaything of the academic community which, despite numerous projects and publications, have seen limited implementations in industrial manufacturing applications. The main objective of this paper is to provide a comparison between ontologies and databases, offering both qualitative and quantitative analyses in the context of PLM and PPR. To achieve this, the article presents (1) a literature review within the context of manufacturing systems that use databases and ontologies, identifying their respective strengths and weaknesses, and (2) an implementation in a real industrial scenario that demonstrates how different modeling approaches can be used for the same purpose. This experiment is used to enable discussion and comparative analysis of both modeling strategies.

Journal ArticleDOI
TL;DR: A straightforward two-dimensional data representation is proposed that allows the faster processing of datasets with a large number of examples and dimensions and an adaptive drift detector is developed on this visual representation that is efficient for fast streams with thousands of features and is accurate as existing costly methods.
Abstract: Stream mining considers the online arrival of examples at high speed and the possibility of changes in its descriptive features or class definitions compared with past knowledge (i.e., concept drifts). The fast detection of drifts is essential to keep the predictive model updated and stable in changing environments. For many applications, such as those related to smart sensors, the high number of features is an additional challenge in terms of memory and time for stream processing. This paper presents an unsupervised and model-independent concept drift detector suitable for high-speed and high-dimensional data streams. We propose a straightforward two-dimensional data representation that allows the faster processing of datasets with a large number of examples and dimensions. We developed an adaptive drift detector on this visual representation that is efficient for fast streams with thousands of features and is accurate as existing costly methods that perform various statistical tests considering each feature individually. Our method achieves better performance measured by execution time and accuracy in classification problems for different types of drifts. The experimental evaluation considering synthetic and real data demonstrates the method’s versatility in several domains, including entomology, medicine, and transportation systems.

Journal ArticleDOI
TL;DR: This paper presents a hybrid approach that integrates an ensemble-learning framework by combining a Multiscale Laplacian Graph kernel and a feature-based linear kernel, using a pattern-matching engine to identify biomedical events with arguments.
Abstract: Bio-event extraction is an extensive research area in the field of biomedical text mining, this focuses on elaborating relationships between biomolecules and can provide various aspects of their nature. Bio-event extraction plays a vital role in biomedical literature mining applications such as biological network construction, pathway curation, and drug repurposing. Extracting biological events automatically is a difficult task because of the uncertainty and assortment of natural language processing such as negations and speculations, which provides further room for the development of feasible methodologies. This paper presents a hybrid approach that integrates an ensemble-learning framework by combining a Multiscale Laplacian Graph kernel and a feature-based linear kernel, using a pattern-matching engine to identify biomedical events with arguments. This graph-based kernel not only captures the topological relationships between the individual event nodes but also identifies the associations among the subgraphs for complex events. In addition, the lexico-syntactic patterns were used to automatically discover the semantic role of each word in the sentence. For performance evaluation, we used the gold standard corpora, namely BioNLP-ST (2009, 2011, and 2013) and GENIA-MK. Experimental results show that our approach achieved better performance than other state-of-the-art systems.

Journal ArticleDOI
TL;DR: SemKeyphrase is proposed, an unsupervised cluster-based approach for keyphrase extraction from MOOC video lectures that incorporates a new semantic relatedness metric and a ranking algorithm that involves two phases on ranking candidates.
Abstract: Massive open online courses (MOOCs) have emerged as a great resource for learners. Numerous challenges remain to be addressed in order to make MOOCs more useful and convenient for learners. One such challenge is how to automatically extract a set of keyphrases from MOOC video lectures that can help students quickly identify the right knowledge they want to learn and thus expedite their learning process. In this paper, we propose SemKeyphrase, an unsupervised cluster-based approach for keyphrase extraction from MOOC video lectures. SemKeyphrase incorporates a new semantic relatedness metric and a ranking algorithm, called PhraseRank, that involves two phases on ranking candidates. We conducted experiments on a real-world dataset of MOOC video lectures, and the results show that our proposed approach outperforms the state-of-the-art keyphrase extraction methods.