scispace - formally typeset
Search or ask a question

Showing papers in "Knowledge and Information Systems in 2017"


Journal ArticleDOI
TL;DR: This survey article enumerates, categorizes, and compares many of the methods that have been proposed to detect change points in time series, and presents some grand challenges for the community to consider.
Abstract: Change points are abrupt variations in time series data. Such abrupt changes may represent transitions that occur between states. Detection of change points is useful in modelling and prediction of time series and is found in application areas such as medical condition monitoring, climate change detection, speech and image analysis, and human activity analysis. This survey article enumerates, categorizes, and compares many of the methods that have been proposed to detect change points in time series. The methods examined include both supervised and unsupervised algorithms that have been introduced and evaluated. We introduce several criteria to compare the algorithms. Finally, we present some grand challenges for the community to consider.

788 citations


Journal ArticleDOI
TL;DR: This review paper presents a selection of challenges which are of particular current interests, such as feature selection for high-dimensional small sample size data, large-scale data, and secure feature selection, as well as some representative applications of feature selection.
Abstract: Feature selection is one of the key problems for machine learning and data mining. In this review paper, a brief historical background of the field is given, followed by a selection of challenges which are of particular current interests, such as feature selection for high-dimensional small sample size data, large-scale data, and secure feature selection. Along with these challenges, some hot topics for feature selection have emerged, e.g., stable feature selection, multi-view feature selection, distributed feature selection, multi-label feature selection, online feature selection, and adversarial feature selection. Then, the recent advances of these topics are surveyed in this paper. For each topic, the existing problems are analyzed, and then, current solutions to these problems are presented and discussed. Besides the topics, some representative applications of feature selection are also introduced, such as applications in bioinformatics, social media, and multimedia retrieval.

219 citations


Journal ArticleDOI
TL;DR: This work substantiates its points with extensive experiments, using clustering and outlier detection methods with and without index acceleration, and discusses what one can learn from evaluations, whether experiments are properly designed, and what kind of conclusions one should avoid.
Abstract: Any paper proposing a new algorithm should come with an evaluation of efficiency and scalability (particularly when we are designing methods for "big data"). However, there are several (more or less serious) pitfalls in such evaluations. We would like to point the attention of the community to these pitfalls. We substantiate our points with extensive experiments, using clustering and outlier detection methods with and without index acceleration. We discuss what we can learn from evaluations, whether experiments are properly designed, and what kind of conclusions we should avoid. We close with some general recommendations but maintain that the design of fair and conclusive experiments will always remain a challenge for researchers and an integral part of the scientific endeavor.

184 citations


Journal ArticleDOI
TL;DR: A novel algorithm named EFIM (EFficient high-utility Itemset Mining), which introduces several new ideas to more efficiently discover high-UTility itemsets and is in general two to three orders of magnitude faster than the state-of-art algorithms.
Abstract: In recent years, high-utility itemset mining has emerged as an important data mining task. However, it remains computationally expensive both in terms of runtime and memory consumption. It is thus an important challenge to design more efficient algorithms for this task. In this paper, we address this issue by proposing a novel algorithm named EFIM (EFficient high-utility Itemset Mining), which introduces several new ideas to more efficiently discover high-utility itemsets. EFIM relies on two new upper bounds named revised sub-tree utility and local utility to more effectively prune the search space. It also introduces a novel array-based utility counting technique named Fast Utility Counting to calculate these upper bounds in linear time and space. Moreover, to reduce the cost of database scans, EFIM proposes efficient database projection and transaction merging techniques named High-utility Database Projection and High-utility Transaction Merging (HTM), also performed in linear time. An extensive experimental study on various datasets shows that EFIM is in general two to three orders of magnitude faster than the state-of-art algorithms $$\hbox {d}^2$$d2HUP, HUI-Miner, HUP-Miner, FHM and UP-Growth+ on dense datasets and performs quite well on sparse datasets. Moreover, a key advantage of EFIM is its low memory consumption.

169 citations


Journal ArticleDOI
TL;DR: Significant contributions made in recent years are emphasized, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences.
Abstract: The task of automatic document summarization aims at generating short summaries for originally long documents. A good summary should cover the most important information of the original document or a cluster of documents, while being coherent, non-redundant and grammatically readable. Numerous approaches for automatic summarization have been developed to date. In this paper we give a self-contained, broad overview of recent progress made for document summarization within the last 5 years. Specifically, we emphasize on significant contributions made in recent years that represent the state-of-the-art of document summarization, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences. In addition, we review progress made for document summarization in domains, genres and applications that are different from traditional settings. We also point out some of the latest trends and highlight a few possible future directions.

143 citations


Journal ArticleDOI
TL;DR: A novel transfer learning and domain adaptation approach, referred to as visual domain adaptation (VDA), which reduces the joint marginal and conditional distributions across domains in an unsupervised manner where no label is available in test set.
Abstract: One of the serious challenges in computer vision and image classification is learning an accurate classifier for a new unlabeled image dataset, considering that there is no available labeled training data. Transfer learning and domain adaptation are two outstanding solutions that tackle this challenge by employing available datasets, even with significant difference in distribution and properties, and transfer the knowledge from a related domain to the target domain. The main difference between these two solutions is their primary assumption about change in marginal and conditional distributions where transfer learning emphasizes on problems with same marginal distribution and different conditional distribution, and domain adaptation deals with opposite conditions. Most prior works have exploited these two learning strategies separately for domain shift problem where training and test sets are drawn from different distributions. In this paper, we exploit joint transfer learning and domain adaptation to cope with domain shift problem in which the distribution difference is significantly large, particularly vision datasets. We therefore put forward a novel transfer learning and domain adaptation approach, referred to as visual domain adaptation (VDA). Specifically, VDA reduces the joint marginal and conditional distributions across domains in an unsupervised manner where no label is available in test set. Moreover, VDA constructs condensed domain invariant clusters in the embedding representation to separate various classes alongside the domain transfer. In this work, we employ pseudo target labels refinement to iteratively converge to final solution. Employing an iterative procedure along with a novel optimization problem creates a robust and effective representation for adaptation across domains. Extensive experiments on 16 real vision datasets with different difficulties verify that VDA can significantly outperform state-of-the-art methods in image classification problem.

125 citations


Journal ArticleDOI
TL;DR: Methodical analysis of task scheduling in cloud and grid computing is presented based on swarm intelligence and bio-inspired techniques and will enable the readers to decide suitable approach for suggesting better schemes for scheduling user’s application.
Abstract: Heterogeneous distributed computing systems are the emerging for executing scientific and computationally intensive applications. Cloud computing in this context describes a paradigm to deliver the resource-like computing and storage on-demand basis using pay-per-use model. These resources are managed by data centers and dynamically provisioned to the users based on their availability, demand and quality parameters required to be satisfied. The task scheduling onto the distributed and virtual resources is a main concern which can affect the performance of the system. In the literature, a lot of work has been done by considering cost and makespan as the affecting parameters for scheduling the dependent tasks. Prior work has discussed the various challenges affecting the performance of dependent task scheduling but did not consider storage cost, failure rate-related challenges. This paper accomplishes a review of using meta-heuristics techniques for scheduling tasks in cloud computing. We presented the taxonomy and comparative review on these algorithms. Methodical analysis of task scheduling in cloud and grid computing is presented based on swarm intelligence and bio-inspired techniques. This work will enable the readers to decide suitable approach for suggesting better schemes for scheduling user’s application. Future research issues have also been suggested in this research work.

114 citations


Journal ArticleDOI
TL;DR: The neuro-fuzzy approach was used to detect the most important variables which affect the wind speed according to the fractal dimensions and the main goal was to investigate the influence of terrain roughness length and different heights of theWind speed prediction.
Abstract: Fluctuation of wind speed affects wind energy systems since the potential wind power is proportional the cube of wind speed. Hence precise prediction of wind speed is very important to improve the performances of the systems. Due to unstable behavior of the wind speed above different terrains, in this study fractal characteristics of the wind speed series were analyzed. According to the self-similarity characteristic and the scale invariance, the fractal extrapolate interpolation prediction can be performed by extending the fractal characteristic from internal interval to external interval. Afterward neuro-fuzzy technique was applied to the fractal data because of high nonlinearity of the data. The neuro-fuzzy approach was used to detect the most important variables which affect the wind speed according to the fractal dimensions. The main goal was to investigate the influence of terrain roughness length and different heights of the wind speed on the wind speed prediction.

96 citations


Journal ArticleDOI
TL;DR: This paper analyzes the properties of an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC with forgetting, called prequential AUC, and shows that the proposed measure is statistically consistent with AUC computed traditionally on streams without drift and comparably fast to existing evaluation procedures.
Abstract: Modern data-driven systems often require classifiers capable of dealing with streaming imbalanced data and concept changes. The assessment of learning algorithms in such scenarios is still a challenge, as existing online evaluation measures focus on efficiency, but are susceptible to class ratio changes over time. In case of static data, the area under the receiver operating characteristics curve, or simply AUC, is a popular measure for evaluating classifiers both on balanced and imbalanced class distributions. However, the characteristics of AUC calculated on time-changing data streams have not been studied. This paper analyzes the properties of our recent proposal, an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC with forgetting. The resulting evaluation measure, called prequential AUC, is studied in terms of: visualization over time, processing speed, differences compared to AUC calculated on blocks of examples, and consistency with AUC calculated traditionally. Simulation results show that the proposed measure is statistically consistent with AUC computed traditionally on streams without drift and comparably fast to existing evaluation procedures. Finally, experiments on real-world and synthetic data showcase characteristic properties of prequential AUC compared to classification accuracy, G-mean, Kappa, Kappa M, and recall when used to evaluate classifiers on imbalanced streams with various difficulty factors.

91 citations


Journal ArticleDOI
TL;DR: Different performance evaluation parameters such as precision, recall, f-measure, accuracy have been considered to evaluate the performance of the proposed approach on two different datasets, i.e., IMDb dataset and polarity dataset.
Abstract: It is a practice that users or customers intend to share their comments or reviews about any product in different social networking sites. An analyst usually processes to reviews properly to obtain any meaningful information from it. Classification of sentiments associated with reviews is one of these processing steps. The reviews framed are often made in text format. While processing the text reviews, each word of the review is considered as a feature. Thus, selection of right kind of features needs to be carried out to select the best feature from the set of all features. In this paper, the machine learning algorithm, i.e., support vector machine, is used to select the best features from the training data. These features are then given input to artificial neural network method, to process further. Different performance evaluation parameters such as precision, recall, f-measure, accuracy have been considered to evaluate the performance of the proposed approach on two different datasets, i.e., IMDb dataset and polarity dataset.

82 citations


Journal ArticleDOI
TL;DR: This research proposes a semi-supervised sentiment analysis approach that incorporates lexicon-based methodology with machine learning in order to improve sentiment analysis performance.
Abstract: An immense amount of data is available with the advent of social media in the last decade. This data can be used for sentiment analysis and decision making. The data present on blogs, news/review sites, social networks, etc., are so enormous that manual labeling is not feasible and an automatic approach is required for its analysis. The sentiment of the masses can be understood by analyzing this large scale and opinion rich data. The major issues in the application of automated approaches are data unavailability, data sparsity, domain independence and inadequate performance. This research proposes a semi-supervised sentiment analysis approach that incorporates lexicon-based methodology with machine learning in order to improve sentiment analysis performance. Mathematical models such as information gain and cosine similarity are employed to revise the sentiment scores defined in SentiWordNet. This research also emphasizes on the importance of nouns and employs them as semantic features with other parts of speech. The evaluation of performance measures and comparison with state-of-the-art techniques proves that the proposed approach is superior.

Journal ArticleDOI
TL;DR: Current state-of-the-art cloud resource allocation schemes are extensively reviewed to highlight their strengths and weaknesses and a thematic taxonomy is presented based on resource allocation optimization objectives to classify the existing literature.
Abstract: Cloud computing has emerged as a popular computing model to process data and execute computationally intensive applications in a pay-as-you-go manner Due to the ever-increasing demand for cloud-based applications, it is becoming difficult to efficiently allocate resources according to user requests while satisfying the service-level agreement between service providers and consumers Furthermore, cloud resource heterogeneity, the unpredictable nature of workload, and the diversified objectives of cloud actors further complicate resource allocation in the cloud computing environment Consequently, both the industry and academia have commenced substantial research efforts to efficiently handle the aforementioned multifaceted challenges with cloud resource allocation The lack of a comprehensive review covering the resource allocation aspects of optimization objectives, design approaches, optimization methods, target resources, and instance types has motivated a review of existing cloud resource allocation schemes In this paper, current state-of-the-art cloud resource allocation schemes are extensively reviewed to highlight their strengths and weaknesses Moreover, a thematic taxonomy is presented based on resource allocation optimization objectives to classify the existing literature The cloud resource allocation schemes are analyzed based on the thematic taxonomy to highlight the commonalities and deviations among them Finally, several opportunities are suggested for the design of optimal resource allocation schemes

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a fast, efficient, and parallel framework for counting k-node graphlets, which leverages a number of theoretical combinatorial arguments that allow them to obtain significant improvement on the scalability of graphlet counting.
Abstract: From social science to biology, numerous applications often rely on graphlets for intuitive and meaningful characterization of networks. While graphlets have witnessed a tremendous success and impact in a variety of domains, there has yet to be a fast and efficient framework for computing the frequencies of these subgraph patterns. However, existing methods are not scalable to large networks with billions of nodes and edges. In this paper, we propose a fast, efficient, and parallel framework as well as a family of algorithms for counting k-node graphlets. The proposed framework leverages a number of theoretical combinatorial arguments that allow us to obtain significant improvement on the scalability of graphlet counting. For each edge, we count a few graphlets and obtain the exact counts of others in constant time using the combinatorial arguments. On a large collection of $$300+$$300+ networks from a variety of domains, our graphlet counting strategies are on average $$460{\times }$$460× faster than existing methods. This brings new opportunities to investigate the use of graphlets on much larger networks and newer applications as we show in the experiments. To the best of our knowledge, this paper provides the largest graphlet computations to date.

Journal ArticleDOI
TL;DR: A new architecture called a cascading network is proposed that is capable of distributing a deep neural network between a local device and the cloud while keeping the required communication network traffic to a minimum and allows for an early-stopping mechanism during the recall phase of the network.
Abstract: Most of the research on deep neural networks so far has been focused on obtaining higher accuracy levels by building increasingly large and deep architectures. Training and evaluating these models is only feasible when large amounts of resources such as processing power and memory are available. Typical applications that could benefit from these models are, however, executed on resource-constrained devices. Mobile devices such as smartphones already use deep learning techniques, but they often have to perform all processing on a remote cloud. We propose a new architecture called a cascading network that is capable of distributing a deep neural network between a local device and the cloud while keeping the required communication network traffic to a minimum. The network begins processing on the constrained device, and only relies on the remote part when the local part does not provide an accurate enough result. The cascading network allows for an early-stopping mechanism during the recall phase of the network. We evaluated our approach in an Internet of Things context where a deep neural network adds intelligence to a large amount of heterogeneous connected devices. This technique enables a whole variety of autonomous systems where sensors, actuators and computing nodes can work together. We show that the cascading architecture allows for a substantial improvement in evaluation speed on constrained devices while the loss in accuracy is kept to a minimum.

Journal ArticleDOI
TL;DR: An efficient algorithm named fast algorithm for mining discriminative high utility patterns (DHUPs) with strong frequency affinity (FDHUP) is proposed to efficiently discover DHUPs by considering both the utility and frequency affinity constraints.
Abstract: Recently, high utility pattern mining (HUPM) has been extensively studied. Many approaches for HUPM have been proposed in recent years, but most of them aim at mining HUPs without any consideration for their frequency. This has the major drawback that any combination of a low utility item with a very high utility pattern is regarded as a HUP, even if this combination has low affinity and contains items that rarely co-occur. Thus, frequency should be a key criterion to select HUPs. To address this issue, and derive high utility interesting patterns (HUIPs) with strong frequency affinity, the HUIPM algorithm was proposed. However, it recursively constructs a series of conditional trees to produce candidates and then derive the HUIPs. This procedure is time-consuming and may lead to a combinatorial explosion when the minimum utility threshold is set relatively low. In this paper, an efficient algorithm named fast algorithm for mining discriminative high utility patterns (DHUPs) with strong frequency affinity (FDHUP) is proposed to efficiently discover DHUPs by considering both the utility and frequency affinity constraints. Two compact structures named EI-table and FU-tree and three pruning strategies are introduced in the proposed algorithm to reduce the search space, and efficiently and effectively discover DHUPs. An extensive experimental study shows that the proposed FDHUP algorithm considerably outperforms the state-of-the-art HUIPM algorithm in terms of execution time, memory consumption, and scalability.

Journal ArticleDOI
TL;DR: A simple and efficient algorithm for k-degree anonymity in large networks by considering the neighbourhood centrality score of each edge, which preserves the most important edges of the network, reducing the information loss and increasing the data utility.
Abstract: The problem of anonymization in large networks and the utility of released data are considered in this paper. Although there are some anonymization methods for networks, most of them cannot be applied in large networks because of their complexity. In this paper, we devise a simple and efficient algorithm for k-degree anonymity in large networks. Our algorithm constructs a k-degree anonymous network by the minimum number of edge modifications. We compare our algorithm with other well-known k-degree anonymous algorithms and demonstrate that information loss in real networks is lowered. Moreover, we consider the edge relevance in order to improve the data utility on anonymized networks. By considering the neighbourhood centrality score of each edge, we preserve the most important edges of the network, reducing the information loss and increasing the data utility. An evaluation of clustering processes is performed on our algorithm, proving that edge neighbourhood centrality increases data utility. Lastly, we apply our algorithm to different large real datasets and demonstrate their efficiency and practical utility.

Journal ArticleDOI
TL;DR: This paper proposes a localized transportation mode choice model, with which it can dynamically predict the bus travel demand for different bus routing by taking into account both bus and taxi travel demands, and identifies region pairs with flawed bus routes which are effectively optimized using this approach.
Abstract: Optimal planning for public transportation is one of the keys helping to bring a sustainable development and a better quality of life in urban areas. Compared to private transportation, public transportation uses road space more efficiently and produces fewer accidents and emissions. However, in many cities people prefer to take private transportation other than public transportation due to the inconvenience of public transportation services. In this paper, we focus on the identification and optimization of flawed region pairs with problematic bus routing to improve utilization efficiency of public transportation services, according to people's real demand for public transportation. To this end, we first provide an integrated mobility pattern analysis between the location traces of taxicabs and the mobility records in bus transactions. Based on the mobility patterns, we propose a localized transportation mode choice model, with which we can dynamically predict the bus travel demand for different bus routing by taking into account both bus and taxi travel demands. This model is then used for bus routing optimization which aims to convert as many people from private transportation to public transportation as possible given budget constraints on the bus route modification. We also leverage the model to identify region pairs with flawed bus routes, which are effectively optimized using our approach. To validate the effectiveness of the proposed methods, extensive studies are performed on real-world data collected in Beijing which contains 19 million taxi trips and 10 million bus trips.

Journal ArticleDOI
TL;DR: This survey aims to provide a thorough review of a wide range of result diversification techniques including various definitions of diversifications, corresponding algorithms, diversification technique specified for some applications including database, search engines, recommendation systems, graphs, time series and data streams as well as result diversify systems.
Abstract: Nowadays, in information systems such as web search engines and databases, diversity is becoming increasingly essential and getting more and more attention for improving users' satisfaction. In this sense, query result diversification is of vital importance and well worth researching. Some issues such as the definition of diversification and efficient diverse query processing are more challenging to handle in information systems. Many researchers have focused on various dimensions of diversify problem. In this survey, we aim to provide a thorough review of a wide range of result diversification techniques including various definitions of diversifications, corresponding algorithms, diversification technique specified for some applications including database, search engines, recommendation systems, graphs, time series and data streams as well as result diversification systems. We also propose some open research directions, which are challenging and have not been explored up till now, to improve the quality of query results.

Journal ArticleDOI
TL;DR: This paper first estimates how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and then embed the corresponding information as weights associated with objects in a framework called Weighted-Object Ensemble Clustering (WOEC).
Abstract: Ensemble clustering has attracted increasing attention in recent years. Its goal is to combine multiple base clusterings into a single consensus clustering of increased quality. Most of the existing ensemble clustering methods treat each base clustering and each object as equally important, while some approaches make use of weights associated with clusters, or to clusterings, when assembling the different base clusterings. Boosting algorithms developed for classification have led to the idea of considering weighted objects during the clustering process. However, not much effort has been put toward incorporating weighted objects into the consensus process. To fill this gap, in this paper, we propose a framework called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and we then embed the corresponding information as weights associated with objects. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We experimentally demonstrate the gain in performance that our WOEC methodology achieves with respect to state-of-the-art ensemble clustering methods, as well as its stability and robustness.

Journal ArticleDOI
TL;DR: This paper presents an approach to detect and minimize the violations of the so-called conservativity principle where novel subsumption entailments between named concepts in one of the input ontologies are considered as unwanted.
Abstract: In order to enable interoperability between ontology-based systems, ontology matching techniques have been proposed. However, when the generated mappings lead to undesired logical consequences, their usefulness may be diminished. In this paper, we present an approach to detect and minimize the violations of the so-called conservativity principle where novel subsumption entailments between named concepts in one of the input ontologies are considered as unwanted. The practical applicability of the proposed approach is experimentally demonstrated on the datasets from the Ontology Alignment Evaluation Initiative.

Journal ArticleDOI
TL;DR: This paper classifies smart city data in sensitive, quasi-sensitive, and open/public levels and then suggests different strategies to process and publish the data within these categories, including data collection, cleansing, anonymization, and publishing.
Abstract: Smart city data come from heterogeneous sources including various types of the Internet of Things such as traffic, weather, pollution, noise, and portable devices. They are characterized with diverse quality issues and with different types of sensitive information. This makes data processing and publishing challenging. In this paper, we propose a framework to streamline smart city data management, including data collection, cleansing, anonymization, and publishing. The paper classifies smart city data in sensitive, quasi-sensitive, and open/public levels and then suggests different strategies to process and publish the data within these categories. The paper evaluates the framework using a real-world smart city data set, and the results verify its effectiveness and efficiency. The framework can be a generic solution to manage smart city data.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed algorithm mines high utility patterns more efficiently than the state-of-the-art algorithms.
Abstract: High utility pattern mining has been studied as an essential topic in the field of pattern mining in order to satisfy requirements of many real-world applications that need to process non-binary databases including item importance such as market analysis. In this paper, we propose an efficient algorithm with a novel indexed list-based data structure for mining high utility patterns. Previous approaches first generate an enormous number of candidate patterns on the basis of overestimation methods in their mining processes and then identify actual high utility patterns from the candidates through an additional database scan, which leads to high computational overheads. Although several list-based algorithms to discover high utility patterns without candidate generation have been suggested in recent years, they require a large number of comparison operations. Our method facilitates efficient mining of high utility patterns with the proposed indexed list by effectively reducing the total number of such operations. Moreover, we develop two techniques based on this novel data structure to more enhance mining performance of the proposed method. Experimental results on real and synthetic datasets show that the proposed algorithm mines high utility patterns more efficiently than the state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: A detailed survey of recent applications of business analytics to churn, with a focus on computational intelligence methods, is provided, preceded by an in-depth discussion of churn within the context of customer continuity management.
Abstract: Globalization processes and market deregulation policies are rapidly changing the competitive environments of many economic sectors. The appearance of new competitors and technologies leads to an increase in competition and, with it, a growing preoccupation among service-providing companies with creating stronger customer bonds. In this context, anticipating the customer's intention to abandon the provider, a phenomenon known as churn, becomes a competitive advantage. Such anticipation can be the result of the correct application of information-based knowledge extraction in the form of business analytics. In particular, the use of intelligent data analysis, or data mining, for the analysis of market surveyed information can be of great assistance to churn management. In this paper, we provide a detailed survey of recent applications of business analytics to churn, with a focus on computational intelligence methods. This is preceded by an in-depth discussion of churn within the context of customer continuity management. The survey is structured according to the stages identified as basic for the building of the predictive models of churn, as well as according to the different types of predictive methods employed and the business areas of their application.

Journal ArticleDOI
TL;DR: The main contributions of proposed method are based on the interests and activities of users on networks, some small communities of similar users are discovered, and then by using social relations, the discovered communities are extended.
Abstract: Recently, social networking sites are offering a rich resource of heterogeneous data. The analysis of such data can lead to the discovery of unknown information and relations in these networks. The detection of communities including `similar' nodes is a challenging topic in the analysis of social network data, and it has been widely studied in the social networking community in the context of underlying graph structure. Online social networks, in addition to having graph structures, include effective user information within networks. Using this information leads to enhance quality of community discovery. In this study, a method of community discovery is provided. Besides communication among nodes to improve the quality of the discovered communities, content information is used as well. This is a new approach based on frequent patterns and the actions of users on networks, particularly social networking sites where users carry out their preferred activities. The main contributions of proposed method are twofold: First, based on the interests and activities of users on networks, some small communities of similar users are discovered, and then by using social relations, the discovered communities are extended. The F-measure is used to evaluate the results of two real-world datasets (Blogcatalog and Flickr), demonstrating that the proposed method principals to improve the community detection quality.

Journal ArticleDOI
TL;DR: The results show that the graph-based approach is able to handle the specification, integration and analysis of enterprise models represented with different modelling languages and, on the other, that the integration challenge resides in defining appropriate mapping functions between the schemas.
Abstract: Enterprise models assist the governance and transformation of organizations through the specification, communication and analysis of strategy, goals, processes, information, along with the underlying application and technological infrastructure. Such models cross-cut different concerns and are often conceptualized using domain-specific modelling languages. This paper explores the application of graph-based semantic techniques to specify, integrate and analyse multiple, heterogeneous enterprise models. In particular, the proposal described in this paper (1) specifies enterprise models as ontological schemas, (2) uses transformation mapping functions to integrate the ontological schemas and (3) analyses the integrated schemas with graph querying and logical inference. The proposal is evaluated through a scenario that integrates three distinct enterprise modelling languages: the business model canvas, e3value, and the business layer of the ArchiMate language. The results show, on the one hand, that the graph-based approach is able to handle the specification, integration and analysis of enterprise models represented with different modelling languages and, on the other, that the integration challenge resides in defining appropriate mapping functions between the schemas.

Journal ArticleDOI
TL;DR: In this paper, the density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph.
Abstract: Class imbalance is a challenging problem that demonstrates the unsatisfactory classification performance of a minority class. A trivial classifier is biased toward minority instances because of their tiny fraction. In this paper, our density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph. A short path implies highly overlapping dense minority instances. In contrast, a long path indicates a sparsity of instances. A new under-sampling algorithm is proposed to eliminate majority instances with low distances because these instances are insignificant and obscure the classification boundary in the overlapping region. The results show predictive improvements on a minority class from various classifiers on different UCI datasets.

Journal ArticleDOI
TL;DR: A comprehensive review of data analytics paradigms for intrusion detection along with an overview of techniques that apply contextual information in a layered manner with consistent, coherent, and feasible evidence toward the correct prediction of cyber-attacks is presented.
Abstract: Research in cyber-security has demonstrated that dealing with cyber-attacks is by no means an easy task. One particular limitation of existing research originates from the uncertainty of information that is gathered to discover attacks. This uncertainty is partly due to the lack of attack prediction models that utilize contextual information to analyze activities that target computer networks. The focus of this paper is a comprehensive review of data analytics paradigms for intrusion detection along with an overview of techniques that apply contextual information for intrusion detection. A new research taxonomy is introduced consisting of several dimensions of data mining techniques, which create attack prediction models. The survey reveals the need to use multiple categories of contextual information in a layered manner with consistent, coherent, and feasible evidence toward the correct prediction of cyber-attacks.

Journal ArticleDOI
TL;DR: Analysis of the intrinsic complexity of several microarray datasets with and without feature selection and the connection with the empirical results obtained by four widely used classifiers demonstrate that a correlation exists between microarray data complexity and the classification error rates.
Abstract: Data complexity analysis enables an understanding of whether classification performance could be affected, not by algorithm limitations, but by intrinsic data characteristics. Microarray datasets based on high numbers of gene expressions combined with small sample sizes represent a particular challenge for machine learning researchers. This type of data also has other particularities that may negatively affect the generalization capacity of classifiers, such as overlaps between classes and class imbalance. Making use of several complexity measures, we analyzed the intrinsic complexity of several microarray datasets with and without feature selection and then explored the connection with the empirical results obtained by four widely used classifiers. Experimental results for 21 binary and multiclass datasets demonstrate that a correlation exists between microarray data complexity and the classification error rates.

Journal ArticleDOI
TL;DR: This paper proposes a new online transfer learning algorithm that merges and leverages the classifiers of the source and target domain with an ensemble method and demonstrates that the algorithm outperforms the compared baseline algorithms.
Abstract: Transfer learning aims to enhance performance in a target domain by exploiting useful information from auxiliary or source domains when the labeled data in the target domain are insufficient or difficult to acquire. In some real-world applications, the data of source domain are provided in advance, but the data of target domain may arrive in a stream fashion. This kind of problem is known as online transfer learning. In practice, there can be several source domains that are related to the target domain. The performance of online transfer learning is highly associated with selected source domains, and simply combining the source domains may lead to unsatisfactory performance. In this paper, we seek to promote classification performance in a target domain by leveraging labeled data from multiple source domains in online setting. To achieve this, we propose a new online transfer learning algorithm that merges and leverages the classifiers of the source and target domain with an ensemble method. The mistake bound of the proposed algorithm is analyzed, and the comprehensive experiments on three real-world data sets illustrate that our algorithm outperforms the compared baseline algorithms.

Journal ArticleDOI
TL;DR: This paper represents graphs in smaller units called graphlets and develops a matching technique to leverage this representation called SGMatch, which substantially improves the performance of current state-of-the-art techniques for larger query graphs with different structures, i.e., cliques, paths or subgraphs.
Abstract: Graphs are natural candidates for modeling application domains, such as social networks, pattern recognition, citation networks, or protein---protein interactions. One of the most challenging tasks in managing graphs is subgraph matching over data graphs, which attempts to find one-to-one correspondences, called solutions, among the query and data nodes. To compute solutions, most contemporary techniques use backtracking and recursion. An open research question is whether graphs can be matched based on parts and local solutions can be combined to reach a global matching. In this paper, we present an approach based on graph decomposition called SGMatch to match graphs. We represent graphs in smaller units called graphlets and develop a matching technique to leverage this representation. Pruning strategies use a new notion of edge covering called minimum hub cover and metadata, such as statistics and inverted indices, to reduce the number of matching candidates. Our evaluation of SGMatch versus contemporary algorithms, i.e., VF2, GraphQL, QuickSI, GADDI, or SPath, shows that SGMatch substantially improves the performance of current state-of-the-art techniques for larger query graphs with different structures, i.e., cliques, paths or subgraphs.