scispace - formally typeset
Search or ask a question

Showing papers in "Knowledge and Information Systems in 2014"


Journal ArticleDOI
TL;DR: A sensitivity analysis-based method for explaining prediction models that can be applied to any type of classification or regression model, and which is equivalent to commonly used additive model-specific methods when explaining an additive model.
Abstract: We present a sensitivity analysis-based method for explaining prediction models that can be applied to any type of classification or regression model. Its advantage over existing general methods is that all subsets of input features are perturbed, so interactions and redundancies between features are taken into account. Furthermore, when explaining an additive model, the method is equivalent to commonly used additive model-specific methods. We illustrate the method's usefulness with examples from artificial and real-world data sets and an empirical analysis of running times. Results from a controlled experiment with 122 participants suggest that the method's explanations improved the participants' understanding of the model.

1,024 citations


Journal ArticleDOI
TL;DR: A probabilistic framework for classifier combination is proposed, which gives rigorous optimality conditions (minimum classification error) for four combination methods: majority vote, weighted majorityVote, recall combiner and the naive Bayes combiner, revealing that there is no definitive best combiner among the four candidates.
Abstract: We propose a probabilistic framework for classifier combination, which gives rigorous optimality conditions (minimum classification error) for four combination methods: majority vote, weighted majority vote, recall combiner and the naive Bayes combiner. The framework is based on two assumptions: class-conditional independence of the classifier outputs and an assumption about the individual accuracies. The four combiners are derived subsequently from one another, by progressively relaxing and then eliminating the second assumption. In parallel, the number of the trainable parameters increases from one combiner to the next. Simulation studies reveal that if the parameter estimates are accurate and the first assumption is satisfied, the order of preference of the combiners is: naive Bayes, recall, weighted majority and majority. By inducing label noise, we expose a caveat coming from the stability-plasticity dilemma. Experimental results with 73 benchmark data sets reveal that there is no definitive best combiner among the four candidates, giving a slight preference to naive Bayes. This combiner was better for problems with a large number of fairly balanced classes while weighted majority vote was better for problems with a small number of unbalanced classes.

194 citations


Journal ArticleDOI
TL;DR: It is shown that coordinate descent-based methods have a more efficient update rule compared to ALS and have faster and more stable convergence than SGD, and empirically show that CCD++ is much faster than ALS and SGD in both settings.
Abstract: Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating least squares (ALS) and stochastic gradient descent (SGD) are two popular approaches to compute matrix factorization, and there has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been thoroughly explored. In this paper, we show that coordinate descent-based methods have a more efficient update rule compared to ALS and have faster and more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, with a synthetic dataset containing 14.6 billion ratings, on a distributed memory cluster with 64 processors, to deliver the desired test RMSE, CCD++ is 49 times faster than SGD and 20 times faster than ALS. When the number of processors is increased to 256, CCD++ takes only 16 s and is still 40 times faster than SGD and 20 times faster than ALS.

147 citations


Journal ArticleDOI
TL;DR: An efficient utility mining approach that adopts an indexing mechanism to speed up the execution and reduce the memory requirement in the mining process and a pruning strategy is also applied to reduce the number of unpromising itemsets in mining.
Abstract: Recently, utility mining has widely been discussed in the field of data mining. It finds high utility itemsets by considering both profits and quantities of items in transactional data sets. However, most of the existing approaches are based on the principle of levelwise processing, as in the traditional two-phase utility mining algorithm to find a high utility itemsets. In this paper, we propose an efficient utility mining approach that adopts an indexing mechanism to speed up the execution and reduce the memory requirement in the mining process. The indexing mechanism can imitate the traditional projection algorithms to achieve the aim of projecting sub-databases for mining. In addition, a pruning strategy is also applied to reduce the number of unpromising itemsets in mining. Finally, the experimental results on synthetic data sets and on a real data set show the superior performance of the proposed approach.

131 citations


Journal ArticleDOI
TL;DR: The results obtained show that methods using the One-vs-One strategy lead to better performances and more robust classifiers when dealing with noisy data, especially with the most disruptive noise schemes.
Abstract: The presence of noise in data is a common problem that produces several negative consequences in classification problems. In multi-class problems, these consequences are aggravated in terms of accuracy, building time, and complexity of the classifiers. In these cases, an interesting approach to reduce the effect of noise is to decompose the problem into several binary subproblems, reducing the complexity and, consequently, dividing the effects caused by noise into each of these subproblems. This paper analyzes the usage of decomposition strategies, and more specifically the One-vs-One scheme, to deal with noisy multi-class datasets. In order to investigate whether the decomposition is able to reduce the effect of noise or not, a large number of datasets are created introducing different levels and types of noise, as suggested in the literature. Several well-known classification algorithms, with or without decomposition, are trained on them in order to check when decomposition is advantageous. The results obtained show that methods using the One-vs-One strategy lead to better performances and more robust classifiers when dealing with noisy data, especially with the most disruptive noise schemes.

124 citations


Journal ArticleDOI
TL;DR: This research focuses on identifying key values of components to represent underlying characteristics of this linguistic phenomenon irony by means of three conceptual layers, and shows how complex and subjective the task of automatically detecting irony could be.
Abstract: It is well known that irony is one of the most subtle devices used to, in a refined way and without a negation marker, deny what is literally said. As such, its automatic detection would represent valuable knowledge regarding tasks as diverse as sentiment analysis, information extraction, or decision making. The research described in this article is focused on identifying key values of components to represent underlying characteristics of this linguistic phenomenon. In the absence of a negation marker, we focus on representing the core of irony by means of three conceptual layers. These layers involve 8 different textual features. By representing four available data sets with these features, we try to find hints about how to deal with this unexplored task from a computational point of view. Our findings are assessed by human annotators in two strata: isolated sentences and entire documents. The results show how complex and subjective the task of automatically detecting irony could be.

121 citations


Journal ArticleDOI
TL;DR: This paper proposes a hybrid Web service tag recommendation strategy, named WSTRec, which employs tag co-occurrence, tag mining, and semantic relevance measurement for tag recommendation for tags recommendation.
Abstract: Clustering Web services would greatly boost the ability of Web service search engine to retrieve relevant services. The performance of traditional Web service description language (WSDL)-based Web service clustering is not satisfied, due to the singleness of data source. Recently, Web service search engines such as Seekda! allow users to manually annotate Web services using tags, which describe functions of Web services or provide additional contextual and semantical information. In this paper, we cluster Web services by utilizing both WSDL documents and tags. To handle the clustering performance limitation caused by uneven tag distribution and noisy tags, we propose a hybrid Web service tag recommendation strategy, named WSTRec, which employs tag co-occurrence, tag mining, and semantic relevance measurement for tag recommendation. Extensive experiments are conducted based on our real-world dataset, which consists of 15,968 Web services. The experimental results demonstrate the effectiveness of our proposed service clustering and tag recommendation strategies. Specifically, compared with traditional WSDL-based Web service clustering approaches, the proposed approach produces gains in both precision and recall for up to 14 % in most cases.

119 citations


Journal ArticleDOI
TL;DR: It is shown that the main bottleneck in mining such massive datasets is the time taken to build the index, and a novel bulk loading mechanism is introduced, the first of this kind specifically tailored to a time series index.
Abstract: There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series. In this paper, we describe $$i$$ SAX 2.0 and its improvements, $$i$$ SAX 2.0 Clustered and $$i$$ SAX2+, three methods designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our methods allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.

111 citations


Journal ArticleDOI
TL;DR: A set of similarity criteria derived from a user study conducted with a set of OLAP practitioners and researchers is proposed and a function for estimating the similarity between OLAP queries based on three components: the query group-by set, its selection predicate, and the measures required in output is proposed.
Abstract: OLAP queries are not normally formulated in isolation, but in the form of sequences called OLAP sessions. Recognizing that two OLAP sessions are similar would be useful for different applications, such as query recommendation and personalization; however, the problem of measuring OLAP session similarity has not been studied so far. In this paper, we aim at filling this gap. First, we propose a set of similarity criteria derived from a user study conducted with a set of OLAP practitioners and researchers. Then, we propose a function for estimating the similarity between OLAP queries based on three components: the query group-by set, its selection predicate, and the measures required in output. To assess the similarity of OLAP sessions, we investigate the feasibility of extending four popular methods for measuring similarity, namely the Levenshtein distance, the Dice coefficient, the tf–idf weight, and the Smith–Waterman algorithm. Finally, we experimentally compare these four extensions to show that the Smith–Waterman extension is the one that best captures the users’ criteria for session similarity.

73 citations


Journal ArticleDOI
TL;DR: This paper gives an efficient method called NetSleuth, which gives an highly efficient algorithm to identify likely sets of seed nodes given a snapshot, and shows it can optimize the virus propagation ripple in a principled way by maximizing likelihood.
Abstract: Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.

70 citations


Journal ArticleDOI
TL;DR: This paper organizes a survey on five major email mining tasks, namely spam detection, email categorization, contact analysis, email network property analysis and email visualization, and systematically review the commonly used techniques.
Abstract: Email is one of the most popular forms of communication nowadays, mainly due to its efficiency, low cost, and compatibility of diversified types of information. In order to facilitate better usage of emails and explore business potentials in emailing, various data mining techniques have been applied on email data. In this paper, we present a brief survey of the major research efforts on email mining. To emphasize the differences between email mining and general text mining, we organize our survey on five major email mining tasks, namely spam detection, email categorization, contact analysis, email network property analysis and email visualization. Those tasks are inherently incorporated into various usages of emails. We systematically review the commonly used techniques and also discuss the related software tools available.

Journal ArticleDOI
TL;DR: A new improved matrix factorization approach for such a rating matrix, called Bounded Matrix Factorization (BMF), which imposes a lower and an upper bound on every estimated missing element of the rating matrix.
Abstract: Matrix factorization has been widely utilized as a latent factor model for solving the recommender system problem using collaborative filtering. For a recommender system, all the ratings in the rating matrix are bounded within a pre-determined range. In this paper, we propose a new improved matrix factorization approach for such a rating matrix, called Bounded Matrix Factorization (BMF), which imposes a lower and an upper bound on every estimated missing element of the rating matrix. We present an efficient algorithm to solve BMF based on the block coordinate descent method. We show that our algorithm is scalable for large matrices with missing elements on multicore systems with low memory. We present substantial experimental results illustrating that the proposed method outperforms the state of the art algorithms for recommender system such as stochastic gradient descent, alternating least squares with regularization, SVD++ and Bias-SVD on real-world datasets such as Jester, Movielens, Book crossing, Online dating and Netflix.

Journal ArticleDOI
TL;DR: This paper achieves unprecedented results by finding “dense subgraph” patterns and combining them with techniques such as node orderings and compact data structures, and proposes a compact data structure that represents dense subgraphs without using virtual nodes.
Abstract: Compressed representations have become effective to store and access large Web and social graphs, in order to support various graph querying and mining tasks. The existing representations exploit various typical patterns in those networks and provide basic navigation support. In this paper, we obtain unprecedented results by finding "dense subgraph" patterns and combining them with techniques such as node orderings and compact data structures. On those representations, we support out-neighbor and out/in-neighbor queries, as well as mining queries based on the dense subgraphs. First, we propose a compression scheme for Web graphs that reduces edges by representing dense subgraphs with "virtual nodes"; over this scheme, we apply node orderings and other compression techniques. With this approach, we match the best current compression ratios that support out-neighbor queries (i.e., nodes pointed from a given node), using 1.0---1.8 bits per edge (bpe) on large Web graphs, and retrieving each neighbor of a node in 0.6---1.0 microseconds ( $$\upmu $$ s). When supporting both out- and in-neighbor queries, instead, our technique generally offers the best time when using little space. If the reduced graph, instead, is represented with a compact data structure that supports bidirectional navigation, we obtain the most compact Web graph representations (0.9---1.5 bpe) that support out/in-neighbor navigation; yet, the time per neighbor extracted raises to around 5---20 $$\upmu $$ s. We also propose a compact data structure that represents dense subgraphs without using virtual nodes. It allows us to recover out/in-neighbors and answer other more complex queries on the dense subgraphs identified. This structure is not competitive on Web graphs, but on social networks, it achieves 4---13 bpe and 8---12 $$\upmu $$ s per out/in-neighbor retrieved, which improves upon all existing representations.

Journal ArticleDOI
TL;DR: Results show that the proposed new intrinsic IC computing method and the new relatedness measure correlated better with human judgments than related works.
Abstract: Computing semantic similarity/relatedness between concepts and words is an important issue of many research fields. Information theoretic approaches exploit the notion of Information Content (IC) that provides for a concept a better understanding of its semantics. In this paper, we present a complete IC metrics survey with a critical study. Then, we propose a new intrinsic IC computing method using taxonomical features extracted from an ontology for a particular concept. This approach quantifies the subgraph formed by the concept subsumers using the depth and the descendents count as taxonomical parameters. In a second part, we integrate this IC metric in a new parameterized multistrategy approach for measuring word semantic relatedness. This measure exploits the WordNet features such as the noun "is a" taxonomy, the nominalization relation allowing the use of verb "is a" taxonomy and the shared words (overlaps) in glosses. Our work has been evaluated and compared with related works using a wide set of benchmarks conceived for word semantic similarity/relatedness tasks. Obtained results show that our IC method and the new relatedness measure correlated better with human judgments than related works.

Journal ArticleDOI
TL;DR: The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise, and is adaptive to change in the data distribution.
Abstract: Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple scans of data an impractical proposition for the purpose of knowledge discovery. Need to learn structures from data in streaming environment has been a driving force for making clustering a popular technique for knowledge discovery from data streams. Continuous nature of streaming data makes it infeasible to look for point membership among the clusters discovered so far, necessitating employment of a synopsis structure to consolidate incoming data points. This synopsis is exploited for building clustering scheme to meet subsequent user demands. The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise. The algorithm is robust, adaptive to changes in data distribution and detects succinct outliers on-the-fly. It deploys a fixed granularity grid structure as synopsis and performs clustering by coalescing dense regions in grid. Speed-based pruning is applied to synopsis prior to clustering to ensure currency of discovered clusters. Extensive experimentation demonstrates that the algorithm is robust, identifies succinct outliers on-the-fly and is adaptive to change in the data distribution. ExCC algorithm is further evaluated for performance and compared with other contemporary algorithms.

Journal ArticleDOI
TL;DR: A system that monitors the evolution of the learning process that is able to self-diagnose degradations of this process, using change detection mechanisms, and self-repair the decision models is presented.
Abstract: This work addresses the problem of mining data streams generated in dynamic environments where the distribution underlying the observations may change over time. We present a system that monitors the evolution of the learning process. The system is able to self-diagnose degradations of this process, using change detection mechanisms, and self-repair the decision models. The system uses meta-learning techniques that characterize the domain of applicability of previously learned models. The meta-learner can detect recurrence of contexts, using unlabeled examples, and take pro-active actions by activating previously learned models. The experimental evaluation on three text mining problems demonstrates the main advantages of the proposed system: it provides information about the recurrence of concepts and rapidly adapts decision models when drift occurs.

Journal ArticleDOI
TL;DR: A new metric is proposed, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios, and is extended in this direction by providing ample additional empirical evidence for the utility.
Abstract: Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increase this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates obtained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. We propose a new metric, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios. We propose a simple, effective method to mitigate the bias of probability estimates for imbalanced data that bags estimators independently calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We extend our previous work in this direction by providing ample additional empirical evidence for the utility of this strategy, using both support vector machines and boosted decision trees as base learners. Finally, we show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates.

Journal ArticleDOI
TL;DR: A library of common ontology alignment patterns as reusable templates of recurring correspondences is developed, based on a detailed analysis of frequent ontology mismatches, and an application of ontology aligned patterns for an ontology transformation service is described.
Abstract: Interoperability between heterogeneous ontological descriptions can be performed through ontology mediation techniques. At the heart of ontology mediation lies the alignment: a specification of correspondences between ontology entities. Ontology matching can bring some automation but are limited to finding simple correspondences. Design patterns have proven themselves useful to capture experience in design problems. In this article, we introduce ontology alignment patterns as reusable templates of recurring correspondences. Based on a detailed analysis of frequent ontology mismatches, we develop a library of common patterns. Ontology alignment patterns can be used to refine correspondences, either by the alignment designer or via pattern detection algorithms. We distinguish three levels of abstraction for ontology alignment representation, going from executable transformation rules, to concrete correspondences between two ontologies, to ontology alignment patterns at the third level. We express patterns using an ontology alignment representation language, making them ready to use in practical mediation tasks. We extract mismatches from vocabularies associated with data sets published as linked open data, and we evaluate the ability of correspondence patterns to provide proper alignments for these mismatches. Finally, we describe an application of ontology alignment patterns for an ontology transformation service.

Journal ArticleDOI
TL;DR: Experimental results illustrate that NSA, especially for NSA with consistency weights adjusting strategy (NSA+), significantly outperforms particle swarm optimization and clonal selection algorithm for QoS-aware service selection problem.
Abstract: Web service selection, as an important part of web service composition, has direct influence on the quality of composite service Many works have been carried out to find the efficient algorithms for quality of service (QoS)-aware service selection problem in recent years In this paper, a negative selection immune algorithm (NSA) is proposed, and as far as we know, this is the first time that NSA is introduced into web service selection problem Domain terms and operations of NSA are firstly redefined in this paper aiming at QoS-aware service selection problem NSA is then constructed to demonstrate how to use negative selection principle to solve this question Thirdly, an inconsistent analysis between local exploitation and global planning is presented, through which a local alteration of a composite service scheme can transfer to the global exploration correctly It is a general adjusting method and independent to algorithms Finally, extensive experimental results illustrate that NSA, especially for NSA with consistency weights adjusting strategy (NSA+), significantly outperforms particle swarm optimization and clonal selection algorithm for QoS-aware service selection problem The superiority of NSA+ over others is more and more evident with the increase of component tasks and related candidate services

Journal ArticleDOI
Wei Liu1, Su Peng1, Wei Du1, Wei Wang2, Guo Sun Zeng2 
TL;DR: This work builds a security overhead model to reasonably measure the security overheads incurred by the sensitive data and develops a data placement strategy to dynamically place the intermediate data for the scientific workflows.
Abstract: Massive computation power and storage capacity of cloud computing systems allow scientists to deploy data-intensive applications without the infrastructure investment, where large application datasets can be stored in the cloud. Based on the pay-as-you-go model, data placement strategies have been developed to cost-effectively store large volumes of generated datasets in the scientific cloud workflows. As promising as it is, this paradigm also introduces many new challenges for data security when the users outsource sensitive data for sharing on the cloud servers, which are not within the same trusted domain as the data owners. This challenge is further complicated by the security constraints on the potential sensitive data for the scientific workflows in the cloud. To effectively address this problem, we propose a security-aware intermediate data placement strategy. First, we build a security overhead model to reasonably measure the security overheads incurred by the sensitive data. Second, we develop a data placement strategy to dynamically place the intermediate data for the scientific workflows. Finally, our experimental results show that our strategy can effectively improve the intermediate data security while ensuring the data transfer time during the execution of scientific workflows.

Journal ArticleDOI
TL;DR: The complexity of the clustering model is proved and the exploration of various pruning strategies to design the efficient algorithm GAMer (Graph & Attribute Miner) is focused on, which achieves low runtimes and high clustering qualities.
Abstract: In this work, we propose a new method to find homogeneous object groups in a single vertex-labeled graph. The basic premise is that many prevalent datasets consist of multiple types of information: graph data to represent the relations between objects and attribute data to characterize the single objects. Analyzing both information types simultaneously can increase the expressiveness of the resulting patterns. Our patterns of interest are sets of objects that are densely connected within the associated graph and as well show high similarity regarding their attributes. As for attribute data it is known that full-space clustering often is futile, we have to analyze the similarity of objects regarding subsets of their attributes. In order to take full advantage of all present information, we combine the paradigms of dense subgraph mining and subspace clustering. For our approach, we face several challenges to achieve a sound combination of the two paradigms. We maximize our twofold clusters according to their density, size, and number of relevant dimensions. The optimization of these three objectives usually is conflicting; thus, we realize a trade-off between these characteristics to obtain meaningful patterns. We develop a redundancy model to confine the clustering to a manageable size by selecting only the most interesting clusters for the result set. We prove the complexity of our clustering model and we particularly focus on the exploration of various pruning strategies to design the efficient algorithm GAMer (Graph & Attribute Miner). In thorough experiments on synthetic and real world data we show that GAMer achieves low runtimes and high clustering qualities. We provide all datasets, measures, executables, and parameter settings on our website http://dme.rwth-aachen.de/gamer .

Journal ArticleDOI
TL;DR: A novel, hubness-aware secondary similarity measure, which takes into account both the supervised and the unsupervised hubness information, is proposed and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard measure.
Abstract: Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase curse of dimensionality. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast makes it more difficult to distinguish between close and distant points, which renders many traditional distance-based learning methods ineffective. Secondary distances based on shared neighbor similarities have recently been proposed as one possible solution to this problem. However, these initial metrics failed to take hubness into account. Hubness is a recently described aspect of the dimensionality curse, and it affects all sorts of $$k$$ -nearest neighbor learning methods in severely negative ways. This paper is the first to discuss the impact of hubs on forming the shared neighbor similarity scores. We propose a novel, hubness-aware secondary similarity measure $$simhub_s$$ and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard $$simcos_s$$ measure. The proposed similarity changes the underlying $$k$$ NN graph in such a way that it reduces the overall frequency of label mismatches in $$k$$ -neighbor sets and increases the purity of occurrence profiles, which improves classifier performance. It is a hybrid measure, which takes into account both the supervised and the unsupervised hubness information. The analysis shows that both components are useful in their own ways and that the measure is therefore properly defined. This new similarity does not increase the overall computational cost, and the improvement is essentially ‘free’.

Journal ArticleDOI
TL;DR: A survey is presented, which illustrates distinguishing features and provides comparison of selected modern component-based frameworks for real-time embedded systems, and restricts the survey only to the frameworks that support the full development life cycle.
Abstract: The use of components significantly helps in development of real-time embedded systems. There have been a number of component frameworks developed for this purpose, and some of them have already became well established in this area. Even though these frameworks share the general idea of component-based development, they significantly differ in the range of supported features and maturity. This makes it relatively difficult to select the right component framework and thus poses a significant obstacle in adoption of the component-based development approach for developing real-time embedded systems. To provide guidance in choosing a component framework, or at least relevant concepts when building a custom framework, we present a survey, which illustrates distinguishing features and provides comparison of selected modern component-based frameworks for real-time embedded systems. Compared to other existing surveys, this survey focuses specifically on criteria connected with real-time and embedded systems. Further, to be practically relevant, we restrict the survey only to the frameworks that support the full development life cycle (i.e. from design till execution support). In this context, the survey illustrates the complexity of development in each framework by giving specification and code samples.

Journal ArticleDOI
TL;DR: This work proposes an approach called Depth-First SPelling (DFSP) algorithm, which is faster than that of PrefixSpan, its leading competitor, and it is superior to other sequential pattern mining algorithms for biological sequences.
Abstract: Scientific progress in recent years has led to the generation of huge amounts of biological data, most of which remains unanalyzed. Mining the data may provide insights into various realms of biology, such as finding co-occurring biosequences, which are essential for biological data mining and analysis. Data mining techniques like sequential pattern mining may reveal implicitly meaningful patterns among the DNA or protein sequences. If biologists hope to unlock the potential of sequential pattern mining in their field, it is necessary to move away from traditional sequential pattern mining algorithms, because they have difficulty handling a small number of items and long sequences in biological data, such as gene and protein sequences. To address the problem, we propose an approach called Depth-First SPelling (DFSP) algorithm for mining sequential patterns in biological sequences. The algorithm’s processing speed is faster than that of PrefixSpan, its leading competitor, and it is superior to other sequential pattern mining algorithms for biological sequences.

Journal ArticleDOI
TL;DR: Novel utility measurements that are based on more complex community-based graph models are proposed that can be used with various utility measurements to achieve $$k$$k-anonymity with small utility loss on given social networks.
Abstract: Privacy and utility are two main desiderata of good sensitive information publishing schemes. For publishing social networks, many existing algorithms rely on $$k$$ k -anonymity as a criterion to guarantee privacy protection. They reduce the utility loss by first using the degree sequence to model the structural properties of the original social network and then minimizing the changes on the degree sequence caused by the anonymization process. However, the degree sequence-based graph model is simple, and it fails to capture many important graph topological properties. Consequently, the existing anonymization algorithms that rely on this simple graph model to measure utility cannot guarantee generating anonymized social networks of high utility. In this paper, we propose novel utility measurements that are based on more complex community-based graph models. We also design a general $$k$$ k -anonymization framework, which can be used with various utility measurements to achieve $$k$$ k -anonymity with small utility loss on given social networks. Finally, we conduct extensive experimental evaluation on real datasets to evaluate the effectiveness of the new utility measurements proposed. The results demonstrate that our scheme achieves significant improvement on the utility of the anonymized social networks compared with the existing anonymization algorithms. The utility losses of many social network statistics of the anonymized social networks generated by our scheme are under 1 % in most cases.

Journal ArticleDOI
TL;DR: In this work, probabilistic opposition-based learning for particles is incorporated with PSO to enhance the convergence rate—it uses velocity clamping and inertia weights to control the position, speed and direction of particles to avoid premature convergence.
Abstract: A probabilistic opposition-based Particle Swarm Optimization algorithm with Velocity Clamping and inertia weights (OvcPSO) is designed for function optimization--to accelerate the convergence speed and to optimize solution's accuracy on standard benchmark functions. In this work, probabilistic opposition-based learning for particles is incorporated with PSO to enhance the convergence rate--it uses velocity clamping and inertia weights to control the position, speed and direction of particles to avoid premature convergence. A comprehensive set of 58 complex benchmark functions including a wide range of dimensions have been used for experimental verification. It is evident from the results that OvcPSO can deal with complex optimization problems effectively and efficiently. A series of experiments have been performed to investigate the influence of population size and dimensions upon the performance of different PSO variants. It also outperforms FDR-PSO, CLPSO, FIPS, CPSO-H and GOPSO on various benchmark functions. Last but not the least, OvcPSO has also been compared with opposition-based differential evolution (ODE); it outperforms ODE on lower swarm population and higher-dimensional functions.

Journal ArticleDOI
TL;DR: This work proposes a multi-view algorithm based on constrained clustering that can operate with an incomplete mapping and shows that this approach significantly improves clustering performance over several other methods for transferring constraints and allows multi-View clustering to be reliably applied when given a limited mapping between the views.
Abstract: Multi-view learning algorithms typically assume a complete bipartite mapping between the different views in order to exchange information during the learning process. However, many applications provide only a partial mapping between the views, creating a challenge for current methods. To address this problem, we propose a multi-view algorithm based on constrained clustering that can operate with an incomplete mapping. Given a set of pairwise constraints in each view, our approach propagates these constraints using a local similarity measure to those instances that can be mapped to the other views, allowing the propagated constraints to be transferred across views via the partial mapping. It uses co-EM to iteratively estimate the propagation within each view based on the current clustering model, transfer the constraints across views, and then update the clustering model. By alternating the learning process between views, this approach produces a unified clustering model that is consistent with all views. We show that this approach significantly improves clustering performance over several other methods for transferring constraints and allows multi-view clustering to be reliably applied when given a limited mapping between the views. Our evaluation reveals that the propagated constraints have high precision with respect to the true clusters in the data, explaining their benefit to clustering performance in both single- and multi-view learning scenarios.

Journal ArticleDOI
TL;DR: From experimental results, it is observed that the proposed model exhibits higher accuracy than those of existing two-factors fuzzy time series models.
Abstract: Fuzzy time series forecasting method has been applied in several domains, such as stock market price, temperature, sales, crop production and academic enrollments. In this paper, we introduce a model to deal with forecasting problems of two factors. The proposed model is designed using fuzzy time series and artificial neural network. In a fuzzy time series forecasting model, the length of intervals in the universe of discourse always affects the results of forecasting. Therefore, an artificial neural network- based technique is employed for determining the intervals of the historical time series data sets by clustering them into different groups. The historical time series data sets are then fuzzified, and the high-order fuzzy logical relationships are established among fuzzified values based on fuzzy time series method. The paper also introduces some rules for interval weighing to defuzzify the fuzzified time series data sets. From experimental results, it is observed that the proposed model exhibits higher accuracy than those of existing two-factors fuzzy time series models.

Journal ArticleDOI
TL;DR: This work demonstrates that the use of grammar-guided genetic programming for the discovery of frequent association rules was competitive in terms of scalability, expressiveness, flexibility and the ability to restrict the search space, and provides mechanisms to discard noise from the rare association rule set.
Abstract: To date, association rule mining has mainly focused on the discovery of frequent patterns Nevertheless, it is often interesting to focus on those that do not frequently occur Existing algorithms for mining this kind of infrequent patterns are mainly based on exhaustive search methods and can be applied only over categorical domains In a previous work, the use of grammar-guided genetic programming for the discovery of frequent association rules was introduced, showing that this proposal was competitive in terms of scalability, expressiveness, flexibility and the ability to restrict the search space The goal of this work is to demonstrate that this proposal is also appropriate for the discovery of rare association rules This approach allows one to obtain solutions within specified time limits and does not require large amounts of memory, as current algorithms do It also provides mechanisms to discard noise from the rare association rule set by applying four different and specific fitness functions, which are compared and studied in depth Finally, this approach is compared with other existing algorithms for mining rare association rules, and an analysis of the mined rules is performed As a result, this approach mines rare rules in a homogeneous and low execution time The experimental study shows that this proposal obtains a small and accurate set of rules close to the size specified by the data miner

Journal ArticleDOI
TL;DR: This paper describes a new approach which consists in forming overlapping mixed communities in a bipartite network based on dual optimization of modularity, and applies this approach to the decomposition of vertices, resulting from the evolutionary process.
Abstract: Many algorithms have been designed to discover community structure in networks. These algorithms are mostly dedicated to detecting disjoint communities. Very few of them are intended to discover overlapping communities, particularly the bipartite networks have hardly been explored for the detection of such communities. In this paper, we describe a new approach which consists in forming overlapping mixed communities in a bipartite network based on dual optimization of modularity. To this end, we propose two algorithms. The first one is an evolutionary algorithm dedicated for global optimization of the Newman's modularity on the line graph. This algorithm has been tested on well-known real benchmark networks and compared with several other existing methods of community detection in networks. The second one is an algorithm that locally optimizes the graph Mancoridis modularity, and we have adapted to a bipartite graph. Specifically, this second algorithm is applied to the decomposition of vertices, resulting from the evolutionary process, and also characterizes the overlapping communities taking into account their semantic aspect. Our approach requires a priori no knowledge on the number of communities searched in the network. We show its interest on two datasets, namely, a group of synthetic networks and real-world network whose structure is also difficult to understand.