scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2005"


Journal ArticleDOI
TL;DR: This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches.
Abstract: This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multicriteria ratings, and a provision of more flexible and less intrusive types of recommendations.

9,873 citations


Journal ArticleDOI
TL;DR: With the categorizing framework, the efforts toward-building an integrated system for intelligent feature selection are continued, and an illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms.
Abstract: This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward-building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.

2,605 citations


Journal ArticleDOI
TL;DR: It is shown theoretically and empirically that AUC is a better measure (defined precisely) than accuracy and reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results.
Abstract: The area under the ROC (receiver operating characteristics) curve, or simply AUC, has been traditionally used in medical diagnosis since the 1970s. It has recently been proposed as an alternative single-number measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. We establish formal criteria for comparing two different measures for learning algorithms and we show theoretically and empirically that AUC is a better measure (defined precisely) than accuracy. We then reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results. For example, it has been well-established and accepted that Naive Bayes and decision trees are very similar in predictive accuracy. We show, however, that Naive Bayes is significantly better than decision trees in AUC. The conclusions drawn in this paper may make a significant impact on machine learning and data mining applications.

1,528 citations


Journal ArticleDOI
TL;DR: Experiments on UCI data sets and application to the Web page classification task indicate that tri-training can effectively exploit unlabeled data to enhance the learning performance.
Abstract: In many practical data mining applications, such as Web page classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm, named tri-training, is proposed. This algorithm generates three classifiers from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space to be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the Web page classification task indicate that tri-training can effectively exploit unlabeled data to enhance the learning performance.

1,067 citations



Journal ArticleDOI
TL;DR: A novel document clustering method which aims to cluster the documents into different semantic classes by using locality preserving indexing (LPI), an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of the method.
Abstract: We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.

707 citations


Journal ArticleDOI
TL;DR: CHARM is an efficient algorithm for mining all frequent closed itemsets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels, and uses a technique called diffsets to reduce the memory footprint of intermediate computations.
Abstract: The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any "nonclosed" sets found during computation. We also present CHARM-L, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a state-of-the-art algorithm that outperforms previous methods. Further, CHARM-L explicitly generates the frequent closed itemset lattice.

668 citations


Journal ArticleDOI
TL;DR: This work proposes a generalized temporal role-based access control (GTRBAC) model capable of expressing a wider range of temporal constraints and allows expressing periodic as well as duration constraints on roles, user-role assignments, and role-permission assignments.
Abstract: Role-based access control (RBAC) models have generated a great interest in the security community as a powerful and generalized approach to security management. In many practical scenarios, users may be restricted to assume roles only at predefined time periods. Furthermore, roles may only be invoked on prespecified intervals of time depending upon when certain actions are permitted. To capture such dynamic aspects of a role, a temporal RBAC (TRBAC) model has been recently proposed. However, the TRBAC model addresses the role enabling constraints only. In This work, we propose a generalized temporal role-based access control (GTRBAC) model capable of expressing a wider range of temporal constraints. In particular, the model allows expressing periodic as well as duration constraints on roles, user-role assignments, and role-permission assignments. In an interval, activation of a role can further be restricted as a result of numerous activation constraints including cardinality constraints and maximum active duration constraints. The GTRBAC model extends the syntactic structure of the TRBAC model and its event and trigger expressions subsume those of TRBAC. Furthermore, GTRBAC allows expressing role hierarchies and separation of duty (SoD) constraints for specifying fine-grained temporal semantics.

619 citations


Journal ArticleDOI
TL;DR: A novel FP-array technique is presented that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree-based algorithms and works especially well for sparse data sets.
Abstract: Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. Methods for mining frequent itemsets have been implemented using a prefix-tree structure, known as an FP-tree, for storing compressed information about frequent itemsets. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this paper, we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree-based algorithms. Our technique works especially well for sparse data sets. Furthermore, we present new algorithms for mining all, maximal, and closed frequent itemsets. Our algorithms use the FP-tree data structure in combination with the FP-array technique efficiently and incorporate various optimization techniques. We also present experimental results comparing our methods with existing algorithms. The results show that our methods are the fastest for many cases. Even though the algorithms consume much memory when the data sets are sparse, they are still the fastest ones when the minimum support is low. Moreover, they are always among the fastest algorithms and consume less memory than other methods when the data sets are dense.

527 citations


Journal ArticleDOI
TL;DR: A new privacy-enabling algorithm is presented, named k-Same, that guarantees face recognition software cannot reliably recognize deidentified faces, even though many facial details are preserved.
Abstract: In the context of sharing video surveillance data, a significant threat to privacy is face recognition software, which can automatically identify known people, such as from a database of drivers' license photos, and thereby track people regardless of suspicion. This paper introduces an algorithm to protect the privacy of individuals in video surveillance data by deidentifying faces such that many facial characteristics remain but the face cannot be reliably recognized. A trivial solution to deidentifying faces involves blacking out each face. This thwarts any possible face recognition, but because all facial details are obscured, the result is of limited use. Many ad hoc attempts, such as covering eyes, fail to thwart face recognition because of the robustness of face recognition methods. This work presents a new privacy-enabling algorithm, named k-Same, that guarantees face recognition software cannot reliably recognize deidentified faces, even though many facial details are preserved. The algorithm determines similarity between faces based on a distance metric and creates new faces by averaging image components, which may be the original image pixels (k-Same-Pixel) or eigenvectors (k-Same-Eigen). Results are presented on a standard collection of real face images with varying k.

504 citations


Journal ArticleDOI
TL;DR: A kernel-boundary-alignment algorithm is proposed, which considers THE training data imbalance as prior information to augment SVMs to improve class-prediction accuracy and works effectively on several data sets.
Abstract: An imbalanced training data set can pose serious problems for many real-world data mining tasks that employ SVMs to conduct supervised learning. In this paper, we propose a kernel-boundary-alignment algorithm, which considers THE training data imbalance as prior information to augment SVMs to improve class-prediction accuracy. Using a simple example, we first show that SVMs can suffer from high incidences of false negatives when the training instances of the target class are heavily outnumbered by the training instances of a nontarget class. The remedy we propose is to adjust the class boundary by modifying the kernel matrix, according to the imbalanced data distribution. Through theoretical analysis backed by empirical study, we show that our kernel-boundary-alignment algorithm works effectively on several data sets.

Journal ArticleDOI
TL;DR: A substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set.
Abstract: Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.

Journal ArticleDOI
TL;DR: An in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases are presented.
Abstract: A new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.

Journal ArticleDOI
TL;DR: This work presents polynomial-time heuristics for a selection of views to optimize total query response time under a disk-space constraint, for some important special cases of the general data warehouse scenario, viz. an AND view graph, where each query/view has a unique evaluation, and extends this heuristic to the general AND-OR view graphs.
Abstract: A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decision-support or OLAP queries One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse The goal is to select an appropriate set of views that minimizes total query response time and the cost of maintaining the selected views, given a limited amount of resource, eg, materialization time, storage space, etc In This work, we have developed a theoretical framework for the general problem of selection of views in a data warehouse We present polynomial-time heuristics for a selection of views to optimize total query response time under a disk-space constraint, for some important special cases of the general data warehouse scenario, viz: 1) an AND view graph, where each query/view has a unique evaluation, eg, when a multiple-query optimizer can be used to general a global evaluation plan for the queries, and 2) an OR view graph, in which any view can be computed from any one of its related views, eg, data cubes We present proofs showing that the algorithms are guaranteed to provide a solution that is fairly close to (within a constant factor ratio of) the optimal solution We extend our heuristic to the general AND-OR view graphs Finally, we address in detail the view-selection problem under the maintenance cost constraint and present provably competitive heuristics

Journal ArticleDOI
TL;DR: This work presents TREEMINER, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list, and contrasts it with a pattern matching tree mining algorithm (PATTERNMATCHER), and also compares it with TREEMinERD, which counts only distinct occurrences of a pattern.
Abstract: Mining frequent trees is very useful in domains like bioinformatics, Web mining, mining semistructured data, etc. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TREEMINER, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list. We contrast TREEMINER with a pattern matching tree mining algorithm (PATTERNMATCHER), and we also compare it with TREEMINERD, which counts only distinct occurrences of a pattern. We conduct detailed experiments to test the performance and scalability of these methods. We also use tree mining to analyze RNA structure and phylogenetics data sets from bioinformatics domain.

Journal ArticleDOI
TL;DR: The experiments show that MAFIA performs best when mining long itemsets and outperforms other algorithms on dense data by a factor of three to 30.
Abstract: We present a new algorithm for mining maximal frequent itemsets from a transactional database. The search strategy of the algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms that significantly improve mining performance. Our implementation for support counting combines a vertical bitmap representation of the data with an efficient bitmap compression scheme. In a thorough experimental analysis, we isolate the effects of individual components of MAFIA including search space pruning techniques and adaptive compression. We also compare our performance with previous work by running tests on very different types of data sets. Our experiments show that MAFIA performs best when mining long itemsets and outperforms other algorithms on dense data by a factor of three to 30.

Journal ArticleDOI
TL;DR: This paper proposes an alternative mining task: mining top-k frequent closed itemsets of length no less than min/spl I.bar/support, and shows that TFP has high performance and linear scalability in terms of the database size.
Abstract: Frequent itemset mining has been studied extensively in literature. Most previous studies require the specification of a min/spl I.bar/support threshold and aim at mining a complete set of frequent itemsets satisfying min/spl I.bar/support. However, in practice, it is difficult for users to provide an appropriate min/spl I.bar/support threshold. In addition, a complete set of frequent itemsets is much less compact than a set of frequent closed itemsets. In this paper, we propose an alternative mining task: mining top-k frequent closed itemsets of length no less than min/spl I.bar/l, where k is the desired number of frequent closed itemsets to be mined, and min/spl I.bar/l is the minimal length of each itemset. An efficient algorithm, called TFP, is developed for mining such itemsets without mins/spl I.bar/support. Starting at min/spl I.bar/support = 0 and by making use of the length constraint and the properties of top-k frequent closed itemsets, min/spl I.bar/support can be raised effectively and FP-Tree can be pruned dynamically both during and after the construction of the tree using our two proposed methods: the closed node count and descendant/spl I.bar/sum. Moreover, mining is further speeded up by employing a top-down and bottom-up combined FP-Tree traversing strategy, a set of search space pruning methods, a fast 2-level hash-indexed result tree, and a novel closed itemset verification scheme. Our extensive performance study shows that TFP has high performance and linear scalability in terms of the database size.

Journal ArticleDOI
TL;DR: This paper presents an agent-based and context-oriented approach that supports the composition of Web services, where software agents engage in conversations with their peers to agree on the Web services that participate in this process.
Abstract: This paper presents an agent-based and context-oriented approach that supports the composition of Web services. A Web service is an accessible application that other applications and humans can discover and invoke to satisfy multiple needs. To reduce the complexity featuring the composition of Web services, two concepts are put forward, namely, software agent and context. A software agent is an autonomous entity that acts on behalf of users and the context is any relevant information that characterizes a situation. During the composition process, software agents engage in conversations with their peers to agree on the Web services that participate in this process. Conversations between agents take into account the execution context of the Web services. The security of the computing resources on which the Web services are executed constitutes another core component of the agent-based and context-oriented approach presented in this paper.

Journal ArticleDOI
TL;DR: Two types of periodicities are defined, and a scalable, computationally efficient algorithm is proposed for each type, and the algorithms are extended in order to discover the periodic patterns of unknown periods at the same time without affecting the time complexity.
Abstract: Periodicity mining is used for predicting trends in time series data. Discovering the rate at which the time series is periodic has always been an obstacle for fully automated periodicity mining. Existing periodicity mining algorithms assume that the periodicity, rate (or simply the period) is user-specified. This assumption is a considerable limitation, especially in time series data where the period is not known a priori. In this paper, we address the problem of detecting the periodicity rate of a time series database. Two types of periodicities are defined, and a scalable, computationally efficient algorithm is proposed for each type. The algorithms perform in O(n log n) time for a time series of length n. Moreover, the proposed algorithms are extended in order to discover the periodic patterns of unknown periods at the same time without affecting the time complexity. Experimental results show that the proposed algorithms are highly accurate with respect to the discovered periodicity rates and periodic patterns. Real-data experiments demonstrate the practicality of the discovered periodic patterns.

Journal ArticleDOI
TL;DR: This paper presents a clustering algorithm for partitioning a minimum spanning tree with a constraint on minimum group size that is sufficiently efficient to be practical for large data sets and yields results that are comparable to the best available heuristic methods for microaggregation.
Abstract: This paper presents a clustering algorithm for partitioning a minimum spanning tree with a constraint on minimum group size. The problem is motivated by microaggregation, a disclosure limitation technique in which similar records are aggregated into groups containing a minimum of k records. Heuristic clustering methods are needed since the minimum information loss microaggregation problem is NP-hard. Our MST partitioning algorithm for microaggregation is sufficiently efficient to be practical for large data sets and yields results that are comparable to the best available heuristic methods for microaggregation. For data that contain pronounced clustering effects, our method results in significantly lower information loss. Our algorithm is general enough to accommodate different measures of information loss and can be used for other clustering applications that have a constraint on minimum group size.

Journal ArticleDOI
TL;DR: This paper proposes a policy integration framework for merging heterogeneous role-based access control policies of multiple domains into a global access control policy, and proposes an integer programming (IP)-based approach for optimal resolution of conflicts.
Abstract: Multidomain application environments where distributed multiple organizations interoperate with each other are becoming a reality as witnessed by emerging Internet-based enterprise applications. Composition of a global coherent security policy that governs information and resource accesses in such environments is a challenging problem. In this paper, we propose a policy integration framework for merging heterogeneous role-based access control (RBAC) policies of multiple domains into a global access control policy. A key challenge in composition of this policy is the resolution of conflicts that may arise among the RBAC policies of individual domains. We propose an integer programming (IP)-based approach for optimal resolution of such conflicts. The optimality criterion is to maximize interdomain role accesses without exceeding the autonomy losses beyond the acceptable limit.

Journal ArticleDOI
TL;DR: Algorithms are presented that modify an initial road-network representation, so that it works better as a basis for predicting an object's position, and an attempt is made to use known movement patterns of the object, in the form of routes, to use acceleration profiles together with the routes.
Abstract: With the continued advances in wireless communications, geo-positioning, and consumer electronics, an infrastructure is emerging that enables location-based services that rely on the tracking of the continuously changing positions of entire populations of service users, termed moving objects. This scenario is characterized by large volumes of updates, for which reason location update technologies become important. A setting is assumed in which a central database stores a representation of each moving object's current position. This position is to be maintained so that it deviates from the user's real position by at most a given threshold. To do so, each moving object stores locally the central representation of its position. Then, an object updates the database whenever the deviation between its actual position (as obtained from a GPS device) and the database position exceeds the threshold. The main issue considered is how to represent the location of a moving object in a database so that tracking can be done with as few updates as possible. The paper proposes to use the road network within which the objects are assumed to move for predicting their future positions. The paper presents algorithms that modify an initial road-network representation, so that it works better as a basis for predicting an object's position; it proposes to use known movement patterns of the object, in the form of routes; and, it proposes to use acceleration profiles together with the routes. Using real GPS-data and a corresponding real road network, the paper offers empirical evaluations and comparisons that include three existing approaches and all the proposed approaches.

Journal ArticleDOI
TL;DR: This work considers alternative aggregate functions and techniques that utilize Euclidean distance bounds, spatial access methods, and/or network distance materialization structures and shows that their relative performance depends on the problem characteristics.
Abstract: Aggregate nearest neighbor queries return the object that minimizes an aggregate distance function with respect to a set of query points. Consider, for example, several users at specific locations (query points) that want to find the restaurant (data point), which leads to the minimum sum of distances that they have to travel in order to meet. We study the processing of such queries for the case where the position and accessibility of spatial objects are constrained by spatial (e.g., road) networks. We consider alternative aggregate functions and techniques that utilize Euclidean distance bounds, spatial access methods, and/or network distance materialization structures. Our algorithms are experimentally evaluated with synthetic and real data. The results show that their relative performance depends on the problem characteristics.

Journal ArticleDOI
TL;DR: The Intelligent Discovery Assistant (IDA) as discussed by the authors provides users with systematic enumerations of valid data mining processes and effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute.
Abstract: A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and nontrivial interactions, both novices and data mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype intelligent discovery assistant (IDA), which provides users with 1) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and 2) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a demonstration of cost-sensitive classification using a more complicated process and data from the 1998 KDDCUP competition.

Journal ArticleDOI
TL;DR: This paper discusses and compares several strategies that utilize only known values and that "missing is useful" for cost reduction in cost-sensitive decision tree learning and considers both test costs and misclassification costs.
Abstract: Many real-world data sets for machine learning and data mining contain missing values and much previous research regards it as a problem and attempts to impute missing values before training and testing. In this paper, we study this issue in cost-sensitive learning that considers both test costs and misclassification costs. If some attributes (tests) are too expensive in obtaining their values, it would be more cost-effective to miss out their values, similar to skipping expensive and risky tests (missing values) in patient diagnosis (classification). That is, "missing is useful" as missing values actually reduces the total cost of tests and misclassifications and, therefore, it is not meaningful to impute their values. We discuss and compare several strategies that utilize only known values and that "missing is useful" for cost reduction in cost-sensitive decision tree learning.

Journal ArticleDOI
TL;DR: This work proposes a family of novel unsupervised methods for feature subset selection from multivariate time series (MTS) based on common principal component analysis, termed CLeVer, which outperforms RFE, FC, and random selection by up to a factor of two in terms of the classification accuracy, while taking up to 2 orders of magnitude less processing time than RFE and FC.
Abstract: Feature subset selection (FSS) is a known technique to preprocess the data before performing any data mining tasks, e.g., classification and clustering. FSS provides both cost-effective predictors and a better understanding of the underlying process that generated the data. We propose a family of novel unsupervised methods for feature subset selection from multivariate time series (MTS) based on common principal component analysis, termed CLeVer. Traditional FSS techniques, such as recursive feature elimination (RFE) and Fisher criterion (FC), have been applied to MTS data sets, e.g., brain computer interface (BCI) data sets. However, these techniques may lose the correlation information among features, while our proposed techniques utilize the properties of the principal component analysis to retain that information. In order to evaluate the effectiveness of our selected subset of features, we employ classification as the target data mining task. Our exhaustive experiments show that CLeVer outperforms RFE, FC, and random selection by up to a factor of two in terms of the classification accuracy, while taking up to 2 orders of magnitude less processing time than RFE and FC.

Journal ArticleDOI
TL;DR: A confidence-based dynamic ensemble (CDE), which employs a three-level classification scheme, and uses the confidence factor to make dynamic adjustments to its member classifiers so as to improve class-prediction accuracy, to accommodate new semantics, and to assist in the discovery of useful low-level features.
Abstract: We propose using one-class, two-class, and multiclass SVMs to annotate images for supporting keyword retrieval of images. Providing automatic annotation requires an accurate mapping of images' low-level perceptual features (e.g., color and texture) to some high-level semantic labels (e.g., landscape, architecture, and animals). Much work has been performed in this area; however, there is a lack of ability to assess the quality of annotation. In this paper, we propose a confidence-based dynamic ensemble (CDE), which employs a three-level classification scheme. At the base-level, CDE uses one-class support vector machines (SVMs) to characterize a confidence factor for ascertaining the correctness of an annotation (or a class prediction) made by a binary SVM classifier. The confidence factor is then propagated to the multiclass classifiers at subsequent levels. CDE uses the confidence factor to make dynamic adjustments to its member classifiers so as to improve class-prediction accuracy, to accommodate new semantics, and to assist in the discovery of useful low-level features. Our empirical studies on a large real-world data set demonstrate CDE to be very effective.

Journal ArticleDOI
TL;DR: A model is presented that reduces the evaluation of aggregate queries to the problem of selecting qualifying tuples and the grouping of these tuples into collections on which an aggregate function is to be applied.
Abstract: Spatiotemporal databases are becoming increasingly more common. Typically, applications modeling spatiotemporal objects need to process vast amounts of data. In such cases, generating aggregate information from the data set is more useful than individually analyzing every entry. In this paper, we study the most relevant techniques for the evaluation of aggregate queries on spatial, temporal, and spatiotemporal data. We also present a model that reduces the evaluation of aggregate queries to the problem of selecting qualifying tuples and the grouping of these tuples into collections on which an aggregate function is to be applied. This model gives us a framework that allows us to analyze and compare the different existing techniques for the evaluation of aggregate queries. At the same time, it allows us to identify opportunities for research on types of aggregate queries that have not been studied.

Journal ArticleDOI
TL;DR: A composability model is proposed to ascertain that Web services can safely be combined, hence avoiding unexpected failures at runtime, and the concepts of composability degree and /spl tau/-composability to cater for partial and total composability are introduced.
Abstract: We propose a composability model to ascertain that Web services can safely be combined, hence avoiding unexpected failures at runtime. Composability is checked through a set of rules organized into four levels, syntactic, static semantic, dynamic semantic, and qualitative levels. We introduce the concepts of composability degree and /spl tau/-composability to cater for partial and total composability. We also propose a set of algorithms for checking composability. Finally, we conduct a performance study (analytical and experimental) of the proposed algorithms.

Journal ArticleDOI
TL;DR: CMTreeMiner is presented, a computationally efficient algorithm that discovers only closed and maximal frequent subtrees in a database of labeled rooted trees, where the rooted trees can be either ordered or unordered.
Abstract: Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. One important problem in mining databases of trees is to find frequently occurring subtrees. Because of the combinatorial explosion, the number of frequent subtrees usually grows exponentially with the size of frequent subtrees and, therefore, mining all frequent subtrees becomes infeasible for large tree sizes. We present CMTreeMiner, a computationally efficient algorithm that discovers only closed and maximal frequent subtrees in a database of labeled rooted trees, where the rooted trees can be either ordered or unordered. The algorithm mines both closed and maximal frequent subtrees by traversing an enumeration tree that systematically enumerates all frequent subtrees. Several techniques are proposed to prune the branches of the enumeration tree that do not correspond to closed or maximal frequent subtrees. Heuristic techniques are used to arrange the order of computation so that relatively expensive computation is avoided as much as possible. We study the performance of our algorithm through extensive experiments, using both synthetic data and data sets from real applications. The experimental results show that our algorithm is very efficient in reducing the search space and quickly discovers all closed and maximal frequent subtrees.