scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Data Warehousing and Mining in 2007"


Journal ArticleDOI
TL;DR: The task of multi-label classification is introduced, the sparse related literature is organizes into a structured presentation and comparative experimental results of certain multilabel classification methods are performed.
Abstract: Nowadays, multi-label classification methods are increasingly required by modern applications, such as protein function classification, music categorization and semantic scene classification. This paper introduces the task of multi-label classification, organizes the sparse related literature into a structured presentation and performs comparative experimental results of certain multi-label classification methods. It also contributes the definition of concepts for the quantification of the multi-label nature of a data set.

2,592 citations


Journal ArticleDOI
TL;DR: This article proposes an innovative advanced OLAP visualization technique that meaningfully combines the so-called OLAP dimension flattening process, which allows us to extract two-dimensional OLAP views from multidimensional data cubes, and very efficient data compression techniques for such views, which allow to generate “semantics-aware†compressed representations where data are grouped along OLAP hierarchies.
Abstract: Efficiently supporting advanced OLAP visualization of multidimensional data cubes is a novel and challenging research topic, which results to be of interest for a large family of data warehouse applications relying on the management of spatio-temporal (e.g., mobile) data, scientific and statistical data, sensor network data, biological data, etc. On the other hand, the issue of visualizing multidimensional data domains has been quite neglected from the research community, since it does not belong to the well-founded conceptual-logical-physical design hierarchy inherited from relational database methodologies. Inspired from these considerations, in this article we propose an innovative advanced OLAP visualization technique that meaningfully combines (i) the so-called OLAP dimension flattening process, which allows us to extract two-dimensional OLAP views from multidimensional data cubes, and (ii) very efficient data compression techniques for such views, which allow us to generate “semantics-aware†compressed representations where data are grouped along OLAP hierarchies.

56 citations


Journal ArticleDOI
TL;DR: This chapter introduces a novel cache-based architecture that takes into account the containment properties of geographical data and predicates, and allows evicting the most irrelevant values from the cache.
Abstract: GML is a promising model for integrating geodata within data warehouses. The resulting databases are generally large and require spatial operators to be handled. Depending on the size of the target geographical data and the number and complexity of operators in a query, the processing time may quickly become prohibitive. To optimize spatial queries over GML encoded data, this chapter introduces a novel cache-based architecture. A new cache replacement policy is then proposed. It takes into account the containment properties of geographical data and predicates, and allows evicting the most irrelevant values from the cache. Experiences with the GeoCache prototype show the effectiveness of the proposed architecture with the associated replacement policy, compared to existing works.

38 citations


Journal ArticleDOI
TL;DR: This article presents a new evolutionary algorithm for induction of mixed decision trees that searches for an optimal tree in a global manner, that is it learns a tree structure and finds tests in one run of the EA.
Abstract: This article presents a new evolutionary algorithm (EA) for induction of mixed decision trees. In nonterminal nodes of a mixed tree, different types of tests can be placed, ranging from a typical inequality test up to an oblique test based on a splitting hyper-plane. In contrast to classical top-down methods, the proposed system searches for an optimal tree in a global manner, that is it learns a tree structure and finds tests in one run of the EA. Specialized genetic operators are developed, which allow the system to exchange parts of trees, generating new sub-trees, pruning existing ones as well as changing the node type and the tests. An informed mutation application scheme is introduced and the number of unprofitable modifications is reduced. The proposed approach is experimentally verified on both artificial and real-life data and the results are promising. Scaling of system performance with increasing training data size was also investigated.

32 citations


Book ChapterDOI
TL;DR: A two-phase normalization approach is proposed: heterogeneous dimensions are reshaped into a set of wellbehaved homogeneous subdimensions, followed by the enforcement of summarizability in each dimension’s data hierarchy.
Abstract: Comprehensive data analysis has become indispensable in a variety of domains. OLAP (On-Line Analytical Processing) systems tend to perform poorly or even fail when applied to complex data scenarios. The restriction of the underlying multidimensional data model to admit only homogeneous and balanced dimension hierarchies is too rigid for many real-world applications and, therefore, has to be overcome in order to provide adequate OLAP support. We present a framework for classifying and modeling complex multidimensional data, with the major effort at the conceptual level as to transform irregular hierarchies to make them navigable in a uniform manner. The properties of various hierarchy types are formalized and a two-phase normalization approach is proposed: heterogeneous dimensions are reshaped into a set of wellbehaved homogeneous subdimensions, followed by the enforcement of summarizability in each dimension’s data hierarchy. Mapping the data to a visual data browser relies solely on metadata, which captures the properties of facts, dimensions, and relationships within the dimensions. The navigation is schema-based, that is, users interact with dimensional levels with on-demand data display. The power of our approach is exemplified using a real-world study from the domain of academic administration.

31 citations


Journal ArticleDOI
TL;DR: This chapter introduces a similarity preserving function called sequence and set similarity measure S3M that captures both the order of occurrence of page visits as well as the content of pages and proposes a new clustering algorithm, SeqPAM for clustering sequential data.
Abstract: With the growth in the number of Web users and necessity for making information available on the Web, the problem of Web personalization has become very critical and popular. Developers are trying to customize a Web site to the needs of specific users with the help of knowledge acquired from user navigational behavior. Since user page visits are intrinsically sequential in nature, efficient clustering algorithms for sequential data are needed. In this chapter, we introduce a similarity preserving function called sequence and set similarity measure S3M that captures both the order of occurrence of page visits as well as the content of pages. We conducted pilot experiments comparing the results of PAM, a standard clustering algorithm, with two similarity measures: Cosine and S3M. The goodness of the clusters resulting from both the measures was computed using a cluster validation technique based on average levensthein distance. Results on pilot dataset established the effectiveness of S3M for sequential data. Based on these results, we proposed a new clustering algorithm, SeqPAM for clustering sequential data. We tested the new algorithm on two datasets namely, cti and msnbc datasets. We provided recommendations for Web personalization based on the clusters obtained from SeqPAM for msnbc dataset.

28 citations


Journal ArticleDOI
TL;DR: This article addresses the problem of mining pairs of items, such that the presence of one excludes the other, and proposes a probability-based evaluation metric and a mining algorithm that is tested on transaction data.
Abstract: Association rule mining is a popular task that involves the discovery of co-occurences of items in transaction databases. Several extensions of the traditional association rule mining model have been proposed so far; however, the problem of mining for mutually exclusive items has not been directly tackled yet. Such information could be useful in various cases (e.g., when the expression of a gene excludes the expression of another), or it can be used as a serious hint in order to reveal inherent taxonomical information. In this article, we address the problem of mining pairs of items, such that the presence of one excludes the other. First, we provide a concise review of the literature, then we define this problem, we propose a probability-based evaluation metric, and finally a mining algorithm that we test on transaction data.

22 citations


Journal ArticleDOI
TL;DR: How the third generation (3G) customers are predicted using lo-gistic regression analysis and statistical tools like Classification and Regression Tree (CART), Multivariate Adaptive Regression Splines (MARS), and other variables derived from the raw variables.
Abstract: In this article we discuss how we have predicted the third generation (3G) customers using lo-gistic regression analysis and statistical tools like Classification and Regression Tree (CART), Multivariate Adaptive Regression Splines (MARS), and other variables derived from the raw variables. The basic idea reflected in this paper is that the performance of logistic regression using raw variables standalone can be improved upon, by the use for various functions of the raw variables and dummies representing potential segments of the population.

21 citations


Journal ArticleDOI
TL;DR: This chapter has used a novel approach to instantiate and solve four versions of the Materialized View Selection (MVS) problem using three sampling techniques and two databases and compared these solutions with the optimal solutions corresponding to the actual problems.
Abstract: In any online decision support system, the backbone is a data warehouse. In order to facilitate rapid response to complex business decision support queries, it is a common practice to materialize an appropriate set of the views at the data warehouse. However, it typically requires the solution of the Materialized View Selection (MVS) problem to select the right set of views to materialize in order to achieve a certain level of service given a limited amount of resource such as materialization time, storage space, or view maintenance time. Dynamic changes in the source data and the end users requirement necessitate rapid and repetitive instantiation and solution of the MVS problem. In an online decision support context, time is of the essence in finding acceptable solutions to this problem. In this chapter, we have used a novel approach to instantiate and solve four versions of the MVS problem using three sampling techniques and two databases. We compared these solutions with the optimal solutions corresponding to the actual problems. In our experimentation, we found that the sampling approach resulted in substantial savings in time while producing good solutions.

18 citations


Journal ArticleDOI
TL;DR: Alternative design solutions that can be adopted, in presence of late measurements, to support different types of queries that enable meaningful historical analysis are discussed, aimed at enabling wellinformed design decisions.
Abstract: Though in most data warehousing applications no relevance is given to the time when events are recorded, some domains call for a different behavior. In particular, whenever late measurements of events take place, and particularly when the events registered are subject to further updates, the traditional design solutions fail in preserving accountability and query consistency. In this article, we discuss the alternative design solutions that can be adopted, in presence of late measurements, to support different types of queries that enable meaningful historical analysis. These solutions are based on the enforcement of the distinction between transaction time and valid time within the schema that represents the fact of interest. Besides, we provide a qualitative and quantitative comparison of the solutions proposed, aimed at enabling wellinformed design decisions.

15 citations


Journal ArticleDOI
TL;DR: This report proposes the Gradually Expanded Tree Ensemble (GetEnsemble) method, which handles the difficulties via ensembling expanded trees and found the proposed method beats others in this task.
Abstract: Our LAMDAer team has won the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006 Data Mining Competition (open category) grand champion. This report presents our solution to the PAKDD 2006 Data Mining Competition. Following a brief description of the task, we discuss the difficulties of the task and explain the motivation of our solution. Then, we propose the Gradually Expanded Tree Ensemble (GetEnsemble) method, which handles the difficulties via ensembling expanded trees. We evaluated the proposed method and several other methods using AUC, and found the proposed method beats others in this task. Besides, we show how to obtain cues on which kind of second generation (2G) customers are likely to become third generation (3G) users with the proposed method.

Journal ArticleDOI
TL;DR: Using real world data provided to the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006 Data Mining Competition, the effectiveness of Friedman’s stochastic gradient boosting (Multiple Additive Regression Trees [MART]) for the rapid development of a high performance predictive model is explored.
Abstract: Mobile phone customers face many choices regarding handset hardware, add-on services, and features to subscribe to from their service providers Mobile phone companies are now increas-ingly interested in the drivers of migration to third generation (3G) hardware and services Using real world data provided to the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006 Data Mining Competition we explore the effectiveness of Friedman’s stochastic gradient boosting (Multiple Additive Regression Trees [MART]) for the rapid development of a high performance predictive model

Journal ArticleDOI
TL;DR: The results on the discovery of pairs or groups of sibling terms with XTREEM-SA (Xhtml TREE mining for sibling associations), an algorithm that extracts semantics from Web documents, are presented and the challenges of evaluating semantics extracted from the Web against handcrafted ontologies of high quality but possibly low coverage are elaborate.
Abstract: The automated discovery of relationships among terms contributes to the automation of the ontology engineering process and allows for sophisticated query expansion in information retrieval. While there are many findings on the identification of direct hierarchical relations among concepts, less attention has been paid on the discovery sibling terms. These are terms that share a common, a priori unknown parent such as co-hyponyms and co-meronyms. In this study, we present our results on the discovery of pairs or groups of sibling terms with XTREEM-SA (Xhtml TREE mining for sibling associations), an algorithm that extracts semantics from Web documents. While conventional methods process an appropriately prepared corpus, XTREEM-SA takes as input an arbitrary collection of Web documents on a given topic and finds sibling relations between terms in this corpus. It is thus independent of domain and language, does not require linguistic preprocessing, and does not rely on syntactic or other rules on text formation. We describe XTREEM-SA and evaluate it toward two reference ontologies. In this context, we also elaborate on the challenges of evaluating semantics extracted from the Web against handcrafted ontologies of high quality but possibly low coverage.

Journal ArticleDOI
TL;DR: This article combines several classifiers, some of them ensemble techniques, into a heterogeneous meta-ensemble, to produce a probability estimate for each test case and uses a simple decision theoretic framework to form a classification.
Abstract: This article describes the entry of the Super Computer Data Mining (SCDM) Project to the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006 Data Mining Competition. The SCDM project is developing data mining tools for parallel execution on Linux clusters. The code is freely available; please contact the first author for a copy. We combine several classifiers, some of them ensemble techniques, into a heterogeneous meta-ensemble, to produce a probability estimate for each test case. We then use a simple decision theoretic framework to form a classification. The meta-ensemble contains a Bayesian neural network, a learning classifier system (LCS), attribute selection based-ensemble algorithms (Filtered At-tribute Subspace based Bagging with Injected Randomness [FASBIR]), and more well-known classifiers such as logistic regression, Naive Bayes (NB), and C4.5.

Journal ArticleDOI
TL;DR: A new single-pass algorithm for detecting significant intervals representing intrinsic nature of data are discovered in a single pass is presented; its characteristics, advantages, and disadvantages are discussed; and its performance is compared with previously developed level-wise and SQL-based algorithms for significant interval discovery.
Abstract: Sensor-based applications, such as smart homes, require prediction of event occurrences for automating the environment using time-series data collected over a period of time In these applications, it is important to predict events in tight and accurate intervals to effectively automate the application This article deals with the discovery of significant intervals from time-series data Although there is a considerable body of work on sequential mining of transactional data, most of them deal with time-point data and make several passes over the entire data set in order to discover frequently occurring patterns/events We propose an approach in which significant intervals representing intrinsic nature of data are discovered in a single pass In our approach, time-series data is folded over a periodicity (day, week, etc) in which the intervals are formed Significant intervals are discovered from this interval data that satisfy the criteria of minimum confidence and maximum interval length specified by the user Both compression and working with intervals contribute towards improving the efficiency of the algorithm In this article, we present a new single-pass algorithm for detecting significant intervals; discuss its characteristics, advantages, and disadvantages; and analyze it Finally, we compare the performance of our algorithm with previously developed level-wise and SQL-based algorithms for significant interval discovery (SID)

Journal ArticleDOI
TL;DR: A hyper-heuristic to construct rule search heuristics for weighted covering algorithms that allows producing rules of desired generality is proposed, based on a PN space, a new ROC-like tool for analysis, evaluation, and visualization of rules.
Abstract: Rule induction from examples is a machine learning technique that finds rules of the form con dition → class, where condition and class are logic expressions of the form variable 1 = value 1 ∧ variable 2 = value 2 ∧… ∧ variable k = value k . There are in general three approaches to rule induction: exhaustive search, divide-and-conquer, and separate-and-conquer (or its extension as weighted covering). Among them, the third approach, according to different rule search heuristics, can avoid the problem of producing many redundant rules (limitation of the first approach) or non-overlapping rules (limitation of the second approach). In this chapter, we propose a hyper-heuristic to construct rule search heuristics for weighted covering algorithms that allows producing rules of desired generality. The hyper-heuristic is based on a PN-space, a new ROC-like tool for analysis, evaluation, and visualization of rules. Well-known rule search heuristics such as entropy, Laplacian, weight relative accuracy, and others are equivalent to ones proposed by the hyper-heuristic. Moreover, it can present new non-linear rule search heuristics, some are especially appropriate for description tasks. The non-linear rule search heuristics have been experimentally compared with others on the generality of rules induced from UCI datasets and used to learn regulatory rules from microarray data.

Journal ArticleDOI
TL;DR: The background behind the preparation of the dataset, the choice of judging criteria incorporating both a quantitative measure of accuracy and a set of subjective qualitative assessments, and finally a summary of the participation and results are discussed.
Abstract: The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2006 Data Mining Competition involved the problem of classifying mobile telecom network customers into second generation (2G) and third generation (3G) services, with the ultimate aim of identi-fying existing 2G network customers who had a high potential of switching to using the mobile operator’s new 3G mobile network and services. This paper discusses the background behind the preparation of the dataset, the choice of judging criteria incorporating both a quantitative measure of accuracy and a set of subjective qualitative assessments, and finally a summary of the participation and results. We also highlight, in the report, interesting observations and find-ings from some of the participating teams.

Journal ArticleDOI
TL;DR: A novel algorithm called Artificial Immune Recognition System (AIRS) that is based on the specificity of the human immune system is used, achieving an accuracy rate in the range of 80% to 90%, depending on the set of parameter values used.
Abstract: With the recent introduction of third generation (3G) technology in the field of mobile commu-nications, mobile phone service providers will have to find an effective strategy to market this new technology. One approach is to analyze the current profile of existing 3G subscribers to discover common patterns in their usage of mobile phones. With these usage patterns, the service provider can effectively target certain classes of customers who are more likely to purchase their subscription plans. To discover these patterns, we use a novel algorithm called Artificial Immune Recognition System (AIRS) that is based on the specificity of the human immune system. In our experiment, the algorithm performs well, achieving an accuracy rate in the range of 80% to 90%, depending on the set of parameter values used.

Journal ArticleDOI
TL;DR: This article presents an efficient feature selection to the promoter recognition, prediction, and localization problem, and implements a hybrid filter-wrapper, featuredeletion (or addition) algorithmic process to select those BPBVs that best discriminate between two DNA sequences target classes.
Abstract: With the completion of various whole genomes, one of the fundamental bioinformatics tasks is the identification of functional regulatory regions, such as promoters, and the computational discovery of genes from the produced DNA sequences. Confronted with huge amounts of DNA sequences, the utilization of automated computational sequence analysis methods and tools is more than demanding. In this article, we present an efficient feature selection to the promoter recognition, prediction, and localization problem. The whole approach is implemented in a system called MineProm. The basic idea underlying our approach is that each position-nucleotide pair in a DNA sequence is represented by a distinct binary-valued feature—the binary position base value (BPBV). A hybrid filter-wrapper, featuredeletion (or addition) algorithmic process is called for in order to select those BPBVs that best discriminate between two DNA sequences target classes (i.e., promoter vs. nonpromoter). MineProm is tested on two widely used benchmark data sets. Assessment of results demonstrates the reliability of the approach.

Journal ArticleDOI
TL;DR: This article proposes two new item-based algorithms, which are experimentally evaluated with kNN, and shows that, in terms of precision, the proposed methods outperform kNN classification by up to 15%, whereas compared to other methods, like the C4.5 system, improvement exceeds 30%.
Abstract: The existence of noise in the data significantly impacts the accuracy of classification. In this article, we are concerned with the development of novel classification algorithms that can efficiently handle noise. To attain this, we recognize an analogy between k nearest neighbors (kNN) classification and user-based collaborative filtering algorithms, as they both find a neighborhood of similar past data and process its contents to make a prediction about new data. The recent development of item-based collaborative filtering algorithms, which are based on similarities between items instead of transactions, addresses the sensitivity of user-based methods against noise in recommender systems. For this reason, we focus on the item-based paradigm, compared to kNN algorithms, to provide improved robustness against noise for the problem of classification. We propose two new item-based algorithms, which are experimentally evaluated with kNN. Our results show that, in terms of precision, the proposed methods outperform kNN classification by up to 15%, whereas compared to other methods, like the C4.5 system, improvement exceeds 30%.

Journal ArticleDOI
TL;DR: An information-theoretic framework is used to simultaneously cluster the logged process traces, encoding structural information, as well as a number of performance metrics associated with them, providing the analyst with a compact and handy description of major execution scenarios for the process.
Abstract: Mining process logs has been increasingly attracting the data mining community, due to the chances the development of process mining techniques can offer to the analysis and design of complex processes. Currently, these techniques focus on “structural†aspects by only considering which activities were executed and in which order, and disregard any other kind of data usually kept by real systems (e.g., activity executors, parameter values, and time-stamps). In this article, we aim at discovering different process variants by clustering process logs. To this purpose, an information-theoretic framework is used to simultaneously cluster the logged process traces, encoding structural information, as well as a number of performance metrics associated with them. Each cluster is equipped with a specific model, so providing the analyst with a compact and handy description of major execution scenarios for the process.