scispace - formally typeset
Search or ask a question

Showing papers in "Knowledge and Information Systems in 2007"


Journal ArticleDOI
TL;DR: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.
Abstract: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.

4,944 citations


Journal ArticleDOI
TL;DR: This study quantifies the sensitivity of feature selection algorithms to variations in the training set by assessing the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset.
Abstract: With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms Finally, we show how stability profiles can support the choice of a feature selection algorithm

536 citations


Journal ArticleDOI
TL;DR: Two variations of a method that uses the median from a neighborhood of a data point and a threshold value to compare the difference between the median and the observed data value are proposed.
Abstract: In this article we consider the problem of detecting unusual values or outliers from time series data where the process by which the data are created is difficult to model. The main consideration is the fact that data closer in time are more correlated to each other than those farther apart. We propose two variations of a method that uses the median from a neighborhood of a data point and a threshold value to compare the difference between the median and the observed data value. Both variations of the method are fast and can be used for data streams that occur in quick succession such as sensor data on an airplane.

219 citations


Journal ArticleDOI
TL;DR: A novel tree structure, called CanTree (canonical-order tree), that captures the content of the transaction database and orders tree nodes according to some canonical order, which can be easily maintained when database transactions are inserted, deleted, and/or modified.
Abstract: Since its introduction, frequent-pattern mining has been the subject of numerous studies, including incremental updating. Many existing incremental mining algorithms are Apriori-based, which are not easily adoptable to FP-tree-based frequent-pattern mining. In this paper, we propose a novel tree structure, called CanTree (canonical-order tree), that captures the content of the transaction database and orders tree nodes according to some canonical order. By exploiting its nice properties, the CanTree can be easily maintained when database transactions are inserted, deleted, and/or modified. For example, the CanTree does not require adjustment, merging, and/or splitting of tree nodes during maintenance. No rescan of the entire updated database or reconstruction of a new tree is needed for incremental updating. Experimental results show the effectiveness of our CanTree in the incremental mining of frequent patterns. Moreover, the applicability of CanTrees is not confined to incremental mining; CanTrees can also be applicable to other frequent-pattern mining tasks including constrained mining and interactive mining.

149 citations


Journal ArticleDOI
TL;DR: This paper proposes a new solution which goes at an opposite way, that is, adapting the multi-instance representation to single-instance learning algorithms, and shows that the proposed method works well on standard as well as generalized multi- instance problems.
Abstract: In multi-instance learning, the training set is composed of labeled bags each consists of many unlabeled instances, that is, an object is represented by a set of feature vectors instead of only one feature vector. Most current multi-instance learning algorithms work through adapting single-instance learning algorithms to the multi-instance representation, while this paper proposes a new solution which goes at an opposite way, that is, adapting the multi-instance representation to single-instance learning algorithms. In detail, the instances of all the bags are collected together and clustered into d groups first. Each bag is then re-represented by d binary features, where the value of the ith feature is set to one if the concerned bag has instances falling into the ith group and zero otherwise. Thus, each bag is represented by one feature vector so that single-instance classifiers can be used to distinguish different classes of bags. Through repeating the above process with different values of d, many classifiers can be generated and then they can be combined into an ensemble for prediction. Experiments show that the proposed method works well on standard as well as generalized multi-instance problems.

139 citations


Journal ArticleDOI
TL;DR: A data transformation is proposed that minimally suppresses the domain values in the data to satisfy the set of privacy templates and the transformed data is free of sensitive inferences even in the presence of data mining algorithms.
Abstract: We present an approach of limiting the confidence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are specified by a set of “privacy templates". Each template specifies the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the confidence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum confidence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization k has been italicized consistently throughout this article. focuses on personal identities. This work focuses on the association between personal identities and sensitive properties.

138 citations


Journal ArticleDOI
TL;DR: This paper proposes a fast approximation algorithm for the single linkage method that reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithms for the approximate nearest neighbor search.
Abstract: The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between the nearest members is the shortest. This step is repeated until only one cluster remains. The single linkage method can efficiently detect clusters in arbitrary shapes. However, a drawback of this method is a large time complexity of O(n 2), where n represents the number of data points. This time complexity makes this method infeasible for large data. This paper proposes a fast approximation algorithm for the single linkage method. Our algorithm reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. Here, B represents the maximum number of points going into a single hash entry and it practically diminishes to a small constant as compared to n for sufficiently large hash tables. Experimentally, we show that (1) the proposed algorithm obtains clustering results similar to those obtained by the single linkage method and (2) it runs faster for large data than the single linkage method.

106 citations


Journal ArticleDOI
TL;DR: This work presents a novel approach for detecting instances with attribute noise and demonstrates its usefulness with case studies using two different real-world software measurement data sets, showing that PANDA provides better noise detection performance than the DM algorithm.
Abstract: Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm.

79 citations


Journal ArticleDOI
TL;DR: An extension of the information bottleneck framework, called coordinated conditional information bottleneck, is presented, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints.
Abstract: Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice, this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and non-numeric attributes. We discuss extensions of the technique to the tasks of semi-supervised classification and enumeration of successive non-redundant clusterings. We present experimental results for applications in text mining and computer vision.

79 citations


Journal ArticleDOI
TL;DR: This paper provides an overview and a sampling of many of the ways that the automotive industry has utilized AI, soft computing and other intelligent system technologies in such diverse domains like manufacturing, diagnostics, on-board systems, warranty analysis and design.
Abstract: There is a common misconception that the automobile industry is slow to adapt new technologies, such as artificial intelligence (AI) and soft computing. The reality is that many new technologies are deployed and brought to the public through the vehicles that they drive. This paper provides an overview and a sampling of many of the ways that the automotive industry has utilized AI, soft computing and other intelligent system technologies in such diverse domains like manufacturing, diagnostics, on-board systems, warranty analysis and design.

71 citations


Journal ArticleDOI
TL;DR: This paper presents metrics for assessing the content quality and maturity of categorization standards and applies these metrics to eCl@ss, UNSPSC, eOTD, and RNTD, showing that the amount of content is very unevenly spread over top-level categories and that more expressive structural features exist only for parts of these standards.
Abstract: Many e-business scenarios require the integration of product-related data into target applications or target documents at the recipient’s side. Such tasks can be automated much better if the textual descriptions are augmented by a machine-feasible representation of the product semantics. For this purpose, categorization standards for products and services, like UNSPSC, eCl@ss, the ECCMA Open Technical Dictionary (eOTD), or the RosettaNet Technical Dictionary (RNTD) are available, but they vary in terms of structural properties and content. In this paper, we present metrics for assessing the content quality and maturity of such standards and apply these metrics to eCl@ss, UNSPSC, eOTD, and RNTD. Our analysis shows that (1) the amount of content is very unevenly spread over top-level categories, which contradicts the promise of a broad scope implicitly made by the existence of a large number of top-level categories, and that (2) more expressive structural features exist only for parts of these standards. Additionally, we (3) measure the amount of maintenance in the various top-level categories, which helps identify the actively maintained subject areas as compared to those which ones are rather dead branches. Finally, we show how our approach can be used (4) by enterprises for selecting an appropriate standard, and (5) by standards bodies for monitoring the maintenance of a standard as a whole.

Journal ArticleDOI
TL;DR: The rule-focusing methodology is proposed, an interactive methodology for the visual post-processing of association rules that exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm.
Abstract: On account of the enormous amounts of rules that can be produced by data mining algorithms, knowledge post-processing is a difficult stage in an association rule discovery process. In order to find relevant knowledge for decision making, the user (a decision maker specialized in the data studied) needs to rummage through the rules. To assist him/her in this task, we here propose the rule-focusing methodology, an interactive methodology for the visual post-processing of association rules. It allows the user to explore large sets of rules freely by focusing his/her attention on limited subsets. This new approach relies on rule interestingness measures, on a visual representation, and on interactive navigation among the rules. We have implemented the rule-focusing methodology in a prototype system called ARVis. It exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm.

Journal ArticleDOI
TL;DR: VR-VTK: a multimodal interface to VTK on a virtual environment to address several problems specific for spatial 3D interaction and a number of additional features, such as more complex interaction methods and enhanced depth perception, are discussed.
Abstract: The object-oriented visualization Toolkit (VTK) is widely used for scientific visualization. VTK is a visualization library that provides a large number of functions for presenting three-dimensional data. Interaction with the visualized data is controlled with two-dimensional input devices, such as mouse and keyboard. Support for real three-dimensional and multimodal input is non-existent. This paper describes VR-VTK: a multimodal interface to VTK on a virtual environment. Six degree of freedom input devices are used for spatial 3D interaction. They control the 3D widgets that are used to interact with the visualized data. Head tracking is used for camera control. Pedals are used for clutching. Speech input is used for application commands and system control. To address several problems specific for spatial 3D interaction, a number of additional features, such as more complex interaction methods and enhanced depth perception, are discussed. Furthermore, the need for multimodal input to support interaction with the visualization is shown. Two existing VTK applications are ported using VR-VTK to run in a desktop virtual reality system. Informal user experiences are presented.

Journal ArticleDOI
TL;DR: This paper describes an approach to defining confidence for ETIs that preserves the interpretation of confidence as an estimate of a conditional probability, and shows how association rules based on ETIs can have better coverage thanrules based on traditional itemsets.
Abstract: In this paper, we explore extending association analysis to non-traditional types of patterns and non-binary data by generalizing the notion of confidence. We begin by describing a general framework that measures the strength of the connection between two association patterns by the extent to which the strength of one association pattern provides information about the strength of another. Although this framework can serve as the basis for designing or analyzing measures of association, the focus in this paper is to use the framework as the basis for extending the traditional concept of confidence to error-tolerant itemsets (ETIs) and continuous data. To that end, we provide two examples. First, we (1) describe an approach to defining confidence for ETIs that preserves the interpretation of confidence as an estimate of a conditional probability, and (2) show how association rules based on ETIs can have better coverage (at an equivalent confidence level) than rules based on traditional itemsets. Next, we derive a confidence measure for continuous data that agrees with the standard confidence measure when applied to binary transaction data. Further analysis of this result exposes some of the important issues involved in constructing a confidence measure for continuous data.

Journal ArticleDOI
TL;DR: This paper presents the JITik approach to model knowledge and information distribution, giving a high-level account of the research made around this project, emphasizing two particular aspects: a sophisticated argument-based mechanism for deciding among conflicting distribution policies, and the embedding of JITIK agents in enterprises using the service-oriented architecture paradigm.
Abstract: Knowledge and Information distribution is indeed one of the main processes in Knowledge Management. Today, most Information Technology tools for supporting this distribution are based on repositories accessed through Web-based systems. This approach has, however, many practical limitations, mainly due to the strain they put on the user, who is responsible of accessing the right Knowledge and Information at the right moments. As a solution for this problem, we have proposed an alternative approach which is based on the notion of delegation of distribution tasks to synthetic agents, which become responsible of taking care of the organization's as well as the individuals' interests. In this way, many Knowledge and Information distribution tasks can be performed on the background, and the agents can recognize relevant events as triggers for distributing the right information to the right users at the right time. In this paper, we present the JITIK approach to model knowledge and information distribution, giving a high-level account of the research made around this project, emphasizing two particular aspects: a sophisticated argument-based mechanism for deciding among conflicting distribution policies, and the embedding of JITIK agents in enterprises using the service-oriented architecture paradigm. It must be remarked that a JITIK-based application is currently being implemented for one of the leading industries in Mexico.

Journal ArticleDOI
TL;DR: The purpose of this study is to bridge the gap with transformation algorithms for mapping the data from an abstract space to an intuitive one, which include shape correlation, periodicity, multiphysics, and spatial Bayesian.
Abstract: Analytical models intend to reveal inner structure, dynamics, or relationship of things. However, they are not necessarily intuitive to humans. Conventional scientific visualization methods are intuitive, but limited by depth, dimension, and resolution. The purpose of this study is to bridge the gap with transformation algorithms for mapping the data from an abstract space to an intuitive one, which include shape correlation, periodicity, multiphysics, and spatial Bayesian. We tested this approach with the oceanographic case study. We found that the interactive visualization increases robustness in object tracking and positive detection accuracy in object prediction. We also found that the interactive method enables the user to process the image data at less than 1 min per image versus 30 min per image manually. As a result, our test system can handle at least 10 times more data sets than traditional manual analyses. The results also suggest that minimal human interactions with appropriate computational transformations or cues may significantly increase the overall productivity.

Journal ArticleDOI
TL;DR: It is shown that sequential time series clustering is not meaningless, and that the problem highlighted in these works stem from their use of the Euclidean distance metric as the distance measure in the delay-vector space.
Abstract: Sequential time series clustering is a technique used to extract important features from time series data. The method can be shown to be the process of clustering in the delay-vector space formalism used in the Dynamical Systems literature. Recently, the startling claim was made that sequential time series clustering is meaningless. This has important consequences for a significant amount of work in the literature, since such a claim invalidates these work’s contribution. In this paper, we show that sequential time series clustering is not meaningless, and that the problem highlighted in these works stem from their use of the Euclidean distance metric as the distance measure in the delay-vector space. As a solution, we consider quite a general class of time series, and propose a regime based on two types of similarity that can exist between delay vectors, giving rise naturally to an alternative distance measure to Euclidean distance in the delay-vector space. We show that, using this alternative distance measure, sequential time series clustering can indeed be meaningful. We repeat a key experiment in the work on which the “meaningless” claim was based, and show that our method leads to a successful clustering outcome.

Journal ArticleDOI
TL;DR: Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm the approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.
Abstract: The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHS→ RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

Journal ArticleDOI
TL;DR: This paper presents engineering decision-making on pipe stress analysis through the application of knowledge-based systems (KBS), and establishes a bidirectional communication with the current engineering software for pipe stressAnalysis so that the user benefits from this integration.
Abstract: This paper presents engineering decision-making on pipe stress analysis through the application of knowledge-based systems (KBS). Stress analysis, as part of the design and analysis of process pipe networks, serves to identify whether a given pipe arrangement can cope with weight, thermal, and pressure stress at safe operation levels. An iterative process of design and analysis cycle is done routinely by engineers while analyzing the existing networks or while designing the process pipe networks. In our proposal, the KBS establishes a bidirectional communication with the current engineering software for pipe stress analysis, so that the user benefits from this integration. The stress analysis knowledge base is constructed by registering the senior engineers’ know-how. The engineers’ overall strategy to follow up during the pipe stress analysis, to some extent contained by the KBS, is presented. Advantages in saving engineering man-hours and usefulness in guiding experts in pipe stress analysis are the major services for the process industry.

Journal ArticleDOI
TL;DR: The experimental study suggests criteria for the inclusion of human factors into the user model guiding and controlling the adaptation process, and the Intelligent System for User Modelling has been developed.
Abstract: A scientific problem solving environment should be built in such a way that users (scientists) might exploit underlying technologies without a specialised knowledge about available tools and resources. An adaptive user interface can be considered as an opportunity in addressing this challenge. This paper explores the importance of individual human abilities in the design of adaptive user interfaces for scientific problem solving environments. In total, seven human factors (gender, learning abilities, locus of control, attention focus, cognitive strategy and verbal and nonverbal IQs) have been evaluated regarding their impact on interface adjustments done manually by users. People’s preferences for different interface configurations have been investigated. The experimental study suggests criteria for the inclusion of human factors into the user model guiding and controlling the adaptation process. To provide automatic means of adaptation, the Intelligent System for User Modelling has been developed.

Journal ArticleDOI
TL;DR: As data mining is increasingly recognized as a key technology to analyzing and understanding the data, the need for knowledge discovery from real-world low-quality data becomes not just overwhelming, but also compelling.
Abstract: Data mining is dedicated to searching for novel and actionable patterns and relationships that exist in a large volume of data. The mining process typically involves four major steps: (1) data collection, for example, transferring data collected from the production systems into data warehouses; (2) data preprocessing, transforming/cleansing the data to remove errors, filling missing values, and checking for inconsistency or duplicates; (3) finding patterns and models from preprocessed data; and (4) developing and monitoring the knowledge model [5, 7]. In data-driven application domains, many potential reasons, such as unreliable data acquisition sources, faulty sensors, data collection errors, and the lack of data representation standards, will make data vulnerable to errors and therefore lead to poor quality data. Although these factors and constraints are widely accepted by general data mining practitioners, most applications have traditionally ignored the need for developing appropriate approaches for representing and reasoning with such data imperfections. As data mining is increasingly recognized as a key technology to analyzing and understanding the data, the need for knowledge discovery from real-world low-quality data becomes not just overwhelming, but also compelling. As a result, issues related to data quality have become more and more

Journal ArticleDOI
TL;DR: Some challenges faced by refineries that seek to be lean, nimble, and proactive are outlined, and methodologies drawn from artificial intelligence – software agents, pattern recognition, expert systems – have a role to play.
Abstract: Agile manufacturing is the capability to prosper in a competitive environment of continuous and unpredictable changes by reacting quickly and effectively to the changing markets and other exogenous factors Agility of petroleum refineries is determined by two factors – ability to control the process and ability to efficiently manage the supply chain In this paper, we outline some challenges faced by refineries that seek to be lean, nimble, and proactive These problems, which arise in supply chain management and operations management are seldom amenable to traditional, monolithic solutions As discussed here using several examples, methodologies drawn from artificial intelligence – software agents, pattern recognition, expert systems – have a role to play in this path toward agility

Journal ArticleDOI
TL;DR: This paper proposes two efficient algorithms, namely the Sample-Gene Search and the Gene–Sample Search, to mine the complete set of coherent gene clusters from microarray data sets that records the expression levels of various genes under a set of samples during a series of time points.
Abstract: Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene–sample–time microarray data sets that records the expression levels of various genes under a set of samples during a series of time points. In particular, we propose the mining of coherent gene clusters from such data sets. Each cluster contains a subset of genes and a subset of samples such that the genes are coherent on the samples along the time series. The coherent gene clusters may identify the samples corresponding to some phenotypes (e.g., diseases), and suggest the candidate genes correlated to the phenotypes. We present two efficient algorithms, namely the Sample-Gene Search and the Gene–Sample Search, to mine the complete set of coherent gene clusters. We empirically evaluate the performance of our approaches on both a real microarray data set and synthetic data sets. The test results have shown that our approaches are both efficient and effective to find meaningful coherent gene clusters.

Journal ArticleDOI
TL;DR: An approach to improve the management of complexity during the redesign of technical processes is proposed, which is an extension of the Multimodeling and Multilevel Flow Modeling methodologies and used to represent a process hierarchically, thus improving the identification of analogous equipment/sections from different processes.
Abstract: An approach to improve the management of complexity during the redesign of technical processes is proposed. The approach consists of two abstract steps. In the first step, model-based reasoning is used to generate automatically alternative representations of an existing process at several levels of abstraction. In the second step, process alternatives are generated through the application of case-based reasoning. The key point of our framework is the modeling approach, which is an extension of the Multimodeling and Multilevel Flow Modeling methodologies. These, together with a systematic design methodology, are used to represent a process hierarchically, thus improving the identification of analogous equipment/sections from different processes. The hierarchical representation results in sets of equipment/sections organized according to their functions and intentions. A case-based reasoning system then retrieves from a library of cases similar equipment/sections to the one selected by the user. The final output is a set of equipment/sections ordered according to their similarity. Human intervention is necessary to adapt the most promising case within the original process.

Journal ArticleDOI
TL;DR: A novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries is developed, which costs more than the case of a fully materialized data cube, but the query answering time is still kept linear in the size of the transient segment.
Abstract: In some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or approximation over a data stream is often computationally prohibitive. On the other hand, although previous studies proposed approximate methods for continuous aggregate queries, they cannot provide accurate answers. In this paper, we develop a novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries. Often, a data stream can be partitioned into the historical segment, which is stored in a traditional data warehouse, and the transient segment, which can be stored in a PAT to answer ad hoc aggregate queries. The size of a PAT is linear in the size of the transient segment, and only one scan of the data stream is needed to create and incrementally maintain a PAT. Although the query answering using PAT costs more than the case of a fully materialized data cube, the query answering time is still kept linear in the size of the transient segment. Our extensive experimental results on both synthetic and real data sets illustrate the efficiency and the scalability of our design.

Journal ArticleDOI
TL;DR: This paper discusses the idea of semantic service requests for composite services, and presents a multi-attribute utility theory (MAUT) based model of composite service requests that enables unambiguous understanding of the service needs and more precise generation of the desired compositions.
Abstract: When meeting the challenges in automatic and semi-automatic Web service composition, capturing the user’s service demand and preferences is as important as knowing what the services can do. This paper discusses the idea of semantic service requests for composite services, and presents a multi-attribute utility theory (MAUT) based model of composite service requests. Service requests are modeled as user preferences and constraints. Two preference structures, additive independence and generalized additive independence, are utilized in calculating the expected utilities of service composition outcomes. The model is also based on an iterative and incremental scheme meant to better capture requirements in accordance with service consumers’ needs. OWL-S markup vocabularies and associated inference mechanism are used as a means to bring semantics to service requests. Ontology conceptualizations and language constructs are added to OWL-S as uniform representations of possible aspects of the requests. This model of semantics in service requests enables unambiguous understanding of the service needs and more precise generation of the desired compositions. An application scenario is presented to illustrate how the proposed model can be applied in the real business world.

Journal ArticleDOI
TL;DR: This work combines a well-known unimodal regression algorithm with a simple dynamic-programming approach to obtain an optimal quadratic-time algorithm for the problem of unimmodal k-segmentation and describes a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to the optimal.
Abstract: We study the problem of segmenting a sequence into k pieces so that the resulting segmentation satisfies monotonicity or unimodality constraints. Unimodal functions can be used to model phenomena in which a measured variable first increases to a certain level and then decreases. We combine a well-known unimodal regression algorithm with a simple dynamic-programming approach to obtain an optimal quadratic-time algorithm for the problem of unimodal k-segmentation. In addition, we describe a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to the optimal. As a concrete application of our algorithms, we describe methods for testing if a sequence behaves unimodally or not. The methods include segmentation error comparisons, permutation testing, and a BIC-based scoring scheme. Our experimental evaluation shows that our algorithms and the proposed unimodality tests give very intuitive results, for both real-valued and binary data.

Journal ArticleDOI
TL;DR: The paper proposes a technological ontology-driven framework for configuration support as applied to networked organization referred to as KSNet, which integrates concepts of business intelligence and Web intelligence into a collaboration environment of a networked organizations on the base of attainment of knowledge logistics purposes.
Abstract: Nowadays, organizations must continually adapt to market and organizational changes to achieve their most important goals. Migration to business services and service-oriented architectures provides a valuable opportunity to attain the organization objectives. This migration causes evolution both in organizational structure and in technology-enabling businesses to dynamically change vendors and services. One of the forms of organizational structures is the form of networked organization. Technologies of business intelligence and Web intelligence effectively support business processes within the networked organizations. While business intelligence focuses on development of services for consumer needs recognition, information search, and evaluation of alternatives; Web intelligence addresses advancement of Web-empowered systems, services, and environments. The paper proposes a technological ontology-driven framework for configuration support as applied to networked organization. The framework integrates concepts of business intelligence and Web intelligence into a collaboration environment of a networked organization on the base of attainment of knowledge logistics purposes. This framework referred to as KSNet is based on the integration of software agent technology and Web services. Knowledge logistics functions of KSNet are complemented by technological functions of knowledge-gathering agents. The services of these agents are implemented with CAPNET, a FIPA compliant agent platform. CAPNET allows consuming services of agents in a service-oriented way. Applicability of the approach is illustrated through a “Binni scenario”-based case study of a portable field hospital configuration.

Journal ArticleDOI
TL;DR: The results show that S-Club scheme significantly improves search performance and outperforms existing approaches.
Abstract: Information service plays a key role in grid system, handles resource discovery and management process. Employing existing information service architectures suffers from poor scalability, long search response time, and large traffic overhead. In this paper, we propose a service club mechanism, called S-Club, for efficient service discovery. In S-Club, an overlay based on existing Grid Information Service (GIS) mesh network of CROWN is built, so that GISs are organized as service clubs. Each club serves for a certain type of service while each GIS may join one or more clubs. S-Club is adopted in our CROWN Grid and the performance of S-Club is evaluated by comprehensive simulations. The results show that S-Club scheme significantly improves search performance and outperforms existing approaches.

Journal ArticleDOI
TL;DR: The group developed CAPNET agent platform and has been involved in several projects for the energy industry ranging from petroleum exploration and production to knowledge management with special focus on industrial exploitation of agent technology.
Abstract: Matias Alvarado is currently a Research Scientist at the Centre of Research and Advanced Studies (CINVESTAV-IPN, Mexico). He got a Ph.D. degree in computer science from the Technical University of Catalonia, with a major in artificial intelligence. He received the B.Sc. degree in mathematics from the National Autonomous University of Mexico. His interests in research and technological applications include knowledge management and decision making; autonomous agents and multiagent systems for supply chain disruption management; concurrency control, pattern recognition and computational logic. He is the author of about 50 scientific papers, a Journal Special Issues Guest Editor on topics of artificial intelligence and knowledge management for the oil industry; an academic, invited to the National University of Singapore, Technical University of Catalonia, University of Oxford, University of Utrecht, and Benemerita Universidad Autonoma de Puebla. Leonid Sheremetov received the Ph.D. degree in computer science in 1990 from St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, where he has worked as a Research Fellow and a Senior Research Fellow from 1982. Now he is a Principal Investigator of the Research Program on Applied Mathematics and Computing of the Mexican Petroleum Institute, where he leads the Distributed Intelligent Systems Group, and a part-time professor of the Artificial Intelligence Laboratory of the Centre for Computing Research of the National Polytechnic Institute (CIC-IPN), Mexico. His current research interests include multiagent systems, semantic WEB, decision support systems, and enterprise information integration. His group developed CAPNET agent platform and has been involved in several projects for the energy industry ranging from petroleum exploration and production to knowledge management with special focus on industrial exploitation of agent technology. He is also a member of the Editorial Boards of several journals. Rene Banares-Alcantara has worked in the University of Oxford from October 2003 and is now a Reader in engineering science at the Department of Engineering Science and a Fellow in engineering at New College. He previously held a readership at the University of Edinburgh and lectureships in Spain and at the Universidad Nacional Autonoma de Mexico (UNAM). He obtained his undergraduate degree from UNAM and the M.S. and Ph.D. degrees from Carnegie Mellon University (CMU). Starting with his work at CMU, his research interests have been in the area of process systems engineering, in particular chemical process design and synthesis. He has developed a strong relationship with computer science/artificial intelligence research groups in different universities and research institutes, with current research also linking to social and biological modeling. He has (co)authored more than 100 refereed publications and has been a Principal Investigator and a Researcher in several EPSRC and European Union projects. Francisco Cantu-Ortiz obtained the Ph.D. degree in artificial intelligence from the University of Edinburgh, United Kingdom and the Bachelor's degree in computer systems engineering from the Tecnologico de Monterrey (ITESM), Mexico. He is a Full Professor of artificial intelligence at Tecnologico de Monterey and is also the Dean of research and graduate Studies. He has been the Head of the Center for Artificial Intelligence and of the Informatics Research Center. Dr. Cantu-Ortiz has been the General Chair of about 15 international conferences in artificial intelligence and expert system and was a Local Chair of the International Joint Conference on Artificial Intelligence in 2003. His research interests include knowledge based systems and inference, machine learning, and data mining using Bayesian and statistical techniques for business intelligence, technology management, and entrepreneurial science. More recently, his interests have extended to epistemology and philosophy of science. He was the President of the Mexican Society for Artificial Intelligence and is a member of the IEEE Computer Society and the ACM.