Showing papers in &quot;Knowledge and Information Systems in 2007&quot;

Automatic outlier detection for time series: an application to sensor data

TL;DR: This study quantifies the sensitivity of feature selection algorithms to variations in the training set by assessing the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset.

...read moreread less

Abstract: With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms Finally, we show how stability profiles can support the choice of a feature selection algorithm

...read moreread less

536 citations

Journal Article•DOI•

[...]

Sabyasachi Basu, Martin Meckesheimer

CanTree: a canonical-order tree for incremental frequent-pattern mining

TL;DR: Two variations of a method that uses the median from a neighborhood of a data point and a threshold value to compare the difference between the median and the observed data value are proposed.

...read moreread less

Abstract: In this article we consider the problem of detecting unusual values or outliers from time series data where the process by which the data are created is difficult to model. The main consideration is the fact that data closer in time are more correlated to each other than those farther apart. We propose two variations of a method that uses the median from a neighborhood of a data point and a threshold value to compare the difference between the median and the observed data value. Both variations of the method are fast and can be used for data streams that occur in quick succession such as sensor data on an airplane.

...read moreread less

219 citations

Journal Article•DOI•

[...]

Carson K. Leung¹, Quamrul I. Khan¹, Zhan Li¹, Tariqul Hoque¹•Institutions (1)

University of Manitoba¹

03 Apr 2007-Knowledge and Information Systems

TL;DR: A novel tree structure, called CanTree (canonical-order tree), that captures the content of the transaction database and orders tree nodes according to some canonical order, which can be easily maintained when database transactions are inserted, deleted, and/or modified.

...read moreread less

Abstract: Since its introduction, frequent-pattern mining has been the subject of numerous studies, including incremental updating. Many existing incremental mining algorithms are Apriori-based, which are not easily adoptable to FP-tree-based frequent-pattern mining. In this paper, we propose a novel tree structure, called CanTree (canonical-order tree), that captures the content of the transaction database and orders tree nodes according to some canonical order. By exploiting its nice properties, the CanTree can be easily maintained when database transactions are inserted, deleted, and/or modified. For example, the CanTree does not require adjustment, merging, and/or splitting of tree nodes during maintenance. No rescan of the entire updated database or reconstruction of a new tree is needed for incremental updating. Experimental results show the effectiveness of our CanTree in the incremental mining of frequent patterns. Moreover, the applicability of CanTrees is not confined to incremental mining; CanTrees can also be applicable to other frequent-pattern mining tasks including constrained mining and interactive mining.

...read moreread less

149 citations

Journal Article•DOI•

Solving multi-instance problems with classifier ensemble based on constructive clustering

[...]

Zhi-Hua Zhou¹, Min-Ling Zhang¹•Institutions (1)

Nanjing University¹

Handicapping attacker's confidence: an alternative to k -anonymization

TL;DR: This paper proposes a new solution which goes at an opposite way, that is, adapting the multi-instance representation to single-instance learning algorithms, and shows that the proposed method works well on standard as well as generalized multi- instance problems.

...read moreread less

Abstract: In multi-instance learning, the training set is composed of labeled bags each consists of many unlabeled instances, that is, an object is represented by a set of feature vectors instead of only one feature vector. Most current multi-instance learning algorithms work through adapting single-instance learning algorithms to the multi-instance representation, while this paper proposes a new solution which goes at an opposite way, that is, adapting the multi-instance representation to single-instance learning algorithms. In detail, the instances of all the bags are collected together and clustered into d groups first. Each bag is then re-represented by d binary features, where the value of the ith feature is set to one if the concerned bag has instances falling into the ith group and zero otherwise. Thus, each bag is represented by one feature vector so that single-instance classifiers can be used to distinguish different classes of bags. Through repeating the above process with different values of d, many classifiers can be generated and then they can be combined into an ensemble for prediction. Experiments show that the proposed method works well on standard as well as generalized multi-instance problems.

...read moreread less

139 citations

Journal Article•DOI•

[...]

Ke Wang¹, Benjamin C. M. Fung¹, Philip S. Yu²•Institutions (2)

Simon Fraser University¹, IBM²

03 Apr 2007-Knowledge and Information Systems

TL;DR: A data transformation is proposed that minimally suppresses the domain values in the data to satisfy the set of privacy templates and the transformed data is free of sensitive inferences even in the presence of data mining algorithms.

...read moreread less

Abstract: We present an approach of limiting the confidence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are specified by a set of “privacy templates". Each template specifies the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the confidence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum confidence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization k has been italicized consistently throughout this article. focuses on personal identities. This work focuses on the association between personal identities and sensitive properties.

...read moreread less

138 citations

Journal Article•DOI•

Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing

[...]

Hisashi Koga¹, Tetsuo Ishibashi¹, Toshinori Watanabe¹•Institutions (1)

University of Electro-Communications¹

The pairwise attribute noise detection algorithm

TL;DR: This paper proposes a fast approximation algorithm for the single linkage method that reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithms for the approximate nearest neighbor search.

...read moreread less

Abstract: The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between the nearest members is the shortest. This step is repeated until only one cluster remains. The single linkage method can efficiently detect clusters in arbitrary shapes. However, a drawback of this method is a large time complexity of O(n 2), where n represents the number of data points. This time complexity makes this method infeasible for large data. This paper proposes a fast approximation algorithm for the single linkage method. Our algorithm reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. Here, B represents the maximum number of points going into a single hash entry and it practically diminishes to a small constant as compared to n for sufficiently large hash tables. Experimentally, we show that (1) the proposed algorithm obtains clustering results similar to those obtained by the single linkage method and (2) it runs faster for large data than the single linkage method.

...read moreread less

106 citations

Journal Article•DOI•

[...]

Jason Van Hulse¹, Taghi M. Khoshgoftaar¹, Haiying Huang¹•Institutions (1)

Florida Atlantic University¹

Non-redundant data clustering

TL;DR: This work presents a novel approach for detecting instances with attribute noise and demonstrates its usefulness with case studies using two different real-world software measurement data sets, showing that PANDA provides better noise detection performance than the DM algorithm.

...read moreread less

Abstract: Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm.

...read moreread less

79 citations

Journal Article•DOI•

[...]

David Gondek¹, Thomas Hofmann¹•Institutions (1)

Brown University¹

Intelligent systems in the automotive industry: applications and trends

TL;DR: An extension of the information bottleneck framework, called coordinated conditional information bottleneck, is presented, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints.

...read moreread less

Abstract: Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice, this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and non-numeric attributes. We discuss extensions of the technique to the tasks of semi-supervised classification and enumeration of successive non-redundant clusterings. We present experimental results for applications in text mining and computer vision.

...read moreread less

79 citations

Journal Article•DOI•

[...]

Oleg Yurievitch Gusikhin¹, Nestor Rychtyckyj¹, Dimitar Petrov Filev¹•Institutions (1)

Ford Motor Company¹

A quantitative analysis of product categorization standards: content, coverage, and maintenance of eCl@ss, UNSPSC, eOTD, and the RosettaNet Technical Dictionary

TL;DR: This paper provides an overview and a sampling of many of the ways that the automotive industry has utilized AI, soft computing and other intelligent system technologies in such diverse domains like manufacturing, diagnostics, on-board systems, warranty analysis and design.

...read moreread less

Abstract: There is a common misconception that the automobile industry is slow to adapt new technologies, such as artificial intelligence (AI) and soft computing. The reality is that many new technologies are deployed and brought to the public through the vehicles that they drive. This paper provides an overview and a sampling of many of the ways that the automotive industry has utilized AI, soft computing and other intelligent system technologies in such diverse domains like manufacturing, diagnostics, on-board systems, warranty analysis and design.

...read moreread less

71 citations

Journal Article•DOI•

[...]

Martin Hepp¹, Joerg Leukel², Volker Schmitz³•Institutions (3)

Digital Enterprise Research Institute¹, University of Hohenheim², University of Duisburg-Essen³

01 Sep 2007-Knowledge and Information Systems

TL;DR: This paper presents metrics for assessing the content quality and maturity of categorization standards and applies these metrics to eCl@ss, UNSPSC, eOTD, and RNTD, showing that the amount of content is very unevenly spread over top-level categories and that more expressive structural features exist only for parts of these standards.

...read moreread less

Abstract: Many e-business scenarios require the integration of product-related data into target applications or target documents at the recipient’s side. Such tasks can be automated much better if the textual descriptions are augmented by a machine-feasible representation of the product semantics. For this purpose, categorization standards for products and services, like UNSPSC, eCl@ss, the ECCMA Open Technical Dictionary (eOTD), or the RosettaNet Technical Dictionary (RNTD) are available, but they vary in terms of structural properties and content. In this paper, we present metrics for assessing the content quality and maturity of such standards and apply these metrics to eCl@ss, UNSPSC, eOTD, and RNTD. Our analysis shows that (1) the amount of content is very unevenly spread over top-level categories, which contradicts the promise of a broad scope implicitly made by the existence of a large number of top-level categories, and that (2) more expressive structural features exist only for parts of these standards. Additionally, we (3) measure the amount of maintenance in the various top-level categories, which helps identify the actively maintained subject areas as compared to those which ones are rather dead branches. Finally, we show how our approach can be used (4) by enterprises for selecting an appropriate standard, and (5) by standards bodies for monitoring the maintenance of a standard as a whole.

...read moreread less

Journal Article•DOI•

Interactive visual exploration of association rules with rule-focusing methodology

[...]

Julien Blanchard¹, Fabrice Guillet¹, Henri Briand¹•Institutions (1)

École polytechnique de l'université de Nantes¹

01 Sep 2007-Knowledge and Information Systems

TL;DR: The rule-focusing methodology is proposed, an interactive methodology for the visual post-processing of association rules that exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm.

...read moreread less

Abstract: On account of the enormous amounts of rules that can be produced by data mining algorithms, knowledge post-processing is a difficult stage in an association rule discovery process. In order to find relevant knowledge for decision making, the user (a decision maker specialized in the data studied) needs to rummage through the rules. To assist him/her in this task, we here propose the rule-focusing methodology, an interactive methodology for the visual post-processing of association rules. It allows the user to explore large sets of rules freely by focusing his/her attention on limited subsets. This new approach relies on rule interestingness measures, on a visual representation, and on interactive navigation among the rules. We have implemented the rule-focusing methodology in a prototype system called ARVis. It exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm.

...read moreread less

Journal Article•DOI•

A multimodal virtual reality interface for 3D interaction with VTK

[...]

Arjan J. F. Kok¹, Robert van Liere•Institutions (1)

Open University in the Netherlands¹

09 Oct 2007-Knowledge and Information Systems

TL;DR: VR-VTK: a multimodal interface to VTK on a virtual environment to address several problems specific for spatial 3D interaction and a number of additional features, such as more complex interaction methods and enhanced depth perception, are discussed.

...read moreread less

Abstract: The object-oriented visualization Toolkit (VTK) is widely used for scientific visualization. VTK is a visualization library that provides a large number of functions for presenting three-dimensional data. Interaction with the visualized data is controlled with two-dimensional input devices, such as mouse and keyboard. Support for real three-dimensional and multimodal input is non-existent. This paper describes VR-VTK: a multimodal interface to VTK on a virtual environment. Six degree of freedom input devices are used for spatial 3D interaction. They control the 3D widgets that are used to interact with the visualized data. Head tracking is used for camera control. Pedals are used for clutching. Speech input is used for application commands and system control. To address several problems specific for spatial 3D interaction, a number of additional features, such as more complex interaction methods and enhanced depth perception, are discussed. Furthermore, the need for multimodal input to support interaction with the visualization is shown. Two existing VTK applications are ported using VR-VTK to run in a desktop virtual reality system. Informal user experiences are presented.

...read moreread less

Journal Article•DOI•

Generalizing the notion of confidence

[...]

Michael Steinbach¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

03 Aug 2007-Knowledge and Information Systems

TL;DR: This paper describes an approach to defining confidence for ETIs that preserves the interpretation of confidence as an estimate of a conditional probability, and shows how association rules based on ETIs can have better coverage thanrules based on traditional itemsets.

...read moreread less

Abstract: In this paper, we explore extending association analysis to non-traditional types of patterns and non-binary data by generalizing the notion of confidence. We begin by describing a general framework that measures the strength of the connection between two association patterns by the extent to which the strength of one association pattern provides information about the strength of another. Although this framework can serve as the basis for designing or analyzing measures of association, the focus in this paper is to use the framework as the basis for extending the traditional concept of confidence to error-tolerant itemsets (ETIs) and continuous data. To that end, we provide two examples. First, we (1) describe an approach to defining confidence for ETIs that preserves the interpretation of confidence as an estimate of a conditional probability, and (2) show how association rules based on ETIs can have better coverage (at an equivalent confidence level) than rules based on traditional itemsets. Next, we derive a confidence measure for continuous data that agrees with the standard confidence measure when applied to binary transaction data. Further analysis of this result exposes some of the important issues involved in constructing a confidence measure for continuous data.

...read moreread less

Journal Article•DOI•

Knowledge and information distribution leveraged by intelligent agents

[...]

Ramóon F. Brena¹, José Luis Aguirre¹, Carlos Iván Chesñevar², Eduardo H. Ramírez¹, Leonardo Garrido¹ - Show less +1 more•Institutions (2)

Monterrey Institute of Technology and Higher Education¹, Universidad Nacional del Sur²

Visual transformation for interactive spatiotemporal data mining

TL;DR: This paper presents the JITik approach to model knowledge and information distribution, giving a high-level account of the research made around this project, emphasizing two particular aspects: a sophisticated argument-based mechanism for deciding among conflicting distribution policies, and the embedding of JITIK agents in enterprises using the service-oriented architecture paradigm.

...read moreread less

Abstract: Knowledge and Information distribution is indeed one of the main processes in Knowledge Management. Today, most Information Technology tools for supporting this distribution are based on repositories accessed through Web-based systems. This approach has, however, many practical limitations, mainly due to the strain they put on the user, who is responsible of accessing the right Knowledge and Information at the right moments. As a solution for this problem, we have proposed an alternative approach which is based on the notion of delegation of distribution tasks to synthetic agents, which become responsible of taking care of the organization's as well as the individuals' interests. In this way, many Knowledge and Information distribution tasks can be performed on the background, and the agents can recognize relevant events as triggers for distributing the right information to the right users at the right time. In this paper, we present the JITIK approach to model knowledge and information distribution, giving a high-level account of the research made around this project, emphasizing two particular aspects: a sophisticated argument-based mechanism for deciding among conflicting distribution policies, and the embedding of JITIK agents in enterprises using the service-oriented architecture paradigm. It must be remarked that a JITIK-based application is currently being implemented for one of the leading industries in Mexico.

...read moreread less

Journal Article•DOI•

[...]

Yang Cai¹, Richard P. Stumpf², Timothy T. Wynne², Michelle C. Tomlinson², Daniel Sai Ho Chung¹, Xavier Boutonnier¹, Matthias Ihmig¹, Rafael de M. Franco¹, Nathaniel Bauernfeind¹ - Show less +5 more•Institutions (2)

Carnegie Mellon University¹, Silver Spring Networks²

09 Oct 2007-Knowledge and Information Systems

TL;DR: The purpose of this study is to bridge the gap with transformation algorithms for mapping the data from an abstract space to an intuitive one, which include shape correlation, periodicity, multiphysics, and spatial Bayesian.

...read moreread less

Abstract: Analytical models intend to reveal inner structure, dynamics, or relationship of things. However, they are not necessarily intuitive to humans. Conventional scientific visualization methods are intuitive, but limited by depth, dimension, and resolution. The purpose of this study is to bridge the gap with transformation algorithms for mapping the data from an abstract space to an intuitive one, which include shape correlation, periodicity, multiphysics, and spatial Bayesian. We tested this approach with the oceanographic case study. We found that the interactive visualization increases robustness in object tracking and positive detection accuracy in object prediction. We also found that the interactive method enables the user to process the image data at less than 1 min per image versus 30 min per image manually. As a result, our test system can handle at least 10 times more data sets than traditional manual analyses. The results also suggest that minimal human interactions with appropriate computational transformations or cues may significantly increase the overall productivity.

...read moreread less

Journal Article•DOI•

Making clustering in delay-vector space meaningful

[...]

Jason Robert Chen¹•Institutions (1)

Australian National University¹

03 Apr 2007-Knowledge and Information Systems

TL;DR: It is shown that sequential time series clustering is not meaningless, and that the problem highlighted in these works stem from their use of the Euclidean distance metric as the distance measure in the delay-vector space.

...read moreread less

Abstract: Sequential time series clustering is a technique used to extract important features from time series data. The method can be shown to be the process of clustering in the delay-vector space formalism used in the Dynamical Systems literature. Recently, the startling claim was made that sequential time series clustering is meaningless. This has important consequences for a significant amount of work in the literature, since such a claim invalidates these work’s contribution. In this paper, we show that sequential time series clustering is not meaningless, and that the problem highlighted in these works stem from their use of the Euclidean distance metric as the distance measure in the delay-vector space. As a solution, we consider quite a general class of time series, and propose a regime based on two types of similarity that can exist between delay vectors, giving rise naturally to an alternative distance measure to Euclidean distance in the delay-vector space. We show that, using this alternative distance measure, sequential time series clustering can indeed be meaningful. We repeat a key experiment in the work on which the “meaningless” claim was based, and show that our method leads to a successful clustering outcome.

...read moreread less

Journal Article•DOI•

Data quality awareness: a case study for cost optimal association rule mining

[...]

Laure Berti-Equille¹•Institutions (1)

University of Rennes¹

Decision-making on pipe stress analysis enabled by knowledge-based systems

TL;DR: Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm the approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

...read moreread less

Abstract: The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHS→ RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

...read moreread less

Journal Article•DOI•

[...]

Matías Alvarado¹, Miguel A. Rodriguez-Toral², Armando Rosas², Sergio Ayala²•Institutions (2)

Instituto Politécnico Nacional¹, Mexican Institute of Petroleum²

On the role of individual human abilities in the design of adaptive user interfaces for scientific problem solving environments

TL;DR: This paper presents engineering decision-making on pipe stress analysis through the application of knowledge-based systems (KBS), and establishes a bidirectional communication with the current engineering software for pipe stressAnalysis so that the user benefits from this integration.

...read moreread less

Abstract: This paper presents engineering decision-making on pipe stress analysis through the application of knowledge-based systems (KBS). Stress analysis, as part of the design and analysis of process pipe networks, serves to identify whether a given pipe arrangement can cope with weight, thermal, and pressure stress at safe operation levels. An iterative process of design and analysis cycle is done routinely by engineers while analyzing the existing networks or while designing the process pipe networks. In our proposal, the KBS establishes a bidirectional communication with the current engineering software for pipe stress analysis, so that the user benefits from this integration. The stress analysis knowledge base is constructed by registering the senior engineers’ know-how. The engineers’ overall strategy to follow up during the pipe stress analysis, to some extent contained by the KBS, is presented. Advantages in saving engineering man-hours and usefulness in guiding experts in pipe stress analysis are the major services for the process industry.

...read moreread less

Journal Article•DOI•

[...]

Elena Zudilova-Seinstra¹•Institutions (1)

University of Amsterdam¹

09 Oct 2007-Knowledge and Information Systems

TL;DR: The experimental study suggests criteria for the inclusion of human factors into the user model guiding and controlling the adaptation process, and the Intelligent System for User Modelling has been developed.

...read moreread less

Abstract: A scientific problem solving environment should be built in such a way that users (scientists) might exploit underlying technologies without a specialised knowledge about available tools and resources. An adaptive user interface can be considered as an opportunity in addressing this challenge. This paper explores the importance of individual human abilities in the design of adaptive user interfaces for scientific problem solving environments. In total, seven human factors (gender, learning abilities, locus of control, attention focus, cognitive strategy and verbal and nonverbal IQs) have been evaluated regarding their impact on interface adjustments done manually by users. People’s preferences for different interface configurations have been investigated. The experimental study suggests criteria for the inclusion of human factors into the user model guiding and controlling the adaptation process. To provide automatic means of adaptation, the Intelligent System for User Modelling has been developed.

...read moreread less

Journal Article•DOI•

Editorial: Special issue on mining low-quality data

[...]

Xingquan Zhu¹, Taghi M. Khoshgoftaar², Ian Davidson³, Shichao Zhang⁴•Institutions (4)

Chinese Academy of Sciences¹, Florida Atlantic University², State University of New York System³, Guangxi Normal University⁴

Artificial intelligence methodologies for agile refining: an overview

TL;DR: As data mining is increasingly recognized as a key technology to analyzing and understanding the data, the need for knowledge discovery from real-world low-quality data becomes not just overwhelming, but also compelling.

...read moreread less

Abstract: Data mining is dedicated to searching for novel and actionable patterns and relationships that exist in a large volume of data. The mining process typically involves four major steps: (1) data collection, for example, transferring data collected from the production systems into data warehouses; (2) data preprocessing, transforming/cleansing the data to remove errors, filling missing values, and checking for inconsistency or duplicates; (3) finding patterns and models from preprocessed data; and (4) developing and monitoring the knowledge model [5, 7]. In data-driven application domains, many potential reasons, such as unreliable data acquisition sources, faulty sensors, data collection errors, and the lack of data representation standards, will make data vulnerable to errors and therefore lead to poor quality data. Although these factors and constraints are widely accepted by general data mining practitioners, most applications have traditionally ignored the need for developing appropriate approaches for representing and reasoning with such data imperfections. As data mining is increasingly recognized as a key technology to analyzing and understanding the data, the need for knowledge discovery from real-world low-quality data becomes not just overwhelming, but also compelling. As a result, issues related to data quality have become more and more

...read moreread less

Journal Article•DOI•

[...]

Rajagopalan Srinivasan_aff n¹•Institutions (1)

National University of Singapore¹

Mining gene–sample–time microarray data: a coherent gene cluster discovery approach

TL;DR: Some challenges faced by refineries that seek to be lean, nimble, and proactive are outlined, and methodologies drawn from artificial intelligence – software agents, pattern recognition, expert systems – have a role to play.

...read moreread less

Abstract: Agile manufacturing is the capability to prosper in a competitive environment of continuous and unpredictable changes by reacting quickly and effectively to the changing markets and other exogenous factors Agility of petroleum refineries is determined by two factors – ability to control the process and ability to efficiently manage the supply chain In this paper, we outline some challenges faced by refineries that seek to be lean, nimble, and proactive These problems, which arise in supply chain management and operations management are seldom amenable to traditional, monolithic solutions As discussed here using several examples, methodologies drawn from artificial intelligence – software agents, pattern recognition, expert systems – have a role to play in this path toward agility

...read moreread less

Journal Article•DOI•

[...]

Daxin Jiang¹, Jian Pei², Murali Ramanathan³, Chuan Lin³, Chun Tang³, Aidong Zhang³ - Show less +2 more•Institutions (3)

Nanyang Technological University¹, Simon Fraser University², University at Buffalo³

14 Nov 2007-Knowledge and Information Systems

TL;DR: This paper proposes two efficient algorithms, namely the Sample-Gene Search and the Gene–Sample Search, to mine the complete set of coherent gene clusters from microarray data sets that records the expression levels of various genes under a set of samples during a series of time points.

...read moreread less

Abstract: Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene–sample–time microarray data sets that records the expression levels of various genes under a set of samples during a series of time points. In particular, we propose the mining of coherent gene clusters from such data sets. Each cluster contains a subset of genes and a subset of samples such that the genes are coherent on the samples along the time series. The coherent gene clusters may identify the samples corresponding to some phenotypes (e.g., diseases), and suggest the candidate genes correlated to the phenotypes. We present two efficient algorithms, namely the Sample-Gene Search and the Gene–Sample Search, to mine the complete set of coherent gene clusters. We empirically evaluate the performance of our approaches on both a real microarray data set and synthetic data sets. The test results have shown that our approaches are both efficient and effective to find meaningful coherent gene clusters.

...read moreread less

Journal Article•DOI•

A hierarchical approach for the redesign of chemical processes

[...]

Ivan Lopez-Arevalo¹, René Bañares-Alcántara², Arantza Aldea³, A. Rodríguez-Martínez⁴•Institutions (4)

CINVESTAV¹, University of Oxford², Oxford Brookes University³, Universidad Autónoma del Estado de Morelos⁴

Answering ad hoc aggregate queries from data streams using prefix aggregate trees

TL;DR: An approach to improve the management of complexity during the redesign of technical processes is proposed, which is an extension of the Multimodeling and Multilevel Flow Modeling methodologies and used to represent a process hierarchically, thus improving the identification of analogous equipment/sections from different processes.

...read moreread less

Abstract: An approach to improve the management of complexity during the redesign of technical processes is proposed. The approach consists of two abstract steps. In the first step, model-based reasoning is used to generate automatically alternative representations of an existing process at several levels of abstraction. In the second step, process alternatives are generated through the application of case-based reasoning. The key point of our framework is the modeling approach, which is an extension of the Multimodeling and Multilevel Flow Modeling methodologies. These, together with a systematic design methodology, are used to represent a process hierarchically, thus improving the identification of analogous equipment/sections from different processes. The hierarchical representation results in sets of equipment/sections organized according to their functions and intentions. A case-based reasoning system then retrieves from a library of cases similar equipment/sections to the one selected by the user. The final output is a set of equipment/sections ordered according to their similarity. Human intervention is necessary to adapt the most promising case within the original process.

...read moreread less

Journal Article•DOI•

[...]

Moonjung Cho¹, Jian Pei², Ke Wang²•Institutions (2)

University at Buffalo¹, Simon Fraser University²

03 Aug 2007-Knowledge and Information Systems

TL;DR: A novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries is developed, which costs more than the case of a fully materialized data cube, but the query answering time is still kept linear in the size of the transient segment.

...read moreread less

Abstract: In some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or approximation over a data stream is often computationally prohibitive. On the other hand, although previous studies proposed approximate methods for continuous aggregate queries, they cannot provide accurate answers. In this paper, we develop a novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries. Often, a data stream can be partitioned into the historical segment, which is stored in a traditional data warehouse, and the transient segment, which can be stored in a PAT to answer ad hoc aggregate queries. The size of a PAT is linear in the size of the transient segment, and only one scan of the data stream is needed to create and incrementally maintain a PAT. Although the query answering using PAT costs more than the case of a fully materialized data cube, the query answering time is still kept linear in the size of the transient segment. Our extensive experimental results on both synthetic and real data sets illustrate the efficiency and the scalability of our design.

...read moreread less

Journal Article•DOI•

Modeling semantics in composite Web service requests by utility elicitation

[...]

Qianhui Althea Liang¹, Jen-Yao Chung², Steven M. Miller¹•Institutions (2)

Singapore Management University¹, IBM²

14 Nov 2007-Knowledge and Information Systems

TL;DR: This paper discusses the idea of semantic service requests for composite services, and presents a multi-attribute utility theory (MAUT) based model of composite service requests that enables unambiguous understanding of the service needs and more precise generation of the desired compositions.

...read moreread less

Abstract: When meeting the challenges in automatic and semi-automatic Web service composition, capturing the user’s service demand and preferences is as important as knowing what the services can do. This paper discusses the idea of semantic service requests for composite services, and presents a multi-attribute utility theory (MAUT) based model of composite service requests. Service requests are modeled as user preferences and constraints. Two preference structures, additive independence and generalized additive independence, are utilized in calculating the expected utilities of service composition outcomes. The model is also based on an iterative and incremental scheme meant to better capture requirements in accordance with service consumers’ needs. OWL-S markup vocabularies and associated inference mechanism are used as a means to bring semantics to service requests. Ontology conceptualizations and language constructs are added to OWL-S as uniform representations of possible aspects of the requests. This model of semantics in service requests enables unambiguous understanding of the service needs and more precise generation of the desired compositions. An application scenario is presented to illustrate how the proposed model can be applied in the real business world.

...read moreread less

Journal Article•DOI•

Algorithms for unimodal segmentation with applications to unimodality detection

[...]

Niina Haiminen¹, Aristides Gionis¹, Kari Laasonen¹•Institutions (1)

Helsinki Institute for Information Technology¹

19 Dec 2007-Knowledge and Information Systems

TL;DR: This work combines a well-known unimodal regression algorithm with a simple dynamic-programming approach to obtain an optimal quadratic-time algorithm for the problem of unimmodal k-segmentation and describes a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to the optimal.

...read moreread less

Abstract: We study the problem of segmenting a sequence into k pieces so that the resulting segmentation satisfies monotonicity or unimodality constraints. Unimodal functions can be used to model phenomena in which a measured variable first increases to a certain level and then decreases. We combine a well-known unimodal regression algorithm with a simple dynamic-programming approach to obtain an optimal quadratic-time algorithm for the problem of unimodal k-segmentation. In addition, we describe a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to the optimal. As a concrete application of our algorithms, we describe methods for testing if a sequence behaves unimodally or not. The methods include segmentation error comparisons, permutation testing, and a BIC-based scoring scheme. Our experimental evaluation shows that our algorithms and the proposed unimodality tests give very intuitive results, for both real-valued and binary data.

...read moreread less

Journal Article•DOI•

Ontology-driven intelligent service for configuration support in networked organizations

[...]

Alexander V. Smirnov¹, Nikolay Shilov¹, Tatiana Levashova¹, Leonid Sheremetov², Miguel Contreras² - Show less +1 more•Institutions (2)

Russian Academy of Sciences¹, Mexican Institute of Petroleum²

S-Club: an overlay-based efficient service discovery mechanism in CROWN Grid

TL;DR: The paper proposes a technological ontology-driven framework for configuration support as applied to networked organization referred to as KSNet, which integrates concepts of business intelligence and Web intelligence into a collaboration environment of a networked organizations on the base of attainment of knowledge logistics purposes.

...read moreread less

Abstract: Nowadays, organizations must continually adapt to market and organizational changes to achieve their most important goals. Migration to business services and service-oriented architectures provides a valuable opportunity to attain the organization objectives. This migration causes evolution both in organizational structure and in technology-enabling businesses to dynamically change vendors and services. One of the forms of organizational structures is the form of networked organization. Technologies of business intelligence and Web intelligence effectively support business processes within the networked organizations. While business intelligence focuses on development of services for consumer needs recognition, information search, and evaluation of alternatives; Web intelligence addresses advancement of Web-empowered systems, services, and environments. The paper proposes a technological ontology-driven framework for configuration support as applied to networked organization. The framework integrates concepts of business intelligence and Web intelligence into a collaboration environment of a networked organization on the base of attainment of knowledge logistics purposes. This framework referred to as KSNet is based on the integration of software agent technology and Web services. Knowledge logistics functions of KSNet are complemented by technological functions of knowledge-gathering agents. The services of these agents are implemented with CAPNET, a FIPA compliant agent platform. CAPNET allows consuming services of agents in a service-oriented way. Applicability of the approach is illustrated through a “Binni scenario”-based case study of a portable field hospital configuration.

...read moreread less

Journal Article•DOI•

[...]

Chunming Hu¹, Yanmin Zhu², Jinpeng Huai¹, Yunhao Liu², Lionel M. Ni² - Show less +1 more•Institutions (2)

Beihang University¹, Hong Kong University of Science and Technology²

Current challenges and trends in intelligent computing and knowledge management in industry

TL;DR: The results show that S-Club scheme significantly improves search performance and outperforms existing approaches.

...read moreread less

Abstract: Information service plays a key role in grid system, handles resource discovery and management process. Employing existing information service architectures suffers from poor scalability, long search response time, and large traffic overhead. In this paper, we propose a service club mechanism, called S-Club, for efficient service discovery. In S-Club, an overlay based on existing Grid Information Service (GIS) mesh network of CROWN is built, so that GISs are organized as service clubs. Each club serves for a certain type of service while each GIS may join one or more clubs. S-Club is adopted in our CROWN Grid and the performance of S-Club is evaluated by comprehensive simulations. The results show that S-Club scheme significantly improves search performance and outperforms existing approaches.

...read moreread less

Journal Article•DOI•

[...]

Matías Alvarado¹, Leonid Sheremetov², René Bañares-Alcántara³, Francisco J. Cantu-Ortiz⁴•Institutions (4)

CINVESTAV¹, American Petroleum Institute², University of Oxford³, Monterrey Institute of Technology and Higher Education⁴