scispace - formally typeset
Search or ask a question

Showing papers by "James Bailey published in 2008"


Journal ArticleDOI
TL;DR: To discover regions of correlated spatio-temporal change in graphs, an algorithm called cSTAG is proposed, which addresses the problem of finding clusters that optimise both temporal and spatial distance measures simultaneously.
Abstract: Graphs provide powerful abstractions of relational data, and are widely used in fields such as network management, web page analysis and sociology. While many graph representations of data describe dynamic and time evolving relationships, most graph mining work treats graphs as static entities. Our focus in this paper is to discover regions of a graph that are evolving in a similar manner. To discover regions of correlated spatio-temporal change in graphs, we propose an algorithm called cSTAG. Whereas most clustering techniques are designed to find clusters that optimise a single distance measure, cSTAG addresses the problem of finding clusters that optimise both temporal and spatial distance measures simultaneously. We show the effectiveness of cSTAG using a quantitative analysis of accuracy on synthetic data sets, as well as demonstrating its utility on two large, real-life data sets, where one is the routing topology of the Internet, and the other is the dynamic graph of files accessed together on the 1998 World Cup official website.

59 citations


Proceedings Article
01 Oct 2008
TL;DR: This work introduces a new technique for building decision trees that is better suited to gene expression data, based on consideration of the area under the Receiver Operating Characteristics (ROC) curve, to help determine decision tree characteristics, such as node selection and stopping criteria.
Abstract: Gene expression information from microarray experiments is a primary form of data for biological analysis and can offer insights into disease processes and cellular behaviour. Such datasets are particularly challenging to build classifiers for, due to their very high dimensional nature and small sample size. Decision trees are a seemingly attractive technique for this domain, due to their easily interpretable white box nature and noise resistance. However, existing decision tree methods tend to perform rather poorly for classifying gene expression data. To address this gap, we introduce a new technique for building decision trees that is better suited to this scenario. Our method is based on consideration of the area under the Receiver Operating Characteristics (ROC) curve, to help determine decision tree characteristics, such as node selection and stopping criteria. We experimentally compare our algorithm, called ROC-tree, against other well known decision tree techniques, on a number of gene expression datasets. The experimental results clearly demonstrate that ROC-tree can deliver better classification accuracy in a range of challenging situations. Copyright © by SIAM.

25 citations


Book ChapterDOI
15 Sep 2008
TL;DR: Experiments show that the proposed method based on consideration of the area under the Receiver Operating Characteristics (ROC) curve can substantially boost the classification performance of the k-NN algorithm and is even able to deliver better accuracy than state-of-the-art non k-nn classifiers, such as support vector machines.
Abstract: The k-nearest neighbour (k-NN) technique, due to its interpretable nature, is a simple and very intuitively appealing method to address classification problems. However, choosing an appropriate distance function for k-NN can be challenging and an inferior choice can make the classifier highly vulnerable to noise in the data. In this paper, we propose a new method for determining a good distance function for k-NN. Our method is based on consideration of the area under the Receiver Operating Characteristics (ROC) curve, which is a well known method to measure the quality of binary classifiers. It computes weights for the distance function, based on ROC properties within an appropriate neighbourhood for the instances whose distance is being computed. We experimentally compare the effect of our scheme with a number of other well-known k-NN distance metrics, as well as with a range of different classifiers. Experiments show that our method can substantially boost the classification performance of the k-NN algorithm. Furthermore, in a number of cases our technique is even able to deliver better accuracy than state-of-the-art non k-NN classifiers, such as support vector machines.

22 citations


Proceedings ArticleDOI
26 Oct 2008
TL;DR: Experimental results demonstrate that the proposed technique can efficiently identify and explain contrast behaviour which would be difficult or impossible to isolate using standard techniques.
Abstract: Contrast data mining is a key tool for finding differences between sets of objects, or classes, and contrast patterns are a popular method for discrimination between two classes. However, such patterns can be limited in two primary ways: i) They do not readily allow second order differentiation - i.e. discovering contrasts of contrasts, ii) Mining contrast patterns often results in an overwhelming volume of output for the user. To address these limitations, this paper proposes a method which can identify contrast behaviour across both classes and also groups of classes. Furthermore, to increase interpretability for the user, it presents a new technique for finding the attributes which represent the key underlying factors behind the contrast behaviour. The associated mining task is computationally challenging and we describe an efficient algorithm to handle it, based on binary decision diagrams. Experimental results demonstrate that our technique can efficiently identify and explain contrast behaviour which would be difficult or impossible to isolate using standard techniques.

15 citations


Book ChapterDOI
26 Apr 2008
TL;DR: This paper presents a new and extensible approach to discover acronym patterns that can also be used for both ranking the patterns, as well as utilizing them within search queries.
Abstract: Techniques for being able to automatically identify acronym patterns are very important for enhancing a multitude of applications that rely upon search. This task is challenging, due to the many ways that acronyms and their expansions can be embedded in text. Methods for ranking and exploiting acronym patterns are another related, yet mostly untouched area. In this paper we present a new and extensible approach to discover acronym patterns. Furthermore, we present a new approach that can also be used for both ranking the patterns, as well as utilizing them within search queries. In our pattern discovery system, we are able to achieve a clear separation between higher and lower level functionalities. This enables great flexibility and allows users to easily configure and tune the system for different target domains. We evaluate our system and show how it is able to offer new capabilities, compared to existing work in the area.

9 citations



Book ChapterDOI
15 Oct 2008
TL;DR: The g-MARS (gapped Markov Chain with Support Vector Machine) protein classifier is presented, which models the structure of a protein sequence by measuring the transition probabilities between pairs of amino acids and can be generalized to incorporate gaps in the Markov chain.
Abstract: Classifying protein sequences has important applications in areas such as disease diagnosis, treatment development and drug design. In this paper we present a highly accurate classifier called the g-MARS (gapped Markov Chain with Support Vector Machine) protein classifier. It models the structure of a protein sequence by measuring the transition probabilities between pairs of amino acids. This results in a Markov chain style model for each protein sequence. Then, to capture the similarity among non-exactly matching protein sequences, we show that this model can be generalized to incorporate gaps in the Markov chain. We perform a thorough experimental study and compare g-MARS to several other state-of-the-art protein classifiers. Overall, we demonstrate that g-MARS has superior accuracy and operates efficiently on a diverse range of protein families.

3 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: An adaptive control framework for a multi-compartmental model of a pressure-limited respirator and lung mechanics system where the plant and reference model involve switching and time-varying dynamics is developed.
Abstract: In this paper, we develop an adaptive control framework for a multi-compartmental model of a pressure-limited respirator and lung mechanics system. Specifically, we develop a model reference direct adaptive controller framework where the plant and reference model involve switching and time-varying dynamics. We then apply the proposed adaptive feedback controller framework to stabilize a given limit cycle corresponding to a clinically plausible respiratory pattern.

3 citations


Book ChapterDOI
15 Oct 2008
TL;DR: A novel clustering algorithm is developed which incorporates functional gene information from the Gene Ontology into the clustering process, resulting in more biologically meaningfull clusters and shows the potential of such methods for the exploration of cancer etiology.
Abstract: Gene expression profiling provides insight into the functions of genes at a molecular level. Clustering of gene expression profiles can facilitate the identification of the underlying driving biological program causing genes' co-expression. Standard clustering methods, grouping genes based on similar expression values, fail to capture weak expression correlations potentially causing genes in the same biological process to be grouped separately. We have developed a novel clustering algorithm which incorporates functional gene information from the Gene Ontology into the clustering process, resulting in more biologically meaningfull clusters. We have validated our method using a multi-cancer microarray dataset. In addition, we show the potential of such methods for the exploration of cancer etiology.

2 citations


Posted Content
TL;DR: In this article, the problem of deciding satisfiability of first-order logic queries over views is studied, with the aim being to delimit the boundary between the decidable and the undecidable fragments of this language.
Abstract: We study the problem of deciding satisfiability of first order logic queries over views, our aim being to delimit the boundary between the decidable and the undecidable fragments of this language. Views currently occupy a central place in database research, due to their role in applications such as information integration and data warehousing. Our main result is the identification of a decidable class of first order queries over unary conjunctive views that generalises the decidability of the classical class of first order sentences over unary relations, known as the Lowenheim class. We then demonstrate how various extensions of this class lead to undecidability and also provide some expressivity results. Besides its theoretical interest, our new decidable class is potentially interesting for use in applications such as deciding implication of complex dependencies, analysis of a restricted class of active database rules, and ontology reasoning.

1 citations


Book
01 Jan 2008
TL;DR: Semantic Web Reasoning Using a Blackboard System and Effective and Efficient Data Access in the Versatile Web Query Language Xcerpt.
Abstract: Session 1. Invited Talk.- The RuleML Family of Web Rule Languages.- Session 2. Reasoning I.- Automated Reasoning Support for First-Order Ontologies.- Combining Safe Rules and Ontologies by Interfacing of Reasoners.- Session 3. Applications.- Realizing Business Processes with ECA Rules: Benefits, Challenges, Limits.- Interaction Protocols and Capabilities: A Preliminary Report.- Semantic Web Reasoning for Analyzing Gene Expression Profiles.- Session 4. Querying.- Data Model and Query Constructs for Versatile Web Query Languages: State-of-the-Art and Challenges for Xcerpt.- AMa ? oS-Abstract Machine for Xcerpt: Architecture and Principles.- Towards More Precise Typing Rules for Xcerpt.- Session 5. Reasoning II.- Extending an OWL Web Node with Reactive Behavior.- Supporting Open and Closed World Reasoning on the Web.- Reasoning with Temporal Constraints in RDF.- Session 6. Reasoning III.- Bidirectional Mapping Between OWL DL and Attempto Controlled English.- XML Querying Using Ontological Information.- Semantic Web Reasoning Using a Blackboard System.- Systems Session.- Effective and Efficient Data Access in the Versatile Web Query Language Xcerpt.- Web Queries with Style: Rendering Xcerpt Programs with CSSNG.- Information Gathering in a Dynamic World.- Practice of Inductive Reasoning on the Semantic Web: A System for Semantic Web Mining.- Fuzzy Time Intervals System Description of the FuTI-Library.- A Prototype of a Descriptive Type System for Xcerpt.

Proceedings Article
25 Sep 2008
TL;DR: This book constitutes the proceedings of the 9th International Conference on Web Information Systems Engineering, WISE 2008, held in Auckland, New Zealand, in September 2008, and contains 17 revised full papers and 14 revised short papers presented.
Abstract: This book constitutes the proceedings of the 9th International Conference on Web Information Systems Engineering, WISE 2008, held in Auckland, New Zealand, in September 2008. The 17 revised full papers and 14 revised short papers presented together with two keynote talks were carefully reviewed and selected from around 110 submissions. The papers are organized in topical sections on grid computing and peer-to-peer systems; Web mining; rich Web user interfaces; semantic Web; Web information retrieval; Web data integration; queries and peer-to-peer systems; and Web services.

Proceedings Article
05 Dec 2008
TL;DR: The experimental results suggest that document context provides the strongest evidence of contextual information for this task, and various contextual evidence of similarity outside of the passage such as query/fulltext similarity, query/citation sentence similarity, queries/title similarity, and query/abstract similarity are investigated.
Abstract: Query Expansion is a widely used technique that augments a query with synonymous and related terms in order to address a common issue in ad hoc retrieval: the vocabulary mismatch problem, where relevant documents contain query terms that are semantically similar, but lexically distinct. Standard query expansion techniques include pseudo relevance feedback and ontology-based expansion. In this paper, we explore the use of contextual information as a means of expanding the context surrounding the unit of retrieval, rather than the query, which in this case is a document passage. The ad hoc retrieval task that we focus on in this paper was investigated at the TREC 2006 Genomic tracks, where systems were required to retrieve relevant answer passages. The most commonly reported indexing strategy was passage indexing. Although this simplifies post-retrieval processing, retrieval performance can be hurt as valuable contextual information in the containing document is lost. The focus of this paper is to investigate various contextual evidence of similarity outside of the passage such as: query/fulltext similarity, query/citation sentence similarity, query/title similarity, query/abstract similarity. These similarity scores are then used to boost the rank of passages that exhibit high contextual evidence of query similarity. Our experimental results suggest that document context provides the strongest evidence of contextual information for this task.