scispace - formally typeset
Search or ask a question

Showing papers by "James Bailey published in 2014"


Proceedings ArticleDOI
24 Aug 2014
TL;DR: It is shown how the resulting NP-hard global optimization problem could be efficiently approximately solved via spectral relaxation and semi-definite programming techniques.
Abstract: Most current mutual information (MI) based feature selection techniques are greedy in nature thus are prone to sub-optimal decisions Potential performance improvements could be gained by systematically posing MI-based feature selection as a global optimization problem A rare attempt at providing a global solution for the MI-based feature selection is the recently proposed Quadratic Programming Feature Selection (QPFS) approach We point out that the QPFS formulation faces several non-trivial issues, in particular, how to properly treat feature `self-redundancy' while ensuring the convexity of the objective function In this paper, we take a systematic approach to the problem of global MI-based feature selection We show how the resulting NP-hard global optimization problem could be efficiently approximately solved via spectral relaxation and semi-definite programming techniques We experimentally demonstrate the efficiency and effectiveness of these novel feature selection frameworks

117 citations


Proceedings Article
21 Jun 2014
TL;DR: It is argued that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias, which requires computation of the variance of mutual information under a hypergeometric model of randomness.
Abstract: Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the standardized mutual information. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.

102 citations


Journal ArticleDOI
TL;DR: The analysis suggests that the molecular basis of cell shape may, in addition to motor force, be a key adaptive strategy for malaria parasite dissemination and, as such, transmission.
Abstract: Motility is a fundamental part of cellular life and survival, including for Plasmodium parasites--single-celled protozoan pathogens responsible for human malaria. The motile life cycle forms achieve motility, called gliding, via the activity of an internal actomyosin motor. Although gliding is based on the well-studied system of actin and myosin, its core biomechanics are not completely understood. Currently accepted models suggest it results from a specifically organized cellular motor that produces a rearward directional force. When linked to surface-bound adhesins, this force is passaged to the cell posterior, propelling the parasite forwards. Gliding motility is observed in all three life cycle stages of Plasmodium: sporozoites, merozoites and ookinetes. However, it is only the ookinetes--formed inside the midgut of infected mosquitoes--that display continuous gliding without the necessity of host cell entry. This makes them ideal candidates for invasion-free biomechanical analysis. Here we apply a plate-based imaging approach to study ookinete motion in three-dimensional (3D) space to understand Plasmodium cell motility and how movement facilitates midgut colonization. Using single-cell tracking and numerical analysis of parasite motion in 3D, our analysis demonstrates that ookinetes move with a conserved left-handed helical trajectory. Investigation of cell morphology suggests this trajectory may be based on the ookinete subpellicular cytoskeleton, with complementary whole and subcellular electron microscopy showing that, like their motion paths, ookinetes share a conserved left-handed corkscrew shape and underlying twisted microtubular architecture. Through comparisons of 3D movement between wild-type ookinetes and a cytoskeleton-knockout mutant we demonstrate that perturbation of cell shape changes motion from helical to broadly linear. Therefore, while the precise linkages between cellular architecture and actomyosin motor organization remain unknown, our analysis suggests that the molecular basis of cell shape may, in addition to motor force, be a key adaptive strategy for malaria parasite dissemination and, as such, transmission.

48 citations


Proceedings Article
21 Jun 2014
TL;DR: This paper proposes a novel approach to the MI-based feature selection problem, in which the overfitting phenomenon is controlled rigourously by means of a statistical test, and develops local and global optimization algorithms for this new feature selection model.
Abstract: Mutual information (MI) based approaches are a popular feature selection paradigm. Although the stated goal of Mi-based feature selection is to identify a subset of features that share the highest mutual information with the class variable, most current Mi-based techniques are greedy methods that make use of low dimensional MI quantities. The reason for using low dimensional approximation has been mostly attributed to the difficulty associated with estimating the high dimensional MI from limited samples. In this paper, we argue a different viewpoint that, given a very large amount of data, the high dimensional MI objective is still problematic to be employed as a meaningful optimization criterion, due to its overfitting nature: the MI almost always increases as more features are added, thus leading to a trivial solution which includes all features. We propose a novel approach to the MI-based feature selection problem, in which the overfitting phenomenon is controlled rigourously by means of a statistical test. We develop local and global optimization algorithms for this new feature selection model, and demonstrate its effectiveness in the applications of explaining variables and objects.

30 citations


Book ChapterDOI
21 Apr 2014
TL;DR: The meta path-based similarity measure PathSim is extended by incorporating richer information, such as transitive similarity and temporal dynamics, to help solve the problem of similarity search in heterogeneous information networks.
Abstract: Heterogeneous information networks have attracted much attention in recent years and a key challenge is to compute the similarity between two objects. In this paper, we study the problem of similarity search in heterogeneous information networks, and extend the meta path-based similarity measure PathSim by incorporating richer information, such as transitive similarity and temporal dynamics. Experiments on a large DBLP network show that our improved similarity measure is more effective at identifying similar authors in terms of their future collaborations.

19 citations


Journal ArticleDOI
TL;DR: Two new algorithms for alternative clustering generation are presented, each with a distinctive feature of their principled formulation of an objective function, facilitating the discovery of a subspace satisfying natural quality and orthogonality criteria.
Abstract: Clustering analysis is important for exploring complex datasets. Alternative clustering analysis is an emerging subfield involving techniques for the generation of multiple different clusterings, allowing the data to be viewed from different perspectives. We present two new algorithms for alternative clustering generation. A distinctive feature of our algorithms is their principled formulation of an objective function, facilitating the discovery of a subspace satisfying natural quality and orthogonality criteria. The first algorithm is a regularization of the Principal Components analysis method, whereas the second is a regularization of graph-based dimension reduction. In both cases, we demonstrate a globally optimum subspace solution can be computed. Experimental evaluation shows our techniques are able to equal or outperform a range of existing methods.

16 citations


Proceedings ArticleDOI
01 Dec 2014
TL;DR: This paper generalizes eight information theoretic crisp indices to soft clusterings, so that they can be used with partitions of any type (i.e., crisp or soft, with soft including fuzzy, probabilistic and possibilistic cases).
Abstract: There have been a large number of external validity indices proposed for cluster validity. One such class of cluster comparison indices is the information theoretic measures, due to their strong mathematical foundation and their ability to detect non-linear relationships. However, they are devised for evaluating crisp (hard) partitions. In this paper, we generalize eight information theoretic crisp indices to soft clusterings, so that they can be used with partitions of any type (i.e., crisp or soft, with soft including fuzzy, probabilistic and possibilistic cases). We present experimental results to demonstrate the effectiveness of the generalized information theoretic indices.

14 citations


Proceedings ArticleDOI
25 Nov 2014
TL;DR: This paper proposes a proof of concept Do-Not-Disturb (DND) service that can a) determine a user's context relevant for DND service from the built-in smartphone sensors and b) correctly predict the DND status based on the given context such as being in a meeting, sleeping, or working at the office.
Abstract: Modern sensor-equipped smartphones have attracted significant research interest in the pervasive community for recognizing and creating context-aware applications at a personal or community scale level. In this paper, we propose a proof of concept Do-Not-Disturb (DND) service that can a) determine a user's context relevant for DND service from the built-in smartphone sensors and b) correctly predict the DND status based on the given context such as being in a meeting, sleeping, or working at the office. In this preliminary study, we investigate whether sensor data can be clustered to represent user contexts. We use standard machine learning techniques to learn the relationship between a user's context and the corresponding DND status (available or unavailable). Given a user's current context, the DND service predicts a DND status and configures the mobile device accordingly. Our preliminary experiment demonstrates that the proposed system can achieve a prediction accuracy of up to 90% when trained with sufficient data.

12 citations


Proceedings Article
01 Jan 2014
TL;DR: This paper focuses on the core problem of computing substring matching probability in uncertain sequences and proposes an efficient dynamic programming algorithm for this task, which contributes towards a foundation for adapting classic sequence mining methods to deal with uncertain data.
Abstract: Substring matching is fundamental to data mining methods for sequential data. It involves checking the existence of a short subsequence within a longer sequence, ensuring no gaps within a match. Whilst a large amount of existing work has focused on substring matching and mining techniques for certain sequences, there are only a few results for uncertain sequences. Uncertain sequences provide powerful representations for modelling sequence behavioural characteristics in emerging domains, such as bioinformatics, sensor streams and trajectory analysis. In this paper, we focus on the core problem of computing substring matching probability in uncertain sequences and propose an efficient dynamic programming algorithm for this task. We demonstrate our approach is both competitive theoretically, as well as effective and scalable experimentally. Our results contribute towards a foundation for adapting classic sequence mining methods to deal with uncertain data.

9 citations


Book ChapterDOI
13 May 2014
TL;DR: This paper presents CSMiner, a mining method with various pruning techniques, which is substantially faster than the baseline method and demonstrates that this problem has important applications, and at the same time is very challenging.
Abstract: In this paper, we tackle a novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes C + and C − and a query object o, we want to find top-k subspaces S that maximize the ratio of likelihood of o in C + against that in C −. We demonstrate that this problem has important applications, and at the same time, is very challenging. It even does not allow polynomial time approximation. We present CSMiner, a mining method with various pruning techniques. CSMiner is substantially faster than the baseline method. Our experimental results on real data sets verify the effectiveness and efficiency of our method.

9 citations


Journal Article
TL;DR: This work introduces a real-time feedback system for surgical technique within a temporal bone surgical simulator and shows that this feedback system performs exceptionally well with respect to accuracy and effectiveness.
Abstract: Timely feedback on surgical technique is an important aspect of surgical skill training in any learning environment, be it virtual or otherwise. Feedback on technique should be provided in real-time to allow trainees to recognize and amend their errors as they occur. Expert surgeons have typically carried out this task, but they have limited time available to spend with trainees. Virtual reality surgical simulators offer effective, repeatable training at relatively low cost, but their benefits may not be fully realized while they still require the presence of experts to provide feedback. We attempt to overcome this limitation by introducing a real-time feedback system for surgical technique within a temporal bone surgical simulator. Our evaluation study shows that this feedback system performs exceptionally well with respect to accuracy and effectiveness.

Proceedings ArticleDOI
25 Nov 2014
TL;DR: This work shows that diagnostic information that is not considered sensitive, could be used to identify a user after just three consecutive days of monitoring, using only diagnostic features like hardware statistics and system settings.
Abstract: Mobile smart phones capture a great amount of information about a user across a variety of different data domains. This information can be sensitive and allow for identifying a user profile, thus causing potential threats to a user's privacy. Our work shows that diagnostic information that is not considered sensitive, could be used to identify a user after just three consecutive days of monitoring. We have used the Device Analyzer dataset to determine what features of a mobile device are important in identifying a user.Many mobile games and applications collect diagnostic data as a means of identifying or resolving issues. Diagnostic data is commonly accepted as less sensitive information. Our experimental results demonstrate that using only diagnostic features like hardware statistics and system settings, a user's device can be identified at an accuracy of 94% with a Naive Bayes classifier.

Book ChapterDOI
13 May 2014
TL;DR: This paper focuses on comparison measures for two important graph clustering approaches, community detection and blockmodelling, and proposes comparison measures that work for weighted (and unweighted) graphs.
Abstract: Clustering in graphs aims to group vertices with similar patterns of connections. Applications include discovering communities and latent structures in graphs. Many algorithms have been proposed to find graph clusterings, but an open problem is the need for suitable comparison measures to quantitatively validate these algorithms, performing consensus clustering and to track evolving (graph) clusters across time. To date, most comparison measures have focused on comparing the vertex groupings, and completely ignore the difference in the structural approximations in the clusterings, which can lead to counter-intuitive comparisons. In this paper, we propose new measures that account for differences in the approximations. We focus on comparison measures for two important graph clustering approaches, community detection and blockmodelling, and propose comparison measures that work for weighted (and unweighted) graphs.

Journal ArticleDOI
TL;DR: To test whether commonly measured laboratory variables can identify surgical patients at risk of major adverse events (death, unplanned intensive care unit (ICU) admission or rapid response team (RRT) activation), a number of laboratory variables are tested.
Abstract: Background/Aims To test whether commonly measured laboratory variables can identify surgical patients at risk of major adverse events (death, unplanned intensive care unit (ICU) admission or rapid response team (RRT) activation). Methods We conducted a prospective observational study in a surgical ward of a university-affiliated hospital in a cohort of 834 surgical patients admitted for >24 h. We applied a previously validated multivariable model-derived risk assessment to each combined set of common laboratory tests to identify patients at risk. We compared the clinical course of such patients with that of control patients from the same ward who had blood tests but were identified as low risk. Results We studied 7955 batches and 73 428 individual tests in 834 patients (males 55%; average age 65.8 ± 17.6 years). Among these patients, 66 (7.9%) were identified as ‘high risk’. High-risk patients were older (75.9 vs 61.8 years of age; P < 0.0001), had much greater early (48 h) mortality (6/66 (9%) vs 4/768 (0.5%); P < 0.0001) and greater overall hospital mortality (11/66 (16.7%) vs 9/768 (1.2%); P < 0.0001). They also had more early (8/66 (12.1%) vs 14/768 (1.8%); P = 0.0001) and overall in-hospital unplanned ICU admissions (12/66 (18.2%) vs 18/768 (2.3%); P < 0.0001) and more early (26/66 (39.3%) vs 50/768 (6.5%); P < 0.0001) and overall in-hospital RRT calls (26/66 (39.4%) vs 55/768 (7.2%); P < 0.0001). Conclusions Commonly performed laboratory tests identify surgical ward patients at risk of early major adverse events. Further studies are needed to assess whether such identification system can be used to trigger interventions that help improve patient outcomes.

Book ChapterDOI
15 Sep 2014
TL;DR: In this paper, a simple and effective filtering algorithm (FILTA) is proposed to find multiple clusterings in the datasest, which can be flexibly used in conjunction with any meta-clustering method.
Abstract: Meta-clustering is a popular approach to find multiple clusterings in the datasest, which takes a large number of base clusterings as input for further user navigation and refinement. However, the effectiveness of meta-clustering is highly dependent on the distribution of the base clusterings and open challenges exist with regard to its stability and noise tolerance. In this paper we propose a simple and effective filtering algorithm (FILTA) that can be flexibly used in conjunction with any meta-clustering method. Given a (raw) set of base clusterings, FILTA employs information theoretic criteria to remove those having poor quality or high redundancy. Then this filtered set of clusterings is highly suitable for further exploration, particularly the use of visualization for determining the dominant views in the dataset. We evaluate FILTA on both synthetic and real world datasets, and see how its use can enhance view discovery for complex scenarios.

Book ChapterDOI
01 Dec 2014
TL;DR: It is demonstrated that higher order transformations have the potential to boost prediction performance and that DLR is a promising method for transfer learning.
Abstract: Density based logistic regression (DLR) is a recently introduced classification technique, that performs a one-to-one non-linear transformation of the original feature space to another feature space based on density estimations. This new feature space is particularly well suited for learning a logistic regression model. Whilst performance gains, good interpretability and time efficiency make DLR attractive, there exist some limitations to its formulation. In this paper, we tackle these limitations and propose several new extensions: 1) A more robust methodology for performing density estimations, 2) A method that can transform two or more features into a single target feature, based on the use of higher order kernel density estimation, 3) Analysis of the utility of DLR for transfer learning scenarios. We evaluate our extensions using several synthetic and publicly available datasets, demonstrating that higher order transformations have the potential to boost prediction performance and that DLR is a promising method for transfer learning.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: This work proposes a transfer learning framework to adapt a classifier built on a single temporal bone specimen to multiple specimens, and built a surgical end-product performance classifier from 16 expert trials on a simulated temporalBone specimen.
Abstract: Evaluation of the outcome (end-product) of surgical procedures carried out in virtual reality environments is an essential part of simulation-based surgical training. Automated end-product assessment can be carried out by performance classifiers built from a set of expert performances. When applied to temporal bone surgery simulation, these classifiers can evaluate performance on the bone specimen they were trained on, but they cannot be extended to new specimens. Thus, new expert performances need to be recorded for each new specimen, requiring considerable time commitment from time-poor expert surgeons. To eliminate this need, we propose a transfer learning framework to adapt a classifier built on a single temporal bone specimen to multiple specimens. Once a classifier is trained, we translate each new specimens' features to the original feature space, which allows us to carry out performance evaluation on different specimens using the same classifier. In our experiment, we built a surgical end-product performance classifier from 16 expert trials on a simulated temporal bone specimen. We applied the transfer learning approach to 8 new specimens to obtain machine generated end-products. We also collected end-products for these 8 specimens drilled by a single expert. We then compared the machine generated end-products to those drilled by the expert. The drilled regions generated by transfer learning were similar to those drilled by the expert.

Journal Article
TL;DR: This paper proposes a simple and effective filtering algorithm (FILTA) that can be flexibly used in conjunction with any meta-clustering method and evaluates its use on both synthetic and real world datasets to see how its use can enhance view discovery for complex scenarios.
Abstract: Meta-clustering is a popular approach to find multiple clusterings in the datasest, which takes a large number of base clusterings as input for further user navigation and refinement. However, the effectiveness of meta-clustering is highly dependent on the distribution of the base clusterings and open challenges exist with regard to its stability and noise tolerance. In this paper we propose a simple and effective filtering algorithm (FILTA) that can be flexibly used in conjunction with any meta-clustering method. Given a (raw) set of base clusterings, FILTA employs information theoretic criteria to remove those having poor quality or high redundancy. Then this filtered set of clusterings is highly suitable for further exploration, particularly the use of visualization for determining the dominant views in the dataset. We evaluate FILTA on both synthetic and real world datasets, and see how its use can enhance view discovery for complex scenarios. © 2014 Springer-Verlag.

Proceedings ArticleDOI
14 Dec 2014
TL;DR: This paper proposes a new formulation of the graph clustering problem that results in clusterings that are easy to interpret and more accurate than state-of-the-art algorithms for both synthetic and real datasets.
Abstract: Graphs are a powerful representation of relational data, such as social and biological networks. Often, these entities form groups and are organised according to a latent structure. However, these groupings and structures are generally unknown and it can be difficult to identify them. Graph clustering is an important type of approach used to discover these vertex groups and the latent structure within graphs. One type of approach for graph clustering is non-negative matrix factorisation However, the formulations of existing factorisation approaches can be overly relaxed and their groupings and results consequently difficult to interpret, may fail to discover the true latent structure and groupings, and converge to extreme solutions. In this paper, we propose a new formulation of the graph clustering problem that results in clusterings that are easy to interpret. Combined with a novel algorithm, the clusterings are also more accurate than state-of-the-art algorithms for both synthetic and real datasets.