scispace - formally typeset
Search or ask a question

Showing papers by "Svetha Venkatesh published in 2017"


Journal ArticleDOI
TL;DR: DeepCare is introduced, an end-to-end deep dynamic neural network that reads medical records, stores previous illness history, infers current illness states and predicts future medical outcomes, demonstrating the efficacy of DeepCare for disease progression modeling, intervention recommendation, and future risk prediction.

342 citations


Journal ArticleDOI
TL;DR: A new deep learning system that learns to extract features from medical records and predicts future risk automatically achieves superior accuracy compared to traditional techniques, detects meaningful clinical motifs, and uncovers the underlying structure of the disease and intervention space.
Abstract: Feature engineering remains a major bottleneck when creating predictive systems from electronic medical records. At present, an important missing element is detecting predictive regular clinical motifs from irregular episodic records. We present Deepr (short for Deep record), a new end-to-end deep learning system that learns to extract features from medical records and predicts future risk automatically. Deepr transforms a record into a sequence of discrete elements separated by coded time gaps and hospital transfers. On top of the sequence is a convolutional neural net that detects and combines predictive local clinical motifs to stratify the risk. Deepr permits transparent inspection and visualization of its inner working. We validate Deepr on hospital data to predict unplanned readmission after discharge. Deepr achieves superior accuracy compared to traditional techniques, detects meaningful clinical motifs, and uncovers the underlying structure of the disease and intervention space.

261 citations


Proceedings Article
01 Jan 2017
TL;DR: The Column Network (CLN) as discussed by the authors is a deep learning model for collective classification in multi-relational domains, which encodes multi-relations between any two instances and allows complex functions to be approximated at the network level with a small set of free parameters.
Abstract: Relational learning deals with data that are characterized by relational structures. An important task is collective classification, which is to jointly classify networked objects. While it holds a great promise to produce a better accuracy than non-collective classifiers, collective classification is computationally challenging and has not leveraged on the recent breakthroughs of deep learning. We present Column Network (CLN), a novel deep learning model for collective classification in multi-relational domains. CLN has many desirable theoretical properties: (i) it encodes multi-relations between any two instances; (ii) it is deep and compact, allowing complex functions to be approximated at the network level with a small set of free parameters; (iii) local and relational features are learned simultaneously; (iv) long-range, higher-order dependencies between instances are supported naturally; and (v) crucially, learning and inference are efficient with linear complexity in the size of the network and the number of relations. We evaluate CLN on multiple real-world applications: (a) delay prediction in software projects, (b) PubMed Diabetes publication classification and (c) film genre classification. In all of these applications, CLN demonstrates a higher accuracy than state-of-the-art rivals.

108 citations


Proceedings ArticleDOI
Cheng Li1, Sunil Gupta1, Santu Rana1, Vu Nguyen1, Svetha Venkatesh1, Alistair Shilton1 
01 Jan 2017
TL;DR: In this article, a dropout strategy was proposed to optimize only a subset of variables at each iteration, and theoretical bounds for the regret were derived for the derivation of the algorithm.
Abstract: Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications- training cascade classifiers and optimizing alloy composition.

85 citations


Journal ArticleDOI
TL;DR: In this paper, an iterative method which uses machine learning to optimise process development, incorporating multiple qualitative and quantitative objectives, is described. And the method is demonstrated with a novel fluid processing platform for synthesis of short polymer fibers, and show how the synthesis process can be efficiently directed to achieve material and process objectives.
Abstract: The discovery of processes for the synthesis of new materials involves many decisions about process design, operation, and material properties. Experimentation is crucial but as complexity increases, exploration of variables can become impractical using traditional combinatorial approaches. We describe an iterative method which uses machine learning to optimise process development, incorporating multiple qualitative and quantitative objectives. We demonstrate the method with a novel fluid processing platform for synthesis of short polymer fibers, and show how the synthesis process can be efficiently directed to achieve material and process objectives.

80 citations


Proceedings Article
Santu Rana1, Cheng Li1, Sunil Gupta1, Vu Nguyen1, Svetha Venkatesh1 
01 Jan 2017
TL;DR: An algorithm is proposed that enables local gradient-dependent algorithms to move through the flat terrain by using a sequence of gross-tofiner Gaussian process priors on the objective function as there exists a large enough length-scales for which the acquisition function can be made to have a significant gradient at any location in the parameter space.
Abstract: Bayesian optimization is an efficient way to optimize expensive black-box functions such as designing a new product with highest quality or tuning hyperparameter of a machine learning algorithm. However, it has a serious limitation when the parameter space is high-dimensional as Bayesian optimization crucially depends on solving a global optimization of a surrogate utility function in the same sized dimensions. The surrogate utility function, known commonly as acquisition function is a continuous function but can be extremely sharp at high dimension having only a few peaks marooned in a large terrain of almost flat surface. Global optimization algorithms such as DIRECT are infeasible at higher dimensions and gradient-dependent methods cannot move if initialized in the flat terrain. We propose an algorithm that enables local gradient-dependent algorithms to move through the flat terrain by using a sequence of gross-tofiner Gaussian process priors on the objective function as we leverage two underlying facts a) there exists a large enough length-scales for which the acquisition function can be made to have a significant gradient at any location in the parameter space, and b) the extrema of the consecutive acquisition functions are close although they are different only due to a small difference in the length-scales. Theoretical guarantees are provided and experiments clearly demonstrate the utility of the proposed method on both benchmark test functions and real-world case studies.

73 citations


Journal ArticleDOI
TL;DR: This work aims to explore the textual cues of online communities interested in depression, and finds Topics and psycholinguistic features were found to be highly valid predictors of community subgroup.
Abstract: Depression is a highly prevalent mental health problem and is a co-morbidity of other mental, physical, and behavioural disorders. The internet allows individuals who are depressed or caring for those who are depressed, to connect with others via online communities; however, the characteristics of these discussions have not yet been fully explored. This work aims to explore the textual cues of online communities interested in depression. A total of 5,000 posts were randomly selected from 24 online communities. Five subgroups of online communities were identified: Depression, Bipolar Disorder, Self-Harm, Grief/Bereavement, and Suicide. Psycholinguistic features and content topics were extracted from the posts and analysed. Machine learning techniques were used to discriminate the online conversations in the depression communities from the other subgroups. Topics and psycholinguistic features were found to be highly valid predictors of community subgroup. Clear discrimination between linguistic features and topics, alongside good predictive power is an important step in understanding social media and its use in mental health.

53 citations


Proceedings Article
Vu Nguyen1, Sunil Gupta1, Santu Rana1, Cheng Li1, Svetha Venkatesh1 
01 Jan 2017
TL;DR: This analysis is the first to study a stopping criteria for EI to prevent unnecessary evaluations and demonstrates empirically that EI using ymax is both more computationally efficiency and more accurate than EIUsing μmax.
Abstract: Bayesian optimization (BO) is a sample-efficient method for global optimization of expensive, noisy, black-box functions using probabilistic methods. The performance of a BO method depends on its selection strategy through the acquisition function. Expected improvement (EI) is one of the most widely used acquisition functions for BO that finds the expectation of the improvement function over the incumbent. The incumbent is usually selected as the best-observed value so far, termed as ymax (for the maximizing problem). Recent work has studied the convergence rate for EI under some mild assumptions or zero noise of observations. Especially, the work of Wang and de Freitas (2014) has derived the sublinear regret for EI under a stochastic noise. However, due to the difficulty in stochastic noise setting and to make the convergent proof feasible, they use an alternative choice for the incumbent as the maximum of the Gaussian process predictive mean, μmax. This modification makes the algorithm computationally inefficient because it requires an additional global optimization step to estimate μmax that is costly and may be inaccurate. To address this issue, we derive a sublinear convergence rate for EI using the commonly used ymax. Moreover, our analysis is the first to study a stopping criteria for EI to prevent unnecessary evaluations. Our analysis complements the results of Wang and de Freitas (2014) to theoretically cover two incumbent settings for EI. Finally, we demonstrate empirically that EI using ymax is both more computationally efficiency and more accurate than EI using μmax.

49 citations


Journal ArticleDOI
TL;DR: Advances in lightning-fast cluster computing was employed to process large scale data, consisting of 6.4 terabytes of data containing 3.8 billion records from all the media, demonstrating the capability of advanced techniques in machine learning to aid in the discovery of meaningful patterns from medical data, and social media data, at scale.

33 citations


Book ChapterDOI
23 May 2017
TL;DR: A unified framework for anomaly detection in video based on the restricted Boltzmann machine, a recent powerful method for unsupervised learning and representation learning, that can detect and localize the abnormalities at pixel level with better accuracy than those of baselines, and achieve competitive performance compared with state-of-the-art approaches.
Abstract: Automated detection of abnormal events in video surveillance is an important task in research and practical applications. This is, however, a challenging problem due to the growing collection of data without the knowledge of what to be defined as “abnormal”, and the expensive feature engineering procedure. In this paper we introduce a unified framework for anomaly detection in video based on the restricted Boltzmann machine (\(\text {RBM}\)), a recent powerful method for unsupervised learning and representation learning. Our proposed system works directly on the image pixels rather than hand-crafted features, it learns new representations for data in a completely unsupervised manner without the need for labels, and then reconstructs the data to recognize the locations of abnormal events based on the reconstruction errors. More importantly, our approach can be deployed in both offline and streaming settings, in which trained parameters of the model are fixed in offline setting whilst are updated incrementally with video data arriving in a stream. Experiments on three publicly benchmark video datasets show that our proposed method can detect and localize the abnormalities at pixel level with better accuracy than those of baselines, and achieve competitive performance compared with state-of-the-art approaches. Moreover, as RBM belongs to a wider class of deep generative models, our framework lays the groundwork towards a more powerful deep unsupervised abnormality detection framework.

29 citations


Journal ArticleDOI
TL;DR: The hierarchical semi-Markov conditional random field is presented, a generalisation of embedded undirected Markov chains to model complex hierarchical, nested Markov processes and develops efficient algorithms for learning and constrained inference in a partially-supervised setting.

Journal ArticleDOI
TL;DR: The results showed that DistEn values are minimally affected by the variations of input parameters compared to ApEn and SampEn, and DistEn showed the most consistent and the best performance in differentiating physiological and pathological conditions with various ofinput parameters among reported complexity measures.
Abstract: Distribution entropy (DistEn) is a recently developed measure of complexity that is used to analyse heart rate variability (HRV) data. Its calculation requires two input parameters-the embedding dimension m, and the number of bins M which replaces the tolerance parameter r that is used by the existing approximation entropy (ApEn) and sample entropy (SampEn) measures. The performance of DistEn can also be affected by the data length N. In our previous studies, we have analyzed stability and performance of DistEn with respect to one parameter (m or M) or combination of two parameters (N and M). However, impact of varying all the three input parameters on DistEn is not yet studied. Since DistEn is predominantly aimed at analysing short length heart rate variability (HRV) signal, it is important to comprehensively study the stability, consistency and performance of the measure using multiple case studies. In this study, we examined the impact of changing input parameters on DistEn for synthetic and physiological signals. We also compared the variations of DistEn and performance in distinguishing physiological (Elderly from Young) and pathological (Healthy from Arrhythmia) conditions with ApEn and SampEn. The results showed that DistEn values are minimally affected by the variations of input parameters compared to ApEn and SampEn. DistEn also showed the most consistent and the best performance in differentiating physiological and pathological conditions with various of input parameters among reported complexity measures. In conclusion, DistEn is found to be the best measure for analysing short length HRV time series.

Proceedings Article
01 Jan 2017
TL;DR: This paper studies the regret bound of two transfer learning algorithms in Bayesian optimisation and proposes a new way to model the difference between the source and target as a Gaussian process which is used to adapt the source data.
Abstract: This paper studies the regret bound of two transfer learning algorithms in Bayesian optimisation. The first algorithm models any difference between the source and target functions as a noise process. The second algorithm proposes a new way to model the difference between the source and target as a Gaussian process which is then used to adapt the source data. We show that in both cases the regret bounds are tighter than in the no transfer case. We also experimentally compare the performance of these algorithms relative to no transfer learning and demonstrate benefits of transfer learning.

Journal ArticleDOI
01 Oct 2017
TL;DR: Experimental results show that the kernel-based features gained significantly higher prediction performance than existing techniques, by up to 16.3%, suggesting the potential and applicability of the proposed features in a wide spectrum of applications on data analytics at population levels.
Abstract: When using tweets to predict population health index, due to the large scale of data, an aggregation of tweets by population has been a popular practice in learning features to characterize the population. This would alleviate the computational cost for extracting features on each individual tweet. On the other hand, much information on the population could be lost as the distribution of textual features of a population could be important for identifying the health index of that population. In addition, there could be relationships between features and those relationships could also convey predictive information of the health index. In this paper, we propose mid-level features namely kernel-based features for prediction of health indices of populations from social media data. The kernel-based features are extracted on the distributions of textual features over population tweets and encode the relationships between individual textual features in a kernel function. We implemented our features using three different kernel functions and applied them for two case studies of population health prediction: across-year prediction and across-county prediction. The kernel-based features were evaluated and compared with existing features on a dataset collected from the Behavioral Risk Factor Surveillance System dataset. Experimental results show that the kernel-based features gained significantly higher prediction performance than existing techniques, by up to 16.3%, suggesting the potential and applicability of the proposed features in a wide spectrum of applications on data analytics at population levels.

Posted Content
TL;DR: This paper uses the recently introduced Column Network for the expanded graph, resulting in a new end-to-end graph classification model dubbed Virtual Column Network (VCN), validated on two tasks: predicting bio-activity of chemical compounds, and finding software vulnerability from source code.
Abstract: Learning representation for graph classification turns a variable-size graph into a fixed-size vector (or matrix). Such a representation works nicely with algebraic manipulations. Here we introduce a simple method to augment an attributed graph with a virtual node that is bidirectionally connected to all existing nodes. The virtual node represents the latent aspects of the graph, which are not immediately available from the attributes and local connectivity structures. The expanded graph is then put through any node representation method. The representation of the virtual node is then the representation of the entire graph. In this paper, we use the recently introduced Column Network for the expanded graph, resulting in a new end-to-end graph classification model dubbed Virtual Column Network (VCN). The model is validated on two tasks: (i) predicting bio-activity of chemical compounds, and (ii) finding software vulnerability from source code. Results demonstrate that VCN is competitive against well-established rivals.

Journal ArticleDOI
TL;DR: Independent of small between-LGA differences in utilisation, and in contrast to the expected greater prevalence of osteoarthritis in disadvantaged populations, the data suggest low correlation between ‘need’ vs. ‘uptake’ of surgery in rural/regional areas.
Abstract: The names of the co-authors Steven Graves and Michelle Lorimer were missing from the manuscript supplied for publication. The lead authors regret this error and apologize for any inconvenience.

Posted Content
TL;DR: This work proposes to work with regular patterns whose unlabeled data is abundant and usually easy to collect in practice, which allows the system to be trained completely in an unsupervised procedure and liberate the author from the need for costly data annotation.
Abstract: Automated detection of abnormalities in data has been studied in research area in recent years because of its diverse applications in practice including video surveillance, industrial damage detection and network intrusion detection. However, building an effective anomaly detection system is a non-trivial task since it requires to tackle challenging issues of the shortage of annotated data, inability of defining anomaly objects explicitly and the expensive cost of feature engineering procedure. Unlike existing appoaches which only partially solve these problems, we develop a unique framework to cope the problems above simultaneously. Instead of hanlding with ambiguous definition of anomaly objects, we propose to work with regular patterns whose unlabeled data is abundant and usually easy to collect in practice. This allows our system to be trained completely in an unsupervised procedure and liberate us from the need for costly data annotation. By learning generative model that capture the normality distribution in data, we can isolate abnormal data points that result in low normality scores (high abnormality scores). Moreover, by leverage on the power of generative networks, i.e. energy-based models, we are also able to learn the feature representation automatically rather than replying on hand-crafted features that have been dominating anomaly detection research over many decades. We demonstrate our proposal on the specific application of video anomaly detection and the experimental results indicate that our method performs better than baselines and are comparable with state-of-the-art methods in many benchmark video anomaly detection datasets.

Proceedings ArticleDOI
Vu Nguyen1, Sunil Gupta1, Santu Rane, Cheng Li1, Svetha Venkatesh1 
01 Jan 2017
TL;DR: This paper proposes the filtering expansion strategy for Bayesian optimization that starts from the initial region and gradually expands the search space, and develops an efficient algorithm for this strategy and derive its regret bound.
Abstract: Bayesian optimization (BO) has recently emerged as a powerful and flexible tool for hyper-parameter tuning and more generally for the efficient global optimization of expensive black-box functions. Systems implementing BO has successfully solved difficult problems in automatic design choices and machine learning hyper-parameters tunings. Many recent advances in the methodologies and theories underlying Bayesian optimization have extended the framework to new applications and provided greater insights into the behavior of these algorithms. Still, these established techniques always require a user-defined space to perform optimization. This pre-defined space specifies the ranges of hyper-parameter values. In many situations, however, it can be difficult to prescribe such spaces, as a prior knowledge is often unavailable. Setting these regions arbitrarily can lead to inefficient optimization - if a space is too large, we can miss the optimum with a limited budget, on the other hand, if a space is too small, it may not contain the optimum point that we want to get. The unknown search space problem is intractable to solve in practice. Therefore, in this paper, we narrow down to consider specifically the setting of "weakly specified" search space for Bayesian optimization. By weakly specified space, we mean that the pre-defined space is placed at a sufficiently good region so that the optimization can expand and reach to the optimum. However, this pre-defined space need not include the global optimum. We tackle this problem by proposing the filtering expansion strategy for Bayesian optimization. Our approach starts from the initial region and gradually expands the search space. Wedevelop an efficient algorithm for this strategy and derive its regret bound. These theoretical results are complemented by an extensive set of experiments on benchmark functions and tworeal-world applications which demonstrate the benefits of our proposed approach.

Proceedings ArticleDOI
03 Apr 2017
TL;DR: To predict the percentage of adults in a county reporting "insufficient sleep", a health behavior, and, at the same time, their health outcomes, novel textual and temporal features are proposed and the combination of kernel-based textual features and temporal information predict well both the health behavior and health outcomes.
Abstract: From 1984, the US has annually conducted the Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture either health behaviors, such as drinking or smoking, or health outcomes, including mental, physical, and generic health, of the population. Although this kind of information at a population level, such as US counties, is important for local governments to identify local needs, traditional datasets may take years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. In this work, to predict the percentage of adults in a county reporting "insufficient sleep", a health behavior, and, at the same time, their health outcomes, novel textual and temporal features are proposed. The proposed textual features are defined at mid-level and can be applied on top of various low-level textual features. They are computed via kernel functions on underlying features and encode the relationships between individual underlying features over a population. To further enrich the predictive ability of the health indices, the textual features are augmented with temporal information. We evaluated the proposed features and compared them with existing features using a dataset collected from the BRFSS. Experimental results show that the combination of kernel-based textual features and temporal information predict well both the health behavior (with best performance at rho=0.82) and health outcomes (with best performance at rho=0.78), demonstrating the capability of social media data in prediction of population health indices. The results also show that our proposed features gained higher correlation coefficients than did the existing ones, increasing the correlation coefficient by up to 0.16, suggesting the potential of the approach in a wide spectrum of applications on data analytics at population levels.

Posted Content
TL;DR: In this paper, the authors proposed a deep learning architecture that can effectively handle these challenges for predicting ICU mortality outcomes, which is based on Long Short-Term Memory (LSTM) and has layered attention mechanisms.
Abstract: Modeling physiological time-series in ICU is of high clinical importance. However, data collected within ICU are irregular in time and often contain missing measurements. Since absence of a measure would signify its lack of importance, the missingness is indeed informative and might reflect the decision making by the clinician. Here we propose a deep learning architecture that can effectively handle these challenges for predicting ICU mortality outcomes. The model is based on Long Short-Term Memory, and has layered attention mechanisms. At the sensing layer, the model decides whether to observe and incorporate parts of the current measurements. At the reasoning layer, evidences across time steps are weighted and combined. The model is evaluated on the PhysioNet 2012 dataset showing competitive and interpretable results.

Proceedings Article
01 Jan 2017
TL;DR: This work proposes two algorithms, pc-BO(basic) and pc-nested, which are proposed as a process-constrained batch Bayesian optimisation problem and shows that the regret of pc- BO(nested) is sublinear.
Abstract: Prevailing batch Bayesian optimisation methods allow all control variables to be freely altered at each iteration. Real-world experiments, however, often have physical limitations making it time-consuming to alter all settings for each recommendation in a batch. This gives rise to a unique problem in BO: in a recommended batch, a set of variables that are expensive to experimentally change need to be fixed, while the remaining control variables can be varied. We formulate this as a process-constrained batch Bayesian optimisation problem. We propose two algorithms, pc-BO(basic) and pc-BO(nested). pc-BO(basic) is simpler but lacks convergence guarantee. In contrast pc-BO(nested) is slightly more complex, but admits convergence analysis. We show that the regret of pc-BO(nested) is sublinear. We demonstrate the performance of both pc-BO(basic) and pc-BO(nested) by optimising benchmark test functions, tuning hyper-parameters of the SVM classifier, optimising the heat-treatment process for an Al-Sc alloy to achieve target hardness, and optimising the short polymer fibre production process.

Posted Content
TL;DR: This work presents a new distributed representation in deep neural nets wherein the information is represented in native form as a matrix, which differs from current neural architectures that rely on vector representations.
Abstract: We present a new distributed representation in deep neural nets wherein the information is represented in native form as a matrix. This differs from current neural architectures that rely on vector representations. We consider matrices as central to the architecture and they compose the input, hidden and output layers. The model representation is more compact and elegant -- the number of parameters grows only with the largest dimension of the incoming layer rather than the number of hidden units. We derive several new deep networks: (i) feed-forward nets that map an input matrix into an output matrix, (ii) recurrent nets which map a sequence of input matrices into a sequence of output matrices. We also reinterpret existing models for (iii) memory-augmented networks and (iv) graphs using matrix notations. For graphs we demonstrate how the new notations lead to simple but effective extensions with multiple attentions. Extensive experiments on handwritten digits recognition, face reconstruction, sequence to sequence learning, EEG classification, and graph-based node classification demonstrate the efficacy and compactness of the matrix architectures.

Journal ArticleDOI
TL;DR: This paper investigates and identifies latent meta-groups of online communities with and without mental health-related conditions including depression and autism, and analyzes sentiment-based, psycholinguistics-based and topic-based features from blog posts made by members of these online communities.
Abstract: Social media are an online means of interaction among individuals. People are increasingly using social media, especially online communities, to discuss health concerns and seek support. Understanding topics, sentiment, and structures of these communities informs important aspects of health-related conditions. There has been growing research interest in analysing online mental health communities; however, analysis of these communities with health concerns has been limited. This paper investigates and identifies latent meta-groups of online communities with and without mental health-related conditions including depression and autism. Large datasets from online communities were crawled. We analyse sentiment-based, psycholinguistics-based and topic-based features from blog posts made by members of these online communities. The work focuses on using nonparametric methods to infer latent topics automatically from the corpus of affective words in the blog posts. The visualization of the discovered meta-communities in their use of latent topics shows a difference between the groups. This presents evidence of the emotion-bearing difference in online mental health-related communities, suggesting a possible angle for support and intervention. The methodology might offer potential machine learning techniques for research and practice in psychiatry.

Journal ArticleDOI
TL;DR: A novel model to fill-in missing values in EMR data analysis and use the new representation for prediction of key hospital events, and extends the proposed method to a supervised model for predicting multiple related risk outcomes in an integrated framework.
Abstract: Electronic medical records (EMRs) are being increasingly used for “risk” prediction. By “risks,” we denote outcomes such as emergency presentation, readmission, and the length of hospitalizations. However, EMR data analysis is complicated by missing entries. There are two reasons—the “primary reason for admission” is included in EMR, but the comorbidities (other chronic diseases) are left uncoded, and many zero values in the data are accurate, reflecting that a patient has not accessed medical facilities. A key challenge is to deal with the peculiarities of this data—unlike many other datasets, EMR is sparse, reflecting the fact that patients have some but not all diseases. We propose a novel model to fill-in these missing values and use the new representation for prediction of key hospital events. To “fill-in” missing values, we represent the feature-patient matrix as a product of two low-rank factors, preserving the sparsity property in the product. Intuitively, the product regularization allows sparse imputation of patient conditions reflecting common comorbidities across patients. We develop a scalable optimization algorithm based on Block coordinate descent method to find an optimal solution. We evaluate the proposed framework on two real-world EMR cohorts: Cancer (7000 admissions) and Acute Myocardial Infarction (2652 admissions). Our result shows that the AUC for 3-month emergency presentation prediction is improved significantly from (0.729 to 0.741) for Cancer data and (0.699 to 0.723) for AMI data. Similarly, AUC for 3-month emergency admission prediction from (0.730 to 0.752) for Cancer data and (0.682 to 0.724) for AMI data. We also extend the proposed method to a supervised model for predicting multiple related risk outcomes (e.g., emergency presentations and admissions in hospital over 3, 6, and 12 months period) in an integrated framework. The supervised model consistently outperforms state-of-the-art baseline methods.

Journal ArticleDOI
TL;DR: The proposed framework for mixed-type multioutcome prediction proposes a cumulative loss function composed of a specific loss function for each outcome type, and shows that the predictive performance of the proposed model is better than several state-of-the-art baselines.
Abstract: Health analysis often involves prediction of multiple outcomes of mixed type. The existing work is restrictive to either a limited number or specific outcome types. We propose a framework for mixed-type multioutcome prediction. Our proposed framework proposes a cumulative loss function composed of a specific loss function for each outcome type–as an example, least square (continuous outcome), hinge (binary outcome), Poisson (count outcome), and exponential (nonnegative outcome). To model these outcomes jointly, we impose a commonality across the prediction parameters through a common matrix normal prior. The framework is formulated as iterative optimization problems and solved using an efficient block-coordinate descent method. We empirically demonstrate both scalability and convergence. We apply the proposed model to a synthetic dataset and then on two real-world cohorts: a cancer cohort and an acute myocardial infarction cohort collected over a two-year period. We predict multiple emergency-related outcomes–as example, future emergency presentations (binary), emergency admissions (count), emergency length of stay days (nonnegative), and emergency time to next admission day (nonnegative). We show that the predictive performance of the proposed model is better than several state-of-the-art baselines.

Posted Content
TL;DR: Column Bundle is proposed, a novel deep neural network for capturing the shared statistics in data that demonstrates a comparable and competitive performance in all datasets against state-of-the-art methods designed specifically for each type.
Abstract: Much recent machine learning research has been directed towards leveraging shared statistics among labels, instances and data views, commonly referred to as multi-label, multi-instance and multi-view learning. The underlying premises are that there exist correlations among input parts and among output targets, and the predictive performance would increase when the correlations are incorporated. In this paper, we propose Column Bundle (CLB), a novel deep neural network for capturing the shared statistics in data. CLB is generic that the same architecture can be applied for various types of shared statistics by changing only input and output handling. CLB is capable of scaling to thousands of input parts and output labels by avoiding explicit modeling of pairwise relations. We evaluate CLB on different types of data: (a) multi-label, (b) multi-view, (c) multi-view/multi-label and (d) multi-instance. CLB demonstrates a comparable and competitive performance in all datasets against state-of-the-art methods designed specifically for each type.

Posted Content
TL;DR: An end-to-end model that reads medical record and predicts future risk, which adopts the algebraic view in that discrete medical objects are embedded into continuous vectors lying in the same space and the health trajectory is modeled using a recurrent neural network.
Abstract: Understanding the latent processes from Electronic Medical Records could be a game changer in modern healthcare. However, the processes are complex due to the interaction between at least three dynamic components: the illness, the care and the recording practice. Existing methods are inadequate in capturing the dynamic structure of care. We propose an end-to-end model that reads medical record and predicts future risk. The model adopts the algebraic view in that discrete medical objects are embedded into continuous vectors lying in the same space. The bag of disease and comorbidities recorded at each hospital visit are modeled as function of sets. The same holds for the bag of treatments. The interaction between diseases and treatments at a visit is modeled as the residual of the diseases minus the treatments. Finally, the health trajectory, which is a sequence of visits, is modeled using a recurrent neural network. We report preliminary results on chronic diseases - diabetes and mental health - for predicting unplanned readmission.

Posted Content
TL;DR: The Budgeted Batch Bayesian Optimization (B3O) is presented - it is shown empirically that the proposed B3O outperforms the existing fixed batch BO approaches in finding the optimum whilst requiring a fewer number of evaluations, thus saving cost and time.
Abstract: Parameter settings profoundly impact the performance of machine learning algorithms and laboratory experiments. The classical grid search or trial-error methods are exponentially expensive in large parameter spaces, and Bayesian optimization (BO) offers an elegant alternative for global optimization of black box functions. In situations where the black box function can be evaluated at multiple points simultaneously, batch Bayesian optimization is used. Current batch BO approaches are restrictive in that they fix the number of evaluations per batch, and this can be wasteful when the number of specified evaluations is larger than the number of real maxima in the underlying acquisition function. We present the Budgeted Batch Bayesian Optimization (B3O) for hyper-parameter tuning and experimental design - we identify the appropriate batch size for each iteration in an elegant way. To set the batch size flexible, we use the infinite Gaussian mixture model (IGMM) for automatically identifying the number of peaks in the underlying acquisition functions. We solve the intractability of estimating the IGMM directly from the acquisition function by formulating the batch generalized slice sampling to efficiently draw samples from the acquisition function. We perform extensive experiments for both synthetic functions and two real world applications - machine learning hyper-parameter tuning and experimental design for alloy hardening. We show empirically that the proposed B3O outperforms the existing fixed batch BO approaches in finding the optimum whilst requiring a fewer number of evaluations, thus saving cost and time.


Posted Content
04 Mar 2017
TL;DR: This work presents a new distributed representation in deep neural nets wherein the information is represented in native form as a matrix, which differs from current neural architectures that rely on vector representations.
Abstract: We present a new distributed representation in deep neural nets wherein the information is represented in native form as a matrix. This differs from current neural architectures that rely on vector representations. We consider matrices as central to the architecture and they compose the input, hidden and output layers. The model representation is more compact and elegant -- the number of parameters grows only with the largest dimension of the incoming layer rather than the number of hidden units. We derive several new deep networks: (i) feed-forward nets that map an input matrix into an output matrix, (ii) recurrent nets which map a sequence of input matrices into a sequence of output matrices. We also reinterpret existing models for (iii) memory-augmented networks and (iv) graphs using matrix notations. For graphs we demonstrate how the new notations lead to simple but effective extensions with multiple attentions. Extensive experiments on handwritten digits recognition, face reconstruction, sequence to sequence learning, EEG classification, and graph-based node classification demonstrate the efficacy and compactness of the matrix architectures.