scispace - formally typeset
Search or ask a question

Showing papers by "Pang-Ning Tan published in 2015"


Journal ArticleDOI
TL;DR: The largest challenge of this task was the heterogeneity of the data, formats, and metadata, which made a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data.
Abstract: Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km 2 ). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

104 citations



Proceedings Article
01 Jan 2015
TL;DR: A novel approach called FactORized MUlti-task LeArning model (Formula), which learns the personalized model of each patient via a sparse multi-task learning method, which delivered superior predictive performance while the personalized models offered many useful medical insights.
Abstract: Medical predictive modeling is a challenging problem due to the heterogeneous nature of the patients. In order to build effective medical predictive models we need to address such heterogeneous nature during modeling and allow patients to have their own personalized models instead of using a one-size-fits-all model. However, building a personalized model for each patient is computationally expensive and the over-parametrization of the model makes it susceptible to the model overfitting problem. To address these challenges, we propose a novel approach called FactORized MUlti-task LeArning model (Formula), which learns the personalized model of each patient via a sparse multi-task learning method. The personalized models are assumed to share a low-rank representation, known as the base models. Formula is designed to simultaneously learn the base models as well as the personalized model of each patient, where the latter is a linear combination of the base models. We have performed extensive experiments to evaluate the proposed approach on a real medical data set. The proposed approach delivered superior predictive performance while the personalized models offered many useful medical insights.

27 citations


Proceedings ArticleDOI
01 Oct 2015
TL;DR: A truncated exponential kernel is introduced to represent spatial contiguity constraints for region delineation using constrained spectral clustering and it is shown that a Hadamard product approach that combines the kernel with landscape feature similarity matrix can produce regions that are more spatially contiguous compared to other baseline algorithms.
Abstract: A regionalization system delineates the geographical landscape into spatially contiguous, homogeneous units for landscape ecology research and applications. In this study, we investigated a quantitative approach for developing a regional-ization system using constrained clustering algorithms. Unlike conventional clustering, constrained clustering uses domain constraints to help guide the clustering process towards finding a desirable solution. For region delineation, the adjacency relationship between neighboring spatial units can be provided as constraints to ensure that the resulting regions are geographically connected. However, using a large-scale terrestrial ecology data set as our case study, we showed that incorporating such constraints into existing constrained clustering algorithms is not that straightforward. First, the algorithms must carefully balance the trade-off between spatial contiguity and landscape homogeneity of the regions. Second, the effectiveness of the algorithms strongly depends on how the spatial constraints are represented and incorporated into the clustering framework. In this paper, we introduced a truncated exponential kernel to represent spatial contiguity constraints for region delineation using constrained spectral clustering. We also showed that a Hadamard product approach that combines the kernel with landscape feature similarity matrix can produce regions that are more spatially contiguous compared to other baseline algorithms.

13 citations


Proceedings ArticleDOI
17 Oct 2015
TL;DR: In this article, a hierarchical learning method known as MF-tree is proposed to efficiently classify data sets with large number of classes while simultaneously inducing a taxonomy structure that captures relationships among the classes.
Abstract: Many big data applications require accurate classification of objects into one of possibly thousands or millions of categories. Such classification tasks are challenging due to issues such as class imbalance, high testing cost, and model interpretability problems. To overcome these challenges, we propose a novel hierarchical learning method known as MF-Tree to efficiently classify data sets with large number of classes while simultaneously inducing a taxonomy structure that captures relationships among the classes. Unlike many other existing hierarchical learning methods, our approach is designed to optimize a global objective function. We demonstrate the equivalence between our proposed regularized loss function and the Hilbert-Schmidt Independence Criterion (HSIC). The latter has a nice additive property, which allows us to decompose the multi-class learning problem into hierarchical binary classification tasks. To improve its training efficiency, an approximate algorithm for inducing MF-Tree is also proposed. We performed extensive experiments to compare MF-Tree against several state-of-the-art algorithms and showed both its effectiveness and efficiency when applied to real-world data sets.

4 citations


01 Jan 2015
TL;DR: This work proposes a novel hierarchical learning method known as MF-Tree to efficiently classify data sets with large number of classes while simultaneously inducing a taxonomy structure that captures relationships among the classes.
Abstract: Many big data applications require accurate classification of objects into one of possibly thousands or millions of categories. Such classification tasks are challenging due to issues such as class imbalance, high testing cost, and model interpretability problems. To overcome these challenges, we propose a novel hierarchical learning method known as MF-Tree to efficiently classify data sets with large number of classes while simultaneously inducing a taxonomy structure that captures relationships among the classes. Unlike many other existing hierarchical learning methods, our approach is designed to optimize a global objective function. We demonstrate the equivalence between our proposed regularized loss function and the Hilbert-Schmidt Independence Criterion (HSIC). The latter has a nice additive property, which allows us to decompose the multi-class learning problem into hierarchical binary classification tasks. To improve its training efficiency, an approximate algorithm for inducing MF-Tree is also proposed. We performed extensive experiments to compare MF-Tree against several state-of-the-art algorithms and showed both its effectiveness and efficiency when applied to real-world data sets.

1 citations