scispace - formally typeset
Search or ask a question

Showing papers by "Vipin Kumar published in 2019"


Journal ArticleDOI
TL;DR: Some of the cross-cutting research themes in machine learning that are applicable across several geoscience problems, and the importance of a deep collaboration between machine learning and geosciences for synergistic advancements in both disciplines are discussed.
Abstract: Geosciences is a field of great societal relevance that requires solutions to several urgent problems facing our humanity and the planet. As geosciences enters the era of big data, machine learning (ML)—that has been widely successful in commercial domains—offers immense potential to contribute to problems in geosciences. However, geoscience applications introduce novel challenges for ML due to combinations of geoscience properties encountered in every problem, requiring novel research in machine learning. This article introduces researchers in the machine learning (ML) community to these challenges offered by geoscience problems and the opportunities that exist for advancing both machine learning and geosciences. We first highlight typical sources of geoscience data and describe their common properties. We then describe some of the common categories of geoscience problems where machine learning can play a role, discussing the challenges faced by existing ML methods and opportunities for novel ML research. We conclude by discussing some of the cross-cutting research themes in machine learning that are applicable across several geoscience problems, and the importance of a deep collaboration between machine learning and geosciences for synergistic advancements in both disciplines.

290 citations


Book ChapterDOI
31 Jan 2019
TL;DR: It is shown that a PGRNN can improve prediction accuracy over that of physical models, while generating outputs consistent with physical laws, and achieving good generalizability.
Abstract: This paper proposes a physics-guided recurrent neural network model (PGRNN) that combines RNNs and physics-based models to leverage their complementary strengths and improve the modeling of physical processes. Specifically, we show that a PGRNN can improve prediction accuracy over that of physical models, while generating outputs consistent with physical laws, and achieving good generalizability. Standard RNNs, even when producing superior prediction accuracy, often produce physically inconsistent results and lack generalizability. We further enhance this approach by using a pre-training method that leverages the simulated data from a physics-based model to address the scarcity of observed data. The PGRNN has the flexibility to incorporate additional physical constraints and we incorporate a density-depth relationship. Both enhancements further improve PGRNN performance. Although we present and evaluate this methodology in the context of modeling the dynamics of temperature in lakes, it is applicable more widely to a range of scientific and engineering disciplines where mechanistic (also known as process-based) models are used, e.g., power engineering, climate science, materials science, computational chemistry, and biomedicine.

190 citations


Journal ArticleDOI
TL;DR: In this paper, the authors presented the results of the North Central Climate Adaptation Science Center (NCAACS) at the University of Minnesota (U.M. System) with the help of the National Science Foundation (NSF).
Abstract: Department of the Interior Northeast Climate Adaptation Science Center; Midwest Glacial Lakes Fish Habitat Partnership grant through FWS; NSF Expedition in Computing Grant [1029711]; NSFNational Science Foundation (NSF) [EAR-PF-1725386]; Digital Technology Center at the University of MinnesotaUniversity of Minnesota System; Department of the Interior North Central Climate Adaptation Science Center; North Temperate Lakes Long-Term Ecological Research [NSF DEB-1440297]; Global Lake Ecological Observatory Network [NSF 1702991]

188 citations


Journal ArticleDOI
TL;DR: BridGE is developed, a computational approach for identifying pathways connected by genetic interactions from genome-wide genotype data that discovers significant interactions in Parkinson’s disease, schizophrenia, hypertension, prostate cancer, breast cancer, and type 2 diabetes.
Abstract: Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to focus on testing individual locus pairs, which undermines statistical power. Importantly, a global genetic network mapped for a model eukaryotic organism revealed that genetic interactions often connect genes between compensatory functional modules in a highly coherent manner. Taking advantage of this expected structure, we developed a computational approach called BridGE that identifies pathways connected by genetic interactions from GWAS data. Applying BridGE broadly, we discover significant interactions in Parkinson's disease, schizophrenia, hypertension, prostate cancer, breast cancer, and type 2 diabetes. Our novel approach provides a general framework for mapping complex genetic networks underlying human disease from genome-wide genotype data.

47 citations


Posted ContentDOI
TL;DR: In this paper, the authors present a case study of the Honey Bee Network's decentralized model for collecting, verifying and disseminating grassroots innovations and provide a road map for its replication in Africa.
Abstract: This paper presents a case study of the Honey Bee Network’s decentralized model for collecting, verifying and disseminating grassroots innovations and provides a road map for its replication in Africa. The Honey Bee Network brings together governmental and non‐governmental institutions, members of academia, scholars and a large number of volunteers. Through the Network’s activities, locally designed solutions and traditional knowledge with the potential to be refined and scaled up are scouted and members of the Network work with the innovators to help their ideas reach their commercial or non‐commercial potential. The Network has been involved in the sharing of grassroots technology developed in India with Kenya, notably a food processing machine, seed sowing device, and a small tractor. Through these pilot programs, actors at the grassroots had a chance to collaborate and co‐design solutions adapted to the Kenyan context. This experience revealed a willingness in Kenya to further invest in grassroots innovation initiatives, and Network members identified many conditions that would make Kenya the right choice for an African network hub, such as a rich traditional knowledge system and institutional willingness and recognition of the dynamism of the informal sector. Lessons from the Network’s experience in Kenya and its technology transfer program are collected and turned into recommendations for the development of a sister Network in Africa.

19 citations




Proceedings ArticleDOI
25 Jul 2019
TL;DR: A novel adversarial training approach for sequential data classification is developed by investigating when and how to perturb a sequence for an effective data augmentation and demonstrating the superiority of the proposed method over baselines in a diversity of real-world sequential datasets.
Abstract: The last decade has witnessed a surge of interest in applying deep learning models for discovering sequential patterns from a large volume of data. Recent works show that deep learning models can be further improved by enforcing models to learn a smooth output distribution around each data point. This can be achieved by augmenting training data with slight perturbations that are designed to alter model outputs. Such adversarial training approaches have shown much success in improving the generalization performance of deep learning models on static data, e.g., transaction data or image data captured on a single snapshot. However, when applied to sequential data, the standard adversarial training approaches cannot fully capture the discriminative structure of a sequence. This is because real-world sequential data are often collected over a long period of time and may include much irrelevant information to the classification task. To this end, we develop a novel adversarial training approach for sequential data classification by investigating when and how to perturb a sequence for an effective data augmentation. Finally, we demonstrate the superiority of the proposed method over baselines in a diversity of real-world sequential datasets.

16 citations



Proceedings ArticleDOI
01 Aug 2019
TL;DR: A generative model to combine multi-scale remote sensing data to detect croplands at high resolution is proposed and can track classification confidence in real time and potentially lead to an early detection of cropland detection.
Abstract: Effective and timely monitoring of croplands is critical for managing food supply. While remote sensing data from earth-observing satellites can be used to monitor croplands over large regions, this task is challenging for small-scale croplands as they cannot be captured precisely using coarse-resolution data. On the other hand, the remote sensing data in higher resolution are collected less frequently and contain missing or disturbed data. Hence, traditional sequential models cannot be directly applied on high-resolution data to extract temporal patterns, which are essential to identify crops. In this work, we propose a generative model to combine multi-scale remote sensing data to detect croplands at high resolution. During the learning process, we leverage the temporal patterns learned from coarse-resolution data to generate missing high-resolution data. Additionally, the proposed model can track classification confidence in real time and potentially lead to an early detection. The evaluation in an intensively cultivated region demonstrates the effectiveness of the proposed method in cropland detection.

7 citations


01 Jan 2019
TL;DR: Use of a general purpose HDQF provides a method to assess and visualize data quality to quickly identify areas for improvement and the results show that data quality issues can be efficiently identified and visualized.
Abstract: The ability to assess data quality is essential for secondary use of EHR data and an automated Healthcare Data Quality Framework (HDQF) can be used as a tool to support a healthcare organization’s data quality initiatives. Use of a general purpose HDQF provides a method to assess and visualize data quality to quickly identify areas for improvement. The value of the approach is illustrated for two analytics use cases: 1) predictive models and 2) clinical quality measures. The results show that data quality issues can be efficiently identified and visualized. The automated HDQF is much less time consuming than a manual approach to data quality and the framework can be rerun repeatedly on additional datasets without much effort.

Proceedings ArticleDOI
21 Aug 2019
TL;DR: This work proposed a new knowledge-driven representation for clinical data mining as well as trajectory mining, called Severity Encoding Variables (SEVs), and studied which characteristics make representations most suitable for particular clinical analytics tasks including trajectory mining.
Abstract: Different analytic techniques operate optimally with different types of data. As the use of EHR-based analytics expands to newer tasks, data will have to be transformed into different representations, so the tasks can be optimally solved. We classified representations into broad categories based on their characteristics, and proposed a new knowledge-driven representation for clinical data mining as well as trajectory mining, called Severity Encoding Variables (SEVs). Additionally, we studied which characteristics make representations most suitable for particular clinical analytics tasks including trajectory mining. Our evaluation shows that, for regression, most data representations performed similarly, with SEV achieving a slight (albeit statistically significant) advantage. For patients at high risk of diabetes, it outperformed the competing representation by (relative) 20%. For association mining, SEV achieved the highest performance. Its ability to constrain the search space of patterns through clinical knowledge was key to its success.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: The causal rule mining framework is evaluated on the Electronic Health Records data of a large cohort of 152000 patients from Mayo Clinic and it is shown that the patterns extracted are sufficiently rich to explain the controversial findings in the medical literature regarding the effect of a class of cholesterol drugs on Type-II Diabetes Mellitus (T2DM).
Abstract: Our aging population increasingly suffers from multiple chronic diseases simultaneously, necessitating the comprehensive treatment of these conditions. Finding the optimal set of drugs for a combinatorial set of diseases is a combinatorial pattern exploration problem. Association rule mining is a popular tool for such problems, but the requirement of health care for finding causal, rather than associative, patterns renders association rule mining unsuitable. To address this issue, we propose a novel framework based on the Rubin-Neyman causal model for extracting causal rules from observational data, correcting for a number of common biases. Specifically, given a set of interventions and a set of items that define subpopulations (e.g., diseases), we wish to find all subpopulations in which effective intervention combinations exist and in each such subpopulation, we wish to find all intervention combinations such that dropping any intervention from this combination will reduce the efficacy of the treatment. A key aspect of our framework is the concept of closed intervention sets which extend the concept of quantifying the effect of a single intervention to a set of concurrent interventions. Closed intervention sets also allow for a pruning strategy that is strictly more efficient than the traditional pruning strategy used by the Apriori algorithm. To implement our ideas, we introduce and compare five methods of estimating causal effect from observational data and rigorously evaluate them on synthetic data to mathematically prove (when possible) why they work. We also evaluated our causal rule mining framework on the Electronic Health Records (EHR) data of a large cohort of 152000 patients from Mayo Clinic and showed that the patterns we extracted are sufficiently rich to explain the controversial findings in the medical literature regarding the effect of a class of cholesterol drugs on Type-II Diabetes Mellitus (T2DM).

Proceedings ArticleDOI
16 Mar 2019
TL;DR: MINT is a novel framework for model integration that captures extensive knowledge about models and data and aims to automatically compose them together to create valid end-to-end simulations.
Abstract: Understanding the interactions between natural processes and human activities poses major challenges as it requires the integration of models and data across disparate disciplines. It typically takes many months and even years to create valid end-to-end simulations as different models need to be configured in consistent ways and generate data that is usable by other models. MINT is a novel framework for model integration that captures extensive knowledge about models and data and aims to automatically compose them together. MINT guides a user to pose a well-formed modeling question, select and configure appropriate models, find and prepare appropriate datasets, compose data and models into end-to-end workflows, run the simulations, and visualize the results. MINT currently includes hydrology, agriculture, and socioeconomic models.

Journal ArticleDOI
06 Dec 2019
TL;DR: This paper studies an ensemble learning based framework for automatically mapping plantations in southern Kalimantan on a yearly scale using remote sensing data and examines the effectiveness of several components in this framework, including class aggregation, data sampling, learning model selection and post-processing.
Abstract: Plantation mapping is important for understanding deforestation and climate change. While most existing plantation products are created manually, in this paper we study an ensemble learning based framework for automatically mapping plantations in southern Kalimantan on a yearly scale using remote sensing data. We study the effectiveness of several components in this framework, including class aggregation, data sampling, learning model selection and post-processing, by comparing with multiple baselines. In addition, we analyze the quality of our plantation mapping product by visual examination of high resolution images. We also compare our method to existing manually labeled plantation datasets and show that our method can achieve a better balance of precision (i.e., user’s accuracy) and recall (i.e., producer’s accuracy).

Posted Content
TL;DR: In this paper, the authors present a case study of the Honey Bee Network's decentralized model for collecting, verifying and disseminating grassroots innovations and provide a roadmap for its replication in Africa.
Abstract: This paper presents a case study of the Honey Bee Network’s decentralized model for collecting, verifying and disseminating grassroots innovations and provides a roadmap for its replication in Africa. The Honey Bee Network brings together governmental and non‐governmental institutions, members of academia, scholars and a large number of volunteers. Through the Network’s activities, locally-designed solutions and traditional knowledge with the potential to be refined and scaled up are scouted and members of the Network work with the innovators to help their ideas reach their commercial or non‐commercial potential. The Network has been involved in the sharing of grassroots technology developed in India with Kenya, notably a food processing machine, seed sowing device, and a small tractor. Through these pilot programs, actors at the grassroots had a chance to collaborate and co‐design solutions adapted to the Kenyan context. This experience revealed a willingness in Kenya to further invest in grassroots innovation initiatives, and Network members identified many conditions that would make Kenya the right choice for an African network hub, such as a rich traditional knowledge system and institutional willingness and recognition of the dynamism of the informal sector. Lessons from the Network’s experience in Kenya and its technology transfer program are collected and turned into recommendations for the development of a sister Network in Africa.

Patent
03 Oct 2019
TL;DR: In this paper, a method of identifying land cover includes receiving multi-spectral values for a plurality of locations at a pluralityof times at a location is selected and for each time in the plurality of times, a latent representation of the multispectral values is determined based on a Latent Representation of Multi-Spectral values determined for a previous time and multispectral value for the previous time of other locations that are near the selected location.
Abstract: A method of identifying land cover includes receiving multi-spectral values for a plurality of locations at a plurality of times. A location is selected and for each time in the plurality of times, a latent representation of the multi-spectral values is determined based on a latent representation of multi-spectral values determined for a previous time and multi-spectral values for the previous time of a plurality of other locations that are near the selected location. The determined latent representation is then used to predict a land cover for the selected location at the time.

Posted Content
TL;DR: This work proposes a fast-optimal guaranteed algorithm to find most interesting SIR relationship in a pair of time series based on a real-world dataset along with its scalability scope and obtains useful domain insights.
Abstract: Traditional approaches focus on finding relationships between two entire time series, however, many interesting relationships exist in small sub-intervals of time and remain feeble during other sub-intervals. We define the notion of a sub-interval relationship (SIR) to capture such interactions that are prominent only in certain sub-intervals of time. To that end, we propose a fast-optimal guaranteed algorithm to find most interesting SIR relationship in a pair of time series. Lastly, we demonstrate the utility of our method in climate science domain based on a real-world dataset along with its scalability scope and obtain useful domain insights.

Posted Content
TL;DR: An overview of how recent advances in machine learning and the availability of data from earth observing satellites can dramatically improve the authors' ability to automatically map croplands over long period and over large regions is provided.
Abstract: This paper provides an overview of how recent advances in machine learning and the availability of data from earth observing satellites can dramatically improve our ability to automatically map croplands over long period and over large regions. It discusses three applications in the domain of crop monitoring where ML approaches are beginning to show great promise. For each application, it highlights machine learning challenges, proposed approaches, and recent results. The paper concludes with discussion of major challenges that need to be addressed before ML approaches will reach their full potential for this problem of great societal relevance.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This work presents a multi-view framework to classify spatio-temporal phenomena at multiple resolutions that utilizes the complementarity of features across different resolutions and improves the corresponding models by enforcing consistency of their predictions on unlabeled data.
Abstract: In this work, we present a multi-view framework to classify spatio-temporal phenomena at multiple resolutions. This approach utilizes the complementarity of features across different resolutions and improves the corresponding models by enforcing consistency of their predictions on unlabeled data. Unlike traditional multi-view learning problems, the key challenge in our case is that there is a many-to-one correspondence between instances across different resolutions, which needs to be explicitly modeled. Experiments on the real-world application of mapping urban areas using spatial raster data-sets from satellite observations show the benefits of the proposed multi-view framework.