scispace - formally typeset
Search or ask a question
Author

Jong Lee

Bio: Jong Lee is an academic researcher from University of Illinois at Urbana–Champaign. The author has contributed to research in topics: Cyberinfrastructure & Data management. The author has an hindex of 6, co-authored 15 publications receiving 123 citations. Previous affiliations of Jong Lee include National Center for Supercomputing Applications.

Papers
More filters
Proceedings ArticleDOI
31 Aug 2015
TL;DR: This paper describes the conceptual framework motivating the SEAD project and the suite of data services developed and deployed as an initial implementation of this approach and identifies some key architectural features of the approach as well as open challenges to fully realizing the value of this Approach in the broad ecosystem of cyberinfrastructure.
Abstract: When the effort to curate and preserve data is made at the end of a project, there is little opportunity to leverage ongoing research work to reduce curation costs or conversely, to leverage curation efforts to improve research productivity. In the Sustainable Environment Actionable Data (SEAD) project, we have envisioned a more active approach to data curation and preservation in which these processes occur in parallel with research and generate sufficient short and long-term return on researcher investments for self-interest to drive their adoption. In this paper, we describe the conceptual framework motivating the SEAD project and the suite of data services we have developed and deployed as an initial implementation of this approach. Use cases in which these services can reduce curation effort and aid ongoing research are highlighted and, based on our experience to date, we identify some key architectural features of our approach as well as open challenges to fully realizing the value of this approach in the broad ecosystem of cyberinfrastructure.

29 citations

Journal ArticleDOI
TL;DR: A set of workflows is demonstrated to facilitate rapid and repeatable creation of GI landscape designs which are incorporated into complex models using web applications and services that are sufficient to address diverse ecosystem service goals.
Abstract: Land use planners, landscape architects, and water resource managers are using Green Infrastructure (GI) designs in urban environments to promote ecosystem services including mitigation of storm water flooding and water quality degradation. An expanded set of urban sustainability goals also includes increasing carbon sequestration, songbird habitat, reducing urban heat island effects, and improvement of landscape aesthetics. GI is conceptualized to improve water and ecosystem quality by reducing storm water runoff at the source, but when properly designed, may also benefit these expanded goals. With the increasing use of GI in urban contexts, there is an emerging need to facilitate participatory design and scenario evaluation to enable better communication between GI designers and groups impacted by these designs. Major barriers to this type of public participation is the complexity of both parameterizing, operating, visualizing and interpreting results of complex ecohydrological models at various watershed scales that are sufficient to address diverse ecosystem service goals. This paper demonstrates a set of workflows to facilitate rapid and repeatable creation of GI landscape designs which are incorporated into complex models using web applications and services. For this project, we use the RHESSys (Regional Hydro-Ecologic Simulation System) ecohydrologic model to evaluate participatory GI landscape designs generated by stakeholders and decision makers, but note that the workflow could be adapted to a set of other watershed models.

22 citations

Proceedings ArticleDOI
22 Jul 2018
TL;DR: Some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas are discussed, including support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, and ability to provide new data visualizations in the web browser.
Abstract: Clowder is an open source data management system to support data curation of long tail data and metadata across multiple research domains and diverse data types. Institutions and labs can install and customize their own instance of the framework on local hardware or on remote cloud computing resources to provide a shared service to distributed communities of researchers. Data can be ingested directly from instruments or manually uploaded by users and then shared with remote collaborators using a web front end. We discuss some of the challenges encountered in designing and developing a system that can be easily adapted to different scientific areas including digital preservation, geoscience, material science, medicine, social science, cultural heritage and the arts. Some of these challenges include support for large amounts of data, horizontal scaling of domain specific preprocessing algorithms, ability to provide new data visualizations in the web browser, a comprehensive Web service API for automatic data ingestion and curation, a suite of social annotation and metadata management features to support data annotation by communities of users and algorithms, and a web based front-end to interact with code running on heterogeneous clusters, including HPC resources.

20 citations

Proceedings ArticleDOI
29 Oct 2015
TL;DR: Brown Dog is presented, two highly extensible services that aim to leverage any existing pieces of code, libraries, services, or standalone software towards providing users with a simple to use and programmable means of automated aid in the curation and indexing of distributed collections of uncurated and/or unstructured data.
Abstract: We present Brown Dog, two highly extensible services that aim to leverage any existing pieces of code, libraries, services, or standalone software (past or present) towards providing users with a simple to use and programmable means of automated aid in the curation and indexing of distributed collections of uncurated and/or unstructured data. Data collections such as these encompassing large varieties of data, in addition to large amounts of data, pose a significant challenge within modern day "Big Data" efforts. The two services, the Data Access Proxy (DAP) and the Data Tilling Service (DTS), focusing on format conversions and content based analysis/extraction respectively, wrap relevant conversion and extraction operations within arbitrary software, manages their deployment in an elastic manner, and manages job execution from behind a deliberately compact REST API. We describe both the motivation and need/scientific drivers for such services, the constituent components that allow for arbitrary software/code to be used and managed, and lastly an evaluation of the systems capabilities and scalability.

17 citations

01 Jan 2014
TL;DR: The Reproducibility@xsede workshop as discussed by the authors focused on reproducibility in large-scale computational research and highlighted four areas of particular interest to XSEDE: documentation and training that promotes reproducible research; system-level tools that provide build and run-time information at the level of the individual job; the need to model best practices in research collaborations involving Extreme Science and Engineering Discovery Environment staff; and continued work on gateways and related technologies.
Abstract: This is the final report on reproducibility@xsede, a one-day workshop held in conjunction with XSEDE14, the annual conference of the Extreme Science and Engineering Discovery Environment (XSEDE). The workshop's discussion-oriented agenda focused on reproducibility in large-scale computational research. Two important themes capture the spirit of the workshop submissions and discussions: (1) organizational stakeholders, especially supercomputer centers, are in a unique position to promote, enable, and support reproducible research; and (2) individual researchers should conduct each experiment as though someone will replicate that experiment. Participants documented numerous issues, questions, technologies, practices, and potentially promising initiatives emerging from the discussion, but also highlighted four areas of particular interest to XSEDE: (1) documentation and training that promotes reproducible research; (2) system-level tools that provide build- and run-time information at the level of the individual job; (3) the need to model best practices in research collaborations involving XSEDE staff; and (4) continued work on gateways and related technologies. In addition, an intriguing question emerged from the day's interactions: would there be value in establishing an annual award for excellence in reproducible research? Overview

15 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The Whole Tale project as discussed by the authors aims to connect computational, data-intensive research efforts with the larger research process, transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create "living publications" or tales.

99 citations

Journal ArticleDOI
TL;DR: The many objectives and meanings of reproducibility are discussed within the context of scientific computing, and technical barriers to reproducible work are described.
Abstract: Reproducibility is widely considered to be an essential requirement of the scientific process. However, a number of serious concerns have been raised recently, questioning whether today’s computational work is adequately reproducible. In principle, it should be possible to specify a computation to sufficient detail that anyone should be able to reproduce it exactly. But in practice, there are fundamental, technical, and social barriers to doing so. The many objectives and meanings of reproducibility are discussed within the context of scientific computing. Technical barriers to reproducibility are described, extant approaches surveyed, and open areas of research are identified.

91 citations

Journal ArticleDOI
TL;DR: A taxonomy of workflow management system (WMS) characteristics is proposed, including aspects previously overlooked, that frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress.
Abstract: Modern scientific collaborations have opened up the opportunity to solve complex problems that require both multidisciplinary expertise and large-scale computational experiments. These experiments typically consist of a sequence of processing steps that need to be executed on selected computing platforms. Execution poses a challenge, however, due to (1) the complexity and diversity of applications, (2) the diversity of analysis goals, (3) the heterogeneity of computing platforms, and (4) the volume and distribution of data. A common strategy to make these in silico experiments more manageable is to model them as workflows and to use a workflow management system to organize their execution. This article looks at the overall challenge posed by a new order of scientific experiments and the systems they need to be run on, and examines how this challenge can be addressed by workflows and workflow management systems. It proposes a taxonomy of workflow management system (WMS) characteristics, including aspects previously overlooked. This frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress in this area.

82 citations

Journal ArticleDOI
TL;DR: The authors found that, overall, tree density and understory vegetation density are positively associated with preference in a power-curve relationship, even though the nature of the relationship between bioretention density and preference remains unclear.

45 citations

Journal ArticleDOI
TL;DR: This article discusses how research infrastructures are identified and referenced by scholars in the research literature and how those references are being collected and analyzed for the purposes of evaluating impact and identifies notable challenges that impede the analysis of impact metrics.
Abstract: Recent policy shifts on the part of funding agencies and journal publishers are causing changes in the acknowledgment and citation behaviors of scholars. A growing emphasis on open science and reproducibility is changing how authors cite and acknowledge “research infrastructures”—entities that are used as inputs to or as underlying foundations for scholarly research, including data sets, software packages, computational models, observational platforms, and computing facilities. At the same time, stakeholder interest in quantitative understanding of impact is spurring increased collection and analysis of metrics related to use of research infrastructures. This article reviews work spanning several decades on tracing and assessing the outcomes and impacts from these kinds of research infrastructures. We discuss how research infrastructures are identified and referenced by scholars in the research literature and how those references are being collected and analyzed for the purposes of evaluating impact. Synthesizing common features of a wide range of studies, we identify notable challenges that impede the analysis of impact metrics for research infrastructures and outline key open research questions that can guide future research and applications related to such metrics.

43 citations