scispace - formally typeset
Search or ask a question

Showing papers on "Workflow published in 2021"


Journal ArticleDOI
TL;DR: It is shown how the popular workflow management system Snakemake can be used to guarantee reproducibility, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.
Abstract: Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.

519 citations


Journal ArticleDOI
TL;DR: In this article, a workflow for preprocessing single-cell RNA-sequencing data that balances efficiency and accuracy is described, based on the kallisto and bustools programs.
Abstract: We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses

170 citations


Journal ArticleDOI
TL;DR: In this article, a global single-cell mass spectrometry-based proteomics approach is proposed for large-scale single cell analyses, which can provide insights into the molecular basis for cellular heterogeneity.
Abstract: Large-scale single-cell analyses are of fundamental importance in order to capture biological heterogeneity within complex cell systems, but have largely been limited to RNA-based technologies. Here we present a comprehensive benchmarked experimental and computational workflow, which establishes global single-cell mass spectrometry-based proteomics as a tool for large-scale single-cell analyses. By exploiting a primary leukemia model system, we demonstrate both through pre-enrichment of cell populations and through a non-enriched unbiased approach that our workflow enables the exploration of cellular heterogeneity within this aberrant developmental hierarchy. Our approach is capable of consistently quantifying ~1000 proteins per cell across thousands of individual cells using limited instrument time. Furthermore, we develop a computational workflow (SCeptre) that effectively normalizes the data, integrates available FACS data and facilitates downstream analysis. The approach presented here lays a foundation for implementing global single-cell proteomics studies across the world. Single-cell proteomics can provide insights into the molecular basis for cellular heterogeneity. Here, the authors develop a multiplexed single-cell proteomics and computational workflow, and show that their strategy captures the cellular hierarchies in an Acute Myeloid Leukemia culture model.

149 citations


Journal ArticleDOI
TL;DR: This article proposes secure AIoT for implicit group recommendations (SAIoT-GRs) with a collaborative Bayesian network model and noncooperative game introduced as algorithms and is able to maximize the advantages of the two modules.
Abstract: The emergence of Artificial Intelligence of Things (AIoT) has provided novel insights for many social computing applications such as group recommender systems. As the distances between people have been greatly shortened, there has been more general demand for the provision of personalized services aimed at groups instead of individuals. The existing methods for capturing group-level preference features from individuals have mostly been established via aggregation and face two challenges: secure data management workflows are absent, and implicit preference feedback is ignored. To tackle these current difficulties, this paper proposes secure AIoT for implicit group recommendations (SAIoT-GR). For the hardware module, a secure IoT structure is developed as the bottom support platform. For the software module, a collaborative Bayesian network model and noncooperative game are introduced as algorithms. This secure AIoT architecture is able to maximize the advantages of the two modules. In addition, a large number of experiments are carried out to evaluate the performance of SAIoT-GR in terms of efficiency and robustness.

117 citations


Journal ArticleDOI
TL;DR: This study develops an unceRtainty-aware Online Scheduling Algorithm (ROSA) to schedule dynamic and multiple workflows with deadlines that performs better than the five compared algorithms with respect to costs, deviations, deviation, resource utilization, and fairness.
Abstract: Scheduling workflows in cloud service environment has attracted great enthusiasm, and various approaches have been reported up to now. However, these approaches often ignored the uncertainties in the scheduling environment, such as the uncertain task start/execution/finish time, the uncertain data transfer time among tasks, the sudden arrival of new workflows. Ignoring these uncertain factors often leads to the violation of workflow deadlines and increases service renting costs of executing workflows. This study devotes to improving the performance for cloud service platforms by minimizing uncertainty propagation in scheduling workflow applications that have both uncertain task execution time and data transfer time. To be specific, a novel scheduling architecture is designed to control the count of workflow tasks directly waiting on each service instance (e.g., virtual machine and container). Once a task is completed, its start/execution/finish time are available, which means its uncertainties disappearing, and will not affect the subsequent waiting tasks on the same service instance. Thus, controlling the count of waiting tasks on service instances can prohibit the propagation of uncertainties. Based on this architecture, we develop an unce R tainty-aware O nline S cheduling A lgorithm ( ROSA ) to schedule dynamic and multiple workflows with deadlines. The proposed ROSA skillfully integrates both the proactive and reactive strategies. During the execution of the generated baseline schedules, the reactive strategy in ROSA will be dynamically called to produce new proactive baseline schedules for dealing with uncertainties. Then, on the basis of real-world workflow traces, five groups of simulation experiments are carried out to compare ROSA with five typical algorithms. The comparison results reveal that ROSA performs better than the five compared algorithms with respect to costs (up to 56 percent), deviation (up to 70 percent), resource utilization (up to 37 percent), and fairness (up to 37 percent).

116 citations


Journal ArticleDOI
TL;DR: In this article, a cloud-edge based dynamic reconfiguration to service workflow for mobile e-commerce environments is proposed, where the value and cost attributes of a service are considered, and a long short-term memory (LSTM) neural network is used to predict the stability of services.
Abstract: The emergence of mobile service composition meets the current needs for real-time eCommerce. However, the requirements for eCommerce, such as safety and timeliness, are becoming increasingly strict. Thus, the cloud-edge hybrid computing model has been introduced to accelerate information processing, especially in a mobile scenario. However, the mobile environment is characterized by limited resource storage and users who frequently move, and these characteristics strongly affect the reliability of service composition running in this environment. Consequently, applications are likely to fail if inappropriate services are invoked. To ensure that the composite service can operate normally, traditional dynamic reconfiguration methods tend to focus on cloud services scheduling. Unfortunately, most of these approaches cannot support timely responses to dynamic changes. In this article, the cloud-edge based dynamic reconfiguration to service workflow for mobile eCommerce environments is proposed. First, the service quality concept is extended. Specifically, the value and cost attributes of a service are considered. The value attribute is used to assess the stability of the service for some time to come, and the cost attribute is the cost of a service invocation. Second, a long short-term memory (LSTM) neural network is used to predict the stability of services, which is related to the calculation of the value attribute. Then, in view of the limited available equipment resources, a method for calculating the cost of calling a service is introduced. Third, candidate services are selected by considering both service stability and the cost of service invocation, thus yielding a dynamic reconfiguration scheme that is more suitable for the cloud-edge environment. Finally, a series of comparative experiments were carried out, and the experimental results prove that the method proposed in this article offers higher stability, less energy consumption, and more accurate service prediction.

93 citations


Journal ArticleDOI
TL;DR: To make machine-learning analyses in the life sciences more computationally reproducible, standards based on data, model and code publication, programming best practices and workflow automation are proposed.
Abstract: To make machine-learning analyses in the life sciences more computationally reproducible, we propose standards based on data, model and code publication, programming best practices and workflow automation. By meeting these standards, the community of researchers applying machine-learning methods in the life sciences can ensure that their analyses are worthy of trust.

88 citations


Book ChapterDOI
01 Jan 2021
TL;DR: This chapter introduces Duet: the authors' tool for easier FL for scientists and data owners and provides a proof-of-concept demonstration of a FL workflow using an example of how to train a convolutional neural network.
Abstract: PySyft is an open-source multi-language library enabling secure and private machine learning by wrapping and extending popular deep learning frameworks such as PyTorch in a transparent, lightweight, and user-friendly manner. Its aim is to both help popularize privacy-preserving techniques in machine learning by making them as accessible as possible via Python bindings and common tools familiar to researchers and data scientists, as well as to be extensible such that new Federated Learning (FL), Multi-Party Computation, or Differential Privacy methods can be flexibly and simply implemented and integrated. This chapter will introduce the methods available within the PySyft library and describe their implementations. We will then provide a proof-of-concept demonstration of a FL workflow using an example of how to train a convolutional neural network. Next, we review the use of PySyft in academic literature to date and discuss future use-cases and development plans. Most importantly, we introduce Duet: our tool for easier FL for scientists and data owners.

84 citations


Journal ArticleDOI
TL;DR: A principled Bayesian workflow is introduced that provides guidelines and checks for valid data analysis, avoiding overfitting complex models to noise, and capturing relevant data structure in a probabilistic model.
Abstract: Experiments in research on memory, language, and in other areas of cognitive science are increasingly being analyzed using Bayesian methods. This has been facilitated by the development of probabilistic programming languages such as Stan, and easily accessible front-end packages such as brms. The utility of Bayesian methods, however, ultimately depends on the relevance of the Bayesian model, in particular whether or not it accurately captures the structure of the data and the data analyst's domain expertise. Even with powerful software, the analyst is responsible for verifying the utility of their model. To demonstrate this point, we introduce a principled Bayesian workflow (Betancourt, 2018) to cognitive science. Using a concrete working example, we describe basic questions one should ask about the model: prior predictive checks, computational faithfulness, model sensitivity, and posterior predictive checks. The running example for demonstrating the workflow is data on reading times with a linguistic manipulation of object versus subject relative clause sentences. This principled Bayesian workflow also demonstrates how to use domain knowledge to inform prior distributions. It provides guidelines and checks for valid data analysis, avoiding overfitting complex models to noise, and capturing relevant data structure in a probabilistic model. Given the increasing use of Bayesian methods, we aim to discuss how these methods can be properly employed to obtain robust answers to scientific questions. All data and code accompanying this article are available from https://osf.io/b2vx9/. (PsycInfo Database Record (c) 2021 APA, all rights reserved).

82 citations


Journal ArticleDOI
TL;DR: A comprehensive review was conducted on GBI-targeted studies enlisting ENVI-met as the primary tool, providing researchers with an overview of the ENvi-met methodology and recommendations to refine research on G BI thermal effects.

80 citations


Journal ArticleDOI
TL;DR: Current limitations and challenges are discussed, including advances in network implementations, applications to unconventional resources, dataset acquisition and synthetic training, extrapolative potential, accuracy loss from soft computing, and the computational cost of 3D Deep Learning.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a cloud workflow scheduling approach which combines particle swarm optimization and idle time slot-aware rules, to minimize the execution cost of a workflow application under a deadline constraint.
Abstract: Workflow scheduling is a key issue and remains a challenging problem in cloud computing. Faced with the large number of virtual machine (VM) types offered by cloud providers, cloud users need to choose the most appropriate VM type for each task. Multiple task scheduling sequences exist in a workflow application. Different task scheduling sequences have a significant impact on the scheduling performance. It is not easy to determine the most appropriate set of VM types for tasks and the best task scheduling sequence. Besides, the idle time slots on VM instances should be used fully to increase resources' utilization and save the execution cost of a workflow. This paper considers these three aspects simultaneously and proposes a cloud workflow scheduling approach which combines particle swarm optimization (PSO) and idle time slot-aware rules, to minimize the execution cost of a workflow application under a deadline constraint. A new particle encoding is devised to represent the VM type required by each task and the scheduling sequence of tasks. An idle time slot-aware decoding procedure is proposed to decode a particle into a scheduling solution. To handle tasks' invalid priorities caused by the randomness of PSO, a repair method is used to repair those priorities to produce valid task scheduling sequences. The proposed approach is compared with state-of-the-art cloud workflow scheduling algorithms. Experiments show that the proposed approach outperforms the comparative algorithms in terms of both of the execution cost and the success rate in meeting the deadline.

Journal ArticleDOI
29 Nov 2021
TL;DR: Chatbots have the potential to be integrated into clinical practice by working alongside health practitioners to reduce costs, refine workflow efficiencies, and improve patient outcomes, and further research and interdisciplinary collaboration could advance this technology to dramatically improve the quality of care for patients, rebalance the workload for clinicians, and revolutionize the practice of medicine.
Abstract: Background: Chatbot is a timely topic applied in various fields, including medicine and health care, for human-like knowledge transfer and communication. Machine learning, a subset of artificial intelligence, has been proven particularly applicable in health care, with the ability for complex dialog management and conversational flexibility. Objective: This review article aims to report on the recent advances and current trends in chatbot technology in medicine. A brief historical overview, along with the developmental progress and design characteristics, is first introduced. The focus will be on cancer therapy, with in-depth discussions and examples of diagnosis, treatment, monitoring, patient support, workflow efficiency, and health promotion. In addition, this paper will explore the limitations and areas of concern, highlighting ethical, moral, security, technical, and regulatory standards and evaluation issues to explain the hesitancy in implementation. Methods: A search of the literature published in the past 20 years was conducted using the IEEE Xplore, PubMed, Web of Science, Scopus, and OVID databases. The screening of chatbots was guided by the open-access Botlist directory for health care components and further divided according to the following criteria: diagnosis, treatment, monitoring, support, workflow, and health promotion. Results: Even after addressing these issues and establishing the safety or efficacy of chatbots, human elements in health care will not be replaceable. Therefore, chatbots have the potential to be integrated into clinical practice by working alongside health practitioners to reduce costs, refine workflow efficiencies, and improve patient outcomes. Other applications in pandemic support, global health, and education are yet to be fully explored. Conclusions: Further research and interdisciplinary collaboration could advance this technology to dramatically improve the quality of care for patients, rebalance the workload for clinicians, and revolutionize the practice of medicine.

Journal ArticleDOI
TL;DR: PatRoon as discussed by the authors is a new R based open-source software platform, which provides comprehensive, fully tailored and straightforward non-target analysis workflows, making the use, evaluation and mixing of well-tested algorithms seamless by harmonizing various common (primarily open) software tools under a consistent interface.
Abstract: Mass spectrometry based non-target analysis is increasingly adopted in environmental sciences to screen and identify numerous chemicals simultaneously in highly complex samples. However, current data processing software either lack functionality for environmental sciences, solve only part of the workflow, are not openly available and/or are restricted in input data formats. In this paper we present patRoon, a new R based open-source software platform, which provides comprehensive, fully tailored and straightforward non-target analysis workflows. This platform makes the use, evaluation and mixing of well-tested algorithms seamless by harmonizing various common (primarily open) software tools under a consistent interface. In addition, patRoon offers various functionality and strategies to simplify and perform automated processing of complex (environmental) data effectively. patRoon implements several effective optimization strategies to significantly reduce computational times. The ability of patRoon to perform time-efficient and automated non-target data annotation of environmental samples is demonstrated with a simple and reproducible workflow using open-access data of spiked samples from a drinking water treatment plant study. In addition, the ability to easily use, combine and evaluate different algorithms was demonstrated for three commonly used feature finding algorithms. This article, combined with already published works, demonstrate that patRoon helps make comprehensive (environmental) non-target analysis readily accessible to a wider community of researchers.

Journal ArticleDOI
Guanjie Wang1, Liyu Peng1, Kaiqi Li1, Linggang Zhu1, Jian Zhou1, Naihua Miao1, Zhimei Sun1 
TL;DR: An open-source computational platform named ALKEMIE, acronyms for Artificial Learning and Knowledge Enhanced Materials Informatics Engineering, which enables easy access of data-driven techniques to broad communities and has an elaborately designed user-friendly graphical user-interface which makes the workflow and dataflow more maneuverable and transparent, facilitating its easy-to-use for scientists with broad backgrounds.

Journal ArticleDOI
TL;DR: A digital twin-based assembly data management and process traceability approach for complex products is proposed and the Digital Twin-based Assembly Process Management and Control System (DT-APMCS) was designed to verify the efficiency of the proposed approach.

Journal ArticleDOI
TL;DR: In this paper, the authors present a review of the main bottom-up physics-based UBEM tools, comparing them from a user-oriented perspective, focusing on the required inputs, the reported outputs, the exploited workflow, the applicability of each tool, and the potential users.
Abstract: Regulations corroborate the importance of retrofitting existing building stocks or constructing new energy efficient district. There is, thus, a need for modeling tools to evaluate energy scenarios to better manage and design cities, and numerous methodologies and tools have been developed. Among them, Urban Building Energy Modeling (UBEM) tools allow the energy simulation of buildings at large scales. Choosing an appropriate UBEM tool, balancing the level of complexity, accuracy, usability, and computing needs, remains a challenge for users. The review focuses on the main bottom-up physics-based UBEM tools, comparing them from a user-oriented perspective. Five categories are used: (i) the required inputs, (ii) the reported outputs, (iii) the exploited workflow, (iv) the applicability of each tool, and (v) the potential users. Moreover, a critical discussion is proposed focusing on interests and trends in research and development. The results highlighted major differences between UBEM tools that must be considered to choose the proper one for an application. Barriers of adoption of UBEM tools include the needs of a standardized ontology, a common three dimensional city model, a standard procedure to collect data, and a standard set of test cases. This feeds into future development of UBEM tools to support cities' sustainability goals.

Journal ArticleDOI
TL;DR: An onliNe multi-workflOw Scheduling Framework, named NOSF, to schedule deadline-constrained workflows with random arrivals and uncertain task execution time is proposed and significantly outperforms two state-of-the-art algorithms in terms of reducing VM rental costs and deadline violation probability.
Abstract: Cloud has become an important platform for executing numerous deadline-constrained scientific applications generally represented by workflow models. It provides scientists a simple and cost-efficient method of running workflows on their rental Virtual Machines (VMs) anytime and anywhere. Since pay-as-you-go is a dominating pricing solution in clouds, extensive research efforts have been devoted to minimizing the monetary cost of executing workflows by designing tailored VM allocation mechanisms. However, most of them assume that the task execution time in clouds is static and can be estimated in advance, which is impractical in real scenarios due to performance fluctuation of VMs. In this paper, we propose an onli N e multi-workfl O w S cheduling F ramework, named NOSF, to schedule deadline-constrained workflows with random arrivals and uncertain task execution time. In NOSF, workflow scheduling process consists of three phases, including workflow preprocessing, VM allocation and feedback process. Built upon the new framework, a deadline-aware heuristic algorithm is then developed to elastically provision suitable VMs for workflow execution, with the objective of minimizing the rental cost and improving resource utilization. Simulation results demonstrate that the proposed algorithm significantly outperforms two state-of-the-art algorithms in terms of reducing VM rental costs and deadline violation probability, as well as improving the resource utilization efficiency.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed algorithm reduces makespan, enhances resource utilization, and improves load balancing, compared to MOHEFT and MCP, the well-known workflow scheduling algorithms of the literature.
Abstract: Cloud computing is one of the most popular distributed environments, in which, multiple powerful and heterogeneous resources are used by different user applications Task scheduling and resource provisioning are two important challenges of cloud environment, called cloud resource management Resource management is a major problem especially for scientific workflows due to their heavy calculations and dependency between their operations Several algorithms and methods have been developed to manage cloud resources In this paper, the combination of state-action-reward-state-action learning and genetic algorithm is used to manage cloud resources At the first step, the intelligent agents schedule the tasks during the learning process by exploring the workflow Then, in the resource provisioning step, each resource is assigned to an agent, and its utilization is attempted to be maximized in the learning process of its corresponding agent This is conducted by selecting the most appropriate set of the tasks that maximizes the utilization of the resource Genetic algorithm is utilized for convergence of the agents of the proposed method, and to achieve global optimization The fitness function that has been exploited by this genetic algorithm seeks to achieve more efficient resource utilization and better load balancing by observing the deadlines of the tasks The experimental results show that the proposed algorithm reduces makespan, enhances resource utilization, and improves load balancing, compared to MOHEFT and MCP, the well-known workflow scheduling algorithms of the literature

Posted Content
TL;DR: In this article, a narrative review of interpretability methods for deep learning models for medical image analysis applications is presented, which is based on the type of generated explanations and technical similarities.
Abstract: Artificial Intelligence has emerged as a useful aid in numerous clinical applications for diagnosis and treatment decisions. Deep neural networks have shown same or better performance than clinicians in many tasks owing to the rapid increase in the available data and computational power. In order to conform to the principles of trustworthy AI, it is essential that the AI system be transparent, robust, fair and ensure accountability. Current deep neural solutions are referred to as black-boxes due to a lack of understanding of the specifics concerning the decision making process. Therefore, there is a need to ensure interpretability of deep neural networks before they can be incorporated in the routine clinical workflow. In this narrative review, we utilized systematic keyword searches and domain expertise to identify nine different types of interpretability methods that have been used for understanding deep learning models for medical image analysis applications based on the type of generated explanations and technical similarities. Furthermore, we report the progress made towards evaluating the explanations produced by various interpretability methods. Finally we discuss limitations, provide guidelines for using interpretability methods and future directions concerning the interpretability of deep neural networks for medical imaging analysis.

Journal ArticleDOI
TL;DR: An interactive visual analysis workflow for the end-to-end analysis of Imaging Mass Cytometry data that was developed in close collaboration with domain expert partners is presented and the effectiveness of the workflow and ImaCytE is shown.
Abstract: Tissue functionality is determined by the characteristics of tissue-resident cells and their interactions within their microenvironment. Imaging Mass Cytometry offers the opportunity to distinguish cell types with high precision and link them to their spatial location in intact tissues at sub-cellular resolution. This technology produces large amounts of spatially-resolved high-dimensional data, which constitutes a serious challenge for the data analysis. We present an interactive visual analysis workflow for the end-to-end analysis of Imaging Mass Cytometry data that was developed in close collaboration with domain expert partners. We implemented the presented workflow in an interactive visual analysis tool; ImaCytE. Our workflow is designed to allow the user to discriminate cell types according to their protein expression profiles and analyze their cellular microenvironments, aiding in the formulation or verification of hypotheses on tissue architecture and function. Finally, we show the effectiveness of our workflow and ImaCytE through a case study performed by a collaborating specialist.

Journal ArticleDOI
TL;DR: Workflow managers have been developed to simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing as mentioned in this paper.
Abstract: The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.

Journal ArticleDOI
TL;DR: A dependable algorithm for scheduling workflow applications on CPCS that uses slack to recover failed tasks and allows all tasks to share the available slack in the system to improve soft-error reliability.
Abstract: Cyber–physical cloud systems (CPCS) are integrations of cyber–physical systems (CPS) and cloud computing infrastructures. Integrating CPS into cloud computing infrastructures could improve the performance in many aspects. However, new reliability and security challenges are also introduced. This fact highlights the need to develop novel methodologies to tackle these challenges in CPCS. To this end, this article is oriented toward enhancing the soft-error reliability of real-time workflows on CPCS while satisfying the lifetime reliability, security, and real-time constraints. In this article, we propose a dependable algorithm for scheduling workflow applications on CPCS. The proposed algorithm uses slack to recover failed tasks and allows all tasks to share the available slack in the system. To improve soft-error reliability, the algorithm first determines the priority of tasks, then assigns the maximum frequency to each task, and finally assigns the recoveries to tasks dynamically. Slack also can be used to utilize security services for satisfying system security requirements. The lifetime reliability constraint is met by dynamically scaling down the operating frequency of low-priority tasks. Extensive experiments on real-world workflow benchmarks demonstrate that the proposed scheme reduces the probability of failure by up to $52.1\%$ and improves the scheduling feasibility by up to $83.5\%$ compared to a number of representative approaches.

Journal ArticleDOI
TL;DR: By identifying aspects of machine learning, which can be reused from project to project, open-source tools which help in specific parts of the pipeline, and possible combinations, an overview of support in MLOps is given.
Abstract: Nowadays, machine learning projects have become more and more relevant to various real-world use cases. The success of complex Neural Network models depends upon many factors, as the requirement for structured and machine learning-centric project development management arises. Due to the multitude of tools available for different operational phases, responsibilities and requirements become more and more unclear. In this work, Machine Learning Operations (MLOps) technologies and tools for every part of the overall project pipeline, as well as involved roles, are examined and clearly defined. With the focus on the inter-connectivity of specific tools and comparison by well-selected requirements of MLOps, model performance, input data, and system quality metrics are briefly discussed. By identifying aspects of machine learning, which can be reused from project to project, open-source tools which help in specific parts of the pipeline, and possible combinations, an overview of support in MLOps is given. Deep learning has revolutionized the field of Image processing, and building an automated machine learning workflow for object detection is of great interest for many organizations. For this, a simple MLOps workflow for object detection with images is portrayed.

Journal ArticleDOI
TL;DR: In this paper, the authors provide guidelines for interpreting single-cell transcriptomic maps to identify cell types, states and other biologically relevant patterns with the objective of creating an annotated map of cells.
Abstract: Single-cell transcriptomics can profile thousands of cells in a single experiment and identify novel cell types, states and dynamics in a wide variety of tissues and organisms. Standard experimental protocols and analysis workflows have been developed to create single-cell transcriptomic maps from tissues. This tutorial focuses on how to interpret these data to identify cell types, states and other biologically relevant patterns with the objective of creating an annotated map of cells. We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification. Frequently encountered challenges are discussed, as well as strategies to address them. Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow. Basic familiarity with computer software is assumed, and basic knowledge of programming (e.g., in the R language) is recommended. This tutorial provides guidelines for interpreting single-cell transcriptomic maps to identify cell types, states and other biologically relevant patterns.

Journal ArticleDOI
TL;DR: The LCZ Generator as discussed by the authors is an online platform that maps a city of interest into Local Climate Zones (LCZs), solely expecting a valid training area file and some metadata as input.
Abstract: Since their introduction in 2012, Local Climate Zones (LCZs) emerged as a new standard for characterising urban landscapes, providing a holistic classification approach that takes into account micro-scale land-cover and associated physical properties. In 2015, as part of the community-based World Urban Database and Access Portal Tools (WUDAPT) project, a protocol was developed that enables the mapping of cities into LCZs, using freely available data and software packages, yet performed on local computing facilities. The ‘LCZ Generator’ described here further simplifies this process, providing an online platform that maps a city of interest into LCZs, solely expecting a valid training area file and some metadata as input. The web application (available at https://lcz-generator.rub.de) integrates the state-of-the-art of LCZ mapping, and simultaneously provides an automated accuracy assessment, training data derivatives and a novel approach to identify suspicious training areas. As this contribution explains all front- and back-end procedures, databases, and underlying datasets in detail, it serves as the primary ‘User Guide’ for this web application. We anticipate this development will significantly ease the workflow of researchers and practitioners interested in using the LCZ framework for a variety of urban-induced human and environmental impacts. In addition, this development will ease the accessibility and dissemination of maps and their metadata.

Journal ArticleDOI
TL;DR: The emerging intelligent automation (IA) as mentioned in this paper is the combination of RPA, AI and soft computing, which can further surpass traditional DM to achieve unprecedented levels of operational efficiency, decision quality and system reliability.

Journal ArticleDOI
TL;DR: The algorithm introduced in this paper utilizes a load balancing routine to maximize resources’ efficiency at execution time and performs task scheduling with the least makespan and cost.
Abstract: Cloud infrastructures are suitable environments for processing large scientific workflows. Nowadays, new challenges are emerging in the field of optimizing workflows such that it can meet user’s service quality requirements. The key to workflow optimization is the scheduling of workflow tasks, which is a famous NP-hard problem. Although several methods have been proposed based on the genetic algorithm for task scheduling in clouds, our proposed method is more efficient than other proposed methods due to the use of new genetic operators as well as modified genetic operators and the use of load balancing routine. Moreover, a solution obtained from a heuristic used as one of the initial population chromosomes and an efficient routine also used for generating the rest of the primary population chromosomes. An adaptive fitness function is used that takes into account both cost and makespan. The algorithm introduced in this paper utilizes a load balancing routine to maximize resources’ efficiency at execution time. The performance of the proposed algorithm is evaluated by comparing the results with state of the art algorithms of this field, and the results indicate that the proposed algorithm has remarkable superiority in comparison to other algorithms and performs task scheduling with the least makespan and cost.

Journal ArticleDOI
TL;DR: In this article, a subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry.
Abstract: Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much—yet not enough—information at the same time to perform comprehensive exposomics research. Furthermore, the improvements in analytical technologies and computational mass spectrometry workflows coupled with the rapid growth in databases and increasing demand for high throughput “big data” services from the research community present significant challenges for both data hosts and workflow developers. This article explores how to reduce candidate search spaces in non-target small molecule identification workflows, while increasing content usability in the context of environmental and exposomics analyses, so as to profit from the increasing size and information content of large compound databases, while increasing efficiency at the same time. In this article, these methods are explored using PubChem, the NORMAN Network Suspect List Exchange and the in silico fragmentation approach MetFrag. A subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry. Benchmarking datasets from earlier publications are used to show how experimental knowledge and existing datasets can be used to detect and fill gaps in compound databases to progressively improve large resources such as PubChem, and topic-specific subsets such as PubChemLite. PubChemLite is a living collection, updating as annotation content in PubChem is updated, and exported to allow direct integration into existing workflows such as MetFrag. The source code and files necessary to recreate or adjust this are jointly hosted between the research parties (see data availability statement). This effort shows that enhancing the FAIRness (Findability, Accessibility, Interoperability and Reusability) of open resources can mutually enhance several resources for whole community benefit. The authors explicitly welcome additional community input on ideas for future developments.

Journal ArticleDOI
03 Mar 2021
TL;DR: The objective of the research work is to use bio-inspired bacteria foraging optimization algorithm (BFOA) along with other heuristics algorithms for better search of the scheduling solution space for multiple workflows and demonstrates that the hybrid approach (MinMin/Myopic with B FOA) outperforms other approaches.
Abstract: Efficient scheduling of tasks in workflows of cloud or grid applications is a key to achieving better utilization of resources as well as timely completion of the user jobs. Many scientific applications comprise several tasks that are dependent in nature and are specified by workflow graphs. The aim of the cloud meta-scheduler is to schedule the user application tasks (and the applications) so as to optimize the resource utilization and to execute the user applications in minimum amount of time. During the past decade, there have been several attempts to use bio-inspired scheduling algorithms to obtain an optimal or near optimal schedule in order to minimize the overall schedule length and to optimize the use of resources. However, as the number of tasks increases, the solution space comprising different tasks-resource mapping sequences increases exponentially. Hence, there is a need to devise mechanisms to improvise the search strategies of the bio-inspired scheduling algorithms for better scheduling solutions in lesser number of iterations/time. The objective of the research work in this paper is to use bio-inspired bacteria foraging optimization algorithm (BFOA) along with other heuristics algorithms for better search of the scheduling solution space for multiple workflows. The idea is to first find a schedule by the heuristic algorithms such as MaxMin, MinMin, and Myopic, and use these as initial solutions (along with other randomly generated solutions) in the search space to get better solutions using BFOA. The performance of our approach with the existing approaches is compared for quality of the scheduling solutions. The results demonstrate that our hybrid approach (MinMin/Myopic with BFOA) outperforms other approaches.