scispace - formally typeset
Search or ask a question

Showing papers on "Workflow published in 2020"


Journal ArticleDOI
TL;DR: PhyloSuite is designed for both beginners and experienced researchers, allowing the former to quick‐start their way into phylogenetic analysis, and the latter to conduct, store and manage their work in a streamlined way, and spend more time investigating scientific questions instead of wasting it on transferring files from one software program to another.
Abstract: Multigene and genomic data sets have become commonplace in the field of phylogenetics, but many existing tools are not designed for such data sets, which often makes the analysis time-consuming and tedious. Here, we present PhyloSuite, a (cross-platform, open-source, stand-alone Python graphical user interface) user-friendly workflow desktop platform dedicated to streamlining molecular sequence data management and evolutionary phylogenetics studies. It uses a plugin-based system that integrates several phylogenetic and bioinformatic tools, thereby streamlining the entire procedure, from data acquisition to phylogenetic tree annotation (in combination with iTOL). It has the following features: (a) point-and-click and drag-and-drop graphical user interface; (b) a workplace to manage and organize molecular sequence data and results of analyses; (c) GenBank entry extraction and comparative statistics; and (d) a phylogenetic workflow with batch processing capability, comprising sequence alignment (mafft and macse), alignment optimization (trimAl, HmmCleaner and Gblocks), data set concatenation, best partitioning scheme and best evolutionary model selection (PartitionFinder and modelfinder), and phylogenetic inference (MrBayes and iq-tree). PhyloSuite is designed for both beginners and experienced researchers, allowing the former to quick-start their way into phylogenetic analysis, and the latter to conduct, store and manage their work in a streamlined way, and spend more time investigating scientific questions instead of wasting it on transferring files from one software program to another.

1,144 citations


Journal ArticleDOI
TL;DR: The existing literature in the field of automated machine learning (AutoML) is reviewed to help healthcare professionals better utilize machine learning models "off-the-shelf" with limited data science expertise to help there to be widespread adoption of AutoML in healthcare.

346 citations


Journal ArticleDOI
TL;DR: A comprehensive survey of machine learning testing can be found in this article, which covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (i.e., the data, learning program, and framework), testing workflow, and application scenarios.
Abstract: This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in machine learning testing.

343 citations


Journal ArticleDOI
TL;DR: This work introduces MetaboAnalystR 3.0, a significantly improved pipeline with three key new features: efficient parameter optimization for peak picking; automated batch effect correction; and more accurate pathway activity prediction that offers an efficient pipeline to support high-throughput global metabolomics in the open-source R environment.
Abstract: Liquid chromatography coupled to high-resolution mass spectrometry platforms are increasingly employed to comprehensively measure metabolome changes in systems biology and complex diseases. Over the past decade, several powerful computational pipelines have been developed for spectral processing, annotation, and analysis. However, significant obstacles remain with regard to parameter settings, computational efficiencies, batch effects, and functional interpretations. Here, we introduce MetaboAnalystR 3.0, a significantly improved pipeline with three key new features: (1) efficient parameter optimization for peak picking; (2) automated batch effect correction; and 3) more accurate pathway activity prediction. Our benchmark studies showed that this workflow was 20~100X faster compared to other well-established workflows and produced more biologically meaningful results. In summary, MetaboAnalystR 3.0 offers an efficient pipeline to support high-throughput global metabolomics in the open-source R environment.

320 citations


Journal ArticleDOI
TL;DR: This protocol describes how to use GNPS to explore uploaded metabolomics data, and provides step-by-step instructions for creating reproducible, high-quality molecular networks.
Abstract: Global Natural Product Social Molecular Networking (GNPS) is an interactive online small molecule-focused tandem mass spectrometry (MS2) data curation and analysis infrastructure. It is intended to provide as much chemical insight as possible into an untargeted MS2 dataset and to connect this chemical insight to the user's underlying biological questions. This can be performed within one liquid chromatography (LC)-MS2 experiment or at the repository scale. GNPS-MassIVE is a public data repository for untargeted MS2 data with sample information (metadata) and annotated MS2 spectra. These publicly accessible data can be annotated and updated with the GNPS infrastructure keeping a continuous record of all changes. This knowledge is disseminated across all public data; it is a living dataset. Molecular networking-one of the main analysis tools used within the GNPS platform-creates a structured data table that reflects the molecular diversity captured in tandem mass spectrometry experiments by computing the relationships of the MS2 spectra as spectral similarity. This protocol provides step-by-step instructions for creating reproducible, high-quality molecular networks. For training purposes, the reader is led through a 90- to 120-min procedure that starts by recalling an example public dataset and its sample information and proceeds to creating and interpreting a molecular network. Each data analysis job can be shared or cloned to disseminate the knowledge gained, thus propagating information that can lead to the discovery of molecules, metabolic pathways, and ecosystem/community interactions.

274 citations


Journal ArticleDOI
TL;DR: An automated BPM solution is investigated to select and compose services in open business environment, Blockchain technology (BCT) is explored and proposed to transfer and verify the trustiness of businesses and partners, and a BPM framework is developed to illustrate how BCT can be integrated to support prompt, reliable, and cost-effective evaluation and transferring of Quality of Services in the workflow composition and management.
Abstract: Business process management (BPM) aims to optimize business processes to achieve better system performance such as higher profit, quicker response, and better services. BPM systems in Industry 4.0 are required to digitize and automate business process workflows and support the transparent interoperations of service vendors. The critical bottleneck to advance BPM systems is the evaluation, verification, and transformation of trustworthiness and digitized assets. Most of BPM systems rely heavily on domain experts or third parties to deal with trustworthiness. In this paper, an automated BPM solution is investigated to select and compose services in open business environment, Blockchain technology (BCT) is explored and proposed to transfer and verify the trustiness of businesses and partners, and a BPM framework is developed to illustrate how BCT can be integrated to support prompt, reliable, and cost-effective evaluation and transferring of Quality of Services in the workflow composition and management.

201 citations


Journal ArticleDOI
TL;DR: This is a foundational study that formalises and categorises the existing usage of AR and VR in the construction industry and provides a roadmap to guide future research efforts.

182 citations


Journal ArticleDOI
TL;DR: Five future directions of big data applications in manufacturing are presented from modelling and simulation to real-time big data analytics and cybersecurity, and several research domains are identified that are driven by available capabilities ofbig data ecosystem.
Abstract: Advanced manufacturing is one of the core national strategies in the US (AMP), Germany (Industry 4.0) and China (Made-in China 2025). The emergence of the concept of Cyber Physical System (CPS) and big data imperatively enable manufacturing to become smarter and more competitive among nations. Many researchers have proposed new solutions with big data enabling tools for manufacturing applications in three directions: product, production and business. Big data has been a fast-changing research area with many new opportunities for applications in manufacturing. This paper presents a systematic literature review of the state-of-the-art of big data in manufacturing. Six key drivers of big data applications in manufacturing have been identified. The key drivers are system integration, data, prediction, sustainability, resource sharing and hardware. Based on the requirements of manufacturing, nine essential components of big data ecosystem are captured. They are data ingestion, storage, computing, analytics, visualization, management, workflow, infrastructure and security. Several research domains are identified that are driven by available capabilities of big data ecosystem. Five future directions of big data applications in manufacturing are presented from modelling and simulation to real-time big data analytics and cybersecurity.

181 citations


Journal ArticleDOI
TL;DR: The metan R package is described, a collection of functions that implement a workflow‐based approach to check, manipulate and summarize typical MET data, and how they integrate into a workflow to explore and analyse MET data.
Abstract: Multi-environment trials (MET) are crucial steps in plant breeding programs that aim increasing crop productivity to ensure global food security. The analysis of MET data requires the combination of several approaches including data manipulation, visualization, and modeling. As new methods are proposed, analyzing MET data correctly and completely remains a challenge, often intractable with existing tools. Here we describe the metan R package, a collection of functions that implement a workflow-based approach to (a) check, manipulate and summarise typical MET data; (b) analyze individual environments using both fixed and mixed-effect models; (c) compute parametric and non-parametric stability statistics; (c) implement biometrical models widely used in MET analysis; and (d) plot typical MET data quickly. In this paper, we present a summary of the functions implemented in metan and how they integrate into a workflow to explore and analyze MET data. We guide the user along a gentle learning curve and show how adding only a few commands or options at a time, powerfull analyzes can be implemented. metan offers a flexible, intuitive, and richly documented working environment with tools that will facilitate the implementation of a complete analysis of MET data sets.

171 citations


Journal ArticleDOI
TL;DR: The performance of 14 different bagging and boosting based ensembles, including XGBoost, LightGBM and Random Forest, is empirically analyzed in terms of predictive capability and efficiency.

165 citations


Journal ArticleDOI
27 Nov 2020
TL;DR: DTC should be viewed as a comprehensive mode of construction that prioritizes closing the control loops rather than an extension of BIM tools integrated with sensing and monitoring technologies.
Abstract: The concept of a “digital twin” as a model for data-driven management and control of physical systems has emerged over the past decade in the domains of manufacturing, production, and operations. In the context of buildings and civil infrastructure, the notion of a digital twin remains ill-defined, with little or no consensus among researchers and practitioners of the ways in which digital twin processes and data-centric technologies can support design and construction. This paper builds on existing concepts of Building Information Modeling (BIM), lean project production systems, automated data acquisition from construction sites and supply chains, and artificial intelligence to formulate a mode of construction that applies digital twin information systems to achieve closed loop control systems. It contributes a set of four core information and control concepts for digital twin construction (DTC), which define the dimensions of the conceptual space for the information used in DTC workflows. Working from the core concepts, we propose a DTC information system workflow—including information stores, information processing functions, and monitoring technologies—according to three concentric control workflow cycles. DTC should be viewed as a comprehensive mode of construction that prioritizes closing the control loops rather than an extension of BIM tools integrated with sensing and monitoring technologies.

Journal ArticleDOI
28 May 2020
TL;DR: This paper conducted an online survey with 183 participants who work in various aspects of data science and found that data science teams are extremely collaborative and work with a variety of stakeholders and tools during the six common steps of a data science workflow (e.g., clean data and train model).
Abstract: Today, the prominence of data science within organizations has given rise to teams of data science workers collaborating on extracting insights from data, as opposed to individual data scientists working alone. However, we still lack a deep understanding of how data science workers collaborate in practice. In this work, we conducted an online survey with 183 participants who work in various aspects of data science. We focused on their reported interactions with each other (e.g., managers with engineers) and with different tools (e.g., Jupyter Notebook). We found that data science teams are extremely collaborative and work with a variety of stakeholders and tools during the six common steps of a data science workflow (e.g., clean data and train model). We also found that the collaborative practices workers employ, such as documentation, vary according to the kinds of tools they use. Based on these findings, we discuss design implications for supporting data science team collaborations and future research directions.

Journal ArticleDOI
TL;DR: The comparison results show that DGLDPSO is better than or at least comparable to other state-of-the-art large-scale optimization algorithms and workflow scheduling algorithms.
Abstract: Cloud workflow scheduling is a significant topic in both commercial and industrial applications. However, the growing scale of workflow has made such a scheduling problem increasingly challenging. Many current algorithms often deal with small- or medium-scale problems (e.g., less than 1000 tasks) and face difficulties in providing satisfactory solutions when dealing with the large-scale problems, due to the curse of dimensionality. To this aim, this article proposes a dynamic group learning distributed particle swarm optimization (DGLDPSO) for large-scale optimization and extends it for the large-scale cloud workflow scheduling. DGLDPSO is efficient for large-scale optimization due to its following two advantages. First, the entire population is divided into many groups, and these groups are coevolved by using the master-slave multigroup distributed model, forming a distributed PSO (DPSO) to enhance the algorithm diversity. Second, a dynamic group learning (DGL) strategy is adopted for DPSO to balance diversity and convergence. When applied DGLDPSO into the large-scale cloud workflow scheduling, an adaptive renumber strategy (ARS) is further developed to make solutions relate to the resource characteristic and to make the searching behavior meaningful rather than aimless. Experiments are conducted on the large-scale benchmark functions set and the large-scale cloud workflow scheduling instances to further investigate the performance of DGLDPSO. The comparison results show that DGLDPSO is better than or at least comparable to other state-of-the-art large-scale optimization algorithms and workflow scheduling algorithms.

Journal ArticleDOI
TL;DR: A framework for interactive and explainable machine learning that enables users to understand machine learning models; diagnose model limitations using different explainable AI methods; as well as refine and optimize the models is proposed.
Abstract: We propose a framework for interactive and explainable machine learning that enables users to (1) understand machine learning models; (2) diagnose model limitations using different explainable AI methods; as well as (3) refine and optimize the models. Our framework combines an iterative XAI pipeline with eight global monitoring and steering mechanisms, including quality monitoring, provenance tracking, model comparison, and trust building. To operationalize the framework, we present explAIner, a visual analytics system for interactive and explainable machine learning that instantiates all phases of the suggested pipeline within the commonly used TensorBoard environment. We performed a user-study with nine participants across different expertise levels to examine their perception of our workflow and to collect suggestions to fill the gap between our system and framework. The evaluation confirms that our tightly integrated system leads to an informed machine learning process while disclosing opportunities for further extensions.

Posted Content
TL;DR: By mapping found challenges to the steps of the machine learning deployment workflow it is shown that practitioners face issues at each stage of the deployment process.
Abstract: In recent years, machine learning has received increased interest both as an academic research field and as a solution for real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. Our survey shows that practitioners face challenges at each stage of the deployment. The goal of this paper is to layout a research agenda to explore approaches addressing these challenges.

Journal ArticleDOI
TL;DR: AiiDA as mentioned in this paper is an open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording, supporting throughputs of tens of thousands of processes/hour.
Abstract: The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial. In recent years, we have been developing AiiDA (aiida.net), a robust open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording. Here, we introduce developments and capabilities required to reach sustained performance, with AiiDA supporting throughputs of tens of thousands processes/hour, while automatically preserving and storing the full data provenance in a relational database making it queryable and traversable, thus enabling high-performance data analytics. AiiDA's workflow language provides advanced automation, error handling features and a flexible plugin model to allow interfacing with external simulation software. The associated plugin registry enables seamless sharing of extensions, empowering a vibrant user community dedicated to making simulations more robust, user-friendly and reproducible.

Journal ArticleDOI
TL;DR: ANPELA has a unique ability to evaluate the performance of whole LFQ workflow and enables the discovery of the optimal LFQs by the comprehensive performance ranking of all 560 workflows.
Abstract: Label-free quantification (LFQ) with a specific and sequentially integrated workflow of acquisition technique, quantification tool and processing method has emerged as the popular technique employed in metaproteomic research to provide a comprehensive landscape of the adaptive response of microbes to external stimuli and their interactions with other organisms or host cells. The performance of a specific LFQ workflow is highly dependent on the studied data. Hence, it is essential to discover the most appropriate one for a specific data set. However, it is challenging to perform such discovery due to the large number of possible workflows and the multifaceted nature of the evaluation criteria. Herein, a web server ANPELA (https://idrblab.org/anpela/) was developed and validated as the first tool enabling performance assessment of whole LFQ workflow (collective assessment by five well-established criteria with distinct underlying theories), and it enabled the identification of the optimal LFQ workflow(s) by a comprehensive performance ranking. ANPELA not only automatically detects the diverse formats of data generated by all quantification tools but also provides the most complete set of processing methods among the available web servers and stand-alone tools. Systematic validation using metaproteomic benchmarks revealed ANPELA's capabilities in 1 discovering well-performing workflow(s), (2) enabling assessment from multiple perspectives and (3) validating LFQ accuracy using spiked proteins. ANPELA has a unique ability to evaluate the performance of whole LFQ workflow and enables the discovery of the optimal LFQs by the comprehensive performance ranking of all 560 workflows. Therefore, it has great potential for applications in metaproteomic and other studies requiring LFQ techniques, as many features are shared among proteomic studies.

Journal ArticleDOI
TL;DR: This review provides guidance to interested users of the multi-omics domain by addressing challenges of the underlying biology, giving an overview of the available toolset, addressing common pitfalls, and acknowledging current methods’ limitations.
Abstract: Multi-omics, variously called integrated omics, pan-omics, and trans-omics, aims to combine two or more omics data sets to aid in data analysis, visualization and interpretation to determine the mechanism of a biological process. Multi-omics efforts have taken center stage in biomedical research leading to the development of new insights into biological events and processes. However, the mushrooming of a myriad of tools, datasets, and approaches tends to inundate the literature and overwhelm researchers new to the field. The aims of this review are to provide an overview of the current state of the field, inform on available reliable resources, discuss the application of statistics and machine/deep learning in multi-omics analyses, discuss findable, accessible, interoperable, reusable (FAIR) research, and point to best practices in benchmarking. Thus, we provide guidance to interested users of the domain by addressing challenges of the underlying biology, giving an overview of the available toolset, addressing common pitfalls, and acknowledging current methods' limitations. We conclude with practical advice and recommendations on software engineering and reproducibility practices to share a comprehensive awareness with new researchers in multi-omics for end-to-end workflow.

Journal ArticleDOI
TL;DR: Through this synthesis study, the many interdependencies of each step in the collection and processing chain are identified, and approaches to formalize and ensure a successful workflow and product development are outlined.
Abstract: With the increasing role that unmanned aerial systems (UAS) are playing in data collection for environmental studies, two key challenges relate to harmonizing and providing standardized guidance for data collection, and also establishing protocols that are applicable across a broad range of environments and conditions. In this context, a network of scientists are cooperating within the framework of the Harmonious Project to develop and promote harmonized mapping strategies and disseminate operational guidance to ensure best practice for data collection and interpretation. The culmination of these efforts is summarized in the present manuscript. Through this synthesis study, we identify the many interdependencies of each step in the collection and processing chain, and outline approaches to formalize and ensure a successful workflow and product development. Given the number of environmental conditions, constraints, and variables that could possibly be explored from UAS platforms, it is impractical to provide protocols that can be applied universally under all scenarios. However, it is possible to collate and systematically order the fragmented knowledge on UAS collection and analysis to identify the best practices that can best ensure the streamlined and rigorous development of scientific products.

Journal ArticleDOI
TL;DR: A blockchain model to protect data security and patients’ privacy, ensure data provenance, and provide patients full control of their health records is developed to achieve patient-centric HIE.
Abstract: Health Information Exchange (HIE) exhibits remarkable benefits for patient care such as improving healthcare quality and expediting coordinated care. The Office of the National Coordinator (ONC) for Health Information Technology is seeking patient-centric HIE designs that shift data ownership from providers to patients. There are multiple barriers to patient-centric HIE in the current system, such as security and privacy concerns, data inconsistency, timely access to the right records across multiple healthcare facilities. After investigating the current workflow of HIE, this paper provides a feasible solution to these challenges by utilizing the unique features of blockchain, a distributed ledger technology which is considered “unhackable”. Utilizing the smart contract feature, which is a programmable self-executing protocol running on a blockchain, we developed a blockchain model to protect data security and patients’ privacy, ensure data provenance, and provide patients full control of their health records. By personalizing data segmentation and an “allowed list” for clinicians to access their data, this design achieves patient-centric HIE. We conducted a large-scale simulation of this patient-centric HIE process and quantitatively evaluated the model's feasibility, stability, security, and robustness.

Journal ArticleDOI
TL;DR: Generally applicable recommendations on implementation and quality assurance of AI models are presented for commonly used applications in radiotherapy such as auto-segmentation, automated treatment planning and synthetic computed tomography.

Journal ArticleDOI
TL;DR: A theoretical human-centered framework for Operator 4.0 is defined, based on data collection about the workers’ performance, actions and reactions, with the final objective to improve the overall factory performance and organization.

Journal ArticleDOI
TL;DR: The empirical study based on real-world applications from Pegasus workflow management system reveals that the NN-DNSGA-II algorithm significantly outperforms the other alternatives in most cases with respect to metrics used for DMOPs with unknown true Pareto-optimal front, including the number of non-dominated solutions, Schott’s spacing and Hypervolume indicator.

Posted ContentDOI
14 Jan 2020-bioRxiv
TL;DR: The metan R package is described, a collection of functions that implement a workflow-based approach to check, manipulate and summarise typical MET data, and how they integrate into a workflow to explore and analyze MET data.
Abstract: Multi-environment trials (MET) are crucial steps in plant breeding programs that aim increasing crop productivity to ensure global food security. The analysis of MET data requires the combination of several approaches including data manipulation, visualization, and modeling. As new methods are proposed, analyzing MET data correctly and completely remains a challenge, often intractable with existing tools. Here we describe the metan R package, a collection of functions that implement a workflow-based approach to (a) check, manipulate and summarise typical MET data; (b) analyze individual environments using both fixed and mixed-effect models; (c) compute parametric and non-parametric stability statistics; (c) implement biometrical models widely used in MET analysis; and (d) plot typical MET data quickly. In this paper, we present a summary of the functions implemented in metan and how they integrate into a workflow to explore and analyze MET data. We guide the user along a gentle learning curve and show how adding only a few commands or options at a time, powerfull analyzes can be implemented. metan offers a flexible, intuitive, and richly documented working environment with tools that will facilitate the implementation of a complete analysis of MET data sets.

Journal ArticleDOI
TL;DR: The workflow of ML for computational studies of materials, with a specific interest in the prediction of materials properties is illustrated, and the fundamental ideas of ML are presented.
Abstract: We give here a brief overview of the use of machine learning (ML) in our field, for chemists and materials scientists with no experience with these techniques. We illustrate the workflow of ML for computational studies of materials, with a specific interest in the prediction of materials properties. We present concisely the fundamental ideas of ML, and for each stage of the workflow, we give examples of the possibilities and questions to be considered in implementing ML-based modeling.

Journal ArticleDOI
TL;DR: The workflow for liver lesion detection, segmentation, classification, monitoring, and prediction of tumor recurrence and patient survival is illustrated, including ethical considerations, cohorting, data collection, anonymization, and availability of expert annotations.
Abstract: Interest for deep learning in radiology has increased tremendously in the past decade due to the high achievable performance for various computer vision tasks such as detection, segmentation, classification, monitoring, and prediction. This article provides step-by-step practical guidance for conducting a project that involves deep learning in radiology, from defining specifications, to deployment and scaling. Specifically, the objectives of this article are to provide an overview of clinical use cases of deep learning, describe the composition of multi-disciplinary team, and summarize current approaches to patient, data, model, and hardware selection. Key ideas will be illustrated by examples from a prototypical project on imaging of colorectal liver metastasis. This article illustrates the workflow for liver lesion detection, segmentation, classification, monitoring, and prediction of tumor recurrence and patient survival. Challenges are discussed, including ethical considerations, cohorting, data collection, anonymization, and availability of expert annotations. The practical guidance may be adapted to any project that requires automated medical image analysis.

Journal ArticleDOI
31 Jan 2020
TL;DR: This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps, their provenance, and their development.
Abstract: Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. ...

Journal ArticleDOI
TL;DR: This review focuses on the main bottom-up physics-based UBEM tools, comparing them from a user-oriented perspective, and highlighted major differences between UB EM tools that must be considered to choose the proper one for an application.

Journal ArticleDOI
TL;DR: A novel resource provisioning mechanism and a workflow scheduling algorithm, named Greedy Resource Provisioning and modified HEFT (GRP-HEFT), for minimizing the makespan of a given workflow subject to a budget constraint for the hourly-based cost model of modern IaaS clouds.
Abstract: In Infrastructure as a Service (IaaS) Clouds, users are charged to utilize cloud services according to a pay-per-use model. If users intend to run their workflow applications on cloud resources within a specific budget, they have to adjust their demands for cloud resources with respect to this budget. Although several scheduling approaches have introduced solutions to optimize the makespan of workflows on a set of heterogeneous IaaS cloud resources within a certain budget, the hourly-based cost model of some well-known cloud providers (e.g., Amazon EC2 Cloud) can easily lead to a higher makespan and some schedulers may not find any feasible solution. In this article, we propose a novel resource provisioning mechanism and a workflow scheduling algorithm, named Greedy Resource Provisioning and modified HEFT (GRP-HEFT), for minimizing the makespan of a given workflow subject to a budget constraint for the hourly-based cost model of modern IaaS clouds. As a resource provisioning mechanism, we propose a greedy algorithm which lists the instance types according to their efficiency rate. For our scheduler, we modified the HEFT algorithm to consider a budget limit. GRP-HEFT is compared against state-of-the-art workflow scheduling techniques, including MOACS (Multi-Objective Ant Colony System), PSO (Particle Swarm Optimization), and GA (Genetic Algorithm). The experimental results demonstrate that GRP-HEFT outperforms GA, PSO, and MOACS for several well-known scientific workflow applications for different problem sizes on average by 13.64, 19.77, and 11.69 percent, respectively. Also in terms of time complexity, GRP-HEFT outperforms GA, PSO and MOACS.

Journal ArticleDOI
29 Sep 2020
TL;DR: Wang et al. as mentioned in this paper studied implicit knowledge in industrial Internet of things (IIoT) by using collaborative learning techniques, considering the increased dimensions and dynamics of IoT devices, and explored the possible relationships between users and between APIs.
Abstract: The industrial Internet of things (IIoT), a new computing mode in Industry 4.0, is deployed to connect IoT devices and use communication technology to respond to control commands and handle industrial data. IIoT is typically employed to improve the efficiency of computing and sensing and can be used in many scenarios, such as intelligent manufacturing and video surveillance. To build an IIoT system, we need a collection of software to manage and monitor each system component when there are large-scale devices. Application programming interface (API) is an effective way to invoke public services provided by different platforms. Developers can invoke different APIs to operate IoT devices without knowing the implementation process. We can design a workflow to configure how and when to invoke target APIs. Thus, APIs are a powerful tool for rapidly developing industrial systems. However, the increasing number of APIs exacerbates the problem of finding suitable APIs. Current related recommendation methods have defects. For example, most existing methods focus on the relation between users and APIs but neglect the valuable relations among the users or APIs themselves. To address these problems, this article studies implicit knowledge in IIoT by using collaborative learning techniques. Considering the increased dimensions and dynamics of IoT devices, we explore the possible relationships between users and between APIs. We enhance the matrix factorization (MF) model with the mined implicit knowledge that are implicit relationships on both sides. We build an ensemble model by using all implicit knowledge. We conduct experiments on a collected real-world dataset and simulate industrial system scenarios. The experimental results verify the effectiveness and superiority of the proposed models.