scispace - formally typeset
Search or ask a question

Showing papers on "Workflow published in 2012"


Journal ArticleDOI
TL;DR: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow.
Abstract: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow. It is the first system to support the use of automatically inferred multiple named wildcards (or variables) in input and output filenames.

1,932 citations


Journal ArticleDOI
TL;DR: How SRM is applied in proteomics is described, recent advances are reviewed, present selected applications and a perspective on the future of this powerful technology is provided.
Abstract: Selected reaction monitoring (SRM) is a targeted mass spectrometry technique that is emerging in the field of proteomics as a complement to untargeted shotgun methods. SRM is particularly useful when predetermined sets of proteins, such as those constituting cellular networks or sets of candidate biomarkers, need to be measured across multiple samples in a consistent, reproducible and quantitatively precise manner. Here we describe how SRM is applied in proteomics, review recent advances, present selected applications and provide a perspective on the future of this powerful technology.

1,187 citations


Journal ArticleDOI
TL;DR: The Biological General Repository for Interaction Datasets (BioGRID) is an open access archive of genetic and protein interactions that are curated from the primary biomedical literature for all major model organism species.
Abstract: The Biological General Repository for Interaction Datasets (BioGRID: http//thebiogrid.org) is an open access archive of genetic and protein interactions that are curated from the primary biomedical literature for all major model organism species. As of September 2012, BioGRID houses more than 500 000 manually annotated interactions from more than 30 model organisms. BioGRID maintains complete curation coverage of the literature for the budding yeast Saccharomyces cerevisiae, the fission yeast Schizosaccharomyces pombe and the model plant Arabidopsis thaliana. A number of themed curation projects in areas of biomedical importance are also supported. BioGRID has established collaborations and/or shares data records for the annotation of interactions and phenotypes with most major model organism databases, including Saccharomyces Genome Database, PomBase, WormBase, FlyBase and The Arabidopsis Information Resource. BioGRID also actively engages with the text-mining community to benchmark and deploy automated tools to expedite curation workflows. BioGRID data are freely accessible through both a user-defined interactive interface and in batch downloads in a wide variety of formats, including PSI-MI2.5 and tab-delimited files. BioGRID records can also be interrogated and analyzed with a series of new bioinformatics tools, which include a post-translational modification viewer, a graphical viewer, a REST service and a Cytoscape plugin.

1,000 citations


Posted Content
TL;DR: In this paper, the authors outline a framework that will enable crowd work that is complex, collaborative, and sustainable, and lay out research challenges in twelve major areas: workflow, task assignment, hierarchy, real-time response, synchronous collaboration, quality control, crowds guiding AIs, AIs guiding crowds, platforms, job design, reputation, and motivation.
Abstract: Paid crowd work offers remarkable opportunities for improving productivity, social mobility, and the global economy by engaging a geographically distributed workforce to complete complex tasks on demand and at scale. But it is also possible that crowd work will fail to achieve its potential, focusing on assembly-line piecework. Can we foresee a future crowd workplace in which we would want our children to participate? This paper frames the major challenges that stand in the way of this goal. Drawing on theory from organizational behavior and distributed computing, as well as direct feedback from workers, we outline a framework that will enable crowd work that is complex, collaborative, and sustainable. The framework lays out research challenges in twelve major areas: workflow, task assignment, hierarchy, real-time response, synchronous collaboration, quality control, crowds guiding AIs, AIs guiding crowds, platforms, job design, reputation, and motivation.

803 citations


Journal ArticleDOI
TL;DR: This work introduces a methodology for the application of process mining techniques that leads to the identification of regular behavior, process variants, and exceptional medical cases in a case study conducted at a hospital emergency service.

452 citations


Journal ArticleDOI
TL;DR: This protocol describes three workflows based on the NetworkAnalyzer and RINalyzer plug-ins for Cytoscape, a popular software platform for networks, to perform a topological analysis of biological networks.
Abstract: Computational analysis and interactive visualization of biological networks and protein structures are common tasks for gaining insight into biological processes. This protocol describes three workflows based on the NetworkAnalyzer and RINalyzer plug-ins for Cytoscape, a popular software platform for networks. NetworkAnalyzer has become a standard Cytoscape tool for comprehensive network topology analysis. In addition, RINalyzer provides methods for exploring residue interaction networks derived from protein structures. The first workflow uses NetworkAnalyzer to perform a topological analysis of biological networks. The second workflow applies RINalyzer to study protein structure and function and to compute network centrality measures. The third workflow combines NetworkAnalyzer and RINalyzer to compare residue networks. The full protocol can be completed in ∼2 h.

424 citations


Proceedings ArticleDOI
08 Oct 2012
TL;DR: WorkflowSim as mentioned in this paper extends the existing CloudSim simulator by providing a higher layer of workflow management, which takes into consideration heterogeneous system overheads and failures, and it is shown that to ignore system overhead and failures in simulating scientific workflows could cause significant inaccuracies in the predicted workflow runtime.
Abstract: Simulation is one of the most popular evaluation methods in scientific workflow studies. However, existing workflow simulators fail to provide a framework that takes into consideration heterogeneous system overheads and failures. They also lack the support for widely used workflow optimization techniques such as task clustering. In this paper, we introduce WorkflowSim, which extends the existing CloudSim simulator by providing a higher layer of workflow management. We also indicate that to ignore system overheads and failures in simulating scientific workflows could cause significant inaccuracies in the predicted workflow runtime. To further validate its value in promoting other research work, we introduce two promising research areas for which WorkflowSim provides a unique and effective evaluation platform.

405 citations


Book ChapterDOI
01 Jan 2012
TL;DR: The NeOn Methodology suggests a variety of pathways for developing ontologies in commonly occurring situations, for example, when available ontologies need to be re-engineered, aligned, modularized, localized to support different languages and cultures, and integrated with ontology design patterns and non-ontological resources.
Abstract: In contrast to other approaches that provide methodological guidance for ontology engineering, the NeOn Methodology does not prescribe a rigid workflow, but instead it suggests a variety of pathways for developing ontologies. The nine scenarios proposed in the methodology cover commonly occurring situations, for example, when available ontologies need to be re-engineered, aligned, modularized, localized to support different languages and cultures, and integrated with ontology design patterns and non-ontological resources, such as folksonomies or thesauri. In addition, the NeOn Methodology framework provides (a) a glossary of processes and activities involved in the development of ontologies, (b) two ontology life cycle models, and (c) a set of methodological guidelines for different processes and activities, which are described (a) functionally, in terms of goals, inputs, outputs, and relevant constraints; (b) procedurally, by means of workflow specifications; and (c) empirically, through a set of illustrative examples.

361 citations


Proceedings ArticleDOI
10 Nov 2012
TL;DR: It is found that the key factor determining the performance of an algorithm is its ability to decide which workflows in an ensemble to admit or reject for execution, and an admission procedure based on workflow structure and estimates of task runtimes can significantly improve the quality of solutions.
Abstract: Large-scale applications expressed as scientific workflows are often grouped into ensembles of inter-related workflows. In this paper, we address a new and important problem concerning the efficient management of such ensembles under budget and deadline constraints on Infrastructure- as-a-Service (IaaS) clouds. We discuss, develop, and assess algorithms based on static and dynamic strategies for both task scheduling and resource provisioning. We perform the evaluation via simulation using a set of scientific workflow ensembles with a broad range of budget and deadline parameters, taking into account uncertainties in task runtime estimations, provisioning delays, and failures. We find that the key factor determining the performance of an algorithm is its ability to decide which workflows in an ensemble to admit or reject for execution. Our results show that an admission procedure based on workflow structure and estimates of task runtimes can significantly improve the quality of solutions.

312 citations


Proceedings ArticleDOI
11 Feb 2012
TL;DR: It is argued that Turkomatic's collaborative approach can be more successful than the conventional workflow design process and implications for the design of collaborative crowd planning systems are discussed.
Abstract: Preparing complex jobs for crowdsourcing marketplaces requires careful attention to workflow design, the process of decomposing jobs into multiple tasks, which are solved by multiple workers. Can the crowd help design such workflows? This paper presents Turkomatic, a tool that recruits crowd workers to aid requesters in planning and solving complex jobs. While workers decompose and solve tasks, requesters can view the status of worker-designed workflows in real time; intervene to change tasks and solutions; and request new solutions to subtasks from the crowd. These features lower the threshold for crowd employers to request complex work. During two evaluations, we found that allowing the crowd to plan without requester supervision is partially successful, but that requester intervention during workflow planning and execution improves quality substantially. We argue that Turkomatic's collaborative approach can be more successful than the conventional workflow design process and discuss implications for the design of collaborative crowd planning systems.

298 citations


Journal ArticleDOI
TL;DR: A new representation of interventions in terms of multidimensional time-series formed by synchronized signals acquired over time is proposed, which results in workflow models combining low-level signals with high-level information such as predefined phases, which can be used to detect actions and trigger an event.

Patent
27 Jun 2012
TL;DR: A scalable data storage service may maintain tables in a non-relational data store on behalf of clients as discussed by the authors, where items stored in tables may be partitioned and indexed using a simple or composite primary key.
Abstract: A system that implements a scalable data storage service may maintain tables in a non-relational data store on behalf of clients. The system may provide a Web services interface through which service requests are received, and an API usable to request that a table be created, deleted, or described; that an item be stored, retrieved, deleted, or its attributes modified; or that a table be queried (or scanned) with filtered items and/or their attributes returned. An asynchronous workflow may be invoked to create or delete a table. Items stored in tables may be partitioned and indexed using a simple or composite primary key. The system may not impose pre-defined limits on table size, and may employ a flexible schema. The service may provide a best-effort or committed throughput model. The system may automatically scale and/or re-partition tables in response to detecting workload changes, node failures, or other conditions or anomalies.

Patent
02 Mar 2012
TL;DR: In this article, a deployment system enables a developer to generate a deployment plan according to a logical, multi-tier application blueprint defined by application architects, which includes tasks to be executed for deploying application components on virtual computing resource provided in a cloud infrastructure.
Abstract: A deployment system enables a developer to generate a deployment plan according to a logical, multi-tier application blueprint defined by application architects. The deployment plan includes tasks to be executed for deploying application components on virtual computing resource provided in a cloud infrastructure. The deployment plan includes time dependencies that determine an execution order of the tasks according to dependencies between application components specified in the application blueprint. The deployment plan enables system administrators to view the application blueprint as an ordered workflow view that facilitates collaboration between system administrators and application architects.

Journal ArticleDOI
TL;DR: DataSpaces essentially implements a semantically specialized virtual shared space abstraction that can be associatively accessed by all components and services in the application workflow and enables live data to be extracted from running simulation components, indexes this data online, and then allows it to be monitored, queried and accessed by other components and Services via the space using semantically meaningful operators.
Abstract: Emerging high-performance distributed computing environments are enabling new end-to-end formulations in science and engineering that involve multiple interacting processes and data-intensive application workflows. For example, current fusion simulation efforts are exploring coupled models and codes that simultaneously simulate separate application processes, such as the core and the edge turbulence. These components run on different high performance computing resources, need to interact at runtime with each other and with services for data monitoring, data analysis and visualization, and data archiving. As a result, they require efficient and scalable support for dynamic and flexible couplings and interactions, which remains a challenge. This paper presents DataSpaces a flexible interaction and coordination substrate that addresses this challenge. DataSpaces essentially implements a semantically specialized virtual shared space abstraction that can be associatively accessed by all components and services in the application workflow. It enables live data to be extracted from running simulation components, indexes this data online, and then allows it to be monitored, queried and accessed by other components and services via the space using semantically meaningful operators. The underlying data transport is asynchronous, low-overhead and largely memory-to-memory. The design, implementation, and experimental evaluation of DataSpaces using a coupled fusion simulation workflow is presented.

Journal ArticleDOI
TL;DR: The results of this study indicate that the HeuristicsMiner algorithm is especially suited in a real-life setting, and it is shown that, particularly for highly complex event logs, knowledge discovery from such data sets can become a major problem for traditional process discovery techniques.

Journal ArticleDOI
TL;DR: This paper proposes a new QoS-based workflow scheduling algorithm based on a novel concept called Partial Critical Paths (PCP), that tries to minimize the cost of workflow execution while meeting a user-defined deadline.
Abstract: Recently, utility Grids have emerged as a new model of service provisioning in heterogeneous distributed systems. In this model, users negotiate with service providers on their required Quality of Service and on the corresponding price to reach a Service Level Agreement. One of the most challenging problems in utility Grids is workflow scheduling, i.e., the problem of satisfying the QoS of the users as well as minimizing the cost of workflow execution. In this paper, we propose a new QoS-based workflow scheduling algorithm based on a novel concept called Partial Critical Paths (PCP), that tries to minimize the cost of workflow execution while meeting a user-defined deadline. The PCP algorithm has two phases: in the deadline distribution phase it recursively assigns subdeadlines to the tasks on the partial critical paths ending at previously assigned tasks, and in the planning phase it assigns the cheapest service to each task while meeting its subdeadline. The simulation results show that the performance of the PCP algorithm is very promising.

Book ChapterDOI
01 Jan 2012
TL;DR: An architecture that allows to dynamically modify running workflow instances based on an object-oriented approach is introduced and case handling is introduced as a technique for flexible process enactment based on data dependencies rather than process structures.
Abstract: BPM architectures are in the centre of Chapter 7, starting from the WfMC Architecture and proceeding towards service oriented architectures and architectures for flexible workflow management. In particular, an architecture that allows to dynamically modify running workflow instances based on an object-oriented approach is introduced. Web services and their composition are sketched, describing the core concepts of the XML-based service composition language WS-BPEL. Advanced service composition based on semantic concepts are sketched, and case handling is introduced as a technique for flexible process enactment based on data dependencies rather than process structures.

Journal ArticleDOI
TL;DR: Bpipe is a simple, dedicated programming language for defining and executing bioinformatics pipelines that is fully self-contained and cross-platform, making it very easy to adopt and deploy into existing environments.
Abstract: Summary Bpipe is a simple, dedicated programming language for defining and executing bioinformatics pipelines. It specializes in enabling users to turn existing pipelines based on shell scripts or command line tools into highly flexible, adaptable and maintainable workflows with a minimum of effort. Bpipe ensures that pipelines execute in a controlled and repeatable fashion and keeps audit trails and logs to ensure that experimental results are reproducible. Requiring only Java as a dependency, Bpipe is fully self-contained and cross-platform, making it very easy to adopt and deploy into existing environments. Availability and implementation Bpipe is freely available from http://bpipe.org under a BSD License.

Journal ArticleDOI
TL;DR: This short article will concentrate only on cheminformatics applications and the workflow tools most commonly used in chemin formatics, namely Pipeline Pilot and KNIME.
Abstract: There are many examples of scientific workflow systems [1, 2]; in this short article I will concentrate only on cheminformatics applications and the workflow tools most commonly used in cheminformatics, namely Pipeline Pilot [3] and KNIME [4]. Workflow solutions have been used for years in bioinformatics and other sciences, and some also have applications in so-called “business intelligence” and “predictive analytics”. Readers can find details of Discovery Net, Galaxy, Kepler, Triana, SOMA, SMILA, VisTrails, and others on the Web. Kappler has compared Competitive Workflow, Taverna and Pipeline Pilot [5]. Taverna has been widely used in bioinformatics but is also used with the Chemistry Development Kit (CDK) [6, 7]. CDK-Taverna workflows are made freely available at myExperiment.org [8]. (myExperiment.org also includes KNIME workflows.) DiscoveryNet was one of the earliest examples of a scientific workflow system; its concepts were later commercialized in InforSense Knowledge Discovery Environment (KDE). My 2007 review [1] centered on Pipeline Pilot and InforSense KDE; KNIME was then a relative newcomer. In 2009 the loss-making InforSense organization was acquired by IDBS and KDE has made progress in translational medicine [9]. InforSense’s ChemSense [10] used ChemAxon’s JChem Cartridge, and ChemAxon chemical structure, property prediction, and enumeration tools. ChemSense’s three major pharmaceutical customers have turned to other solutions. The InforSense Suite lives on but it not seen as a “personal productivity tool”; rather it is integrated into the IDBS ELN platform. KNIME and Pipeline Pilot are now the market leaders in personal productivity in cheminformatics.

Proceedings ArticleDOI
20 May 2012
TL;DR: This work introduces Makeflow, a simple system for expressing and running a data-intensive workflow across multiple execution engines without requiring changes to the application or workflow description, and introduces Workbench, a suite of benchmarks designed for analyzing common workflow patterns.
Abstract: In recent years, there has been a renewed interest in languages and systems for large scale distributed computing. Unfortunately, most systems available to the end user use a custom description language tightly coupled to a specific runtime implementation, making it difficult to transfer applications between systems. To address this problem we introduce Makeflow, a simple system for expressing and running a data-intensive workflow across multiple execution engines without requiring changes to the application or workflow description. Makeflow allows any user familiar with basic Unix Make syntax to generate a workflow and run it on one of many supported execution systems. Furthermore, in order to assess the performance characteristics of the various execution engines available to users and assist them in selecting one for use we introduce Workbench, a suite of benchmarks designed for analyzing common workflow patterns. We evaluate Workbench on two physical architectures -- the first a storage cluster with local disks and a slower network and the second a high performance computing cluster with a central parallel filesystem and fast network -- using a variety of execution engines. We conclude by demonstrating three applications that use Makeflow to execute data intensive applications consisting of thousands of jobs.

Journal ArticleDOI
TL;DR: The Substitutable Medical Applications, Reusable Technologies (SMART) Platforms project as mentioned in this paper aims to develop a health information technology platform with substitutable applications (apps) constructed around core services.

Journal ArticleDOI
TL;DR: Tavaxy reduces the workflow development cycle by introducing the use of workflow patterns to simplify workflow creation and enables the re-use and integration of existing (sub-) workflows from Taverna and Galaxy, and allows the creation of hybrid workflows.
Abstract: Over the past decade the workflow system paradigm has evolved as an efficient and user-friendly approach for developing complex bioinformatics applications. Two popular workflow systems that have gained acceptance by the bioinformatics community are Taverna and Galaxy. Each system has a large user-base and supports an ever-growing repository of application workflows. However, workflows developed for one system cannot be imported and executed easily on the other. The lack of interoperability is due to differences in the models of computation, workflow languages, and architectures of both systems. This lack of interoperability limits sharing of workflows between the user communities and leads to duplication of development efforts. In this paper, we present Tavaxy, a stand-alone system for creating and executing workflows based on using an extensible set of re-usable workflow patterns. Tavaxy offers a set of new features that simplify and enhance the development of sequence analysis applications: It allows the integration of existing Taverna and Galaxy workflows in a single environment, and supports the use of cloud computing capabilities. The integration of existing Taverna and Galaxy workflows is supported seamlessly at both run-time and design-time levels, based on the concepts of hierarchical workflows and workflow patterns. The use of cloud computing in Tavaxy is flexible, where the users can either instantiate the whole system on the cloud, or delegate the execution of certain sub-workflows to the cloud infrastructure. Tavaxy reduces the workflow development cycle by introducing the use of workflow patterns to simplify workflow creation. It enables the re-use and integration of existing (sub-) workflows from Taverna and Galaxy, and allows the creation of hybrid workflows. Its additional features exploit recent advances in high performance cloud computing to cope with the increasing data size and complexity of analysis. The system can be accessed either through a cloud-enabled web-interface or downloaded and installed to run within the user's local environment. All resources related to Tavaxy are available at http://www.tavaxy.org .

Journal ArticleDOI
TL;DR: The scheduling problem in hybrid clouds is introduced, presenting the main characteristics to be considered when scheduling workflows, as well as a brief survey of some of the scheduling algorithms used in these systems.
Abstract: Schedulers for cloud computing determine on which processing resource jobs of a workflow should be allocated. In hybrid clouds, jobs can be allocated on either a private cloud or a public cloud on a pay per use basis. The capacity of the communication channels connecting these two types of resources impacts the makespan and the cost of workflow execution. This article introduces the scheduling problem in hybrid clouds presenting the main characteristics to be considered when scheduling workflows, as well as a brief survey of some of the scheduling algorithms used in these systems. To assess the influence of communication channels on job allocation, we compare and evaluate the impact of the available bandwidth on the performance of some of the scheduling algorithms.

Journal ArticleDOI
TL;DR: Understanding end users' perspectives towards HIE technology is crucial to the long-term success of HIE, and user and role-specific customization to accommodate differences in workflow and information needs may increase the adoption and use of Hie.

Journal ArticleDOI
01 Feb 2012
TL;DR: ReStore is a system that manages the storage and reuse of intermediate results of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a Map Reduce job.
Abstract: Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query languages such as Pig, Hive, or Jaql to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system and produces output that is stored in this distributed file system and read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. One way to improve the performance of workflows of MapReduce jobs is to keep these intermediate results and reuse them for future workflows submitted to the system. In this paper, we present ReStore, a system that manages the storage and reuse of such intermediate results. ReStore can reuse the output of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job. We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrate significant speedups on queries from the PigMix benchmark.

Journal ArticleDOI
15 Feb 2012
TL;DR: The Yabi system encapsulates considered design of both execution and data models, while abstracting technical details away from users who are not skilled in HPC and providing an intuitive drag-and-drop scalable web-based workflow environment where the same tools can also be accessed via a command line.
Abstract: Background There is a significant demand for creating pipelines or workflows in the life science discipline that chain a number of discrete compute and data intensive analysis tasks into sophisticated analysis procedures. This need has led to the development of general as well as domain-specific workflow environments that are either complex desktop applications or Internet-based applications. Complexities can arise when configuring these applications in heterogeneous compute and storage environments if the execution and data access models are not designed appropriately. These complexities manifest themselves through limited access to available HPC resources, significant overhead required to configure tools and inability for users to simply manage files across heterogenous HPC storage infrastructure.

Proceedings ArticleDOI
08 Oct 2012
TL;DR: A minimal set of auxiliary resources to be preserved together with the workflows as an aggregation object and provide a software tool for end-users to create such aggregations and to assess their completeness.
Abstract: Workflows provide a popular means for preserving scientific methods by explicitly encoding their process. However, some of them are subject to a decay in their ability to be re-executed or reproduce the same results over time, largely due to the volatility of the resources required for workflow executions. This paper provides an analysis of the root causes of workflow decay based on an empirical study of a collection of Taverna workflows from the myExperiment repository. Although our analysis was based on a specific type of workflow, the outcomes and methodology should be applicable to workflows from other systems, at least those whose executions also rely largely on accessing third-party resources. Based on our understanding about decay we recommend a minimal set of auxiliary resources to be preserved together with the workflows as an aggregation object and provide a software tool for end-users to create such aggregations and to assess their completeness.

Posted Content
TL;DR: ReStore as discussed by the authors is an extension to the Pig dataflow system on top of Hadoop, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job.
Abstract: Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query languages such as Pig, Hive, or Jaql to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system and produces output that is stored in this distributed file system and read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. One way to improve the performance of workflows of MapReduce jobs is to keep these intermediate results and reuse them for future workflows submitted to the system. In this paper, we present ReStore, a system that manages the storage and reuse of such intermediate results. ReStore can reuse the output of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job. We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrate significant speedups on queries from the PigMix benchmark.

Journal ArticleDOI
TL;DR: It is demonstrated that logic-based workflow verification can be applied to SWSpec which is capable of checking compliance and also detecting conflicts of the imposed requirements and will support scalable services interoperation in the form of workflows in opened environments.
Abstract: This paper presents a requirement-oriented automated framework for formal verification of service workflows. It is based on our previous work describing the requirement-oriented service workflow specification language called SWSpec. This language has been developed to facilitate workflow composer as well as arbitrary services willing to participate in a workflow to formally and uniformly impose their own requirements. As such, SWSpec provides a formal way to regulate and control workflows. The key component of the to-be-proposed framework centers on verification algorithms that rely on propositional logic. We demonstrate that logic-based workflow verification can be applied to SWSpec which is capable of checking compliance and also detecting conflicts of the imposed requirements. By automating compliance checking process, this framework will support scalable services interoperation in the form of workflows in opened environments.

Journal ArticleDOI
TL;DR: A Service Workflow Specification language is proposed, called SWSpec, which allows arbitrary services in a workflow to formally and uniformly impose their requirements, and will provide a formal way to regulate and control workflows as well as enrich the proliferation of service provisions and consumptions in opened environments.
Abstract: Advanced technologies have changed the nature of business processes in the form of services. In coordinating services to achieve a particular objective, service workflow is used to control service composition, execution sequences as well as path selection. Since existing mechanisms are insufficient for addressing the diversity and dynamicity of the requirements in a large-scale distributed environment, developing formal requirements specification is necessary. In this paper, we propose a Service Workflow Specification language, called SWSpec, which allows arbitrary services in a workflow to formally and uniformly impose their requirements. As such, the solution will provide a formal way to regulate and control workflows as well as enrich the proliferation of service provisions and consumptions in opened environments.