scispace - formally typeset
Search or ask a question

Showing papers on "Workflow published in 2010"


Proceedings ArticleDOI
20 Oct 2010
TL;DR: The greatest value of Mininet will be supporting collaborative network research, by enabling self-contained SDN prototypes which anyone with a PC can download, run, evaluate, explore, tweak, and build upon.
Abstract: Mininet is a system for rapidly prototyping large networks on the constrained resources of a single laptop The lightweight approach of using OS-level virtualization features, including processes and network namespaces, allows it to scale to hundreds of nodes Experiences with our initial implementation suggest that the ability to run, poke, and debug in real time represents a qualitative change in workflow We share supporting case studies culled from over 100 users, at 18 institutions, who have developed Software-Defined Networks (SDN) Ultimately, we think the greatest value of Mininet will be supporting collaborative network research, by enabling self-contained SDN prototypes which anyone with a PC can download, run, evaluate, explore, tweak, and build upon

1,890 citations


Proceedings ArticleDOI
20 Apr 2010
TL;DR: This paper presents a particle swarm optimization (PSO) based heuristic to schedule applications to cloud resources that takes into account both computation cost and data transmission cost, and shows that PSO can achieve as much as 3 times cost savings as compared to BRS.
Abstract: Cloud computing environments facilitate applications by providing virtualized resources that can be provisioned dynamically. However, users are charged on a pay-per-use basis. User applications may incur large data retrieval and execution costs when they are scheduled taking into account only the ‘execution time’. In addition to optimizing execution time, the cost arising from data transfers between resources as well as execution costs must also be taken into account. In this paper, we present a particle swarm optimization (PSO) based heuristic to schedule applications to cloud resources that takes into account both computation cost and data transmission cost. We experiment with a workflow application by varying its computation and communication costs. We compare the cost savings when using PSO and existing ‘Best Resource Selection’ (BRS) algorithm. Our results show that PSO can achieve: a) as much as 3 times cost savings as compared to BRS, and b) good distribution of workload onto resources.

837 citations


Journal ArticleDOI
TL;DR: The full workflow of the TPP is described, along with an example on a sample data set, demonstrating that the setup and use of the tools are straightforward and well supported and do not require specialized informatic resources or knowledge.
Abstract: The Trans-Proteomic Pipeline (TPP) is a suite of software tools for the analysis of MS/MS data sets. The tools encompass most of the steps in a proteomic data analysis workflow in a single, integrated software system. Specifically, the TPP supports all steps from spectrometer output file conversion to protein-level statistical validation, including quantification by stable isotope ratios. We describe here the full workflow of the TPP and the tools therein, along with an example on a sample data set, demonstrating that the setup and use of the tools are straightforward and well supported and do not require specialized informatic resources or knowledge.

756 citations


Journal ArticleDOI
01 Apr 2010-Methods
TL;DR: The key message is that quality assurance and quality control are essential throughout the entire RT-qPCR workflow; from living cells, over extraction of nucleic acids, storage, various enzymatic steps such as DNase treatment, reverse transcription and PCR amplification, to data-analysis and finally reporting.

648 citations


Journal ArticleDOI
TL;DR: An eight-dimensional model specifically designed to address the sociotechnical challenges involved in design, development, implementation, use and evaluation of HIT within complex adaptive healthcare systems is introduced.
Abstract: Conceptual models have been developed to address challenges inherent in studying health information technology (HIT). This chapter introduces an 8-dimensional model specifically designed to address the socio-technical challenges involved in design, development, implementation, use, and evaluation of HIT within complex adaptive healthcare systems. The 8 dimensions are not independent, sequential, or hierarchical, but rather are interdependent and interrelated concepts similar to compositions of other complex adaptive systems. Hardware and software computing infrastructure refers to equipment and software used to power, support, and operate clinical applications and devices. Clinical content refers to textual or numeric data and images that constitute the “language” of clinical applications. The human computer interface includes all aspects of the computer that users can see, touch, or hear as they interact with it. People refers to everyone who interacts in some way with the system, from developers to end-users, including potential patient-users. Workflow and communication are the processes or steps involved in assuring that patient care tasks are carried out effectively. Two additional dimensions of the model are internal organizational features (e.g., environment, policies, procedures, and culture) and external rules and regulations, both of which may facilitate or constrain many aspects of the preceding dimensions. The final dimension is measurement and monitoring, which refers to the process of measuring and evaluating both intended and unintended consequences of HIT implementation and use. We illustrate how our model has been successfully applied in real-world complex adaptive settings to understand and improve HIT applications at various stages of development and implementation.

579 citations


Journal ArticleDOI
TL;DR: Mayavi as discussed by the authors is an open-source, general-purpose, 3D scientific visualization package that provides easy and interactive tools for data visualization that fit with the scientific user's workflow.
Abstract: Mayavi is an open-source, general-purpose, 3D scientific visualization package. It seeks to provide easy and interactive tools for data visualization that fit with the scientific user's workflow. For this purpose, Mayavi provides several entry points: a full-blown interactive application; a Python library with both a MATLAB-like interface focused on easy scripting and a feature-rich object hierarchy; widgets associated with these objects for assembling in a domain-specific application, and plugins that work with a general purpose application-building framework. In this article, we present an overview of the various features of Mayavi, we then provide insight on the design and engineering decisions made in implementing Mayavi, and finally discuss a few novel applications.

413 citations


Journal ArticleDOI
TL;DR: A matrix based k-means clustering strategy for data placement in scientific cloud workflows that dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage is proposed.

325 citations


Journal ArticleDOI
TL;DR: This paper addresses the issue of selecting and composing Web services not only according to their functional requirements but also to their transactional properties and QoS characteristics by proposing a selection algorithm that satisfies user's preferences as weights over QoS criteria and as risk levels defining semantically the transactional requirements.
Abstract: Web Services are the most famous implementation of service-oriented architectures that has brought some challenging research issues. One of these is the composition, i.e., the capability to recursively construct a composite Web service as a workflow of other existing Web services, which are developed by different organizations and offer diverse functionalities (e.g., ticket purchase, payment), transactional properties (e.g., compensatable or not), and Quality of Service (QoS) values (e.g., execution price, success rate). The selection of a Web service, for each activity of the workflow, meeting the user's requirements, is still an important challenge. Indeed, the selection of one Web service among a set of them that fulfill some functionalities is a critical task, generally depending on a combined evaluation of QoS. However, the conventional QoS-aware composition approaches do not consider the transactional constraints during the composition process. This paper addresses the issue of selecting and composing Web services not only according to their functional requirements but also to their transactional properties and QoS characteristics. We propose a selection algorithm that satisfies user's preferences, expressed as weights over QoS criteria and as risk levels defining semantically the transactional requirements. Proofs and experimental results are presented.

325 citations


Proceedings ArticleDOI
TL;DR: This paper uses three characteristic workflows to compare the performance of a commercial cloud with that of a typical HPC system, and it analyzes the various costs associated with running those workflows in the cloud.
Abstract: The proliferation of commercial cloud computing providers has generated significant interest in the scientific computing community. Much recent research has attempted to determine the benefits and drawbacks of cloud computing for scientific applications. Although clouds have many attractive features, such as virtualization, on-demand provisioning, and "pay as you go" usage-based pricing, it is not clear whether they are able to deliver the performance required for scientific applications at a reasonable price. In this paper we examine the performance and cost of clouds from the perspective of scientific workflow applications. We use three characteristic workflows to compare the performance of a commercial cloud with that of a typical HPC system, and we analyze the various costs associated with running those workflows in the cloud. We find that the performance of clouds is not unreasonable given the hardware resources provided, and that performance comparable to HPC systems can be achieved given similar resources. We also find that the cost of running workflows on a commercial cloud can be reduced by storing data in the cloud rather than transferring it from outside.

254 citations


Book
01 Jan 2010
TL;DR: This monograph contends that provenance can and should reliably be tracked and exploited on the Web, and investigates the necessary foundations to achieve such a vision, as well as identifying an open approach and a model for provenance.
Abstract: Provenance, i.e., the origin or source of something, is becoming an important concern, since it offers the means to verify data products, to infer their quality, to analyse the processes that led to them, and to decide whether they can be trusted. For instance, provenance enables the reproducibility of scientific results; provenance is necessary to track attribution and credit in curated databases; and, it is essential for reasoners to make trust judgements about the information they use over the Semantic Web. As the Web allows information sharing, discovery, aggregation, filtering and flow in an unprecedented manner, it also becomes very difficult to identify, reliably, the original source that produced an information item on the Web. Since the emerging use of provenance in niche applications is undoubtedly demonstrating the benefits of provenance, this monograph contends that provenance can and should reliably be tracked and exploited on the Web, and investigates the necessary foundations to achieve such a vision. Multiple data sources have been used to compile the largest bibliographical database on provenance so far. This large corpus permits the analysis of emerging trends in the research community. Specifically, the CiteSpace tool identifies clusters of papers that constitute research fronts, from which characteristics are extracted to structure a foundational framework for provenance on the Web. Such an endeavour requires a multi-disciplinary approach, since it requires contributions from many computer science sub-disciplines, but also other non-technical fields given the human challenge that is anticipated. To develop such a vision, it is necessary to provide a definition of provenance that applies to the Web context. A conceptual definition of provenance is expressed in terms of processes, and is shown to generalise various definitions of provenance commonly encountered. Furthermore, by bringing realistic distributed systems assumptions, this definition is refined as a query over assertions made by applications. Given that the majority of work on provenance has been undertaken by the database, workflow and e-science communities, some of their work is reviewed, contrasting approaches, and focusing on important topics believed to be crucial for bringing provenance to the Web, such as abstraction, collections, storage, queries, workflow evolution, semantics and activities involving human interactions. However, provenance approaches developed in the context of databases and workflows essentially deal with closed systems. By that, it is meant that workflow or database management systems are in full control of the data they manage, and track their provenance within their own scope, but not beyond. In the context of the Web, a broader approach is required by which chunks of provenance representation can be brought together to describe the provenance of information flowing across multiple systems. For this purpose, this monograph puts forward the Open Provenance Vision, which is an approach that consists of controlled vocabulary, serialisation formats and interfaces to allow the provenance of individual systems to be expressed, connected in a coherent fashion, and queried seamlessly. In this context, the Open Provenance Model is an emerging community-driven representation of provenance, which has been actively used by some 20 teams to exchange provenance information, in line with the Open Provenance Vision. After identifying an open approach and a model for provenance, techniques to expose provenance over the Web are investigated. In particular, Semantic Web technologies are discussed since they have been successfully exploited to express, query and reason over provenance. Symmetrically, Semantic Web technologies such as RDF, underpinning the Linked Data effort, are analysed since they offer their own difficulties with respect to provenance. A powerful argument for provenance is that it can help make systems transparent, so that it becomes possible to determine whether a particular use of information is appropriate under a set of rules. Such capability helps make systems and information accountable. To offer accountability, provenance itself must be authentic, and rely on security approaches, which are described in the monograph. This is then followed by systems where provenance is the basis of an auditing mechanism to check past processes against rules or regulations. In practice, not all users want to check and audit provenance, instead, they may rely on measures of quality or trust; hence, emerging provenance-based approaches to compute trust and quality of data are reviewed.

248 citations


Proceedings ArticleDOI
11 Dec 2010
TL;DR: Experimental results show that the proposed Revised Discrete Particle Swarm Optimization (RDPSO) algorithm can achieve much more cost savings and better performance on make span and cost optimization.
Abstract: A cloud workflow system is a type of platform service which facilitates the automation of distributed applications based on the novel cloud infrastructure. Compared with grid environment, data transfer is a big overhead for cloud workflows due to the market-oriented business model in the cloud environments. In this paper, a Revised Discrete Particle Swarm Optimization (RDPSO) is proposed to schedule applications among cloud services that takes both data transmission cost and computation cost into account. Experiment is conducted with a set of workflow applications by varying their data communication costs and computation costs according to a cloud price model. Comparison is made on make span and cost optimization ratio and the cost savings with RDPSO, the standard PSO and BRS (Best Resource Selection) algorithm. Experimental results show that the proposed RDPSO algorithm can achieve much more cost savings and better performance on make span and cost optimization.

Journal ArticleDOI
TL;DR: The results of the service-oriented application applied to alpine runoff models are shown, including the use of geospatial services facilitating discovery, access, processing and visualization ofGeospatial data in a distributed manner.
Abstract: Environmental modelling often requires a long iterative process of sourcing, reformatting, analyzing, and introducing various types of data into the model. Much of the data to be analyzed are geospatial data-digital terrain models (DTM), river basin boundaries, snow cover from satellite imagery, etc.-and so the modelling workflow typically involves the use of multiple desktop GIS and remote sensing software packages, with limited compatibility among them. Recent advances in service-oriented architectures (SOA) are allowing users to migrate from dedicated desktop solutions to on-line, loosely coupled, and standards-based services which accept source data, process them, and pass results as basic parameters to other intermediate services and/or then to the main model, which also may be made available on-line. This contribution presents a service-oriented application that addresses the issues of data accessibility and service interoperability for environmental models. Key model capabilities are implemented as geospatial services, which are combined to form complex services, and may be reused in other similar contexts. This work was carried out under the auspices of the AWARE project funded by the European programme Global Monitoring for Environment and Security (GMES). We show results of the service-oriented application applied to alpine runoff models, including the use of geospatial services facilitating discovery, access, processing and visualization of geospatial data in a distributed manner.

Journal ArticleDOI
Ke Liu1, Hai Jin, Jinjun Chen1, Xiao Liu1, Dong Yuan1, Yun Yang1 
01 Nov 2010
TL;DR: This paper presents a novel compromised-time-cost scheduling algorithm which considers the characteristics of cloud computing to accommodate instance-intensive cost-constrained workflows by compromising execution time and cost with user input enabled on the fly.
Abstract: The concept of cloud computing continues to spread widely, as it has been accepted recently. Cloud computing has many unique advantages which can be utilized to facilitate workflow execution. Instance-intensive cost-constrained cloud workflows are workflows with a large number of workflow instances (i.e. instance intensive) bounded by a certain budget for execution (i.e. cost constrained) on a cloud computing platform (i.e. cloud workflows). However, there are, so far, no dedicated scheduling algorithms for instance-intensive cost-constrained cloud workflows. This paper presents a novel compromised-time-cost scheduling algorithm which considers the characteristics of cloud computing to accommodate instance-intensive cost-constrained workflows by compromising execution time and cost with user input enabled on the fly. The simulation performed demonstrates that the algorithm can cut down the mean execution cost by over 15% whilst meeting the user-designated deadline or shorten the mean execution time by over 20% within the user-designated execution cost.

Journal ArticleDOI
TL;DR: It is demonstrated that agent-based execution scales better than a non-distributed approach, with at least 70% and 120% improvements in process execution time, and throughput, respectively, even with a large number of concurrent process instances.
Abstract: The Business Process Execution Language (BPEL) standardizes the development of composite enterprise applications that make use of software components exposed as Web services. BPEL processes are currently executed by a centralized orchestration engine, in which issues such as scalability, platform heterogeneity, and division across administrative domains can be difficult to manage. We propose a distributed agent-based orchestration engine in which several lightweight agents execute a portion of the original business process and collaborate in order to execute the complete process. The complete set of standard BPEL activities are supported, and the transformations of several BPEL activities to the agent-based architecture are described. Evaluations of an implementation of this architecture demonstrate that agent-based execution scales better than a non-distributed approach, with at least 70p and 120p improvements in process execution time, and throughput, respectively, even with a large number of concurrent process instances. In addition, the distributed architecture successfully executes large processes that are shown to be infeasible to execute with a nondistributed engine.

Patent
24 Mar 2010
TL;DR: In this article, a workflow server can satisfy requests by assigning tasks to different service providers that provide software services, each of the tasks can be assigned to corresponding ones of the software services.
Abstract: A workflow server can receive requests, each for a business process workflow conforming to a business process model. Each business process workflow can include a set of interdependent tasks. The workflow server can satisfy received requests by assigning tasks to different service providers that provide software services. Each of the tasks can be assigned to corresponding ones of the software services. For each task, the workflow server can also defines an allocated cost per software service, and a time allocation per software service for completing the corresponding one of the tasks. Different service providers, including those assigned to tasks, can receive information for ones of the tasks not directly assigned to them by the workflow server. The different service providers can then bid on these tasks. Wherein when bids are won, tasks for a business process flow can be reassigned based on winning bids.

Proceedings Article
11 Jul 2010
TL;DR: A planner is described, TURKONTROL, that formulates workflow control as a decision-theoretic optimization problem, trading off the implicit quality of a solution artifact against the cost for workers to achieve it.
Abstract: Crowd-sourcing is a recent framework in which human intelligence tasks are outsourced to a crowd of unknown people ("workers") as an open call (e.g., on Amazon's Mechanical Turk). Crowd-sourcing has become immensely popular with hoards of employers ("requesters"), who use it to solve a wide variety of jobs, such as dictation transcription, content screening, etc. In order to achieve quality results, requesters often subdivide a large task into a chain of bite-sized subtasks that are combined into a complex, iterative workflow in which workers check and improve each other's results. This paper raises an exciting question for AI — could an autonomous agent control these workflows without human intervention, yielding better results than today's state of the art, a fixed control program? We describe a planner, TURKONTROL, that formulates workflow control as a decision-theoretic optimization problem, trading off the implicit quality of a solution artifact against the cost for workers to achieve it. We lay the mathematical framework to govern the various decisions at each point in a popular class of workflows. Based on our analysis we implement the workflow control algorithm and present experiments demonstrating that TURKONTROL obtains much higher utilities than popular fixed policies.

Proceedings ArticleDOI
03 Oct 2010
TL;DR: Chronicle captures the entire video history of a graphical document, and provides links between the content and the relevant areas of the history, which makes any working document a potentially powerful learning tool.
Abstract: We describe Chronicle, a new system that allows users to explore document workflow histories. Chronicle captures the entire video history of a graphical document, and provides links between the content and the relevant areas of the history. Users can indicate specific content of interest, and see the workflows, tools, and settings needed to reproduce the associated results, or to better understand how it was constructed to allow for informed modification. Thus, by storing the rich information regarding the document's history workflow, Chronicle makes any working document a potentially powerful learning tool. We outline some of the challenges surrounding the development of such a system, and then describe our implementation within an image editing application. A qualitative user study produced extremely encouraging results, as users unanimously found the system both useful and easy to use.

30 Jun 2010
TL;DR: How the recently overhauled technical architecture of Taverna addresses issues of efficiency, scalability, and extensibility, and presents performance results based on a collection of synthetic workflows is described, as well as a concrete case study involving a production workflow in the area of cancer research.
Abstract: The Taverna workflow management system is an open source project with a history of widespread adoption within multiple experimental science communities, and a long-term ambition of effectively supporting the evolving need of those communities for complex, data-intensive, service-based experimental pipelines. This short paper describes how the recently overhauled technical architecture of Taverna addresses issues of efficiency, scalability, and extensibility, and presents performance results based on a collection of synthetic workflows, as well as a concrete case study involving a production workflow in the area of cancer research.

Proceedings ArticleDOI
05 Jul 2010
TL;DR: SciCumulus is a cloud middleware that explores parameter sweep and data fragmentation parallelism in scientific workflow activities (with provenance support) and works between the SWfMS and the cloud.
Abstract: Most of the large-scale scientific experiments modeled as scientific workflows produce a large amount of data and require workflow parallelism to reduce workflow execution time. Some of the existing Scientific Workflow Management Systems (SWfMS) explore parallelism techniques - such as parameter sweep and data fragmentation. In those systems, several computing resources are used to accomplish many computational tasks in homogeneous environments, such as multiprocessor machines or cluster systems. Cloud computing has become a popular high performance computing model in which (virtualized) resources are provided as services over the Web. Some scientists are starting to adopt the cloud model in scientific domains and are moving their scientific workflows (programs and data) from local environments to the cloud. Nevertheless, it is still difficult for the scientist to express a parallel computing paradigm for the workflow on the cloud. Capturing distributed provenance data at the cloud is also an issue. Existing approaches for executing scientific workflows using parallel processing are mainly focused on homogeneous environments whereas, in the cloud, the scientist has to manage new aspects such as initialization of virtualized instances, scheduling over different cloud environments, impact of data transferring and management of instance images. In this paper we propose SciCumulus, a cloud middleware that explores parameter sweep and data fragmentation parallelism in scientific workflow activities (with provenance support). It works between the SWfMS and the cloud. SciCumulus is designed considering cloud specificities. We have evaluated our approach by executing simulated experiments to analyze the overhead imposed by clouds on the workflow execution time.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: This paper proposes a new QoS-based workflow scheduling algorithm based on a novel concept called Partial Critical Path that recursively schedules the critical path ending at a recently scheduled node.
Abstract: Recently, utility grids have emerged as a new model of service provisioning in heterogeneous distributed systems. In this model, users negotiate with providers on their required Quality of Service and on the corresponding price to reach a Service Level Agreement. One of the most challenging problems in utility grids is workflow scheduling, i.e., the problem of satisfying users' QoS as well as minimizing the cost of workflow execution. In this paper, we propose a new QoS-based workflow scheduling algorithm based on a novel concept called Partial Critical Path. This algorithm recursively schedules the critical path ending at a recently scheduled node. The proposed algorithm tries to minimize the cost of workflow execution while meeting a user-defined deadline. The simulation results show that the performance of our algorithm is very promising.

Journal ArticleDOI
TL;DR: A set of new analytical methods consisting of workflow fragmentation assessments, pattern recognition, and data visualization are presented, which are accordingly designed to uncover hidden regularities embedded in the flow of the work.

Journal ArticleDOI
TL;DR: There is little "flow" in nurse workflow and the chaotic pace implies that nurses rarely complete an activity before switching to another, suggesting the opportunity to use critical thinking and engage in planning care is severely limited under these circumstances.
Abstract: Objective:To quantitatively measure workflow and computer use, the activities of 27 medical-surgical RNs were recorded through direct observation.Background:Previous studies have shown how nurses spend their time but have not documented the pattern, duration, or frequency of activities. The absence

Journal ArticleDOI
TL;DR: A workflow is proposed that guides the user through the data analysis process and allows practitioners and non-statisticians to get an overview over panel performances in a rapid manner without the need to be familiar with details on the statistical methods.
Abstract: This paper discusses statistical methods and a workflow strategy for comparing performance across multiple sensory panels that participated in a proficiency test (also referred to as inter laboratory test). Performance comparison and analysis are based on a data set collected from 26 sensory panels carrying out profiling on the same set of candy samples. The candy samples were produced according to an experimental design using design factors, such as sugar, and acid level. Because of the exceptionally large amount of data and the availability of multiple statistical and graphical tools in the PanelCheck software, a workflow is proposed that guides the user through the data analysis process. This allows practitioners and non-statisticians to get an overview over panel performances in a rapid manner without the need to be familiar with details on the statistical methods. Visualisation of data analysis results plays an important role as this provides a time saving and efficient way of screening and investigating sensory panel performances. Most of the statistical methods used in this paper are available in the open source software PanelCheck, which may be downloaded and used for free.

Journal ArticleDOI
Qihua Wang1, Ninghui Li2
TL;DR: This work proposes the role-and-relation-based access control (R2BAC) model for workflow authorization systems, and formally defines three levels of resiliency in workflow systems and study computational problems related to these notions of Resiliency.
Abstract: We propose the role-and-relation-based access control (R2BAC) model for workflow authorization systems. In R2BAC, in addition to a user’s role memberships, the user’s relationships with other users help determine whether the user is allowed to perform a certain step in a workflow. For example, a constraint may require that two steps must not be performed by users who have conflicts of interests. We study computational complexity of the workflow satisfiability problem, which asks whether a set of users can complete a workflow. In particular, we apply tools from parameterized complexity theory to better understand the complexities of this problem. Furthermore, we reduce the workflow satisfiability problem to SAT and apply SAT solvers to address the problem. Experiments show that our algorithm can solve instances of reasonable size efficiently. Finally, it is sometimes not enough to ensure that a workflow can be completed in normal situations. We study the resiliency problem in workflow authorization systems, which asks whether a workflow can be completed even if a number of users may be absent. We formally define three levels of resiliency in workflow systems and study computational problems related to these notions of resiliency.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: This paper ran experiments using three typical workflow applications on Amazon's EC2, and investigated some of the ways in which data can be managed for workflows in the cloud.
Abstract: Efficient data management is a key component in achieving good performance for scientific workflows in distributed environments. Workflow applications typically communicate data between tasks using files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. In grids and clusters, workflow data is often stored on network and parallel file systems. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon's EC2. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.

Patent
14 Apr 2010
TL;DR: In this paper, a data services framework workflow processing system and method is disclosed, which includes receiving a request for data from a client and based on the request, determining a workflow to process the request.
Abstract: A data services framework workflow processing system and method is disclosed. The method includes receiving a request for data from a client and based on the request, determining a workflow to process the request. The method also includes based on the workflow, generating a plurality of backend calls. Additionally, the method includes based on the plurality of backend calls, selecting one or more data sources from a plurality of data sources. The method also includes transmitting one or more of the plurality of backend calls to the selected data sources and receiving, from each the selected data sources, a response to the plurality of backend calls.

Journal ArticleDOI
TL;DR: This work describes workflows and extensions to Kepler to stream and analyze data from observatory networks and archives, and focuses on the use of two newly integrated data sources in Kepler: DataTurbine and OPeNDAP.

Journal ArticleDOI
TL;DR: This work proposes an approach for managing large scale experiments based on provenance gathering during all phases of the life cycle and foresee that such approach may aid scientists to have more control on the trials of the scientific experiment.
Abstract: One of the main challenges of scientific experiments is to allow scientists to manage and exchange their scientific computational resources (data, programs, models, etc.). The effective management of such experiments requires a specific set of cardinal facilities, such as experiment specification techniques, workflow derivation heuristics and provenance mechanisms. These facilities may characterise the experiment life cycle into three phases: composition, execution, and analysis. Works concerned with supporting scientific workflows are mainly concerned with the execution and analysis phase. Therefore, they fail to support the scientific experiment throughout its life cycle as a set of integrated experimentation technologies. In large scale experiments this represents a research challenge. We propose an approach for managing large scale experiments based on provenance gathering during all phases of the life cycle. We foresee that such approach may aid scientists to have more control on the trials of the scientific experiment.

Patent
25 Mar 2010
TL;DR: In this article, a computer-based system and method of providing document isolation during routing of a document through a workflow is disclosed, which comprises maintaining a separate "working" copy of the original base document while the document is routed through the workflow.
Abstract: A computer based system and method of providing document isolation during routing of a document through a workflow is disclosed. The method comprises maintaining a separate “working” copy of the original base document while the document is routed through a workflow. Access controls, which define who may access the original document as well as any versions of the working copy document, are defined and stored in relation to the documents. The access controls further define the types of actions users may take with respect to the document. Users are selectively directed to the appropriate document, either the base document or working copy, and selectively granted permission to perform publishing operations on the working copy document, as determined by the access controls.

Journal ArticleDOI
TL;DR: A web-based prokaryotic genome annotation server, Integrative Services for Genomics Analysis (ISGA), which builds upon the Ergatis workflow system, integrates other dynamic analysis tools and provides intuitive web interfaces for biologists to customize and execute their own annotation pipelines.
Abstract: Summary: Ergatis is a flexible workflow management system for designing and executing complex bioinformatics pipelines. However, its complexity restricts its usage to only highly skilled bioinformaticians. We have developed a web-based prokaryotic genome annotation server, Integrative Services for Genomics Analysis (ISGA), which builds upon the Ergatis workflow system, integrates other dynamic analysis tools and provides intuitive web interfaces for biologists to customize and execute their own annotation pipelines. ISGA is designed to be installed at genomics core facilities and be used directly by biologists. Availability: ISGA is accessible at http://isga.cgb.indiana.edu/ and the system is also freely available for local installation. Contact: qunfeng.dong@unt.edu Supplementary information:Supplementary data are available at Bioinformatics online.