scispace - formally typeset
Search or ask a question

Showing papers by "Carole Goble published in 2016"


Journal ArticleDOI
TL;DR: The FAIR Data Principles as mentioned in this paper are a set of data reuse principles that focus on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
Abstract: There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

7,602 citations


Proceedings ArticleDOI
08 Dec 2016
TL;DR: This work proposes simple methods and tools for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows and combines a simple and robust method for describing data collections, data descriptions, and simple persistent identifiers to create a powerful ecosystem of tools and services for big data analysis and sharing.
Abstract: Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and tools for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.

45 citations


Journal ArticleDOI
TL;DR: The standards, software tooling, repositories and infrastructures that support this work are explored, and what makes them vital to realizing the Human Physiome are detailed.
Abstract: Reconstructing and understanding the Human Physiome virtually is a complex mathematical problem, and a highly demanding computational challenge. Mathematical models spanning from the molecular level through to whole populations of individuals must be integrated, then personalized. This requires interoperability with multiple disparate and geographically separated data sources, and myriad computational software tools. Extracting and producing knowledge from such sources, even when the databases and software are readily available, is a challenging task. Despite the difficulties, researchers must frequently perform these tasks so that available knowledge can be continually integrated into the common framework required to realize the Human Physiome. Software and infrastructures that support the communities that generate these, together with their underlying standards to format, describe and interlink the corresponding data and computer models, are pivotal to the Human Physiome being realized. They provide the foundations for integrating, exchanging and re-using data and models efficiently, and correctly, while also supporting the dissemination of growing knowledge in these forms. In this paper, we explore the standards, software tooling, repositories and infrastructures that support this work, and detail what makes them vital to realizing the Human Physiome.

36 citations


Journal ArticleDOI
TL;DR: This work shows an operational, scalable and flexible Internet-based virtual laboratory to meet new demands for data processing and analysis in biodiversity science and ecology, and successfully integrated existing and popular tools and practices from different scientific disciplines.
Abstract: Making forecasts about biodiversity and giving support to policy relies increasingly on large collections of data held electronically, and on substantial computational capability and capacity to analyse, model, simulate and predict using such data. However, the physically distributed nature of data resources and of expertise in advanced analytical tools creates many challenges for the modern scientist. Across the wider biological sciences, presenting such capabilities on the Internet (as “Web services”) and using scientific workflow systems to compose them for particular tasks is a practical way to carry out robust “in silico” science. However, use of this approach in biodiversity science and ecology has thus far been quite limited. BioVeL is a virtual laboratory for data analysis and modelling in biodiversity science and ecology, freely accessible via the Internet. BioVeL includes functions for accessing and analysing data through curated Web services; for performing complex in silico analysis through exposure of R programs, workflows, and batch processing functions; for on-line collaboration through sharing of workflows and workflow runs; for experiment documentation through reproducibility and repeatability; and for computational support via seamless connections to supporting computing infrastructures. We developed and improved more than 60 Web services with significant potential in many different kinds of data analysis and modelling tasks. We composed reusable workflows using these Web services, also incorporating R programs. Deploying these tools into an easy-to-use and accessible ‘virtual laboratory’, free via the Internet, we applied the workflows in several diverse case studies. We opened the virtual laboratory for public use and through a programme of external engagement we actively encouraged scientists and third party application and tool developers to try out the services and contribute to the activity. Our work shows we can deliver an operational, scalable and flexible Internet-based virtual laboratory to meet new demands for data processing and analysis in biodiversity science and ecology. In particular, we have successfully integrated existing and popular tools and practices from different scientific disciplines to be used in biodiversity and ecological research.

35 citations


Journal ArticleDOI
TL;DR: The Manchester Synthetic Biology Research Centre's integrated technology platforms provide a unique capability to facilitate predictable engineering of microbial bio-factories for chemicals production.
Abstract: The Manchester Synthetic Biology Research Centre (SYNBIOCHEM) is a foundry for the biosynthesis and sustainable production of fine and speciality chemicals. The Centre's integrated technology platforms provide a unique capability to facilitate predictable engineering of microbial bio-factories for chemicals production. An overview of these capabilities is described.

7 citations


Journal ArticleDOI
TL;DR: An overview of foundry activities that are being applied to grand challenge projects to deliver innovation in bio-based chemicals production for industrial biotechnology is provided.

7 citations


Journal ArticleDOI
TL;DR: The approach taken by ELixIR-UK to address the advice by the ELIXIR Scientific Advisory Board that Nodes need to develop " mechanisms to ensure that each Node continues to be representative of the Bioinformatics efforts within the country” is presented.
Abstract: ELIXIR is the European infrastructure established specifically for the sharing and sustainability of life science data. To provide up-to-date resources and services, ELIXIR needs to undergo a continuous process of refreshing the services provided by its national Nodes. Here we present the approach taken by ELIXIR-UK to address the advice by the ELIXIR Scientific Advisory Board that Nodes need to develop “ mechanisms to ensure that each Node continues to be representative of the Bioinformatics efforts within the country”. ELIXIR-UK put in place an open and transparent process to identify potential ELIXIR resources within the UK during late 2015 and early to mid-2016. Areas of strategic strength were identified and Expressions of Interest in these priority areas were requested from the UK community. Criteria were established, in discussion with the ELIXIR Hub, and prospective ELIXIR-UK resources were assessed by an independent committee set up by the Node for this purpose. Of 19 resources considered, 14 were judged to be immediately ready to be included in the UK ELIXIR Node’s portfolio. A further five were placed on the Node’s roadmap for future consideration for inclusion. ELIXIR-UK expects to repeat this process regularly to ensure its portfolio continues to reflect its community’s strengths.

6 citations


DOI
01 Jan 2016
TL;DR: This report documents the program and the outcomes of Dagstuhl Perspectives Workshop 16252 "Engineering Academic Software" as well as the results of the workshop.
Abstract: This report documents the program and the outcomes of Dagstuhl Perspectives Workshop 16252 "Engineering Academic Software".

6 citations


Journal ArticleDOI
TL;DR: The results provide clear evidence that data science should adopt single-field natural language search interfaces for variable search supporting in particular: query reformulation; data browsing; faceted search; surrogates; relevance feedback; summarization, analytics, and visual presentation.
Abstract: Background: Data discovery, particularly the discovery of key variables and their inter-relationships, is key to secondary data analysis, and in-turn, the evolving field of data science. Interface designers have presumed that their users are domain experts, and so they have provided complex interfaces to support these “experts.” Such interfaces hark back to a time when searches needed to be accurate first time as there was a high computational cost associated with each search. Our work is part of a governmental research initiative between the medical and social research funding bodies to improve the use of social data in medical research. Objective: The cross-disciplinary nature of data science can make no assumptions regarding the domain expertise of a particular scientist, whose interests may intersect multiple domains. Here we consider the common requirement for scientists to seek archived data for secondary analysis. This has more in common with search needs of the “Google generation” than with their single-domain, single-tool forebears. Our study compares a Google-like interface with traditional ways of searching for noncomplex health data in a data archive. Methods: Two user interfaces are evaluated for the same set of tasks in extracting data from surveys stored in the UK Data Archive (UKDA). One interface, Web search, is “Google-like,” enabling users to browse, search for, and view metadata about study variables, whereas the other, traditional search, has standard multioption user interface. Results: Using a comprehensive set of tasks with 20 volunteers, we found that the Web search interface met data discovery needs and expectations better than the traditional search. A task × interface repeated measures analysis showed a main effect indicating that answers found through the Web search interface were more likely to be correct ( F 1,19 =37.3, P <.001), with a main effect of task ( F 3,57 =6.3, P <.001). Further, participants completed the task significantly faster using the Web search interface ( F 1,19 =18.0, P <.001). There was also a main effect of task ( F 2,38 =4.1, P =.025, Greenhouse-Geisser correction applied). Overall, participants were asked to rate learnability, ease of use, and satisfaction. Paired mean comparisons showed that the Web search interface received significantly higher ratings than the traditional search interface for learnability ( P =.002, 95% CI [0.6-2.4]), ease of use ( P <.001, 95% CI [1.2-3.2]), and satisfaction ( P <.001, 95% CI [1.8-3.5]). The results show superior cross-domain usability of Web search, which is consistent with its general familiarity and with enabling queries to be refined as the search proceeds, which treats serendipity as part of the refinement. Conclusions: The results provide clear evidence that data science should adopt single-field natural language search interfaces for variable search supporting in particular: query reformulation; data browsing; faceted search; surrogates; relevance feedback; summarization, analytics, and visual presentation. [J Med Internet Res 2016;18(1):e13]

6 citations


Proceedings ArticleDOI
06 Jun 2016
TL;DR: The Taverna Engine as discussed by the authors uses a generic model for describing provenance called PROV-O to describe the details of a Taverna execution, which can be used to simplify query and interoperability with PROV tools.
Abstract: ion levels PROV is a generic model for describing provenance. While this means there are generally many multiple ways to express the same history in PROV, a scientific workflow run with processors and data values naturally match the Activity and Entity relations wasGeneratedBy and used. However, using PROV-O to describe the details of a Taverna execution meant a significant increase in verbosity. To simplify query and interoperability with PROV tools, we declare relations both with starting point terms and as qualified terms, e.g. to represent an Activity that used different values as different input parameters, we provide both a direct used, but also a qualifiedUsage to a Usage that specify hadRole and entity. PROV deliberately does not mandate how to make the design decisions on what activities, entities and agents participate in a particular scenario, but for interoperability purposes this flexibility means that PROV is a kind of "XML for provenance" a common language with a defined semantics, but which can be applied in many different ways. One interoperability design question for representing computational workflow runs is how much of the workflow engine's internal logic and language should be explicit in the PROV representation. As we primarily wanted to convey what happened in the workflow at the same granularity as its definition, we tried to hide provenance that would be intrinsic to the Taverna Engine, e.g. implicit iteration is not shown as a separate Activity . However keeping provenance only at the dataflow level (input/outputs of workflow processes) meant that TavernaProv could not easily represent "deeper" provenance such as the intermediate values of while-loops or intermittent failures that were automatically recovered by Taverna's retry mechanism, as we wanted to avoid unrolled workflow provenance. Keeping the link between the workflow definition and execution is essential to understanding Taverna provenance, yet PROV doesn't describe the structure of a Plan. Taverna's workflow definition is in Linked Data using the SCUFL2 vocabulary, which includes many implementation details for the Taverna Engine, and so forming a meaningful query like "What is the value made by calls to webservice X" means understanding the whole conceptual model of Taverna workflow definitions. Therefore Taverna's PROV export also includes an annotation with the wfdesc abstraction of the workflow definition, embedding user annotations and higher-level information like web service location. wfdesc deliberately leaves out execution details like iteration and parallelism controls as it primarily functions as a target for userdriven annotations about the workflow steps. Correspondingly the provenance bundle from TavernaPROV includes a higher level wfprov abstraction of the workflow execution, with direct shortcuts like describedByProcess and describedByParameter to bypass the indirection of PROV qualified terms; simplifying queries like "Which web service consumed value Y?". The duality between wfdesc and wfprov is similar to the "future provenance" model of P-Plan and OPMW and its workflow templates , and similarly the the split between "prospective provenance" and "retrospective provenance" of the ProvONE Data Model for Scientific Workflow Provenance. Identifiers and interoperability A great advantage of using Linked Data was that we could use the same identifiers in all three formats. One challenge was that Taverna workflows are often run within a desktop user interface or on the command line, and with privacy concerns we didn't have the luxury of a server to mint URIs; we already learnt our lesson with LSIDs. Taverna therefore generate UUID-based structured http:// URIs within our namespaces, e.g. http://ns.taverna.org.uk/2011/run/d5ee659e-e11e-43a5-bc0a-58d93674e5e2/process/1e027057-2aeb-47f7-97dc03e19e9772be/ and http://ns.taverna.org.uk/2010/workflowBundle/2f0e94ef-b5c4-455d-aeab1e9611f46b8b/workflow/HelloWorld/processor/hello/ The scufl2-info web-service provide a minimal JSON-LD wfprov/wfdesc representation identifying the URI as a provenance or workflow item, but (by design) not having access to the data bundle it can't say anything more. We found that our UUID-based URIs don't play too well with PROV Toolbox and alternative PROV formats like PROV-N and PROVXML, as every URI ending in / is registered as a separate namespace in order to form valid QNames. A suggested improvement for TavernaProv is to generate provly identifiers, while remaining compliant with the 10 simple rules for identifiers. Similarly, OWL reasoning is not generally applied by PROV-O consumers, so even though wfprov formally extends PROV in its ontology definitions, we needed to add explicitly the implied PROV-O statements in TavernaProv's Turtle output. Common Workflow Language Common Workflow Language has created a workflow language specification, a reference implementation cwltool, and a large community of workflow system developers who are adding CWL support across bioinformatics, including Apache Taverna and Galaxy. Unlike wfdesc, OPMW and P-Plan, CWL workflows are primarily intended to be executed, with a strong emphasis on the dataflow between command line tools packaged as Docker images. CWL is specified using Schema Salad, which provides JSON-LD constructs in YAML. The CWL dataflow model is inspired by wfdesc and Apache Taverna and thus have similar execution semantics and provenance requirements. CWL is planning its provenance format based on PROV-O, wfprov and JSON-LD. As one of the CWL adopters, Apache Taverna will naturally also aim to support the CWL provenance model. Future work To face the verbosity issue, we are considering to split out wfprov statements to a different file; as a ZIP archive the Taverna Data Bundle can contain many provenance formats. Similarly splitting out the details using PROV-O Qualified Terms to a separate file is worth considering, this could also improve PROV visualization of workflow provenance. Having such separate PROV bundles would also make it easier for Taverna to support the ProvONE model as an additional format. PROV Links could be added to Research Object Bundles to relate its data files to the then multiple workflow provenance traces that describe their generation and usage. References 1. Katherine Wolstencroft, Robert Haines, Donal Fellows, Alan Williams, David Withers, Stuart Owen, Stian Soiland-Reyes, Ian Dunlop, Aleksandra Nenadic, Paul Fisher, Jiten Bhagat, Khalid Belhajjame, Finn Bacall, Alex Hardisty, Abraham Nieva de la Hidalga, Maria P. Balcazar Vargas, Shoaib Sufi, Carole Goble (2013): The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. In: Nucleic Acids Research, 41(W1): W557–W561. doi:10.1093/nar/gkt328 2. Paolo Missier, Satya Sahoo, Jun Zhao, Carole Goble, Amit Sheth. (2010): Janus: from Workflows to Semantic Provenance and Linked Open Data in Provenance and Annotation of Data and Processes, Third International Provenance and Annotation Workshop, (IPAW'10), 15–16 Jun 2010. Springer, Berlin: 129–141. doi:10.1007/978-3-642-17819-1_16 [pdf] 3. Stian Soiland-Reyes, Matthew Gamble, Robert Haines (2014): Research Object Bundle 1.0. researchobject.org Specification. https://w3id.org/bundle/ 2014-11-05. doi:10.5281/zenodo.12586 4. Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina Hettne, Raul Palma, Eleni Mina, Oscar Corcho, José Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole Goble (2015): Using a suite of ontologies for preserving workflow-centric research objects, Web Semantics: Science, Services and Agents on the World Wide Web. doi:10.1016/j.websem.2015.01.003 5. Daniel Garijo, Yolanda Gil, Oscar Corcho (2014): Towards workflow ecosystems through semantic and standard representations. Proceeding WORKS '14 Proceedings of the 9th Workshop on Workflows in Support of Large-Scale Science. doi:10.1109/WORKS.2014.13 [pdf] 6. Víctor Cuevas-Vicenttín, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, Saumen Dey (2014): The PBase Scientific Workflow Provenance Repository. International Journal of Digital Curation 9(2). doi:10.2218/ijdc.v9i2.332 7. Stian Soiland-Reyes, Alan R Williams (2016): What exactly happened to LSID? myGrid developer blog, 2016-02-26. http://dev.mygrid.org.uk/blog/2016/02/what-exactly-happened-to-lsid/ doi:10.5281/zenodo.46804 8. Julie A McMurry, Niklas Blomberg, Tony Burdett, Nathalie Conte, Michel Dumontier, Donal Fellows, Alejandra Gonzalez-Beltran, Philipp Gormanns, Janna Hastings, Melissa A Haendel, Henning Hermjakob, Jean-Karim Hériché, Jon C Ison, Rafael C Jimenez, Simon Jupp, Nick Juty, Camille Laibe, Nicolas Le Novère, James Malone, Maria Jesus Martin, Johanna R McEntyre, Chris Morris, Juha Muilu, Wolfgang Müller, Christopher J Mungall, Philippe Rocca-Serra, Susanna-Assunta Sansone, Murat Sariyar, Jacky L Snoep, Natalie J Stanford, Neil Swainston, Nicole L Washington, Alan R Williams, Katherine Wolstencroft, Carole Goble, Helen Parkinson (2015): 10 Simple rules for design, provision, and reuse of identifiers for web-based life science data. Zenodo. Submitted to PLoS Computational Biology. doi:10.5281/zenodo.31765

5 citations


Journal ArticleDOI
TL;DR: This special issue starts with two articles that explore the runtime platform support required for executing Big Data science on clouds, and compiled a set of articles that discusses new research, development, and deployment efforts in running eScience and eEngineering workloads on cloud infrastructures and platforms.
Abstract: During the past decade, data-driven science and engineering have emerged as a key paradigm for performing scientific research, enabling innovations through new kinds of experiments that were earlier impossible. Today’s science has access to advanced instruments like next generation genome sequencers, gigapixel survey telescopes, and networks of sensors that monitor cyber-physical systems, and these are generating datasets that are growing exponentially in complexity and data volume. Big Data, across all dimensions of volume, velocity, variety, and veracity, are offering unique opportunities to enable scientific discovery as well as novel challenges to scientific platforms. Such dynamic, distributed, and data-intensive applications hold the solutions to vital scientific and societal problems of the 21st century. In order to achieve breakthrough in new knowledge, there is a need to develop data-driven system models, perform analytics at large scales, manage data from instruments and analyses, and share and visualize the results with scientific peers and the society at large. To this end, cloud computing offers a computing model for running such data-intensive scientific and engineering applications. Clouds have democratized resource access to underserved disciplines, making it possible to perform nontrivial scientific explorations for just a few hundred dollars. Clouds are particularly cost-effective for Big Data applications due to their co-location of elastic compute resources with data, and their use of commodity hardware, which economizes on costs for non-high performance computing (HPC) workloads. Many contemporary Big Data platforms that have emerged from online enterprises such as Google and Twitter are also optimized for such commodity hardware, as found in their own data centers. Of course, there are costs associated with data transfer and storage, in keeping with the pay-as-you-go model, that may not be well suited for applications requiring frequent transfer of and long-term storage of terabytes of data. Likewise, it is valuable to understand how data-intensive or even HPC applications that have been developed for computing grids, at one end, and applications developed for workstation tools like MATLAB and R, at the other end, can be effectively run on clouds. These are some of the practical realities that are worth exploring on the relevance of clouds for data-driven scientific applications. In this special issue, we have compiled a set of articles that discusses new research, development, and deployment efforts in running eScience and eEngineering workloads on cloud infrastructures and platforms. The open solicitation, which followed the 3rd Workshop on Scientific Cloud Computing (ScienceCloud), invited research and case studies on a variety of topics relevant to data-driven scientific computing on clouds: use of cloud-based technologies to address innovative compute and data-driven scientific problems that are not well served by current HPC clusters and grids, programming platforms for elastic and Big Data applications, performance and cost-effective computing on clouds, and gaps in diverse cloud fabrics and service offerings, among others. In all, the special issue received 28 articles, of which six were selected for publication after multiple rounds of reviews and revisions. The special issue starts with two articles that explore the runtime platform support required for executing Big Data science on clouds. In TomusBlobs: Scalable Data-Intensive Processing on Azure Clouds [1], the authors address the limitations of data storage within IaaS clouds such as Amazon S3 and Microsoft Azure BLOBs that are, while co-located in the data center, not present in the virtual machines (VMs) and need to be accessed over the network. Their distributed storage on VMs is optimized for concurrent access and elastic scaling, even

DOI
12 Sep 2016
TL;DR: This lightning talk aims to inspire the WSSSPE audience to think about actions they can take themselves rather than actions they want others to take, and to publish a more fully developed Dagstuhl Manifesto by December 2016.
Abstract: Software is fundamental to academic research work, both as part of the method and as the result of research. In June 2016 25 people gathered at Schloss Dagstuhl for a week-long Perspectives Workshop and began to develop a manifesto which places emphasis on the scholarly value of academic software and on personal responsibility. Twenty pledges cover the recognition of academic software, the academic software process and the intellectual content of academic software. This is still work in progress. Through this lightning talk, we aim to get feedback and hone these further, as well as to inspire the WSSSPE audience to think about actions they can take themselves rather than actions they want others to take. We aim to publish a more fully developed Dagstuhl Manifesto by December 2016.

01 Jan 2016
TL;DR: FAIRDOM: Reproducible Systems Biology through FAIR Asset Management is presented at Reproducibility, standards and SOP in bioinformatics, Rome, Italy.
Abstract: Citation for published version (APA): Stanford, N., Bacall, F., Golebiewski, M., Krebs, O., Kuzyakiv, R., Nguyen, Q., Owen, S., Soiland-Reyes, S., Straszewski, J., van Niekerk, D., Williams, A., Wolstencroft, K., Malmström, L., Rinn, B., Snoep, J., Müller, W., & Goble, C. (Accepted/In press). FAIRDOM: Reproducible Systems Biology through FAIR Asset Management. Paper presented at Reproducibility, standards and SOP in bioinformatics, Rome, Italy.


08 Jun 2016
TL;DR: This paper takes a real-world workflow containing user-created design abstractions and compares these with abstractions created by ZOOM UserViews and Workflow Summaries systems, showing that semi-automatic and manual approaches largely overlap from a process perspective.
Abstract: In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi-automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow's design, such as using sub-workflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOM*UserViews and Workflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation. We discuss reasons and suggest future research directions.

Posted Content
TL;DR: In this article, a comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses is presented.
Abstract: In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow's design, such as using sub-workflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOM UserViews and Workflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation. We discuss reasons and suggest future research directions.