scispace - formally typeset
Search or ask a question

Showing papers on "Data management published in 2015"


Journal ArticleDOI
Yu Zheng1
TL;DR: A systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics, and introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors.
Abstract: The advances in location-acquisition and mobile computing techniques have generated massive spatial trajectory data, which represent the mobility of a diversity of moving objects, such as people, vehicles, and animals. Many techniques have been proposed for processing, managing, and mining trajectory data in the past decade, fostering a broad range of applications. In this article, we conduct a systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics. Following a road map from the derivation of trajectory data, to trajectory data preprocessing, to trajectory data management, and to a variety of mining tasks (such as trajectory pattern mining, outlier detection, and trajectory classification), the survey explores the connections, correlations, and differences among these existing techniques. This survey also introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors, to which more data mining and machine learning techniques can be applied. Finally, some public trajectory datasets are presented. This survey can help shape the field of trajectory data mining, providing a quick understanding of this field to the community.

1,289 citations


Journal ArticleDOI
TL;DR: This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications, and identifies possible gaps in technology and provides recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions.

773 citations


Journal ArticleDOI
TL;DR: This survey aims to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks.
Abstract: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.

391 citations


Journal ArticleDOI
TL;DR: A Multi-Level Smart City architecture is proposed based on semantic web technologies and Dempster-Shafer uncertainty theory and described and explained in terms of its functionality and some real-time context-aware scenarios.

332 citations


Journal ArticleDOI
TL;DR: An overview of the next-generation artificial intelligence and blockchain technologies is provided and innovative solutions that may be used to accelerate the biomedical research and enable patients with new tools to control and profit from their personal data as well with the incentives to undergo constant health monitoring are presented.
Abstract: The increased availability of data and recent advancements in artificial intelligence present the unprecedented opportunities in healthcare and major challenges for the patients, developers, providers and regulators. The novel deep learning and transfer learning techniques are turning any data about the person into medical data transforming simple facial pictures and videos into powerful sources of data for predictive analytics. Presently, the patients do not have control over the access privileges to their medical records and remain unaware of the true value of the data they have. In this paper, we provide an overview of the next-generation artificial intelligence and blockchain technologies and present innovative solutions that may be used to accelerate the biomedical research and enable patients with new tools to control and profit from their personal data as well with the incentives to undergo constant health monitoring. We introduce new concepts to appraise and evaluate personal records, including the combination-, time- and relationship-value of the data. We also present a roadmap for a blockchain-enabled decentralized personal health data ecosystem to enable novel approaches for drug discovery, biomarker development, and preventative healthcare. A secure and transparent distributed personal data marketplace utilizing blockchain and deep learning technologies may be able to resolve the challenges faced by the regulators and return the control over personal data including medical records back to the individuals.

311 citations


Journal ArticleDOI
TL;DR: This paper presents the 5Vs characteristics of big data and the technique and technology used to handle big data in a wide variety of scalable database tools and techniques.

253 citations


Journal ArticleDOI
TL;DR: It is proposed in this position paper that big data analytics can be successfully combined with VPH technologies to produce robust and effective in silico medicine solutions.
Abstract: The idea that the purely phenomenological knowledge that we can extract by analyzing large amounts of data can be useful in healthcare seems to contradict the desire of VPH researchers to build detailed mechanistic models for individual patients. But in practice no model is ever entirely phenomenological or entirely mechanistic. We propose in this position paper that big data analytics can be successfully combined with VPH technologies to produce robust and effective in silico medicine solutions. In order to do this, big data technologies must be further developed to cope with some specific requirements that emerge from this application. Such requirements are: working with sensitive data; analytics of complex and heterogeneous data spaces, including nontextual information; distributed data management under security and performance constraints; specialized analytics to integrate bioinformatics and systems biology information with clinical observations at tissue, organ and organisms scales; and specialized analytics to define the “physiological envelope” during the daily life of each patient. These domain-specific requirements suggest a need for targeted funding, in which big data technologies for in silico medicine becomes the research priority.

240 citations


Journal ArticleDOI
25 Feb 2015-PLOS ONE
TL;DR: It is concluded that research data cannot be regarded as knowledge commons, but research policies that better incentivise data sharing are needed to improve the quality of research results and foster scientific progress.
Abstract: Despite widespread support from policy makers, funding agencies, and scientific journals, academic researchers rarely make their research data available to others. At the same time, data sharing in research is attributed a vast potential for scientific progress. It allows the reproducibility of study results and the reuse of old data for new research questions. Based on a systematic review of 98 scholarly papers and an empirical survey among 603 secondary data users, we develop a conceptual framework that explains the process of data sharing from the primary researcher’s point of view. We show that this process can be divided into six descriptive categories: Data donor, research organization, research community, norms, data infrastructure, and data recipients. Drawing from our findings, we discuss theoretical implications regarding knowledge creation and dissemination as well as research policy measures to foster academic collaboration. We conclude that research data cannot be regarded as knowledge commons, but research policies that better incentivise data sharing are needed to improve the quality of research results and foster scientific progress.

203 citations


Journal ArticleDOI
TL;DR: The concept of process-structure-property (PSP) linkages is introduced and illustrated how the determination of PSPs is one of the main objectives of materials data science.
Abstract: The field of materials science and engineering is on the cusp of a digital data revolution. After reviewing the nature of data science and Big Data, we discuss the features of materials data that distinguish them from data in other fields. We introduce the concept of process-structure-property (PSP) linkages and illustrate how the determination of PSPs is one of the main objectives of materials data science. Then we review a selection of materials databases, as well as important aspects of materials data management, such as storage hardware, archiving strategies, and data access strategies. We introduce the emerging field of materials data analytics, which focuses on data-driven approaches to extract and curate materials knowledge from available data sets. The critical need for materials e-collaboration platforms is highlighted, and we conclude the article with a number of suggestions regarding the near-term future of the materials data science field.

199 citations


Journal ArticleDOI
TL;DR: A searchable attribute-based proxy reencryption system that enables a data owner to efficiently share his data to a specified group of users matching a sharing policy and meanwhile, the data will maintain its searchable property but also the corresponding search keyword(s) can be updated after the data sharing.
Abstract: To date, the growth of electronic personal data leads to a trend that data owners prefer to remotely outsource their data to clouds for the enjoyment of the high-quality retrieval and storage service without worrying the burden of local data management and maintenance. However, secure share and search for the outsourced data is a formidable task, which may easily incur the leakage of sensitive personal information. Efficient data sharing and searching with security is of critical importance. This paper, for the first time, proposes a searchable attribute-based proxy reencryption system. When compared with the existing systems only supporting either searchable attribute-based functionality or attribute-based proxy reencryption, our new primitive supports both abilities and provides flexible keyword update service. In particular, the system enables a data owner to efficiently share his data to a specified group of users matching a sharing policy and meanwhile, the data will maintain its searchable property but also the corresponding search keyword(s) can be updated after the data sharing. The new mechanism is applicable to many real-world applications, such as electronic health record systems. It is also proved chosen ciphertext secure in the random oracle model.

179 citations


Proceedings ArticleDOI
29 Oct 2015
TL;DR: An implementation of the lambda architecture design pattern is presented to construct a data-handling backend on Amazon EC2, providing high throughput, dense and intense data demand delivered as services, minimizing the cost of the network maintenance.
Abstract: Sensor and smart phone technologies present opportunities for data explosion, streaming and collecting from heterogeneous devices every second. Analyzing these large datasets can unlock multiple behaviors previously unknown, and help optimize approaches to city wide applications or societal use cases. However, collecting and handling of these massive datasets presents challenges in how to perform optimized online data analysis ‘on-the-fly’, as current approaches are often limited by capability, expense and resources. This presents a need for developing new methods for data management particularly using public clouds to minimize cost, network resources and on-demand availability. This paper presents an implementation of the lambda architecture design pattern to construct a data-handling backend on Amazon EC2, providing high throughput, dense and intense data demand delivered as services, minimizing the cost of the network maintenance. This paper combines ideas from database management, cost models, query management and cloud computing to present a general architecture that could be applied in any given scenario where affordable online data processing of Big Datasets is needed. The results are presented with a case study of processing router sensor data on the current ESnet network data as a working example of the approach. The results showcase a reduction in cost and argue benefits for performing online analysis and anomaly detection for sensor data.

Journal ArticleDOI
TL;DR: A set of best practices and examples of software tools are presented that can enable research transparency, reproducibility and new knowledge by facilitating idea generation, research planning, data management and the dissemination of data and results.

Journal ArticleDOI
01 Feb 2015
TL;DR: This article surveys RDF data management architectures and systems designed for a cloud environment, and more generally, those large-scale RDFData management systems that can be easily deployed therein.
Abstract: The Resource Description Framework (RDF) pioneered by the W3C is increasingly being adopted to model data in a variety of scenarios, in particular data to be published or exchanged on the Web. Managing large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance, and elasticity feature it provides, enabling the easy deployment of distributed and parallel architectures. In this article, we survey RDF data management architectures and systems designed for a cloud environment, and more generally, those large-scale RDF data management systems that can be easily deployed therein. We first give the necessary background, then describe the existing systems and proposals in this area, and classify them according to dimensions related to their capabilities and implementation techniques. The survey ends with a discussion of open problems and perspectives.

Journal ArticleDOI
TL;DR: A definition of big data in healthcare is proposed, defined by volume, which can be defined as datasets with Log⁡(n∗p) ≥ 7 and its great variety and high velocity.
Abstract: Objective. The aim of this study was to provide a definition of big data in healthcare. Methods. A systematic search of PubMed literature published until May 9, 2014, was conducted. We noted the number of statistical individuals and the number of variables for all papers describing a dataset. These papers were classified into fields of study. Characteristics attributed to big data by authors were also considered. Based on this analysis, a definition of big data was proposed. Results. A total of 196 papers were included. Big data can be defined as datasets with . Properties of big data are its great variety and high velocity. Big data raises challenges on veracity, on all aspects of the workflow, on extracting meaningful information, and on sharing information. Big data requires new computational methods that optimize data management. Related concepts are data reuse, false knowledge discovery, and privacy issues. Conclusion. Big data is defined by volume. Big data should not be confused with data reuse: data can be big without being reused for another purpose, for example, in omics. Inversely, data can be reused without being necessarily big, for example, secondary use of Electronic Medical Records (EMR) data.

Proceedings ArticleDOI
27 May 2015
TL;DR: A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Abstract: Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

Proceedings ArticleDOI
08 Jun 2015
TL;DR: The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges, but there are lots of concerns and questions for large enterprises to implement data lakes.
Abstract: The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges, but there are lots of concerns and questions for large enterprises to implement data lakes. The paper discusses the concept of data lakes and shares the author's thoughts and practices of data lakes.

Journal ArticleDOI
J.J. McArthur1
TL;DR: In this article, the authors describe the process used to address and overcome four key challenges to develop building information management models suitable for sustainable operations management: identification of critical information required to inform Operational decisions, high level of effort to create new or modify existing BIM models for the building(s), management of information transfer between real-time operations and monitoring systems and the BIM model, and handling of uncertainty based on incomplete building documentation.

Journal ArticleDOI
TL;DR: This investigation of VGI for disaster management provides broader insight into key challenges and impacts of V GI on geospatial data practices and the wider field of geographical science.
Abstract: The immediacy of locational information requirements and importance of data currency for natural disaster events highlights the value of volunteered geographic information (VGI) in all stages of disaster management, including prevention, preparation, response, and recovery. The practice of private citizens generating online geospatial data presents new opportunities for the creation and dissemination of disaster-related geographic data from a dense network of intelligent observers. VGI technologies enable rapid sharing of diverse geographic information for disaster management at a fraction of the resource costs associated with traditional data collection and dissemination, but they also present new challenges. These include a lack of data quality assurance and issues surrounding data management, liability, security, and the digital divide. There is a growing need for researchers to explore and understand the implications of these data and data practices for disaster management. In this article, we review the current state of knowledge in this emerging field and present recommendations for future research. Significantly, we note further research is warranted in the pre-event phases of disaster management, where VGI may present an opportunity to connect and engage individuals in disaster preparation and strengthen community resilience to potential disaster events. Our investigation of VGI for disaster management provides broader insight into key challenges and impacts of VGI on geospatial data practices and the wider field of geographical science.

Journal ArticleDOI
TL;DR: A data management plan is a document that describes how you will treat your data during a project and what happens with the data after the project ends, and is used in part to evaluate a project’s merit.
Abstract: Research papers and data products are key outcomes of the science enterprise. Governmental, nongovernmental, and private foundation sponsors of research are increasingly recognizing the value of research data. As a result, most funders now require that sufficiently detailed data management plans be submitted as part of a research proposal. A data management plan (DMP) is a document that describes how you will treat your data during a project and what happens with the data after the project ends. Such plans typically cover all or portions of the data life cycle—from data discovery, collection, and organization (e.g., spreadsheets, databases), through quality assurance/quality control, documentation (e.g., data types, laboratory methods) and use of the data, to data preservation and sharing with others (e.g., data policies and dissemination approaches). Fig 1 illustrates the relationship between hypothetical research and data life cycles and highlights the links to the rules presented in this paper. The DMP undergoes peer review and is used in part to evaluate a project’s merit. Plans also document the data management activities associated with funded projects and may be revisited during performance reviews. Open in a separate window Fig 1 Relationship of the research life cycle (A) to the data life cycle (B); note: highlighted circles refer to the rules that are most closely linked to the steps of the data life cycle. As part of the research life cycle (A), many researchers (1) test ideas and hypotheses by (2) acquiring data that are (3) incorporated into various analyses and visualizations, leading to interpretations that are then (4) published in the literature and disseminated via other mechanisms (e.g., conference presentations, blogs, tweets), and that often lead back to (1) new ideas and hypotheses. During the data life cycle (B), researchers typically (1) develop a plan for how data will be managed during and after the project; (2) discover and acquire existing data and (3) collect and organize new data; (4) assure the quality of the data; (5) describe the data (i.e., ascribe metadata); (6) use the data in analyses, models, visualizations, etc.; and (7) preserve and (8) share the data with others (e.g., researchers, students, decision makers), possibly leading to new ideas and hypotheses.

Journal ArticleDOI
TL;DR: The business world is rapidly digitizing as companies embrace sensors, mobile devices, radio frequency identi- fication, audio and video streams, software logs, and the Internet to predict needs, avert fraud and waste, understand relationships, and connect with stakeholders both internal and external to the firm.
Abstract: The business world is rapidly digitizing as companies embrace sensors, mobile devices, radio frequency identi- fication, audio and video streams, software logs, and the Internet to predict needs, avert fraud and waste, understand relationships, and connect with stakeholders both internal and external to the firm. Digitization creates challenges because for most companies it is unevenly distributed throughout the organization: in a 2013 survey, only 39% of company-wide investment in digitization was identified as being in the IT budget (Weill and Woerner, 2013a). This uneven, discon-necked investment makes it difficult to consolidate and simplify the increasing amount of data that is one of the outcomes of digitization. This in turn makes it more difficult to derive insight – and then proceed based on that insight. Early big data research identified over a dozen characteristics of data (e.g., location, network associations, latency, structure, softness) that challenge extant data management practices (Santos and Singer, 2012).

Journal ArticleDOI
TL;DR: There is a need for data literacy and it is advantageous to have a unified terminology, and data literacy can be offered both to researchers, who need to become data literate science workers and to educate data management professionals.
Abstract: Purpose – The role of data literacy is discussed in the light of such activities as data a quality, data management, data curation, and data citation. The differing terms and their relationship to the most important literacies are examined. The paper aims to discuss these issues. Design/methodology/approach – By stressing the importance of data literacy in fulfilling the mission of the contemporary academic library, the paper centres on information literacy, while the characteristics of other relevant literacies are also examined. The content of data literacy education is explained in the context of data-related activities. Findings – It can be concluded that there is a need for data literacy and it is advantageous to have a unified terminology. Data literacy can be offered both to researchers, who need to become data literate science workers and have the goal to educate data management professionals. Several lists of competencies contain important skills and abilities, many of them indicating the close r...

Journal ArticleDOI
TL;DR: These protocols facilitate high-speed lossless data compression and content-based multiview image fusion optimized for multicore CPU architectures, reducing image data size 30–500-fold and visualization, editing and annotation of multiterabyte image data and cell-lineage reconstructions with tens of millions of data points.
Abstract: Light-sheet microscopy is a powerful method for imaging the development and function of complex biological systems at high spatiotemporal resolution and over long time scales. Such experiments typically generate terabytes of multidimensional image data, and thus they demand efficient computational solutions for data management, processing and analysis. We present protocols and software to tackle these steps, focusing on the imaging-based study of animal development. Our protocols facilitate (i) high-speed lossless data compression and content-based multiview image fusion optimized for multicore CPU architectures, reducing image data size 30-500-fold; (ii) automated large-scale cell tracking and segmentation; and (iii) visualization, editing and annotation of multiterabyte image data and cell-lineage reconstructions with tens of millions of data points. These software modules are open source. They provide high data throughput using a single computer workstation and are readily applicable to a wide spectrum of biological model systems.

Journal ArticleDOI
TL;DR: During the design, the implementation and the execution of the benchmarks, a number of point cloud data management improvements were proposed and partly tested: Morton/Hilbert code for ordering data, Morton code and Morton-ranges, algorithms for parallel query execution, and a unique vario-scale LoD data organization avoiding the density jumps of the well-known discreteLoD data organizations.

Book
29 Dec 2015
TL;DR: The authors begin by explaining how Big Data can propel an organization forward by solving a spectrum of previously intractable business problems and show how a Big Data solution environment can be built and integrated to offer competitive advantages.
Abstract: This text should be required reading for everyone in contemporary business. --Peter Woodhull, CEO, Modus21 The one book that clearly describes and links Big Data concepts to business utility. --Dr. Christopher Starr, PhD Simply, this is the best Big Data book on the market! --Sam Rostam, Cascadian IT Group ...one of the most contemporary approaches Ive seen to Big Data fundamentals... --Joshua M. Davis, PhD The Definitive Plain-English Guide to Big Data for Business and Technology Professionals Big Data Fundamentals provides a pragmatic, no-nonsense introduction to Big Data. Best-selling IT author Thomas Erl and his team clearly explain key Big Data concepts, theory and terminology, as well as fundamental technologies and techniques. All coverage is supported with case study examples and numerous simple diagrams. The authors begin by explaining how Big Data can propel an organization forward by solving a spectrum of previously intractable business problems. Next, they demystify key analysis techniques and technologies and show how a Big Data solution environment can be built and integrated to offer competitive advantages. Discovering Big Datas fundamental concepts and what makes it different from previous forms of data analysis and data science Understanding the business motivations and drivers behind Big Data adoption, from operational improvements through innovation Planning strategic, business-driven Big Data initiatives Addressing considerations such as data management, governance, and security Recognizing the 5 V characteristics of datasets in Big Data environments: volume, velocity, variety, veracity, and value Clarifying Big Datas relationships with OLTP, OLAP, ETL, data warehouses, and data marts Working with Big Data in structured, unstructured, semi-structured, and metadata formats Increasing value by integrating Big Data resources with corporate performance monitoring Understanding how Big Data leverages distributed and parallel processing Using NoSQL and other technologies to meet Big Datas distinct data processing requirements Leveraging statistical approaches of quantitative and qualitative analysis Applying computational analysis methods, including machine learning

Journal ArticleDOI
26 Oct 2015-PLOS ONE
TL;DR: The Genomics Virtual Laboratory is designed and implemented as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options.
Abstract: Background: Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise. Results: We designed and implemented the Genomics Virtual Laboratory (GVL) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best-practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud. The principles, implementation and build process are designed to be cloud-agnostic. Conclusions: This paper provides a blueprint for the design and implementation of a cloud-based Genomics Virtual Laboratory. We discuss scope, design considerations and technical and logistical constraints, and explore the value added to the research community through the suite of services and resources provided by our implementation.

Journal ArticleDOI
TL;DR: The SEEK platform has been adopted by many systems biology consortia across Europe and is a data management environment that has a low barrier of uptake and provides rich resources for collaboration.
Abstract: Systems biology research typically involves the integration and analysis of heterogeneous data types in order to model and predict biological processes. Researchers therefore require tools and resources to facilitate the sharing and integration of data, and for linking of data to systems biology models. There are a large number of public repositories for storing biological data of a particular type, for example transcriptomics or proteomics, and there are several model repositories. However, this silo-type storage of data and models is not conducive to systems biology investigations. Interdependencies between multiple omics datasets and between datasets and models are essential. Researchers require an environment that will allow the management and sharing of heterogeneous data and models in the context of the experiments which created them. The SEEK is a suite of tools to support the management, sharing and exploration of data and models in systems biology. The SEEK platform provides an access-controlled, web-based environment for scientists to share and exchange data and models for day-to-day collaboration and for public dissemination. A plug-in architecture allows the linking of experiments, their protocols, data, models and results in a configurable system that is available 'off the shelf'. Tools to run model simulations, plot experimental data and assist with data annotation and standardisation combine to produce a collection of resources that support analysis as well as sharing. Underlying semantic web resources additionally extract and serve SEEK metadata in RDF (Resource Description Format). SEEK RDF enables rich semantic queries, both within SEEK and between related resources in the web of Linked Open Data. The SEEK platform has been adopted by many systems biology consortia across Europe. It is a data management environment that has a low barrier of uptake and provides rich resources for collaboration. This paper provides an update on the functions and features of the SEEK software, and describes the use of the SEEK in the SysMO consortium (Systems biology for Micro-organisms), and the VLN (virtual Liver Network), two large systems biology initiatives with different research aims and different scientific communities.

Journal ArticleDOI
TL;DR: This paper proposes a generic semantic big data architecture based on the “Knowledge as a Service” approach to cope with heterogeneity and scalability challenges, and focuses on enriching the NIST Big Data model with semantics in order to smartly understand the collected data.
Abstract: Advances supported by emerging wearable technologies in healthcare promise patients a provision of high quality of care. Wearable computing systems represent one of the most thrust areas used to transform traditional healthcare systems into active systems able to continuously monitor and control the patients' health in order to manage their care at an early stage. However, their proliferation creates challenges related to data management and integration. The diversity and variety of wearable data related to healthcare, their huge volume and their distribution make data processing and analytics more difficult. In this paper, we propose a generic semantic big data architecture based on the "Knowledge as a Service" approach to cope with heterogeneity and scalability challenges. Our main contribution focuses on enriching the NIST Big Data model with semantics in order to smartly understand the collected data, and generate more accurate and valuable information by correlating scattered medical data stemming from multiple wearable devices or/and from other distributed data sources. We have implemented and evaluated a Wearable KaaS platform to smartly manage heterogeneous data coming from wearable devices in order to assist the physicians in supervising the patient health evolution and keep the patient up-to-date about his/her status.

Journal ArticleDOI
Lanjun Wang1, Shuo Zhang1, Juwei Shi1, Limei Jiao1, Oktie Hassanzadeh1, Jia Zou2, Chen Wangz2 
01 May 2015
TL;DR: A schema management framework for document stores that discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization, and proposes a new data structure, eSiBu-Tree, to store schemas and support queries.
Abstract: Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.

Book
18 Nov 2015
TL;DR: Crowdsourced Data Management: Industry and Academic Perspectives simultaneously introduces academics to real problems that practitioners encounter every day, and provides a survey of the state of the art in crowd-powered algorithms and system design tailored to large-scale data processing.
Abstract: Crowdsourcing and human computation enable organizations to accomplish tasks that are currently not possible for fully automated techniques to complete, or require more flexibility and scalability than traditional employment relationships can facilitate. In the area of data processing, companies have benefited from crowd workers on platforms such as Amazon's Mechanical Turk or Upwork to complete tasks as varied as content moderation, web content extraction, entity resolution, and video/audio/image processing. Several academic researchers from diverse areas, ranging from the social sciences to computer science, have embraced crowdsourcing as a research area, resulting in algorithms and systems that improve crowd work quality, latency, and cost. Despite the relative nascence of the field, the academic and the practitioner communities have largely operated independently of each other for the past decade, rarely exchanging techniques and experiences. Crowdsourced Data Management: Industry and Academic Perspectives aims to narrow the gap between academics and practitioners. On the academic side, it summarizes the state of the art in crowd-powered algorithms and system design tailored to large-scale data processing. On the industry side, it surveys 13 industry users - such as Google, Facebook, and Microsoft - and four marketplace providers of crowd work - such as CrowdFlower and Upwork - to identify how hundreds of engineers and tens of million dollars are invested in various crowdsourcing solutions. Crowdsourced Data Management: Industry and Academic Perspectives simultaneously introduces academics to real problems that practitioners encounter every day, and provides a survey of the state of the art for practitioners to incorporate into their designs. Through the surveys, it also highlights the fact that crowdpowered data processing is a large and growing field. Over the next decade, most technical organizations are likely to benefit in some way from crowd work, and this monograph can help guide the effective adoption of crowdsourcing across these organizations.

Book Chapter
01 Aug 2015
TL;DR: A data collection and energy management platform for smart homes to enhance the value of information given by smart energy meter by providing user-tailored real-time energy consumption feedback and advice that can be easily accessed and acted upon by the household is presented.
Abstract: This paper presents a data collection and energy fe edback platform for smart homes to enhance the value of information given by smart energy meter da ta by providing user-tailored real-time energy consumption feedback and advice that can be easily accessed and acted upon by the household. Our data management platform consists of an SQL server back-end which collects data, namely, aggregate power consumption as well as consumption of major appliances, temperature, humidity, light, and motion data. These data streams allow us to infer information about the household’s appliance usage and domestic activities, which in t urn enables meaningful and useful energy feedback. The platform developed has been rolled ou t in 20 UK households over a period of just over 21 months. As well as the data streams mentioned, q ualitative data such as appliance survey, tariff, house construction type and occupancy information a re also included. The paper presents a review of publically available smart home datasets and a desc ription of our own smart home set up and monitoring platform. We then provide examples of th e types of feedback that can be generated, looking at the suitability of electricity tariffs a nd appliance specific feedback.