scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2019"


Journal Article
TL;DR: In this paper, it has been shown that OLAP possesses a certain potential to support spatio-temporal dimensions of a data warehouse and new tools are needed to exploit the full potential of the spatial and temporal dimensions.
Abstract: To exploit the full potential of the spatial and temporal dimensions of a data warehouse, new tools are needed. It has been shown that OLAP possesses a certain potential to support spatio-temporal ...

189 citations


Proceedings ArticleDOI
08 Apr 2019
TL;DR: This paper outlines a selection of use cases that Presto supports at Facebook, and describes its architecture and implementation, and calls out features and performance optimizations that enable it to support these use cases.
Abstract: Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. Presto is designed to be adaptive, flexible, and extensible. It supports a wide variety of use cases with diverse characteristics. These range from user-facing reporting applications with sub-second latency requirements to multi-hour ETL jobs that aggregate or join terabytes of data. Presto's Connector API allows plugins to provide a high performance I/O interface to dozens of data sources, including Hadoop data warehouses, RDBMSs, NoSQL systems, and stream processing systems. In this paper, we outline a selection of use cases that Presto supports at Facebook. We then describe its architecture and implementation, and call out features and performance optimizations that enable it to support these use cases. Finally, we present performance results that demonstrate the impact of our main design decisions.

91 citations


Journal ArticleDOI
TL;DR: Data quality frameworks are surveyed in a comparative way regarding the definition, assessment, and improvement of data quality with a focus on methodologies that are applicable in a wide range of business environments to aid the decision process concerning the suitability of these methods.
Abstract: Nowadays, the importance of achieving and maintaining a high standard of data quality is widely recognized by both practitioners and researchers. Based on its impact on businesses, the quality of data is commonly viewed as a valuable asset. The literature comprises various techniques for defining, assessing, and improving data quality. However, requirements for data and their quality vary between organizations. Due to this variety, choosing suitable methods that are advantageous for the data quality of an organization or in a particular context can be challenging. This paper surveys data quality frameworks in a comparative way regarding the definition, assessment, and improvement of data quality with a focus on methodologies that are applicable in a wide range of business environments. To aid the decision process concerning the suitability of these methods, we further provide a decision guide to data quality frameworks. This guidance aims to help narrow down possible choices for data quality methodologies based on a number of specified criteria.

86 citations


Journal ArticleDOI
19 Feb 2019-PLOS ONE
TL;DR: An i2b2-to- OMOP transformation, driven by the ARCH-OMOP ontology and the OMOP concept mapping dictionary is developed, which is being used to send NEHPO production data to AOU.
Abstract: Background The All Of Us Research Program (AOU) is building a nationwide cohort of one million patients' EHR and genomic data. Data interoperability is paramount to the program's success. AOU is standardizing its EHR data around the Observational Medical Outcomes Partnership (OMOP) data model. OMOP is one of several standard data models presently used in national-scale initiatives. Each model is unique enough to make interoperability difficult. The i2b2 data warehousing and analytics platform is used at over 200 sites worldwide, which uses a flexible ontology-driven approach for data storage. We previously demonstrated this ontology system can drive data reconfiguration, to transform data into new formats without site-specific programming. We previously implemented this on our 12-site Accessible Research Commons for Health (ARCH) network to transform i2b2 into the Patient Centered Outcomes Research Network model. Methods and results Here, we leverage our investment in i2b2 high-performance transformations to support the AOU OMOP data pipeline. Because the ARCH ontology has gained widespread national interest (through the Accrual to Clinical Trials network, other PCORnet networks, and the Nebraska Lexicon), we leveraged sites' existing investments into this standard ontology. We developed an i2b2-to-OMOP transformation, driven by the ARCH-OMOP ontology and the OMOP concept mapping dictionary. We demonstrated and validated our approach in the AOU New England HPO (NEHPO). First, we transformed into OMOP a fake patient dataset in i2b2 and verified through AOU tools that the data was structurally compliant with OMOP. We then transformed a subset of data in the Partners Healthcare data warehouse into OMOP. We developed a checklist of assessments to ensure the transformed data had self-integrity (e.g., the distributions have an expected shape and required fields are populated), using OMOP's visual Achilles data quality tool. This i2b2-to-OMOP transformation is being used to send NEHPO production data to AOU. It is open-source and ready for use by other research projects.

66 citations


Book ChapterDOI
26 Aug 2019
TL;DR: This work studies the existing work and proposes a complete definition and a generic and extensible architecture of data lake and introduces three future research axes related to metadata management that consists of intra- and inter-metadata.
Abstract: As a relatively new concept, data lake has neither a standard definition nor an acknowledged architecture. Thus, we study the existing work and propose a complete definition and a generic and extensible architecture of data lake. What’s more, we introduce three future research axes in connection with our health-care Information Technology (IT) activities. They are related to (i) metadata management that consists of intra- and inter-metadata, (ii) a unified ecosystem for companies’ data warehouses and data lakes and (iii) data lake governance.

58 citations


Journal ArticleDOI
03 Oct 2019-Sensors
TL;DR: An Internet of Medical Things (IoMT) platform for pervasive healthcare that ensures interoperability, quality of the detection process, and scalability in an M2M-based architecture, and provides functionalities for the processing of high volumes of data, knowledge extraction, and common healthcare services is proposed.
Abstract: Pervasive healthcare services have undergone a great evolution in recent years. The technological development of communication networks, including the Internet, sensor networks, and M2M (Machine-to-Machine) have given rise to new architectures, applications, and standards related to addressing almost all current e-health challenges. Among the standards, the importance of OpenEHR has been recognized, since it enables the separation of medical semantics from data representation of electronic health records. However, it does not meet the requirements related to interoperability of e-health devices in M2M networks, or in the Internet of Things (IoT) scenarios. Moreover, the lack of interoperability hampers the application of new data-processing techniques, such as data mining and online analytical processing, due to the heterogeneity of the data and the sources. This article proposes an Internet of Medical Things (IoMT) platform for pervasive healthcare that ensures interoperability, quality of the detection process, and scalability in an M2M-based architecture, and provides functionalities for the processing of high volumes of data, knowledge extraction, and common healthcare services. The platform uses the semantics described in OpenEHR for both data quality evaluation and standardization of healthcare data stored by the association of IoMT devices and observations defined in OpenEHR. Moreover, it enables the application of big data techniques and online analytic processing (OLAP) through Hadoop Map/Reduce and content-sharing through fast healthcare interoperability resource (FHIR) application programming interfaces (APIs).

55 citations


Book ChapterDOI
26 Aug 2019
TL;DR: This work investigates existing data lake literature and discusses various design and realization aspects for data lakes, such as governance or data models, to identify challenges and research gaps and identify a comprehensive strategy to realize data lakes.
Abstract: The digital transformation leads to massive amounts of heterogeneous data challenging traditional data warehouse solutions in enterprises. In order to exploit these complex data for competitive advantages, the data lake recently emerged as a concept for more flexible and powerful data analytics. However, existing literature on data lakes is rather vague and incomplete, and the various realization approaches that have been proposed neither cover all aspects of data lakes nor do they provide a comprehensive design and realization strategy. Hence, enterprises face multiple challenges when building data lakes. To address these shortcomings, we investigate existing data lake literature and discuss various design and realization aspects for data lakes, such as governance or data models. Based on these insights, we identify challenges and research gaps concerning (1) data lake architecture, (2) data lake governance, and (3) a comprehensive strategy to realize data lakes. These challenges still need to be addressed to successfully leverage the data lake in practice.

45 citations


Journal ArticleDOI
TL;DR: An open data integration platform for patient, clinical, medical and historical data siloed across multiple health information systems is proposed and its suitability for holistic information management, decision support, and predictive analytics justify its role in the advancement of e-healthcare.

45 citations


Posted Content
TL;DR: In this paper, the authors have demonstrated that true and extensible parallel processing of database servers on the cloud can efficiently process OLAP application demands on cloud computing by running a highly partitioned database running on massively parallel database servers in which each server hosts at least one partition of the underlying database serving the OLAP demands.
Abstract: Cloud computing is gradually gaining popularity among businesses due to its distinct advantages over self-hosted IT infrastructures. Business Intelligence (BI) is a highly resource intensive system requiring large-scale parallel processing and significant storage capacities to host data warehouses. In self-hosted environments it was feared that BI will eventually face a resource crunch situation because it will not be feasible for companies to keep adding resources to host a neverending expansion of data warehouses and the online analytical processing (OLAP) demands on the underlying networking. Cloud computing has instigated a new hope for future prospects of BI. However, how will BI be implemented on cloud and how will the traffic and demand profile look like? This research attempts to answer these key questions in regards to taking BI to the cloud. The cloud hosting of BI has been demonstrated with the help of a simulation on OPNET comprising a cloud model with multiple OLAP application servers applying parallel query loads on an array of servers hosting relational databases. The simulation results have reflected that true and extensible parallel processing of database servers on the cloud can efficiently process OLAP application demands on cloud computing. Hence, the BI designer needs to plan for a highly partitioned database running on massively parallel database servers in which, each server hosts at least one partition of the underlying database serving the OLAP demands.

44 citations


Journal ArticleDOI
TL;DR: It is believed that CAMP FHIR can serve as an alternative to implementing new CDMs on a project-by-project basis and could support rare data sharing opportunities, such as collaborations between academic medical centers and community hospitals.
Abstract: Background: In a multisite clinical research collaboration, institutions may or may not use the same common data model (CDM) to store clinical data. To overcome this challenge, we proposed to use Health Level 7’s Fast Healthcare Interoperability Resources (FHIR) as a meta-CDM—a single standard to represent clinical data. Objective: In this study, we aimed to create an open-source application termed the Clinical Asset Mapping Program for FHIR (CAMP FHIR) to efficiently transform clinical data to FHIR for supporting source-agnostic CDM-to-FHIR mapping. Methods: Mapping with CAMP FHIR involves (1) mapping each source variable to its corresponding FHIR element and (2) mapping each item in the source data’s value sets to the corresponding FHIR value set item for variables with strict value sets. To date, CAMP FHIR has been used to transform 108 variables from the Informatics for Integrating Biology & the Bedside (i2b2) and Patient-Centered Outcomes Research Network data models to fields across 7 FHIR resources. It is designed to allow input from any source data model and will support additional FHIR resources in the future. Results: We have used CAMP FHIR to transform data on approximately 23,000 patients with asthma from our institution’s i2b2 database. Data quality and integrity were validated against the origin point of the data, our enterprise clinical data warehouse. Conclusions: We believe that CAMP FHIR can serve as an alternative to implementing new CDMs on a project-by-project basis. Moreover, the use of FHIR as a CDM could support rare data sharing opportunities, such as collaborations between academic medical centers and community hospitals. We anticipate adoption and use of CAMP FHIR to foster sharing of clinical data across institutions for downstream applications in translational research.

42 citations


Journal ArticleDOI
TL;DR: This paper aims at studying data integration technologies, tools, and applications within the healthcare domain and discusses future research directions in the integration of Big healthcare data.
Abstract: In recent years, the radical advancement of technologies has given rise to an abundance of software applications, social media, and smart devices such as smartphone, sensors, and so on. More extensive use of these applications and tools in various industrial domains has led to data deluge, which has fostered enormous challenges and opportunities. However, it is not only the volume of the data but also the speed, variety, and uncertainty, which are promoting a massive challenge for traditional technologies such as data warehouse. These diverse and unprecedented characteristics have engendered the notion of “Big Data.” The data-intensive industries have been experiencing a wide variety of challenges in terms of processing, managing, and analysis of data. For instance, the healthcare sector is confronting difficulties in respect of integration or fusion of diverse medical data stemming from multiple heterogeneous sources. Data integration is critically important within the healthcare sector because it enriches data, enhances its value, and more importantly paves a solid foundation for highly efficient and effective healthcare analytics such as predicting diseases or an outbreak. Several data integration technologies and tools have been developed over the last two decades. This paper aims at studying data integration technologies, tools, and applications within the healthcare domain. Furthermore, this paper discusses future research directions in the integration of Big healthcare data.

Journal ArticleDOI
TL;DR: A quality assurance (QA) process and code base is developed to accompany the incremental transformation of the Department of Veterans Affairs Corporate Data Warehouse health care database into the Observational Medical Outcomes Partnership (OMOP) CDM to prevent incremental load errors.
Abstract: Background The development and adoption of health care common data models (CDMs) has addressed some of the logistical challenges of performing research on data generated from disparate health care systems by standardizing data representations and leveraging standardized terminology to express clinical information consistently. However, transforming a data system into a CDM is not a trivial task, and maintaining an operational, enterprise capable CDM that is incrementally updated within a data warehouse is challenging. Objectives To develop a quality assurance (QA) process and code base to accompany our incremental transformation of the Department of Veterans Affairs Corporate Data Warehouse health care database into the Observational Medical Outcomes Partnership (OMOP) CDM to prevent incremental load errors. Methods We designed and implemented a multistage QA) approach centered on completeness, value conformance, and relational conformance data-quality elements. For each element we describe key incremental load challenges, our extract, transform, and load (ETL) solution of data to overcome those challenges, and potential impacts of incremental load failure. Results Completeness and value conformance data-quality elements are most affected by incremental changes to the CDW, while updates to source identifiers impact relational conformance. ETL failures surrounding these elements lead to incomplete and inaccurate capture of clinical concepts as well as data fragmentation across patients, providers, and locations. Conclusion Development of robust QA processes supporting accurate transformation of OMOP and other CDMs from source data is still in evolution, and opportunities exist to extend the existing QA framework and tools used for incremental ETL QA processes.

Proceedings ArticleDOI
25 Jun 2019
TL;DR: Apache Hive as mentioned in this paper is an open-source relational database system for analytic big-data workloads that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications.
Abstract: Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

Journal ArticleDOI
01 Jan 2019
TL;DR: This paper provides an overview of the existing ETL process data quality approaches, and presents a comparative study of some commercial ETL tools to show how much these tools consider data quality dimensions.
Abstract: The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bring high data quality to the data warehouse from both internal and external sources using the ETL process. The latter is complex and time-consuming as it manages data with heterogeneous content and diverse quality problems. Ensuring data quality requires tracking quality defects along the ETL process. In this paper, we present the main ETL quality characteristics. We provide an overview of the existing ETL process data quality approaches. We also present a comparative study of some commercial ETL tools to show how much these tools consider data quality dimensions. To illustrate our study, we carry out experiments using an ETL dedicated solution (Talend Data Integration) and a data quality dedicated solution (Talend Data Quality). Based on our study, we identify and discuss quality challenges to be addressed in our future research.

Book ChapterDOI
08 Sep 2019
TL;DR: Data querying and analysis depend on a metadata system that must be efficient and comprehensive, and metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.
Abstract: Over the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.

Journal ArticleDOI
TL;DR: An evolutionary game theory-based method to materialized view selection in the data warehouse is represented which exploits the multiple view processing plan structure to represent the search space of the problem.
Abstract: The data warehouse contains a number of views that are used to respond to the system queries. On the one hand, the time consuming process of responding to analytical queries of the data warehouse requires to store intermediate views for efficient query answering, and on the other hand, large numbers and high volumes of intermediate views, make the storage of all views impossible. Hence, choosing the optimal set of views for materialization is one of the most important decisions in the data warehouses design, in order to increase efficiency of query answering. Since the search space of the problem is very large, searching among the collections of all possible views of a data warehouse is very expensive and thus, it is necessary to use methods to solve the problem in an acceptable time. Random methods, such as game theory-based optimization approaches, try to increase the speed of selecting materialized views by finding near optimal solutions. In this article, an evolutionary game theory-based method to materialized view selection in the data warehouse is represented which exploits the multiple view processing plan structure to represent the search space of the problem. In this method, a population of players is created, each of which is a solution to the problem. Three strategies are considered for each player and at each repetition of the game, players attempt to choose the best strategy for themselves. At the end of the game, the final solution is calculated according to the strategies selected by the players. Our empirical evaluations revealed that the proposed method has appropriate convergence for large data warehouses and its execution time is very good. It is also shown that the quality of the solutions of the proposed method is more appropriate than other similar random methods.

Book ChapterDOI
26 Jan 2019
TL;DR: The conceptual, logical and physical models of distributed storages and inter-level transitions procedures are proposed, and location of data on the nodes, data replication routes are determined by criterion of the minimum total cost of data storage and processing using a modified genetic algorithm.
Abstract: Actual scientific and practical task of creating information technology for construction of distributed data warehouses of hybrid type taking into account the properties of data and statistics of queries to the storage is considered in the article. The analysis of the problem of data warehouses construction taking into account data properties and executable queries is carried out. The conceptual, logical and physical models of distributed storages and inter-level transitions procedures are proposed. Location of data on the nodes, data replication routes are determined by criterion of the minimum total cost of data storage and processing using a modified genetic algorithm.

Journal ArticleDOI
TL;DR: A new architecture for Cognitive Internet of Things (CIoT) and big data is proposed by combining the data WareHouse (DWH) and Data Lake (DL), and defining a tool for heterogeneous data collection.

Journal ArticleDOI
TL;DR: A hybrid information infrastructure for business intelligence and analytics (BI&A) and KM based on an educational data warehouse (EDW) and an enterprise architecture (EA) repository that allows the digitization of knowledge and empowers the visualization and the analysis of dissimilar organizational components as people, processes, and technology is presented.
Abstract: Advances in science and technology, the Internet of Things, and the proliferation of mobile apps are critical factors to the current increase in the amount, structure, and size of information that organizations have to store, process, and analyze. Traditional data storages present technical deficiencies when handling huge volumes of data and are not adequate for process modeling and business intelligence; to cope with these deficiencies, new methods and technologies have been developed under the umbrella of big data. However, there is still the need in higher education institutions (HEIs) of a technological tool that can be used for big data processing and knowledge management (KM). To overcome this issue, it is essential to develop an information infrastructure that allows the capturing of knowledge and facilitates experimentation by having cleaned and consistent data. Thus, this paper presents a hybrid information infrastructure for business intelligence and analytics (BI&A) and KM based on an educational data warehouse (EDW) and an enterprise architecture (EA) repository that allows the digitization of knowledge and empowers the visualization and the analysis of dissimilar organizational components as people, processes, and technology. The proposed infrastructure was created based on research and will serve to run different experiments to analyze educational data and academic processes and for the creation of explicit knowledge using different algorithms and methods of educational data mining, learning analytics, online analytical processing (OLAP), and EA analytics.

Journal ArticleDOI
TL;DR: In this paper, the concept of "aesthetic practices" is proposed to capture the work needed for population data to come into relation so that it can be disseminated via government data portals, in our case, the Census Hub of the European Statistical System (ESS) and the Danish Ministry of Education's Data Warehouse.
Abstract: We develop the concept of ‘aesthetic practices’ to capture the work needed for population data to come into relation so that it can be disseminated via government data portals, in our case, the Census Hub of the European Statistical System (ESS) and the Danish Ministry of Education’s Data Warehouse. The portals form part of open government data (OGD) initiatives, which we understand as governing technologies. We argue that to function as such, aesthetic practices are required so that data produced at dispersed sites can be brought into relation and projected as populations at data portals in forms such as bar charts, heat maps and tables. Two examples of aesthetic practices are analysed based on ethnographic studies we have conducted on the production of data for the Hub and Warehouse: metadata and data cleaning. Metadata enables data to come into relation by containing and accounting for (some of) the differences between data. Data cleaning deals with the indeterminacies and absences of data and involves algorithms to determine what values data can obtain so they can be brought into relation. We attend to how both aesthetic practices involve normative decisions that make absent what exceeds them: embodied knowledge that cannot or has not been documented; and data that cannot meet the forms required of data portals. While these aesthetic practices are necessary to sustain data portals as ‘sites of projection,’ we also bring critical attention to their performative effects for knowing, enacting and governing populations.

Journal ArticleDOI
01 Jul 2019-BMJ Open
TL;DR: This project will create person-level profiles guided by the Gelberg-Andersen Behavioral Model and describe new patterns of HIV care utilisation behaviour among persons living with HIV, and ‘missed opportunities’ for re-engaging them back into care.
Abstract: Introduction Linkage and retention in HIV medical care remains problematic in the USA. Extensive health utilisation data collection through electronic health records (EHR) and claims data represent new opportunities for scientific discovery. Big data science (BDS) is a powerful tool for investigating HIV care utilisation patterns. The South Carolina (SC) office of Revenue and Fiscal Affairs (RFA) data warehouse captures individual-level longitudinal health utilisation data for persons living with HIV (PLWH). The data warehouse includes EHR, claims and data from private institutions, housing, prisons, mental health, Medicare, Medicaid, State Health Plan and the department of health and human services. The purpose of this study is to describe the process for creating a comprehensive database of all SC PLWH, and plans for using BDS to explore, identify, characterise and explain new predictors of missed opportunities for HIV medical care utilisation. Methods and analysis This project will create person-level profiles guided by the Gelberg-Andersen Behavioral Model and describe new patterns of HIV care utilisation. The population for the comprehensive database comes from statewide HIV surveillance data (2005–2016) for all SC PLWH (N≈18000). Surveillance data are available from the state health department’s enhanced HIV/AIDS Reporting System (e-HARS). Additional data pulls for the e-HARS population will include Ryan White HIV/AIDS Program Service Reports, Health Sciences SC data and Area Health Resource Files. These data will be linked to the RFA data and serve as sources for traditional and vulnerable domain Gelberg-Anderson Behavioral Model variables. The project will use BDS techniques such as machine learning to identify new predictors of HIV care utilisation behaviour among PLWH, and ‘missed opportunities’ for re-engaging them back into care. Ethics and dissemination The study team applied for data from different sources and submitted individual Institutional Review Board (IRB) applications to the University of South Carolina (USC) IRB and other local authorities/agencies/state departments. This study was approved by the USC IRB (#Pro00068124) in 2017. To protect the identity of the persons living with HIV (PLWH), researchers will only receive linked deidentified data from the RFA. Study findings will be disseminated at local community forums, community advisory group meetings, meetings with our state agencies, local partners and other key stakeholders (including PLWH, policy-makers and healthcare providers), presentations at academic conferences and through publication in peer-reviewed articles. Data security and patient confidentiality are the bedrock of this study. Extensive data agreements ensuring data security and patient confidentiality for the deidentified linked data have been established and are stringently adhered to. The RFA is authorised to collect and merge data from these different sources and to ensure the privacy of all PLWH. The legislatively mandated SC data oversight council reviewed the proposed process stringently before approving it. Researchers will get only the encrypted deidentified dataset to prevent any breach of privacy in the data transfer, management and analysis processes. In addition, established secure data governance rules, data encryption and encrypted predictive techniques will be deployed. In addition to the data anonymisation as a part of privacy-preserving analytics, encryption schemes that protect running prediction algorithms on encrypted data will also be deployed. Best practices and lessons learnt about the complex processes involved in negotiating and navigating multiple data sharing agreements between different entities are being documented for dissemination.

Book ChapterDOI
01 Jan 2019
TL;DR: Building smart Remote patient monitoring models using cloud-based technologies will preserve the lives of patients, especially the elderly who live alone, and give great hope for developing smart healthcare systems that can provide innovative medical services.
Abstract: Healthcare informatics is undergoing a revolution because of the availability of safe, wearable sensors at low cost. Smart hospitals have exploited the development of the Internet of Things (IoT) sensors to create Remote Patients monitoring (RPM) models that observe patients at their homes. RPM is one of the Ambient Assisted Living (AAL) applications. The long-term monitoring of patients using the AALs generates big data. Therefore, AALs must adopt cloud-based architectures to store, process and analyze big data. The usage of big data analytics for handling and analyzing the massive amount of big medical data will make a big shift in the healthcare field. Advanced software frameworks such as Hadoop will promote the success of medical assistive applications because it allows the storage of data in its native form not only in the form of electronic medical records that can be stored in data warehouses. Also, Spark and its machine learning libraries accelerate the analysis of big medical data ten times faster than MapReduce. The advanced cloud technologies that are capable of handling big data give great hope for developing smart healthcare systems that can provide innovative medical services. Building smart Remote patient monitoring models using cloud-based technologies will preserve the lives of patients, especially the elderly who live alone. A case study for monitoring patients suffering from chronic diseases (blood pressure disorders) for 24 h with a reading every 15 min using a cloud-based monitoring model shows its effectiveness in predicting the health status of the patients.

Journal ArticleDOI
TL;DR: This paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance, demonstrating the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies.
Abstract: Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

Journal ArticleDOI
TL;DR: A novel coral reefs optimization-based method is introduced for materialized view selection in a data warehouse and shows that this method offers higher quality solutions than other similar random methods in terms of coverage rate of queries.
Abstract: High response time of analytical queries is one of the most challenging issues of data warehouses. Complicated nature of analytical queries and enormous volume of data are the most important reasons of this high response time. The aim of materialized view selection is to reduce the response time of these analytical queries. For this purpose, the search space is firstly constructed by producing the set of all possible views based on given queries and then, the (semi-) optimal set of materialized views will be selected so that the queries can be answered at the lowest cost using them. Various materialized view selection methods have been proposed in the literature, most of which are randomized methods due to the time-consuming nature of this problem. Randomized view selection methods choose a semi-optimal set of proper views for materialization in an appropriate time using one or a combination of some meta-heuristic(s). In this paper, a novel coral reefs optimization-based method is introduced for materialized view selection in a data warehouse. Coral reefs optimization algorithm is an optimization method that solves problems by simulating the coral behaviors for placement and growth in reefs. In the proposed method, each solution of the problem is considered as a coral, which is always trying to be placed and grow in the reefs. In each step, special operators of the coral reefs optimization algorithm are applied on the solutions. After several steps, better solutions are more likely to survive and grow on the reefs. The best solution is finally chosen as the final solution of the problem. The practical evaluations of the proposed method show that this method offers higher quality solutions than other similar random methods in terms of coverage rate of queries.

Journal ArticleDOI
TL;DR: In this paper, the authors present a literature review on the main applications of OLAP technology in the analysis of information network data, and show a systematic review to list the works that apply OLAP technologies in graph data.
Abstract: Many real systems produce network data or highly interconnected data, which can be called information networks. These information networks form a critical component in modern information infrastructure, constituting a large graph data volume. The analysis of information network data covers several technological areas, among them OLAP technologies. OLAP is a technology that enables multi-dimensional and multi-level analysis on a large volume of data, providing aggregated data visualizations with different perspectives. This article presents a literature review on the main applications of OLAP technology in the analysis of information network data. To achieve such goal, it shows a systematic review to list the works that apply OLAP technologies in graph data. It defines seven comparison criteria (Materialization, Network, Selection, Aggregation, Model, OLAP Operations, Analytics) to qualify the works found based on their functionalities. The works are analyzed according to each criterion and discussed to identify trends and challenges in the application of OLAP in the information network.

Journal ArticleDOI
TL;DR: The newly modelled biological data types and the enhanced visual and analytical features of TargetMine are described and it is demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations.
Abstract: Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput 'omics' data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene-gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein-protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype-phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.

Proceedings ArticleDOI
03 May 2019
TL;DR: This paper proposes in this paper a methodological approach to build and manage a metadata system that is specific to textual documents in data lakes, and applies some specific techniques from the text mining and information retrieval domains to extract, store and reuse these metadata within the COREL research project.
Abstract: Data lakes have emerged as an alternative to data warehouses for the storage, exploration and analysis of big data. In a data lake, data are stored in a raw state and bear no explicit schema. Thence, an efficient metadata system is essential to avoid the data lake turning to a so-called data swamp. Existing works about managing data lake metadata mostly focus on structured and semi-structured data, with little research on unstructured data. Thus, we propose in this paper a methodological approach to build and manage a metadata system that is specific to textual documents in data lakes. First, we make an inventory of usual and meaningful metadata to extract. Then, we apply some specific techniques from the text mining and information retrieval domains to extract, store and reuse these metadata within the COREL research project, in order to validate our proposals.

Journal ArticleDOI
TL;DR: This paper discusses modeling and querying mobility data warehouses, providing a comprehensive collection of queries implemented using PostgresSQL and PostGIS as database backend, extended with the libraries provided by MobilityDB, a moving object database that extends the Postgres SQL database with temporal data types, allowing seamless integration with spatial and non-spatial data.
Abstract: The interest in mobility data analysis has grown dramatically with the wide availability of devices that track the position of moving objects. Mobility analysis can be applied, for example, to analyze traffic flows. To support mobility analysis, trajectory data warehousing techniques can be used. Trajectory data warehouses typically include, as measures, segments of trajectories, linked to spatial and non-spatial contextual dimensions. This paper goes beyond this concept, by including, as measures, the trajectories of moving objects at any point in time. In this way, online analytical processing (OLAP) queries, typically including aggregation, can be combined with moving object queries, to express queries like “List the total number of trucks running at less than 2 km from each other more than 50% of its route in the province of Antwerp” in a concise and elegant way. Existing proposals for trajectory data warehouses do not support queries like this, since they are based on either the segmentation of the trajectories, or a pre-aggregation of measures. The solution presented here is implemented using MobilityDB, a moving object database that extends the PostgresSQL database with temporal data types, allowing seamless integration with relational spatial and non-spatial data. This integration leads to the concept of mobility data warehouses. This paper discusses modeling and querying mobility data warehouses, providing a comprehensive collection of queries implemented using PostgresSQL and PostGIS as database backend, extended with the libraries provided by MobilityDB.

Journal ArticleDOI
TL;DR: The objectives of this review were to identify unique aspects of pediatrics that are relevant to the potential impact of data science on child health and provide examples of how common data science tasks are being used to facilitate clinical and biomedical child health research.

Journal ArticleDOI
TL;DR: A health system's high throughput NLP architecture may serve as a benchmark for large-scale clinical research using a CUI-based approach and present a predictive model use case for predicting 30-day hospital readmission.