Showing papers in "Distributed and Parallel Databases in 2018"

PDF

Open Access

Journal Article•DOI•

An adaptive multi-objective evolutionary algorithm for constrained workflow scheduling in Clouds

[...]

Miao Zhang¹, Huiqi Li¹, Li Liu², Rajkumar Buyya³•Institutions (3)

Beijing Institute of Technology¹, University of Science and Technology Beijing², University of Melbourne³

01 Jun 2018-Distributed and Parallel Databases

TL;DR: An adaptive individual-assessment scheme based on evolutionary states is proposed to handle the constraints in multi-objective optimization problems and it achieves better optimization ability when it is applied to solve Cloud workflow scheduling problem.

...read moreread less

Abstract: The Cloud workflow scheduling is to find proper Cloud resources for the execution of workflow tasks to efficiently utilize resources and meet different user’s quality of service requirements. Cloud workflow scheduling is a constrained and NP-complete problem and multi-objective evolutionary algorithms have shown their excellent ability to solve such problem. But most existing works simply use static penalty function to handle constraints which usually result in premature when the constraints become strict. On the other hand, with the search space being more tremendous and chaotic, how to balance the ability of exploring the entire search space and exploiting the important regions during the evolutionary process is increasingly important. In this paper, an adaptive individual-assessment scheme based on evolutionary states is proposed to handle the constraints in multi-objective optimization problems. In addition, the evolutionary parameters are also adjusted accordingly to balance the exploration and exploitation ability. These are distinguishable from most previous studies that directly incorporate multi-objective evolutionary algorithm to search excellent solutions for Cloud workflow scheduling. Experimental results demonstrate the proposed algorithm outperforms other state-of-the-art methods in convergence and diversity, and it also achieves better optimization ability when it is applied to solve Cloud workflow scheduling problem.

...read moreread less

28 citations

Journal Article•DOI•

Energy-aware task scheduling in mobile cloud computing

[...]

Chaogang Tang¹, Mingyang Hao¹, Xianglin Wei, Wei Chen¹•Institutions (1)

China University of Mining and Technology¹

18 Jun 2018-Distributed and Parallel Databases

TL;DR: This paper model the task scheduling problem at the end-user mobile device as an energy consumption optimization problem, while taking into account task dependency, data transmission and other constraint conditions such as task deadline and cost and presents several heuristic algorithms to solve it.

...read moreread less

Abstract: The limited energy supply, computing, storage and transmission capabilities of mobile devices pose a number of challenges for improving the quality of service (QoS) of various mobile applications, which has stimulated the emergence of many enhanced mobile computing paradigms, such as mobile cloud computing (MCC), fog computing, mobile edge computing (MEC), etc. The mobile devices need to partition mobile applications into related tasks and decide which tasks should be offloaded to remote computing facilities provided by cloud computing, fog nodes etc. It is very important yet tough to decide which tasks to be uploaded and where they are scheduled since this could greatly impact the applications’ timeliness and mobile devices’ lifetime. In this paper, we model the task scheduling problem at the end-user mobile device as an energy consumption optimization problem, while taking into account task dependency, data transmission and other constraint conditions such as task deadline and cost. We further present several heuristic algorithms to solve it. A series of simulation experiments are conducted to evaluate the performance of the algorithms and the results show that our proposed algorithms outperform the state-of-the-art algorithms in energy efficiency as well as response time.

...read moreread less

26 citations

Journal Article•DOI•

How to exploit high performance computing in population-based metaheuristics for solving association rule mining problem

[...]

Youcef Djenouri¹, Djamel Djenouri, Zineb Habbas², Asma Belhadi•Institutions (2)

University of Southern Denmark¹, Metz²

13 Jan 2018-Distributed and Parallel Databases

TL;DR: This is the first work that explores the combination of GPU and cluster-based parallel computing with the population-based metaheuristics in association rule mining and shows that the proposed solution outperforms the HPC-based ARM approaches when exploring Webdocs instance.

...read moreread less

Abstract: The application of population-based metaheuristics approaches to the association rules mining problem is explored in this paper. The combination of GPU and cluster-based parallel computing techniques is investigated for the purpose of accelerating the process of extracting the correlations between items in sizeable data instances. We propose four parallel-based approaches that benefit from the cluster intensive computing in the generation process and the massively GPU threading. This is by evaluating the association rules in parallel on GPU. To validate the proposed approaches, the most used population-based metaheuristics (GA, PSO, and BSO) have been executed on a cluster of GPUs to solve benchmarks of large and big ARM instances. We used Intel Xeon 64bit quad-core processor E5520 coupled to an Nvidia Tesla C2075 GPU device. The results show that the BSO outperforms GA and PSO. They also show that the proposed solution outperforms the HPC-based ARM approaches when exploring Webdocs instance (the largest instance existing on the web). To our knowledge, this is the first work that explores the combination of GPU and cluster-based parallel computing with the population-based metaheuristics in association rule mining.

...read moreread less

21 citations

Journal Article•DOI•

A statistically-based ontology matching tool

[...]

Peter Ochieng¹, Swaib Kyanda¹•Institutions (1)

Makerere University¹

01 Mar 2018-Distributed and Parallel Databases

TL;DR: This work explores the use of a predictive statistical model to establish an alignment between two input ontologies and demonstrates how to integrate ontology partitioning and parallelism in the ontology matching process in order to make the statistical predictive model scalable to large ontological matching tasks.

...read moreread less

Abstract: Ontologies have become a popular means of knowledge sharing and reuse. This has motivated development of large independent ontologies within the same or different domains with some overlapping information among them. In order to match such large ontologies, automatic matchers become an inevitable solution. This work explores the use of a predictive statistical model to establish an alignment between two input ontologies. We demonstrate how to integrate ontology partitioning and parallelism in the ontology matching process in order to make the statistical predictive model scalable to large ontology matching tasks. Unlike most ontology matching tools which establish 1:1 cardinality mappings, our statistical model generates one-to-many cardinality mappings.

...read moreread less

13 citations

Journal Article•DOI•

Multi-join query optimization in bucket-based encrypted databases using an enhanced ant colony optimization algorithm

[...]

Mahmoud Jafarinejad¹, Morteza Amini¹•Institutions (1)

Sharif University of Technology¹

01 Jun 2018-Distributed and Parallel Databases

TL;DR: An enhanced ant-colony algorithm (named BACO) is proposed which aims to reduce the required processing efforts in multi-join query optimization problem alongside with reducing the total false-positive results generated in Bucket-based encrypted databases.

...read moreread less

Abstract: One of the organizations' main concerns is to protect sensitive data in database systems, especially the ones outsourced to untrusted service providers. An effective solution for this issue is to employ database encryption methods. Among different encryption approaches, Bucket-based method has the advantage of balancing security and performance of database operations. However, generating false-positive results in executing queries is the main drawback of this method. On the other hand, multi-join queries are one of the most critical operations executed on these stored sensitive data. Hence, acceptable processing and response time in executing multi-join queries is required. In this paper, we propose an enhanced ant-colony algorithm (named BACO) which aims to reduce the required processing efforts in multi-join query optimization problem alongside with reducing the total false-positive results generated in Bucket-based encrypted databases. Our enhanced solution approach leads to much less response time without losing solutions' quality. Experimental results denote that our proposed solution can yield 75% decrease in multi-join queries processing efforts and 74% decrease in the total amount of false-positive results in a faster manner and with better performance than previous methods.

...read moreread less

12 citations

Journal Article•DOI•

An effective weighted rule-based method for entity resolution

[...]

Hiba Abu Ahmad¹, Hongzhi Wang¹•Institutions (1)

Harbin Institute of Technology¹

02 Aug 2018-Distributed and Parallel Databases

TL;DR: This paper defines a multi-attributes weighted rule system (MAWR) that investigates all values of records’ attributes in order to represent the difficult record-entity mapping and proposes a rule generation algorithm based on this system.

...read moreread less

Abstract: Entity resolution is an important task in data cleaning to detect records that belong to the same entity. It has a critical impact on digital libraries where different entities share the same name without any identifier key. Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity. Due to the lack of performance, recent methods build rules on records’ attributes with distinct values for entities to overcome some drawbacks. However, they use inadequate attributes and ignore common and empty attributes values which affect the quality of entity resolution. In this paper, we define a multi-attributes weighted rule system (MAWR) that investigates all values of records’ attributes in order to represent the difficult record-entity mapping. Then, we propose a rule generation algorithm based on this system. We also propose an entity resolution algorithm (MAWR-ER) depending on the generated rules to identify entities. We verify our method on real data, and the experimental results prove the effectiveness and efficiency of our proposed method.

...read moreread less

10 citations

Journal Article•DOI•

MetaStore: an adaptive metadata management framework for heterogeneous metadata models

[...]

Ajinkya Prabhune¹, Rainer Stotzka¹, Vaibhav Sakharkar¹, Jürgen Hesser², Michael Gertz² - Show less +1 more•Institutions (2)

Karlsruhe Institute of Technology¹, Heidelberg University²

01 Mar 2018-Distributed and Parallel Databases

TL;DR: MetaStore is an adaptive metadata management framework based on a NoSQL database and an RDF triple store that automatically segregates the different categories of metadata in their corresponding data models to maximize the utilization of the data models supported by NoSQL databases.

...read moreread less

Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data, and the handling of associated metadata is critical, as it enables discovering, analyzing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments are heterogeneous and subject to frequent changes, demanding a flexible data model. Existing metadata management systems provide a broad range of features for handling scientific metadata. However, the principal limitation of these systems is their architecture design that is restricted towards either a single or at the most a few standard metadata models. Support for handling different types of metadata models, i.e., administrative, descriptive, structural, and provenance metadata, and including community-specific metadata models is not possible with these systems. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database and an RDF triple store. MetaStore provides a set of core functionalities to handle heterogeneous metadata models by automatically generating the necessary software code (services) and on-the-fly extends the functionality of the framework. To handle dynamic metadata and to control metadata quality, MetaStore also provides an extended set of functionalities such as enabling annotation of images and text by integrating the Web Annotation Data Model, allowing communities to define discipline-specific vocabularies using Simple Knowledge Organization System, and providing advanced search and analytical capabilities by integrating the ElasticSearch. To maximize the utilization of the data models supported by NoSQL databases, MetaStore automatically segregates the different categories of metadata in their corresponding data models. Complex provenance graphs and dynamic metadata are modeled and stored in an RDF triple store, whereas the static metadata is stored in a NoSQL database. For enabling large-scale harvesting (sharing) of metadata using the METS standard over the OAI-PMH protocol, MetaStore is designed OAI-compliant. Finally, to show the practical usability of the MetaStore framework and that the requirements from the research communities have been realized, we describe our experience in the adoption of MetaStore for three communities.

...read moreread less

10 citations

Journal Article•DOI•

Distributed computing connected components with linear communication cost

[...]

Xing Feng¹, Lijun Chang², Xuemin Lin³, Lu Qin¹, Wenjie Zhang³, Long Yuan⁴ - Show less +2 more•Institutions (4)

University of Technology, Sydney¹, University of Sydney², University of New South Wales³, East China Normal University⁴

01 Sep 2018-Distributed and Parallel Databases

TL;DR: A new paradigm based on graph decomposition to compute CCs and BCCs with O(m) total communication cost is proposed, which can outperform the existing techniques by one order of magnitude regarding the total running time.

...read moreread less

Abstract: The paper studies three fundamental problems in graph analytics, computing connected components (CCs), biconnected components (BCCs), and 2-edge-connected components (ECCs) of a graph. With the recent advent of big data, developing efficient distributed algorithms for computing CCs, BCCs and ECCs of a big graph has received increasing interests. As with the existing research efforts, we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur $$O(m\times \#\text {supersteps})$$ total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to compute CCs and BCCs with O(m) total communication cost. The total computation costs of our techniques are also smaller than that of the existing techniques in practice, though theoretically almost the same. Moreover, we also study distributed computing ECCs. We are the first to study this problem and an approach with O(m) total communication cost is proposed. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.

...read moreread less

10 citations

Journal Article•DOI•

Crowd enabled curation and querying of large and noisy text mined protein interaction data

[...]

Hasan M. Jamil¹, Fereidoon Sadri²•Institutions (2)

University of Idaho¹, University of North Carolina at Greensboro²

01 Mar 2018-Distributed and Parallel Databases

TL;DR: This paper presents a novel approach to annotation and curation of biological database contents using crowd computing that is designed for literature mined protein-protein interaction data curation and is amenable to substantial generalization.

...read moreread less

Abstract: The abundance of mined, predicted and uncertain biological data warrant massive, efficient and scalable curation efforts. The human expertise required for any successful curation enterprise is often economically prohibitive, especially for speculative end user queries that ultimately may not bear fruit. So the challenge remains in devising a low cost engine capable of delivering fast but tentative annotation and curation of a set of data items that can later be authoritatively validated by experts demanding significantly smaller investment. The aim thus is to make a large volume of predicted data available for use as early as possible with an acceptable degree of confidence in their accuracy while the curation continues. In this paper, we present a novel approach to annotation and curation of biological database contents using crowd computing. The technical contribution is in the identification and management of trust of mechanical turks, and support for ad hoc declarative queries, both of which are leveraged to enable reliable analytics using noisy predicted interactions. While the proposed approach and the CrowdCure system are designed for literature mined protein-protein interaction data curation, they are amenable to substantial generalization.

...read moreread less

10 citations

Journal Article•DOI•

P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces

[...]

Ajinkya Prabhune¹, Aaron Zweig², Rainer Stotzka¹, Jürgen Hesser³, Michael Gertz³ - Show less +1 more•Institutions (3)

Karlsruhe Institute of Technology¹, Stanford University², Heidelberg University³

01 Mar 2018-Distributed and Parallel Databases

TL;DR: A ProvONE-based Provenance Interoperability Framework that completely automates the modeling of provenance from heterogeneous WfMSs by automatically translating the scientific workflows to their equivalent representation in a ProvONE prospective graph using the Prov2ONE algorithm.

...read moreread less

Abstract: Enabling provenance interoperability by analyzing heterogeneous provenance information from different scientific workflow management systems is a novel research topic. With the advent of the ProvONE model, it is now possible to model both the prospective as well as the retrospective provenance in a single provenance model. Scientific workflows are composed using a declarative definition language, such as BPEL, SCUFL/t2flow, or MoML. Associated with the execution of a workflow is its corresponding provenance that is modeled and stored in the data model specified by the workflow system. However, sharing of provenance generated by heterogeneous workflows is a challenging task and prevents the aggregate analysis and comparison of workflows and their associated provenance. To address these challenges, this paper introduces a ProvONE-based Provenance Interoperability Framework that completely automates the modeling of provenance from heterogeneous WfMSs by: (a) automatically translating the scientific workflows to their equivalent representation in a ProvONE prospective graph using the Prov2ONE algorithm, (b) enriching the ProvONE prospective graph with the retrospective provenance exported by the WfMSs, and (c) native support for storing the ProvONE provenance graphs in a Resource Description Framework triplestore that supports the SPARQL query language for querying and retrieving ProvONE graphs. The Prov2ONE algorithm is based on a set of vocabulary translation rules between workflow specifications and the ProvONE model. The correctness and completeness proof of the algorithm is shown and its complexity is analyzed. Moreover, to demonstrate the practical applicability of the complete framework, ProvONE graphs for workflows defined in BPEL, SCUFL, and MoML are generated. Finally, the provenance challenge queries are extended with six additional queries for retrieving the provenance modeled in ProvONE.

...read moreread less

8 citations

Journal Article•DOI•

Web-scale provenance reconstruction of implicit information diffusion on social media

[...]

Io Taxidou¹, Sven Lieber², Peter M. Fischer¹, Tom De Nies², Ruben Verborgh² - Show less +1 more•Institutions (2)

University of Freiburg¹, Ghent University²

01 Mar 2018-Distributed and Parallel Databases

TL;DR: The mechanisms of implicit information diffusion are investigated by computing its fine-grained provenance by proving that explicit mechanisms are insufficient to capture influence and unravels a significant part of implicit interactions and influence in social media.

...read moreread less

Abstract: Fast, massive, and viral data diffused on social media affects a large share of the online population, and thus, the (prospective) information diffusion mechanisms behind it are of great interest to researchers. The (retrospective) provenance of such data is equally important because it contributes to the understanding of the relevance and trustworthiness of the information. Furthermore, computing provenance in a timely way is crucial for particular use cases and practitioners, such as online journalists that promptly need to assess specific pieces of information. Social media currently provide insufficient mechanisms for provenance tracking, publication and generation, while state-of-the-art on social media research focuses mainly on explicit diffusion mechanisms (like retweets in Twitter or reshares in Facebook).The implicit diffusion mechanisms remain understudied due to the difficulties of being captured and properly understood. From a technical side, the state of the art for provenance reconstruction evaluates small datasets after the fact, sidestepping requirements for scale and speed of current social media data. In this paper, we investigate the mechanisms of implicit information diffusion by computing its fine-grained provenance. We prove that explicit mechanisms are insufficient to capture influence and our analysis unravels a significant part of implicit interactions and influence in social media. Our approach works incrementally and can be scaled up to cover a truly Web-scale scenario like major events. We can process datasets consisting of up to several millions of messages on a single machine at rates that cover bursty behaviour, without compromising result quality. By doing that, we provide to online journalists and social media users in general, fine grained provenance reconstruction which sheds lights on implicit interactions not captured by social media providers. These results are provided in an online fashion which also allows for fast relevance and trustworthiness assessment.

...read moreread less

Journal Article•DOI•

Information flow control on encrypted data for service composition among multiple clouds

[...]

Ning Xi¹, Jianfeng Ma¹, Cong Sun¹, Di Lu¹, Yulong Shen¹ - Show less +1 more•Institutions (1)

Xidian University¹

01 Jun 2018-Distributed and Parallel Databases

TL;DR: This work defines a new type of flow called the encryption flow to describe the dependence relationship among different encrypted data objects across multiple services and proposes the secure information flow verification theorem, which provides a more effective way compared with centralized verification approaches.

...read moreread less

Abstract: Homomorphic encryption allows the direct operations on encrypted data, which provides a promising way to protect outsourcing data in clouds. However, it can not guarantee the end-to-end data security if different cloud services are composed together. Especially for the operations on encrypted data, it may violate the standard noninterference, which can not be solved by traditional information flow control approaches. In order to analyze the information flow with encrypted data, we define a new type of flow called the encryption flow to describe the dependence relationship among different encrypted data objects across multiple services. Based on the new definition on encrypted flow, we propose the secure information flow verification theorem and specify the improved security constraints on each service component. Then a distributed information flow control framework and algorithm are designed for verification on regular and encrypted flow across multiple clouds. Through the experiments, we can obtain that our approach is more appropriate for the verification across multiple clouds and provides a more effective way compared with centralized verification approaches.

...read moreread less

Journal Article•DOI•

A K-way spectral partitioning of an ontology for ontology matching

[...]

Peter Ochieng¹, Swaib Kyanda¹•Institutions (1)

Makerere University¹

22 Mar 2018-Distributed and Parallel Databases

TL;DR: In this paper, it is demonstrated that spectral partitioning of an ontology can generate high quality partitions geared towards ontology matching.

...read moreread less

Abstract: Ontology matching, the process of resolving heterogeneity between two ontologies consumes a lot of computing memory and time. This problem is exacerbated in large ontology matching tasks. To address the problem of time and space complexity in the matching process, ontology partitioning has been adopted as one of the methods, however, most ontology partitioning algorithms either produce incomplete partitions or are slow in the partitioning process hence eroding the benefits of the partitioning. In this paper, we demonstrate that spectral partitioning of an ontology can generate high quality partitions geared towards ontology matching.

...read moreread less

Journal Article•DOI•

Flexible partitioning for selective binary theta-joins in a massively parallel setting

[...]

Ioannis Koumarelas¹, Athanasios Naskos², Anastasios Gounaris²•Institutions (2)

Hasso Plattner Institute¹, Aristotle University of Thessaloniki²

30 Apr 2018-Distributed and Parallel Databases

TL;DR: This work focuses on generic theta joins in a massively parallel environment, such as MapReduce and Spark, and proposes an ensemble-based partitioning approach that tackles all three aspects: communication cost, memory and computation limitations of reducers, and the total execution time.

...read moreread less

Abstract: Efficient join processing plays an important role in big data analysis. In this work, we focus on generic theta joins in a massively parallel environment, such as MapReduce and Spark. Theta joins are notoriously slow due to their inherent quadratic complexity, even when their selectivity is low, e.g., 1%. The main performance bottleneck differs between cases, and is due to any of the following factors or their combination: amount of data being shuffled, memory load on reducers, or computation load on reducers. We propose an ensemble-based partitioning approach that tackles all three aspects. In this way, we can save communication cost, we better respect the memory and computation limitations of reducers and overall, we reduce the total execution time. The key idea behind our partitioning is to cluster join key values following two techniques, namely matrix re-arrangement and agglomerative clustering. These techniques can run either in isolation or in combination. We present thorough experimental results using both band queries on real data and arbitrary synthetic predicates. We show that we can save up to 45% of the communication cost and reduce the computation load of a single reducer up to 50% in band queries, whereas the savings are up to 74 and 80%, respectively, in queries with arbitrary theta predicates. Apart from being effective, the potential benefits of our approach can be estimated before execution from metadata, which allows for informed partitioning decisions. Finally, our solutions are flexible in that they can account for any weighted combination of the three bottleneck factors.

...read moreread less

Journal Article•DOI•

Executable schema mappings for statistical data processing

[...]

Paolo Atzeni¹, Luigi Bellomarini¹, Francesca Bugiotti², Marco De Leonardis³•Institutions (3)

Roma Tre University¹, Université Paris-Saclay², Hewlett-Packard³

01 Jun 2018-Distributed and Parallel Databases

TL;DR: This paper illustrates here a proposal where a high-level language, EXL, is used for the declarative specification of statistical programs, and a translation into executable form in various target systems is available.

...read moreread less

Abstract: Data processing is the core of any statistical information system. Statisticians are interested in specifying transformations and manipulations of data at a high level, in terms of entities of statistical models. We illustrate here a proposal where a high-level language, EXL, is used for the declarative specification of statistical programs, and a translation into executable form in various target systems is available. The language is based on the theory of schema mappings, in particular those defined by a specific class of tgds, which we actually use to optimize user programs and facilitate the translation towards several target systems. The characteristics of such class guarantee good tractability properties and the applicability in Big Data settings. A concrete implementation, EXLEngine, has been carried out and is currently used at the Bank of Italy.

...read moreread less

Journal Article•DOI•

Scalable datacenter multicast using in-packet bitmaps

[...]

Kun Huang¹, Xin Su²•Institutions (2)

Chinese Academy of Sciences¹, Hunan Police Academy²

03 May 2018-Distributed and Parallel Databases

TL;DR: This paper uses a bitmap to encode switch ports of a multicast tree in the packet header, eliminating false positive forwarding and uses a clustered Golomb coding method to compress in-packet bitmaps for further reducing the bandwidth overhead.

...read moreread less

Abstract: Scalability is a key issue in datacenter multicast as it needs to support a large number of groups in commodity switches with limited fast memory. Previous in-packet Bloom filter-based datacenter multicast schemes have been proposed to address the scalability issue. They encode a multicast tree in a Bloom filter carried in each packet. However, these schemes induce high bandwidth overhead due to the false positives inherent in Bloom filters, and cannot scale well to the increasing variety of group sizes. In this paper, we propose an in-packet bitmap-based approach towards scalable datacenter multicast, improving the bandwidth efficiency. We use a bitmap to encode switch ports of a multicast tree in the packet header, eliminating false positive forwarding. In addition, we use a clustered Golomb coding method to compress in-packet bitmaps for further reducing the bandwidth overhead. Experimental results on simulations and a Click-based switch prototype demonstrate that our scheme achieves up to several orders of magnitude reductions in bandwidth overhead compared to previous schemes.

...read moreread less

Journal Article•DOI•

AUDIT: approving and tracking updates with dependencies in collaborative databases

[...]

Khaleel Mershad¹, Qutaibah M. Malluhi², Mourad Ouzzani³, Mingjie Tang⁴, Michael Gribskov⁴, Walid G. Aref⁴ - Show less +2 more•Institutions (4)

Arts, Sciences and Technology University in Lebanon¹, Qatar University², Qatar Computing Research Institute³, Purdue University⁴

01 Mar 2018-Distributed and Parallel Databases

TL;DR: This paper presents a scalable cloud-based collaborative database system to support collaboration and data curation scenarios based on an Update Pending Approval model and fully realized the system inside HBase, a cloud- based platform.

...read moreread less

Abstract: Collaborative databases such as genome databases, often involve extensive curation activities where collaborators need to interact to be able to converge and agree on the content of data. In a typical scenario, a member of the collaboration makes some updates and these become visible to all collaborators for possible comments and modifications. At the same time, these updates are usually pending the approval or rejection from the data custodian based on the related discussion and the content of the data. Unfortunately, the approval and authorization of updates in current databases is based solely on the identity of the user, e.g., via the SQL GRANT and REVOKE commands. In this paper, we present a scalable cloud-based collaborative database system to support collaboration and data curation scenarios. Our system is based on an Update Pending Approval model. In a nutshell, when a collaborator updates a given data item, it is marked as pending approval until the data custodian approves or rejects the update. Until then, any other collaborator can view and comment on the data, pending its approval. We fully realized our system inside HBase, a cloud-based platform. We also conducted extensive experiments showing that the system scales well under different workloads.

...read moreread less

Journal Article•DOI•

COACT: a query interface language for collaborative databases

[...]

Khaleel Mershad¹, Qutaibah M. Malluhi², Mourad Ouzzani³, Mingjie Tang⁴, Michael Gribskov⁴, Walid G. Aref⁴, Deo Prakash⁵ - Show less +3 more•Institutions (5)

Arts, Sciences and Technology University in Lebanon¹, Qatar University², Qatar Computing Research Institute³, Purdue University⁴, Shri Mata Vaishno Devi University⁵

01 Mar 2018-Distributed and Parallel Databases

TL;DR: A high level SQL query interface language for PIs and collaborators to interact with the UPA framework, which defines a set of UPA keywords that are used as a part of the history tracking mechanism to select specific versions of a data item, and a setof UPA options that select specific version based on possible future decisions of PIs.

...read moreread less

Abstract: Data curation activities in collaborative databases mandate that collaborators interact until they converge and agree on the content of their data. In a previous work, we presented a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. Our system classifies different versions of a data item to either pending, approved, or rejected. The approval or rejection of a certain version is done by the database Principle Investigators (or PIs) based on its value. Our system also allows collaborators to view the status of each version and help PIs take decisions by providing feedback based on their experiments and/or opinions. Most importantly, our system provided mechanisms for history tracking of different versions to trace the modifications and approval/rejection done by both collaborators and PIs on different versions of a data item. We labeled our system as Update-Pending-Approval model (or UPA). In this paper, we describe a high level SQL query interface language for PIs and collaborators to interact with the UPA framework. We define a set of UPA keywords that are used as a part of the history tracking mechanism to select specific versions of a data item, and a set of UPA options that select specific versions based on possible future decisions of PIs. We implemented our query interface mechanism on top of Apache Phoenix, taking into consideration that the UPA system was implemented on top of Apache HBase. We test the performance of the UPA query language by executing several queries that contain different complexity levels and discuss their results.

...read moreread less

Journal Article•DOI•

Monitoring distributed fragmented skylines

[...]

Odysseas Papapetrou¹, Minos Garofalakis²•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Technical University of Crete²

27 Mar 2018-Distributed and Parallel Databases

TL;DR: This work presents the first known distributed algorithms for continuous monitoring of skylines over complex functions of fragmented multi-dimensional objects, and proposes several optimizations, including a technique for adaptively determining the most efficient monitoring strategy for each object.

...read moreread less

Abstract: Distributed skyline computation is important for a wide range of domains, from distributed and web-based systems to ISP-network monitoring and distributed databases. The problem is particularly challenging in dynamic distributed settings, where the goal is to efficiently monitor a continuous skyline query over a collection of distributed streams. All existing work relies on the assumption of a single point of reference for object attributes/dimensions: objects may be vertically or horizontally partitioned, but the accurate value of each dimension for each object is always maintained by a single site. This assumption is unrealistic for several distributed applications, where object information is fragmented over a set of distributed streams (each monitored by a different site) and needs to be aggregated (e.g., averaged) across several sites. Furthermore, it is frequently useful to define skyline dimensions through complex functions over the aggregated objects, which raises further challenges for dealing with distribution and object fragmentation. We present the first known distributed algorithms for continuous monitoring of skylines over complex functions of fragmented multi-dimensional objects. Our algorithms rely on decomposition of the skyline monitoring problem to a select set of distributed threshold-crossing queries, which can be monitored locally at each site. We propose several optimizations, including: (a) a technique for adaptively determining the most efficient monitoring strategy for each object, (b) an approximate monitoring technique, and (c) a strategy that reduces communication overhead by grouping together threshold-crossing queries. Furthermore, we discuss how our proposed algorithms can be used to address other continuous query types. A thorough experimental study with synthetic and real-life data sets verifies the effectiveness of our schemes and demonstrates order-of-magnitude improvements in communication costs compared to the only alternative centralized solution.

...read moreread less

Journal Article•DOI•

An accurate estimation algorithm for big data streams

[...]

Qin Xin¹, Jianping Wu¹•Institutions (1)

Tsinghua University¹

23 May 2018-Distributed and Parallel Databases

TL;DR: This paper proposes a new sketch, which can significantly improve the insertion speed while improving the accuracy, and extensive experimental results show that this sketch significantly outperforms the state-of-the-art both in terms of accuracy and speed.

...read moreread less

Abstract: Sketch is a memory-efficient data structure, and is used to store and query the frequency of any item in a given multiset. As it can achieve fast query and update, it has been applied to various fields. Different sketches have different advantages and disadvantages. Sketches are originally proposed for estimation of flow size in network measurement. The key factor of sketches for network measurement is the insertion speed and accuracy. In this paper, we propose a new sketch, which can significantly improve the insertion speed while improving the accuracy. Our key methods include on-chip/off-chip separation and partial update algorithm. Extensive experimental results show that our sketch significantly outperforms the state-of-the-art both in terms of accuracy and speed.

...read moreread less

Journal Article•DOI•

A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments

[...]

Chun-Cheng Lin¹, Sheng Hao Chung¹, Ju-Chin Chen², Yuan Tse Yu³, Kawuu W. Lin² - Show less +1 more•Institutions (3)

National Chiao Tung University¹, National Kaohsiung First University of Science and Technology², National Kaohsiung Normal University³

01 Dec 2018-Distributed and Parallel Databases

TL;DR: A method that can begin mining when a small part of an FP-tree is received is proposed, and the idle time of computing nodes can be reduced, and thus, the time required for mining can be reduction effectively.

...read moreread less

Abstract: Association rules mining has attracted much attention among data mining topics because it has been successfully applied in various fields to find the association between purchased items by identifying frequent patterns (FPs). Currently, databases are huge, ranging in size from terabytes to petabytes. Although past studies can effectively discover FPs to deduce association rules, the execution efficiency is still a critical problem, particularly for big data. Progressive size working set (PSWS) and parallel FP-growth (PFP) are state-of-the-art methods that have been applied successfully to parallel and distributed computing technology to improve mining processing time in many-task computing, thereby bridging the gap between high-throughput and high-performance computing. However, such methods cannot mine before obtaining a complete FP-tree or the corresponding subdatabase, causing a high idle time for computing nodes. We propose a method that can begin mining when a small part of an FP-tree is received. The idle time of computing nodes can be reduced, and thus, the time required for mining can be reduced effectively. Through an empirical evaluation, the proposed method is shown to be faster than PSWS and PFP.

...read moreread less

Journal Article•DOI•

Online multi-view subspace learning via group structure analysis for visual object tracking

[...]

Wanqi Yang¹, Wanqi Yang², Yinghuan Shi², Yang Gao², Ming Yang¹ - Show less +1 more•Institutions (2)

Nanjing Normal University¹, Nanjing University²

25 May 2018-Distributed and Parallel Databases

TL;DR: A novel online multi-view subspace learning algorithm (OMEL) via group structure analysis, which consistently learns a low-dimensional representation shared across views with time changing for visual object tracking is proposed.

...read moreread less

Abstract: In this paper, we focus on incrementally learning a robust multi-view subspace representation for visual object tracking. During the tracking process, due to the dynamic background variation and target appearance changing, it is challenging to learn an informative feature representation of tracking object, distinguished from the dynamic background. To this end, we propose a novel online multi-view subspace learning algorithm (OMEL) via group structure analysis, which consistently learns a low-dimensional representation shared across views with time changing. In particular, both group sparsity and group interval constraints are incorporated to preserve the group structure in the low-dimensional subspace, and our subspace learning model will be incrementally updated to prevent repetitive computation of previous data. We extensively evaluate our proposed OMEL on multiple benchmark video tracking sequences, by comparing with six related tracking algorithms. Experimental results show that OMEL is robust and effective to learn dynamic subspace representation for online object tracking problems. Moreover, several evaluation tests are additionally conducted to validate the efficacy of group structure assumption.

...read moreread less