scispace - formally typeset
Search or ask a question

Showing papers on "Distributed database published in 2014"


Proceedings ArticleDOI
19 May 2014
TL;DR: Wang et al. as discussed by the authors proposed a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns, and empirically analyzed the efficiency of their protocols through various experiments.
Abstract: For the past decade, query processing on relational data has been studied extensively, and many theoretical and practical solutions to query processing have been proposed under various scenarios. With the recent popularity of cloud computing, users now have the opportunity to outsource their data as well as the data management tasks to the cloud. However, due to the rise of various privacy issues, sensitive data (e.g., medical records) need to be encrypted before outsourcing to the cloud. In addition, query processing tasks should be handled by the cloud; otherwise, there would be no point to outsource the data at the first place. To process queries over encrypted data without the cloud ever decrypting the data is a very challenging task. In this paper, we focus on solving the k-nearest neighbor (kNN) query problem over encrypted database outsourced to a cloud: a user issues an encrypted query record to the cloud, and the cloud returns the k closest records to the user. We first present a basic scheme and demonstrate that such a naive solution is not secure. To provide better security, we propose a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns. Also, we empirically analyze the efficiency of our protocols through various experiments. These results indicate that our secure protocol is very efficient on the user end, and this lightweight scheme allows a user to use any mobile device to perform the kNN query.

285 citations


Proceedings ArticleDOI
19 May 2014
TL;DR: DICE is introduced, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels.
Abstract: Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.

149 citations


Journal ArticleDOI
TL;DR: A prototype water resource management IIS is developed which integrates geoinformatics, EIS, and cloud service and a novel approach to information management that allows any participant play the role as a sensor as well as a contributor to the information warehouse is proposed.
Abstract: Water scarcity and floods are the major challenges for human society both present and future. Effective and scientific management of water resources requires a good understanding of water cycles, and a systematic integration of observations can lead to better prediction results. This paper presents an integrated approach to water resource management based on geoinformatics including technologies such as Remote Sensing (RS), Geographical Information Systems (GIS), Global Positioning Systems (GPS), Enterprise Information Systems (EIS), and cloud services. The paper introduces a prototype IIS called Water Resource Management Enterprise Information System (WRMEIS) that integrates functions such as data acquisition, data management and sharing, modeling, and knowledge management. A system called SFFEIS (Snowmelt Flood Forecasting Enterprise Information System) based on the WRMEIS structure has been implemented. It includes operational database, Extraction-Transformation-Loading (ETL), information warehouse, temporal and spatial analysis, simulation/prediction models, knowledge management, and other functions. In this study, a prototype water resource management IIS is developed which integrates geoinformatics, EIS, and cloud service. It also proposes a novel approach to information management that allows any participant play the role as a sensor as well as a contributor to the information warehouse. Both users and public play the role for providing data and knowledge. This study highlights the crucial importance of a systematic approach toward IISs for effective resource and environment management.

124 citations


Journal ArticleDOI
TL;DR: This work exploits the fact that RDTs can naturally fit into a parallel and fully distributed architecture, and develops protocols to implement privacy-preserving R DTs that enable general and efficient distributed privacy- Preserving knowledge discovery.
Abstract: Distributed data is ubiquitous in modern information driven applications. With multiple sources of data, the natural challenge is to determine how to collaborate effectively across proprietary organizational boundaries while maximizing the utility of collected information. Since using only local data gives suboptimal utility, techniques for privacy-preserving collaborative knowledge discovery must be developed. Existing cryptography-based work for privacy-preserving data mining is still too slow to be effective for large scale data sets to face today’s big data challenge. Previous work on random decision trees (RDT) shows that it is possible to generate equivalent and accurate models with much smaller cost. We exploit the fact that RDTs can naturally fit into a parallel and fully distributed architecture, and develop protocols to implement privacy-preserving RDTs that enable general and efficient distributed privacy-preserving knowledge discovery.

115 citations


Journal ArticleDOI
TL;DR: This comment points out several mathematical errors in the proof of Therorem 3, and gives the correct expression of B3.
Abstract: This paper considers a stochastic optimization approach for job scheduling and server management in large-scale, geographically distributed data centers. Randomly arriving jobs are routed to a choice of servers. The number of active servers depends on server activation decisions that are updated at a slow time scale, and the service rates of the servers are controlled by power scaling decisions that are made at a faster time scale. We develop a two-time-scale decision strategy that offers provable power cost and delay guarantees. The performance and robustness of the approach is illustrated through simulations.

104 citations


Journal ArticleDOI
Tamir Tassa1
TL;DR: Cheung et al. as mentioned in this paper proposed a protocol based on the Fast Distributed Mining (FDM) algorithm, which is an unsecured distributed version of the Apriori algorithm.
Abstract: We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton . Our protocol, like theirs, is based on the Fast Distributed Mining (FDM)algorithm of Cheung et al. , which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms-one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in . In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.

103 citations


Journal ArticleDOI
01 Aug 2014
TL;DR: This work reproduces performance and scalability benchmarking experiments of HBase and Cassandra that have been conducted by previous research and compares the results.
Abstract: Distributed database system performance benchmarks are an important source of information for decision makers who must select the right technology for their data management problems. Since important decisions rely on trustworthy experimental data, it is necessary to reproduce experiments and verify the results. We reproduce performance and scalability benchmarking experiments of HBase and Cassandra that have been conducted by previous research and compare the results. The scope of our reproduced experiments is extended with a performance evaluation of Cassandra on different Amazon EC2 infrastructure configurations, and an evaluation of Cassandra and HBase elasticity by measuring scaling speed and performance impact while scaling.

99 citations


Book ChapterDOI
16 Dec 2014
TL;DR: This work proposes a hybrid logical clock, HLC, that combines the best of logical clocks and physical clocks, and shows that HLC has many benefits for wait-free transaction ordering and performing snapshot reads in multiversion globally distributed databases.
Abstract: There is a gap between the theory and practice of distributed systems in terms of the use of time. The theory of distributed systems shunned the notion of time, and introduced “causality tracking” as a clean abstraction to reason about concurrency. The practical systems employed physical time (NTP) information but in a best effort manner due to the difficulty of achieving tight clock synchronization. In an effort to bridge this gap and reconcile the theory and practice of distributed systems on the topic of time, we propose a hybrid logical clock, HLC, that combines the best of logical clocks and physical clocks. HLC captures the causality relationship like logical clocks, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of physical/NTP clocks since it maintains its logical clock to be always close to the NTP clock. Moreover HLC fits in to 64 bits NTP timestamp format, and is masking tolerant to NTP kinks and uncertainties.We show that HLC has many benefits for wait-free transaction ordering and performing snapshot reads in multiversion globally distributed databases.

94 citations


Journal ArticleDOI
TL;DR: This is the first solution supporting geographically distributed clients to connect directly to an encrypted cloud database, and to execute concurrent and independent operations including those modifying the database structure.
Abstract: Placing critical data in the hands of a cloud provider should come with the guarantee of security and availability for data at rest, in motion, and in use. Several alternatives exist for storage services, while data confidentiality solutions for the database as a service paradigm are still immature. We propose a novel architecture that integrates cloud database services with data confidentiality and the possibility of executing concurrent operations on encrypted data. This is the first solution supporting geographically distributed clients to connect directly to an encrypted cloud database, and to execute concurrent and independent operations including those modifying the database structure. The proposed architecture has the further advantage of eliminating intermediate proxies that limit the elasticity, availability, and scalability properties that are intrinsic in cloud-based solutions. The efficacy of the proposed architecture is evaluated through theoretical analyses and extensive experimental results based on a prototype implementation subject to the TPC-C standard benchmark for different numbers of clients and network latencies.

88 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: Map Reduce is a Minimization technique which makes use of file indexing with mapping, sorting, shuffling and finally reducing, which is implemented for Big Data analysis using HDFS.
Abstract: We live in on-demand, on-command Digital universe with data prolifering by Institutions, Individuals and Machines at a very high rate. This data is categories as "Big Data" due to its sheer Volume, Variety and Velocity. Most of this data is unstructured, quasi structured or semi structured and it is heterogeneous in nature. The volume and the heterogeneity of data with the speed it is generated, makes it difficult for the present computing infrastructure to manage Big Data. Traditional data management, warehousing and analysis systems fall short of tools to analyze this data. Due to its specific nature of Big Data, it is stored in distributed file system architectures. Hadoop and HDFS by Apache is widely used for storing and managing Big Data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. Map Reduce is widely been used for the efficient analysis of Big Data. Traditional DBMS techniques like Joins and Indexing and other techniques like graph search is used for classification and clustering of Big Data. These techniques are being adopted to be used in Map Reduce. In this paper we suggest various methods for catering to the problems in hand through Map Reduce framework over Hadoop Distributed File System (HDFS). Map Reduce is a Minimization technique which makes use of file indexing with mapping, sorting, shuffling and finally reducing. Map Reduce techniques have been studied in this paper which is implemented for Big Data analysis using HDFS.

83 citations


Proceedings Article
01 Jan 2014
TL;DR: This paper presents Salt, a distributed database that allows developers to improve the performance and scalability of their ACID applications through the incremental adoption of the BASE approach, a new abstraction that encapsulates the workflow of performance-critical transactions.
Abstract: This paper presents Salt, a distributed database that allows developers to improve the performance and scalability of their ACID applications through the incremental adoption of the BASE approach. Salt's motivation is rooted in the Pareto principle: for many applications, the transactions that actually test the performance limits of ACID are few. To leverage this insight, Salt introduces BASE transactions, a new abstraction that encapsulates the workflow of performance-critical transactions. BASE transactions retain desirable properties like atomicity and durability, but, through the new mechanism of Salt Isolation, control which granularity of isolation they offer to other transactions, depending on whether they are BASE or ACID. This flexibility allows BASE transactions to reap the performance benefits of the BASE paradigm without compromising the guarantees enjoyed by the remaining ACID transactions. For example, in our MySQL Cluster-based implementation of Salt, BASE-ifying just one out of 11 transactions in the open source ticketing application Fusion Ticket yields a 6.5x increase over the throughput obtained with an ACID implementation.

Patent
26 Mar 2014
TL;DR: In this paper, a heterogeneous large data integration method and system based on data warehouses is presented, where all kinds of data are integrated by combining the advantages of a relational database, a distributed database and a memory database, deep data analysis is carried out on the basis of the data warehouses, and data mining is deepened continuously.
Abstract: The invention provides a heterogeneous large data integration method and system based on data warehouses. The incidence relation between structurized data, semi-structurized data and non-structurized data are is established, all kinds of data are integrated by combining the advantages of a relational database, a distributed database and a memory database, deep data analysis is carried out on the basis of the data warehouses, data mining is deepened continuously, and thus high-efficiency and high-quality heterogeneous large data analysis is achieved. The structurized data, the semi-structurized data and the non-structurized data in Internet applications are associated, through Map/Reduce distributed processing and data mining, the processing result and relevant data are written into a memory in a database structure mode, thus, a simple memory database is formed, and high speed calculation and fast response can be carried out conveniently.

Proceedings ArticleDOI
19 May 2014
TL;DR: In this paper, locality-sensitive data shuffling is introduced to reduce the amount of network communication for distributed operators such as join and aggregation, which can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.
Abstract: The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.

Patent
12 Mar 2014
TL;DR: In this article, a distributed database system may implement fast crash recovery by establishing a connection with one or more storage nodes of a distributed storage system storing data for a database implemented by the database head node.
Abstract: A distributed database system may implement fast crash recovery. Upon recovery from a database head node failure, a connection with one or more storage nodes of a distributed storage system storing data for a database implemented by the database head node may be established. Upon establishment of the connection with the storage nodes, that database may be made available for access, such as for various access requests. In various embodiments, redo log records may not be replayed in order to provide access to the database. In at least some embodiments, the storage nodes may provide a current state of data stored for the database in response to requests.

Journal ArticleDOI
TL;DR: This paper presents an innovative system, coined as DISTROD, for detecting outliers, namely abnormal instances or observations, from multiple large distributed databases that are consistent with those produced by the centralized detection paradigm.
Abstract: In this paper, we present an innovative system, coined as DISTROD (a.k.a DISTRibuted Outlier Detector), for detecting outliers, namely abnormal instances or observations, from multiple large distributed databases. DISTROD is able to effectively detect the so-called global outliers from distributed databases that are consistent with those produced by the centralized detection paradigm. DISTROD is equipped with a number of optimization/boosting strategies which empower it to significantly enhance its speed performance and reduce its communication overhead. Experimental evaluation demonstrates the good performance of DISTROD in terms of speed and communication overhead.

Patent
22 Oct 2014
TL;DR: In this paper, a health insurance outpatient clinic big data extraction system and method based on a hadoop platform is described, which consists of a data acquisition module, a data storage module, data cleaning module, and a data analyzing and processing module.
Abstract: The invention discloses a health insurance outpatient clinic big data extraction system and method based on a hadoop platform. The system comprises a data acquisition module, a data storage module, a data cleaning module, a data analyzing and processing module, an Hbase distributed database and a data display module. The data acquisition module is connected with the data storage module, the data storage module is connected with the data analyzing and processing module through the data cleaning module, and a data query and analysis module is respectively connected with the Hbase distributed database and the data display module. The system and method have the advantages that a Hadoop cluster can be formed by thousands of cheap servers, a distributed file system cluster is constructed on large-scale cheap machines, data extraction and analysis cost is reduced to a large extent, and parallel processing can be carried out on outpatient clinic big data. Meanwhile, reliability and security of the data are well guaranteed by means of a transcript storage strategy of an HDFS.

Proceedings ArticleDOI
06 Oct 2014
TL;DR: SALT as mentioned in this paper is a distributed database that allows developers to improve the performance and scalability of their ACID applications through the incremental adoption of the BASE approach, a new abstraction that encapsulates the workflow of performancecritical transactions.
Abstract: This paper presents Salt, a distributed database that allows developers to improve the performance and scalability of their ACID applications through the incremental adoption of the BASE approach Salt's motivation is rooted in the Pareto principle: for many applications, the transactions that actually test the performance limits of ACID are few To leverage this insight, Salt introduces BASE transactions, a new abstraction that encapsulates the workflow of performance-critical transactions BASE transactions retain desirable properties like atomicity and durability, but, through the new mechanism of Salt Isolation, control which granularity of isolation they offer to other transactions, depending on whether they are BASE or ACID This flexibility allows BASE transactions to reap the performance benefits of the BASE paradigm without compromising the guarantees enjoyed by the remaining ACID transactions For example, in our MySQL Cluster-based implementation of Salt, BASE-ifying just one out of 11 transactions in the open source ticketing application Fusion Ticket yields a 65x increase over the throughput obtained with an ACID implementation

Journal ArticleDOI
01 Dec 2014
TL;DR: The results of comprehensive performance experiments show that the Incremental approach significantly outperforms any other known method from the literature and requires no a priori knowledge of which nodes of a distributed system are involved in executing a transaction.
Abstract: Modern database systems employ Snapshot Isolation to implement concurrency control and isolationbecause it promises superior query performance compared to lock-based alternatives. Furthermore, Snapshot Isolation never blocks readers, which is an important property for modern information systems, which have mixed workloads of heavy OLAP queries and short update transactions. This paper revisits the problem of implementing Snapshot Isolation in a distributed database system and makes three important contributions. First, a complete definition of Distributed Snapshot Isolation is given, thereby extending existing definitions from the literature. Based on this definition, a set of criteria is proposed to efficiently implement Snapshot Isolation in a distributed system. Second, the design space of alternative methods to implement Distributed Snapshot Isolation is presented based on this set of criteria. Third, a new approach to implement Distributed Snapshot Isolation is devised; we refer to this approach as Incremental. The results of comprehensive performance experiments with the TPC-C benchmark show that the Incremental approach significantly outperforms any other known method from the literature. Furthermore, the Incremental approach requires no a priori knowledge of which nodes of a distributed system are involved in executing a transaction. Also, the Incremental approach can execute transactions that involve data from a single node only with the same efficiency as a centralized database system. This way, the Incremental approach takes advantage of sharding or other ways to improve data locality. The cost for synchronizing transactions in a distributed system is only paid by transactions that actually involve data from several nodes. All these properties make the Incremental approach more practical than related methods proposed in the literature.

Patent
17 Mar 2014
TL;DR: In this paper, the working copies of database entries residing in a local database of a secondary storage computing device are modified in response to instructions to modify the database entries in a deduplication database.
Abstract: An information management system can modify working copies of database entries residing in a local database of a secondary storage computing device in response to instructions to modify the database entries residing in a deduplication database. If the working copy does not already reside in the local database, a copy of the database entry, or portion thereof, from the deduplication database can be used to generate the working copy. Based on a desired policy, the working copies in the local database can be merged with the actual database entries in the deduplication database.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment.
Abstract: This work describes a proposal for developing and testing a scalable machine learning architecture able to provide real-time predictions or analytics as a service over domain-independent big data, working on top of the Hadoop ecosystem and providing real-time analytics as a service through a RESTful API. Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment. In order to validate the proposed architecture, two systems are developed, each one providing classical machine-learning services in different domains: the first one involves a recommender system for web advertising, while the second consists in a prediction system which learns from gamers' behavior and tries to predict future events such as purchases or churning. An evaluation is carried out on these systems, and results show how both services are able to provide fast responses even when a number of concurrent requests are made, and in the particular case of the second system, results clearly prove that computed predictions significantly outperform those obtained if random guess was used.

Patent
11 Jul 2014
TL;DR: In this paper, a distributed computing application is described that provides a highly elastic and multi-tenant platform for Hadoop applications and other workloads running in a virtualized environment.
Abstract: A distributed computing application is described that provides a highly elastic and multi-tenant platform for Hadoop applications and other workloads running in a virtualized environment. Deployments of a distributed computing application, such as Hadoop, may be executed concurrently with a distributed database application, such as HBase, using a shared instance of a distributed filesystem, or in other cases, multiple instances of the distributed filesystem. Computing resources allocated to region server nodes executing as VMs may be isolated from compute VMs of the distributed computing application, as well as from data nodes executing as VMs of the distributed filesystem.

Proceedings ArticleDOI
27 Mar 2014
TL;DR: This paper gives a brief overview of Big Data, Hadoop MapReduce andHadoop Distributed File System along with its architecture.
Abstract: Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture.

Proceedings ArticleDOI
12 Jun 2014
TL;DR: An assessment criterion comprising various security features for the analysis of sharded NoSQL databases is proposed and presented, which helps various organizations in the selection of appropriate and reliable database in accordance with their preferences and security requirements.
Abstract: NoSQL databases are easy to scale-out because of their flexible schema and support for BASE (Basically Available, Soft State and Eventually Consistent) properties. The process of scaling-out in most of these databases is supported by sharding which is considered as the key feature in providing faster reads and writes to the database. However, securing the data sharded over various servers is a challenging problem because of the data being distributedly processed and transmitted over the unsecured network. Though, extensive research has been performed on NoSQL sharding mechanisms but no specific criterion has been defined to analyze the security of sharded architecture. This paper proposes an assessment criterion comprising various security features for the analysis of sharded NoSQL databases. It presents a detailed view of the security features offered by NoSQL databases and analyzes them with respect to proposed assessment criteria. The presented analysis helps various organizations in the selection of appropriate and reliable database in accordance with their preferences and security requirements.

Journal ArticleDOI
TL;DR: The architecture is a new scheme for accurate leak point detection, which is more consistent with practical application in the large scale petrochemical industry.
Abstract: In the large-scale petrochemical industry, one of the most concerning problems is the leakage of toxic gas. To solve this problem, it is necessary to locate the leak points and feed the possible location of leak points back to rescuers. Although some researchers have previously presented several methods to locate leak points, they ignored the impact of external factors, such as wind, and internal factors, such as the internal pressure of equipment, on the accurate detection of leak points. Fundamentally, both of those factors belong to context-aware data in a context-aware system. Therefore, this article proposes a context-aware system architecture for leak point detection in the large-scale petrochemical industry. In this three-layer architecture, a distributed database based on data categorization is designed in the storage layer, which is able to choose the most efficient approach to store the context-aware data from the gathering layer according to different context-aware data types. Then a real-time template matching algorithm for context-aware systems is presented in the computing layer to process the context-aware data stream. The architecture is a new scheme for accurate leak point detection, which is more consistent with practical application in the large scale petrochemical industry.

Proceedings ArticleDOI
24 Sep 2014
TL;DR: This work designs a distributed storage and index model for HBase Spatial, a scalable spatial dada storage based on HBase that can effectively enhance the query speed of big spatial data and provide a good solution for storage.
Abstract: Recent years, the scale of spatial data is developing more and more huge and its storage has encountered a lot of problems. Traditional DBMS can efficiently handle some big spatial data. However, popular open source relational database systems are overwhelmed by the high insertion rates, querying requirements and terabytes of data that these systems can handle. On the other hand, key-value storage can effectively support large scale operations. To resolve the problems of big vector spatial data's storage and query, we bring forward HBase Spatial, a scalable spatial dada storage based on HBase. At first, we analyze the distributed storage model of HBase. Then, we design a distributed storage and index model. Finally, the advantages of our storage model and index algorithm are proven by experiments with both big sample sets and typical benchmarks on cluster compared with MongoDB and Mysql, which shows that our model can effectively enhance the query speed of big spatial data and provide a good solution for storage.

Patent
13 Mar 2014
TL;DR: In this paper, a distributed database consisting of a plurality of server racks and one or more many-core processor servers in each of the server racks is presented, where the data is configured as tables distributed to the servers for storage in the servers.
Abstract: A distributed database, comprising a plurality of server racks, and one or more many-core processor servers in each of the plurality of server racks, wherein each of the one or more many-core processor servers comprises a many-core processor configured to store and access data on one or more solid state drives in the distributed database, where the one or more solid state drives are configured to enable retrieval of data through one or more text-searchable indexes. The one or more many-core processor servers are configured to communicate within the plurality of server racks via a network, and the data is configured as one or more tables distributed to the one or more many-core processor servers for storage in the one or more solid state drives.

Proceedings ArticleDOI
26 May 2014
TL;DR: This work proposes a uniform data management system that is environment-aware, as it monitors and models the global cloud infrastructure, and offers predictable data handling performance for transfer cost and time, and reduces the monetary costs and transfer time by up to 3 times.
Abstract: Today's continuously growing cloud infrastructures provide support for processing ever increasing amounts of scientific data. Cloud resources for computation and storage are spread among globally distributed datacenters. Thus, to leverage the full computation power of the clouds, global data processing across multiple sites has to be fully enabled. However, managing data across geographically distributed datacenters is not trivial as it involves high and variable latencies among sites which come at a high monetary cost. In this work, we propose a uniform data management system for scientific applications running across geographically distributed sites. Our solution is environment-aware, as it monitors and models the global cloud infrastructure, and offers predictable data handling performance for transfer cost and time. In terms of efficiency, it provides the applications with the possibility to set a tradeoff between money and time and optimizes the transfer strategy accordingly. The system was validated on Microsoft's Azure cloud across the 6 EU and US datacenters. The experiments were conducted on hundreds of nodes using both synthetic benchmarks and the real life A-Brain application. The results show that our system is able to model and predict well the cloud performance and to leverage this into efficient data dissemination. Our approach reduces the monetary costs and transfer time by up to 3 times.

Journal ArticleDOI
TL;DR: This paper proposes a new quorum-based data replication protocol with the objectives of minimizing the data update cost, providing high availability and data consistency, and compares the proposed approach with two existing approaches using response time,Data consistency, data availability, and communication costs.
Abstract: Data grids have been adopted by many scientific communities that need to share, access, transport, process, and manage geographically distributed large data collections. Data replication is one of the main mechanisms used in data grids whereby identical copies of data are generated and stored at various distributed sites to either improve data access performance or reliability or both. However, when data updates are allowed, it is a great challenge to simultaneously improve performance and reliability while ensuring data consistency of such huge and widely distributed data. In this paper, we address this problem. We propose a new quorum-based data replication protocol with the objectives of minimizing the data update cost, providing high availability and data consistency. We compare the proposed approach with two existing approaches using response time, data consistency, data availability, and communication costs. The results show that the proposed approach performs substantially better than the benchmark approaches.

Patent
13 Mar 2014
TL;DR: In this paper, a database system may maintain a plurality of log records at a distributed storage system, each of which is associated with a change to a data page, and upon detection of a coalesce event, log records linked to the particular data page may be applied to generate the particular page in its current state.
Abstract: A database system may maintain a plurality of log records at a distributed storage system. Each of the plurality of log records may be associated with a respective change to a data page. Upon detection of a coalesce event for a particular data page, log records linked to the particular data page may be applied to generate the particular data page in its current state. Detecting the coalesce event may be a determination that the number of log records linked to the particular data page exceeds a threshold.

Proceedings ArticleDOI
19 May 2014
TL;DR: Experimental results show that several query optimization techniques for distributed graph pattern matching can lead to an order of magnitude improvement in query performance.
Abstract: Greedy algorithms for subgraph pattern matching operations are often sufficient when the graph data set can be held in memory on a single machine However, as graph data sets increasingly expand and require external storage and partitioning across a cluster of machines, more sophisticated query optimization techniques become critical to avoid explosions in query latency In this paper, we introduce several query optimization techniques for distributed graph pattern matching These techniques include (1) a System-R style dynamic programming-based optimization algorithm that considers both linear and bushy plans, (2) a cycle detection-based algorithm that leverages cycles to reduce intermediate result set sizes, and (3) a computation reusing technique that eliminates redundant query execution and data transfer over the network Experimental results show that these algorithms can lead to an order of magnitude improvement in query performance