A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

doi:10.1109/GRID.2012.17

Proceedings ArticleDOI

A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

- pp 12-21

TLDR

Experimental results show that the novel cache system built on the top of the Hadoop Distributed File System can store files with a wide range in their sizes and has the access performance in a millisecond level in highly concurrent environments.

Abstract:

The improvement of file access performance is a great challenge in real-time cloud services. In this paper, we analyze preconditions of dealing with this problem considering the aspects of requirements, hardware, software, and network environments in the cloud. Then we describe the design and implementation of a novel distributed layered cache system built on the top of the Hadoop Distributed File System which is named HDFS-based Distributed Cache System (HDCache). The cache system consists of a client library and multiple cache services. The cache services are designed with three access layers an in-memory cache, a snapshot of the local disk, and the actual disk view as provided by HDFS. The files loading from HDFS are cached in the shared memory which can be directly accessed by a client library. Multiple applications integrated with a client library can access a cache service simultaneously. Cache services are organized in the P2P style using a distributed hash table. Every file cached has three replicas in different cache service nodes in order to improve robustness and alleviates the workload. Experimental results show that the novel cache system can store files with a wide range in their sizes and has the access performance in a millisecond level in highly concurrent environments.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Survey on Network Methodologies for Real-Time Analytics of Massive IoT Data and Open Research Issues

Shikhar Verma, +4 more

- 14 Apr 2017 -

IEEE Communications Surveys and Tutorial...

TL;DR: The state-of-the-art of the analytics network methodologies, which are suitable for real-time IoT analytics are reviewed, and a number of prospective research problems and future research directions are presented focusing on thenetwork methodologies for the real- time IoT analytics.

...read moreread less

Journal ArticleDOI

Large-scale data mining using genetics-based machine learning

Jaume Bacardit, +1 more

- 01 Jan 2013 -

Wiley Interdisciplinary Reviews-Data Min...

TL;DR: Different classes of methods that alone or (in many cases) combined accelerate genetics‐based machine learning methods are reviewed.

...read moreread less

Patent

Prioritizing data requests based on quality of service

Thomas A. Phelan, +3 more

TL;DR: In this paper, a method of prioritizing data requests in a computing system based on quality of service includes identifying a plurality of data requests and assigning cache memory to each of the plurality of requests based on the prioritization.

...read moreread less

Proceedings ArticleDOI

AutoReplica: Automatic data replica manager in distributed caching and data processing systems

Zhengyu Yang, +3 more

TL;DR: This paper proposes a complete solution called AutoReplica — a replica manager in distributed caching and data processing systems with SSD-HDD tier storages, and proposes the a migrate-on-write technique called “fusion cache” to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem.

...read moreread less

Proceedings ArticleDOI

Analytical review on Hadoop Distributed file system

Kalpana Dwivedi, +1 more

TL;DR: Step by step introduction toData management using file system, data management using RDBMS then need of Hadoop distributed file system and its working process are included.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Journal ArticleDOI

The Google file system

Sanjay Ghemawat, +2 more

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Proceedings Article

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

Fay W. Chang, +8 more

TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.

...read moreread less

Proceedings ArticleDOI

Dynamo: amazon's highly available key-value store

Giuseppe deCandia, +8 more

TL;DR: D Dynamo is presented, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience and makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

...read moreread less

Collapse

A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

Citations

A Survey on Network Methodologies for Real-Time Analytics of Massive IoT Data and Open Research Issues

Large-scale data mining using genetics-based machine learning

Prioritizing data requests based on quality of service

AutoReplica: Automatic data replica manager in distributed caching and data processing systems

Analytical review on Hadoop Distributed file system

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

The Google file system

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

Dynamo: amazon's highly available key-value store

Related Papers (5)

MapReduce: simplified data processing on large clusters

HDCache: A Distributed Cache System for Real-Time Cloud Services

The Hadoop Distributed File System

Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS

The Google file system