The Hadoop Distributed File System

doi:10.1109/MSST.2010.5496972

Proceedings ArticleDOI

The Hadoop Distributed File System

Konstantin Shvachko, +3 more

- pp 1-10

Chats0

TLDR

The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.

Abstract:

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Edge Computing: Vision and Challenges

Weisong Shi, +4 more

- 09 Jun 2016 -

IEEE Internet of Things Journal

TL;DR: The definition of edge computing is introduced, followed by several case studies, ranging from cloud offloading to smart home and city, as well as collaborative edge to materialize the concept of edge Computing.

...read moreread less

Journal ArticleDOI

The rise of big data on cloud computing

Ibrahim Abaker Targio Hashem, +5 more

- 01 Jan 2015 -

Information Systems

TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.

...read moreread less

Proceedings ArticleDOI

Apache Hadoop YARN: yet another resource negotiator

Vinod Kumar Vavilapalli, +15 more

TL;DR: The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.

...read moreread less

Proceedings Article

In search of an understandable consensus algorithm

Diego Ongaro, +1 more

TL;DR: Raft is a consensus algorithm for managing a replicated log that separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered.

...read moreread less

Journal ArticleDOI

A review of clustering techniques and developments

Amit Saxena, +8 more

- 06 Dec 2017 -

Neurocomputing

TL;DR: The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted and the approaches used in these methods are discussed with their respective states of art and applicability.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

The Google file system

Sanjay Ghemawat, +2 more

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Book

Hadoop: The Definitive Guide

Tom White

TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.

...read moreread less

Journal ArticleDOI

Hive: a warehousing solution over a map-reduce framework

Ashish Thusoo, +8 more

TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.

...read moreread less

Proceedings ArticleDOI

Ceph: a scalable, high-performance distributed file system

Sage A. Weil, +4 more

TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

...read moreread less

The Hadoop Distributed File System

Citations

Edge Computing: Vision and Challenges

The rise of big data on cloud computing

Apache Hadoop YARN: yet another resource negotiator

In search of an understandable consensus algorithm

A review of clustering techniques and developments

References

MapReduce: simplified data processing on large clusters

The Google file system

Hadoop: The Definitive Guide

Hive: a warehousing solution over a map-reduce framework

Ceph: a scalable, high-performance distributed file system

Related Papers (5)

MapReduce: simplified data processing on large clusters

The Google file system

Spark: cluster computing with working sets

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Hadoop: The Definitive Guide