Proceedings ArticleDOI
The Hadoop Distributed File System
Konstantin Shvachko,Hairong Kuang,Sanjay Radia,Robert J. Chansler +3 more
- pp 1-10
Reads0
Chats0
TLDR
The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.Abstract:Â
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.read more
Citations
More filters
Journal ArticleDOI
Edge Computing: Vision and Challenges
TL;DR: The definition of edge computing is introduced, followed by several case studies, ranging from cloud offloading to smart home and city, as well as collaborative edge to materialize the concept of edge Computing.
Journal ArticleDOI
The rise of big data on cloud computing
Ibrahim Abaker Targio Hashem,Ibrar Yaqoob,Nor Badrul Anuar,Salimah Binti Mokhtar,Abdullah Gani,Samee U. Khan +5 more
TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.
Proceedings ArticleDOI
Apache Hadoop YARN: yet another resource negotiator
Vinod Kumar Vavilapalli,Arun C. Murthy,Chris Douglas,Sharad Agarwal,Mahadev Konar,Robert Evans,Thomas Graves,Jason Lowe,Hitesh Shah,Siddharth Seth,Bikas Saha,Carlo Curino,Owen O'Malley,Sanjay Radia,Benjamin Reed,Eric Baldeschwieler +15 more
TL;DR: The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
Proceedings Article
In search of an understandable consensus algorithm
Diego Ongaro,John Ousterhout +1 more
TL;DR: Raft is a consensus algorithm for managing a replicated log that separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered.
Journal ArticleDOI
A review of clustering techniques and developments
Amit Saxena,Mukesh Prasad,Akshansh Gupta,Neha Bharill,Om Prakash Patel,Aruna Tiwari,Meng Joo Er,Weiping Ding,Chin-Teng Lin +8 more
TL;DR: The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted and the approaches used in these methods are discussed with their respective states of art and applicability.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
The Google file system
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Book
Hadoop: The Definitive Guide
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Journal ArticleDOI
Hive: a warehousing solution over a map-reduce framework
Ashish Thusoo,Joydeep Sen Sarma,Namit Jain,Zheng Shao,Prasad Chakka,Suresh Anthony,Hao Liu,Pete Wyckoff,Raghotham Murthy +8 more
TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
Proceedings ArticleDOI
Ceph: a scalable, high-performance distributed file system
TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.