Showing papers on "Distributed File System published in 2018"

PDF

Open Access

Journal Article•DOI•

PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database

[...]

Wei Cao¹, Zhenjun Liu¹, Peng Wang², Sen Chen¹, Caifeng Zhu, Song Zheng, Yuhui Wang¹, Guoqing Ma - Show less +4 more•Institutions (2)

Alibaba Group¹, Fudan University²

01 Aug 2018

TL;DR: ParallelRaft is developed, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases.

...read moreread less

Abstract: PolarFS is a distributed file system with ultra-low latency and high availability, designed for the POLARDB database service, which is now available on the Alibaba Cloud. PolarFS utilizes a lightweight network stack and I/O stack in user-space, taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. In this way, the end-to-end latency of PolarFS has been reduced drastically and our experiments show that the write latency of PolarFS is quite close to that of local file system on SSD. To keep replica consistency while maximizing I/O throughput for PolarFS, we develop ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understand-ability and easy implementation of Raft while providing much better I/O scalability for PolarFS. We also describe the shared storage architecture of PolarFS, which gives a strong support for POLARDB.

...read moreread less

76 citations

Proceedings Article•DOI•

GekkoFS - A Temporary Distributed File System for HPC Applications

[...]

Marc-André Vef¹, Nafiseh Moti¹, Tim SuB¹, Tommaso Tocci², Ramon Nou², Alberto Miranda², Toni Cortes³, André Brinkmann¹ - Show less +4 more•Institutions (3)

University of Mainz¹, Barcelona Supercomputing Center², Polytechnic University of Catalonia³

01 Sep 2018

TL;DR: GekkoFS is a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing applications, significantly outperforming the capabilities of general-purpose parallel file systems.

...read moreread less

Abstract: We present GekkoFS, a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing (HPC) applications. The file system provides relaxed POSIX semantics, only offering features which are actually required by most (not all) applications. It is able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of general-purpose parallel file systems.

...read moreread less

37 citations

Proceedings Article•DOI•

Wharf: Sharing Docker Images in a Distributed File System

[...]

Chao Zheng¹, Lukas Rupprecht², Vasily Tarasov², Douglas Thain¹, Mohamed Mohamed², Dimitrios Skourtis², Amit Warke², Dean Hildebrand³ - Show less +4 more•Institutions (3)

University of Notre Dame¹, IBM², Google³

11 Oct 2018

TL;DR: Wharf is introduced, a middleware to transparently add distributed storage support to Docker that partitions Docker's runtime state into local and global parts and efficiently synchronizes accesses to the global state and minimizes the synchronization overhead.

...read moreread less

Abstract: Container management frameworks, such as Docker, package diverse applications and their complex dependencies in self-contained images, which facilitates application deployment, distribution, and sharing. Currently, Docker employs a shared-nothing storage architecture, i.e. every Docker-enabled host requires its own copy of an image on local storage to create and run containers. This greatly inflates storage utilization, network load, and job completion times in the cluster. In this paper, we investigate the option of storing container images in and serving them from a distributed file system. By sharing images in a distributed storage layer, storage utilization can be reduced and redundant image retrievals from a Docker registry become unnecessary. We introduce Wharf, a middleware to transparently add distributed storage support to Docker. Wharf partitions Docker's runtime state into local and global parts and efficiently synchronizes accesses to the global state. By exploiting the layered structure of Docker images, Wharf minimizes the synchronization overhead. Our experiments show that compared to Docker on local storage, Wharf can speed up image retrievals by up to 12x, has more stable performance, and introduces only a minor overhead when accessing data on distributed storage.

...read moreread less

26 citations

Journal Article•DOI•

Achieving Load Balance for Parallel Data Access on Distributed File Systems

[...]

Dan Huang¹, Dezhi Han², Jun Wang¹, Jiangling Yin¹, Xunchao Chen¹, Xuhong Zhang¹, Jian Zhou¹, Mao Ye¹ - Show less +4 more•Institutions (2)

University of Central Florida¹, Shanghai Maritime University²

01 Mar 2018-IEEE Transactions on Computers

TL;DR: Novel methods, referred to as Opass, are proposed to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems to benefit parallel data-intensive analysis and balanced data access.

...read moreread less

Abstract: The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth , resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE’s Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.

...read moreread less

25 citations

Alluxio: A Virtual Distributed File System

[...]

Haoyuan Li

01 Jan 2018

TL;DR: This dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer, and achieves these goals through an implementation of VDFS called Alluxio (formerly Tachyon), which presents a set of disparate data stores as a single file system.

...read moreread less

Abstract: Author(s): Li, Haoyuan | Advisor(s): Stoica, Ion; Shenker, Scott | Abstract: The world is entering the data revolution era. Along with the latest advancements of the Internet, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing exponentially. To store and process these data has exposed tremendous challenges and opportunities.Over the past two decades, we have seen significant innovation in the data stack. For example, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Tensorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.This increasing complexity in the stack creates challenges in multi-fold. Data is siloed in various storage systems, making it difficult for users and applications to find and access the data efficiently. For example, for system developers, it requires more work to integrate a new compute or storage component as a building block to work with the existing ecosystem. For data application developers, understanding and managing the correct way to access different data stores becomes more complex. For end users, accessing data from various and often remote data stores often results in performance penalty and semantics mismatch. For system admins, adding, removing, or upgrading an existing compute or data store or migrating data from one store to another can be arduous if the physical storage has been deeply coupled with all applications.To address these challenges, this dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer. Adding VDFS into the stack brings many benefits. Specifically, VDFS enables global data accessibility for different compute frameworks, efficient in-memory data sharing and management across applications and data stores, high I/O performance and efficient use of network bandwidth, and the flexible choice of compute and storage. Meanwhile, as the layer to access data and collect data metrics and usage patterns, it also provides users insight into their data and can also be used to optimize the data access based on workloads.We achieve these goals through an implementation of VDFS called Alluxio (formerly Tachyon). Alluxio presents a set of disparate data stores as a single file system, greatly reducing the complexity of storage APIs, and semantics exposed to applications. Alluxio is designed with a memory centric architecture, enabling applications to leverage memory speed I/O by simply using Alluxio. Alluxio has been deployed at hundreds of leading companies in production, serving critical workloads. Its open source community has attracted more than 800 contributors worldwide from over 200 companies.In this dissertation, we also investigate lineage as an important technique in the VDFS to improve write performance, and also propose DFS-Perf, a scalable distributed file system performance evaluation framework to help researchers and developers better design and implement systems in the Alluxio ecosystem.

...read moreread less

24 citations

Proceedings Article•DOI•

An approach for big data security based on Hadoop distributed file system

[...]

Hadeer Mahmoud¹, Abdelfatah Hegazy¹, Mohamed Helmy Khafagy²•Institutions (2)

Arab Academy for Science, Technology & Maritime Transport¹, Fayoum University²

01 Feb 2018

TL;DR: The proposed approach was used to improve the performance of encryption /Decryption file by using AES and OTP algorithms integrated on Hadoop by improving this ratio as the size of the encrypted file increased by 20% from the original file size.

...read moreread less

Abstract: Cloud computing appeared for huge data because of its ability to provide users with on-demand, reliable, flexible, and low-cost services. With the increasing use of cloud applications, data security protection has become an important issue for the cloud. In this work, the proposed approach was used to improve the performance of encryption /Decryption file by using AES and OTP algorithms integrated on Hadoop. Where files are encrypted within the HDFS and decrypted within the Map Task. Encryption /Decryption in previous works used AES algorithm, the size of the encrypted file increased by 50% from the original file size. The proposed approach improved this ratio as the size of the encrypted file increased by 20% from the original file size. Also, we have compared this approach with the previously implemented method, we implement this new approach to secure HDFS, and some experimental studies were conducted to verify its effectiveness.

...read moreread less

24 citations

Journal Article•DOI•

Dynamic Merging based Small File Storage (DM-SFS) Architecture for Efficiently Storing Small Size Files in Hadoop

[...]

Mohd Abdul Ahad¹, Ranjit Biswas¹•Institutions (1)

Jamia Hamdard¹

01 Jan 2018-Procedia Computer Science

TL;DR: The empirical results shows that the proposed architecture is helpful in saving the Namenode memory overhead as well as reducing the disk seek time to a greater extent.

...read moreread less

22 citations

Proceedings Article•DOI•

Size Matters: Improving the Performance of Small Files in Hadoop

[...]

Salman Niazi¹, Mikael Ronström², Seif Haridi¹, Jim Dowling¹•Institutions (2)

Royal Institute of Technology¹, Oracle Corporation²

26 Nov 2018

TL;DR: This work has designed an adaptive tiered storage using in-memory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS.

...read moreread less

Abstract: The Hadoop Distributed File System (HDFS) is designed to handle massive amounts of data, preferably stored in very large files. The poor performance of HDFS in managing small files has long been a bane of the Hadoop community. In many production deployments of HDFS, almost 25% of the files are less than 16 KB in size and as much as 42% of all the file system operations are performed on these small files. We have designed an adaptive tiered storage using in-memory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS. Our solution is completely transparent, and it does not require any changes in the HDFS clients or the applications using the Hadoop platform. In experiments, we observed up to 61 times higher throughput in writing files, and for real-world workloads from Spotify our solution reduces the latency of reading and writing small files by a factor of 3.15 and 7.39 respectively.

...read moreread less

20 citations

Proceedings Article•DOI•

Endolith: A Blockchain-Based Framework to Enhance Data Retention in Cloud Storages

[...]

Thomas Renner¹, Johannes Muller¹, Odej Kao¹•Institutions (1)

Technical University of Berlin¹

21 Mar 2018

TL;DR: Endolith, an auditing framework for verifying file integrity and tracking file history without third party reliance using a smart contract-based blockchain, based on Ethereum and Hadoop Distributed File System.

...read moreread less

Abstract: Blockchains like Bitcoin and Ethereum have seen significant adoption in the past few years and show promise to design applications without any centralized reliance on third parties. In this paper, we present Endolith, an auditing framework for verifying file integrity and tracking file history without third party reliance using a smart contract-based blockchain. Annotated files are continuously monitored and metadata about changes including file hashes are stored tamper-proof on the blockchain. Based on this, Endolith can prove that a file stored a long time ago has not been changed without authorization or, if it did, track when it has changed, by whom. Endolith implementation is based on Ethereum and Hadoop Distributed File System (HDFS). Our evaluation on a public blockchain network shows that Endolith is efficient for files that are infrequently modified but often accessed, which are common characteristics of data archives.

...read moreread less

20 citations

Proceedings Article•DOI•

Load Balancing through Block Rearrangement Policy for Hadoop Heterogeneous Cluster

[...]

Ankit Shah, Mamta Padole¹•Institutions (1)

Maharaja Sayajirao University of Baroda¹

01 Sep 2018

TL;DR: A Novel algorithm to balance the data blocks on specific nodes (i.e. custom block placement) only by dividing total nodes among two categories like: homogeneous vs. heterogeneous or high performing nodes vs. low performing nodes is presented.

...read moreread less

Abstract: To store and analyze Big Data, Hadoop is the most common tool for the researchers and scientists. The storage of huge amount of data in Hadoop is done using Hadoop Distributed File System (HDFS). HDFS uses block placement policy to split a very large file into blocks and place them across the cluster in a distributed manner. Basically, Hadoop and HDFS have been designed in such a way that it works efficiently on the homogeneous cluster. But in this era of networking, we cannot imagine having a cluster of homogeneous nodes only. So, there is the need of storage policy that can work efficiently on both homogeneous as well as the heterogeneous cluster. Thus, the needs of applications that can be executed time-efficiently based on homogeneous as well as the heterogeneous environment can be sufficed. Data locality in Hadoop maps the data block to process in the same node, but often when you're dealing with Big Data, it is required to map the data block to the processes across multiple nodes. To deal with this Hadoop has functionality to copy the data block where mappers are running. This creates a lot of performance degradation especially on heterogeneous cluster due to I/O delay or network congestions. Here we present a Novel algorithm to balance the data blocks on specific nodes (i.e. custom block placement) only by dividing total nodes among two categories like: homogeneous vs. heterogeneous or high performing nodes vs. low performing nodes. This policy helps to achieve better load rearrangement among the nodes and we can put data blocks actually where we want our data to be placed for the processing.

...read moreread less

19 citations

Journal Article•DOI•

Developing a File System Structure to Solve Healthy Big Data Storage and Archiving Problems Using a Distributed File System

[...]

Atilla Ergüzen, Mahmut Ünver

02 Jun 2018-Applied Sciences

TL;DR: This work has produced a robust, available, scalable, and serverless solution structure, especially for storing large amounts of data in the medical field, and the security level of the system is extreme by use of static Internet protocol, user credentials, and synchronously encrypted file contents.

...read moreread less

Abstract: Recently, the use of internet has become widespread, increasing the use of mobile phones, tablets, computers, Internet of Things (IoT) devices and other digital sources. In the health sector with the help of new generation digital medical equipment, this digital world also has tended to grow in an unpredictable way in that it has nearly 10% of the global wide data itself and continues to keep grow beyond what the other sectors have. This progress has greatly enlarged the amount of produced data which cannot be resolved with conventional methods. In this work, an efficient model for the storage of medical images using a distributed file system structure has been developed. With this work, a robust, available, scalable, and serverless solution structure has been produced, especially for storing large amounts of data in the medical field. Furthermore, the security level of the system is extreme by use of static Internet protocol (IP), user credentials, and synchronously encrypted file contents. One of the most important key features of the system is high performance and easy scalability. In this way, the system can work with fewer hardware elements and be more robust than others that use name node architecture. According to the test results, it is seen that the performance of the designed system is better than 97% from a Not Only Structured Query Language (NoSQL) system, 80% from a relational database management system (RDBMS), and 74% from an operating system (OS).

...read moreread less

Patent•

Secure decentralized system utilizing smart contracts, a blockchain, and/or a distributed file system

[...]

Jonathan S. W. Lee¹, Steve Frensch, Ethan Greig, Anna-Maria Nalepa, Zheng Jian - Show less +1 more•Institutions (1)

Capital One¹

04 Jun 2018

TL;DR: In this paper, a node associated with an organization may receive a storage identifier for new credit data associated with the individual, and the node may use the storage identifier to search the distributed data sources.

...read moreread less

Abstract: A node associated with an organization may receive a storage identifier for new credit data associated with an individual. A distributed ledger and distributed data sources may be used to share the new credit data with a network of nodes. The node may update a smart contract with the storage identifier for the new credit data. The node may receive, from a particular device associated with the organization, a request for the new credit data. The node may obtain the storage identifier for the new credit data from the smart contract. The node may obtain the new credit data by using the storage identifier to search the distributed data sources. The node may provide the new credit data to the particular device. The node may perform actions to obtain additional new credit data from the distributed data sources or provide the additional new credit data to the distributed data sources.

...read moreread less

Proceedings Article•DOI•

Towards a Better Replica Management for Hadoop Distributed File System

[...]

Hilmi Egemen Ciritoglu¹, Takfarinas Saber¹, Teodora Sandra Buda², John Murphy¹, Christina Thorpe¹ - Show less +1 more•Institutions (2)

University College Dublin¹, IBM²

02 Jul 2018

TL;DR: This work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it, which leads to unbalanced data, hot spots, and performance degradation.

...read moreread less

Abstract: The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.

...read moreread less

Patent•

Containerized high-performance network storage

[...]

Chirammal Humble Devassy¹, Liyazudeen Mohamed Ashiq¹, Watt Stephen James, Pabon Luis Pablo•Institutions (1)

Red Hat¹

27 Dec 2018

TL;DR: In this paper, a storage controller and a container scheduler execute on processors, and a persistent storage volume is mapped to the service container, where a content of the persistent storage volumes is replicated to the second storage node.

...read moreread less

Abstract: Containerized high-performance network storage is disclosed. For example, first and second memories are associated with first and second hosts and separated by a network. A storage controller and a container scheduler execute on processors. The container scheduler instantiates first and second storage containers on the respective first and second hosts. The storage controller configures the first and second storage containers as first and second storage nodes of a distributed file system. The container scheduler instantiates a service container on the first host. The storage controller receives a persistent volume claim associated with the service container and then creates a persistent storage volume in the first storage node based on the persistent volume claim. The persistent storage volume is mapped to the service container, where a content of the persistent storage volume is replicated to the second storage node.

...read moreread less

Posted Content•

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud.

[...]

Christian Pinto, Yiannis Gkoufas, Andrea Reale, Seetharami R. Seelam, Eliuk Steven N - Show less +1 more

03 Dec 2018-arXiv: Performance

TL;DR: Hard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark.

...read moreread less

Abstract: Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is difficult when utilizing dedicated hardware like GPUs. As accelerators are getting faster, the storage media \& data buses feeding the data have not kept pace and the ever increasing size of training data further compounds the problem. We describe the design and implementation of a distributed caching system called Hoard that stripes the data across fast local disks of multiple GPU nodes using a distributed file system that efficiently feeds the data to ensure minimal degradation in GPU utilization due to I/O starvation. Hoard can cache the data from a central storage system before the start of the job or during the initial execution of the job and feeds the cached data for subsequent epochs of the same job and for different invocations of the jobs that share the same data requirements, e.g. hyper-parameter tuning. Hoard exposes a POSIX file system interface so the existing deep learning frameworks can take advantage of the cache without any modifications. We show that Hoard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark with 150GB of input dataset. As a result of the caching, Hoard eliminates the I/O bottlenecks introduced by the shared storage and increases the utilization of the system by 2x compared to using the shared storage without the cache.

...read moreread less

Journal Article•DOI•

A Flattened Metadata Service for Distributed File Systems

[...]

Siyang Li, Fenlin Liu, Jiwu Shu¹, Youyou Lu¹, Tao Li², Yang Hu² - Show less +2 more•Institutions (2)

Tsinghua University¹, University of Florida²

01 Dec 2018-IEEE Transactions on Parallel and Distributed Systems

TL;DR: LocoMeta is a distributed file system with a flattened and fine-grained division metadata service, LocoMeta, to bridge the performance gap between file system metadata and key-value stores.

...read moreread less

Abstract: Key-Value stores provide scalable metadata service for distributed file systems. However, the metadata's organization itself, which is organized using a directory tree structure, does not fit the key-value access pattern, thereby limiting the performance. To address this issues, we propose a distributed file system with a flattened and fine-grained division metadata service, LocoMeta, to bridge the performance gap between file system metadata and key-value stores. LocoMeta is designed to bridge the gap between file metadata to key-value store with two techniques. First, LocoMeta flattens the directory content and structure, which organizes file and directory index nodes in a flat space while reversely indexing the directory entries. Second, it exploits a fine-grained division method to improve the key-value access performance. Evaluations show that LocoMeta with eight nodes boosts the metadata throughput by five times, which approaches 93 percent throughput of a single-node key-value store, compared to 18 percent in the state-of-the-art IndexFS.

...read moreread less

Journal Article•DOI•

A Dynamic Replica Factor Calculator for Weighted Dynamic Replication Management in Cloud Storage Systems

[...]

Suji Gopinath¹, Elizabeth Sherly²•Institutions (2)

University of Kerala¹, Indian Institute of Information Technology and Management, Kerala²

01 Jan 2018-Procedia Computer Science

TL;DR: A strategy is proposed to dynamically set the replica factor for each data item considering the popularity of data, its current replication factor and the number of active nodes present in the cloud storage to maintain an optimal number of replicas.

...read moreread less

Proceedings Article•DOI•

Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File System

[...]

Hilmi Egemen Ciritoglu¹, Leandro Batista de Almeida, Eduardo Cunha de Almeida², Teodora Sandra Buda³, John Murphy¹, Christina Thorpe¹ - Show less +2 more•Institutions (3)

University College Dublin¹, Federal University of Paraná², IBM³

02 Apr 2018

TL;DR: Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.

...read moreread less

Abstract: The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.

...read moreread less

Patent•

Persistent cache layer in a distributed file system

[...]

Laier Max, Popovich Evgeny, Kim Hwanju

04 Dec 2018

TL;DR: In this paper, a cache overlay layer can store additional state information on a per block basis that details whether each individual block of file data within the cache overlay layers is clean, dirty, or indicates that a write back to the storage layer is in progress.

...read moreread less

Abstract: Implementations are provided herein for having at least two data streams associated with each file in a file system. The first, a cache overlay layer, can store additional state information on a per block basis that details whether each individual block of file data within the cache overlay layer is clean, dirty, or indicates that a write back to the storage layer is in progress. The second, a storage layer, can be a use case defined repository that can transform data using data augmentation methods or store unmodified raw data in local storage. File system operations directed to the cache overlay layer can be processed asynchronously from file system operations directed to the storage layer.

...read moreread less

Patent•

Spark framework based distributed implementation method for GAN (generative adversarial network)

[...]

Wang Wanliang, Zhang Zhaojuan, Gao Nan, Wu Fei, Li Zhuorong, Zhao Yanwei - Show less +2 more

10 Jul 2018

TL;DR: In this paper, a Spark framework based distributed implementation method for a GAN (generative adversarial network) is proposed, which comprises the following steps: a host node randomly initializes network configuration, and a parameter set is generated; a data file is directly uploaded to a distributed file system; an elastic distributed data set of Spark is constructed; for each training datasubset RDD (resilient distributed dataset), the host node transmits parameters, configuration and network update state to all slave nodes; each slave node trains part of the data and updates the parameters; a

...read moreread less

Abstract: The invention relates to Spark framework based distributed implementation method for a GAN (generative adversarial network). The method comprises the following steps: a host node randomly initializesnetwork configuration, and a parameter set is generated; a data file is directly uploaded to a distributed file system; an elastic distributed data set of Spark is constructed; for each training datasubset RDD (resilient distributed dataset), the host node transmits parameters, configuration and network update state to all slave nodes; each slave node trains part of the data and updates the parameters; a GAN model is trained in parallel with a data parallel training manner until Nash equilibrium is reached, and training is completed. The method is used for distributed training of the GAN model in presence of massive data. Meanwhile, training of the GAN model can be accelerated by taking full advantages of a memory-based computing framework and suitability for recursive computing of Spark,efficiency is improved, and better extendibility is achieved.

...read moreread less

Journal Article•DOI•

Block Placement in Distributed File Systems Based on Block Access Frequency

[...]

Jianwei Liao¹, Zhigang Cai¹, Francois Trahay², Xiaoning Peng³•Institutions (3)

Southwest University¹, Telecom SudParis², Huaihua University³

29 Jun 2018-IEEE Access

TL;DR: The experimental results show that the proposed scheme can greatly reduce I/O time and better improve utilization of storage servers when running the database-relevant applications, compared with the commonly used block data placement strategy, i.e., the round-robin placement policy.

...read moreread less

Abstract: This paper proposes a new data placement policy to allocate data blocks across storage servers of the distributed/parallel file systems, for yielding even block access workload distribution. To this end, we first analyze the history of block access sequence of a specific application and then introduce a k-partition algorithm to divide data blocks into multiple groups, by referring their access frequency. After that, each group has almost the same access workloads, and we can thus distribute these block groups onto storage servers of the distributed file system, to achieve the goal of uniformly assigning data blocks when running the application. In summary, this newly proposed data placement policy can yield not only an even data distribution but also the block data access balance. The experimental results show that the proposed scheme can greatly reduce I/O time and better improve utilization of storage servers when running the database-relevant applications, compared with the commonly used block data placement strategy, i.e., the round-robin placement policy.

...read moreread less

Proceedings Article•DOI•

Hadoop Distributed File System Security -A Review

[...]

S. Suganya¹, S. Selvamuthukumaran²•Institutions (2)

Anna University¹, A. V. C. College of Engineering²

01 Mar 2018

TL;DR: A review of algorithms or methodologies suggested for the storage of large volume of unstructured, real time data and streams at a high velocity in Hadoop.

...read moreread less

Abstract: Big data is the core topic in industries and research fields as well as for society as a whole. Analytics of Big Data is a predictive analysis rather than traditional descriptive analysis of data. Hadoop is the most widely used tool for big data analytics in social media like Google, Facebook, Yahoo, Amazon etc. Hadoop basically uses Distributed File System for the storage of large volume of unstructured, real time data and streams at a high velocity. It has given precise importance to data storage in Hadoop, but the security of data has ignored and very least importance was given. We have a review of algorithms or methodologies suggested.

...read moreread less

Book Chapter•DOI•

Co-Design and Verification of an Available File System

[...]

Mahsa Najafzadeh¹, Marc Shapiro², Patrick Eugster¹•Institutions (2)

Purdue University¹, French Institute for Research in Computer Science and Automation²

07 Jan 2018

TL;DR: This paper shows that common file system operations can run concurrently without synchronisation, while still retaining a semantics reasonably similar to Posix hierarchical structure, with one exception is the \(\mathsf {move}\) operation, for which it is proved that, unless synchronised, it will have an anomalous behaviour.

...read moreread less

Abstract: Distributed file systems play a vital role in large-scale enterprise services. However, the designer of a distributed file system faces a vexing choice between strong consistency and asynchronous replication. The former supports a standard sequential model by synchronising operations, but is slow and fragile. The latter is highly available and responsive, but exposes users to concurrency anomalies. In this paper, we describe a rigorous and general approach to navigating this trade-off by leveraging static verification tools that allow to verify different file system designs. We show that common file system operations can run concurrently without synchronisation, while still retaining a semantics reasonably similar to Posix hierarchical structure. The one exception is the \(\mathsf {move}\) operation, for which we prove that, unless synchronised, it will have an anomalous behaviour.

...read moreread less

Proceedings Article•DOI•

Big Data Analytics: Performance Evaluation for High Availability and Fault Tolerance using MapReduce Framework with HDFS

[...]

Jai Prakash Verma¹, Sapan H. Mankad¹, Sanjay Garg¹•Institutions (1)

Nirma University of Science and Technology¹

01 Dec 2018

TL;DR: Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis and provides suggestions for selecting size of Hioop cluster as per data size and generation speed.

...read moreread less

Abstract: Big data analytics helps in analyzing structured data transaction and analytics programs that contain semi-structured and unstructured data. Internet clickstream data, mobile-phone call details, server logs are examples of big data. Relational database-oriented dataset doesn't fit in traditional data warehouse since big data set is updated frequently and large amount of data are generated in real time. Many open source solutions are available for handling this large scale data. The Hadoop Distributed File System (HDFS) is one of the solutions which helps in storing, managing, and analyzing big data. Hadoop has become a standard for distributed storage and computing in Big Data Analytic applications. It has the capability to manage distributed nodes for data storage and processing in distributed manner. Hadoop architecture is also known as Store everything now and decide how to process later. Challenges and issues of multi-node Hadoop cluster setup and configuration are discussed in this paper. The troubleshooting for high availability of nodes in different scenarios for Hadoop cluster failure are experimented with different sizes of datasets. Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis. It also provides suggestions for selecting size of Hadoop cluster as per data size and generation speed.

...read moreread less

Patent•

Blockchain Based Security for End Points

[...]

Syed Mohammad Amir Husain

26 Jul 2018

TL;DR: In this paper, the authors describe a distributed file system with a plurality of nodes for storing linearly integrated data records in distributed file systems and a client installed on each node, each client configured to obtain the security information from at least one other node in the network, and a module contained within each client for delivering the obtained security information to an endpoint security application of the corresponding node.

...read moreread less

Abstract: Systems and methods are provided for distributing security information. The systems and methods include a network having a plurality of nodes for storing a plurality of linearly integrated data records in a distributed file system, each linearly integrated data record including security information, a client installed on each node, each client configured to obtain the security information from at least one other node in the network, and a module contained within each client for delivering the obtained security information to an endpoint security application of the corresponding node.

...read moreread less

Proceedings Article•DOI•

The Evolution of the Hadoop Distributed File System

[...]

Stathis Maneas¹, Bianca Schroeder¹•Institutions (1)

University of Toronto¹

16 May 2018

TL;DR: An extensive study of the Hadoop Distributed File System (HDFS) code evolution over time, based on reports and patch files available from the official Apache issue tracker, to assist developers in improving the design of similar systems and implementing more solid systems in general.

...read moreread less

Abstract: Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.

...read moreread less

Journal Article•DOI•

Improved performance optimization for massive small files in cloud computing environment

[...]

Chang Choi¹, Chulwoong Choi¹, Junho Choi¹, Pankoo Kim¹•Institutions (1)

Chosun University¹

01 Jun 2018-Annals of Operations Research

TL;DR: To improve MapReduce processing performance, the proposed method reduces JVM creation time by reusing a single JVM to run multiple mappers (rather than creating a JVM for every mapper).

...read moreread less

Abstract: Hadoop uses the Hadoop distributed file system for storing big data, and uses MapReduce to process big data in cloud computing environments. Because Hadoop is optimized for large file sizes, it has difficulties processing large numbers of small files. A small file can be defined as any file that is significantly smaller than the Hadoop block size, which is typically set to 64 MB. Hadoop is optimized to store data in relatively large files, and thus suffers from name node memory insufficiency and increased scheduling and processing time when processing large numbers of small files. This study proposes a performance improvement method for MapReduce processing, which integrates the CombineFileInputFormat method and the reuse feature of the Java Virtual Machine (JVM). Existing methods create a mapper for every small file. Unlike these methods, the proposed method reduces the number of created mappers by processing large numbers of files that are combined by a single split using CombineFileInputFormat. Moreover, to improve MapReduce processing performance, the proposed method reduces JVM creation time by reusing a single JVM to run multiple mappers (rather than creating a JVM for every mapper).

...read moreread less

Book Chapter•DOI•

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

[...]

Peng Jianfeng, Wen-guo Wei, Huimin Zhao, Dai Qingyun, Xie Guiyuan, Jun Cai, Ke-jing He¹ - Show less +3 more•Institutions (1)

South China University of Technology¹

07 Jul 2018

TL;DR: This work proposes Small Hadoop Distributed File System (SHDFS), which bases on original HDFS, and adds two novel modules in the proposed SHDFS: merging module and caching module, which are used to find out the correlated files by user-based collaborative filtering and then merge correlated files into a single large file to reduce the total number of files.

...read moreread less

Abstract: Hadoop Distributed File System (HDFS) is designed to reliably storage and manage large-scale files. All the files in HDFS are managed by a single server, the NameNode. The NameNode stores metadata, in its main memory, for each file stored into HDFS. HDFS suffers the penalty of performance with increased number of small files. It imposes a heavy burden to the NameNode to store and manage a mass of small files. The number of files that can be stored into HDFS is constrained by the size of NameNode’s main memory. In order to improve the efficiency of storing and accessing the small files on HDFS, we propose Small Hadoop Distributed File System (SHDFS), which bases on original HDFS. Compared to original HDFS, we add two novel modules in the proposed SHDFS: merging module and caching module. In merging module, the correlated files model is proposed, which is used to find out the correlated files by user-based collaborative filtering and then merge correlated files into a single large file to reduce the total number of files. In caching module, we use Log - linear model to dig out some hot-spot data that user frequently access to, and then design a special memory subsystem to cache these hot-spot data. Caching mechanism speeds up access to hot-spot data.

...read moreread less

Patent•

Dynamic load balancing method for distributed file system under cloud environment

[...]

Yang Geng, Wu Yaoyao, Bai Shuangjie, Liu Guoxiu, Ma Ke - Show less +1 more

22 Jun 2018

TL;DR: In this article, a dynamic load balancing method for a distributed file system under a cloud environment is presented, which mainly comprises the steps of acquiring information of all nodes of the distributed file systems under the cloud environment, computing a threshold needed when the system is balanced according to a disk space utilization rate, a CPU utilization rate and a memory utilization rate.

...read moreread less

Abstract: The invention discloses a dynamic load balancing method for a distributed file system under a cloud environment. The method mainly comprises the steps of acquiring information of all nodes of the distributed file system under the cloud environment, judging whether the file system is balanced, computing a threshold needed when the system is balanced according to a disk space utilization rate, a CPUutilization rate, a memory utilization rate, a disk I/O occupancy rate and a network bandwidth occupancy rate of each node, and performing unbalance adjustment on the load of the file system according to the threshold and the disk space utilization rate. The method provided by the invention balances the load of the file system while supporting the execution of a cloud computation task, and continuously adjusts the load via monitoring information of the nodes, and thus the execution efficiency of cloud computation to the file system is improved.

...read moreread less

Patent•

Business big data analysis system

[...]

Tao Yu, Wei Zhenjiang, Chen Huiping

20 Jul 2018

TL;DR: In this article, a business big data analysis system takes a Browser/Server structure as the overall structure of the analysis system, and includes WIFIprobes, a receiving server, a distributed data analysis systems, an HDFS distributed file system, an HBASE distributed database, and a connected network among the WifiI probes, the receiving server.

...read moreread less

Abstract: The invention relates to a business big data analysis system. The business big data analysis system takes a Browser/Server structure as the overall structure of the analysis system, and includes WIFIprobes, a receiving server, a distributed data analysis system, an HDFS distributed file system, an HBASE distributed database, and a connected network among the WIFI probes, the receiving server, thedistributed data analysis system, the HDFS distributed file system and the HBASE distributed database. The business big data analysis system employs the WIFI probe devices to probe the mobile phone MAC address and extract the business passenger flow data to perform big data analysis, takes the B/S structure as the overall structure of the system, puts the management system at the Web terminal, utilizes a single physical server to construct a server cluster to collect data and store the data at the server terminal in a distributed manner, and performs effective storage, analysis and classification of mass data to form an organized business big data analysis system, on the basis of convenient operation, so that a user can conveniently check the business passenger flow data index analysis result through the system.

...read moreread less

Collapse