scispace - formally typeset
Search or ask a question

Showing papers on "Distributed File System published in 2018"


Journal ArticleDOI
01 Aug 2018
TL;DR: ParallelRaft is developed, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases.
Abstract: PolarFS is a distributed file system with ultra-low latency and high availability, designed for the POLARDB database service, which is now available on the Alibaba Cloud. PolarFS utilizes a lightweight network stack and I/O stack in user-space, taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. In this way, the end-to-end latency of PolarFS has been reduced drastically and our experiments show that the write latency of PolarFS is quite close to that of local file system on SSD. To keep replica consistency while maximizing I/O throughput for PolarFS, we develop ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understand-ability and easy implementation of Raft while providing much better I/O scalability for PolarFS. We also describe the shared storage architecture of PolarFS, which gives a strong support for POLARDB.

76 citations


Proceedings ArticleDOI
01 Sep 2018
TL;DR: GekkoFS is a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing applications, significantly outperforming the capabilities of general-purpose parallel file systems.
Abstract: We present GekkoFS, a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing (HPC) applications. The file system provides relaxed POSIX semantics, only offering features which are actually required by most (not all) applications. It is able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of general-purpose parallel file systems.

37 citations


Proceedings ArticleDOI
11 Oct 2018
TL;DR: Wharf is introduced, a middleware to transparently add distributed storage support to Docker that partitions Docker's runtime state into local and global parts and efficiently synchronizes accesses to the global state and minimizes the synchronization overhead.
Abstract: Container management frameworks, such as Docker, package diverse applications and their complex dependencies in self-contained images, which facilitates application deployment, distribution, and sharing. Currently, Docker employs a shared-nothing storage architecture, i.e. every Docker-enabled host requires its own copy of an image on local storage to create and run containers. This greatly inflates storage utilization, network load, and job completion times in the cluster. In this paper, we investigate the option of storing container images in and serving them from a distributed file system. By sharing images in a distributed storage layer, storage utilization can be reduced and redundant image retrievals from a Docker registry become unnecessary. We introduce Wharf, a middleware to transparently add distributed storage support to Docker. Wharf partitions Docker's runtime state into local and global parts and efficiently synchronizes accesses to the global state. By exploiting the layered structure of Docker images, Wharf minimizes the synchronization overhead. Our experiments show that compared to Docker on local storage, Wharf can speed up image retrievals by up to 12x, has more stable performance, and introduces only a minor overhead when accessing data on distributed storage.

26 citations


Journal ArticleDOI
TL;DR: Novel methods, referred to as Opass, are proposed to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems to benefit parallel data-intensive analysis and balanced data access.
Abstract: The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth , resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE’s Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.

25 citations


01 Jan 2018
TL;DR: This dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer, and achieves these goals through an implementation of VDFS called Alluxio (formerly Tachyon), which presents a set of disparate data stores as a single file system.
Abstract: Author(s): Li, Haoyuan | Advisor(s): Stoica, Ion; Shenker, Scott | Abstract: The world is entering the data revolution era. Along with the latest advancements of the Internet, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing exponentially. To store and process these data has exposed tremendous challenges and opportunities.Over the past two decades, we have seen significant innovation in the data stack. For example, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Tensorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.This increasing complexity in the stack creates challenges in multi-fold. Data is siloed in various storage systems, making it difficult for users and applications to find and access the data efficiently. For example, for system developers, it requires more work to integrate a new compute or storage component as a building block to work with the existing ecosystem. For data application developers, understanding and managing the correct way to access different data stores becomes more complex. For end users, accessing data from various and often remote data stores often results in performance penalty and semantics mismatch. For system admins, adding, removing, or upgrading an existing compute or data store or migrating data from one store to another can be arduous if the physical storage has been deeply coupled with all applications.To address these challenges, this dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer. Adding VDFS into the stack brings many benefits. Specifically, VDFS enables global data accessibility for different compute frameworks, efficient in-memory data sharing and management across applications and data stores, high I/O performance and efficient use of network bandwidth, and the flexible choice of compute and storage. Meanwhile, as the layer to access data and collect data metrics and usage patterns, it also provides users insight into their data and can also be used to optimize the data access based on workloads.We achieve these goals through an implementation of VDFS called Alluxio (formerly Tachyon). Alluxio presents a set of disparate data stores as a single file system, greatly reducing the complexity of storage APIs, and semantics exposed to applications. Alluxio is designed with a memory centric architecture, enabling applications to leverage memory speed I/O by simply using Alluxio. Alluxio has been deployed at hundreds of leading companies in production, serving critical workloads. Its open source community has attracted more than 800 contributors worldwide from over 200 companies.In this dissertation, we also investigate lineage as an important technique in the VDFS to improve write performance, and also propose DFS-Perf, a scalable distributed file system performance evaluation framework to help researchers and developers better design and implement systems in the Alluxio ecosystem.

24 citations


Proceedings ArticleDOI
01 Feb 2018
TL;DR: The proposed approach was used to improve the performance of encryption /Decryption file by using AES and OTP algorithms integrated on Hadoop by improving this ratio as the size of the encrypted file increased by 20% from the original file size.
Abstract: Cloud computing appeared for huge data because of its ability to provide users with on-demand, reliable, flexible, and low-cost services. With the increasing use of cloud applications, data security protection has become an important issue for the cloud. In this work, the proposed approach was used to improve the performance of encryption /Decryption file by using AES and OTP algorithms integrated on Hadoop. Where files are encrypted within the HDFS and decrypted within the Map Task. Encryption /Decryption in previous works used AES algorithm, the size of the encrypted file increased by 50% from the original file size. The proposed approach improved this ratio as the size of the encrypted file increased by 20% from the original file size. Also, we have compared this approach with the previously implemented method, we implement this new approach to secure HDFS, and some experimental studies were conducted to verify its effectiveness.

24 citations


Journal ArticleDOI
TL;DR: The empirical results shows that the proposed architecture is helpful in saving the Namenode memory overhead as well as reducing the disk seek time to a greater extent.

22 citations


Proceedings ArticleDOI
26 Nov 2018
TL;DR: This work has designed an adaptive tiered storage using in-memory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS.
Abstract: The Hadoop Distributed File System (HDFS) is designed to handle massive amounts of data, preferably stored in very large files. The poor performance of HDFS in managing small files has long been a bane of the Hadoop community. In many production deployments of HDFS, almost 25% of the files are less than 16 KB in size and as much as 42% of all the file system operations are performed on these small files. We have designed an adaptive tiered storage using in-memory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS. Our solution is completely transparent, and it does not require any changes in the HDFS clients or the applications using the Hadoop platform. In experiments, we observed up to 61 times higher throughput in writing files, and for real-world workloads from Spotify our solution reduces the latency of reading and writing small files by a factor of 3.15 and 7.39 respectively.

20 citations


Proceedings ArticleDOI
21 Mar 2018
TL;DR: Endolith, an auditing framework for verifying file integrity and tracking file history without third party reliance using a smart contract-based blockchain, based on Ethereum and Hadoop Distributed File System.
Abstract: Blockchains like Bitcoin and Ethereum have seen significant adoption in the past few years and show promise to design applications without any centralized reliance on third parties. In this paper, we present Endolith, an auditing framework for verifying file integrity and tracking file history without third party reliance using a smart contract-based blockchain. Annotated files are continuously monitored and metadata about changes including file hashes are stored tamper-proof on the blockchain. Based on this, Endolith can prove that a file stored a long time ago has not been changed without authorization or, if it did, track when it has changed, by whom. Endolith implementation is based on Ethereum and Hadoop Distributed File System (HDFS). Our evaluation on a public blockchain network shows that Endolith is efficient for files that are infrequently modified but often accessed, which are common characteristics of data archives.

20 citations


Proceedings ArticleDOI
01 Sep 2018
TL;DR: A Novel algorithm to balance the data blocks on specific nodes (i.e. custom block placement) only by dividing total nodes among two categories like: homogeneous vs. heterogeneous or high performing nodes vs. low performing nodes is presented.
Abstract: To store and analyze Big Data, Hadoop is the most common tool for the researchers and scientists. The storage of huge amount of data in Hadoop is done using Hadoop Distributed File System (HDFS). HDFS uses block placement policy to split a very large file into blocks and place them across the cluster in a distributed manner. Basically, Hadoop and HDFS have been designed in such a way that it works efficiently on the homogeneous cluster. But in this era of networking, we cannot imagine having a cluster of homogeneous nodes only. So, there is the need of storage policy that can work efficiently on both homogeneous as well as the heterogeneous cluster. Thus, the needs of applications that can be executed time-efficiently based on homogeneous as well as the heterogeneous environment can be sufficed. Data locality in Hadoop maps the data block to process in the same node, but often when you're dealing with Big Data, it is required to map the data block to the processes across multiple nodes. To deal with this Hadoop has functionality to copy the data block where mappers are running. This creates a lot of performance degradation especially on heterogeneous cluster due to I/O delay or network congestions. Here we present a Novel algorithm to balance the data blocks on specific nodes (i.e. custom block placement) only by dividing total nodes among two categories like: homogeneous vs. heterogeneous or high performing nodes vs. low performing nodes. This policy helps to achieve better load rearrangement among the nodes and we can put data blocks actually where we want our data to be placed for the processing.

19 citations


Journal ArticleDOI
TL;DR: This work has produced a robust, available, scalable, and serverless solution structure, especially for storing large amounts of data in the medical field, and the security level of the system is extreme by use of static Internet protocol, user credentials, and synchronously encrypted file contents.
Abstract: Recently, the use of internet has become widespread, increasing the use of mobile phones, tablets, computers, Internet of Things (IoT) devices and other digital sources. In the health sector with the help of new generation digital medical equipment, this digital world also has tended to grow in an unpredictable way in that it has nearly 10% of the global wide data itself and continues to keep grow beyond what the other sectors have. This progress has greatly enlarged the amount of produced data which cannot be resolved with conventional methods. In this work, an efficient model for the storage of medical images using a distributed file system structure has been developed. With this work, a robust, available, scalable, and serverless solution structure has been produced, especially for storing large amounts of data in the medical field. Furthermore, the security level of the system is extreme by use of static Internet protocol (IP), user credentials, and synchronously encrypted file contents. One of the most important key features of the system is high performance and easy scalability. In this way, the system can work with fewer hardware elements and be more robust than others that use name node architecture. According to the test results, it is seen that the performance of the designed system is better than 97% from a Not Only Structured Query Language (NoSQL) system, 80% from a relational database management system (RDBMS), and 74% from an operating system (OS).

Patent
04 Jun 2018
TL;DR: In this paper, a node associated with an organization may receive a storage identifier for new credit data associated with the individual, and the node may use the storage identifier to search the distributed data sources.
Abstract: A node associated with an organization may receive a storage identifier for new credit data associated with an individual. A distributed ledger and distributed data sources may be used to share the new credit data with a network of nodes. The node may update a smart contract with the storage identifier for the new credit data. The node may receive, from a particular device associated with the organization, a request for the new credit data. The node may obtain the storage identifier for the new credit data from the smart contract. The node may obtain the new credit data by using the storage identifier to search the distributed data sources. The node may provide the new credit data to the particular device. The node may perform actions to obtain additional new credit data from the distributed data sources or provide the additional new credit data to the distributed data sources.

Proceedings ArticleDOI
02 Jul 2018
TL;DR: This work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it, which leads to unbalanced data, hot spots, and performance degradation.
Abstract: The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.

Patent
27 Dec 2018
TL;DR: In this paper, a storage controller and a container scheduler execute on processors, and a persistent storage volume is mapped to the service container, where a content of the persistent storage volumes is replicated to the second storage node.
Abstract: Containerized high-performance network storage is disclosed. For example, first and second memories are associated with first and second hosts and separated by a network. A storage controller and a container scheduler execute on processors. The container scheduler instantiates first and second storage containers on the respective first and second hosts. The storage controller configures the first and second storage containers as first and second storage nodes of a distributed file system. The container scheduler instantiates a service container on the first host. The storage controller receives a persistent volume claim associated with the service container and then creates a persistent storage volume in the first storage node based on the persistent volume claim. The persistent storage volume is mapped to the service container, where a content of the persistent storage volume is replicated to the second storage node.

Posted Content
TL;DR: Hard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark.
Abstract: Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is difficult when utilizing dedicated hardware like GPUs. As accelerators are getting faster, the storage media \& data buses feeding the data have not kept pace and the ever increasing size of training data further compounds the problem. We describe the design and implementation of a distributed caching system called Hoard that stripes the data across fast local disks of multiple GPU nodes using a distributed file system that efficiently feeds the data to ensure minimal degradation in GPU utilization due to I/O starvation. Hoard can cache the data from a central storage system before the start of the job or during the initial execution of the job and feeds the cached data for subsequent epochs of the same job and for different invocations of the jobs that share the same data requirements, e.g. hyper-parameter tuning. Hoard exposes a POSIX file system interface so the existing deep learning frameworks can take advantage of the cache without any modifications. We show that Hoard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark with 150GB of input dataset. As a result of the caching, Hoard eliminates the I/O bottlenecks introduced by the shared storage and increases the utilization of the system by 2x compared to using the shared storage without the cache.

Journal ArticleDOI
TL;DR: LocoMeta is a distributed file system with a flattened and fine-grained division metadata service, LocoMeta, to bridge the performance gap between file system metadata and key-value stores.
Abstract: Key-Value stores provide scalable metadata service for distributed file systems. However, the metadata's organization itself, which is organized using a directory tree structure, does not fit the key-value access pattern, thereby limiting the performance. To address this issues, we propose a distributed file system with a flattened and fine-grained division metadata service, LocoMeta, to bridge the performance gap between file system metadata and key-value stores. LocoMeta is designed to bridge the gap between file metadata to key-value store with two techniques. First, LocoMeta flattens the directory content and structure, which organizes file and directory index nodes in a flat space while reversely indexing the directory entries. Second, it exploits a fine-grained division method to improve the key-value access performance. Evaluations show that LocoMeta with eight nodes boosts the metadata throughput by five times, which approaches 93 percent throughput of a single-node key-value store, compared to 18 percent in the state-of-the-art IndexFS.

Journal ArticleDOI
TL;DR: A strategy is proposed to dynamically set the replica factor for each data item considering the popularity of data, its current replication factor and the number of active nodes present in the cloud storage to maintain an optimal number of replicas.

Proceedings ArticleDOI
02 Apr 2018
TL;DR: Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.
Abstract: The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.

Patent
04 Dec 2018
TL;DR: In this paper, a cache overlay layer can store additional state information on a per block basis that details whether each individual block of file data within the cache overlay layers is clean, dirty, or indicates that a write back to the storage layer is in progress.
Abstract: Implementations are provided herein for having at least two data streams associated with each file in a file system. The first, a cache overlay layer, can store additional state information on a per block basis that details whether each individual block of file data within the cache overlay layer is clean, dirty, or indicates that a write back to the storage layer is in progress. The second, a storage layer, can be a use case defined repository that can transform data using data augmentation methods or store unmodified raw data in local storage. File system operations directed to the cache overlay layer can be processed asynchronously from file system operations directed to the storage layer.

Patent
10 Jul 2018
TL;DR: In this paper, a Spark framework based distributed implementation method for a GAN (generative adversarial network) is proposed, which comprises the following steps: a host node randomly initializes network configuration, and a parameter set is generated; a data file is directly uploaded to a distributed file system; an elastic distributed data set of Spark is constructed; for each training datasubset RDD (resilient distributed dataset), the host node transmits parameters, configuration and network update state to all slave nodes; each slave node trains part of the data and updates the parameters; a
Abstract: The invention relates to Spark framework based distributed implementation method for a GAN (generative adversarial network). The method comprises the following steps: a host node randomly initializesnetwork configuration, and a parameter set is generated; a data file is directly uploaded to a distributed file system; an elastic distributed data set of Spark is constructed; for each training datasubset RDD (resilient distributed dataset), the host node transmits parameters, configuration and network update state to all slave nodes; each slave node trains part of the data and updates the parameters; a GAN model is trained in parallel with a data parallel training manner until Nash equilibrium is reached, and training is completed. The method is used for distributed training of the GAN model in presence of massive data. Meanwhile, training of the GAN model can be accelerated by taking full advantages of a memory-based computing framework and suitability for recursive computing of Spark,efficiency is improved, and better extendibility is achieved.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed scheme can greatly reduce I/O time and better improve utilization of storage servers when running the database-relevant applications, compared with the commonly used block data placement strategy, i.e., the round-robin placement policy.
Abstract: This paper proposes a new data placement policy to allocate data blocks across storage servers of the distributed/parallel file systems, for yielding even block access workload distribution. To this end, we first analyze the history of block access sequence of a specific application and then introduce a k-partition algorithm to divide data blocks into multiple groups, by referring their access frequency. After that, each group has almost the same access workloads, and we can thus distribute these block groups onto storage servers of the distributed file system, to achieve the goal of uniformly assigning data blocks when running the application. In summary, this newly proposed data placement policy can yield not only an even data distribution but also the block data access balance. The experimental results show that the proposed scheme can greatly reduce I/O time and better improve utilization of storage servers when running the database-relevant applications, compared with the commonly used block data placement strategy, i.e., the round-robin placement policy.

Proceedings ArticleDOI
01 Mar 2018
TL;DR: A review of algorithms or methodologies suggested for the storage of large volume of unstructured, real time data and streams at a high velocity in Hadoop.
Abstract: Big data is the core topic in industries and research fields as well as for society as a whole. Analytics of Big Data is a predictive analysis rather than traditional descriptive analysis of data. Hadoop is the most widely used tool for big data analytics in social media like Google, Facebook, Yahoo, Amazon etc. Hadoop basically uses Distributed File System for the storage of large volume of unstructured, real time data and streams at a high velocity. It has given precise importance to data storage in Hadoop, but the security of data has ignored and very least importance was given. We have a review of algorithms or methodologies suggested.

Book ChapterDOI
07 Jan 2018
TL;DR: This paper shows that common file system operations can run concurrently without synchronisation, while still retaining a semantics reasonably similar to Posix hierarchical structure, with one exception is the \(\mathsf {move}\) operation, for which it is proved that, unless synchronised, it will have an anomalous behaviour.
Abstract: Distributed file systems play a vital role in large-scale enterprise services. However, the designer of a distributed file system faces a vexing choice between strong consistency and asynchronous replication. The former supports a standard sequential model by synchronising operations, but is slow and fragile. The latter is highly available and responsive, but exposes users to concurrency anomalies. In this paper, we describe a rigorous and general approach to navigating this trade-off by leveraging static verification tools that allow to verify different file system designs. We show that common file system operations can run concurrently without synchronisation, while still retaining a semantics reasonably similar to Posix hierarchical structure. The one exception is the \(\mathsf {move}\) operation, for which we prove that, unless synchronised, it will have an anomalous behaviour.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis and provides suggestions for selecting size of Hioop cluster as per data size and generation speed.
Abstract: Big data analytics helps in analyzing structured data transaction and analytics programs that contain semi-structured and unstructured data. Internet clickstream data, mobile-phone call details, server logs are examples of big data. Relational database-oriented dataset doesn't fit in traditional data warehouse since big data set is updated frequently and large amount of data are generated in real time. Many open source solutions are available for handling this large scale data. The Hadoop Distributed File System (HDFS) is one of the solutions which helps in storing, managing, and analyzing big data. Hadoop has become a standard for distributed storage and computing in Big Data Analytic applications. It has the capability to manage distributed nodes for data storage and processing in distributed manner. Hadoop architecture is also known as Store everything now and decide how to process later. Challenges and issues of multi-node Hadoop cluster setup and configuration are discussed in this paper. The troubleshooting for high availability of nodes in different scenarios for Hadoop cluster failure are experimented with different sizes of datasets. Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis. It also provides suggestions for selecting size of Hadoop cluster as per data size and generation speed.

Patent
26 Jul 2018
TL;DR: In this paper, the authors describe a distributed file system with a plurality of nodes for storing linearly integrated data records in distributed file systems and a client installed on each node, each client configured to obtain the security information from at least one other node in the network, and a module contained within each client for delivering the obtained security information to an endpoint security application of the corresponding node.
Abstract: Systems and methods are provided for distributing security information. The systems and methods include a network having a plurality of nodes for storing a plurality of linearly integrated data records in a distributed file system, each linearly integrated data record including security information, a client installed on each node, each client configured to obtain the security information from at least one other node in the network, and a module contained within each client for delivering the obtained security information to an endpoint security application of the corresponding node.

Proceedings ArticleDOI
16 May 2018
TL;DR: An extensive study of the Hadoop Distributed File System (HDFS) code evolution over time, based on reports and patch files available from the official Apache issue tracker, to assist developers in improving the design of similar systems and implementing more solid systems in general.
Abstract: Frameworks for large-scale distributed data processing, such as the Hadoop ecosystem, are at the core of the big data revolution we have experienced over the last decade. In this paper, we conduct an extensive study of the Hadoop Distributed File System (HDFS)'s code evolution. Our study is based on the reports and patch files (patches) available from the official Apache issue tracker (JIRA) and our goal was to make complete use of the entire history of HDFS at the time and the richness of the available data. The purpose of our study is to assist developers in improving the design of similar systems and implementing more solid systems in general. In contrast to prior work, our study covers all reports that have been submitted over HDFS's lifetime, rather than a sampled subset. Additionally, we include all associated patch files that have been verified by the developers of the system and classify the root causes of issues at a finer granularity than prior work, by manually inspecting all 3302 reports over the first nine years, based on a two-level classification scheme that we developed. This allows us to present a different perspective of HDFS, including a focus on the system's evolution over time, as well as a detailed analysis of characteristics that have not been previously studied in detail. These include, for example, the scope and complexity of issues in terms of the size of the patch that fixes it and number of files it affects, the time it takes before an issue is exposed, the time it takes to resolve an issue and how these vary over time. Our results indicate that bug reports constitute the most dominant type, having a continuously increasing rate over time. Moreover, the overall scope and complexity of reports and patch files remain surprisingly stable throughout HDFS' lifetime, despite the significant growth the code base experiences over time. Finally, as part of our work, we created a detailed database that includes all reports and patches, along with the key characteristics we extracted.

Journal ArticleDOI
TL;DR: To improve MapReduce processing performance, the proposed method reduces JVM creation time by reusing a single JVM to run multiple mappers (rather than creating a JVM for every mapper).
Abstract: Hadoop uses the Hadoop distributed file system for storing big data, and uses MapReduce to process big data in cloud computing environments. Because Hadoop is optimized for large file sizes, it has difficulties processing large numbers of small files. A small file can be defined as any file that is significantly smaller than the Hadoop block size, which is typically set to 64 MB. Hadoop is optimized to store data in relatively large files, and thus suffers from name node memory insufficiency and increased scheduling and processing time when processing large numbers of small files. This study proposes a performance improvement method for MapReduce processing, which integrates the CombineFileInputFormat method and the reuse feature of the Java Virtual Machine (JVM). Existing methods create a mapper for every small file. Unlike these methods, the proposed method reduces the number of created mappers by processing large numbers of files that are combined by a single split using CombineFileInputFormat. Moreover, to improve MapReduce processing performance, the proposed method reduces JVM creation time by reusing a single JVM to run multiple mappers (rather than creating a JVM for every mapper).

Book ChapterDOI
07 Jul 2018
TL;DR: This work proposes Small Hadoop Distributed File System (SHDFS), which bases on original HDFS, and adds two novel modules in the proposed SHDFS: merging module and caching module, which are used to find out the correlated files by user-based collaborative filtering and then merge correlated files into a single large file to reduce the total number of files.
Abstract: Hadoop Distributed File System (HDFS) is designed to reliably storage and manage large-scale files. All the files in HDFS are managed by a single server, the NameNode. The NameNode stores metadata, in its main memory, for each file stored into HDFS. HDFS suffers the penalty of performance with increased number of small files. It imposes a heavy burden to the NameNode to store and manage a mass of small files. The number of files that can be stored into HDFS is constrained by the size of NameNode’s main memory. In order to improve the efficiency of storing and accessing the small files on HDFS, we propose Small Hadoop Distributed File System (SHDFS), which bases on original HDFS. Compared to original HDFS, we add two novel modules in the proposed SHDFS: merging module and caching module. In merging module, the correlated files model is proposed, which is used to find out the correlated files by user-based collaborative filtering and then merge correlated files into a single large file to reduce the total number of files. In caching module, we use Log - linear model to dig out some hot-spot data that user frequently access to, and then design a special memory subsystem to cache these hot-spot data. Caching mechanism speeds up access to hot-spot data.

Patent
22 Jun 2018
TL;DR: In this article, a dynamic load balancing method for a distributed file system under a cloud environment is presented, which mainly comprises the steps of acquiring information of all nodes of the distributed file systems under the cloud environment, computing a threshold needed when the system is balanced according to a disk space utilization rate, a CPU utilization rate and a memory utilization rate.
Abstract: The invention discloses a dynamic load balancing method for a distributed file system under a cloud environment. The method mainly comprises the steps of acquiring information of all nodes of the distributed file system under the cloud environment, judging whether the file system is balanced, computing a threshold needed when the system is balanced according to a disk space utilization rate, a CPUutilization rate, a memory utilization rate, a disk I/O occupancy rate and a network bandwidth occupancy rate of each node, and performing unbalance adjustment on the load of the file system according to the threshold and the disk space utilization rate. The method provided by the invention balances the load of the file system while supporting the execution of a cloud computation task, and continuously adjusts the load via monitoring information of the nodes, and thus the execution efficiency of cloud computation to the file system is improved.

Patent
20 Jul 2018
TL;DR: In this article, a business big data analysis system takes a Browser/Server structure as the overall structure of the analysis system, and includes WIFIprobes, a receiving server, a distributed data analysis systems, an HDFS distributed file system, an HBASE distributed database, and a connected network among the WifiI probes, the receiving server.
Abstract: The invention relates to a business big data analysis system. The business big data analysis system takes a Browser/Server structure as the overall structure of the analysis system, and includes WIFIprobes, a receiving server, a distributed data analysis system, an HDFS distributed file system, an HBASE distributed database, and a connected network among the WIFI probes, the receiving server, thedistributed data analysis system, the HDFS distributed file system and the HBASE distributed database. The business big data analysis system employs the WIFI probe devices to probe the mobile phone MAC address and extract the business passenger flow data to perform big data analysis, takes the B/S structure as the overall structure of the system, puts the management system at the Web terminal, utilizes a single physical server to construct a server cluster to collect data and store the data at the server terminal in a distributed manner, and performs effective storage, analysis and classification of mass data to form an organized business big data analysis system, on the basis of convenient operation, so that a user can conveniently check the business passenger flow data index analysis result through the system.