scispace - formally typeset
Search or ask a question

Showing papers on "Distributed File System published in 2017"


Proceedings ArticleDOI
12 Oct 2017
TL;DR: A novel service handoff system which seamlessly migrates offloading services to the nearest edge server, while the mobile client is moving, is presented and an important performance problem during Docker container migration is identified.
Abstract: Supporting smooth movement of mobile clients is important when offloading services on an edge computing platform. Interruption-free client mobility demands seamless migration of the offloading service to nearby edge servers. However, fast migration of offloading services across edge servers in a WAN environment poses significant challenges to the handoff service design. In this paper, we present a novel service handoff system which seamlessly migrates offloading services to the nearest edge server, while the mobile client is moving. Service handoff is achieved via container migration. We identify an important performance problem during Docker container migration. Based on our systematic study of container layer management and image stacking, we propose a migration method which leverages the layered storage system to reduce file system synchronization overhead, without dependence on the distributed file system. We implement a prototype system and conduct experiments using real world product applications. Evaluation results reveal that compared to state-of-the-art service handoff systems designed for edge computing platforms, our system reduces the total duration of service handoff time by 80%(56%) with network bandwidth 5Mbps(20Mbps).

157 citations


Proceedings ArticleDOI
09 May 2017
TL;DR: An overview of ADLS architecture, design points, and performance is presented, which includes its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features.
Abstract: Azure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics that depend on a very high degree of parallel reads and writes, as well as collocation of compute and data for high bandwidth and low-latency access. It brings together key components and features of Microsoft?s Cosmos file system-long used by internal customers at Microsoft and HDFS, and is a unified file storage solution for analytics on Azure. Internal and external workloads run on this unified platform. Distinguishing aspects of ADLS include its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features. We present an overview of ADLS architecture, design points, and performance.

99 citations


Patent
20 Oct 2017
TL;DR: In this article, a block chain-based distributed storage method is described, where the file is stored on the basis of the hash value in the distributed network and the file can be tampered with.
Abstract: The embodiment of the disclosure relates to block chain-based distributed storage, and discloses a block chain-based distributed storage method. The method includes: obtaining a file which needs to be stored, and generating a hash value and an index of the hash value for the file. The method also includes: recording the hash value in a block chain network, and storing the file in a distributed network, wherein the file is located on the basis of the hash value in the distributed network. Therefore, according to the embodiment of the disclosure, tampering proofing and the distributed storage of the file can be realized through combining a block chain and a distributed file system, thus the problem of the limited capacity of block chain nodes is solved, and the authenticity and the availability of the stored file are effectively guaranteed.

64 citations


Book ChapterDOI
27 Feb 2017
TL;DR: HopsFS is introduced, a next generation distribution of the Hadoop Distributed File System that replaces HDFS' single node in-memory metadata service, with a distributed metadata service built on a NewSQL database that enables an order of magnitude larger and higher throughput clusters compared to HDFS.
Abstract: Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS' single node in-memory metadata service, with a distributed metadata service built on a NewSQL database. By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS' capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.

56 citations


Proceedings ArticleDOI
09 May 2017
TL;DR: This work presents OctopusFS, a novel distributed file system that is aware of heterogeneous storage media with different capacities and performance characteristics, and offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes.
Abstract: The ever-growing data storage and I/O demands of modern large-scale data analytics are challenging the current distributed storage systems. A promising trend is to exploit the recent improvements in memory, storage media, and networks for sustaining high performance and low cost. While past work explores using memory or SSDs as local storage or combine local with network-attached storage in cluster computing, this work focuses on managing multiple storage tiers in a distributed setting. We present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. The policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. At the same time, the storage media are explicitly exposed to users and applications, allowing them to choose the distribution and placement of replicas in the cluster based on their own performance and fault tolerance requirements. Our extensive evaluation shows the immediate benefits of using OctopusFS with data-intensive processing systems, such as Hadoop and Spark, in terms of both increased performance and better cluster utilization.

43 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: Evaluations show that LocoFS with eight nodes boosts the metadata throughput by 5 times, which approaches 93% throughput of a single-node key-value store, compared to 18% in the state-of-the-art IndexFS.
Abstract: Key-Value stores provide scalable metadata service for distributed file systems. However, the metadata's organization itself, which is organized using a directory tree structure, does not fit the key-value access pattern, thereby limiting the performance. To address this issue, we propose a distributed file system with a loosely-coupled metadata service, LocoFS, to bridge the performance gap between file system metadata and key-value stores. LocoFS is designed to decouple the dependencies between different kinds of metadata with two techniques. First, LocoFS decouples the directory content and structure, which organizes file and directory index nodes in a flat space while reversely indexing the directory entries. Second, it decouples the file metadata to further improve the key-value access performance. Evaluations show that LocoFS with eight nodes boosts the metadata throughput by 5 times, which approaches 93% throughput of a single-node key-value store, compared to 18% in the state-of-the-art IndexFS.

32 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: TagIt is a scalable, distributed metadata indexing framework, using which it implement a flexible tagging capability to support data discovery, and can expedite data search by up to 10X over the extant decoupled approach.
Abstract: Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the underlying file systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled file system-data services design philosophy.In this paper, we present TagIt, a scalable data management service framework aimed at scientific datasets, which is tightly integrated into a shared-nothing distributed file system. A key feature of TagIt is a scalable, distributed metadata indexing framework, using which we implement a flexible tagging capability to support data discovery. The tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to file servers in a load-aware fashion. Our evaluation shows that TagIt can expedite data search by up to 10X over the extant decoupled approach.

29 citations


Patent
24 Nov 2017
TL;DR: In this article, a distributed steaming data processing method and system in a cloud environment is presented, where the data amount is large in concurrent volume, fast in stream and the like, Spark Streaming is utilized to replace MapReduce batch computing of a traditional Lambda structure, and by instantiating multiple input streams, stream computing of multi-table data is achieved, the computing result is stored in a distributed file system HDFS, and efficient query is achieved through a distributed query system Impala.
Abstract: The invention provides a distributed steaming data processing method and system in a cloud environment. Aiming at the shortcomings that in an internet of things era, the data amount is large in concurrent volume, fast in stream and the like, stream type computing engine Spark Streaming is utilized to replace MapReduce batch computing of a traditional Lambda structure, and by instantiating multiple input streams, stream computing of multi-table data is achieved, the computing result is stored in a distributed file system HDFS, and efficient query is achieved through a distributed query system Impala.

24 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: This work is the first of its kind to provide comprehensive insights on user behavior from multiple science domains through metadata analysis of a large-scale shared file system.
Abstract: The Oak Ridge Leadership Computing Facility (OLCF) runs the No. 4 supercomputer in the world, supported by a petascale file system, to facilitate scientific discovery. In this paper, using the daily file system metadata snapshots collected over 500 days, we have studied the behavioral trends of 1, 362 active users and 380 projects across 35 science domains. In particular, we have analyzed both individual and collective behavior of users and projects, highlighting needs from individual communities and the overall requirements to operate the file system. We have analyzed the metadata across three dimensions, namely (i) the projects' file generation and usage trends, using quantitative file system-centric metrics, (ii) scientific user behavior on the file system, and (iii) the data sharing trends of users and projects. To the best of our knowledge, our work is the first of its kind to provide comprehensive insights on user behavior from multiple science domains through metadata analysis of a large-scale shared file system. We envision that this OLCF case study will provide valuable insights for the design, operation, and management of storage systems at scale, and also encourage other HPC centers to undertake similar such efforts.

22 citations


Patent
15 Mar 2017
TL;DR: In this paper, a distributed storage based Docker image downloading method is described, which specifically comprises the steps that a distributed file system is hung in a mirror image storage catalog of a registry node, and all nodes in a cluster need to establish corresponding catalogs and be mounted with the distributed file systems.
Abstract: The invention discloses a distributed storage based Docker image downloading method, which specifically comprises the steps that a distributed file system is hung in a mirror image storage catalog of a Registry node, and all nodes in a cluster need to establish corresponding catalogs and be mounted with the distributed file system; a Docker node in the cluster requests to download an image towards the Registry node; the Registry node determines the storage position of the mirror image data in the distributed file system according to the request of the Docker node, and returns mirror image metadata back to the Docker node; and the Docker node determines the storage position of the mirror image data according to the metadata received from the Registry node, and directly extracts the mirror image data from a storage node. The Docker node only transmits metadata when sending a request to the Registry node, true data is taken from the actual storage node, and thus the network bottleneck of the Registry node is truly relieved.

22 citations


Journal ArticleDOI
TL;DR: A dynamic transmission rate adjustment strategy to prevent potential incast congestion when replicating a file to a server, a network-aware data node selection strategy to reduce file read latency, and a load-aware replica maintenance strategy to quickly create file replicas under replica node failures are proposed.
Abstract: In data intensive clusters, a large amount of files are stored, processed and transferred simultaneously. To increase the data availability, some file systems create and store three replicas for each file in randomly selected servers across different racks. However, they neglect the file heterogeneity and server heterogeneity, which can be leveraged to further enhance data availability and file system efficiency. As files have heterogeneous popularities, a rigid number of three replicas may not provide immediate response to an excessive number of read requests to hot files, and waste resources (including energy) for replicas of cold files that have few read requests. Also, servers are heterogeneous in network bandwidth, hardware configuration and capacity (i.e., the maximal number of service requests that can be supported simultaneously), it is crucial to select replica servers to ensure low replication delay and request response delay. In this paper, we propose an Energy-Efficient Adaptive File Replication System (EAFR), which incorporates three components. It is adaptive to time-varying file popularities to achieve a good tradeoff between data availability and efficiency. Higher popularity of a file leads to more replicas and vice versa. Also, to achieve energy efficiency, servers are classified into hot servers and cold servers with different energy consumption, and cold files are stored in cold servers. EAFR then selects a server with sufficient capacity (including network bandwidth and capacity) to hold a replica. To further improve the performance of EAFR, we propose a dynamic transmission rate adjustment strategy to prevent potential incast congestion when replicating a file to a server, a network-aware data node selection strategy to reduce file read latency, and a load-aware replica maintenance strategy to quickly create file replicas under replica node failures. Experimental results on a real-world cluster show the effectiveness of EAFR and proposed strategies in reducing file read latency, replication time, and power consumption in large clusters.

Proceedings ArticleDOI
22 May 2017
TL;DR: SafeFS is a modular architecture based on software-defined storage principles featuring stackable building blocks that can be combined to construct a secure distributed file system that allows users to specialize their data store to their specific needs by choosing the combination of blocks that provide the best safety and performance tradeoffs.
Abstract: The exponential growth of data produced, the ever faster and ubiquitous connectivity, and the collaborative processing tools lead to a clear shift of data stores from local servers to the cloud This migration occurring across different application domains and types of users---individual or corporate---raises two immediate challenges First, out-sourcing data introduces security risks, hence protection mechanisms must be put in place to provide guarantees such as privacy, confidentiality and integrity Second, there is no "one-size-fits-all" solution that would provide the right level of safety or performance for all applications and users, and it is therefore necessary to provide mechanisms that can be tailored to the various deployment scenarios In this paper, we address both challenges by introducing SafeFS, a modular architecture based on software-defined storage principles featuring stackable building blocks that can be combined to construct a secure distributed file system SafeFS allows users to specialize their data store to their specific needs by choosing the combination of blocks that provide the best safety and performance tradeoffs The file system is implemented in user space using FUSE and can access remote data stores The provided building blocks notably include mechanisms based on encryption, replication, and coding We implemented SafeFS and performed in-depth evaluation across a range of workloads Results reveal that while each layer has a cost, one can build safe yet efficient storage architectures Furthermore, the different combinations of blocks sometimes yield surprising tradeoffs

Proceedings ArticleDOI
01 Jun 2017
TL;DR: This paper presents an improved replica placement policy for Hadoop Distributed File System (HDFS), which is specifically designed for heterogeneous clusters, and can generate perfectly even replica assignment, and achieve load balance among cluster nodes in any heterogeneous or homogeneous environments without the running of the load balance utility.
Abstract: Load balance is a crucial issue for data-intensive computing on cloud platforms, because a load balanced cluster can significantly improve the completion time of data-intensive jobs. In this paper, we present an improved replica placement policy for Hadoop Distributed File System (HDFS), which is specifically designed for heterogeneous clusters. The HDFS replica placement policy cannot generate balanced replica assignment, and hence has to rely on a load balance utility to balance the load among cluster nodes. In contrast, our proposed policy can generate perfectly even replica assignment, and also achieve load balance among cluster nodes in any heterogeneous or homogeneous environments without the running of the load balance utility.

Patent
04 Jan 2017
TL;DR: In this article, the authors proposed a data processing method and device that comprises the following steps: acquiring input data including structured data, semi-structured data, or non-structure data; storing the data by adopting an HDFS (Hadoop Distributed File System).
Abstract: The embodiments of the invention provide a data processing method and device. The method comprises the following steps: acquiring input data including structured data, semi-structured data or non-structured data; if the input data are structured data or semi-structured data, storing the data by adopting an HDFS (Hadoop Distributed File System) and performing data modeling; and if the input data is non-structure data, storing the data by adopting an mooseFS distributed file system. By use of the data processing method, the problems the HDFS in Hadoop cannot process a large amount of small files very effectively and processing of the small files by MapReduce of Hadoop generates great resource waste are solved.

Journal ArticleDOI
TL;DR: HBase is recommended to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services to establish an interactive BDA platform with simulated patient data using open-source software technologies.
Abstract: Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.

Journal ArticleDOI
TL;DR: A comprehensive performance measurement of different applications on scale-up and scale-out clusters configured with HDFS and a remote file system and a performance prediction model is proposed to help users select the best platforms that lead to the minimum latency.
Abstract: MapReduce is a popular computing model for parallel data processing on large-scale datasets, which can vary from gigabytes to terabytes and petabytes. Though Hadoop MapReduce normally uses Hadoop Distributed File System (HDFS) local file system, it can be configured to use a remote file system. Then, an interesting question is raised: for a given application, which is the best running platform among the different combinations of scale-up and scale-out Hadoop with remote and local file systems. However, there has been no previous research on how different types of applications (e.g., CPU-intensive, data-intensive) with different characteristics (e.g., input data size) can benefit from the different platforms. Thus, in this paper, we conduct a comprehensive performance measurement of different applications on scale-up and scale-out clusters configured with HDFS and a remote file system (i.e., OFS), respectively. We identify and study how different job characteristics (e.g., input data size, the number of file reads/writes, and the amount of computations) affect the performance of different applications on the different platforms. Based on the measurement results, we also propose a performance prediction model to help users select the best platforms that lead to the minimum latency. Our evaluation using a Facebook workload trace demonstrates the effectiveness of our prediction model. This study is expected to provide a guidance for users to choose the best platform to run different applications with different characteristics in the environment that provides both remote and local storage, such as HPC cluster and cloud environment.

Proceedings ArticleDOI
14 May 2017
TL;DR: A new algorithm OMSS (Optimized MapFile based Storage of Small files) which merges the small files into a large file based on the Worst fit strategy helps in reducing internal fragmentation in data blocks, which in turn leads to fewer data blocks consumed for the same number of small files.
Abstract: Hadoop is an open source software based on MapReduce framework. The Hadoop Distributed File System (HDFS) performs well while storing and managing data sets of very large size. However, the performance of HDFS suffers while handling a large number of small files since they put a lot of burden on the NameNode of HDFS both in terms of memory and access time. To overcome these defects, we merge small files into a large file and store the merged file on HDFS. Generally, when small files are merged, variation in the size distribution of files is not taken into consideration. We propose a new algorithm OMSS (Optimized MapFile based Storage of Small files) which merges the small files into a large file based on the Worst fit strategy. The strategy helps in reducing internal fragmentation in data blocks, which in turn leads to fewer data blocks consumed for the same number of small files. Less number of data blocks mean fewer memory overheads at major nodes of Hadoop cluster and hence increased efficiency of data processing. Our experimental results indicate that the time to process data on HDFS containing unprocessed small files reduces significantly to 590s when MapFile is used and it reduces further to 440s when OMSS is used. OMSS as compared to MapFile merging algorithm has a reduction of 34.7% in memory requirements.

Journal ArticleDOI
TL;DR: An algorithm that estimates the potential of the files located in each node of the grid, using a binary tree structure, indicates that the proposed scheme can offer better data access performance in terms of the hit ratio and the average job execution time, compared to other state-of-the-art strategies.
Abstract: Recently, data replication has received considerable attention in the field of grid computing. The main goal of data replication algorithms is to optimize data access performance by replicating the most popular files. When a file does not exist in the node where it was requested, it necessarily has to be transferred from another node, causing delays in the completion the file requests. The general idea behind data replication is to keep track of the most popular files requested in the grid and create copies of them in selected nodes. In this way, more file requests can be completed over a period of time and average job execution time is reduced. In this paper, we introduce an algorithm that estimates the potential of the files located in each node of the grid, using a binary tree structure. Also, the file scope and the file type are taken into account. By potential of a file, we mean its increasing or decreasing demand over a period of time. The file scope generally refers to the extent of the group of users which are interested or potentially interested in a file. The file types are divided into read and write intensive. Our scheme mainly promotes the high-potential files for replication, based on the temporal locality principle. The simulation results indicate that the proposed scheme can offer better data access performance in terms of the hit ratio and the average job execution time, compared to other state-of-the-art strategies.

Journal ArticleDOI
TL;DR: An attempt to provide a lucid comparison among three prominent technologies used for handling Big Data, viz.
Abstract: Objective: With the emergence of the notion of “Internet of Things (IoT)”, colossal amount of information is being generated through the sensors and other computing devices and chips. This paper is an attempt to provide a lucid comparison among three prominent technologies used for handling Big Data, viz. Hadoop Distributed File System, Cassandra file system and Quantcast file system. Apart for these three premier file systems, the paper also explores a newly proposed A train Distributed System for handling Big Data. Methods: An inner perspective of the above stated file systems in details considering various aspects for handling big data has been described. The paper also provides sagacity on the situations wherein these technologies are useful. Findings: Effective tackling of the five V’s (Variety, Volume, Velocity, Veracity and Value) of Big Data has become a challenging task for the researcher around the world. Hadoop is one such technology which is open source and is capable of handling big data in an effective manner. It breaks the big data into fixed sized chunks known as block and these blocks are saved at distinct locations in a distributed manner. The Cassandra file system is an alternative to Hadoop which eliminates the single point failure problem of Hadoop as it follows master-less peer to peer distributed ring architecture instead of client server architecture. The third technology is the quantcast file system which is written in C++ language. It also handles the big data in an effective and efficient manner. Moreover it claims to save upto fifty percent of the disk space by implementing erasure encoding. Application: The concerned organization to use any of these available frameworks for handling big data depending upon their nature of needs.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: MDFS is proposed, a mimic defense theory based architecture for DFS with the capability to improve the data security and Mimic Defense (MD), a proactive defense embedded in MDFS, emphasizes dynamism, heterogeneity and redundancy.
Abstract: As the Internet and the big data system evolve rapidly, the deployment of distributed applications becomes widespread, promoting the development of Distributed File System (DFS). The existing defense technologies for DFS, such as detection or patching, mainly aim to protect the system from known attacks and vulnerabilities. However, it is difficult for those systems to solve the growing security issues from the unknown threats due to their passiveness and hysteresis. In this paper, we propose MDFS, a mimic defense theory based architecture for DFS with the capability to improve the data security. Mimic Defense (MD), a proactive defense embedded in MDFS, emphasizes dynamism, heterogeneity and redundancy. The key benefits of MD are transferring the attack surface as well as increasing the cost of modification.

Journal ArticleDOI
Ting Yang1, Pen Haibo1, Wei Li2, Dong Yuan2, Albert Y. Zomaya2 
TL;DR: Experimental results show that the variable hypergraph coverage based strategy can not only reduce energy consumption, but can also improve the network performance in the datacenter.
Abstract: Distributed storage systems, e.g., Hadoop Distributed File System (HDFS), have been widely used in datacenters for handling large amounts of data due to their excellent performance in terms of fault tolerance, reliability and scalability. However, these storage systems usually adopt the same replication and storage strategy to guarantee data availability, i.e., creating the same number of replicas for all data sets and randomly storing them across data nodes. Such strategies do not fully consider the difference requirements of data availability on different data sets. More servers than necessary should thus be used to store replicas of rarely-used data, which will lead to increased energy consumption. To address this issue, we propose an energy-efficient storage strategy for cloud datacenters based on a novel hypergraph coverage model. According to users’ requirements of data availability in different applications, our proposed algorithm can selectively determine the corresponding minimum hyperedge coverage, which represents the minimum set of data nodes required in the datacenter. Hence, some other data nodes can be turned off for the purpose of energy saving. We have also implemented our proposed algorithm as a dynamic runtime strategy in a HDFS based prototype datacenter for performance evaluation. Experimental results show that the variable hypergraph coverage based strategy can not only reduce energy consumption, but can also improve the network performance in the datacenter.

Patent
19 Dec 2017
TL;DR: In this article, the authors present a file system that may support data management for a distributed data storage and computing system, such as Apache™ Hadoop®, which may include an expandable tree-based indexing framework that enables convenient expansion of the file system.
Abstract: Disclosed is a file system that may support data management for a distributed data storage and computing system, such as Apache™ Hadoop®. The file system may include an expandable tree-based indexing framework that enables convenient expansion of the file system. As a non-limiting example, the file system disclosed herein may enable indexing, storage, and management of a billion or more files, which is 1,000 times the capacity of currently available file systems. The file system includes a root index system and a number of leaf index systems that are organized in a tree data structure. The leaf index systems provide heartbeat information to the root index system to enable the root index system to maintain a lightweight and searchable index of file references and leaf index references. Each of the leaf indexes maintains an index or mapping of file references to file block addresses within data storage devices that store files.

Patent
04 Jan 2017
TL;DR: Wang et al. as discussed by the authors proposed a storage method under a cloud computing platform, which comprises the steps that: 1, a cloud data backup system based on a Hadoop distributed file system is constructed.
Abstract: The invention discloses a storage method under a cloud computing platform, which comprises the steps that: 1, a cloud data backup system based on a Hadoop distributed file system is constructed, and the system is physically divided into a client, a backup server and a Hadoop distributed file system cluster; 2, information of the backup server for providing a service for a native machine is stored in the client, and when backup or recovery needs to be carried out, a corresponding request is sent to the backup server; and 3, the backup server receives the request of the client and carries out backup and recovery of a file, when file uploading is recovered, the file is managed in a file splitting mode, before the file is uploaded, the file is split into small file blocks and then the file blocks are uploaded, and when the file is recovered, firstly, the file blocks of the file are downloaded, and after downloading of all the file blocks is completed, the file blocks are merged into the original file. The invention discloses a novel storage method based on the cloud computing platform, and improves file storage efficiency.

Proceedings ArticleDOI
14 May 2017
TL;DR: This paper details the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second - at least 16 times higher throughput than HDFS.
Abstract: HopsFS is an open-source, next generation distribution of the Apache Hadoop Distributed File System (HDFS) that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-shared state distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower client latencies for large clusters.In this paper, we detail the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second - at least 16 times higher throughput than HDFS. In particular, we discuss how we exploit recent high performance features from NewSQL databases, such as application defined partitioning, partition-pruned index scans, and distribution aware transactions. Together with more traditional techniques, such as batching and write-ahead caches, we show how many incremental optimizations have enabled a revolution in distributed hierarchical file system performance.

Patent
15 Feb 2017
TL;DR: In this paper, the authors proposed a data migration method and device for real-time data migration in distributed file systems, which consists of the following steps: periodically acquiring working state information of data nodes in a distributed file system; for the data nodes, judging whether disks working abnormally exist in the data node or not based on the Working State Information of the Data nodes, determining at least one data block stored in the disks working abnormal based on a data block attribute list; for each data block in the at least 1 data block, selecting a target data node to be migrated from
Abstract: The invention relates to a data migration method and device, and belongs to the field of distributed technology. The method comprises the following steps: periodically acquiring working state information of data nodes in a distributed file system; for the data nodes, judging whether disks working abnormally exist in the data nodes or not based on the working state information of the data nodes; if the disks working abnormally exist in the data nodes, determining at least one data block stored in the disks working abnormally based on a data block attribute list; for each data block in the at least one data block, selecting a target data node to be migrated from other data nodes other than the data nodes; and migrating the data blocks onto the target data node. A data migration process is automatically finished by the distributed file system under the condition that a data node goes down, or a disk is damaged and the like, and manual operations of a worker are not needed, so that the data migration method and device are simple and convenient, and real-time migration of data is realized.

Journal ArticleDOI
TL;DR: A distributed caching scheme to efficiently access small files in Hadoop distributed file system that reduces the volume of metadata to manage in the NameNode by combining and storing multipleSmall files in a block.
Abstract: In this paper, we propose a distributed caching scheme to efficiently access small files in Hadoop distributed file system. The proposed scheme reduces the volume of metadata to manage in the NameNode by combining and storing multiple small files in a block. In addition, it reduces unnecessary accesses by maintaining information on requested files using client cache and DataNode cache, and synchronizing metadata of the client cache. The client cache maintains small files requested by users and metadata, and each DataNode cache maintains small files frequently requested by users. Performance evaluation shows that the proposed distributed cache management scheme significantly outperforms existing schemes in small file access costs.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: The data value is introduced and redefined, and proposed the heterogeneous storage architecture based on the data value, which can dynamically calculate the datavalue and choose the appropriate storage strategy according to the different data value and improve the overall performance of the system.
Abstract: The rapid development of the Internet and the arrival of the big data age, triggered the storage problem of the massive data. Traditional storage systems have difficult to solve the storage problems of big data, it constitutes heterogeneous storage hierarchy for the distributed storage based on the domestic server usually configured with a larger capacity of solid state drive and hard disk drive, providing the best solution for the storage problem of big data. In this paper, we introduced and redefined the data value, and proposed the heterogeneous storage architecture based on the data value, which can dynamically calculate the data value and choose the appropriate storage strategy according to the different data value and improve the overall performance of the system. Finally, the paper compares and analysis the performance for different strategies and values based on the hadoop distributed file system.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: Improved dynamic replica creation strategy based on file heat and node load is presented in this paper combined with the characteristics of the hybrid cloud environment, which can adaptively adjust the number of copies, reduce the average response time, and achieve better load balance of cluster.
Abstract: Replica creation strategy is one of the important research directions of the distributed file system in the hybrid cloud environment. However, traditional replica creation strategy just simply calculated the file heat based on the number of accesses to the file within a period of time. Besides, creating too many copies will seriously affect the performance of the node without considering the node load. In order to solve this problem, the improved dynamic replica creation strategy based on file heat and node load is presented in this paper combined with the characteristics of the hybrid cloud environment. File heat of history and current access frequency of three cycles and change rate of file are considered comprehensively in the calculation of the heat based on LRFU(Least Recently Frequently Used). Combined with the node load, the average heat and the average load are used to adjust the number of copies in this paper, which can adapt to the changes of the environment dynamically. Experiments show that with changes of file access and traffic intensity, the improved strategy is sensitive to access frequency, which can adaptively adjust the number of copies, reduce the average response time, and achieve better load balance of cluster.

Patent
15 Dec 2017
TL;DR: In this article, an access control method for distributed storage under a cloud environment is proposed, which adopts an HDFS distributed file system of a Hadoop cluster to serve as a basic cloud storage system, and a safe access control function is added on the basis of the basic Cloud storage system.
Abstract: The invention relates to an access control method for distributed storage under a cloud environment. The method adopts an HDFS distributed file system of a Hadoop cluster to serve as a basic cloud storage system, and a safe access control function is added on the basis of the basic cloud storage system. An access control technology in the cloud storage system is broken through via Ranger, and a fine-grained access control authorization system based on a role is built, so that the cloud storage system can reliably support operations of effective isolation and integrity protection on different levels or types of information of multiple users, and isolation of cloud data can be achieved. Access control to a specific data node in the cloud storage system is achieved via Kerberos, and an access control problem in the Hadoop cluster, and between a client and a management node, between the management node and the data node and between the data nodes is solved.

Patent
08 Mar 2017
TL;DR: In this article, a file retrieving system based on an HDFS (Hadoop Distributed File System) is described, which comprises a system configuring module, a file management module, an index management module and a retrieving portal module.
Abstract: The invention discloses a file retrieving system based on an HDFS (Hadoop Distributed File System). The file retrieving system comprises a system configuring module, a file management module, an index management module, a retrieving portal module, a MongoDB database, an HDFS cluster, a Spark cluster and an ElasticSearch cluster, wherein the file management module is used for storing files into the HDFS cluster; the index management module is used for creating indexes through the Spark cluster, and storing the indexes into the ElasticSearch cluster; the retrieving portal module is used for transmitting retrieving conditions to the ElasticSearch cluster to perform index matching in order to obtain retrieving results; and the MongoDB database is used for storing records generated in a file retrieving process. In the file retrieving system, the HDSF cluster, the Spark cluster and the ElasticSearch cluster are distributed clusters, so that the query load is relieved, and the query efficiency is increased. Client-server architecture is adopted, so that horizontal extensibility and stability are achieved; the overall processing capability of the clusters is improved; and the working state of the system is stable. A redundant replica strategy is adopted, so that the index reliability and index integrity can be ensured.