scispace - formally typeset
Proceedings ArticleDOI

SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing

Reads0
Chats0
TLDR
SoMeta is presented, a scalable and decentralized metadata management approach for object-centric storage in HPC systems that provides a flat namespace that is dynamically partitioned, a tagging approach to manage metadata that can be efficiently searched and updated, and a light-weight and fault tolerant management strategy.
Abstract
Scientific data sets, which grow rapidly in volume, are often attached with plentiful metadata, such as their associated experiment or simulation information. Thus, it becomes difficult for them to be utilized and their value is lost over time. Ideally, metadata should be managed along with its corresponding data by a single storage system, and can be accessed and updated directly. However, existing storage systems in high-performance computing (HPC) environments, such as Lustre parallel file system, still use a static metadata structure composed of non-extensible and fixed amount of information. The burden of metadata management falls upon the end-users and require ad-hoc metadata management software to be developed.With the advent of "object-centric" storage systems, there is an opportunity to solve this issue. In this paper, we present SoMeta, a scalable and decentralized metadata management approach for object-centric storage in HPC systems. It provides a flat namespace that is dynamically partitioned, a tagging approach to manage metadata that can be efficiently searched and updated, and a light-weight and fault tolerant management strategy. In our experiments, SoMeta achieves up to 3.7X speedup over Lustre in performing common metadata operations, and up to 16X faster than SciDB and MongoDB for advanced metadata operations, such as adding and searching tags. Additionally, in contrast to existing storage systems, SoMeta offers scalable user-space metadata management by allowing users with the capability to specify the number of metadata servers depending on their workload.

read more

Citations
More filters
Proceedings ArticleDOI

Toward scalable and asynchronous object-centric data management for HPC

TL;DR: This paper forms object-centric PDCs and their mappings in different levels of the storage hierarchy, named Proactive Data Containers (PDC), and achieves comparable performance with HDF5 and PLFS in reading and writing data at small scale, and outperforms them at a scale of larger than 10K cores.
Journal ArticleDOI

Survey of Storage Systems for High-Performance Computing

TL;DR: A thorough understanding of today's storage infrastructures, including their strengths and weaknesses, is crucially important for designing and implementing scalable storage systems suitable for demands of exascale computing.
Journal ArticleDOI

An Integrated Indexing and Search Service for Distributed File Systems

TL;DR: The evaluation demonstrates that TagIt can expedite data search operation by up to 10× over the extant decoupled approach, and is integrated into two popular distributed file systems, i.e., GlusterFS and CephFS.
Proceedings ArticleDOI

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

TL;DR: This paper demonstrates that an RDBMS is a viable technology for managing data-oriented metadata, and offers significant performance advantages over HDF5, providing metadata querying that is 150X to 650X faster, and can greatly accelerate post-processing.
References
More filters
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings ArticleDOI

The Hadoop Distributed File System

TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Proceedings ArticleDOI

Ceph: a scalable, high-performance distributed file system

TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
Proceedings Article

GPFS: A Shared-Disk File System for Large Computing Clusters

TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Journal ArticleDOI

A cost-effective, high-bandwidth storage architecture

TL;DR: Measurements of the prototype NASD system show that these services can be cost-effectively integrated into a next generation disk drive ASK, and show scaluble bandwidth for NASD-specialized filesystems.
Related Papers (5)