Server-side I/O coordination for parallel file systems
Summary (4 min read)
1. INTRODUCTION * This author has now joined R&D center, Dawning Information
- Large-scale data-intensive supercomputing relies on parallel file systems, such as Lustre [1] , GPFS [22] , PVFS [9] , and PanFS [18] for high-performance I/O.
- Many high-performance computing (HPC) applications have become "I/O bounded", unable to scale with increasing compute power.
- Multiple clients access data from a parallel file system independently, and there is explicit synchronization among these I/O clients, also known as Inter-request Synchronization.
- These requests are likely to be served in different order on different file servers because they are scheduled independently.
- Thus the average completion time is: Tavg = 4t.
Re-order Requests
- The contribution of this paper is four-fold.
- Second, the authors propose an effective server-side I/O coordination scheme for parallel I/O systems to reduce the average completion time of I/O requests, and thus to alleviate the performance penalties of data synchronization.
- Third, the authors implement a prototype of the I/O coordination scheme in PVFS2 and MPI-IO.
- Finally, the authors evaluate the proposed scheme both analytically and experimentally.
- Experimental and analytical results are discussed in Section 5.
2. THE IMPACT OF DATA SYNCHRONIZA-TION
- Data synchronization is common in parallel file systems, where I/O requests usually consist of multiple pieces of data access in multiple file servers and will not complete until all involved servers have completed their parts.
- Each file server was installed with a 7200RPM SATA II 250GB hard disk drive (HDD), a PCI-E X4 100GB solid state disk (SSD), and the interconnection was 4X InfiniBand.
- The number of concurrent IOR instances was 10, to simulate 10 concurrent applications.
- Figure 3 shows the finish time of different requests on different file servers.
- The results also reveal that there is a significant potential to shorten completion time by coordinated I/O scheduling on file servers.
3. I/O COORDINATION
- In order to reduce the overhead of data synchronization, the authors propose a server-side I/O coordination scheme which re-arranges I/O requests on file servers, so that requests are serviced in the same order in terms of applications on all involved nodes.
- The authors allocate an integer value for each application running on the cluster.
- According to the definition, all I/O requests from one application have the same 'Application ID'.
- For applications with multiple parallel processes, such as MPI programs, there might be large amounts of data synchronization.
- In a system with many concurrent clients, a request issued earlier might get a later arrival time on some file servers.
3.1 Algorithm
- These I/O requests might come from multiple applications.
- In the same 'Time Window', I/O requests are ordered by the value of 'Application ID'; while in different 'Time Windows', requests in an earlier window would be serviced prior to those in a later one.
- It takes both performance and fairness into consideration.
- Figure 4 illustrates how the I/O coordination algorithm works in parallel file systems.
- The scheduler on each file server then reorders the requests in each 'Time Window' by 'Application ID', so that requests from one application can be serviced in the same time on all file servers, as shown in subfigure (c).
3.2 Completion Time Analysis
- Assume that the number of file servers is n, the number of concurrent applications is m, and that each application needs to access data on all file servers (for simplicity).
- The average completion time can be represented as Formula (1) , where F (k) means the probability distribution function and f (x) represents the probability density function.
- With the I/O coordination strategy, all file servers serve applications one at a time.
- As the number of concurrent applications m increases, the decrease rate is approaching 50%.
4. IMPLEMENTATION
- The authors have implemented the server-side I/O coordination scheme under PVFS2 [9] and MPI-IO.
- PVFS2 is an open source parallel file system developed jointly by Clemson University and Argonne National Laboratory.
- It is a virtual parallel file system for Linux clusters based on underlying native file systems on storage nodes.
- The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library.
4.1 Implementation in PVFS2
- The authors modified the client interface and server side request scheduler in PVFS2.
- The authors utilize the 'PVFS_hint' mechanism to pass the two parameters between I/O clients and file servers.
- When a file server receives a request, the scheduler first calculates its priority, and then inserts the request to the request queue in the ascending order of their priorities.
- Therefore, all I/O requests are serviced in the order of req_prior.
4.2 Implementation in MPI-IO Library
- The authors also modified the PVFS2 driver in ROMIO [26] to pass 'Request Time' and 'Application ID' via 'PVFS_hint'.
- 'Application ID' is generated the first time when an MPI program calls function MPI_File_open, and then it is broadcast to all MPI processes.
- For system performance tuning, the authors also provide a configuration interface for parallel file system administrators.
- ROMIO [26] is a high-performance, portable implementation of MPI-IO, providing applications with a uniform interface in the top layer, and dealing with data access to various file systems by an internal abstract I/O device layer called ADIO.
- Following is an example of calling the PVFS2 data read interface.
These code modifications in the MPI-IO library are transparent to application programmers and users.
- There is no need to mod-ify the source code of application; the user can simply relink the program using the modified MPI-IO library.
- The request time is one of the primary factors used for request reordering on file servers in the proposed I/O coordination strategy.
- For this reason, the clock of all machines in the large-scale system must be synchronized.
- In their implementation, the request time is generated in MPI-IO library at the client side, so all the client machines must adopt the same clock.
- Clock skew of client nodes may lead to unexpected requests service orders, especially for the collective I/O synchronization and inter-request synchronization cases.
5.1 Experiments Setup
- The authors experiments were conducted on a 65-node SUN Fire Linuxbased cluster, with one head node and 64 computing nodes.
- The computing nodes were Sun Fire X2200 servers, each with dual 2.3GHz Opteron quad-core processors, 8GB memory, and a 250GB 7200RPM SATA hard drive.
- All 65 nodes were connected with Gigabit Ethernet.
- MPI-TILE-IO and Noncontig are designed to test the performance of MPI-IO for non-contiguous access workloads.
- Before each run, the authors flushed memory to avoid the impact of memory cache and buffer.
5.2 Results and Analysis
- First the authors conducted experiments to evaluate the completion time of I/O requests with the proposed I/O coordination strategy, by comparing with original scheduling strategy (without I/O coordination) in PVFS2.
- The authors then compared the average completion time with different number of concurrent applications.
- Next the authors conducted experiments to evaluate the scalability of the proposed I/O coordination strategy.
- While the number of file servers increases, the completion time decrease is around 46% for 64-node HDD environment and 39% for 16-node SSD environment.
- The request sizes of all programs were 128 KB, and the stripe size was 4 KB.
6.1 Server-side I/O Scheduling in Parallel File Systems
- In order to obtain sustained peak I/O performance, a collection of I/O scheduling techniques have been developed for the server side I/O scheduling of parallel file systems, such as disk-directed I/O [13] , server-directed I/O [23] , and stream-based I/O [11, 21] .
- To the best of their knowledge, little effort has been devoted to reducing the average completion time of I/O requests of multiple applications for multiple file servers.
- Numerous research efforts have been devoted to improving quality of service (QoS) of I/O requests in distributed or parallel storage systems [4, 8, 10, 12, 19, 29] .
- Some of them adopted deadlinedriven strategies [12, 19, 31] , that allow the upper layer to specify latency and throughput goals of file servers and schedule the requests based on Earliest Deadline First(EDF) [16] or its variants [19, 20, 31] .
- Moreover, their approach takes into consideration multiple file servers.
6.2 Coordinated scheduling
- Coordinated scheduling has been recognized as an effective approach to obtain efficient execution for parallel or distributed environments.
- The scheduler packs synchronized processes into gangs and schedules them simultaneously, to alleviate performance penalties of communicative synchronization.
- Feitelson et al. [5] made a comparison of various packing schemes for gang scheduling, and evaluated them under different cases.
- Zhang et al. [32] proposed an inter-server coordination technique in parallel file systems to improve the spatial locality and program reuse distance.
- The motivation and methodology of the design and implementation of their and their approaches are very different.
Did you find this useful? Give us your feedback
Citations
112 citations
Cites background from "Server-side I/O coordination for pa..."
...[35], with an application’s id instead of an object id....
[...]
77 citations
Cites background from "Server-side I/O coordination for pa..."
...The focus has been on causes as diverse as access locality in disks [2], synchronization across storage servers [2], [3], or network contention [4]–[7]....
[...]
45 citations
Cites background from "Server-side I/O coordination for pa..."
...As a result, the disks work in an interleaving way and each request can finish only when all the sub-requests on all nodes finish [16]....
[...]
41 citations
Cites background from "Server-side I/O coordination for pa..."
...Server-side optimizations generally embed their solutions inside the storage server, overcoming issues of contention by dynamically coordinating data movement among servers [37, 14, 43]....
[...]
39 citations
References
7,067 citations
"Server-side I/O coordination for pa..." refers background in this paper
...Some of them adopted deadlinedriven strategies [12, 19, 31], that allow the upper layer to specify latency and throughput goals of file servers and schedule the requests based on Earliest Deadline First(EDF) [16] or its variants [19, 20, 31]....
[...]
1,434 citations
"Server-side I/O coordination for pa..." refers methods in this paper
...GPFS: A Shared-Disk File System for Large Computing Clusters....
[...]
...Large-scale data-intensive supercomputing relies on parallel file systems, such as Lustre [1], GPFS [22], PVFS [9], and PanFS[18] for high-performance I/O....
[...]
619 citations
470 citations
"Server-side I/O coordination for pa..." refers background or methods in this paper
...Moreover, collective data access from multiple clients, such as collective I/O in MPI-IO [26], has to wait for all aggregators to complete....
[...]
...We also modified the PVFS2 driver in ROMIO [26] to pass ‘Request Time’ and ‘Application ID’ via ‘PVFS_hint’....
[...]
...The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library....
[...]
...ROMIO[26] is a high-performance, portable implementation of MPI-IO, providing applications with a uniform interface in the top layer, and dealing with data access to various file systems by an internal abstract I/O device layer called ADIO....
[...]
466 citations
Related Papers (5)
Frequently Asked Questions (12)
Q2. What are the future works in "Server-side i/o coordination for parallel file systems" ?
In the future, the authors plan to investigate optimization of the I/O coordination strategy based on application data access patterns. The authors also plan to add a minimum group communication in I/O coordination, to explore its feasibility for imbalanced data access workloads.
Q3. How does the average completion time decrease after request re-ordering?
In other words, after requests re-ordering, the average completion time decreases from 4t to 3t, which reveals a significant potential for shortening average completion time through request re-ordering at the file servers.
Q4. Why did some requests with high priority arrive late on some file servers?
The reason is that, due to nonuniform network delays,some requests with low priority were already issued to the storage devices in cases when some requests with high priority arrived late on some file servers.
Q5. What is the proposed I/O scheduling algorithm?
The proposed I/O scheduling algorithm is based on the observation that requests from the same application have a better locality and, equally important, the execution will be optimized if these requests finish at the same time.
Q6. What is the prototype implementation of PVFS2?
The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library.
Q7. What is the way to schedule parallel file systems?
parallel file systems consisting of high performance storage devices should set a short time window, and those consisting of lower performance storage devices should set a relative large window size.
Q8. What is the common classification of I/O requests in parallel computing?
Synchronization of I/O requests across processes is common in parallel computing, and can be classified into the following three categories (as shown in Figure 1).•
Q9. What is the difference between the order of the data accesses?
The experimental results also indicate that, due to the independent scheduling strategy on each file server, data accesses are finished in different orders for concurrent applications.
Q10. How do the authors reduce the wait time of socket connections?
These techniques succeed in achieving high bandwidth in disks and networks of file servers, by reducing either the frequency of disk seeks, or the waiting time of socket connections.
Q11. What is the maximum finish time in the SSD environment?
From subfigure (b) and (d), the authors can see that, the maximum finish time is 4.4 times of the minimum on average in the HDD environment and 3.1 times in the SSD environment.
Q12. How can the proposed I/O coordination scheme reduce the I/O completion time?
The experimental results demonstrate that, compared to the conventional data access strategy, the proposed I/O coordination scheme can reduce the I/O completion time by up to 46% and provide a comparable I/O bandwidth.