Proceedings Article•DOI•

Server-side I/O coordination for parallel file systems

Huaiming Song¹, Yanlong Yin¹, Xian-He Sun¹, Rajeev Thakur², Samuel Lang² - Show less +1 more•Institutions (2)

Illinois Institute of Technology¹, Argonne National Laboratory²

12 Nov 2011-pp 17

TL;DR: Experimental results demonstrate that the proposed server-side I/O coordination scheme can reduce average completion time by 8% to 46%, and provide higher I/W bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/ O workloads.

read less

Abstract: Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.

...read moreread less

Summary (4 min read)

Jump to: [1. INTRODUCTION * This author has now joined R&D center, Dawning Information] – [Re-order Requests] – [2. THE IMPACT OF DATA SYNCHRONIZA-TION] – [3. I/O COORDINATION] – [3.1 Algorithm] – [3.2 Completion Time Analysis] – [4. IMPLEMENTATION] – [4.1 Implementation in PVFS2] – [4.2 Implementation in MPI-IO Library] – [These code modifications in the MPI-IO library are transparent to application programmers and users.] – [5.1 Experiments Setup] – [5.2 Results and Analysis] – [6.1 Server-side I/O Scheduling in Parallel File Systems] and [6.2 Coordinated scheduling]

1. INTRODUCTION * This author has now joined R&D center, Dawning Information

Large-scale data-intensive supercomputing relies on parallel file systems, such as Lustre [1] , GPFS [22] , PVFS [9] , and PanFS [18] for high-performance I/O.
Many high-performance computing (HPC) applications have become "I/O bounded", unable to scale with increasing compute power.
Multiple clients access data from a parallel file system independently, and there is explicit synchronization among these I/O clients, also known as Inter-request Synchronization.
These requests are likely to be served in different order on different file servers because they are scheduled independently.
Thus the average completion time is: Tavg = 4t.

Re-order Requests

The contribution of this paper is four-fold.
Second, the authors propose an effective server-side I/O coordination scheme for parallel I/O systems to reduce the average completion time of I/O requests, and thus to alleviate the performance penalties of data synchronization.
Third, the authors implement a prototype of the I/O coordination scheme in PVFS2 and MPI-IO.
Finally, the authors evaluate the proposed scheme both analytically and experimentally.
Experimental and analytical results are discussed in Section 5.

2. THE IMPACT OF DATA SYNCHRONIZA-TION

Data synchronization is common in parallel file systems, where I/O requests usually consist of multiple pieces of data access in multiple file servers and will not complete until all involved servers have completed their parts.
Each file server was installed with a 7200RPM SATA II 250GB hard disk drive (HDD), a PCI-E X4 100GB solid state disk (SSD), and the interconnection was 4X InfiniBand.
The number of concurrent IOR instances was 10, to simulate 10 concurrent applications.
Figure 3 shows the finish time of different requests on different file servers.
The results also reveal that there is a significant potential to shorten completion time by coordinated I/O scheduling on file servers.

3. I/O COORDINATION

In order to reduce the overhead of data synchronization, the authors propose a server-side I/O coordination scheme which re-arranges I/O requests on file servers, so that requests are serviced in the same order in terms of applications on all involved nodes.
The authors allocate an integer value for each application running on the cluster.
According to the definition, all I/O requests from one application have the same 'Application ID'.
For applications with multiple parallel processes, such as MPI programs, there might be large amounts of data synchronization.
In a system with many concurrent clients, a request issued earlier might get a later arrival time on some file servers.

3.1 Algorithm

These I/O requests might come from multiple applications.
In the same 'Time Window', I/O requests are ordered by the value of 'Application ID'; while in different 'Time Windows', requests in an earlier window would be serviced prior to those in a later one.
It takes both performance and fairness into consideration.
Figure 4 illustrates how the I/O coordination algorithm works in parallel file systems.
The scheduler on each file server then reorders the requests in each 'Time Window' by 'Application ID', so that requests from one application can be serviced in the same time on all file servers, as shown in subfigure (c).

3.2 Completion Time Analysis

Assume that the number of file servers is n, the number of concurrent applications is m, and that each application needs to access data on all file servers (for simplicity).
The average completion time can be represented as Formula (1) , where F (k) means the probability distribution function and f (x) represents the probability density function.
With the I/O coordination strategy, all file servers serve applications one at a time.
As the number of concurrent applications m increases, the decrease rate is approaching 50%.

4. IMPLEMENTATION

The authors have implemented the server-side I/O coordination scheme under PVFS2 [9] and MPI-IO.
PVFS2 is an open source parallel file system developed jointly by Clemson University and Argonne National Laboratory.
It is a virtual parallel file system for Linux clusters based on underlying native file systems on storage nodes.
The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library.

4.1 Implementation in PVFS2

The authors modified the client interface and server side request scheduler in PVFS2.
The authors utilize the 'PVFS_hint' mechanism to pass the two parameters between I/O clients and file servers.
When a file server receives a request, the scheduler first calculates its priority, and then inserts the request to the request queue in the ascending order of their priorities.
Therefore, all I/O requests are serviced in the order of req_prior.

4.2 Implementation in MPI-IO Library

The authors also modified the PVFS2 driver in ROMIO [26] to pass 'Request Time' and 'Application ID' via 'PVFS_hint'.
'Application ID' is generated the first time when an MPI program calls function MPI_File_open, and then it is broadcast to all MPI processes.
For system performance tuning, the authors also provide a configuration interface for parallel file system administrators.
ROMIO [26] is a high-performance, portable implementation of MPI-IO, providing applications with a uniform interface in the top layer, and dealing with data access to various file systems by an internal abstract I/O device layer called ADIO.
Following is an example of calling the PVFS2 data read interface.

These code modifications in the MPI-IO library are transparent to application programmers and users.

There is no need to mod-ify the source code of application; the user can simply relink the program using the modified MPI-IO library.
The request time is one of the primary factors used for request reordering on file servers in the proposed I/O coordination strategy.
For this reason, the clock of all machines in the large-scale system must be synchronized.
In their implementation, the request time is generated in MPI-IO library at the client side, so all the client machines must adopt the same clock.
Clock skew of client nodes may lead to unexpected requests service orders, especially for the collective I/O synchronization and inter-request synchronization cases.

5.1 Experiments Setup

The authors experiments were conducted on a 65-node SUN Fire Linuxbased cluster, with one head node and 64 computing nodes.
The computing nodes were Sun Fire X2200 servers, each with dual 2.3GHz Opteron quad-core processors, 8GB memory, and a 250GB 7200RPM SATA hard drive.
All 65 nodes were connected with Gigabit Ethernet.
MPI-TILE-IO and Noncontig are designed to test the performance of MPI-IO for non-contiguous access workloads.
Before each run, the authors flushed memory to avoid the impact of memory cache and buffer.

5.2 Results and Analysis

First the authors conducted experiments to evaluate the completion time of I/O requests with the proposed I/O coordination strategy, by comparing with original scheduling strategy (without I/O coordination) in PVFS2.
The authors then compared the average completion time with different number of concurrent applications.
Next the authors conducted experiments to evaluate the scalability of the proposed I/O coordination strategy.
While the number of file servers increases, the completion time decrease is around 46% for 64-node HDD environment and 39% for 16-node SSD environment.
The request sizes of all programs were 128 KB, and the stripe size was 4 KB.

6.1 Server-side I/O Scheduling in Parallel File Systems

In order to obtain sustained peak I/O performance, a collection of I/O scheduling techniques have been developed for the server side I/O scheduling of parallel file systems, such as disk-directed I/O [13] , server-directed I/O [23] , and stream-based I/O [11, 21] .
To the best of their knowledge, little effort has been devoted to reducing the average completion time of I/O requests of multiple applications for multiple file servers.
Numerous research efforts have been devoted to improving quality of service (QoS) of I/O requests in distributed or parallel storage systems [4, 8, 10, 12, 19, 29] .
Some of them adopted deadlinedriven strategies [12, 19, 31] , that allow the upper layer to specify latency and throughput goals of file servers and schedule the requests based on Earliest Deadline First(EDF) [16] or its variants [19, 20, 31] .
Moreover, their approach takes into consideration multiple file servers.

6.2 Coordinated scheduling

Coordinated scheduling has been recognized as an effective approach to obtain efficient execution for parallel or distributed environments.
The scheduler packs synchronized processes into gangs and schedules them simultaneously, to alleviate performance penalties of communicative synchronization.
Feitelson et al. [5] made a comparison of various packing schemes for gang scheduling, and evaluated them under different cases.
Zhang et al. [32] proposed an inter-server coordination technique in parallel file systems to improve the spatial locality and program reuse distance.
The motivation and methodology of the design and implementation of their and their approaches are very different.

Did you find this useful? Give us your feedback

Figures (8)

Figure 1: Three scenarios of data synchronization in parallel I/O

Figure 2: Order of request handling affects completion time. In the left subfigure, service order is different on different file servers, and the average completion time for the three applications is 4t. While in the right subfigure, requests are serviced in concert, and the average completion time reduces to 3t.

Figure 5: The finish time of I/O requests on different file servers with the proposed I/O coordination strategy. All file servers serve one application at a time together, and they serve I/O requests from multiple applications in the same order. The application scenarios are the same as in Figure 3.

Figure 3: The finish time of I/O requests from different applications on different file servers. This set of experiments were intrarequest synchronization scenario with 10 concurrent IOR instances and a 8-node PVFS2 system. The stripe size of PVFS2 was 64KB, and each IOR instance issued a 4MB contiguous read request to the PVFS2 system. Thereby every request involved all 8 file servers, and the size of requested data on one file server was 512KB. ‘App$K’ (k=0∼9) refers to an IOR instance, ‘FS$N’ (N=0∼7) refers to a file server. ‘MIN’ refers to the finish time of first complete file server, and ‘MAX’ refers to the finish time of last complete file server. The completion time of each application relies on the ’MAX’ finish time for that application on all involved file servers.

Figure 6: Average completion time with different numbers of concurrent applications. Prefix ‘8C’, ‘16C’ and ‘32C’ mean each application has 8, 16 and 32 MPI processes, respectively. Suffix ‘cio’ means the I/O coordination scheme, and ‘ori’ means the original data access strategy. Labels in Figure 7 are similarly defined. We used multiple instances of IOR to simulate concurrent applications.

Figure 7: Average completion time with different numbers of file servers. In HDD tests, the number of file servers was configured as 2, 4, 8, 16, 32, or 64. In SSD tests, we configured the number of file server as 2, 4, 8, or 16, respectively. We adopted PIO-Bench instances to simulated concurrent applications.

Figure 4: I/O coordination scheme in parallel file systems

Figure 8: Average completion time and aggregate I/O bandwidth under different window sizes. We run 10 concurrent applications (3 IOR, 3 PIO-Bench, 2 MPI-TILE-IO, and 2 Noncontig instances) in both HDD and SSD environments. The window sizes of I/O coordination scheme were 250ms, 500ms, 1000ms, and 2000ms, respectively. We also measured the results without I/O coordination strategy (labelled as ’ORI’ in the figure).

Content maybe subject to copyright Report

Server-Side I/O Coordination for Parallel File Systems

Huaiming Song

†

∗

, Yanlong Yin

†

, Xian-He Sun

†

, Rajeev Thakur

‡

, Samuel Lang

‡

†

Department of Computer Science, Illinois Institute of Technology, Chicago, IL 60616, USA

‡

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA

{huaiming.song, yyin2, sun}@iit.edu, {thakur, slang}@mcs.anl.gov

ABSTRACT

Parallel ﬁle systems have become a common component of mod-

ern high-end computers to mask the ever-increasing gap between

disk data access speed and CPU computing power. However, while

working well for certain applications, current parallel ﬁle systems

lack the ability to effectively handle concurrent I/O requests with

data synchronization needs, whereas concurrent I/O is the norm in

data-intensive applications. Recognizing that an I/O request will

not complete until all involved ﬁle servers in the parallel ﬁle sys-

tem have completed their parts, in this paper we propose a server-

side I/O coordination scheme for parallel ﬁle systems. The basic

idea is to coordinate ﬁle servers to serve one application at a time

in order to reduce the completion time, and in the meantime main-

tain the server utilization and fairness. A window-wide coordina-

tion concept is introduced to serve our purpose. We present the

proposed I/O coordination algorithm and its corresponding analy-

sis of average completion time in this study. We also implement

a prototype of the proposed scheme under the PVFS2 ﬁle system

and MPI-IO environment. Experimental results demonstrate that

the proposed scheme can reduce average completion time by 8%

to 46%, and provide higher I/O bandwidth than that of default data

access strategies adopted by PVFS2 for heavy I/O workloads. Ex-

perimental results also show that the server-side I/O coordination

scheme has good scalability.

Categories and Subject Descriptors

B.4.3 [Interconnections]: Parallel I/O; D.4.3 [File Systems Man-

agement]: Access methods

Keywords

server-side I/O coordination; parallel I/O synchronization; I/O op-

timization; parallel ﬁle systems

1. INTRODUCTION

∗

This author has now joined R&D center, Dawning Information

Industrial LLC, Beijing, China. Email: songhm@sugon.com

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SC11, November 12-18, 2011, Seattle, Washington, USA

Large-scale data-intensive supercomputing relies on parallel ﬁle

systems, such as Lustre [1], GPFS [22], PVFS [9], and PanFS[18]

for high-performance I/O. However, performance improvements in

computing capacity have vastly outpaced the improvements in I/O

performance in the past few decades and will likely continue in

the future. Many high-performance computing (HPC) applications

have become “I/O bounded”, unable to scale with increasing com-

pute power. The gap in performance between compute and I/O is

ampliﬁed further when multiple applications compete for limited

I/O and storage resources at the same time, as this leads to thrash-

ing scenarios within the HPC storage system. Parallel ﬁle systems

have difﬁculty handling I/O workloads of multiple applications for

two primary reasons. First, the ﬁle servers perform data accesses in

an interleaved fashion, resulting in excessive disk seeks. Second,

ﬁle servers perform I/O requests independently, without knowledge

of the order of requests performed at other servers, whereas HPC

applications tend to coordinate I/O across all processes. This sce-

nario leads to under-utilization of compute resources, as all com-

pute processes are held waiting for completion of an I/O request

that is delayed by the interleaved scheduling choices made by an

individual ﬁle server.

In general, data ﬁles are striped across all or a part of the ﬁle

servers in parallel ﬁle systems. One I/O request issued from a sin-

gle client often involves data accesses on multiple servers, and the

parallel I/O library has to merge the multiple data pieces from these

ﬁle servers together. Moreover, collective data access from multi-

ple clients, such as collective I/O in MPI-IO [26], has to wait for

all aggregators to complete. Synchronization of I/O requests across

processes is common in parallel computing, and can be classiﬁed

into the following three categories (as shown in Figure 1).

• Intra-request Synchronization: One I/O request issued by

one client accesses data in multiple ﬁle servers. It needs to

gather/scatter data pieces from/to multiple storage nodes and

merge them together to complete the I/O request.

• Collective I/O Synchronization: Multiple I/O clients access

data from multiple ﬁle servers collectively within a single

application. It has to wait for all aggregators to complete

their collective I/O operations before continuing.

• Inter-request Synchronization: Multiple clients access data

from a parallel ﬁle system independently, and there is explicit

synchronization among these I/O clients.

Figure 1 shows the three scenarios of data synchronization. The

ﬁrst two categories are implicit synchronization and the third one is

explicit. In a large-scale and high-performance computing system,

the parallel ﬁle system is often shared by multiple applications.

When these applications run simultaneously, each ﬁle server may

# 1 # 2# 0

Request

Intra-request sync

I/O

Clients

File

Servers

# 1 # 2# 0

Collective IO sync

RequestRequestRequestRequest

Independent I/O Collective I/O

# 1 # 2# 0

Inter-request sync

RequestRequestRequestRequest

Independent I/O

Explicit

Sync

Explicit

Sync

Figure 1: Three scenarios of data synchronization in parallel I/O

receive multiple I/O requests from different applications. However,

these requests are likely to be served in different order on differ-

ent ﬁle servers because they are scheduled independently. Figure 2

is an example of the I/O request scheduling in 4 ﬁle servers, and

there are 3 applications: A, B and C. Usually, the completion time

of each application depends on the completion time of the last ﬁle

server to ﬁnish the request. In the left subﬁgure, all nodes serve the

requests in different orders. The completion times of the I/O re-

quests from the three applications are: T

= 4t; T

= 4t.

Thus the average completion time is: T

avg

= 4t. If we re-arrange

the requests in the ﬁle servers, letting all nodes service the requests

in the same order, as shown in right part of Figure 2, the completion

times are: T

= 2t; T

= 3t; T

= 4t. The average completion time

is: T

avg

= 3t. In other words, after requests re-ordering, the aver-

age completion time decreases from 4t to 3t, which reveals a sig-

niﬁcant potential for shortening average completion time through

request re-ordering at the ﬁle servers.

time

File Servers

Re-order

Requests

A B C

App A App B App C

# 1# 2# 3# 0 # 1 # 2 # 3# 0

Figure 2: Order of request handling affects completion time.

In the left subﬁgure, service order is different on different ﬁle

servers, and the average completion time for the three applica-

tions is 4t. While in the right subﬁgure, requests are serviced

in concert, and the average completion time reduces to 3t.

Existing scheduling algorithms in parallel ﬁle systems, such as

disk-directed I/O [13], server-directed I/O [23], and stream-based

I/O [11, 21], focus on reducing data access overhead on either stor-

age nodes or network trafﬁc, to improve throughput of each ﬁle

server. These approaches have demonstrated the importance of

scheduling in parallel ﬁle systems to improve performance. How-

ever, little attention has been paid to server-side I/O coordination

in order to reduce average completion time of multiple applications

competing for limited I/O resources. In this paper, we propose a

new server-side I/O coordination scheme for parallel ﬁle systems

that enables all ﬁle servers to schedule requests from different ap-

plications in a coordinated way, to reduce the synchronization time

across clients for multiple applications.

The contribution of this paper is four-fold. First, we present

the data synchronization problems in parallel ﬁle systems. Sec-

ond, we propose an effective server-side I/O coordination scheme

for parallel I/O systems to reduce the average completion time of

I/O requests, and thus to alleviate the performance penalties of data

synchronization. Third, we implement a prototype of the I/O co-

ordination scheme in PVFS2 and MPI-IO. Finally, we evaluate the

proposed scheme both analytically and experimentally.

The remainder of this paper is organized as follows. Section 2

examines the overhead of data synchronization without I/O coor-

dination. Section 3 describes the design of I/O coordination algo-

rithm and gives an analysis of completion time. Section 4 presents

the implementation of the proposed I/O scheme in PVFS2 and MPI-

IO. Experimental and analytical results are discussed in Section 5.

Section 6 reviews related work in server-side I/O scheduling and

parallel job scheduling. Finally, Section 7 concludes this study and

discusses potential future work.

2. THE IMPACT OF DATA SYNCHRONIZA-

TION

Data synchronization is common in parallel ﬁle systems, where

I/O requests usually consist of multiple pieces of data access in

multiple ﬁle servers and will not complete until all involved servers

have completed their parts. However, due to independent schedul-

ing strategies on ﬁle servers, I/O requests with synchronization



120

150

180

210

FS0 FS1 FS2 FS3 FS4 FS5 FS6 FS7

App0 App1 App2 App3 App4

App5 App6 App7 App8 App9

Time(ms)

(a) Finish time on different ﬁle servers (HDD)



120

150

180

210

App0 App1 App2 App3 App4 App5 App6 App7 App8 App9

MIN MAX

Time(ms)

(b) Minimum and maximum ﬁnish time (HDD)



FS0 FS1 FS2 FS3 FS4 FS5 FS6 FS7

App0 App1 App2 App3 App4

App5 App6 App7 App8 App9

Time(ms)



App0 App1 App2 App3 App4 App5 App6 App7 App8 App9

MIN MAX

Time(ms)

(d) Minimum and maximum ﬁnish time (SSD)

Figure 3: The ﬁnish time of I/O requests from different applications on different ﬁle servers. This set of experiments were intra-

request synchronization scenario with 10 concurrent IOR instances and a 8-node PVFS2 system. The stripe size of PVFS2 was 64KB,

and each IOR instance issued a 4MB contiguous read request to the PVFS2 system. Thereby every request involved all 8 ﬁle servers,

and the size of requested data on one ﬁle server was 512KB. ‘App$K’ (k=0∼9) refers to an IOR instance, ‘FS$N’ (N=0∼7) refers to a

ﬁle server. ‘MIN’ refers to the ﬁnish time of ﬁrst complete ﬁle server, and ‘MAX’ refers to the ﬁnish time of last complete ﬁle server.

The completion time of each application relies on the ’MAX’ ﬁnish time for that application on all involved ﬁle servers.

needs from different applications are very likely to be served in

different orders on different ﬁle servers.

Understanding the impact of data synchronization in parallel I/O

systems is critical to efﬁciently improving completion time. In this

section, we evaluate the request completion time when ﬁle servers

serve requests from multiple applications simultaneously. We em-

ployed 8 nodes for PVFS2 ﬁle servers. Each ﬁle server was in-

stalled with a 7200RPM SATA II 250GB hard disk drive (HDD),

a PCI-E X4 100GB solid state disk (SSD), and the interconnection

was 4X InﬁniBand. We adopted the IOR benchmark to simulate

the intra-request synchronization scenario and measured the ﬁnish

time of all requests on different ﬁle servers. The number of concur-

rent IOR instances was 10, to simulate 10 concurrent applications.

In these experiments we show only the intra-request data synchro-

nization case, so each instance was conﬁgured with only one pro-

cess, which issued a 4MB contiguous data read request. Figure 3

shows the ﬁnish time of different requests on different ﬁle servers.

From Figure 3 (a) and (c), we can see that, in both HDD and SSD

environments, the ﬁnish time of every application varies a lot on

different ﬁle servers. From subﬁgure (b) and (d), we can see that,

the maximum ﬁnish time is 4.4 times of the minimum on average

in the HDD environment and 3.1 times in the SSD environment.

The completion time of one request is equal to the maximum value

of all ﬁnish times on all involved ﬁle servers. Therefore, the signif-

icant deviation of ﬁnish time on multiple ﬁle servers leads to high

completion time of data accesses.

The experimental results also indicate that, due to the indepen-

dent scheduling strategy on each ﬁle server, data accesses are ﬁn-

ished in different orders for concurrent applications. The difference

of service orders on different ﬁle servers will become much greater

in the inter-request or collective I/O synchronization cases, where

each application has multiple processes. As a result, the indepen-

dent scheduling strategy on ﬁle servers introduces a large number

of idle CPU cycles waiting for data synchronization on computing

nodes, and the case will become even worse for large-scale HPC

clusters. The results also reveal that there is a signiﬁcant potential

to shorten completion time by coordinated I/O scheduling on ﬁle

servers.

3. I/O COORDINATION

In order to reduce the overhead of data synchronization, we pro-

pose a server-side I/O coordination scheme which re-arranges I/O

requests on ﬁle servers, so that requests are serviced in the same

order in terms of applications on all involved nodes. Data syn-

chronization usually explicitly or implicitly exists in parallel pro-

cesses of parallel applications. The re-ordering aims at scheduling

File

Server 1

…

File

Server 2

File

Server 3

File

Server 4

App1 App2

App3

(a) Original I/O requests

…

Time

Window

Time

Window

Time

Window

Time

Window

Time

Window

File

Server 1

File

Server 2

File

Server 3

File

Server 4

App1 App2

App3

(b) Divided by Time Windows

…

Time

Window

Time

Window

Time

Window

Time

Window

Time

Window

File

Server 1

File

Server 2

File

Server 3

File

Server 4

App1 App2

App3

Figure 4: I/O coordination scheme in parallel ﬁle systems

the parallel I/O requests that need to be synchronized to run to-

gether, which can beneﬁt the system with a shorter average comple-

tion time of all I/O requests. A good scheduling algorithm should

take into account both performance and fairness. A good practi-

cal scheduling algorithm also requires simplicity in implementa-

tion. The proposed I/O coordination is no exception. For fairness,

all I/O requests should be serviced within an acceptable period to

avoid starvation. To provide the right balance of performance and

fairness, the concept of ‘Time Window’ and ‘Application ID’ are

introduced to support the server-side I/O coordination approach.

Time Window. All I/O requests issued to a ﬁle server can be

regarded as a time series. The time series is then divided into suc-

cessive segments by a ﬁxed time interval. Here each segment of the

time series is referred to as Time Window. Thus, one Time Window

consists of a number of I/O requests. The value of the time interval

can be regarded as Time Window Width.

Application ID. We allocate an integer value for each appli-

cation running on the cluster. The integer is an identiﬁcation of

“which application an I/O request belongs to”, and is referred to as

Application ID. For each I/O request, it will pass on this integer to

the ﬁle servers.

According to the deﬁnition, all I/O requests from one applica-

tion have the same ‘Application ID’. For applications with multi-

ple parallel processes, such as MPI programs, there might be large

amounts of data synchronization. In order to alleviate the perfor-

mance penalties of synchronization, I/O requests from all processes

should have the same ‘Application ID’, and they should be served

in concert in multiple ﬁle servers. The ‘Application ID’ is gener-

ated automatically in the parallel I/O library and it is transparent to

users. It can be implemented in parallel I/O client libraries or the

middleware layer, without modifying application programs.

For fairness, requests in an earlier ‘Time Window’ will be ser-

viced prior to those in a later one, to avoid starvation. The request

time can use either ﬁle-server-side time (the arrival time of a re-

quest) or client-side time (the issue time of a request). Because of

network latency and load imbalance issues, one client side request

may have different arrival time on different ﬁle servers. In a system

with many concurrent clients, a request issued earlier might get a

later arrival time on some ﬁle servers. For these reasons, in our

implementation, we choose client-side time as the request time.

3.1 Algorithm

It is not difﬁcult to imagine that in a parallel ﬁle system, a large

number of I/O requests might be queued on each ﬁle server at a

time. These I/O requests might come from multiple applications.

As all arriving requests are attached with a request time and an

‘Application ID’, the I/O coordination algorithm can be described

as follows. In the same ‘Time Window’, I/O requests are ordered by

the value of ‘Application ID’; while in different ‘Time Windows’,

requests in an earlier window would be serviced prior to those in a

later one.

The proposed I/O scheduling algorithm is based on the obser-

vation that requests from the same application have a better lo-

cality and, equally important, the execution will be optimized if

these requests ﬁnish at the same time. It takes both performance

and fairness into consideration. In each time window, requests are

served one application at a time in order to reduce the overhead

of data synchronization. In addition, none of the requests will be

starved, because requests in an earlier time window will always be

performed ﬁrst.

Figure 4 illustrates how the I/O coordination algorithm works in

parallel ﬁle systems. In this example, there are 4 ﬁle servers and

three concurrent applications. The original request arrival orders

are inconsistent on different ﬁle servers, such as in subﬁgure (a).

The series of I/O requests are split into successive ‘Time Windows’

by a ﬁxed time interval on all ﬁle servers, as shown in subﬁgure (b).

The scheduler on each ﬁle server then reorders the requests in each

‘Time Window’ by ‘Application ID’, so that requests from one ap-

plication can be serviced in the same time on all ﬁle servers, as

shown in subﬁgure (c).

The scheduler on each ﬁle server maintains a queue for all re-

quests, which determines the service order of I/O requests. When

a new I/O request arrives, if the queue is empty, the request will

be scheduled immediately. If the queue is not empty, the scheduler

will insert the request into the queue in terms of ‘Time Window’

and ‘Application ID’. The scheduler keeps issuing request with the

highest priority (i.e. the head of the queue) to the low-level storage

devices in current queue on each ﬁle server. Since the ‘Applica-

tion ID’ and request time are generated at the client side and then

passed to the ﬁle servers, there is no communication between dif-

ferent ﬁle servers while scheduling the requests. The use of ‘Ap-

plication ID’ and ‘Time Window’ has signiﬁcantly simpliﬁed the

implementation of the coordination and paved the foundation for

good scalability as the number of ﬁle servers increases.

3.2 Completion Time Analysis

Assume that the number of ﬁle servers is n, the number of con-

current applications is m, and that each application needs to access

data on all ﬁle servers (for simplicity). A collective data access

from one application is mapped into n sub-parts to all ﬁle servers,

and each sub-part is also a request in a ﬁle server. The service time

on each ﬁle server for each sub-part is t.

Without I/O coordination, the sub-parts are served in different

ﬁle servers independently. As requests are issued simultaneously,

the sub-parts may be served randomly without order on all ﬁle

servers. Hence for each sub-part, the ﬁnish time on each ﬁle server

can randomly fall in {t, 2t, 3t, ..., mt}, and the ﬁnish time of data

access for one application depends on the latest ﬁnish time of all

nodes. The expectation of completion time of one data access is

equal to the expectation of the maximum ﬁnish time on all n ﬁle

servers. The average completion time can be represented as For-

mula (1), where F (k) means the probability distribution function

and f (x) represents the probability density function. From the for-

mula, we observe that, if there is only 1 ﬁle server, the expectation

of completion time is

m+1

t, which conforms to the distribution

of our assumption. The formula also indicates that the completion

time increases as the number of ﬁle servers n increases, and also

as the concurrent applications number m increases. When the ﬁle

server number n is very large,

m−1



k=1

would be close to 0, and

then the average completion time would be close to mt.

avg

= E(Max(T )) = (



k=1

kf(k))t



k=1

k( F (k) − F (k − 1)))t



k=1

k((

)

− (

k − 1

)

))t

= mt −

m−1



k=1

(1)

With the I/O coordination strategy, all ﬁle servers serve applica-

tions one at a time. I/O requests with synchronization needs will be

served at the same time on all ﬁle servers. Therefore, the comple-

tion times for these applications are: t, 2t, ..., mt, and the average

completion time can be represented as Formula (2). The formula

indicates that the average completion time is independent of n, the

number of ﬁle servers. That means the average completion time

of the I/O coordination scheme is much more scalable than that of

existing independent scheduling strategies. Currently, parallel ﬁle

systems usually reach up to hundreds of storage nodes or even be-

yond. The proposed I/O coordination strategy is a practical way to

reduce the request completion time for data-intensive applications.



avg



k=1

kt =

m +1

t (2)

From Formula (1) and (2), we can calculate the reduction of the

average completion time as follows.

dif f

= T

avg

− T



avg

m − 1

t −

m−1



k=1

(3)

As can be seen in Formula 3, when the number of ﬁle servers

n is very big, the reduction of completion time would be close to

m−1

t, and the decrease rate would be approaching

m−1

. As the

number of concurrent applications m increases, the decrease rate is

approaching 50%.

4. IMPLEMENTATION

We have implemented the server-side I/O coordination scheme

under PVFS2[9] and MPI-IO. PVFS2 is an open source parallel ﬁle

system developed jointly by Clemson University and Argonne Na-

tional Laboratory. It is a virtual parallel ﬁle system for Linux clus-

ters based on underlying native ﬁle systems on storage nodes. The

prototype implementation includes modiﬁcations to the PVFS2 re-

quest scheduling module and the PVFS2 driver package in ROMIO

[26] MPI-IO library.

4.1 Implementation in PVFS2

We modiﬁed the client interface and server side request sched-

uler in PVFS2. The client interface passes ‘Application ID’ and

‘Request Time’ to the ﬁle servers, and then the ﬁle servers re-

arrange requests service orders based on the two parameters.

We utilize the ‘PVFS_hint’ mechanism to pass the two param-

eters between I/O clients and ﬁle servers. Two new hint types are

deﬁned in the PVFS2 source code: ‘PINT_HINT_APP_ID’ and

‘PINT_HINT_REQ_TIME’, representing the Application ID and

request time respectively. We made a modiﬁcation of the client-side

interface PVFS_sys_read/write(), adding PVFS_hint as

a parameter, so that the hint could be passed to the PVFS2 server

side.

When a ﬁle server receives a request, the scheduler ﬁrst calcu-

lates its priority, and then inserts the request to the request queue

in the ascending order of their priorities. The smaller the prior-

ity number a request gets, the earlier it would be scheduled. The

request priority is calculated as follows.

req_prior = req_time / interval

32768

+ app_id;

Here req_time is the issue time of the I/O request from the client

side, and it is an integer value referring to the number of millisec-

onds since ‘1970-01-01 00:00:00 UTC’. Interval is the width of

the ‘Time Window’, which can be deﬁned as a startup parameter

in the PVFS2 conﬁguration ﬁle. If interval is not conﬁgured, it

will use the default value (1000ms for HDD and 250ms for SSD).

App_id represents ‘Application ID’, and it is an integer value in the

range 0 to 32767. From the formula we observe that the req_prior

HTML Viewer

Frequently Asked Questions (12)

Q1. What are the contributions in "Server-side i/o coordination for parallel file systems" ?

Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper the authors propose a serverside I/O coordination scheme for parallel file systems. The authors present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. The authors also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8 % to 46 %, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads.

Q2. What are the future works in "Server-side i/o coordination for parallel file systems" ?

In the future, the authors plan to investigate optimization of the I/O coordination strategy based on application data access patterns. The authors also plan to add a minimum group communication in I/O coordination, to explore its feasibility for imbalanced data access workloads.

Q3. How does the average completion time decrease after request re-ordering?

In other words, after requests re-ordering, the average completion time decreases from 4t to 3t, which reveals a significant potential for shortening average completion time through request re-ordering at the file servers.

Q4. Why did some requests with high priority arrive late on some file servers?

The reason is that, due to nonuniform network delays,some requests with low priority were already issued to the storage devices in cases when some requests with high priority arrived late on some file servers.

Q5. What is the proposed I/O scheduling algorithm?

The proposed I/O scheduling algorithm is based on the observation that requests from the same application have a better locality and, equally important, the execution will be optimized if these requests finish at the same time.

Q6. What is the prototype implementation of PVFS2?

The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library.

Q7. What is the way to schedule parallel file systems?

parallel file systems consisting of high performance storage devices should set a short time window, and those consisting of lower performance storage devices should set a relative large window size.

Q8. What is the common classification of I/O requests in parallel computing?

Synchronization of I/O requests across processes is common in parallel computing, and can be classified into the following three categories (as shown in Figure 1).•

Q9. What is the difference between the order of the data accesses?

The experimental results also indicate that, due to the independent scheduling strategy on each file server, data accesses are finished in different orders for concurrent applications.

Q10. How do the authors reduce the wait time of socket connections?

These techniques succeed in achieving high bandwidth in disks and networks of file servers, by reducing either the frequency of disk seeks, or the waiting time of socket connections.

Q11. What is the maximum finish time in the SSD environment?

From subfigure (b) and (d), the authors can see that, the maximum finish time is 4.4 times of the minimum on average in the HDD environment and 3.1 times in the SSD environment.

Q12. How can the proposed I/O coordination scheme reduce the I/O completion time?

The experimental results demonstrate that, compared to the conventional data access strategy, the proposed I/O coordination scheme can reduce the I/O completion time by up to 46% and provide a comparable I/O bandwidth.

Server-side I/O coordination for parallel file systems

Summary (4 min read)

1. INTRODUCTION * This author has now joined R&D center, Dawning Information

Re-order Requests

2. THE IMPACT OF DATA SYNCHRONIZA-TION

3. I/O COORDINATION

3.1 Algorithm

3.2 Completion Time Analysis

4. IMPLEMENTATION

4.1 Implementation in PVFS2

4.2 Implementation in MPI-IO Library

These code modifications in the MPI-IO library are transparent to application programmers and users.

5.1 Experiments Setup

5.2 Results and Analysis

6.1 Server-side I/O Scheduling in Parallel File Systems

6.2 Coordinated scheduling

Figures (8)

Citations

Cites background from "Server-side I/O coordination for pa..."

Cites background from "Server-side I/O coordination for pa..."

Cites background from "Server-side I/O coordination for pa..."

Cites background from "Server-side I/O coordination for pa..."

References

"Server-side I/O coordination for pa..." refers background in this paper

"Server-side I/O coordination for pa..." refers methods in this paper

"Server-side I/O coordination for pa..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What are the contributions in "Server-side i/o coordination for parallel file systems" ?

Q2. What are the future works in "Server-side i/o coordination for parallel file systems" ?

Q3. How does the average completion time decrease after request re-ordering?

Q4. Why did some requests with high priority arrive late on some file servers?

Q5. What is the proposed I/O scheduling algorithm?

Q6. What is the prototype implementation of PVFS2?

Q7. What is the way to schedule parallel file systems?

Q8. What is the common classification of I/O requests in parallel computing?

Q9. What is the difference between the order of the data accesses?

Q10. How do the authors reduce the wait time of socket connections?

Q11. What is the maximum finish time in the SSD environment?

Q12. How can the proposed I/O coordination scheme reduce the I/O completion time?

Trending Questions (1)