scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Server-side I/O coordination for parallel file systems

TL;DR: Experimental results demonstrate that the proposed server-side I/O coordination scheme can reduce average completion time by 8% to 46%, and provide higher I/W bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/ O workloads.
Abstract: Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.

Summary (4 min read)

1. INTRODUCTION * This author has now joined R&D center, Dawning Information

  • Large-scale data-intensive supercomputing relies on parallel file systems, such as Lustre [1] , GPFS [22] , PVFS [9] , and PanFS [18] for high-performance I/O.
  • Many high-performance computing (HPC) applications have become "I/O bounded", unable to scale with increasing compute power.
  • Multiple clients access data from a parallel file system independently, and there is explicit synchronization among these I/O clients, also known as Inter-request Synchronization.
  • These requests are likely to be served in different order on different file servers because they are scheduled independently.
  • Thus the average completion time is: Tavg = 4t.

Re-order Requests

  • The contribution of this paper is four-fold.
  • Second, the authors propose an effective server-side I/O coordination scheme for parallel I/O systems to reduce the average completion time of I/O requests, and thus to alleviate the performance penalties of data synchronization.
  • Third, the authors implement a prototype of the I/O coordination scheme in PVFS2 and MPI-IO.
  • Finally, the authors evaluate the proposed scheme both analytically and experimentally.
  • Experimental and analytical results are discussed in Section 5.

2. THE IMPACT OF DATA SYNCHRONIZA-TION

  • Data synchronization is common in parallel file systems, where I/O requests usually consist of multiple pieces of data access in multiple file servers and will not complete until all involved servers have completed their parts.
  • Each file server was installed with a 7200RPM SATA II 250GB hard disk drive (HDD), a PCI-E X4 100GB solid state disk (SSD), and the interconnection was 4X InfiniBand.
  • The number of concurrent IOR instances was 10, to simulate 10 concurrent applications.
  • Figure 3 shows the finish time of different requests on different file servers.
  • The results also reveal that there is a significant potential to shorten completion time by coordinated I/O scheduling on file servers.

3. I/O COORDINATION

  • In order to reduce the overhead of data synchronization, the authors propose a server-side I/O coordination scheme which re-arranges I/O requests on file servers, so that requests are serviced in the same order in terms of applications on all involved nodes.
  • The authors allocate an integer value for each application running on the cluster.
  • According to the definition, all I/O requests from one application have the same 'Application ID'.
  • For applications with multiple parallel processes, such as MPI programs, there might be large amounts of data synchronization.
  • In a system with many concurrent clients, a request issued earlier might get a later arrival time on some file servers.

3.1 Algorithm

  • These I/O requests might come from multiple applications.
  • In the same 'Time Window', I/O requests are ordered by the value of 'Application ID'; while in different 'Time Windows', requests in an earlier window would be serviced prior to those in a later one.
  • It takes both performance and fairness into consideration.
  • Figure 4 illustrates how the I/O coordination algorithm works in parallel file systems.
  • The scheduler on each file server then reorders the requests in each 'Time Window' by 'Application ID', so that requests from one application can be serviced in the same time on all file servers, as shown in subfigure (c).

3.2 Completion Time Analysis

  • Assume that the number of file servers is n, the number of concurrent applications is m, and that each application needs to access data on all file servers (for simplicity).
  • The average completion time can be represented as Formula (1) , where F (k) means the probability distribution function and f (x) represents the probability density function.
  • With the I/O coordination strategy, all file servers serve applications one at a time.
  • As the number of concurrent applications m increases, the decrease rate is approaching 50%.

4. IMPLEMENTATION

  • The authors have implemented the server-side I/O coordination scheme under PVFS2 [9] and MPI-IO.
  • PVFS2 is an open source parallel file system developed jointly by Clemson University and Argonne National Laboratory.
  • It is a virtual parallel file system for Linux clusters based on underlying native file systems on storage nodes.
  • The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library.

4.1 Implementation in PVFS2

  • The authors modified the client interface and server side request scheduler in PVFS2.
  • The authors utilize the 'PVFS_hint' mechanism to pass the two parameters between I/O clients and file servers.
  • When a file server receives a request, the scheduler first calculates its priority, and then inserts the request to the request queue in the ascending order of their priorities.
  • Therefore, all I/O requests are serviced in the order of req_prior.

4.2 Implementation in MPI-IO Library

  • The authors also modified the PVFS2 driver in ROMIO [26] to pass 'Request Time' and 'Application ID' via 'PVFS_hint'.
  • 'Application ID' is generated the first time when an MPI program calls function MPI_File_open, and then it is broadcast to all MPI processes.
  • For system performance tuning, the authors also provide a configuration interface for parallel file system administrators.
  • ROMIO [26] is a high-performance, portable implementation of MPI-IO, providing applications with a uniform interface in the top layer, and dealing with data access to various file systems by an internal abstract I/O device layer called ADIO.
  • Following is an example of calling the PVFS2 data read interface.

These code modifications in the MPI-IO library are transparent to application programmers and users.

  • There is no need to mod-ify the source code of application; the user can simply relink the program using the modified MPI-IO library.
  • The request time is one of the primary factors used for request reordering on file servers in the proposed I/O coordination strategy.
  • For this reason, the clock of all machines in the large-scale system must be synchronized.
  • In their implementation, the request time is generated in MPI-IO library at the client side, so all the client machines must adopt the same clock.
  • Clock skew of client nodes may lead to unexpected requests service orders, especially for the collective I/O synchronization and inter-request synchronization cases.

5.1 Experiments Setup

  • The authors experiments were conducted on a 65-node SUN Fire Linuxbased cluster, with one head node and 64 computing nodes.
  • The computing nodes were Sun Fire X2200 servers, each with dual 2.3GHz Opteron quad-core processors, 8GB memory, and a 250GB 7200RPM SATA hard drive.
  • All 65 nodes were connected with Gigabit Ethernet.
  • MPI-TILE-IO and Noncontig are designed to test the performance of MPI-IO for non-contiguous access workloads.
  • Before each run, the authors flushed memory to avoid the impact of memory cache and buffer.

5.2 Results and Analysis

  • First the authors conducted experiments to evaluate the completion time of I/O requests with the proposed I/O coordination strategy, by comparing with original scheduling strategy (without I/O coordination) in PVFS2.
  • The authors then compared the average completion time with different number of concurrent applications.
  • Next the authors conducted experiments to evaluate the scalability of the proposed I/O coordination strategy.
  • While the number of file servers increases, the completion time decrease is around 46% for 64-node HDD environment and 39% for 16-node SSD environment.
  • The request sizes of all programs were 128 KB, and the stripe size was 4 KB.

6.1 Server-side I/O Scheduling in Parallel File Systems

  • In order to obtain sustained peak I/O performance, a collection of I/O scheduling techniques have been developed for the server side I/O scheduling of parallel file systems, such as disk-directed I/O [13] , server-directed I/O [23] , and stream-based I/O [11, 21] .
  • To the best of their knowledge, little effort has been devoted to reducing the average completion time of I/O requests of multiple applications for multiple file servers.
  • Numerous research efforts have been devoted to improving quality of service (QoS) of I/O requests in distributed or parallel storage systems [4, 8, 10, 12, 19, 29] .
  • Some of them adopted deadlinedriven strategies [12, 19, 31] , that allow the upper layer to specify latency and throughput goals of file servers and schedule the requests based on Earliest Deadline First(EDF) [16] or its variants [19, 20, 31] .
  • Moreover, their approach takes into consideration multiple file servers.

6.2 Coordinated scheduling

  • Coordinated scheduling has been recognized as an effective approach to obtain efficient execution for parallel or distributed environments.
  • The scheduler packs synchronized processes into gangs and schedules them simultaneously, to alleviate performance penalties of communicative synchronization.
  • Feitelson et al. [5] made a comparison of various packing schemes for gang scheduling, and evaluated them under different cases.
  • Zhang et al. [32] proposed an inter-server coordination technique in parallel file systems to improve the spatial locality and program reuse distance.
  • The motivation and methodology of the design and implementation of their and their approaches are very different.

Did you find this useful? Give us your feedback

Figures (8)

Content maybe subject to copyright    Report

Server-Side I/O Coordination for Parallel File Systems
Huaiming Song
, Yanlong Yin
, Xian-He Sun
, Rajeev Thakur
, Samuel Lang
Department of Computer Science, Illinois Institute of Technology, Chicago, IL 60616, USA
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
{huaiming.song, yyin2, sun}@iit.edu, {thakur, slang}@mcs.anl.gov
ABSTRACT
Parallel file systems have become a common component of mod-
ern high-end computers to mask the ever-increasing gap between
disk data access speed and CPU computing power. However, while
working well for certain applications, current parallel file systems
lack the ability to effectively handle concurrent I/O requests with
data synchronization needs, whereas concurrent I/O is the norm in
data-intensive applications. Recognizing that an I/O request will
not complete until all involved file servers in the parallel file sys-
tem have completed their parts, in this paper we propose a server-
side I/O coordination scheme for parallel file systems. The basic
idea is to coordinate file servers to serve one application at a time
in order to reduce the completion time, and in the meantime main-
tain the server utilization and fairness. A window-wide coordina-
tion concept is introduced to serve our purpose. We present the
proposed I/O coordination algorithm and its corresponding analy-
sis of average completion time in this study. We also implement
a prototype of the proposed scheme under the PVFS2 file system
and MPI-IO environment. Experimental results demonstrate that
the proposed scheme can reduce average completion time by 8%
to 46%, and provide higher I/O bandwidth than that of default data
access strategies adopted by PVFS2 for heavy I/O workloads. Ex-
perimental results also show that the server-side I/O coordination
scheme has good scalability.
Categories and Subject Descriptors
B.4.3 [Interconnections]: Parallel I/O; D.4.3 [File Systems Man-
agement]: Access methods
Keywords
server-side I/O coordination; parallel I/O synchronization; I/O op-
timization; parallel file systems
1. INTRODUCTION
This author has now joined R&D center, Dawning Information
Industrial LLC, Beijing, China. Email: songhm@sugon.com
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SC11, November 12-18, 2011, Seattle, Washington, USA
Copyright 2011 ACM 978-1-4503-0771-0/11/11 ...$10.00.
Large-scale data-intensive supercomputing relies on parallel file
systems, such as Lustre [1], GPFS [22], PVFS [9], and PanFS[18]
for high-performance I/O. However, performance improvements in
computing capacity have vastly outpaced the improvements in I/O
performance in the past few decades and will likely continue in
the future. Many high-performance computing (HPC) applications
have become “I/O bounded”, unable to scale with increasing com-
pute power. The gap in performance between compute and I/O is
amplified further when multiple applications compete for limited
I/O and storage resources at the same time, as this leads to thrash-
ing scenarios within the HPC storage system. Parallel file systems
have difficulty handling I/O workloads of multiple applications for
two primary reasons. First, the file servers perform data accesses in
an interleaved fashion, resulting in excessive disk seeks. Second,
file servers perform I/O requests independently, without knowledge
of the order of requests performed at other servers, whereas HPC
applications tend to coordinate I/O across all processes. This sce-
nario leads to under-utilization of compute resources, as all com-
pute processes are held waiting for completion of an I/O request
that is delayed by the interleaved scheduling choices made by an
individual file server.
In general, data files are striped across all or a part of the file
servers in parallel file systems. One I/O request issued from a sin-
gle client often involves data accesses on multiple servers, and the
parallel I/O library has to merge the multiple data pieces from these
file servers together. Moreover, collective data access from multi-
ple clients, such as collective I/O in MPI-IO [26], has to wait for
all aggregators to complete. Synchronization of I/O requests across
processes is common in parallel computing, and can be classified
into the following three categories (as shown in Figure 1).
Intra-request Synchronization: One I/O request issued by
one client accesses data in multiple file servers. It needs to
gather/scatter data pieces from/to multiple storage nodes and
merge them together to complete the I/O request.
Collective I/O Synchronization: Multiple I/O clients access
data from multiple file servers collectively within a single
application. It has to wait for all aggregators to complete
their collective I/O operations before continuing.
Inter-request Synchronization: Multiple clients access data
from a parallel file system independently, and there is explicit
synchronization among these I/O clients.
Figure 1 shows the three scenarios of data synchronization. The
first two categories are implicit synchronization and the third one is
explicit. In a large-scale and high-performance computing system,
the parallel file system is often shared by multiple applications.
When these applications run simultaneously, each file server may

# 1 # 2# 0
Request
Intra-request sync
I/O
Clients
File
Servers
# 1 # 2# 0
Collective IO sync
RequestRequestRequestRequest
Independent I/O Collective I/O
# 1 # 2# 0
Inter-request sync
RequestRequestRequestRequest
Independent I/O
Explicit
Sync
Explicit
Sync
Figure 1: Three scenarios of data synchronization in parallel I/O
receive multiple I/O requests from different applications. However,
these requests are likely to be served in different order on differ-
ent file servers because they are scheduled independently. Figure 2
is an example of the I/O request scheduling in 4 file servers, and
there are 3 applications: A, B and C. Usually, the completion time
of each application depends on the completion time of the last file
server to finish the request. In the left subfigure, all nodes serve the
requests in different orders. The completion times of the I/O re-
quests from the three applications are: T
A
= 4t; T
B
= 4t; T
C
= 4t.
Thus the average completion time is: T
avg
= 4t. If we re-arrange
the requests in the file servers, letting all nodes service the requests
in the same order, as shown in right part of Figure 2, the completion
times are: T
A
= 2t; T
B
= 3t; T
C
= 4t. The average completion time
is: T
avg
= 3t. In other words, after requests re-ordering, the aver-
age completion time decreases from 4t to 3t, which reveals a sig-
nificant potential for shortening average completion time through
request re-ordering at the file servers.
A
B
C
A
B
A
C
C
A
B
B
C
A
t
t
t
t
time
A
A
B
C
A
B
C
C
A
B
B
C
A
File Servers
File Servers
Re-order
Requests
A B C
App A App B App C
# 1# 2# 3# 0 # 1 # 2 # 3# 0
Figure 2: Order of request handling affects completion time.
In the left subfigure, service order is different on different file
servers, and the average completion time for the three applica-
tions is 4t. While in the right subfigure, requests are serviced
in concert, and the average completion time reduces to 3t.
Existing scheduling algorithms in parallel file systems, such as
disk-directed I/O [13], server-directed I/O [23], and stream-based
I/O [11, 21], focus on reducing data access overhead on either stor-
age nodes or network traffic, to improve throughput of each file
server. These approaches have demonstrated the importance of
scheduling in parallel file systems to improve performance. How-
ever, little attention has been paid to server-side I/O coordination
in order to reduce average completion time of multiple applications
competing for limited I/O resources. In this paper, we propose a
new server-side I/O coordination scheme for parallel file systems
that enables all file servers to schedule requests from different ap-
plications in a coordinated way, to reduce the synchronization time
across clients for multiple applications.
The contribution of this paper is four-fold. First, we present
the data synchronization problems in parallel file systems. Sec-
ond, we propose an effective server-side I/O coordination scheme
for parallel I/O systems to reduce the average completion time of
I/O requests, and thus to alleviate the performance penalties of data
synchronization. Third, we implement a prototype of the I/O co-
ordination scheme in PVFS2 and MPI-IO. Finally, we evaluate the
proposed scheme both analytically and experimentally.
The remainder of this paper is organized as follows. Section 2
examines the overhead of data synchronization without I/O coor-
dination. Section 3 describes the design of I/O coordination algo-
rithm and gives an analysis of completion time. Section 4 presents
the implementation of the proposed I/O scheme in PVFS2 and MPI-
IO. Experimental and analytical results are discussed in Section 5.
Section 6 reviews related work in server-side I/O scheduling and
parallel job scheduling. Finally, Section 7 concludes this study and
discusses potential future work.
2. THE IMPACT OF DATA SYNCHRONIZA-
TION
Data synchronization is common in parallel file systems, where
I/O requests usually consist of multiple pieces of data access in
multiple file servers and will not complete until all involved servers
have completed their parts. However, due to independent schedul-
ing strategies on file servers, I/O requests with synchronization

0
30
60
90
120
150
180
210
FS0 FS1 FS2 FS3 FS4 FS5 FS6 FS7
App0 App1 App2 App3 App4
App5 App6 App7 App8 App9
Time(ms)
(a) Finish time on different file servers (HDD)
0
30
60
90
120
150
180
210
App0 App1 App2 App3 App4 App5 App6 App7 App8 App9
MIN MAX
Time(ms)
(b) Minimum and maximum finish time (HDD)
0
5
10
15
20
25
30
35
FS0 FS1 FS2 FS3 FS4 FS5 FS6 FS7
App0 App1 App2 App3 App4
App5 App6 App7 App8 App9
Time(ms)
(c) Finish time on different file servers (SSD)
0
5
10
15
20
25
30
35
App0 App1 App2 App3 App4 App5 App6 App7 App8 App9
MIN MAX
Time(ms)
(d) Minimum and maximum finish time (SSD)
Figure 3: The finish time of I/O requests from different applications on different file servers. This set of experiments were intra-
request synchronization scenario with 10 concurrent IOR instances and a 8-node PVFS2 system. The stripe size of PVFS2 was 64KB,
and each IOR instance issued a 4MB contiguous read request to the PVFS2 system. Thereby every request involved all 8 file servers,
and the size of requested data on one file server was 512KB. ‘App$K’ (k=09) refers to an IOR instance, ‘FS$N’ (N=07) refers to a
file server. ‘MIN’ refers to the finish time of first complete file server, and ‘MAX’ refers to the finish time of last complete file server.
The completion time of each application relies on the ’MAX’ finish time for that application on all involved file servers.
needs from different applications are very likely to be served in
different orders on different file servers.
Understanding the impact of data synchronization in parallel I/O
systems is critical to efficiently improving completion time. In this
section, we evaluate the request completion time when file servers
serve requests from multiple applications simultaneously. We em-
ployed 8 nodes for PVFS2 file servers. Each file server was in-
stalled with a 7200RPM SATA II 250GB hard disk drive (HDD),
a PCI-E X4 100GB solid state disk (SSD), and the interconnection
was 4X InfiniBand. We adopted the IOR benchmark to simulate
the intra-request synchronization scenario and measured the finish
time of all requests on different file servers. The number of concur-
rent IOR instances was 10, to simulate 10 concurrent applications.
In these experiments we show only the intra-request data synchro-
nization case, so each instance was configured with only one pro-
cess, which issued a 4MB contiguous data read request. Figure 3
shows the finish time of different requests on different file servers.
From Figure 3 (a) and (c), we can see that, in both HDD and SSD
environments, the finish time of every application varies a lot on
different file servers. From subfigure (b) and (d), we can see that,
the maximum finish time is 4.4 times of the minimum on average
in the HDD environment and 3.1 times in the SSD environment.
The completion time of one request is equal to the maximum value
of all finish times on all involved file servers. Therefore, the signif-
icant deviation of finish time on multiple file servers leads to high
completion time of data accesses.
The experimental results also indicate that, due to the indepen-
dent scheduling strategy on each file server, data accesses are fin-
ished in different orders for concurrent applications. The difference
of service orders on different file servers will become much greater
in the inter-request or collective I/O synchronization cases, where
each application has multiple processes. As a result, the indepen-
dent scheduling strategy on file servers introduces a large number
of idle CPU cycles waiting for data synchronization on computing
nodes, and the case will become even worse for large-scale HPC
clusters. The results also reveal that there is a significant potential
to shorten completion time by coordinated I/O scheduling on file
servers.
3. I/O COORDINATION
In order to reduce the overhead of data synchronization, we pro-
pose a server-side I/O coordination scheme which re-arranges I/O
requests on file servers, so that requests are serviced in the same
order in terms of applications on all involved nodes. Data syn-
chronization usually explicitly or implicitly exists in parallel pro-
cesses of parallel applications. The re-ordering aims at scheduling

File
Server 1
t
File
Server 2
File
Server 3
File
Server 4
App1 App2
App3
(a) Original I/O requests
Time
Window
Time
Window
Time
Window
Time
Window
Time
Window
File
Server 1
File
Server 2
File
Server 3
File
Server 4
t
App1 App2
App3
(b) Divided by Time Windows
Time
Window
Time
Window
Time
Window
Time
Window
Time
Window
File
Server 1
File
Server 2
File
Server 3
File
Server 4
t
App1 App2
App3
(c) Re-order in each Time Window
Figure 4: I/O coordination scheme in parallel file systems
the parallel I/O requests that need to be synchronized to run to-
gether, which can benefit the system with a shorter average comple-
tion time of all I/O requests. A good scheduling algorithm should
take into account both performance and fairness. A good practi-
cal scheduling algorithm also requires simplicity in implementa-
tion. The proposed I/O coordination is no exception. For fairness,
all I/O requests should be serviced within an acceptable period to
avoid starvation. To provide the right balance of performance and
fairness, the concept of Time Window and Application ID are
introduced to support the server-side I/O coordination approach.
Time Window. All I/O requests issued to a file server can be
regarded as a time series. The time series is then divided into suc-
cessive segments by a fixed time interval. Here each segment of the
time series is referred to as Time Window. Thus, one Time Window
consists of a number of I/O requests. The value of the time interval
can be regarded as Time Window Width.
Application ID. We allocate an integer value for each appli-
cation running on the cluster. The integer is an identification of
“which application an I/O request belongs to”, and is referred to as
Application ID. For each I/O request, it will pass on this integer to
the file servers.
According to the definition, all I/O requests from one applica-
tion have the same Application ID’. For applications with multi-
ple parallel processes, such as MPI programs, there might be large
amounts of data synchronization. In order to alleviate the perfor-
mance penalties of synchronization, I/O requests from all processes
should have the same Application ID’, and they should be served
in concert in multiple file servers. The Application ID’ is gener-
ated automatically in the parallel I/O library and it is transparent to
users. It can be implemented in parallel I/O client libraries or the
middleware layer, without modifying application programs.
For fairness, requests in an earlier ‘Time Window’ will be ser-
viced prior to those in a later one, to avoid starvation. The request
time can use either file-server-side time (the arrival time of a re-
quest) or client-side time (the issue time of a request). Because of
network latency and load imbalance issues, one client side request
may have different arrival time on different file servers. In a system
with many concurrent clients, a request issued earlier might get a
later arrival time on some file servers. For these reasons, in our
implementation, we choose client-side time as the request time.
3.1 Algorithm
It is not difficult to imagine that in a parallel file system, a large
number of I/O requests might be queued on each file server at a
time. These I/O requests might come from multiple applications.
As all arriving requests are attached with a request time and an
Application ID’, the I/O coordination algorithm can be described
as follows. In the same ‘Time Window’, I/O requests are ordered by
the value of Application ID’; while in different ‘Time Windows’,
requests in an earlier window would be serviced prior to those in a
later one.
The proposed I/O scheduling algorithm is based on the obser-
vation that requests from the same application have a better lo-
cality and, equally important, the execution will be optimized if
these requests finish at the same time. It takes both performance
and fairness into consideration. In each time window, requests are
served one application at a time in order to reduce the overhead
of data synchronization. In addition, none of the requests will be
starved, because requests in an earlier time window will always be
performed first.
Figure 4 illustrates how the I/O coordination algorithm works in
parallel file systems. In this example, there are 4 file servers and
three concurrent applications. The original request arrival orders
are inconsistent on different file servers, such as in subfigure (a).
The series of I/O requests are split into successive ‘Time Windows’
by a fixed time interval on all file servers, as shown in subfigure (b).
The scheduler on each file server then reorders the requests in each
‘Time Window’ by Application ID’, so that requests from one ap-
plication can be serviced in the same time on all file servers, as
shown in subfigure (c).

The scheduler on each file server maintains a queue for all re-
quests, which determines the service order of I/O requests. When
a new I/O request arrives, if the queue is empty, the request will
be scheduled immediately. If the queue is not empty, the scheduler
will insert the request into the queue in terms of ‘Time Window’
and Application ID’. The scheduler keeps issuing request with the
highest priority (i.e. the head of the queue) to the low-level storage
devices in current queue on each file server. Since the Applica-
tion ID’ and request time are generated at the client side and then
passed to the file servers, there is no communication between dif-
ferent file servers while scheduling the requests. The use of Ap-
plication ID’ and ‘Time Window’ has significantly simplified the
implementation of the coordination and paved the foundation for
good scalability as the number of file servers increases.
3.2 Completion Time Analysis
Assume that the number of file servers is n, the number of con-
current applications is m, and that each application needs to access
data on all file servers (for simplicity). A collective data access
from one application is mapped into n sub-parts to all file servers,
and each sub-part is also a request in a file server. The service time
on each file server for each sub-part is t.
Without I/O coordination, the sub-parts are served in different
file servers independently. As requests are issued simultaneously,
the sub-parts may be served randomly without order on all file
servers. Hence for each sub-part, the finish time on each file server
can randomly fall in {t, 2t, 3t, ..., mt}, and the finish time of data
access for one application depends on the latest finish time of all
nodes. The expectation of completion time of one data access is
equal to the expectation of the maximum finish time on all n file
servers. The average completion time can be represented as For-
mula (1), where F (k) means the probability distribution function
and f (x) represents the probability density function. From the for-
mula, we observe that, if there is only 1 file server, the expectation
of completion time is
m+1
2
t, which conforms to the distribution
of our assumption. The formula also indicates that the completion
time increases as the number of file servers n increases, and also
as the concurrent applications number m increases. When the file
server number n is very large,
t
m
n
m1
k=1
k
n
would be close to 0, and
then the average completion time would be close to mt.
T
avg
= E(Max(T )) = (
m
k=1
kf(k))t
=(
m
k=1
k( F (k) F (k 1)))t
=(
m
k=1
k((
k
m
)
n
(
k 1
m
)
n
))t
= mt
t
m
n
m1
k=1
k
n
(1)
With the I/O coordination strategy, all file servers serve applica-
tions one at a time. I/O requests with synchronization needs will be
served at the same time on all file servers. Therefore, the comple-
tion times for these applications are: t, 2t, ..., mt, and the average
completion time can be represented as Formula (2). The formula
indicates that the average completion time is independent of n, the
number of file servers. That means the average completion time
of the I/O coordination scheme is much more scalable than that of
existing independent scheduling strategies. Currently, parallel file
systems usually reach up to hundreds of storage nodes or even be-
yond. The proposed I/O coordination strategy is a practical way to
reduce the request completion time for data-intensive applications.
T
avg
=
1
m
m
k=1
kt =
m +1
2
t (2)
From Formula (1) and (2), we can calculate the reduction of the
average completion time as follows.
T
dif f
= T
avg
T
avg
=
m 1
2
t
t
m
n
m1
k=1
k
n
(3)
As can be seen in Formula 3, when the number of file servers
n is very big, the reduction of completion time would be close to
m1
2
t, and the decrease rate would be approaching
m1
2m
. As the
number of concurrent applications m increases, the decrease rate is
approaching 50%.
4. IMPLEMENTATION
We have implemented the server-side I/O coordination scheme
under PVFS2[9] and MPI-IO. PVFS2 is an open source parallel file
system developed jointly by Clemson University and Argonne Na-
tional Laboratory. It is a virtual parallel file system for Linux clus-
ters based on underlying native file systems on storage nodes. The
prototype implementation includes modifications to the PVFS2 re-
quest scheduling module and the PVFS2 driver package in ROMIO
[26] MPI-IO library.
4.1 Implementation in PVFS2
We modified the client interface and server side request sched-
uler in PVFS2. The client interface passes Application ID’ and
‘Request Time’ to the file servers, and then the file servers re-
arrange requests service orders based on the two parameters.
We utilize the PVFS_hint mechanism to pass the two param-
eters between I/O clients and file servers. Two new hint types are
defined in the PVFS2 source code: PINT_HINT_APP_ID and
PINT_HINT_REQ_TIME’, representing the Application ID and
request time respectively. We made a modification of the client-side
interface PVFS_sys_read/write(), adding PVFS_hint as
a parameter, so that the hint could be passed to the PVFS2 server
side.
When a file server receives a request, the scheduler first calcu-
lates its priority, and then inserts the request to the request queue
in the ascending order of their priorities. The smaller the prior-
ity number a request gets, the earlier it would be scheduled. The
request priority is calculated as follows.
req_prior = req_time / interval
*
32768
+ app_id;
Here req_time is the issue time of the I/O request from the client
side, and it is an integer value referring to the number of millisec-
onds since ‘1970-01-01 00:00:00 UTC’. Interval is the width of
the ‘Time Window’, which can be defined as a startup parameter
in the PVFS2 configuration file. If interval is not configured, it
will use the default value (1000ms for HDD and 250ms for SSD).
App_id represents Application ID’, and it is an integer value in the
range 0 to 32767. From the formula we observe that the req_prior

Citations
More filters
Proceedings ArticleDOI
19 May 2014
TL;DR: Experiments show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.
Abstract: Unmatched computation and storage performance in new HPC systems have led to a plethora of I/O optimizations ranging from application-side collective I/O to network and disk-level request scheduling on the file system side. As we deal with ever larger machines, the interferences produced by multiple applications accessing a shared parallel file system in a concurrent manner become a major problem. These interferences often break single-application I/O optimizations, dramatically degrading application I/O performance and, as a result, lowering machine wide efficiency. This paper focuses on CALCioM, a framework that aims to mitigate I/O interference through the dynamic selection of appropriate scheduling policies. CALCioM allows several applications running on a supercomputer to communicate and coordinate their I/O strategy in order to avoid interfering with one another. In this work, we examine four I/O strategies that can be accommodated in this framework: serializing, interrupting, interfering and coordinating. Experiments on Argonne's BG/P Surveyor machine and on several clusters of the French Grid'5000 show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.

112 citations


Cites background from "Server-side I/O coordination for pa..."

  • ...[35], with an application’s id instead of an object id....

    [...]

Proceedings ArticleDOI
23 May 2016
TL;DR: This work provides the first deep insight into the role of each of the potential root causes of interference and their interplay in HPC storage systems and can help developers and platform owners improve I/O performance and motivate further research addressing the problem across all components of the I/o stack.
Abstract: As we move toward the exascale era, performance variability in HPC systems remains a challenge. I/O interference, a major cause of this variability, is becoming more important every day with the growing number of concurrent applications that share larger machines. Earlier research efforts on mitigating I/O interference focus on a single potential cause of interference (e.g., the network). Yet the root causes of I/O interference can be diverse. In this work, we conduct an extensive experimental campaign to explore the various root causes of I/O interference in HPC storage systems. We use microbenchmarks on the Grid'5000 testbed to evaluate how the applications' access pattern, the network components, the file system's configuration, and the backend storage devices influence I/O interference. Our studies reveal that in many situations interference is a result of bad flow control in the I/O path, rather than being caused by some single bottleneck in one of its components. We further show that interference-free behavior is not necessarily a sign of optimal performance. To the best of our knowledge, our work provides the first deep insight into the role of each of the potential root causes of interference and their interplay. Our findings can help developers and platform owners improve I/O performance and motivate further research addressing the problem across all components of the I/O stack.

77 citations


Cites background from "Server-side I/O coordination for pa..."

  • ...The focus has been on causes as diverse as access locality in disks [2], synchronization across storage servers [2], [3], or network contention [4]–[7]....

    [...]

Proceedings ArticleDOI
20 May 2013
TL;DR: Experimental results show that PDLA is effective in improving data access performance of parallel I/O systems and a runtime system is designed and developed to integrate the PDLA replication scheme and existing parallel I/.
Abstract: The performance gap between computing power and the I/O system is ever increasing, and in the meantime more and more High Performance Computing (HPC) applications are becoming data intensive. This study describes an I/O data replication scheme, named Pattern-Direct and Layout-Aware (PDLA) data replication scheme, to alleviate this performance gap. The basic idea of PDLA is replicating identified data access pattern, and saving these reorganized replications with optimized data layouts based on access cost analysis. A runtime system is designed and developed to integrate the PDLA replication scheme and existing parallel I/O system; a prototype of PDLA is implemented under the MPICH2 and PVFS2 environments. Experimental results show that PDLA is effective in improving data access performance of parallel I/O systems.

45 citations


Cites background from "Server-side I/O coordination for pa..."

  • ...As a result, the disks work in an interleaving way and each request can finish only when all the sub-requests on all nodes finish [16]....

    [...]

Proceedings ArticleDOI
08 Sep 2015
TL;DR: This paper proposes a burst buffer based I/O orchestration framework, named TRIO, to intercept and reshape the bursty writes for better sequential write traffic to storage servers, and demonstrates that TRIO could efficiently utilize storage bandwidth and reduce the average job I-O time by 37% on average for data-intensive applications in typical checkpointing scenarios.
Abstract: The growing computing power on leadership HPC systems is often accompanied by ever-escalating failure rates. Checkpointing is a common defensive mechanism used by scientific applications for failure recovery. However, directly writing the large and bursty checkpointing dataset to parallel file systems can incur significant I/O contention on storage servers. Such contention in turn degrades bandwidth utilization of storage servers and prolongs the average job I/O time of concurrent applications. Recently burst buffers have been proposed as an intermediate layer to absorb the bursty I/O traffic from compute nodes to storage backend. But an I/O orchestration mechanism is still desirable to efficiently move checkpointing data from burst buffers to storage backend. In this paper, we propose a burst buffer based I/O orchestration framework, named TRIO, to intercept and reshape the bursty writes for better sequential write traffic to storage servers. Meanwhile, TRIO coordinates the flushing orders among concurrent burst buffers to alleviate the contention on storage server. Our experimental results demonstrated that TRIO could efficiently utilize storage bandwidth and reduce the average job I/O time by 37% on average for data-intensive applications in typical checkpointing scenarios.

41 citations


Cites background from "Server-side I/O coordination for pa..."

  • ...Server-side optimizations generally embed their solutions inside the storage server, overcoming issues of contention by dynamically coordinating data movement among servers [37, 14, 43]....

    [...]

Proceedings ArticleDOI
16 Nov 2014
TL;DR: This paper presents Omnisc'IO, an approach that builds a grammar-based model of the I/O behavior of HPC applications and uses it to predict when futureI/O operations will occur, and where and how much data will be accessed.
Abstract: The increasing gap between the computation performance of post-petascale machines and the performance of their I/O subsystem has motivated many I/O optimizations including prefetching, caching, and scheduling techniques. In order to further improve these techniques, modeling and predicting spatial and temporal I/O patterns of HPC applications as they run has became crucial. In this paper we present Omnisc'IO, an approach that builds a grammar-based model of the I/O behavior of HPC applications and uses it to predict when future I/O operations will occur, and where and how much data will be accessed. Omnisc'IO is transparently integrated into the POSIX and MPI I/O stacks and does not require any modification in applications or higher level I/O libraries. It works without any prior knowledge of the application and converges to accurate predictions within a couple of iterations only. Its implementation is efficient in both computation time and memory footprint.

39 citations

References
More filters
Journal ArticleDOI
TL;DR: The problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service and it is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization.
Abstract: The problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service. It is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization which may be as low as 70 percent for large task sets. It is also shown that full processor utilization can be achieved by dynamically assigning priorities on the basis of their current deadlines. A combination of these two scheduling techniques is also discussed.

7,067 citations


"Server-side I/O coordination for pa..." refers background in this paper

  • ...Some of them adopted deadlinedriven strategies [12, 19, 31], that allow the upper layer to specify latency and throughput goals of file servers and schedule the requests based on Earliest Deadline First(EDF) [16] or its variants [19, 20, 31]....

    [...]

Proceedings Article
Frank B. Schmuck1, Roger L. Haskin1
28 Jan 2002
TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

1,434 citations


"Server-side I/O coordination for pa..." refers methods in this paper

  • ...GPFS: A Shared-Disk File System for Large Computing Clusters....

    [...]

  • ...Large-scale data-intensive supercomputing relies on parallel file systems, such as Lustre [1], GPFS [22], PVFS [9], and PanFS[18] for high-performance I/O....

    [...]

Journal ArticleDOI
12 Nov 2000
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.

619 citations

Proceedings ArticleDOI
21 Feb 1999
TL;DR: This work describes how the MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests and explains in detail the two key optimizations ROMIO performs: data sieving for non Contiguous requests from one process and collective I/O for noncont contiguous requests from multiple processes.
Abstract: The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications-an astrophysics-application template (DIST3D) the NAS BTIO benchmark, and an unstructured code (UNSTRUC)-on five different parallel machines: HP Exemplar IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

470 citations


"Server-side I/O coordination for pa..." refers background or methods in this paper

  • ...Moreover, collective data access from multiple clients, such as collective I/O in MPI-IO [26], has to wait for all aggregators to complete....

    [...]

  • ...We also modified the PVFS2 driver in ROMIO [26] to pass ‘Request Time’ and ‘Application ID’ via ‘PVFS_hint’....

    [...]

  • ...The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library....

    [...]

  • ...ROMIO[26] is a high-performance, portable implementation of MPI-IO, providing applications with a uniform interface in the top layer, and dealing with data access to various file systems by an internal abstract I/O device layer called ADIO....

    [...]

11 Sep 1998
TL;DR: MPI-IO as discussed by the authors allows users to access a non-contiguous data set with a single I/O function call, which provides MPI implementations an opportunity to optimize data access.
Abstract: The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications--an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)--on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.

466 citations

Frequently Asked Questions (12)
Q1. What are the contributions in "Server-side i/o coordination for parallel file systems" ?

Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper the authors propose a serverside I/O coordination scheme for parallel file systems. The authors present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. The authors also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8 % to 46 %, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. 

In the future, the authors plan to investigate optimization of the I/O coordination strategy based on application data access patterns. The authors also plan to add a minimum group communication in I/O coordination, to explore its feasibility for imbalanced data access workloads. 

In other words, after requests re-ordering, the average completion time decreases from 4t to 3t, which reveals a significant potential for shortening average completion time through request re-ordering at the file servers. 

The reason is that, due to nonuniform network delays,some requests with low priority were already issued to the storage devices in cases when some requests with high priority arrived late on some file servers. 

The proposed I/O scheduling algorithm is based on the observation that requests from the same application have a better locality and, equally important, the execution will be optimized if these requests finish at the same time. 

The prototype implementation includes modifications to the PVFS2 request scheduling module and the PVFS2 driver package in ROMIO [26] MPI-IO library. 

parallel file systems consisting of high performance storage devices should set a short time window, and those consisting of lower performance storage devices should set a relative large window size. 

Synchronization of I/O requests across processes is common in parallel computing, and can be classified into the following three categories (as shown in Figure 1).• 

The experimental results also indicate that, due to the independent scheduling strategy on each file server, data accesses are finished in different orders for concurrent applications. 

These techniques succeed in achieving high bandwidth in disks and networks of file servers, by reducing either the frequency of disk seeks, or the waiting time of socket connections. 

From subfigure (b) and (d), the authors can see that, the maximum finish time is 4.4 times of the minimum on average in the HDD environment and 3.1 times in the SSD environment. 

The experimental results demonstrate that, compared to the conventional data access strategy, the proposed I/O coordination scheme can reduce the I/O completion time by up to 46% and provide a comparable I/O bandwidth. 

Trending Questions (1)
How do I make my Terraria server less laggy?

Experimental results also show that the server-side I/O coordination scheme has good scalability.