scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Performance Optimization of a Parallel, Two Stage Stochastic Linear Program

TL;DR: This paper explores the parallelization of two-stage stochastic resource allocation problems that seek an optimal solution in the first stage, while accounting for sudden changes in resource requirements by evaluating multiple possible scenarios in the second stage.
Abstract: Stochastic optimization is used in several high impact contexts to provide optimal solutions in the face of uncertainties. This paper explores the parallelization of two-stage stochastic resource allocation problems that seek an optimal solution in the first stage, while accounting for sudden changes in resource requirements by evaluating multiple possible scenarios in the second stage. Unlike typical scientific computing algorithms, linear programs (which are the individual grains of computation in our parallel design) have unpredictable and long execution times. This confounds both a priori load distribution as well as persistence-based dynamic load balancing techniques. We present a master-worker decomposition coupled with a pull-based work assignment scheme for load balance. We discuss some of the challenges encountered in optimizing both the master and the worker portions of the computations, and techniques to address them. Of note are cut retirement schemes for balancing memory requirements with duplicated worker computation, and scenario clustering for accelerating the evaluation of similar scenarios. We base our work in the context of a real application: the optimization of US military aircraft allocation to various cargo and personnel movement missions in the face of uncertain demands. We demonstrate scaling up to 122 cores of an intel 64 cluster, even for very small, but representative datasets. Our decision to eschew problem-specific decompositions has resulted in a parallel infrastructure that should be easily adapted to other similar problems. Similarly, we believe the techniques developed in this paper will be generally applicable to other contexts that require quick solutions to stochastic optimization problems.

Summary (3 min read)

Introduction

  • The authors base their work in the context of a real application: the optimization of US military aircraft allocation to various cargo and personnel movement missions in the face of uncertain demands.
  • Keywords-stochastic optimization, parallel computing, large scale optimization, airfleet management I. INTRODUCTION Stochastic optimization provides a means of coping with the uncertainty inherent in real-world systems; and with models that are nonlinear, of high dimensionality, or not conducive to deterministic optimization techniques.
  • The solutions obtained can be far from optimal even with small perturbations of the input data.
  • Examples include making investment decisions in order to increase profit, transportation (planning and scheduling logistics), design-space exploration in product design, etc.
  • The authors present their parallel decomposition and some interesting considerations in dealing with computation-communication granularity, responsiveness, and the lack of persistence of work loads in an iterative setting.

II. MODEL FORMULATION & APPROACH

  • The United States Air Mobility Command (AMC) 1 manages a fleet of over 1300 aircraft [2] that operate globally under uncertain and rapidly changing demands.
  • Aircraft are allocated at different bases in anticipation of the demands for several missions to be conducted over an upcoming time period (typically, fifteen days to one month).
  • The purpose of a stochastic formulation is to optimally allocate aircraft to each mission such that subsequent disruptions are minimized.
  • Note that their formulation of the aircraft allocation model has complete recourse (i.e. all candidate allocations generated are feasible) because any demand (in a particular scenario) that cannot be satisfied by a candidate allocation is met by short term leasing of civilian aircraft at a high cost while evaluating that scenario.
  • Tkx (5) The second stage optimization helps Stage 1 to take the recourse action of increasing the capacity for satisfying an unmet demand by providing feedback in the form of additional constraints (cuts) on the Stage 1 LP (6).

III. PARALLEL PROGRAM DESIGN

  • The authors have implemented the program in Charm++ [6], [7], which is a message-driven, objectoriented parallel programming framework with an adaptive run-time system.
  • It also emphasizes any sequential bottlenecks and has been causative of some of their efforts in optimizing solve times.
  • Since the unit of sequential computation is an LP solve, the two-stage formulation maps readily onto a two-stage parallel design, with the first stage generating candidate allocations, and the second stage evaluating these allocations over a spectrum of scenarios that are of interest.
  • An Allocation Generator object acts as the master and generates allocations, while a collection of Scenario Evaluator objects are responsible for the evaluation of all the scenarios.
  • Charm++ provides flexibility in the placement of compute objects on processors.

IV. OPTIMIZING STAGE 1

  • The two-stage design yields an allocation that is iteratively evolved towards the optimal.
  • As the Stage 1 model grows larger every round, it becomes increasingly limited by the memory subsystem and experiences dilated times for LP solves.
  • Cut Usage Rate = num rounds in which cut is active num rounds since its generation (7) We therefore implemented a cut retirement scheme that discards/retires cuts whenever the total number of cuts in the Stage 1 model exceeds a configurable threshold.the authors.the authors.
  • The recently used cuts are scored higher.
  • For more details and proofs for the weighing function refer to [9].

V. OPTIMIZING STAGE 2

  • This constitutes the major volume of the computation involved in the Benders approach because of the large number of scenarios in practical applications.
  • Their experiments show that runs with advanced start take fewer rounds to converge than with a fresh start.
  • The authors do not yet have data to back any line of reasoning that can explain this.
  • The authors also implement random clustering for reference.
  • Figure 10 compares the improvement in average Stage 2 solve times when scenarios are clustered using Algorithm 1.

VI. SCALABILITY

  • With the optimizations described above, the authors were able to scale medium-sized problems up to 122 cores of an Intel-64 Clovertwon (2.33 GHz) cluster with 8 cores per node.
  • For 120 scenarios, an execution that uses 122 processors represents the limit of parallel decomposition using the described approach: one Stage 1 object, one Work Allocator object, and 120 Scenario Evaluators that each solve one scenario.
  • Figure 13(a) and 13(b) show the scalability plots with Stage 1 and Stage 2 wall time breakdown.
  • The plots also demonstrate Amdahl’s effect as the maximum parallelism available is proportional to the number of scenarios that can be solved in parallel, and scaling is limited by the sequential Stage 1 computations.
  • It must be noted that real-world problems may involve several hundreds or thousands of scenarios, and their current design should yield significant speedups because of Stage 2 parallelization.

VIII. SUMMARY

  • Most stochastic programs incorporate a large number of scenarios to hedge against many possible uncertainties.
  • For stochastic optimization with Benders approach, the vast bulk of computation can be parallelized using a master-worker design described in this paper.
  • The authors presented an LRFU based cut management scheme, that completely eliminates the memory bottleneck and significantly reduces the Stage 1 solve time, thus making the optimization of large scale problems tractable.
  • Much higher speedups can be obtained for real-world problems which present much more Stage 2 computational loads.
  • The authors are currently exploring methods such as Lagrangean decomposition to alleivate this.

IX. ACKNOWLEDGMENTS

  • The research was supported by MITRE Research Agreement Number 81990 with UIUC.
  • The Gurobi linear program solver is licensed at no cost for academic use.
  • Runs on Abe cluster were done under the TeraGrid [14] allocation grant ASC050040N supported by NSF.

Did you find this useful? Give us your feedback

Figures (14)

Content maybe subject to copyright    Report

Performance Optimization of a
Parallel, Two Stage Stochastic Linear Program
Akhil Langer
, Ramprasad Venkataraman
, Udatta Palekar
, Laxmikant V. Kale
Dept of Computer Science,
College of Business
University of Illinois at Urbana-Champaign
{alanger, ramv, palekar, kale}@illinois.edu
Steven Baker
MITRE Corporation
sbaker@mitre.org
Abstract—Stochastic optimization is used in several high
impact contexts to provide optimal solutions in the face of
uncertainties. This paper explores the parallelization of two-
stage stochastic resource allocation problems that seek an optimal
solution in the first stage, while accounting for sudden changes
in resource requirements by evaluating multiple possible sce-
narios in the second stage. Unlike typical scientific computing
algorithms, linear programs (which are the individual grains of
computation in our parallel design) have unpredictable and long
execution times. This confounds both a priori load distribution as
well as persistence-based dynamic load balancing techniques. We
present a master-worker decomposition coupled with a pull-based
work assignment scheme for load balance. We discuss some of
the challenges encountered in optimizing both the master and the
worker portions of the computations, and techniques to address
them. Of note are cut retirement schemes for balancing memory
requirements with duplicated worker computation, and scenario
clustering for accelerating the evaluation of similar scenarios.
We base our work in the context of a real application:
the optimization of US military aircraft allocation to various
cargo and personnel movement missions in the face of uncertain
demands. We demonstrate scaling up to 122 cores of an intel
R
64
cluster; even for very small, but representative datasets.
Our decision to eschew problem-specific decompositions has
resulted in a parallel infrastructure that should be easily adapted
to other similar problems. Similarly, we believe the techniques
developed in this paper will be generally applicable to other
contexts that require quick solutions to stochastic optimization
problems.
Keywords-stochastic optimization, parallel computing, large
scale optimization, airfleet management
I. INTRODUCTION
Stochastic optimization provides a means of coping with the
uncertainty inherent in real-world systems; and with models
that are nonlinear, of high dimensionality, or not conducive
to deterministic optimization techniques. Deterministic ap-
proaches find optima for a fixed combination of inputs. How-
ever, the solutions obtained can be far from optimal even with
small perturbations of the input data. This can be problematic
because real-world systems often have many perturbations
from mean values. Stochastic optimization allows the modeler
to account for this randomness by looking for optimality
across multiple possible scenarios [1]. Typically, the search
for an optimum involves the evaluation of candidate solutions
for many possible combinations or variations in input values
(scenarios). Since, the number of likely or possible scenarios
is typically quite large, there is a clear motivation to explore
parallel computing to handle this large computational burden.
Stochastic optimization algorithms have applications in
statistics, science, engineering, and business. Examples include
making investment decisions in order to increase profit, trans-
portation (planning and scheduling logistics), design-space
exploration in product design, etc. There are other applications
in agriculture, energy, telecommunications, military, medicine,
water management etc.
In this paper, we describe our design for a parallel program
to solve a 2-stage stochastic linear optimization model for
an aircraft planning problem. We present our parallel decom-
position and some interesting considerations in dealing with
computation-communication granularity, responsiveness, and
the lack of persistence of work loads in an iterative setting.
In Section II we briefly describe the aircraft allocation
problem and its formulation as a two-stage stochastic program.
In Section III we discuss our parallel program design for the
Benders decomposition approach. In Section IV, we present
challenges and strategies for optimizing the Stage 1 component
of the computations while in Section V we present our study of
the Stage 2 computations. Scalability results are presented in
Section VI, while we summarize related work in Section VII.
II. MODEL FORMULATION & APPROACH
The United States Air Mobility Command (AMC)
1
man-
ages a fleet of over 1300 aircraft [2] that operate globally under
uncertain and rapidly changing demands. Aircraft are allocated
at different bases in anticipation of the demands for several
missions to be conducted over an upcoming time period
(typically, fifteen days to one month). Causes of changes
include demand variation, aircraft breakdown, weather, natural
disaster, conflict, etc. The purpose of a stochastic formulation
is to optimally allocate aircraft to each mission such that
subsequent disruptions are minimized.
Aircraft are allocated by aircraft type, airlift wing, mission
type and day. In situations when self-owned military aircraft
are not sufficient for outstanding missions, civilian aircraft are
leased. The cost of renting civilian aircraft procured in advance
for the entire planning cycle is lower than the rent of civilian
aircraft leased at short notice. Therefore, a good prediction of
the aircraft demand prior to the schedule execution reduces
the execution cost.
1
http://www.amc.af.mil/

We model the allocation process as a two-stage stochastic
linear program (LP) with Stage 1 generating candidate al-
locations and Stage 2 evaluating the allocations over many
scenarios. This iterative method developed by Benders [3] has
been widely applied to Stochastic Programming. Note that
our formulation of the aircraft allocation model has complete
recourse (i.e. all candidate allocations generated are feasible)
because any demand (in a particular scenario) that cannot
be satisfied by a candidate allocation is met by short term
leasing of civilian aircraft at a high cost while evaluating that
scenario. A detailed description of our model and the potential
cost benefits of stochastic vs deterministic models is available
elsewhere [4], [5]. To illustrate the size of the datasets of
interest, Table I lists the sizes of various airlift fleet assignment
models. 3t corresponds to an execution period of 3 days, 5t
for 5 days, and so on.
In Stage 1 , before a realization of the demands are known,
decisions about long-term leasing of civilian aircraft are made,
and the allocations of aircraft to different missions at each base
location are also decided.
min Cx +
K
X
k=1
p
k
θ
k
(1)
s.t. Ax b, (2)
E
l
x + θ e
l
(3)
In the objective function(1), x corresponds to the allocations
by the aircraft type, location, mission and time. C is the
cost of allocating military aircraft and leasing civilian aircraft.
θ = {θ
k
|k = 1, ..., k} is the vector of Stage 2 costs for the
k scenarios, p
k
are the probability of occurrence of scenario
k, l corresponds to the iteration in which the constraint was
generated and E
l
(e
l
) are the coefficients (right hand sides)
of the corresponding constraints. Constraints in (2) are the
feasibility constraints, while constraints in (3) are cuts which
represents an outer linearization of the recourse function.
In Stage 2 , the expected cost of an allocation for each
scenario in a collection of possible scenarios is computed by
solving LPs for that scenario.
min q
k
T
y (4)
s.t. W y h
k
T
k
x (5)
The second stage optimization helps Stage 1 to take the
recourse action of increasing the capacity for satisfying an
unmet demand by providing feedback in the form of additional
constraints (cuts) on the Stage 1 LP (6). Here, π
k
are the dual
multipliers obtained from Stage 2 optimization and x
is the
allocation vector obtained from the last Stage 1 optimization.
θ
k
π
k
(h
k
T
k
x
) π
k
T
k
(x x
) (6)
III. PARALLEL PROGRAM DESIGN
a) Programming Model: We have implemented the pro-
gram in Charm++ [6], [7], which is a message-driven, object-
oriented parallel programming framework with an adaptive
run-time system. It allows expressing the computations in
TABLE I
MODEL SIZES OF INTEREST (120 SCENARIOS)
Model
Name
Num stg1
variables
Num stg2
variables
Num stg2
constraints
3t 255 1076400 668640
5t 345 1663440 1064280
10t 570 3068760 1988640
15t 795 4157040 2805000
30t 1470 7956480 5573400
Stg1Solver
Comm
Stg2Solver Stg2Solver Stg2Solver
allocation
scenarios, allocations
cuts
Fig. 1. Design Schematic
terms of interacting collections of objects and also implic-
itly overlaps computation with communication. Messaging is
one-sided and computation is asynchronous, sender-driven;
facilitating the expression of control flow which is not bulk
synchronous (SPMD) in nature.
b) Coarse Grained Decomposition: To exploit the state
of the craft in LP solvers, our design delegates the individual
LP solves to a library (Gurobi [8]). This allows us to build
atop the domain expertise required to tune these numerically
intensive algorithms. However, the same decision also causes
a very coarse-grain of computation as the individual solves are
not decomposed further. Parallel programs usually benefit from
a medium or fine-grained decomposition as it permits a better
overlap of computation with communication. In Charm++
programs, medium-sized grains allow the runtime system to be
more responsive and give it more flexibility in balancing load.
Adopting a coarse-grained decomposition motivates other mit-
igating design decisions described here. It also emphasizes any
sequential bottlenecks and has been causative of some of our
efforts in optimizing solve times.
c) Two-stage Design: Since the unit of sequential com-
putation is an LP solve, the two-stage formulation maps readily
onto a two-stage parallel design, with the first stage generating
candidate allocations, and the second stage evaluating these
allocations over a spectrum of scenarios that are of interest.
Feedback cuts from the second stage LPs guides the generation
of a new candidate allocation. There are many such iterations
(rounds) until an optimal allocation is found. We express
this as a master-worker design in Charm++ with two types
(C++ classes) of compute objects. An Allocation Generator

0 10 20 30 40 50
round number
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
stage 1 solve time(in seconds)
with advanced start
with fresh start
Fig. 2. Stage 1 LP solve times with and without advanced start on 2.67
GHZ Dual Westmere Xeon
object acts as the master and generates allocations, while a
collection of Scenario Evaluator objects are responsible for
the evaluation of all the scenarios.
d) Unpredictable Grain Sizes: Experiments show that LP
solves for different scenarios take different amounts of time.
Hence, an a priori static distribution of scenarios across all
the Scenario Evaluators will not achieve a good load balance.
Unlike typical algorithms in parallel, scientific computing, the
time taken for an individual grain of computation (LP solve)
is also devoid of any persistence across different iterations
(rounds). This precludes the use of any persistence-based
dynamic load balancers available in Charm++. To tackle
this fundamental unpredictability in the time taken for a
unit of computation we adopt a work-request or pull-based
mechanism to ensure load-balance. We create a separate work
management entity, Work Allocator object(Comm in Figure 1),
that is responsible for doling out work units as needed. As
soon as a Scenario Evaluator becomes idle, it sends a work
request to the Work Allocator which assigns it an unevaluated
scenario. Figure 1 is a schematic representing our design.
e) Maintaining Responsiveness: A pull-based mecha-
nism to achieve load balance requires support from a very
responsive Work Allocator. Charm++ provides flexibility in the
placement of compute objects on processors. We use this to
place the Allocation Generator and the Work Allocator objects
on dedicated processors. This ensures a responsive Work
Allocator object and allows fast handling of work requests
from the Scenario Evaluators; unimpeded by the long, coarse-
grained solves that would otherwise be executing.
IV. OPTIMIZING STAGE 1
f) Advanced Starts: The two-stage design yields an allo-
cation that is iteratively evolved towards the optimal. Typically,
this results in LPs that are only incrementally different from
the corresponding LPs in the previous round as only a few
additional constraints may be added every round. LP solvers
can exploit such situations by maintaining internal state from a
call so that a later call may start its search for an optimum from
the previous solution. This is called advanced start (or warm
0 100 200 300 400 500
round number
0
20
40
60
80
100
120
stage 1 solve time(s)
dummy
stg 1 solve times
0
1000
2000
3000
4000
5000
stage 1 memory usage(MB)
stg1 memory usage
Fig. 3. Stage 1 memory usage and LP solve times for 15 time period model
on Dell 2.6 GHz Lisbon Opteron 4180
1
2 3
4 5
6
number of cores per node
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
time(s)
avg stg2 solve time
Fig. 4. The impact of artificially constraining memory bandwidth available
for an LP solve (10 time period model) on a system with Intel 64(Clovertown)
2.33 GHz dual socket quad core processor with 1333MHz front size bus (per
socket), 2x4MB L2 cache and 2 GB/core memory.
start), and can significantly reduce the time required to find a
solution to an LP. We enabled advanced starts for the Stage 1
LP and observed sizable performance benefits (Figure 2).
g) Memory Footprint and Bandwidth: An observation
from Figure 2 is that the Stage 1 solve time increases steadily
with the round number irrespective of the use of advanced
0.0 0.2 0.4 0.6 0.8 1.0
cut usage rate
10
0
10
1
10
2
10
3
10
4
number of cuts (log scale)
Fig. 5. Cut usage rate for a 5 time period model

0 100 200 300 400 500
round number
0
20
40
60
80
100
120
stage 1 solve time(s)
0
1000
2000
3000
4000
5000
stage 1 memory usage(MB)
w/o cut retirement
with cut retirement
Fig. 6. Stage 1 LP solve times and memory usage for the 15 time period
model solved to 1% convergence with Cut Window of 75 (run on 8 cores of
2.6 GHz Lisbon Opteron 4180)
starts. Our investigation pointed to an increasing solver mem-
ory footprint as the cause for such behavior.
During each round, the Allocation Generator incorporates
feedback from the evaluation of each scenario into the Stage 1
model. This feedback is in the form of constraints (cuts) which
are additional rows added to a matrix maintained internally
by the library. The number of cuts added to the model
grows monotonically with the number of rounds; requiring
an increasing amount of memory to store and solve an LP.
Figure 3 captures this trend by plotting memory utilization
for the Allocation Generator object (which includes LP library
memory footprint) and the time taken for the Stage 1 solves
by round number. The memory usage is as high as 5 GB and
the solve time for a single grain of Stage 1 computation can
reach 100s.
To improve the characterization of the LP solves, we
designed an experiment that artificially limits the memory
bandwidth available to a single LP solver instance by simul-
taneously running multiple, independent LP solver instances
on a multicore node. Our results (Figure 4) show that for the
same problem size, the time to solution of an LP is increased
substantially by limiting the available memory bandwidth per
core. As the Stage 1 model grows larger every round, it
becomes increasingly limited by the memory subsystem and
experiences dilated times for LP solves.
h) Curbing Solver Memory Footprint: For large Stage
1 problems, which take many iterations to converge, the
increasing Stage 1 solve times and the increasing memory
demands exacerbate the serial bottleneck at the Allocation
Generator, and pose a threat to the very tractability of the
Benders approach. However, an important observation in this
context is that not all the cuts added to a Stage 1 problem may
actually constrain the feasible space in which the optimum
solution is found. As new cuts are added, older cuts may no
longer be binding or active. They may become active again in a
later round or maybe rendered redundant if they are dominated
by newer cuts. Such cuts simply add to the size of the Stage 1
model and its solve time, and can be safely discarded. Figure 5
plots a histogram of the cut usage rate (defined by equation 7)
for the cuts generated during the course of convergence of a
5 time period model. Most of the cuts have very low usage
rates while a significant number of the cuts are not used at
all. This suggests that the size of the Stage 1 problem may
be reduced noticeably without diluting the description of the
feasible space for the LP solution.
Cut Usage Rate =
num rounds in which cut is active
num rounds since its generation
(7)
We therefore implemented a cut retirement scheme that
discards/retires cuts whenever the total number of cuts in the
Stage 1 model exceeds a configurable threshold. After every
round of the Benders method, the cut score is updated based on
it’s activity in that round. Cuts with small usage rates (defined
by Equation 7) are discarded. The desired number of lowest
scoring cuts can be determined using a partial sort that runs
in linear time.
Discarding a cut that may be required during a later
round only results in some repeated work. This is because
the Benders approach will cause any necessary cuts to be
regenerated via scenario evaluations in future rounds. This
approach could increase the number of rounds required to
reach convergence, but lowers execution times for each Stage
1 LP solve by limiting the required memory and access
bandwidth. Figure 6 demonstrates these effects and shows
the benefit of cut management on the Stage 1 memory usage
and solve times of the 15 time period model solved to 1%
convergence tolerance. The time to solution reduced from
19025s without cut retirement to 8184s with cut retirement
- a 57% improvement.
We define a Cut Window as the upper limit on the number of
cuts allowed in the Stage 1 model, expressed as the maximum
number of cuts divided by the number of scenarios. Figure 7(a)
and 7(b) describe the effect of different Cut Windows on
the time and number of rounds to convergence. Smaller Cut
Windows reduce the individual Stage 1 solve times, leading to
an overall improvement in the time to solution even though it
takes more rounds to converge. However, decreasing the Cut
Window beyond a certain limit, leads to a significant increase
in the number of rounds because several useful cuts are
discarded and have to be regenerated in later rounds. Further
reducing the Cut Window makes it impossible to converge
because the collection of cuts is no longer sufficient. These
experiments demonstrate the need to make an informed choice
of the Cut Window to get the shortest time to solution, e.g.
for the 5 time period model with 120 scenarios, an optimal
Cut Window size is close to 25 while for the 10 time period
model with 120 scenarios it is close to 15.
i) Evaluating Cut-Retirement Strategies: We investigate
cut management further to study it’s performance with dif-
ferent cut scoring schemes. Three cut scoring schemes are
discussed here namely, the least frequently used, the least
recently used and the least recently/frequently used. Each of
these are briefly discussed here: leftmargin=3mm
Least Frequently Used (LFU) A cut is scored based
on it’s rate of activity since it’s generation (equation7).

1
5
10
15
25
50
75
100
cut window
0
200
400
600
800
1000
number of rounds
# of rounds
0
500
1000
1500
2000
time(in seconds)
time to solution
(a) 5 time period model (solved to 0.1% convergence on 8 cores of
2.26 GHz Dual Nehalem)
1
5
10
15
25
50
75
100
125
cut window
0
200
400
600
800
1000
1200
1400
number of rounds
# of rounds
0
500
1000
1500
2000
time(in seconds)
time to solution
(b) 10 time period model (solved to 1% convergence on 32 cores of
2.67 GHz Intel Xeon hex-core processors)
Fig. 7. Performance of 5 and 10 time period models with different Cut Windows
This scoring method was used for results presented in
Figure 7(a) and 7(b).
Least Recently Used (LRU) - In this scheme, the recently
used cuts are scored higher. Therefore, a cut’s score is
simply the last round in which it was active.
LRU Score = Last active round for the cut
Least Recently/Frequently Used (LRFU) This scheme
takes both the recency and frequency of cut activity into
account. Each round in which the cut was active con-
tributes to the cut score. The contribution is determined
by a weighing function F(x), where x is the time span
from the activity in the past to current time.
LRF U Score =
k
X
i=1
F(t
base
t
i
)
where t
1
, t
2
, ..., t
k
are the active rounds of the cut and
t
1
< t
2
< ... < t
k
t
base
. This policy can demand
a large amount of memory if each reference to every
cut has to be maintained and also demands considerable
computation every time the cut retirement decisions are
made. Lee, et. al. [9] have proposed a weighing function
F(x) = (
1
p
)
λx
(p 2) which reduces the storage and
computational needs drastically. They tested it for cache
replacement policies and obtained competitive results.
With this weighing function, the cut score can be cal-
culated as follows:
S
t
k
= F(0) + F(δ)S
t
k1
,
where S
t
k
is the cut score at the kth reference to the
cut, S
t
k1
was the cut score at the (k 1)th reference
and δ = t
k
t
k1
. For more details and proofs for the
weighing function refer to [9]. We use p = 2 and λ = 0.5
for our experiments.
Figure 8 compares the result of these strategies. LRFU
gives the best performance of the three. The cut windows used
for these experiments were the optimal values obtained from
experiments in Figure 7(a) and 7(b).
5t
10t
0
50
100
150
200
250
time (in seconds)
LFU
LRU
LRFU
5t
10t
0
50
100
150
200
number of rounds
LFU
LRU
LRFU
Fig. 8. Performance of different cut scoring strategies for the 5 time period
model(8 cores, cut-window=25, 0.1% convergence) and the 10 time period
model(32 cores, cut-window=15, 1% convergence)
V. OPTIMIZING STAGE 2
j) Advanced Starts: In every iteration, there are as many
Stage 2 LP solves as there are scenarios. This constitutes the
major volume of the computation involved in the Benders
approach because of the large number of scenarios in practical
applications. Even a small reduction in the number of rounds
or average Stage 2 solve times can have sizable payoffs. In this
section, we analyze different strategies to reduce the amount
of time spent in Stage 2 work.
In contrast to the Stage 1 LP solves, Stage 2 solves take
more time with the advanced start feature as compared to a
fresh start. This can happen because the initial basis from the
previous scenario solve can be a bad starting point for the
new scenario. Despite the slower solves with advanced starts,
our experiments show that runs with advanced start take fewer
rounds to converge than with a fresh start. This indicates that
starting from the previous solves gives us better cuts. This
behavior was seen across several input datasets. We do not
yet have data to back any line of reasoning that can explain

Citations
More filters
Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper explores the parallelization of a two-stage stochastic integer program solved using branch-and-bound, and examines the scalability of the designs using sample aircraft allocation problems for the US airfleet.
Abstract: Many real-world planning problems require searching for an optimal solution in the face of uncertain input. One approach to is to express them as a two-stage stochastic optimization problem where the search for an optimum in one stage is informed by the evaluation of multiple possible scenarios in the other stage. If integer solutions are required, then branch-and-bound techniques are the accepted norm. However, there has been little prior work in parallelizing and scaling branch-and-bound algorithms for stochastic optimization problems. In this paper, we explore the parallelization of a two-stage stochastic integer program solved using branch-and-bound. We present a range of factors that influence the parallel design for such problems. Unlike typical, iterative scientific applications, we encounter several interesting characteristics that make it challenging to realize a scalable design. We present two design variations that navigate some of these challenges. Our designs seek to increase the exposed parallelism while delegating sequential linear program solves to existing libraries. We evaluate the scalability of our designs using sample aircraft allocation problems for the US airfleet. It is important that these problems be solved quickly while evaluating large number of scenarios. Our attempts result in strong scaling to hundreds of cores for these datasets. We believe similar results are not common in literature, and that our experiences will feed usefully into further research on this topic.

10 citations


Cites background or methods from "Performance Optimization of a Paral..."

  • ...It provides a Python based programming framework for developing stochastic optimization models....

    [...]

  • ...Ensuring high utilization of compute resources will therefore require interleaving the iterative twostage evaluation of multiple BnB vertices....

    [...]

Proceedings ArticleDOI
01 May 2022
TL;DR: RAPTOR represents important progress towards improvement of computational drug discovery, in terms of size of libraries screened, and for the possibility of generating training data fast enough to serve the last generation of docking surrogate models.
Abstract: We describe the design, implementation and performance of the RADICAL-Pilot task overlay (RAPTOR). RAPTOR enables the execution of heterogeneous tasks-i.e., functions and executables with arbitrary duration-on HPC platforms, pro-viding high throughput and high resource utilization. RAPTOR supports the high throughput virtual screening requirements of DOE's National Virtual Biotechnology Laboratory effort to find therapeutic solutions for COVID-19. RAPTOR has been used on 8300 compute nodes to sustain 144M/hour docking hits, and to screen 1011 ligands. To the best of our knowledge, both the throughput rate and aggregated number of executed tasks are a factor of two greater than previously reported in literature. RAPTOR represents important progress towards improvement of computational drug discovery, in terms of size of libraries screened, and for the possibility of generating training data fast enough to serve the last generation of docking surrogate models.

1 citations

Dissertation
01 Jan 2016
TL;DR: This dissertation proposes a chance-constrained model that integrates operating rooms to surgeries and then to develop schedules, and proposes an equivalent risk-neutral minimax reformulation for the considered problem, to which dual decomposition methods apply.
Abstract: The primary focus of this dissertation is to develop solution methods for stochastic programs, especially those with binary decisions and risk-averse features such as chance constraint or riskminimizing objective. We approach these problems through a scenario-based reformulation, e.g., sample average approximation, which is more amenable to solution by decomposition methods. The reformulation is often of intractable scale due to the use of a large number of scenarios to represent the uncertainty. Our goal is to develop specialized decomposition algorithms that take advantage of the problem structure and solve the problem in reasonable time. We first study a surgery planning problem with uncertainty in surgery durations. A common practice is to first assign operating rooms to surgeries and then to develop schedules. We propose a chance-constrained model that integrates these two steps, yielding a better tradeoff between cost and the quality of service. A branch-and-cut algorithm is developed, which exploits valid inequalities derived from a bin packing problem and a series of single-machine scheduling problems. We also discuss models and solutions given ambiguous distributional information. Computational results demonstrate the efficacy of the proposed algorithm and provide insights into enhancing performance by the integrated model, managing quality of service via chance constraint, and using data to guide planning under distributional information ambiguity. Next, we study general chance-constrained 0-1 programs, where decisions made before the realization of uncertainty are binary. As most of the existing methods fail when the number of scenarios is large, we develop dual decomposition algorithms that find solutions through bounds and cuts efficiently. Then we derive a proposition about computing the Lagrangian dual whose application substantially reduces the number of subproblems to solve, and develop a cut aggregation method that accelerates the solution of individual subproblems. We also explore non-trivial parallel schemes to implement our algorithms in a distributed system. All of them are shown to improve the speed of the algorithms effectively. We then continue to study dual decomposition, but for risk-averse stochastic 0-1 programs, which do not have chance constraints but minimize the risk of some random outcome measured by a coherent risk function. Using generic dual representations for coherent risk measures, we derive xi an equivalent risk-neutral minimax reformulation for the considered problem, to which dual decomposition methods apply. Motivated by some observation of inefficiency in the foregoing work, we investigate in more depth how to exploit the Lagrangian relaxation by comparing three different approaches for computing lower bounds. We also study parallelism more comprehensively, testing schemes that represent different combinations of basic/master-worker, synchronous/asynchronous and push/pull systems, and identify that the best is a master-worker, asynchronous and pull scheme, which achieves near-linear or even super-linear speedup.

1 citations


Cites background from "Performance Optimization of a Paral..."

  • ...A few examples include military (Langer et al., 2012), energy (Carøe and Schultz, 1998; Wang et al., 2012), finance (Kouwenberg, 2001; Yu et al., 2003), healthcare (Denton et al., 2007; Salmerón and Apte, 2010) and supply chain (Goh et al., 2007; Santoso et al., 2005)....

    [...]

  • ...A few examples include military (Langer et al., 2012), energy (Carøe and Schultz, 1998; Wang et al....

    [...]

Proceedings ArticleDOI
01 Jan 2015
TL;DR: This paper proposes a split-and-merge (SAM) method for accelerating the convergence of stochastic linear programs, which splits the original problem into subpro problems, and utilizes the dual constraints from the subproblems to accelerate the convergence.
Abstract: Stochastic program optimizations are computationally very expensive, especially when the number of scenarios are large. Complexity of the focal application, and the slow convergence rate add to its computational complexity. We propose a split-and-merge (SAM) method for accelerating the convergence of stochastic linear programs. SAM splits the original problem into subproblems, and utilizes the dual constraints from the subproblems to accelerate the convergence of the original problem. Our initial results are very encouraging, giving up to 58% reduction in the optimization time. In this paper we discuss the initial results, the ongoing and the future work.

Cites background from "Performance Optimization of a Paral..."

  • ...Langer et al (Langer et al., 2012) propose clustering schemes for solving similar scenarios in succession that significantly reduces the Stage 2 scenario optimization times by use of advanced/warm start....

    [...]

References
More filters
BookDOI
27 Jun 2011
TL;DR: This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability to help students develop an intuition on how to model uncertainty into mathematical problems.
Abstract: The aim of stochastic programming is to find optimal decisions in problems which involve uncertain data. This field is currently developing rapidly with contributions from many disciplines including operations research, mathematics, and probability. At the same time, it is now being applied in a wide variety of subjects ranging from agriculture to financial planning and from industrial engineering to computer networks. This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability. The authors aim to present a broad overview of the main themes and methods of the subject. Its prime goal is to help students develop an intuition on how to model uncertainty into mathematical problems, what uncertainty changes bring to the decision process, and what techniques help to manage uncertainty in solving the problems.In this extensively updated new edition there is more material on methods and examples including several new approaches for discrete variables, new results on risk measures in modeling and Monte Carlo sampling methods, a new chapter on relationships to other methods including approximate dynamic programming, robust optimization and online methods.The book is highly illustrated with chapter summaries and many examples and exercises. Students, researchers and practitioners in operations research and the optimization area will find it particularly of interest. Review of First Edition:"The discussion on modeling issues, the large number of examples used to illustrate the material, and the breadth of the coverage make'Introduction to Stochastic Programming' an ideal textbook for the area." (Interfaces, 1998)

5,398 citations

Journal ArticleDOI
J. F. Benders1
TL;DR: This paper presented to the 8th International Meeting of the Institute of Management Sciences, Brussels, August 23-26, 1961 presents a meta-analyses of the determinants of infectious disease in eight operation rooms of the immune system and its consequences.
Abstract: Paper presented to the 8th International Meeting of the Institute of Management Sciences, Brussels, August 23-26, 1961.

1,750 citations


"Performance Optimization of a Paral..." refers methods in this paper

  • ...Scalability results are presented in Section VI, while we summarize related work in Section VII....

    [...]

Book
24 Apr 2009

1,364 citations


"Performance Optimization of a Paral..." refers background in this paper

  • ...Keywords-stochastic optimization, parallel computing, large scale optimization, airfleet management I. INTRODUCTION Stochastic optimization provides a means of coping with the uncertainty inherent in real-world systems; and with models that are nonlinear, of high dimensionality, or not conducive to…...

    [...]

Journal ArticleDOI
TL;DR: Experimental results from trace-driven simulations show that the performance of the LRFU is at least competitive with that of previously known policies for the workloads the authors considered.
Abstract: Efficient and effective buffering of disk blocks in main memory is critical for better file system performance due to a wide speed gap between main memory and hard disks. In such a buffering system, one of the most important design decisions is the block replacement policy that determines which disk block to replace when the buffer is full. In this paper, we show that there exists a spectrum of block replacement policies that subsumes the two seemingly unrelated and independent Least Recently Used (LRU) and Least Frequently Used (LFU) policies. The spectrum is called the LRFU (Least Recently/Frequently Used) policy and is formed by how much more weight we give to the recent history than to the older history. We also show that there is a spectrum of implementations of the LRFU that again subsumes the LRU and LFU implementations. This spectrum is again dictated by how much weight is given to recent and older histories and the time complexity of the implementations lies between O(1) (the time complexity of LRU) and {\rm O}(\log_2 n) (the time complexity of LFU), where n is the number of blocks in the buffer. Experimental results from trace-driven simulations show that the performance of the LRFU is at least competitive with that of previously known policies for the workloads we considered.

593 citations


"Performance Optimization of a Paral..." refers background in this paper

  • ...Note that our formulation of the aircraft allocation model has complete recourse (i.e. all candidate allocations generated are feasible) because any demand (in a particular scenario) that cannot be satisfied by a candidate allocation is met by short term leasing of civilian aircraft at a high cost…...

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Performance optimization of a parallel, two stage stochastic linear program" ?

This paper explores the parallelization of twostage stochastic resource allocation problems that seek an optimal solution in the first stage, while accounting for sudden changes in resource requirements by evaluating multiple possible scenarios in the second stage. The authors present a master-worker decomposition coupled with a pull-based work assignment scheme for load balance. The authors discuss some of the challenges encountered in optimizing both the master and the worker portions of the computations, and techniques to address them. The authors demonstrate scaling up to 122 cores of an intel R ©64 cluster ; even for very small, but representative datasets. Similarly, the authors believe the techniques developed in this paper will be generally applicable to other contexts that require quick solutions to stochastic optimization problems.