Adaptive Strategies for Dynamic Pricing Agents

doi:10.1109/WI-IAT.2011.193

Sara Ramezani, Peter A. N. Bosman, and Han La Poutr

´

e

CWI, Dutch National Institute for Mathematics and Computer Science

P.O. Box 94079, NL-1090 GB, Amsterdam, The Netherlands

Email: S.Ramezani, Peter.Bosman, Han.La.Poutre@cwi.nl

Abstract—Dynamic Pricing (DyP) is a form of Revenue

Management in which the price of a (usually) perishable good

is changed over time to increase revenue. It is an effective

method that has become even more relevant and useful with

the emergence of Internet ﬁrms and the possibility of readily

and frequently updating prices. In this paper a new approach

to DyP is presented. We design adaptive dynamic pricing

strategies and optimize their parameters with an Evolutionary

Algorithm (EA) ofﬂine while the strategies can deal with

stochastic market dynamics quickly online. We design two

adaptive heuristic dynamic pricing strategies in a duopoly

where each ﬁrm has a ﬁnite inventory of a single type of

good. We consider two cases, one in which the average of a

customer population’s stochastic valuation for each of the goods

is constant throughout the selling horizon and one in which the

average customer valuation for each good is changed according

to a random Brownian motion. We also design an agent-based

software framework for simulating various dynamic pricing

strategies in agent-based marketplaces with multiple ﬁrms

in a bounded time horizon. We use an EA to optimize the

parameters for each of the pricing strategies in each of the

settings and compare the strategies with other strategies from

the literature. We also perform sensitivity analysis and show

that the optimized strategies work well even when used in

settings with varied demand functions.

I. INTRODUCTION

Dynamic Pricing (DyP) is a form of Revenue Manage-

ment (RM) that involves changing the price of goods or ser-

vices over time with the aim of increasing revenue. Revenue

management is a much broader term that refers to various

techniques for increasing revenue of (usually) perishable

goods or services. RM particularly became popular within

the airline industry after the deregulation of the industry in

the United States in the late 1970’s [18].

Today, the Internet provides exceptional opportunities

for practicing RM and particularly DyP. This is due both

to the amount of data available and the restructuring of

price posting procedures. Thus, the Internet can facilitate

offering different prices for different customers and posting

new prices with minimum extra costs. This also allows for

the increased use of intelligent autonomous agents in e-

commerce, agents designed for automatically buying, sell-

ing, price comparison, bargaining, etc..

Most RM methods exploit the differences in different cus-

tomers’ valuations of a good or changes in these valuations

in time to boost revenue. This has led to different methods of

distinguishing between customers based on their valuation

for the goods, such as fare class distinction, capacity control,

dynamic pricing, auctions, promotions, coupons, and price

discrimination methods such as group discounts [18].

Here we focus on dynamic pricing [7]. By changing prices

in time, ﬁrms can ask for the price that yields the highest

revenue at each moment. This allows them to distinguish

between customers in cases where customers with different

utilities buy at different times, as well as exploit the changes

of valuation of the same customers in time. Many RM

methods can be categorized as dynamic pricing, be it the

end-of-season markdown of a fashion retailer, or the inﬂated

last-minute price of a business-class ﬂight ticket. The main

question is when and how to change the prices in order

to obtain the most revenue. This depends on the market

structure and dynamics, most importantly, on the customer

demand rate and how it changes in time.

In this paper we study dynamic pricing of a limited supply

of goods in a competitive ﬁnite-horizon market. We design

and implement an interactive agent based marketplace where

the agents are the ﬁrms who wish to increase their revenue

using dynamic pricing strategies. We study two cases, one

in which the customers’ valuation for the products follow

the same valuation throughout the selling horizon and one

in which the average of their valuations follow a random

Brownian motion over time.

We design two adaptive pricing strategies and use learning

to optimize them. The execution of the strategies is not

computationally intensive (O(1) complexity) and they are

understandable from a practical perspective. The strategies

use only the observed market response to set new prices at

each point in time. The parameters used in each strategy

are then optimized using an Evolutionary Algorithm (EA)

[3]. A simulation software has been implemented that can

generate and simulate dynamic pricing strategies in various

oligopolistic markets. The EA uses this simulator as a black

box, and optimizes the parameters for a given strategy using

the obtained revenue as the ﬁtness criteria. So on one hand,

the strategies are capable of very fast adaptive decision

making in run-time, and on the other hand their parameters

are optimized ofﬂine to tune them for a more speciﬁc setting.

In many real world applications, some general knowledge

of the market dynamics exists beforehand, although it may

be different from what will actually happen, both because

of inaccuracies in the estimations and predictions and unex-

pected changes to the market. Using our proposed approach,

this knowledge can be used for ofﬂine learning, to optimize

the parameters of the strategies before the actual selling

starts. Also, because of the adaptiveness of the proposed

pricing strategies, any deviations from the expected dynam-

ics of the market will be detected quickly and accounted

for by the strategy online, and thus the strategies also work

reasonably well in various market settings different from

what they have been tuned for.

In the strategies we present, the selling agents do not

assume the demand to have an a priori structure, though the

EA uses the market simulation with a particular structure to

optimize the performance of the strategy over the space of

its parameters. Thus, information of the demand structure is

passed to the selling agents implicitly through the parame-

ters. The strategies are adaptive in the sense that they detect

changes in the market and adjust their prices accordingly.

Hence they can effectively deal with market changes.

In order to evaluate our strategies’ performance, we com-

pare them to a number of strategies previously studied in

the DyP literature. We show that our strategies outperform

a ﬁxed price (FP) strategy that is optimized ofﬂine. This

is signiﬁcant, because FP strat egies perform very well in

many conﬁgurations [9]. It should be noted here that in most

other models a FP strategy is actually the optimal strategy

and DyP is used to ﬁnd this optimal ﬁxed price, but in our

case no ﬁxed price is optimal due to the combination of

competition, ﬁnite inventory, and a ﬁnite time horizon. This

can also be shown by the fact that our strategies outperform

the ofﬂine-optimized FP (that is very close to the actual best

possible ﬁxed price). The same features make the analytical

computation of the best solution in our model intractable,

requiring experimentation to evaluate our strategies. In fact,

a beneﬁt of using simulations is that we are able to tackle

more complicated models that are too difﬁcult to approach

theoretically.

Furthermore, the strategies also perform better than the

optimized versions of the derivative follower (DF) learn-

ing algorithm that change the price in the same direction

(increasing or decreasing) as long as the revenue keeps

increasing, and then changes the price change direction.

Such algorithms have previously been used successfully in

DyP settings [4], [6], [14]. Our proposed strategies are also

compared to the Goal Directed (GD) strategy of [6] which is

particularly similar to one of our strategies. Both strategies

outperform the GD strategy for which the only parameter,

the initial price, is optimized for the given setting.

Finally, we show that optimized adaptive strategies still

perform reasonably well when various changes are made

to the market conﬁguration after the learning phase. For

this means, we evaluate the performance of a strategy

with parameters optimized for a given conﬁguration, on a

stochastically varied conﬁguration. The obtained revenue

is then compared with the revenue obtained by using the

same strategy with parameters optimized for the varied

setting. This can give us the regret of wrongly estimating

the market structure. The variations in conﬁguration that we

study include altering the demand function by changing the

customer/good ratio.

II. RELATED WORK

DyP has been a very active research area in recent

years. Many studies try to learn the demand structure, or

the parameters for a known demand function, on the ﬂy.

They typically use part of the selling horizon for exploring

the market, trying out the demand rate for different prices

in a systematic way, and another portion of the time for

exploiting the market, using the best price(s) based on their

estimates [2], [8]. Others use statistical learning methods and

heuristics based on mathematical estimations of the optimal

price [1]. While most DyP models are monopolies, there

are some that model competitors in the marketplace as well

[14], [15], [16]. In this work we deal with a duopoly market,

though the ﬁrm does not model a competitor explicitly.

Works that are similar to ours in experimenting with

heuristic strategies by simulation are fewer. The Information

Economics group at IBM has investigated the effect of

interacting pricing agents which they call pricebots in a

number of works (see [14] for a survey). In some, they use

use game-theoretic analysis and experiment with heuristics

that aim at achieving the optimal equilibrium price [11].

They focus on the market dynamics and pricing patterns

that arise when using these strategies against each other.

They also study shopbots [10], strategic buyer agents, and

pricing where agents may be differentiated horizontally or

vertically based on their preferences for different attributes

of a product. Some of the algorithms discussed in these

works rely on more information than can be obtained from

the market simulations only, but we have compared our work

to the FP and a few versions of the DF strategies, both of

which are used in these works.

Multi-attribute DyP is also discussed in [5] and [13]. In

[13] a heuristic method for dynamic pricing is presented

which consists of a preference elicitation algorithm and a

dynamic pricing algorithm. The method is then compared to

a DF algorithm and the GD algorithm from [6], their model

differs from the one we use in the existence of multiple

attributes (we consider a single attribute here, the price) and

also because they have a ﬁnite number of identiﬁable buyers

(compared to our inﬁnite population of one-time buyers).

Their algorithms, although similar to ours in fast online

decision making and the use of simulation for the evaluation,

were not directly comparable to the strategies in our current

model due to the strong dependence on these differences,

the simpliﬁcation of which would signiﬁcantly undermine

the strengths of their strategies.

In [4], a heuristic Model Optimizer (MO) method is

designed and compared to a DF pricing strategy. The MO

strategy uses information from the previous time intervals

for a more detailed model of the demand, and solves a

non-linear equation in each time step using a simplex hill-

climbing approach. This heuristic strategy differs from ours

in its online computational complexity, which is much higher

than ours due to the online optimization.

In [6] a DF algorithm and an inventory-based GD algo-

rithm are used in a number of simulations to show how they

actually behave in a market and in which scenarios each

one is useful. We compare our strategies with both of these

strategies, because they are both compatible with our model

and comparable with our strategies in their computational

intensity and the information they use.

In [17], a few different EA methods are used to solve a

dynamic pricing problem. Their approach is not comparable

to ours since they use their optimizer to optimize actual

prices for a dynamic pricing model for a small number

(less than 10) of time steps. A method similar to [17]

is not useful in our stochastic model because optimizing

prices using an EA would lead to an over-ﬁtted solution

that works better than the adaptive strategies only for the

instances (see deﬁnition 2 in section IV) it is optimized for

and considerably worse on average. It is also far more time-

consuming when considering a larger number of time steps.

III. MODEL

We have a market with two competing ﬁrms. The revenue

(sum of price of items sold) of one of the ﬁrms is optimized

using DyP. Each ﬁrm can change the price of its goods at

the start of equi-distant time intervals.

A. Firms

We have a ﬁnite number, m, of ﬁrms, {0, 1, . . . , m − 1}.

We refer to ﬁrm j’s good type as g

j

. The ﬁrm starts off with

an initial inventory Y

j

of its product and the capacity left of

the good at time t is denoted by y

j

(t) (the t can be omitted

if there is no chance of ambiguity).

The model is a ﬁnite horizon model, the goods left at the

end of each time step are transferred into the next and all

goods are lost at the end of the whole time span. Each ﬁrm

announces a selling price, p

j

(t), for each good type in each

time interval t. A cost for each good type, cr

j

, serves as a

reserve price for goods of that type.

B. Customers

1) Preferences: The customers specify their preferences

using non-negative cardinal utilities that are exchangeable

with monetary payments. Each customer has a valuation

function that determines these utilities. Customers are unit-

demand, they only have preferences on sets consisting of one

item, so their valuation functions are deﬁned as v : G → R

+

.

Thus, customers have to specify only a single number for

each good type and its valuation for getting more than one

item is always zero. Thus, a customer’s utility for getting an

item is equal to the difference between its valuation for the

item and the item’s price, which is u

j

(t) = v(g

j

) − p

j

(t)

for ﬁrm j’s good at time t.

2) Population: We model the customer populations as an

unbounded population. This means that the distribution of

customers does not change after an item is sold. Also, the

valuations of all of the customers for each unit of each of the

good types follow the same distribution. This distribution,

which is denoted by P r

j,t

for ﬁrm j’s good at time t, may

or may not change in time.

These distributions are all normal distributions. We con-

sider two settings, in the ﬁrst setting the normal distribution

is the same for each good type and customer segment

pair throughout the time horizon (so the t can be omitted

from P r

j,t

). In the second setting, which we refer to as

the Brownian setting, the mean of the each of the P r

j,t

distributions changes over time, following a basic model

of Brownian motion: the mean increases by a constant

amount (b), decreases by the same constant amount, or does

not change, each of these cases happening with equal (

1

3

)

probability. This allows for some structured dynamism in

the demand pattern in the model.

3) Customer arrival: The number of customers that ar-

rive in each time step follows a Poisson process with a con-

stant intensity a. The ﬁrms may be aware of the parameter

of this process when making their pricing decisions. In each

time interval, the customers arrive consecutively after the

ﬁrms have set their prices. They may or may not buy a

product based on their choice function and, in any case,

leave the market afterwards.

4) Choice Model: At any time t that a customer has to

make a purchase decision, it will buy one of good g

∗

offered

by ﬁrm f

∗

∈ arg max

j

{u

j

(t)|u

j

(t) > 0}, if it exists, i.e.

the item for which it has the highest utility if all items are

not priced higher than he is willing to pay, with probability

1 − λ and does not purchase anything with probability λ.

The λ factor is to model a general chance for a purchase not

occurring, this is close to the natural behavior of customers

in many contexts. Note that the effect of the λ can also be

achieved by changing the arrival rate when we are using

a Poisson arrival process, but it is not so with all arrival

models. If g

∗

does not exist, the customer will not make a

purchase. Ties are broken randomly.

As is evident from the model, the customers are myopic

(greedy) and purchase only based on current utilities, not

any prediction of what will happen next.

C. Modeling Time

In any dynamic pricing model, by deﬁnition, the ﬁrms

should be able to adjust their prices in time. While changing

prices at any particular moment may become more plausible,

particularly with internet ﬁrms, it is still more common in

the literature for the change of prices to occur in ﬁxed

time intervals. We suppose that there are T time intervals,

numbered from 1 to T successively. At the start of each time

interval, all ﬁrms set the prices for their goods.

IV. MARKET SIMULATION

We have developed software for simulating a marketplace

described in the previous section. The software uses an event

queue to keep track of two types of events: pricing events,

ﬁrms setting the price for each of their item types at the

beginning of each time period, and customer arrival events.

Some notation that helps describe the experiments in the

following section is deﬁned here.

Deﬁnition 1 (Conﬁguration): A conﬁguration is a model

where all parameters are set. These parameters consist of the

properties of the ﬁrms (costs of goods, initial stock, etc.) and

the valuation distributions and arrival rate of the customers.

Deﬁnition 2 (Instance): An instance of the problem is

a speciﬁc conﬁguration together with samplings for the

stochastic variables (i.e. a ﬁxed random seed for the pseudo

random generator in the software).

Deﬁnition 3 (Pricing Strategy): A pricing strategy, or

simply strategy, is a function that given a ﬁxed number of

parameters, sets a new price for a unit of the ﬁrm’s good in

each time step. The function can also depend on the previous

events that have occurred in the market. We assume here that

the ﬁrm is aware of the previous customers’ behavior and

previous prices, and that the ﬁrm knows the customer arrival

rate, but nothing about the customers’ valuation functions.

All strategies are deterministic.

Based on the above deﬁnitions, an instance of the problem

is deterministic given the ﬁrms’ strategies, i.e. will yield the

same results when the ﬁrms use the same strategies, while

a conﬁguration alone does not contain enough information

to determine an outcome.

Deﬁnition 4 (Simulation): By a simulation, we designate

a single execution of a particular instance of the problem

with ﬁxed strategies for the ﬁrms.

Deﬁnition 5 (Batch run): By a batch run, or simply

batch, consisting of n simulations, we mean the simulation

of n different instances of the problem that share the same

conﬁguration and use the same strategy for each of the ﬁrms

throughout the n instances.

V. ADAPTIVE HEURISTIC STRATEGIES

We present two heuristic pricing strategies in this section.

A. The Inventory Based (IB) Strategy

The ﬁrst strategy is one that adaptively adjusts the prices

for a ﬁrm based on the number of goods it has left and

the number of goods that it has sold in the previous time

interval, we call it the Inventory Based (IB) strategy.

In each time step, the strategy retains the previous price

if the rate of items sold in the previous time interval is

Algorithm 1 InventoryBasedStrategy(initialPrice,

noChangeThreshUp, noChangeThreshDown, maxIncPer-

cent, maxDecPercent)

1: if time = 0 then

2: price ← initialP rice

3: return price

4: if numLeft = 0 then

5: return lastP rice

6: if pastCustomers = 0 then

7: pastCustomers ← 1

8: pastSold ← pastSold ×

aveCustomers

pastCustomers

9: α ←

pastSold×timeLeft

numGoodsLeft

10: if α < 1 then

11: ∆ = α − 1

12: else

13: ∆ = 1 −

1

α

14: if |∆| < 0 then

15: if ∆ < noChangeT hreshDown then

16: price ← lastP rice

17: else

18: price ← lastP rice(1 + ∆ × maxDecP ercent)

19: else

20: if ∆ < noChangeT hreshUp then

21: price ← lastP rice

22: else

23: price ← lastP rice(1 + ∆ × maxIncP ercent)

24: return price

close to the rate needed to sell all the items by the end of

the time horizon (this “closeness” is controlled by the pa-

rameters noChangeT hreshUp noChangeT hreshDown).

It increases the price if too many items have been sold

in the previous interval and decreases it if too little have

been sold. The maxDecP ercent and maxIncP ercent

parameters along with the distance that the sales rate has

from the expected sales rate control the amount of change

in price in each time step. The other parameter used in this

strategy is intialP rice, the price the ﬁrm uses in the ﬁrst

time interval. The details of the algorithm of this strategy

can be seen in algorithm 1.

In algorithm 1, pastSold is the number of items sold

in previous time step, and numGoodsLef t is the number

of items left in the inventory. timeLef t is the number of

time steps left in the selling horizon, and pastP rice is

the price of a unit of the good in the previous time step.

Finally, pastCustomers is the total number of customers

in the previous time step and aveCustomers is the average

number of customers per time step (same as a).

In line 8, the number of items sold in the past time interval

is normalized by the average number of customers arriving

in each time step and the number of goods left to factor

out the stochasticity as much as possible. Note that this is

a dynamic indicator updated in the beginning of each time

step, so it will take into account the current state of the

agent. In line 9, α is deﬁned as an indicator for determining

how fast the inventory would be exhausted if the sales would

go on with the current rate. If α is smaller than one, then

the sales rate is too slow, and if it is larger than one, the

the inventory would be exhausted sooner than the end of

the time horizon, so there is an opportunity for increasing

the price. The parameter ∆ is then (in the if-then statement

starting from line 10) deﬁned as a normalized version of α

that is negative if the sales rate is too low and positive if it is

too high. The if-then statement starting from line 14 is where

the ﬁnal pricing decision is made. If the absolute value of

∆ is smaller than the respective threshold for positive or

negative threshold parameter, i.e. if the sales rate is close

enough to the desired rate, then the price is not changed

otherwise it is changed proportional to ∆, and with regards

to the maximum allowable change rate.

B. The Revenue Based (RB) Strategy

The Revenue Based (RB) strategy uses an estimation of

a desirable price to estimate the price in each time step.

It uses the revenue per customer (RPC) criterion (which

considers all customers, even the ones that did not buy

from the ﬁrm) to assess the revenue obtained when using

a particular price and compares that to the RPC needed to

ﬁnish the inventory by the end of the selling horizon. The

algorithm for this strategy can be seen in algorithm 2. The

additional parameters used in this strategy are expP rice,

the expected price used in the estimation of the expected

RPC, and the maximum amount the price can change per

time step. These variables are also used: RP , the revenue

per customer in the previous time step, and the expected

RPC, used as a control parameter, expRP C.

Algorithm 2 RevenueBasedStrategy(initialPrice, expPrice,

maxDelta)

1: if time = 0 then

2: price ← initialP rice

3: return price

4: expRP C ←

numGoodsLeft×expP rice

timeLeft×aveCustomers

5: if numLeft = 0 then

6: return lastP rice

7: if pastCustomers = 0 then

8: pastCustomers ← 1

9: RP C ←

pastSold×lastP rice

pastCustomers

10: α ←

RP C

expRP C

11: if α ≤ 1 then

12: ∆ ← α − 1

13: else

14: ∆ ← 1 −

1

α

15: price ← lastP rice + ∆ × maxDelta

16: return price

In this algorithm, the expRP C (deﬁned in line 4) variable

is the expected revenue per customer for the rest of the

selling horizon, provided that the ﬁrm sells with the ex-

pected price, expP rice, that is one of the input parameters.

Then the RP C parameter is the revenue obtained for each

customer in the previous time step (regardless of whether

they make a purchase or not). In this algorithm, α is deﬁned

as the ratio between the expected RPC and the RPC from

the previous time step. Here, also, ∆ is a normalization of

α, positive if the expected revenue is higher than expected

and negative if it is lower (line 11), and the price changes

proportional to the magnitude of ∆. Note that here the lower

than expected revenue is attributed to a price that is too high,

thus prohibiting many customers form making a purchase.

This is not always the case though, but this can be a safe

assumption when the initial price and expected price are

chosen properly, as we can see from the experimental results.

Note that both strategies depend only on information from

the sales in one previous time step. Also, the parameters

are designed as to sustain a certain amount of stability in

the price. This is both a means to control the effects of the

stochastic noise causing sudden jumps in the price (specially

since the changes depend on one previous time step only),

and because too much price ﬂuctuation is not desirable from

the customers’ perspective. Also, in all strategies reported

in this work, the previous price is kept if there are no

more goods to be sold. This has no effect on the simulation

because customers will not be given the option to buy from

ﬁrms that have no goods left.

C. Computing the Parameters

We want to have settings for the parameters of the strate-

gies that we have deﬁned such that the strategies perform

well. The numerical optimization task associated with this is

generally not easy because it is the outcome of a non-trivial

simulation that we want to optimize. The problem at hand

can thus be seen as a black-box optimization problem with

unknown difﬁculty. We therefore need black-box optimiza-

tion algorithms that are capable of tackling a large class of

problems effectively. The algorithm of our choice is called

AMaLGaM. AMaLGaM is essentially an Evolutionary Al-

gorithm (EA) in which a normal distribution is estimated

from the better, selected solutions and subsequently adapted

to be aligned favorably with the local structure of the search

space. New solutions are then constructed by sampling the

normal distribution. A parameter-free version of AMaLGaM

exists that can easily be applied to solve any optimization

problem. This version was recently found to be among the

most competent black-box optimization algorithms [3], [12].

In order to tune the experiments for the EA, which is

not designed to handle stochasticity on one hand, and not

to over-ﬁt a single instance of the problem on the other,

we use the following method. The ﬁtness used in the EA

is the average revenue obtained from a batch run of 100

Adaptive Strategies for Dynamic Pricing Agents

Summary (4 min read)

Introduction

III. MODEL

A. Firms

B. Customers

C. Modeling Time

IV. MARKET SIMULATION

A. The Inventory Based (IB) Strategy

B. The Revenue Based (RB) Strategy

C. Computing the Parameters

VI. EXPERIMENTAL RESULTS

VII. SENSITIVITY ANALYSIS

VIII. CONCLUSIONS AND FURTHER WORK

Figures (6)

Citations

References

Related Papers (5)