scispace - formally typeset
Open AccessProceedings ArticleDOI

Implications of decentralized Q-learning resource allocation in wireless networks

Reads0
Chats0
TLDR
In this article, a stateless variation of Q-learning is proposed to exploit spatial reuse in a wireless network, which allows networks to modify both their transmission power and the channel used solely based on the experienced throughput.
Abstract
Reinforcement Learning is gaining attention by the wireless networking community due to its potential to learn good-performing configurations only from the observed results. In this work we propose a stateless variation of Q-learning, which we apply to exploit spatial reuse in a wireless network. In particular, we allow networks to modify both their transmission power and the channel used solely based on the experienced throughput. We concentrate in a completely decentralized scenario in which no information about neighbouring nodes is available to the learners. Our results show that although the algorithm is able to find the best-performing actions to enhance aggregate throughput, there is high variability in the throughput experienced by the individual networks. We identify the cause of this variability as the adversarial setting of our setup, in which the most played actions provide intermittent good/poor performance depending on the neighbouring decisions. We also evaluate the effect of the intrinsic learning parameters of the algorithm on this variability.

read more

Content maybe subject to copyright    Report

Implications of Decentralized Q-learning Resource
Allocation in Wireless Networks
Francesc Wilhelmi, Boris Be llalta
Wireless Networking (WN-UPF)
Universitat Pompeu Fabra
Barcelona, Spain
Cristina Cano
WINE Group
Universitat Oberta de Catalunya
Castelldefels, Spain
Anders Jonsso n
Art. Int. and Mach. Learn. (AIML-UPF)
Universitat Pompeu Fabra
Barcelona, Spain
Abstract—Reinforcement Learning is gaining attention by the
wireless networking community due to its potential to learn good-
performing configurations on ly from the observed results. In this
work we propose a stateless variation of Q-learning, which we
apply to exploit spatial reuse in a wireless network. In particular,
we allow networks to modif y both their transmission power and
the channel used solely based on the experienced throughput. We
concentrate in a completely decentralized scenario in which no
information about neighbouring nodes is available to the learners.
Our results show that although the algorithm is able to find
the best-performing actions to enhance aggregate throughput,
there is high variability in the throughput experienced by the
individual networks. We identify the cause of this variability as
the adversarial setti ng of our setup, in w hich the most played
actions provide intermittent good/poor performance depending
on the neighb ouring decisions. We also evaluate the effect of the
intrinsic learning parameters of the algorithm on this variability.
I. INTRODUCTION
Reinforcement Learning (RL) has recently spread use in
the wireless communications field to solve many kinds of
problems such as Acc ess Point (AP) associatio n [1], channel
selection [2] or transmit power adjustment [3], as it allows
learning good-perf orming c onfigura tions on ly from the ob-
served results. Among these, Q-learning has been applied to
dynamic channel assignment in mobile networks in [4] and
to autom atic channel selection in Femto Cell networks in [5].
However, to the best of our knowledge, the case of a fully
decentralized scena rio wh ere node s do not have knowledge
from each other, has not yet been considered.
In this work we propose a stateless variation of Q-learning
in which nodes select the transmission power and cha nnel to
use solely based on their resulting throug hput. We concentrate
on a fully decentralized scenario where no information about
the actions and resulting per formance of the other no des is
available to the learners. Note that inferring the throughput of
neighbouring nodes allocated to different channels is costly as
periodic sensing in the oth e r channels would then be needed.
We aim to char acterize the performance of Q-learning in
such scenarios, obtaining insight o n the most played a c tions
(i.e., channel and transmit power selected) and the resulting
performance. We observe that when no information a bout
the neighbours is available to the learners, these will tend
to apply selfish strategies that result in alternating good/poor
performance depending on the actions of the oth ers. In such
scenarios, we show that the use of Q-learnin g allows eac h
network to find the best-performing actions, though without
reaching a steady solution. Note that achieving a steady
solution in a decentralized environment relies in finding a Nash
Equilibrium, a concept used in Game Th eory to define a set of
individual strategies that maximize the profits of each player
in a non-co operative gam e, regardless of the others’ strategy.
Formally, a set of best player actions a
= (a
1
, ..., a
n
) A
leads to a Nash Equilibrium if a
i
B
i
(a
i
), i N , where
B
i
(a
i
) is the best response to the others actions (a
i
). Thu s,
the consequences of not reaching a Nash Equilibrium c an have
an impact on performance variability.
In ad dition, we lo ok at the resultin g per formanc e in terms
of throughput when varying several parameters intrinsic to
the learning algorithm, which h e lps in understanding the
interactions between the degre e of exploration and learning
rate, and the variability of the resulting perform a nce.
The remaining of this document is structured as follows:
Section I I introduces the simulation scenario and co nsider-
ations. Then, Section III presents our Stateless variation of
Q-learning and its pra ctical implementation for the resource
allocation problem in Wireless Networks (WNs). Simulation
results are later discussed in Section I V. Finally, some final
remarks are provided in Section V.
II. SYSTEM MODEL
For the remainder of this work, we consider a scenario in
which several WNs are placed in a 3D-map (with parameters
described later in Section IV-A), e a ch one formed by an
Access Point (AP) transmitting to a single Station (STA) in
downlink manner.
A. Channel modelling
Path-loss and shadowing effects are modelled using the
log-distance model for indoor communications. T he path- loss
between WN i and j is given by
PL
i,j
= P
tx,i
P
rx,j
=
= PL
0
+ 10α
PL
log
10
(d
i,j
) + G
s
+
d
i,j
d
obs
G
o
,
where P
tx,i
is the transmitted power in dBm b y WN i, P
rx,j
is the power in dBm receive d in WN j, PL
0
is the path-loss
at one meter in dB, α
PL
is the path-loss exponent, d
i,j
is the
distance b etween the transmitter and the receiver in m eters, G
s

is the shadowing lo ss in dB, and G
o
is the obstacles loss in
dB. Note that we include the factor d
obs
, which is the distance
between two obstacles in meters.
B. Throughput calculation
By using the power received and th e interference, we
calculate the maximum theoretical throughp ut of e a ch WN
i at time t {1, 2...} by using the Shannon Capacity.
Γ
i,t
= B log
2
(1 + SINR
i,t
),
where B is the channel bandw idth and the experienced Signal
to Interference plus N oise Ratio (SINR) is given by:
SINR
i,t
=
P
i,t
I
i,t
+ N
,
where P
i,t
and I
i,t
are the received power and the sum of
the interference at WN i at time t, re spectively, and N is the
floor noise power. For each STA in a WN, the interference is
considered to be the total power received from all the APs
of the other coexisting WNs as if they were contin uously
transmitting. Adjace nt channel interference is also considered
in I
i,t
, i {1, .., W }, where W is the number of neighbouring
WNs. We consider that the transmitted power leaked to adja-
cent c hannels is 20 dBm lower for each channel separation.
III. DECENTRALIZED STATELESS Q-LEARNING FOR
ENHANCING SPATIAL REUSE IN WNS
Q-learning [6, 7] is an RL technique that enables an agent
to learn the optimal policy to follow in a given env ironmen t. A
set of possible states describing the environment and actions
are defined in this model. In p articular, an agent main ta ins
an estimate of the expected long-te rm discounted reward for
each state-action pair, an d selects actions with the aim of
maximizing it. The expected cumulative reward V
π
(s) is given
by:
V
π
(s) = lim
N→∞
E
N
X
t=1
r
π
t
(s)
,
where r
π
t
(s) is the reward obtained at iteration t after starting
from state s and by following policy π. Since the reward may
easily get unbound ed, a discount factor parameter (γ < 1) is
used. Th e optimal policy π
that maxim iz e s the total expected
reward is given by the Bellman’s O ptimality Equation [6]:
Q
(s, a) = E
n
r
t+1
+ γmax
a
Q
(s
t+1
, a
)|s
t
= s, a
t
= a
o
.
Hencefor th, Q-learning receives information about the current
state-action tuple (s
t
, a
t
), the generated reward r
t
and the next
state s
t+1
, in order to update the Q-table:
ˆ
Q(s
t
, a
t
) (1α
t
)
ˆ
Q(s
t
, a
t
)+α
t
r
t
+γ
max
a
ˆ
Q(s
t+1
, a
)
,
where α
t
is the learning rate at time t, a nd max
a
ˆ
Q(s
t+1
, a
)
is the best estimated value for the next state s
t+1
. The
optimal solution is theoretically achieved with probab ility 1
if
P
t=0
α
t
= , and
P
t=0
α
2
t
< , which satisfies that
lim
t→∞
ˆ
Q(s, a) = Q
(s, a). Since we focus on a completely
decentralized scenario where no information about the othe r
nodes is available, the system can then be fully described by
the set of actions and rewards.
1
Thus, we propose using a
stateless variation of the original Q-learning algorithm. To
implement decentralized learning to the resource allocation
problem, we consider each WN to be an agent runnin g
Stateless Q-learning through an ε-greedy action- selec tion strat-
egy, so that actions a A correspond to all the possible
configurations that can be chosen with re spect to the channel
and transmit power. During the learning pro c ess we assume
that WNs select actions sequentially, so that at each learning
iteration, every agent takes an action in an ordered way. The
order at which WNs choose an action a t each iteration is
randomly selected at the beginning of it. The reward after
choosing an ac tion is set as:
r
i,t
=
Γ
i,t
Γ
i
,
where Γ
i,t
is th e experienced th rough put at time t by WN
i {1, ..., n}, being n the number of WNs in the scenario,
and Γ
i
= B log
2
(1 + SNR
i
) is WN i m aximum achievable
throughput (i.e., when it uses the maximum transmission
power and there is no interference). Each WN applies the
Stateless Q- le arning a s follows:
Initially, it sets the estimates of its actions k {1, ..., K}
to 0:
ˆ
Q(a
k
) = 0.
At each iteration, it applies an action by following the ε-
greedy strategy, i.e., it selec ts th e best-r ewarding ac tion
with probability 1 ε
t
, and a random one (uniformly
distributed) the rest of the times.
After choosing action a
k
, it obser ves the generated reward
(the relative exper ienced throug hput), and updates the
estimated value
ˆ
Q(a
k
).
Finally, ε
t
is u pdated to follow a decreasing sequenc e:
ε
t
=
ε
0
t
.
Note, as well, that the optimal policy cannot be derived for
the presented scenario, bu t it can be approximated to enhance
spatial reuse. Th is is due to the nature of th e presented environ-
ment, as well as WNs decisions affect the others perfo rmance.
Formally, the implemen ta tion details of Stateless Q-learning
are described in Algorithm 1. The presented learning approach
is intended to operate at the PHY level, allowing the operation
of the cu rrent MAC-layer communication standards (e.g., in
IEEE 802.1 1 WLANs, the channel access is governed b y
the CSMA/CA operation, so that Stateless Q-learning may
contribute to improve spatial reuse at the PHY level).
IV. PERFORMANCE EVALUATION
In this section we introduce the simulation parameters and
describe th e expe riments.
2
Then, we show the main results.
1
We note that local information such as the observed instantaneous chan-
nel quality could be incorporated in the state definition. However, such a
description of the system entails increased complexity.
2
The code used for simulations can be found at
https://github.com/wn-upf/Decentralized
Qlearning Resource Allocation in WNs.git
(Commit: eb4042a1830c8ea30b7eae3d72a51afe765a8d86).

Algorithm 1: Stateless Q-learning
1 Function Stateless Q-learning (SINR, A);
Input : SINR: Signal-to- Interference-plus-Noise Ratio
sensed at the STA
A: set of possible ac tions in {1, ..., K}
Output:
Γ: Mean throughput experienced in the WN
2 initialize: t = 0,
ˆ
Q(a
k
) = 0, a
k
A
3 while active do
4 Select a
k
argmax
k=1,...,K
ˆ
Q(a
k
), with prob 1 ε
i U(1, K), otherwise
5 Observe reward r
a
k
=
Γ
a
k
,t
Γ
6
ˆ
Q(a
k
)
ˆ
Q(a
k
) + α ·
r
a
k
+ γ · max
ˆ
Q
ˆ
Q(a
k
)
7 ε
t
ε
0
/
t
8 t t + 1
9 end
A. Simulation Parameters
According to [8], a typical high-den sity scenario for r e si-
dential buildings contains 0.0033APs/m
3
. We then consider
a map scena rio with dimensions 1 0 × 5 × 10 m containing 4
WNs that form a grid topology in which STAs ar e placed at
the maximum possible distance from the other networks. T his
toy scenario allows us to stu dy the perfor mance of Stateless
Q-learning in a controlled envir onment , which is usefu l to
check the applicability of RL in WNs by only using local
informa tion
3
. We consider that the number of channels is
equal to half the number of coexisting WNs, so that we can
study a challenging situation regarding the spatial reuse. Table
I details the parameters u sed.
Parameter Value
Map size (m) 10 × 5 ×10
Number of coexistent WNs 4
APs/STAs per WN 1 / 1
Distance AP-STA (m)
2
Number of Channels 2
Channel Bandwidth (MHz) 20
Initial channel selection model Uniformly distributed
Trans mit power values (dBm) {5, 10, 15, 20}
PL
0
(dB) 5
α
PL
4.4
G
s
(dB) Normally distributed w ith mean 9.5
G
o
(dB) Uniformly distributed with mean 30
d
obs
(meters between two obstacles) 5
Noise level (dBm) -100
Traffic model Full buffer (downlink)
TABLE I: Simulation parameters
B. Optimal solution
We first identify the optimal solutions that maximize: i)
the aggregate th rough put, a nd ii) the pro portional fairness,
which is c omputed as the logarith mic sum of the throughput
experienced b y each WN, i.e., PF = max
k∈A
P
i
log(Γ
i,k
). The
3
The analysis of the presented learning mechanisms in more congested
scenarios is left as future work.
WN id
Action that maximizes the
Aggregate Through p u t
Action that maximizes the
Proportional Fairn ess
1 1 (2) 7 (8)
2 1 (2) 8 (7)
3 7 (8) 7 (8)
4 8 (7) 8 (7)
TABLE II: Optimal configurations (ac tion indexes) to achieve
the maximum network throughput and prop. fairness, resulting
in 1124 Mbps and 891 Mbps, respectively. In parenthesis
the analogous solution is shown. Actions indexes range from
1 to 8 are mapped to {chan nel number, transmit power
(dBm)}: {1,5}, {2,5}, {1, 10}, {2,10}, {1,15}, {2,15},{1,2 0}
and {2,20}, respectively.
optimal solutions are listed in Table II. Note that, since the
considered scen a rio is symmetric, there are two equivalent so-
lutions. Note, as well, that in order to maximize the aggregate
network throughput two of the WNs sacrifice themselves by
choosing a lower transmit power. This result is th en not likely
to occur in an adversarial selfish setting.
C. Input Parameters Analysis
We first analyse the effects of modifying α (the learning
rate), γ (the discount factor ) and ε
0
(the initial exploration
coefficient of the ε-greedy update rule) with respec t to the
achieved network throughput. We run simulatio ns of 100 00
iterations and capture the results o f the last 5 000 iterations
to ensure that the in itial transitory phase has ended. Each
simulation is repeated 100 times for averaging purposes.
Figure 1 shows the average aggregate throughput achieved
for each of the proposed combinations. It can be observed
that the best results with respe ct to the aggregate throughput,
regarding both average and variance, are ac hieve d when α = 1,
γ = 0.95 a nd ε
0
= 1. This means that for achieving the
best results (i.e., high average aggregate throughput and low
variance), the immediate reward of a given action must be
considered rather than any previous information (α = 1). We
see that the difference between the pay-off offered by the best
action and the current one must also be high (γ = 0.95). In
addition, exploration must be highly boosted at the beginning
(ε
0
= 1). For this setting , the resulting thr oughput (902.739
Mbps) represents 80.29% of the one provided by the optimal
configuration that maximizes the aggregate throughput (shown
in Table II). Regarding proportional fairness, the alg orithm’s
resulting th rough put is only 1.32% higher than the optimal.
We also evaluate the relationship b etween different values
of α and γ in the average aggr egate throughput and standard
deviation (shown in Figure 2). We observe a remarkably higher
aggregate throughput w hen α > γ. We also see that the
variability between different simulation runs is much lower
when the average throughput is higher. Additionally, we note
a peak in the standard deviation when γ α and γ > α.
To further understand the effects of modifying each of the
aforementioned parameters, we show for different ε
0
, α and γ:
i) the individual throughput experienced by each WN during
the total 10000 iterations of a single simulation run (Figure 3),

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
800
900
1000
1100
1200
Aggregate Throughput (Mbps)
= 0.95
0
= 1
= 0.5
0
= 1
= 0.05
0
= 1
= 0.95
0
= 0.5
= 0.5
0
= 0.5
= 0.05
0
= 0.5
Optimal (Max. Agg. Throughput)
Optimal (Max. Prop. Fairness)
Fig. 1: Effect of α, γ and ǫ
0
in th e average aggregate through-
put (100 simulation runs per sample).
800
1
850
1
Aggregate Throughput (Mbps)
0.5
900
0.5
0
0
(a) Av. aggregate throughput
0
1
50
1
100
0.5
Standard Deviation (Mbps)
0.5
0
0
(b) Standard deviation
Fig. 2: Evaluation of α a nd γ.
ii) the average throughput experienced by each WN for the
last 5000 iterations, also for a single simulation run (Figure
4), and iii) the probability of choosing each action at each
WN (Figure 5). We observe the following aspects:
In Figure 3 a high va riability of the through put ex-
perienced by each WN can be observed, specially if
ǫ
0
is high (as in Figures 3(a ), 3(c)). A high degree
of exploration allows WNs to discover changes in the
resulting performance of their actions due to the activity
of the other nodes, which at the same time generates more
variability (WN adapt to changes in th e environment).
Despite the variability generated, we obtain fairer results
for high ǫ
0
(Figure 4). Henceforth, there is a relation-
ship b etween the variability generated and the average
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
Throughput (Mbps)
Iteration
WN1
WN2
WN4
WN3
(a) ε
0
= 1, α = 1, γ = 0.95
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
Throughput (Mbps)
Iteration
WN1
WN2
WN4
WN3
(b) ε
0
= 0.1, α = 1, γ = 0.95
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
Throughput (Mbps)
Iteration
WN1
WN2
WN4
WN3
(c) ε
0
= 1, α = 0.1, γ = 0.05
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
0 5000 10000
0
500
Throughput (Mbps)
Iteration
WN1
WN2
WN4
WN3
(d) ε
0
= 0.1, α = 0.1, γ = 0.05
Fig. 3: Individual through put experienced by each WN during
a single simulation run for different ε
0
, α and γ.
1 2 3 4
WN id
0
100
200
300
Mean throughput (Mbps)
(a) ε
0
= 1, α = 1, γ = 0.95
1 2 3 4
WN id
0
100
200
300
Mean throughput (Mbps)
(b) ε
0
= 0.1, α = 1, γ = 0.95
1 2 3 4
WN id
0
100
200
300
Mean throughput (Mbps)
(c) ε
0
= 1, α = 0.1, γ = 0.05
1 2 3 4
WN id
0
100
200
300
Mean throughput (Mbps)
(d) ε
0
= 0.1, α = 0.1, γ = 0.05
Fig. 4: Average throu ghput experienced by each WN during
the last 5000 iterations of a total of 10 000 iterations (in a
single simulation run) and for different ε
0
, α and γ.
throughput fairness.
Finally, in Figures 5(a) a nd 5(c) we observe that for the
former, ther e a re two favourite actions that are being
played the most, but for the latter there is only one
preferred action. The lowe r the learning rate (α), and
consequently the discount factor (γ), the highe r the
probability of choosing a unique a c tion, which results
to be the one that provided the best performa nce in the
past. The opposite occurs for higher α a nd γ values, since
giving more importan ce to the immediate reward allows
for a reaction o nly to the rece ntly-playe d actions of the
neighbouring nodes: the algorithm is short-sighted.

1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
Action Index
0
0.5
1
WN1
WN2
WN4
WN3
(a) ε
0
= 1, α = 1, γ = 0.95
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
Action Index
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
WN1
WN2
WN4
WN3
(b) ε
0
= 0.1, α = 1, γ = 0.95
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
Action Index
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
WN1
WN2
WN4
WN3
(c) ε
0
= 1, α = 0.1, γ = 0.05
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
0
0.5
1
1 2 3 4 5 6 7 8
Action Index
0
0.5
1
WN1
WN2
WN4
WN3
(d) ε
0
= 0.1, α = 0.1, γ = 0.05
Fig. 5: Probability of choosing the different actions at each
WN for a single (10000 iteratio ns) simulation run and different
ε
0
, α and γ values
V. CONCLUSIONS
Decentralized Q-learning can be used to improve spa tial
reuse in dense wireless networks, enhanc ing performance as a
result of exploiting the most rewarding actions. We have shown
in this article, by means of a toy scenario, that Stateless Q-
learning in particula r allows finding go od-performing config-
urations that achieve close-to-optimal (in terms of throughput
maximization and proportional fairness) solutions.
However, the competitiveness of the presented fully-
decentralized env ironment involves the non-existence of a
Nash Equilibr ium. Thus, we have also identified high vari-
ability in the experienced individual throughput due to the
constant changes of the played actions, motivated by the fact
that the reward generated by each action chan ges according
to the opponents’ ones. We have evaluated the impact of
the parameters intrinsic to the learning algorithm on this
variability showing that it ca n be reduced by decreasing th e
exploration degree and learning rate. The individual re duction
on the throughpu t variability occurs at the expen se of losing
aggregate performance.
This variability can potentially result in n egative effects on
the overall WN’s performance. The effects of such a fluc-
tuation in higher layers of the protocol stack can have severe
consequences depending on the time scale at which th ey occ ur.
For example, noticing high throughput fluctuations may trigger
congestion recovery procedures in TCP (Transmission Control
Protocol), which would ha rm the experienced performance.
We left for future work to further extend the decentral-
ized approach in order to find co llaborative algorithms that
allow the neigh bouring WNs to reach an equilibrium that
grants acceptable individual performance. Acquiring any kind
of knowledge about the neighbouring WNs is assumed to
solve the variability issues arisen from de centralization. T his
informa tion may be d irectly exchanged or in ferred from obser-
vations. Furthermore, other learning approaches are intended
to be analysed in the future for performance comparison in
the resource allocation problem.
ACKNOWLEDGMENT
This work has been partially supported by the Spanish
Ministry of E conomy a nd Competitiveness under the Maria de
Maeztu Units of Exce llence Programme (MDM-2015-0502),
and by the European Regional Development Fund under grant
TEC2015 -7130 3-R (MINECO/FEDER).
REFERENCES
[1] Chen, L. (2010, May). A distributed access point selection algorithm
based on no-regret learning for wireless access networks. In Vehicular
Technology Conference (VTC 2010-Spring), 2010 IEEE 71st (pp. 1-5).
IEEE.
[2] Maghsudi, S., & Staczak, S. (2015). Channel selection for network-
assisted D2D communication via no-regret bandit learning with cal-
ibrated forecasting. IEEE Transactions on Wireless Communications,
14(3), 1309-1322.
[3] Maghsudi, S., & Staczak, S. (2015). Joint channel selection and power
control in infrastructureless wireless networks: A multiplayer multi-
armed bandit framework. IEEE Transactions on Vehicular Technology,
64(10), 4565-4578.
[4] Nie, J., & Haykin, S. (1999). A Q-learning-based dynamic channel
assignment technique for mobile communication systems. IEEE Trans-
actions on Vehicular Technology, 48(5), 1676-1687.
[5] Bennis, M., & Niyato, D. (2010, December). A Q-learning based
approach to interference avoidance in self-organized femtocell networks.
In GLOBECOM Workshops (GC Wkshps), 2010 IEEE (pp. 706-710).
IEEE.
[6] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An
introduction (Vol. 1, No. 1). Cambridge: MIT press.
[7] Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8
(3-4), 279-292.
[8] Bellalta, B. ”IEEE 802.11 ax: High-efficiency WLANs. IEEE Wireless
Communications 23.1 (2016): 38-46.
Citations
More filters
Journal ArticleDOI

An Effective Spectrum Handoff Based on Reinforcement Learning for Target Channel Selection in the Industrial Internet of Things.

TL;DR: This work develops an SH policy that holistically considers the implementation of effective CSS, and spectrum sensing technique, as well as minimizes communication delays and showed a significant reduction in terms of latency and a remarkable improvement in throughput performance in comparison to conventional approaches.
Journal ArticleDOI

Interference Mitigation for Coexisting Wireless Body Area Networks: Distributed Learning Solutions

TL;DR: This paper investigates this channel selection problem for interference mitigation in a time-varying environment as a finite repeated potential game and proposes two learning algorithms, Stochastic Learning Algorithm (SLA) and SELA, to achieve the Nash Equilibrium (NE) of the game.
Journal ArticleDOI

Deep Q-learning based resource allocation in industrial wireless networks for URLLC

TL;DR: Deep Q-learning (DQL)-based resource allocation strategies as per the learning of the experienced trade-offs' and interdependencies in IWN is proposed and the proposed findings indicate that the algorithm can find the best performing measures to improve the allocation of resources.
Proceedings ArticleDOI

D2D Power Control Based on Hierarchical Extreme Learning Machine

TL;DR: Compared with the other two power control algorithms based on machine learning, distributed Q-learning and CART Decision Tree, the simulation results show that the H-ELM method has a better performance in both communication throughput and energy efficiency with limited time consumption.
Journal ArticleDOI

Energy-Efficient Secure Short-Packet Transmission in NOMA-Assisted mMTC Networks With Relaying

TL;DR: This paper investigates the energy-efficient secure short-packet transmission of Non-Orthogonal Multiple Access (NOMA) assisted mMTC networks, in which MTCDs aim to transmit secrecy messages to the destination base station via trusted relays with a passive eavesdropper present.
References
More filters
Journal ArticleDOI

A Q-learning-based dynamic channel assignment technique for mobile communication systems

TL;DR: This paper proposes an alternative approach to solving the dynamic channel assignment (DCA) problem through a form of real-time reinforcement learning known as Q learning, which is able to perform better than the FCA in various situations and is capable of achieving a similar performance to that achieved by MAXAVAIL, but with a significantly reduced computational complexity.
Journal ArticleDOI

Channel Selection for Network-Assisted D2D Communication via No-Regret Bandit Learning With Calibrated Forecasting

TL;DR: A network-assisted distributed channel selection approach in which D2D users are only allowed to use vacant cellular channels is proposed, and the proposed approach not only yields vanishing regret in comparison to the global optimal solution but also guarantees that the empirical joint frequencies of the game converge to the set of correlated equilibria.
Proceedings ArticleDOI

A Distributed Access Point Selection Algorithm Based on No-Regret Learning for Wireless Access Networks

TL;DR: This paper develops an access point selection algorithm based on no-regret learning to orient the system converges to an equilibrium state (correlated equilibrium) and demonstrates the effectiveness of the proposed algorithm in achieving high system efficiency.
Journal ArticleDOI

Joint Channel Selection and Power Control in Infrastructureless Wireless Networks: A Multiplayer Multiarmed Bandit Framework

TL;DR: It is proved that the gap between the average rewards achieved by the approaches and that based on the best fixed strategy converges to zero asymptotically.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What have the authors contributed in "Implications of decentralized q-learning resource allocation in wireless networks" ?

In this work the authors propose a stateless variation of Q-learning, which they apply to exploit spatial reuse in a wireless network. The authors identify the cause of this variability as the adversarial setting of their setup, in which the most played actions provide intermittent good/poor performance depending on the neighbouring decisions. 

The authors left for future work to further extend the decentralized approach in order to find collaborative algorithms that allow the neighbouring WNs to reach an equilibrium that grants acceptable individual performance. Furthermore, other learning approaches are intended to be analysed in the future for performance comparison in the resource allocation problem. This information may be directly exchanged or inferred from observations. 

The presented learning approach is intended to operate at the PHY level, allowing the operation of the current MAC-layer communication standards (e.g., in IEEE 802.11 WLANs, the channel access is governed by the CSMA/CA operation, so that Stateless Q-learning may contribute to improve spatial reuse at the PHY level). 

By using the power received and the interference, the authors calculate the maximum theoretical throughput of each WN i at time t ∈ {1, 2...} by using the Shannon Capacity. 

Since the authors focus on a completelydecentralized scenario where no information about the other nodes is available, the system can then be fully described by the set of actions and rewards. 

The authors left for future work to further extend the decentralized approach in order to find collaborative algorithms that allow the neighbouring WNs to reach an equilibrium that grants acceptable individual performance. 

To implement decentralized learning to the resource allocation problem, the authors consider each WN to be an agent running Stateless Q-learning through an ε-greedy action-selection strategy, so that actions a ∈ 

During the learning process the authors assume that WNs select actions sequentially, so that at each learning iteration, every agent takes an action in an ordered way. 

as well, that in order to maximize the aggregate network throughput two of the WNs sacrifice themselves by choosing a lower transmit power. 

Each WN applies the Stateless Q-learning as follows:• Initially, it sets the estimates of its actions k ∈ {1, ...,K} to 0: Q̂(ak) = 0. • At each iteration, it applies an action by following the εgreedy strategy, i.e., it selects the best-rewarding actionwith probability 1 − εt, and a random one (uniformly distributed) the rest of the times. 

The reward after choosing an action is set as:ri,t = Γi,t Γ∗i ,where Γi,t is the experienced throughput at time t by WN i ∈ {1, ..., n}, being n the number of WNs in the scenario, and Γ∗i = B log2(1 + SNRi) is WN i maximum achievable throughput (i.e., when it uses the maximum transmission power and there is no interference). 

Go,where Ptx,i is the transmitted power in dBm by WN i, Prx,j is the power in dBm received in WN j, PL0 is the path-loss at one meter in dB, αPL is the path-loss exponent, di,j is the distance between the transmitter and the receiver in meters, Gsis the shadowing loss in dB, and Go is the obstacles loss in dB. 

The effects of such a fluctuation in higher layers of the protocol stack can have severe consequences depending on the time scale at which they occur. 

C. Input Parameters AnalysisThe authors first analyse the effects of modifying α (the learning rate), γ (the discount factor) and ε0 (the initial exploration coefficient of the ε-greedy update rule) with respect to the achieved network throughput. 

The authors first identify the optimal solutions that maximize: i) the aggregate throughput, and ii) the proportional fairness, which is computed as the logarithmic sum of the throughput experienced by each WN, i.e., PF = max k∈A ∑ i log(Γi,k). 

The authors run simulations of 10000 iterations and capture the results of the last 5000 iterations to ensure that the initial transitory phase has ended. 

2The code used for simulations can be found at https://github.com/wn-upf/Decentralized Qlearning Resource Allocation in WNs.git (Commit: eb4042a1830c8ea30b7eae3d72a51afe765a8d86). 

The authors have evaluated the impact of the parameters intrinsic to the learning algorithm on this variability showing that it can be reduced by decreasing the exploration degree and learning rate.