What are the future works mentioned in the paper "Implications of decentralized q-learning resource allocation in wireless networks" ?

The authors left for future work to further extend the decentralized approach in order to find collaborative algorithms that allow the neighbouring WNs to reach an equilibrium that grants acceptable individual performance. Furthermore, other learning approaches are intended to be analysed in the future for performance comparison in the resource allocation problem. This information may be directly exchanged or inferred from observations.

How do the authors calculate the maximum throughput of each WN?

By using the power received and the interference, the authors calculate the maximum theoretical throughput of each WN i at time t ∈ {1, 2...} by using the Shannon Capacity.

How do the authors extend the decentralized approach?

The authors left for future work to further extend the decentralized approach in order to find collaborative algorithms that allow the neighbouring WNs to reach an equilibrium that grants acceptable individual performance.

What is the strategy for implementing decentralized learning to the resource allocation problem?

To implement decentralized learning to the resource allocation problem, the authors consider each WN to be an agent running Stateless Q-learning through an ε-greedy action-selection strategy, so that actions a ∈

What is the way to maximize the network throughput?

as well, that in order to maximize the aggregate network throughput two of the WNs sacrifice themselves by choosing a lower transmit power.

What is the way to get the maximum reward?

The reward after choosing an action is set as:ri,t = Γi,t Γ∗i ,where Γi,t is the experienced throughput at time t by WN i ∈ {1, ..., n}, being n the number of WNs in the scenario, and Γ∗i = B log2(1 + SNRi) is WN i maximum achievable throughput (i.e., when it uses the maximum transmission power and there is no interference).

What is the path-loss exponent of WN i?

Go,where Ptx,i is the transmitted power in dBm by WN i, Prx,j is the power in dBm received in WN j, PL0 is the path-loss at one meter in dB, αPL is the path-loss exponent, di,j is the distance between the transmitter and the receiver in meters, Gsis the shadowing loss in dB, and Go is the obstacles loss in dB.

What is the way to study the network throughput?

C. Input Parameters AnalysisThe authors first analyse the effects of modifying α (the learning rate), γ (the discount factor) and ε0 (the initial exploration coefficient of the ε-greedy update rule) with respect to the achieved network throughput.

(Open Access) Implications of decentralized Q-learning resource allocation in wireless networks (2017) | Francesc Wilhelmi

Q: What have the authors contributed in "Implications of decentralized q-learning resource allocation in wireless networks" ?

In this work the authors propose a stateless variation of Q-learning, which they apply to exploit spatial reuse in a wireless network. The authors identify the cause of this variability as the adversarial setting of their setup, in which the most played actions provide intermittent good/poor performance depending on the neighbouring decisions.

Q: What is the purpose of the presented learning approach?

The presented learning approach is intended to operate at the PHY level, allowing the operation of the current MAC-layer communication standards (e.g., in IEEE 802.11 WLANs, the channel access is governed by the CSMA/CA operation, so that Stateless Q-learning may contribute to improve spatial reuse at the PHY level).

Q: What is the way to describe the system?

Since the authors focus on a completelydecentralized scenario where no information about the other nodes is available, the system can then be fully described by the set of actions and rewards.

Q: What is the way to learn?

During the learning process the authors assume that WNs select actions sequentially, so that at each learning iteration, every agent takes an action in an ordered way.

Q: What is the strategy for a decentralized Q-learning?

Each WN applies the Stateless Q-learning as follows:• Initially, it sets the estimates of its actions k ∈ {1, ...,K} to 0: Q̂(ak) = 0. • At each iteration, it applies an action by following the εgreedy strategy, i.e., it selects the best-rewarding actionwith probability 1 − εt, and a random one (uniformly distributed) the rest of the times.

Implications of Decentralized Q-learning Resource

Allocation in Wireless Networks

Francesc Wilhelmi, Boris Be llalta

Wireless Networking (WN-UPF)

Universitat Pompeu Fabra

Barcelona, Spain

Cristina Cano

WINE Group

Universitat Oberta de Catalunya

Castelldefels, Spain

Anders Jonsso n

Art. Int. and Mach. Learn. (AIML-UPF)

Universitat Pompeu Fabra

Barcelona, Spain

Abstract—Reinforcement Learning is gaining attention by the

wireless networking community due to its potential to learn good-

performing conﬁgurations on ly from the observed results. In this

work we propose a stateless variation of Q-learning, which we

apply to exploit spatial reuse in a wireless network. In particular,

we allow networks to modif y both their transmission power and

the channel used solely based on the experienced throughput. We

concentrate in a completely decentralized scenario in which no

information about neighbouring nodes is available to the learners.

Our results show that although the algorithm is able to ﬁnd

the best-performing actions to enhance aggregate throughput,

there is high variability in the throughput experienced by the

individual networks. We identify the cause of this variability as

the adversarial setti ng of our setup, in w hich the most played

actions provide intermittent good/poor performance depending

on the neighb ouring decisions. We also evaluate the effect of the

intrinsic learning parameters of the algorithm on this variability.

I. INTRODUCTION

Reinforcement Learning (RL) has recently spread use in

the wireless communications ﬁeld to solve many kinds of

problems such as Acc ess Point (AP) associatio n [1], channel

selection [2] or transmit power adjustment [3], as it allows

learning good-perf orming c onﬁgura tions on ly from the ob-

served results. Among these, Q-learning has been applied to

dynamic channel assignment in mobile networks in [4] and

to autom atic channel selection in Femto Cell networks in [5].

However, to the best of our knowledge, the case of a fully

decentralized scena rio wh ere node s do not have knowledge

from each other, has not yet been considered.

In this work we propose a stateless variation of Q-learning

in which nodes select the transmission power and cha nnel to

use solely based on their resulting throug hput. We concentrate

on a fully decentralized scenario where no information about

the actions and resulting per formance of the other no des is

available to the learners. Note that inferring the throughput of

neighbouring nodes allocated to different channels is costly as

periodic sensing in the oth e r channels would then be needed.

We aim to char acterize the performance of Q-learning in

such scenarios, obtaining insight o n the most played a c tions

(i.e., channel and transmit power selected) and the resulting

performance. We observe that when no information a bout

the neighbours is available to the learners, these will tend

to apply selﬁsh strategies that result in alternating good/poor

performance depending on the actions of the oth ers. In such

scenarios, we show that the use of Q-learnin g allows eac h

network to ﬁnd the best-performing actions, though without

reaching a steady solution. Note that achieving a steady

solution in a decentralized environment relies in ﬁnding a Nash

Equilibrium, a concept used in Game Th eory to deﬁne a set of

individual strategies that maximize the proﬁts of each player

in a non-co operative gam e, regardless of the others’ strategy.

Formally, a set of best player actions a

∗

= (a

∗

, ..., a

∗

) ∈ A

leads to a Nash Equilibrium if a

∗

∈ B

∗

−i

), ∀i ∈ N , where

−i

) is the best response to the others actions (a

−i

). Thu s,

the consequences of not reaching a Nash Equilibrium c an have

an impact on performance variability.

In ad dition, we lo ok at the resultin g per formanc e in terms

of throughput when varying several parameters intrinsic to

the learning algorithm, which h e lps in understanding the

interactions between the degre e of exploration and learning

rate, and the variability of the resulting perform a nce.

The remaining of this document is structured as follows:

Section I I introduces the simulation scenario and co nsider-

ations. Then, Section III presents our Stateless variation of

Q-learning and its pra ctical implementation for the resource

allocation problem in Wireless Networks (WNs). Simulation

results are later discussed in Section I V. Finally, some ﬁnal

remarks are provided in Section V.

II. SYSTEM MODEL

For the remainder of this work, we consider a scenario in

which several WNs are placed in a 3D-map (with parameters

described later in Section IV-A), e a ch one formed by an

Access Point (AP) transmitting to a single Station (STA) in

downlink manner.

A. Channel modelling

Path-loss and shadowing effects are modelled using the

log-distance model for indoor communications. T he path- loss

between WN i and j is given by

i,j

= P

tx,i

− P

rx,j

= PL

+ 10α

log

i,j

) + G

i,j

obs

where P

tx,i

is the transmitted power in dBm b y WN i, P

rx,j

is the power in dBm receive d in WN j, PL

is the path-loss

at one meter in dB, α

is the path-loss exponent, d

i,j

is the

distance b etween the transmitter and the receiver in m eters, G

is the shadowing lo ss in dB, and G

is the obstacles loss in

dB. Note that we include the factor d

obs

, which is the distance

between two obstacles in meters.

B. Throughput calculation

By using the power received and th e interference, we

calculate the maximum theoretical throughp ut of e a ch WN

i at time t ∈ {1, 2...} by using the Shannon Capacity.

i,t

= B log

(1 + SINR

i,t

where B is the channel bandw idth and the experienced Signal

to Interference plus N oise Ratio (SINR) is given by:

SINR

i,t

+ N

where P

i,t

and I

i,t

are the received power and the sum of

the interference at WN i at time t, re spectively, and N is the

ﬂoor noise power. For each STA in a WN, the interference is

considered to be the total power received from all the APs

of the other coexisting WNs as if they were contin uously

transmitting. Adjace nt channel interference is also considered

in I

i,t

, i ∈ {1, .., W }, where W is the number of neighbouring

WNs. We consider that the transmitted power leaked to adja-

cent c hannels is 20 dBm lower for each channel separation.

III. DECENTRALIZED STATELESS Q-LEARNING FOR

ENHANCING SPATIAL REUSE IN WNS

Q-learning [6, 7] is an RL technique that enables an agent

to learn the optimal policy to follow in a given env ironmen t. A

set of possible states describing the environment and actions

are deﬁned in this model. In p articular, an agent main ta ins

an estimate of the expected long-te rm discounted reward for

each state-action pair, an d selects actions with the aim of

maximizing it. The expected cumulative reward V

(s) is given

by:

(s) = lim

N→∞



t=1

(s)



where r

(s) is the reward obtained at iteration t after starting

from state s and by following policy π. Since the reward may

easily get unbound ed, a discount factor parameter (γ < 1) is

used. Th e optimal policy π

∗

that maxim iz e s the total expected

reward is given by the Bellman’s O ptimality Equation [6]:

∗

(s, a) = E

t+1

+ γmax

′

∗

t+1

, a

′

)|s

= s, a

= a

Hencefor th, Q-learning receives information about the current

state-action tuple (s

, a

), the generated reward r

and the next

state s

t+1

, in order to update the Q-table:

Q(s

, a

) ← (1−α

)

Q(s

, a

)+α



+γ



max

′

Q(s

t+1

, a

′

)





where α

is the learning rate at time t, a nd max

′

Q(s

t+1

, a

′

)

is the best estimated value for the next state s

t+1

. The

optimal solution is theoretically achieved with probab ility 1

∞

t=0

= ∞, and

∞

t=0

< ∞, which satisﬁes that

lim

t→∞

Q(s, a) = Q

∗

(s, a). Since we focus on a completely

decentralized scenario where no information about the othe r

nodes is available, the system can then be fully described by

the set of actions and rewards.

Thus, we propose using a

stateless variation of the original Q-learning algorithm. To

implement decentralized learning to the resource allocation

problem, we consider each WN to be an agent runnin g

Stateless Q-learning through an ε-greedy action- selec tion strat-

egy, so that actions a ∈ A correspond to all the possible

conﬁgurations that can be chosen with re spect to the channel

and transmit power. During the learning pro c ess we assume

that WNs select actions sequentially, so that at each learning

iteration, every agent takes an action in an ordered way. The

order at which WNs choose an action a t each iteration is

randomly selected at the beginning of it. The reward after

choosing an ac tion is set as:

i,t

∗

where Γ

i,t

is th e experienced th rough put at time t by WN

i ∈ {1, ..., n}, being n the number of WNs in the scenario,

and Γ

∗

= B log

(1 + SNR

) is WN i m aximum achievable

throughput (i.e., when it uses the maximum transmission

power and there is no interference). Each WN applies the

Stateless Q- le arning a s follows:

• Initially, it sets the estimates of its actions k ∈ {1, ..., K}

to 0:

Q(a

) = 0.

• At each iteration, it applies an action by following the ε-

greedy strategy, i.e., it selec ts th e best-r ewarding ac tion

with probability 1 − ε

, and a random one (uniformly

distributed) the rest of the times.

• After choosing action a

, it obser ves the generated reward

(the relative exper ienced throug hput), and updates the

estimated value

Q(a

• Finally, ε

is u pdated to follow a decreasing sequenc e:

√

Note, as well, that the optimal policy cannot be derived for

the presented scenario, bu t it can be approximated to enhance

spatial reuse. Th is is due to the nature of th e presented environ-

ment, as well as WNs decisions affect the others perfo rmance.

Formally, the implemen ta tion details of Stateless Q-learning

are described in Algorithm 1. The presented learning approach

is intended to operate at the PHY level, allowing the operation

of the cu rrent MAC-layer communication standards (e.g., in

IEEE 802.1 1 WLANs, the channel access is governed b y

the CSMA/CA operation, so that Stateless Q-learning may

contribute to improve spatial reuse at the PHY level).

IV. PERFORMANCE EVALUATION

In this section we introduce the simulation parameters and

describe th e expe riments.

Then, we show the main results.

We note that local information such as the observed instantaneous chan-

nel quality could be incorporated in the state deﬁnition. However, such a

description of the system entails increased complexity.

The code used for simulations can be found at

https://github.com/wn-upf/Decentralized

Qlearning Resource Allocation in WNs.git

(Commit: eb4042a1830c8ea30b7eae3d72a51afe765a8d86).

Algorithm 1: Stateless Q-learning

1 Function Stateless Q-learning (SINR, A);

Input : SINR: Signal-to- Interference-plus-Noise Ratio

sensed at the STA

A: set of possible ac tions in {1, ..., K}

Output:

Γ: Mean throughput experienced in the WN

2 initialize: t = 0,

Q(a

) = 0, ∀a

∈ A

3 while active do

4 Select a







argmax

k=1,...,K

Q(a

), with prob 1 − ε

i ∼ U(1, K), otherwise

5 Observe reward r

∗

Q(a

) ←

Q(a

) + α ·



+ γ · max

Q −

Q(a

)



7 ε

← ε

√

8 t ← t + 1

9 end

A. Simulation Parameters

According to [8], a typical high-den sity scenario for r e si-

dential buildings contains 0.0033APs/m

. We then consider

a map scena rio with dimensions 1 0 × 5 × 10 m containing 4

WNs that form a grid topology in which STAs ar e placed at

the maximum possible distance from the other networks. T his

toy scenario allows us to stu dy the perfor mance of Stateless

Q-learning in a controlled envir onment , which is usefu l to

check the applicability of RL in WNs by only using local

informa tion

. We consider that the number of channels is

equal to half the number of coexisting WNs, so that we can

study a challenging situation regarding the spatial reuse. Table

I details the parameters u sed.

Parameter Value

Map size (m) 10 × 5 ×10

Number of coexistent WNs 4

APs/STAs per WN 1 / 1

Distance AP-STA (m)

√

Number of Channels 2

Channel Bandwidth (MHz) 20

Initial channel selection model Uniformly distributed

Trans mit power values (dBm) {5, 10, 15, 20}

(dB) 5

4.4

(dB) Normally distributed w ith mean 9.5

(dB) Uniformly distributed with mean 30

obs

(meters between two obstacles) 5

Noise level (dBm) -100

Trafﬁc model Full buffer (downlink)

TABLE I: Simulation parameters

B. Optimal solution

We ﬁrst identify the optimal solutions that maximize: i)

the aggregate th rough put, a nd ii) the pro portional fairness,

which is c omputed as the logarith mic sum of the throughput

experienced b y each WN, i.e., PF = max

k∈A

log(Γ

i,k

). The

The analysis of the presented learning mechanisms in more congested

scenarios is left as future work.

WN id

Action that maximizes the

Aggregate Through p u t

Action that maximizes the

Proportional Fairn ess

1 1 (2) 7 (8)

2 1 (2) 8 (7)

3 7 (8) 7 (8)

4 8 (7) 8 (7)

TABLE II: Optimal conﬁgurations (ac tion indexes) to achieve

the maximum network throughput and prop. fairness, resulting

in 1124 Mbps and 891 Mbps, respectively. In parenthesis

the analogous solution is shown. Actions indexes range from

1 to 8 are mapped to {chan nel number, transmit power

(dBm)}: {1,5}, {2,5}, {1, 10}, {2,10}, {1,15}, {2,15},{1,2 0}

and {2,20}, respectively.

optimal solutions are listed in Table II. Note that, since the

considered scen a rio is symmetric, there are two equivalent so-

lutions. Note, as well, that in order to maximize the aggregate

network throughput two of the WNs sacriﬁce themselves by

choosing a lower transmit power. This result is th en not likely

to occur in an adversarial selﬁsh setting.

C. Input Parameters Analysis

We ﬁrst analyse the effects of modifying α (the learning

rate), γ (the discount factor ) and ε

(the initial exploration

coefﬁcient of the ε-greedy update rule) with respec t to the

achieved network throughput. We run simulatio ns of 100 00

iterations and capture the results o f the last 5 000 iterations

to ensure that the in itial transitory phase has ended. Each

simulation is repeated 100 times for averaging purposes.

Figure 1 shows the average aggregate throughput achieved

for each of the proposed combinations. It can be observed

that the best results with respe ct to the aggregate throughput,

regarding both average and variance, are ac hieve d when α = 1,

γ = 0.95 a nd ε

= 1. This means that for achieving the

best results (i.e., high average aggregate throughput and low

variance), the immediate reward of a given action must be

considered rather than any previous information (α = 1). We

see that the difference between the pay-off offered by the best

action and the current one must also be high (γ = 0.95). In

addition, exploration must be highly boosted at the beginning

(ε

= 1). For this setting , the resulting thr oughput (902.739

Mbps) represents 80.29% of the one provided by the optimal

conﬁguration that maximizes the aggregate throughput (shown

in Table II). Regarding proportional fairness, the alg orithm’s

resulting th rough put is only 1.32% higher than the optimal.

We also evaluate the relationship b etween different values

of α and γ in the average aggr egate throughput and standard

deviation (shown in Figure 2). We observe a remarkably higher

aggregate throughput w hen α > γ. We also see that the

variability between different simulation runs is much lower

when the average throughput is higher. Additionally, we note

a peak in the standard deviation when γ ≈ α and γ > α.

To further understand the effects of modifying each of the

aforementioned parameters, we show for different ε

, α and γ:

i) the individual throughput experienced by each WN during

the total 10000 iterations of a single simulation run (Figure 3),

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

800

900

1000

1100

1200

Aggregate Throughput (Mbps)

= 0.95

= 1

= 0.5

= 1

= 0.05

= 1

= 0.95

= 0.5

= 0.05

= 0.5

Optimal (Max. Agg. Throughput)

Optimal (Max. Prop. Fairness)

Fig. 1: Effect of α, γ and ǫ

in th e average aggregate through-

put (100 simulation runs per sample).

800

850

Aggregate Throughput (Mbps)

0.5

900

0.5

(a) Av. aggregate throughput

100

0.5

Standard Deviation (Mbps)

0.5

(b) Standard deviation

Fig. 2: Evaluation of α a nd γ.

ii) the average throughput experienced by each WN for the

last 5000 iterations, also for a single simulation run (Figure

4), and iii) the probability of choosing each action at each

WN (Figure 5). We observe the following aspects:

• In Figure 3 a high va riability of the through put ex-

perienced by each WN can be observed, specially if

is high (as in Figures 3(a ), 3(c)). A high degree

of exploration allows WNs to discover changes in the

resulting performance of their actions due to the activity

of the other nodes, which at the same time generates more

variability (WN adapt to changes in th e environment).

• Despite the variability generated, we obtain fairer results

for high ǫ

(Figure 4). Henceforth, there is a relation-

ship b etween the variability generated and the average

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

Throughput (Mbps)

Iteration

WN1

WN2

WN4

WN3

(a) ε

= 1, α = 1, γ = 0.95

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

Throughput (Mbps)

Iteration

WN1

WN2

WN4

WN3

(b) ε

= 0.1, α = 1, γ = 0.95

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

Throughput (Mbps)

Iteration

WN1

WN2

WN4

WN3

= 1, α = 0.1, γ = 0.05

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

0 5000 10000

500

Throughput (Mbps)

Iteration

WN1

WN2

WN4

WN3

(d) ε

= 0.1, α = 0.1, γ = 0.05

Fig. 3: Individual through put experienced by each WN during

a single simulation run for different ε

, α and γ.

1 2 3 4

WN id

100

200

300

Mean throughput (Mbps)

(a) ε

= 1, α = 1, γ = 0.95

1 2 3 4

WN id

100

200

300

Mean throughput (Mbps)

(b) ε

= 0.1, α = 1, γ = 0.95

1 2 3 4

WN id

100

200

300

Mean throughput (Mbps)

= 1, α = 0.1, γ = 0.05

1 2 3 4

WN id

100

200

300

Mean throughput (Mbps)

(d) ε

= 0.1, α = 0.1, γ = 0.05

Fig. 4: Average throu ghput experienced by each WN during

the last 5000 iterations of a total of 10 000 iterations (in a

single simulation run) and for different ε

, α and γ.

throughput fairness.

• Finally, in Figures 5(a) a nd 5(c) we observe that for the

former, ther e a re two favourite actions that are being

played the most, but for the latter there is only one

preferred action. The lowe r the learning rate (α), and

consequently the discount factor (γ), the highe r the

probability of choosing a unique a c tion, which results

to be the one that provided the best performa nce in the

past. The opposite occurs for higher α a nd γ values, since

giving more importan ce to the immediate reward allows

for a reaction o nly to the rece ntly-playe d actions of the

neighbouring nodes: the algorithm is short-sighted.

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

Action Index

0.5

WN1

WN2

WN4

WN3

(a) ε

= 1, α = 1, γ = 0.95

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

Action Index

0.5

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

0.5

WN1

WN2

WN4

WN3

(b) ε

= 0.1, α = 1, γ = 0.95

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

Action Index

0.5

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

0.5

WN1

WN2

WN4

WN3

= 1, α = 0.1, γ = 0.05

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

0.5

1 2 3 4 5 6 7 8

Action Index

0.5

WN1

WN2

WN4

WN3

(d) ε

= 0.1, α = 0.1, γ = 0.05

Fig. 5: Probability of choosing the different actions at each

WN for a single (10000 iteratio ns) simulation run and different

, α and γ values

V. CONCLUSIONS

Decentralized Q-learning can be used to improve spa tial

reuse in dense wireless networks, enhanc ing performance as a

result of exploiting the most rewarding actions. We have shown

in this article, by means of a toy scenario, that Stateless Q-

learning in particula r allows ﬁnding go od-performing conﬁg-

urations that achieve close-to-optimal (in terms of throughput

maximization and proportional fairness) solutions.

However, the competitiveness of the presented fully-

decentralized env ironment involves the non-existence of a

Nash Equilibr ium. Thus, we have also identiﬁed high vari-

ability in the experienced individual throughput due to the

constant changes of the played actions, motivated by the fact

that the reward generated by each action chan ges according

to the opponents’ ones. We have evaluated the impact of

the parameters intrinsic to the learning algorithm on this

variability showing that it ca n be reduced by decreasing th e

exploration degree and learning rate. The individual re duction

on the throughpu t variability occurs at the expen se of losing

aggregate performance.

This variability can potentially result in n egative effects on

the overall WN’s performance. The effects of such a ﬂuc-

tuation in higher layers of the protocol stack can have severe

consequences depending on the time scale at which th ey occ ur.

For example, noticing high throughput ﬂuctuations may trigger

congestion recovery procedures in TCP (Transmission Control

Protocol), which would ha rm the experienced performance.

We left for future work to further extend the decentral-

ized approach in order to ﬁnd co llaborative algorithms that

allow the neigh bouring WNs to reach an equilibrium that

grants acceptable individual performance. Acquiring any kind

of knowledge about the neighbouring WNs is assumed to

solve the variability issues arisen from de centralization. T his

informa tion may be d irectly exchanged or in ferred from obser-

vations. Furthermore, other learning approaches are intended

to be analysed in the future for performance comparison in

the resource allocation problem.

ACKNOWLEDGMENT

This work has been partially supported by the Spanish

Ministry of E conomy a nd Competitiveness under the Maria de

Maeztu Units of Exce llence Programme (MDM-2015-0502),

and by the European Regional Development Fund under grant

TEC2015 -7130 3-R (MINECO/FEDER).

REFERENCES

[1] Chen, L. (2010, May). A distributed access point selection algorithm

based on no-regret learning for wireless access networks. In Vehicular

Technology Conference (VTC 2010-Spring), 2010 IEEE 71st (pp. 1-5).

IEEE.

[2] Maghsudi, S., & Staczak, S. (2015). Channel selection for network-

assisted D2D communication via no-regret bandit learning with cal-

ibrated forecasting. IEEE Transactions on Wireless Communications,

14(3), 1309-1322.

[3] Maghsudi, S., & Staczak, S. (2015). Joint channel selection and power

control in infrastructureless wireless networks: A multiplayer multi-

armed bandit framework. IEEE Transactions on Vehicular Technology,

64(10), 4565-4578.

[4] Nie, J., & Haykin, S. (1999). A Q-learning-based dynamic channel

assignment technique for mobile communication systems. IEEE Trans-

actions on Vehicular Technology, 48(5), 1676-1687.

[5] Bennis, M., & Niyato, D. (2010, December). A Q-learning based

approach to interference avoidance in self-organized femtocell networks.

In GLOBECOM Workshops (GC Wkshps), 2010 IEEE (pp. 706-710).

IEEE.

[6] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An

introduction (Vol. 1, No. 1). Cambridge: MIT press.

[7] Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8

(3-4), 279-292.

[8] Bellalta, B. ”IEEE 802.11 ax: High-efﬁciency WLANs.” IEEE Wireless

Communications 23.1 (2016): 38-46.

Implications of decentralized Q-learning resource allocation in wireless networks

Figures

Citations

An Effective Spectrum Handoff Based on Reinforcement Learning for Target Channel Selection in the Industrial Internet of Things.

Interference Mitigation for Coexisting Wireless Body Area Networks: Distributed Learning Solutions

Deep Q-learning based resource allocation in industrial wireless networks for URLLC

D2D Power Control Based on Hierarchical Extreme Learning Machine

Energy-Efficient Secure Short-Packet Transmission in NOMA-Assisted mMTC Networks With Relaying

References

A Q-learning-based dynamic channel assignment technique for mobile communication systems

Channel Selection for Network-Assisted D2D Communication via No-Regret Bandit Learning With Calibrated Forecasting

A Distributed Access Point Selection Algorithm Based on No-Regret Learning for Wireless Access Networks

Joint Channel Selection and Power Control in Infrastructureless Wireless Networks: A Multiplayer Multiarmed Bandit Framework

Related Papers (5)

Collaborative Spatial Reuse in wireless networks via selfish Multi-Armed Bandits

Potential and pitfalls of multi-armed bandits for decentralized spatial reuse in WLANs

Client Selection and Bandwidth Allocation in Wireless Federated Learning Networks: A Long-Term Perspective.

Cooperation in wireless networks with selfish users

Locality, scheduling, and selfishness: Algorithmic foundations of highly decentralized networks

Frequently Asked Questions (18)

Q1. What have the authors contributed in "Implications of decentralized q-learning resource allocation in wireless networks" ?

Q2. What are the future works mentioned in the paper "Implications of decentralized q-learning resource allocation in wireless networks" ?

Q3. What is the purpose of the presented learning approach?

Q4. How do the authors calculate the maximum throughput of each WN?

Q5. What is the way to describe the system?

Q6. How do the authors extend the decentralized approach?

Q7. What is the strategy for implementing decentralized learning to the resource allocation problem?

Q8. What is the way to learn?

Q9. What is the way to maximize the network throughput?

Q10. What is the strategy for a decentralized Q-learning?

Q11. What is the way to get the maximum reward?

Q12. What is the path-loss exponent of WN i?

Q13. What is the effect of a fluctuation in higher layers of the protocol stack?

Q14. What is the way to study the network throughput?

Q15. What is the optimal solution for the WNs?

Q16. How many iterations of the simulations have been done?

Q17. Where can The authorfind the code used for the simulations?

Q18. How can the authors reduce variability in the learning algorithm?