# A Reinforcement Learning Framework for Optimizing Age of Information in RF-Powered Communication Systems

04 May 2020-IEEE Transactions on Communications (Institute of Electrical and Electronics Engineers (IEEE))-Vol. 68, Iss: 8, pp 4747-4760

TL;DR: In this article, a real-time monitoring system is considered where multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination.

Abstract: In this paper, we study a real-time monitoring system in which multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination. Since it may not always be feasible to replace or recharge batteries in all source nodes, we consider that the nodes are powered through wireless energy transfer (WET) by the destination. For this system setup, we investigate the optimal online sampling policy (referred to as the age-optimal policy ) that jointly optimizes WET and scheduling of update packet transmissions with the objective of minimizing the long-term average weighted sum of Age of Information (AoI) values for different physical processes (observed by the source nodes) at the destination node, referred to as the sum-AoI . To solve this optimization problem, we first model this setup as an average cost Markov decision process (MDP) with finite state and action spaces. Due to the extreme curse of dimensionality in the state space of the formulated MDP, classical reinforcement learning algorithms are no longer applicable to our problem even for reasonable-scale settings. Motivated by this, we propose a deep reinforcement learning (DRL) algorithm that can learn the age-optimal policy in a computationally-efficient manner. We further characterize the structural properties of the age-optimal policy analytically, and demonstrate that it has a threshold-based structure with respect to the AoI values for different processes. We extend our analysis to characterize the structural properties of the policy that maximizes average throughput for our system setup, referred to as the throughput-optimal policy . Afterwards, we analytically demonstrate that the structures of the age-optimal and throughput-optimal policies are different. We also numerically demonstrate these structures as well as the impact of system design parameters on the optimal achievable average weighted sum-AoI.

## Summary (4 min read)

Jump to: [Introduction] – [A. Related Work] – [B. Contributions] – [C. Organization] – [A. Network Model] – [B. State and Action Spaces] – [A. Problem Statement] – [B. MDP Formulation] – [C. Deep Reinforcement Learning for Optimizing AoI] – [IV. STRUCTURAL PROPERTIES OF THE AGE-OPTIMAL POLICY] – [V. AGE-OPTIMAL POLICY VS. THROUGHPUT-OPTIMAL POLICY] – [A. Average Throughput Maximization Formulation and Proposed Solution] – [B. Structural Properties of the Throughput-optimal Policy] – [VI. NUMERICAL RESULTS] – [B. Comparison of the Structures of the Age-optimal and Throughput-optimal Policies] and [VII. CONCLUSION]

### Introduction

- In practice, the timely delivery of the measurements to the destination nodes is greatly restricted by the limited energy budget of the source nodes and the pathloss of the wireless channel between the source and destination nodes.
- The staleness of information status at the destination nodes increases, which eventually degrades the performance of such real-time applications.
- This necessitates designing efficient transmission policies for freshness-aware RF-powered communication systems, which is the main objective of this paper.

### B. Contributions

- This paper studies a real-time monitoring system in which multiple source nodes are supposed to keep the status of their observed physical processes fresh at a common destination node by transmitting update packets frequently over time.
- By analytically establishing the monotonicity property of the value function associated with the formulated MDP, the authors show that the age-optimal policy is a threshold-based policy with respect to each of the AoI values for different processes 1.
- The authors results provide several useful system design insights.
- They show that the differences between the structures of the age-optimal and throughput-optimal policies in the single source-destination pair model mainly depend upon the AoI value of the observed process at the destination node.
- After showing the convergence of their proposed DRL algorithm, their numerical results also demonstrate the impact of system design parameters, such as the capacity of batteries and the size of update packets, on the achievable average weighted sum-AoI.

### C. Organization

- The long-term weighted sum-AoI minimization problem is then formulated in Section III, where a DRL algorithm is proposed to obtain its solution.
- Afterwards, the authors present their analysis used to characterize the structural properties of the age-optimal policy in Section IV.
- The novelty of their MDP formulation lies in the use of the newly emerging concept of AoI in the objective function to quantify freshness of information, which has not been done in the other research areas.
- Pair model are demonstrated in Section V. Section VI verifies their analytical findings from Sections IV and V as well as evaluates the performance of their proposed DRL algorithm numerically.

### A. Network Model

- Each source node is supposed to keep the information status of its observed process at a destination node (for instance, a cellular BS) fresh by sending status update packets over time.
- The destination node is assumed to have a stable energy source whereas each source node is equipped with an RF energy harvesting circuitry as its only source of energy.
- When Ai(k) reaches Amax,i, it means that the available information at the destination nodes about process i is too stale to be of any use.
- In addition, this assumption makes the AoI variable of each process only take finite number of values, i.e., the AoI state space of each process is finite.
- The locations of the source nodes are known a priori, and hence their average channel power gains are preestimated and known at the destination node.

### B. State and Action Spaces

- At the beginning of an arbitrary time slot k, the state si(k) of a source node i is characterized by its battery level, the 4 AoI of its observed process i at the destination, and its uplink and downlink channel power gains from the destination node, i.e., si(k) , (Bi(k), Ai(k), gi(k), hi(k)) ∈ Sai .
- Note that Sai is the state space which contains all the combinations of Bi(k), Ai(k), gi(k) and hi(k), where the superscript a indicates that it is defined for the average AoI minimization problem.
- The authors assume that P is sufficiently large such that the energy harvested at each source node due to uplink data transmissions by the other source nodes is negligible.
- Ti, slot k is allocated for information transmission where source i sends an update packet about its observed process to the destination.

### A. Problem Statement

- The authors objective is to obtain the optimal policy, which specifies the actions taken at different states of the system over time, achieving the minimum average weighted sum-AoI, i.e., sum of AoI values for different processes at the destination.
- The authors intention behind using a weighted average cost function is to provide a generic problem formulation that can account for the potential differences between the observed physical processes by the source nodes in terms of the impact of the AoI value of each process on the optimal actions taken at the destination node.

### B. MDP Formulation

- Clearly, an upper bound to the performance of the continuous system can be obtained by reversing the use of the floor and ceiling in the definitions of eTi (k) and eHi (k).
- These conditional probabilities are determined according to the Markovian fading channel model considered in the problem.
- Clearly, for a reasonable number of both the discrete values for each state variable (i.e., Amax,i, Gi, Hi, and bmax,i + 1) and the source nodes deployed in the network (N), the state space will have a massive number of states.

### C. Deep Reinforcement Learning for Optimizing AoI

- DRL is suitable for their problem since it can reduce the dimensionality of the large state space while learning the optimal policy at the same time [55].
- By applying the update step in (15), the system can always exploit the learning process by taking the action which minimizes the long-term average cost, i.e., the action that minimizes the Q-function value of the current state.
- Algorithm 1 summarizes the steps of the proposed DRL algorithm.

### IV. STRUCTURAL PROPERTIES OF THE AGE-OPTIMAL POLICY

- More specifically, the optimal actions at some states can now be directly determined based on the optimal actions taken at some other states (due to the threshold-based structure of the age-optimal policy), and hence the computational complexity of the policy improvement step can be greatly reduced.
- It is also worth noting that the case of N = 1 in their system setup refers to the classical single source-destination pair model studied in most prior works on AoI in the literature, e.g., [4], [6], [8]–[13].

### V. AGE-OPTIMAL POLICY VS. THROUGHPUT-OPTIMAL POLICY

- The authors aim to analytically compare the structural properties of the age-optimal and the throughput-optimal policies.
- Due to its higher tractability (as demonstrated in the previous section), the authors will focus on the single sourcedestination pair model for this comparison.
- Specifically, the authors 10 first formulate the average throughput maximization problem for the case of N = 1 in the system setup presented in Section II.
- Afterwards, the authors investigate some structural properties of the throughput-optimal policy from which they highlight the differences between the structures of the age-optimal and throughput-optimal polices.

### A. Average Throughput Maximization Formulation and Proposed Solution

- Srd is the state space of the discrete model for the throughput maximization problem, i.e., when the battery and channel power gain are discretized.
- Note that the AoI is not included now in the state of the system.
- For such single source-destination pair model, the action space is defined as A , {H,T1}, where the source node can either harvest energy or transmit a packet of size S at each time slot.
- Hence, the average throughput maximization problem is modeled as a finite-state finite-action MDP for which there exists an optimal stationary deterministic policy [53].
- Clearly, Q(s, a) represents the expected reward resulting from taking action a in state s.

### B. Structural Properties of the Throughput-optimal Policy

- By using (31), the result can be obtained using the same approach used in the proof of Lemma 2, i.e., by applying mathematical induction to the iterations of the VIA.
- This result can be obtained using the same approach used in the proof of Theorem 2. Remark 4.
- The authors results in Theorems 2 and 3 clearly demonstrate that the structures of the age-optimal and throughput-optimal policies are different, which will also be verified in the numerical results section.

### VI. NUMERICAL RESULTS

- The authors verify their analytical results derived in section IV, and show the performance of their proposed DRL algorithm in terms of the achievable average weighted sumAoI as a function of system design parameters.
- The downlink and uplink channel power gains between the destination and source nodes are modeled as gi = hi = Γψ2d−νi ; where Γ is the signal power gain at a reference distance of 1 meter, 11 ψ2 ∼ exp(1) denotes the small-scale fading gain, and d−νi represents standard power law path-loss with exponent ν.
- In addition, for the single source-destination pair model in Figs.
- 6 and 7, the points located inside the solid polygon refer to the states for which it is possible to transmit an update packet (take T1 action), i.e., for each of those states b1 ≥ eT1 .
- Furthermore, the points located inside the dotted polygon represent the set Sth,ad (defined in Remark 2), i.e., the set of states over which the age-optimal policy has a threshold-based structure.
- Note that the dotted polygon is the same as the solid one in Fig.

### B. Comparison of the Structures of the Age-optimal and Throughput-optimal Policies

- Note that the slight gap between the optimal value and the achievable average AoI by the DRL algorithm is due to using an -greedy policy in the DRL algorithm (required for exploring all the state-action pairs while learning the optimal policy, and hence guaranteeing the convergence of the algorithm).
- It is observed that the achievable average sum-AoI monotonically decreases as the size of update packets decreases and/or the capacity of batteries increases.

### VII. CONCLUSION

- The authors have proposed an implementable ageoptimal sampling strategy for designing freshness-aware RFpowered communication systems.
- To obtain the age-optimal policy, the problem was modeled as an average cost MDP with finite state and action spaces.
- Multiple system design insights were drawn from their numerical results.
- They showed that the structures of the age-optimal and throughput-optimal policies in the single source-destination pair model are similar when the AoI value is relatively small (i.e., there is no urgency to update the information status at the destination node).
- The authors results also revealed that the optimal average weighted sum-AoI is a monotonically increasing function with respect to the size of update packets (capacity of batteries at the source nodes).

Did you find this useful? Give us your feedback

A Reinforcement Learning Framework for

Optimizing Age of Information in RF-Powered

Communication Systems

Mohamed A. Abd-Elmagid, Harpreet S. Dhillon and Nikolaos Pappas

The self-archived postprint version of this journal article is available at Linköping

University Institutional Repository (DiVA):

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-169252

N.B.: When citing this work, cite the original publication.

Abd-Elmagid, M. A., Dhillon, H. S., Pappas, N., (2020), A Reinforcement Learning Framework for

Optimizing Age of Information in RF-Powered Communication Systems, IEEE Transactions on

Communications, 68(8), 4747-4760. https://doi.org/10.1109/TCOMM.2020.2991992

Original publication available at:

https://doi.org/10.1109/TCOMM.2020.2991992

Copyright: Institute of Electrical and Electronics Engineers

http://www.ieee.org/index.html

©2020 IEEE. Personal use of this material is permitted. However, permission to

reprint/republish this material for advertising or promotional purposes or for

creating new collective works for resale or redistribution to servers or lists, or to reuse

any copyrighted component of this work in other works must be obtained from the

IEEE.

1

A Reinforcement Learning Framework for

Optimizing Age of Information in RF-powered

Communication Systems

Mohamed A. Abd-Elmagid, Harpreet S. Dhillon, and Nikolaos Pappas

Abstract—In this paper, we study a real-time monitoring

system in which multiple source nodes are responsible for sending

update packets to a common destination node in order to

maintain the freshness of information at the destination. Since

it may not always be feasible to replace or recharge batteries

in all source nodes, we consider that the nodes are powered

through wireless energy transfer (WET) by the destination. For

this system setup, we investigate the optimal online sampling

policy (referred to as the age-optimal policy) that jointly optimizes

WET and scheduling of update packet transmissions with the

objective of minimizing the long-term average weighted sum of

Age of Information (AoI) values for different physical processes

(observed by the source nodes) at the destination node, referred to

as the sum-AoI. To solve this optimization problem, we ﬁrst model

this setup as an average cost Markov decision process (MDP)

with ﬁnite state and action spaces. Due to the extreme curse of

dimensionality in the state space of the formulated MDP, classical

reinforcement learning algorithms are no longer applicable to our

problem even for reasonable-scale settings. Motivated by this,

we propose a deep reinforcement learning (DRL) algorithm that

can learn the age-optimal policy in a computationally-efﬁcient

manner. We further characterize the structural properties of

the age-optimal policy analytically, and demonstrate that it

has a threshold-based structure with respect to the AoI values

for different processes. We extend our analysis to characterize

the structural properties of the policy that maximizes average

throughput for our system setup, referred to as the throughput-

optimal policy. Afterwards, we analytically demonstrate that the

structures of the age-optimal and throughput-optimal policies are

different. We also numerically demonstrate these structures as

well as the impact of system design parameters on the optimal

achievable average weighted sum-AoI.

Index Terms—Age of Information, RF energy harvesting,

Markov Decision Process, Reinforcement learning.

I. INTRODUCTION

A typical real-time monitoring system consists of source

and destination nodes, where source nodes observe underlying

stochastic processes while the destination nodes keep track of

the status of these processes through status updates transmitted

(often wirelessly) by the source nodes. Examples of the source

nodes include Internet of Things (IoT) devices, aggregators

and sensors, while of the destination nodes include cellular

base stations (BSs) [2]. The performance of many such real-

time systems and applications depends upon how fresh the

M. A. Abd-Elmagid and H. S. Dhillon are with Wireless@VT, Department

of ECE, Virginia Tech, Blacksburg, VA. Email: {maelaziz, hdhillon}@vt.edu.

N. Pappas is with the Department of Science and Technology, Link

¨

oping Uni-

versity, SE-60174 Norrk

¨

oping, Sweden. Email: nikolaos.pappas@liu.se. The

support of the U.S. NSF (Grant CPS-1739642) is gratefully acknowledged.

This paper was presented in part at the IEEE Globecom, 2019 [1].

status updates are when they reach the destination nodes.

In practice, the timely delivery of the measurements to the

destination nodes is greatly restricted by the limited energy

budget of the source nodes and the pathloss of the wireless

channel between the source and destination nodes. Speciﬁ-

cally, this could result in the loss or out-of-order reception

of the measurements at the destination nodes. Consequently,

the staleness of information status at the destination nodes

increases, which eventually degrades the performance of such

real-time applications.

Since it is highly inefﬁcient or even impractical to replace or

recharge batteries in many source nodes, energy harvesting so-

lutions have been considered to enable a self-perpetuating op-

eration of communication networks by supplementing or even

circumventing the use of replaceable batteries in the source

nodes. Due to its ubiquity and cost efﬁcient implementation,

radio-frequency (RF) energy harvesting has quickly emerged

as an appealing solution for charging low-power source nodes

(especially the ones that are deployed at difﬁcult-to-reach

places) [3]. This necessitates designing efﬁcient transmission

policies for freshness-aware RF-powered communication sys-

tems, which is the main objective of this paper. Towards this

objective, we use the concept of AoI to quantify the freshness

of information at the destination nodes [4]. This raises the

obvious question of optimally scheduling packet transmissions

from these RF-powered source nodes with the objective of

minimizing the average AoI at the destination nodes, subject to

the energy causality constraints at the source nodes. To address

this question, this paper makes the ﬁrst attempt, to the best

of our knowledge, to develop a reinforcement learning-based

framework in which we: i) propose a computationally-efﬁcient

approach to characterize the age-optimal transmission policy

numerically, ii) analytically derive the structural properties of

the age-optimal policy, and iii) analytically characterize key

differences in the structural properties of the age-optimal and

throughout-optimal policies.

A. Related Work

First introduced in [4], AoI is a new metric that quantiﬁes

the freshness of information at a destination node due to the

transmission of update packets by the source node. Formally,

AoI is deﬁned as the time passed since the latest successfully

received update packet at the destination was generated at the

source node. Under a simple queue-theoretic model in which

randomly generated packets arrive at the source according to

2

a Poisson process and then are transmitted to the destination

using a ﬁrst-come-ﬁrst-served (FCFS) discipline, the authors

of [4] characterized the average AoI expression. Afterwards,

a series of works [5]–[12] aimed at characterizing the average

AoI and its variations (e.g., Peak Age-of-Information (PAoI)

[8]–[10] and Value of Information of Update (VoIU) [11])

for adaptations of the queueing model studied in [4]. Another

direction of research [13]–[33] focused on employing AoI as

a performance metric for different communication systems

that deal with time critical information while having lim-

ited resources, e.g., multi-server information-update systems

[14], broadcast networks [15]–[17], multi-hop networks [18],

cognitive networks [19], unmanned aerial vehicle (UAV)-

assisted communication systems [20]–[22], IoT networks [2],

[23], [24], ultra-reliable low-latency vehicular networks [25],

multicast networks [26], decentralized random access schemes

[32], and multi-state time-varying networks [33]. Particularly,

the objective of this research direction was to characterize

optimal policies that minimize average AoI, referred to as age-

optimal polices, by applying different tools from optimization

theory. Note that [13]–[33] did not consider energy harvesting

as a powering source for the source nodes.

Different from [13]–[33], another line of research [34]–[48]

focused on the class of problems in which the source node is

powered by energy harvesting under various system settings.

The objective of this line of research was to investigate age-

optimal ofﬂine/online policies for update packet transmissions

subject to the energy causality constraint at the source under

various assumptions regarding the battery size, transmission

time of update packets and channel modeling. Speciﬁcally,

the inﬁnite battery capacity case was studied in [34]–[37],

[44] whereas [38]–[43], [45], [46] considered the case of

ﬁnite battery capacity. Different from [36]–[41] where it was

assumed that each update packet could be transmitted to the

destination instantly subject to the energy causality constraint,

[34], [43], [44] considered stochastic transmission time and

[35], [45], [46] studied the non-zero ﬁxed transmission time

case. While [34]–[36], [38]–[42], [45] considered error-free

channel models, i.e., every update packet transmission is

successfully received at the destination, a noisy channel model

was considered in [37], [43], [44], [46]. A common model

of the energy harvesting process in [34]–[45] is an external

point process (e.g., Poisson process) independent from all the

system design parameters. In contrast, when the source node is

powered by RF energy harvesting, as considered in this paper,

the energy harvested at the source is a function of the temporal

variation of the channel state information (CSI). This, in turn,

means that the age-optimal polices studied in [34]–[44] are not

directly applicable to this setting. In particular, one needs to

incorporate CSI statistics in the process of decision-making,

which adds another layer of complexity to the analysis of age-

optimal policies for such settings.

Before going into more details about our contributions, it is

instructive to note that the problem of age-optimal policy in

wireless powered communications systems has been studied

very recently in [47], [48] for a single source-destination pair

model. However, neither of the policies proposed in [47], [48]

took into account the evolution of the battery level at the

Source node

Wireless energy transfer

Update packet transmission

Battery

Source node i

Destination node

Destination node

Fig. 1. An illustration of the system setup.

source and the variation of CSI over time in the process of

decision-making. It is also worth noting that [22], [46], [49]–

[52] have recently applied reinforcement learning-based algo-

rithms to characterize the age-optimal policy. However, none

of these works applied a DRL-based algorithm to efﬁciently

design freshness-aware RF-powered communication systems.

Different from these, we consider a more general model in

which multiple RF-powered source nodes are deployed to

potentially sense different physical processes. For this setting,

we provide a novel reinforcement learning framework in which

we: 1) develop a DRL-based algorithm that characterizes the

online age-optimal sampling policy while considering the dy-

namics of batteries, AoI values for different processes and CSI,

and 2) analytically characterize key differences between the

structures of the online age-optimal and throughput-optimal

polices. More details on our contributions are provided next.

B. Contributions

This paper studies a real-time monitoring system in which

multiple source nodes are supposed to keep the status of their

observed physical processes fresh at a common destination

node by transmitting update packets frequently over time.

Furthermore, each source node is assumed to be powered by

harvesting energy from RF signals broadcast by the destination

node. For this setup, our main contributions are listed next.

A novel DRL algorithm for optimizing average weighted

sum-AoI. Given an importance weight for each physical pro-

cess at the destination node, we study the long-term average

weighted sum-AoI (i.e., sum of AoI values for different

processes at the destination node) minimization problem in

which WET and scheduling of update packet transmissions

from different source nodes are jointly optimized. To tackle

this problem, we model it as an average cost MDP with ﬁnite

state and action spaces. In particular, the MDP determines

whether each time slot should be allocated for WET or an

update packet transmission from one of the source nodes. This

decision is based on the available energies at the source nodes

(or their battery levels), the AoI values of different processes

at the destination node, and the CSI. Due to the extreme curse

of dimensionality in the state space of the formulated MDP, it

is computationally infeasible to characterize the age-optimal

policy using classical reinforcement learning algorithms [53],

[54] such as relative value iteration algorithm (RVIA), value

iteration algorithm (VIA) or policy iteration algorithm (PIA).

To overcome this hurdle, we propose a novel DRL algorithm

3

that can learn the age-optimal policy in a computationally-

efﬁcient manner.

Analytical characterization for the structural properties

of the age-optimal policy. By analytically establishing the

monotonicity property of the value function associated with

the formulated MDP, we show that the age-optimal policy is a

threshold-based policy with respect to each of the AoI values

for different processes

1

. Moreover, for the single source-

destination pair model (i.e., the case of having a single source

node), our results demonstrate that the age-optimal policy is

a threshold-based policy with respect to each of the system

state variables, i.e., the battery level at the source, the AoI at

the destination and the channel power gains. This result is of

interest on its own because of the relevance of the source-

destination pair model in plethora of applications, such as

predicting and controlling forest ﬁres, safety of an intelligent

transportation system, and efﬁcient energy utilization in future

smart homes. Not surprisingly, this model has been of interest

in a large proportion of the prior work on AoI. Furthermore,

this result allows us to analytically demonstrate the key differ-

ences between the structures of the age-optimal and throughput

optimal policies.

System design insights. Our results provide several useful

system design insights. For instance, they show that the

differences between the structures of the age-optimal and

throughput-optimal policies in the single source-destination

pair model mainly depend upon the AoI value of the observed

process at the destination node. In particular, while the age-

optimal and throughput-optimal policies have different struc-

tures when the AoI value is large, these differences start to

vanish as the AoI value decreases. After showing the conver-

gence of our proposed DRL algorithm, our numerical results

also demonstrate the impact of system design parameters, such

as the capacity of batteries and the size of update packets, on

the achievable average weighted sum-AoI. Speciﬁcally, they

reveal that the achievable average weighted sum-AoI by the

DRL algorithm is monotonically decreasing (monotonically

increasing) with the capacity of batteries (the size of update

packets).

C. Organization

The rest of the paper is organized as follows. Section II

presents our system model. The long-term weighted sum-AoI

minimization problem is then formulated in Section III, where

a DRL algorithm is proposed to obtain its solution. Afterwards,

we present our analysis used to characterize the structural

properties of the age-optimal policy in Section IV. Using

the analytical results derived in Section IV, the key differ-

ences between the structural properties of the age-optimal and

throughput-optimal policies in the single source-destination

1

Note that constructing a threshold-based optimal policy under the analyt-

ical framework of MDPs is common in other research areas (such as power

control and distributed detection) as well. However, the novelty of our MDP

formulation lies in the use of the newly emerging concept of AoI in the

objective function to quantify freshness of information, which has not been

done in the other research areas. This process of decision-making is performed

while accounting for various system design parameters (i.e., the battery levels,

the AoI values at the destination node, and the CSI) as system state variables.

pair model are demonstrated in Section V. Section VI veriﬁes

our analytical ﬁndings from Sections IV and V as well as

evaluates the performance of our proposed DRL algorithm

numerically. Finally, Section VII concludes the paper.

II. SYSTEM MODEL

A. Network Model

We study a real-time monitoring system in which a set I of

N source nodes is deployed to observe potentially different

physical processes, such as temperature or humidity. Each

source node is supposed to keep the information status of its

observed process at a destination node (for instance, a cellular

BS) fresh by sending status update packets over time. In the

context of IoT networks, the source node could refer to a

single IoT device or an aggregator located near a group of IoT

devices, which transmits update packets collected from them to

the destination node. The destination node is assumed to have

a stable energy source whereas each source node is equipped

with an RF energy harvesting circuitry as its only source of

energy. In particular, the source nodes harvest energy from

the RF signals broadcast by the destination in the downlink

such that the energy harvested at source node i is stored in

a battery with ﬁnite capacity B

max,i

Joules. The source and

destination nodes are assumed to have a single antenna each

and operate over the same frequency channel. Hence, at a

given time instant, each source node cannot simultaneously

harvest wireless energy in downlink and transmit data in

uplink.

We consider a discrete time horizon composed of slots of

unit length (without loss of generality) where slot k = 0, 1, . . .

corresponds to the time duration [k, k + 1). Denote by B

i

(k)

and A

i

(k) the amount of available energy at source node

i and the AoI of its observed process i at the destination,

respectively, at the beginning of time slot k. We assume that

A

i

(k) is upper bounded by a ﬁnite value A

max,i

which can be

chosen to be arbitrarily large, i.e., A

i

(k) ∈ {1, 2, · · · , A

max,i

}.

When A

i

(k) reaches A

max,i

, it means that the available

information at the destination nodes about process i is too stale

to be of any use. In addition, this assumption makes the AoI

variable of each process only take ﬁnite number of values, i.e.,

the AoI state space of each process is ﬁnite. This will facilitate

the solution of MDP, as will be clariﬁed in the next section. Let

g

i

(k) and h

i

(k) denote the downlink and uplink channel power

gains between the destination and source node i over slot k,

respectively. The downlink and uplink channels are assumed

to be affected by quasi-static ﬂat fading, i.e., they remain

constant over a time slot but change independently from one

slot to another. The locations of the source nodes are known

a priori, and hence their average channel power gains are pre-

estimated and known at the destination node. In particular, at

the beginning of an arbitrary time slot, the destination node

has perfect knowledge about the channel power gains in that

slot, and only a statistical knowledge for future slots. This is

a very reasonable assumption for many IoT applications.

B. State and Action Spaces

At the beginning of an arbitrary time slot k, the state s

i

(k)

of a source node i is characterized by its battery level, the

4

AoI of its observed process i at the destination, and its uplink

and downlink channel power gains from the destination node,

i.e., s

i

(k) , (B

i

(k), A

i

(k), g

i

(k), h

i

(k)) ∈ S

a

i

. Note that

S

a

i

is the state space which contains all the combinations

of B

i

(k), A

i

(k), g

i

(k) and h

i

(k), where the superscript a

indicates that it is deﬁned for the average AoI minimization

problem. The state of the system at slot k is then given by

s(k) = {s

i

(k)}

i∈I

∈ S

a

, where S

a

is the system state

space. Based on s(k), the action taken at slot k is given

by a(k) ∈ A , {H, T

1

, T

2

, · · · , T

N

}, as illustrated in Fig.

1. When a(k) = H, slot k is dedicated for WET where

the destination broadcasts RF energy signal in the downlink

to charge the batteries at the source nodes. Particularly, the

amount of energy harvested by an arbitrary source node i can

be expressed as

E

H

i

(k) = ηP g

i

(k), (1)

where η is the efﬁciency of the energy harvesting circuitry and

P is the average transmit power by the destination. We assume

that P is sufﬁciently large such that the energy harvested at

each source node due to uplink data transmissions by the other

source nodes is negligible. On the other hand, when a(k) = T

i

,

slot k is allocated for information transmission where source i

sends an update packet about its observed process to the des-

tination. We consider a generate-at-will policy [13], where the

source scheduled for transmission generates an update packet

at the beginning of the time slot whenever that slot is allocated

for information transmission. According to Shannon’s formula,

when the energy consumed by source i to transmit an update

packet of size S in slot k is E

T

i

(k), its maximum reliable

transmission rate is log

2

1 +

h

i

(k)E

T

i

(k)

σ

2

bits/Hz (recall that

the slot length is unity), where σ

2

is the noise power at the

destination. Hence, the action T

i

can only be decided if the

battery level at source i satisﬁes the following condition

B

i

(k) ≥ E

T

i

(k) =

σ

2

h

i

(k)

2

¯

S

− 1

. (2)

In every time slot, the battery level at each source node and

the AoI values for different processes at the destination are

updated based on the action decided. Speciﬁcally, if a(k) = T

i

,

then the battery level at source i decreases by E

T

i

(k), and the

AoI value of its observed process i becomes one (recall that

a generate-at-will policy is employed); if a(k) = H, then

the battery level at source i increases by E

H

i

(k) and the AoI

value of process i increases by one; otherwise, the battery

level at source i does not change and the AoI value of process

i increases by one. Hence, the evolution of the battery level

at source i and the AoI value of its observed process at the

destination node can be expressed, respectively, by

B

i

(k + 1) =

B

i

(k) − E

T

i

(k), if a(k) = T

i

,

min

B

max,i

, B

i

(k) + E

H

i

(k)

, if a(k) = H,

B

i

(k), otherwise.

(3)

A

i

(k + 1) =

1, if a(k) = T

i

,

min {A

max,i

, A

i

(k) + 1} , otherwise.

(4)

Fig. 2. AoI evolution vs. time when N = 1 and A

max,1

= 4.

To help visualize (4), Fig. 2 shows the AoI evolution for

process 1 as a function of actions taken over time when N = 1

and A

max,1

= 4.

III. PROBLEM FORMULATION AND PROPOSED SOLUTION

A. Problem Statement

Our objective is to obtain the optimal policy, which spec-

iﬁes the actions taken at different states of the system over

time, achieving the minimum average weighted sum-AoI, i.e.,

sum of AoI values for different processes at the destination.

Particularly, a policy π = {π

0

, π

1

, · · · } is a sequence of

probability measures of actions over the state space. For

instance, the probability measure π

k

speciﬁes the probability

of taking action a(k), conditioned on the sequence s

k

which

includes the past states and actions, and the current state, i.e.,

s

k

, {s(0), a(0), · · · , s(k − 1), a(k − 1), s(k)}. Formally, π

k

speciﬁes P(a(k) | s

k

) such that

P

a(k)∈A(s(k))

P(a(k) | s

k

) =

1, where A(s(k)) is the set of possible actions at state

s(k) ∈ S

a

. The policy π is said to be stationary when

P(a(k) | s

k

) = P (a (k) | s (k)) , ∀k, and is called deterministic

if P(a(k) | s

k

) = 1 for some a(k) ∈ A(s(k)). Under a policy

π, the long-term average AoI of process i at the destination

starting from an initial state s(0) can be expressed as

¯

A

π

i

, lim sup

K→∞

1

K + 1

K

X

k=0

E [A

i

(k) | s(0)] , (5)

where the expectation is taken with respect to the channel

conditions and the policy. Our goal is to ﬁnd the optimal policy

π

?

, referred to as the age-optimal policy, that minimizes the

average weighted sum-AoI such that

π

?

= arg min

π

X

i∈I

θ

i

¯

A

π

i

, (6)

where θ

i

≥ 0 and

P

N

i=1

θ

i

= 1. Here, θ

i

is a weight

accounting for the importance of process i at the destination

node. Our intention behind using a weighted average cost

function is to provide a generic problem formulation that can

account for the potential differences between the observed

physical processes by the source nodes in terms of the impact

of the AoI value of each process on the optimal actions

taken at the destination node. In particular, the weights can

be chosen according to the importance of the AoI values of

##### Citations

More filters

••

TL;DR: This work investigates the age performance of uncoded and coded schemes in the presence of stragglers under i.i.d. exponential transmission delays and shows that asymptotically MM-MDS coded scheme outperforms the other schemes.

Abstract: We consider a status update system in which the update packets need to be processed to extract the embedded useful information. The source node sends the acquired information to a computation unit (CU) which consists of a master node and $n$ worker nodes. The master node distributes the received computation task to the worker nodes. Upon computation, the master node aggregates the results and sends them back to the source node to keep it updated . We investigate the age performance of uncoded and coded (repetition coded, MDS coded, and multi-message MDS (MM-MDS) coded) schemes in the presence of stragglers under i.i.d. exponential transmission delays and i.i.d shifted exponential computation times. We show that asymptotically MM-MDS coded scheme outperforms the other schemes. Furthermore, we characterize the optimal codes such that the average age is minimized.

59 citations

••

TL;DR: The goal in this paper is to characterize the spatial distribution of the mean AoI observed by the SD pairs by modeling them as a bipolar Poisson point process (PPP) by efficiently capturing the interference-induced coupling in the activities of theSD pairs.

Abstract: This paper considers a large-scale wireless network consisting of source-destination (SD) pairs, where the sources send time-sensitive information, termed status updates , to their corresponding destinations in a time-slotted fashion. We employ age of information (AoI) for quantifying the freshness of the status updates measured at the destination nodes under the preemptive and non-preemptive queueing disciplines with no storage facility. The non-preemptive queue drops the newly arriving updates until the update in service is successfully delivered, whereas the preemptive queue replaces the current update in service with the newly arriving update, if any. As the update delivery rate for a given link is a function of the interference field seen from the receiver, the temporal mean AoI can be treated as a random variable over space. Our goal in this paper is to characterize the spatial distribution of the mean AoI observed by the SD pairs by modeling them as a bipolar Poisson point process (PPP). Towards this objective, we first derive accurate bounds on the moments of success probability while efficiently capturing the interference-induced coupling in the activities of the SD pairs. Using this result, we then derive tight bounds on the moments as well as the spatial distribution of peak AoI (PAoI). Our numerical results verify our analytical findings and demonstrate the impact of various system design parameters on the mean PAoI.

55 citations

•

TL;DR: The quality of an update is model as an increasing function of the processing time spent while generating the update at the transmitter, and distortion is used as a proxy for quality, and model distortion as a decreasing function of processing time.

Abstract: We consider an information update system where an information receiver requests updates from an information provider in order to minimize its age of information. The updates are generated at the information provider (transmitter) as a result of completing a set of tasks such as collecting data and performing computations. We refer to this as the update generation process. We model the $quality$ of an update as an increasing function of the processing time spent while generating the update at the transmitter. In particular, we use $distortion$ as a proxy for $quality$, and model distortion as a decreasing function of processing time. Processing longer at the transmitter results in a better quality (lower distortion) update, but it causes the update to age. We determine the age-optimal policies for the update request times at the receiver and the update processing times at the transmitter subject to a minimum required quality (maximum allowed distortion) constraint on the updates. For the required quality constraint, we consider the cases of constant maximum allowed distortion constraints, as well as age-dependent maximum allowed distortion constraints.

43 citations

••

TL;DR: A cache updating system with a source, a cache and a user, an alternating maximization based method to find the update rates for the cache and for the user is provided to maximize the freshness of the files at the user.

Abstract: We consider a cache updating system with a source, a cache and a user. There are $n$ files. The source keeps the freshest version of the files which are updated with known rates $\lambda _{i}$ . The cache downloads and keeps the freshest version of the files from the source with rates $c_{i}$ . The user gets updates from the cache with rates $u_{i}$ . When the user gets an update, it either gets a fresh update from the cache or the file at the cache becomes outdated by a file update at the source in which case the user gets an outdated update. We find an analytical expression for the average freshness of the files at the user. Next, we generalize our setting to the case where there are multiple caches in between the source and the user, and find the average freshness at the user. We provide an alternating maximization based method to find the update rates for the cache(s), $c_{i}$ , and for the user, $u_{i}$ , to maximize the freshness of the files at the user. We observe that for a given set of update rates for the user (resp. for the cache), the optimal rate allocation policy for the cache (resp. for the user) is a threshold policy , where the optimal update rates for rapidly changing files at the source may be equal to zero. Finally, we consider a system where multiple users are connected to a single cache and find update rates for the cache and the users to maximize the total freshness over all users.

42 citations

••

TL;DR: In this article, the AoI-optimal policy for a single source-destination pair in which a radio frequency (RF)-powered source sends status updates about some physical process to a destination node was analyzed.

Abstract: This paper characterizes the structure of the Age of Information (AoI)-optimal policy in wireless powered communication systems while accounting for the time and energy costs of generating status updates at the source nodes. In particular, for a single source-destination pair in which a radio frequency (RF)-powered source sends status updates about some physical process to a destination node, we minimize the long-term average AoI at the destination node. The problem is modeled as an average cost Markov Decision Process (MDP) in which, the generation times of status updates at the source, the transmissions of status updates from the source to the destination, and the wireless energy transfer (WET) are jointly optimized. After proving the monotonicity property of the value function associated with the MDP, we analytically demonstrate that the AoI-optimal policy has a threshold-based structure w.r.t. the state variables. Our numerical results verify the analytical findings and reveal the impact of state variables on the structure of the AoI-optimal policy. Our results also demonstrate the impact of system design parameters on the optimal achievable average AoI as well as the superiority of our proposed joint sampling and updating policy w.r.t. the generate-at-will policy.

39 citations

##### References

More filters

•

01 Jan 1988TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations

••

Google

^{1}TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

23,074 citations

••

26 May 2013TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

7,316 citations

•

TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

5,310 citations

••

25 Mar 2012TL;DR: A time-average age metric is employed for the performance evaluation of status update systems and the existence of an optimal rate at which a source must generate its information to keep its status as timely as possible at all its monitors is shown.

Abstract: Increasingly ubiquitous communication networks and connectivity via portable devices have engendered a host of applications in which sources, for example people and environmental sensors, send updates of their status to interested recipients. These applications desire status updates at the recipients to be as timely as possible; however, this is typically constrained by limited network resources. In this paper, we employ a time-average age metric for the performance evaluation of status update systems. We derive general methods for calculating the age metric that can be applied to a broad class of service systems. We apply these methods to queue-theoretic system abstractions consisting of a source, a service facility and monitors, with the model of the service facility (physical constraints) a given. The queue discipline of first-come-first-served (FCFS) is explored. We show the existence of an optimal rate at which a source must generate its information to keep its status as timely as possible at all its monitors. This rate differs from those that maximize utilization (throughput) or minimize status packet delivery delay. While our abstractions are simpler than their real-world counterparts, the insights obtained, we believe, are a useful starting point in understanding and designing systems that support real time status updates.

1,879 citations