scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Reinforcement Learning Framework for Optimizing Age of Information in RF-Powered Communication Systems

04 May 2020-IEEE Transactions on Communications (Institute of Electrical and Electronics Engineers (IEEE))-Vol. 68, Iss: 8, pp 4747-4760
TL;DR: In this article, a real-time monitoring system is considered where multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination.
Abstract: In this paper, we study a real-time monitoring system in which multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination. Since it may not always be feasible to replace or recharge batteries in all source nodes, we consider that the nodes are powered through wireless energy transfer (WET) by the destination. For this system setup, we investigate the optimal online sampling policy (referred to as the age-optimal policy ) that jointly optimizes WET and scheduling of update packet transmissions with the objective of minimizing the long-term average weighted sum of Age of Information (AoI) values for different physical processes (observed by the source nodes) at the destination node, referred to as the sum-AoI . To solve this optimization problem, we first model this setup as an average cost Markov decision process (MDP) with finite state and action spaces. Due to the extreme curse of dimensionality in the state space of the formulated MDP, classical reinforcement learning algorithms are no longer applicable to our problem even for reasonable-scale settings. Motivated by this, we propose a deep reinforcement learning (DRL) algorithm that can learn the age-optimal policy in a computationally-efficient manner. We further characterize the structural properties of the age-optimal policy analytically, and demonstrate that it has a threshold-based structure with respect to the AoI values for different processes. We extend our analysis to characterize the structural properties of the policy that maximizes average throughput for our system setup, referred to as the throughput-optimal policy . Afterwards, we analytically demonstrate that the structures of the age-optimal and throughput-optimal policies are different. We also numerically demonstrate these structures as well as the impact of system design parameters on the optimal achievable average weighted sum-AoI.

Summary (4 min read)

Introduction

  • In practice, the timely delivery of the measurements to the destination nodes is greatly restricted by the limited energy budget of the source nodes and the pathloss of the wireless channel between the source and destination nodes.
  • The staleness of information status at the destination nodes increases, which eventually degrades the performance of such real-time applications.
  • This necessitates designing efficient transmission policies for freshness-aware RF-powered communication systems, which is the main objective of this paper.

B. Contributions

  • This paper studies a real-time monitoring system in which multiple source nodes are supposed to keep the status of their observed physical processes fresh at a common destination node by transmitting update packets frequently over time.
  • By analytically establishing the monotonicity property of the value function associated with the formulated MDP, the authors show that the age-optimal policy is a threshold-based policy with respect to each of the AoI values for different processes 1.
  • The authors results provide several useful system design insights.
  • They show that the differences between the structures of the age-optimal and throughput-optimal policies in the single source-destination pair model mainly depend upon the AoI value of the observed process at the destination node.
  • After showing the convergence of their proposed DRL algorithm, their numerical results also demonstrate the impact of system design parameters, such as the capacity of batteries and the size of update packets, on the achievable average weighted sum-AoI.

C. Organization

  • The long-term weighted sum-AoI minimization problem is then formulated in Section III, where a DRL algorithm is proposed to obtain its solution.
  • Afterwards, the authors present their analysis used to characterize the structural properties of the age-optimal policy in Section IV.
  • The novelty of their MDP formulation lies in the use of the newly emerging concept of AoI in the objective function to quantify freshness of information, which has not been done in the other research areas.
  • Pair model are demonstrated in Section V. Section VI verifies their analytical findings from Sections IV and V as well as evaluates the performance of their proposed DRL algorithm numerically.

A. Network Model

  • Each source node is supposed to keep the information status of its observed process at a destination node (for instance, a cellular BS) fresh by sending status update packets over time.
  • The destination node is assumed to have a stable energy source whereas each source node is equipped with an RF energy harvesting circuitry as its only source of energy.
  • When Ai(k) reaches Amax,i, it means that the available information at the destination nodes about process i is too stale to be of any use.
  • In addition, this assumption makes the AoI variable of each process only take finite number of values, i.e., the AoI state space of each process is finite.
  • The locations of the source nodes are known a priori, and hence their average channel power gains are preestimated and known at the destination node.

B. State and Action Spaces

  • At the beginning of an arbitrary time slot k, the state si(k) of a source node i is characterized by its battery level, the 4 AoI of its observed process i at the destination, and its uplink and downlink channel power gains from the destination node, i.e., si(k) , (Bi(k), Ai(k), gi(k), hi(k)) ∈ Sai .
  • Note that Sai is the state space which contains all the combinations of Bi(k), Ai(k), gi(k) and hi(k), where the superscript a indicates that it is defined for the average AoI minimization problem.
  • The authors assume that P is sufficiently large such that the energy harvested at each source node due to uplink data transmissions by the other source nodes is negligible.
  • Ti, slot k is allocated for information transmission where source i sends an update packet about its observed process to the destination.

A. Problem Statement

  • The authors objective is to obtain the optimal policy, which specifies the actions taken at different states of the system over time, achieving the minimum average weighted sum-AoI, i.e., sum of AoI values for different processes at the destination.
  • The authors intention behind using a weighted average cost function is to provide a generic problem formulation that can account for the potential differences between the observed physical processes by the source nodes in terms of the impact of the AoI value of each process on the optimal actions taken at the destination node.

B. MDP Formulation

  • Clearly, an upper bound to the performance of the continuous system can be obtained by reversing the use of the floor and ceiling in the definitions of eTi (k) and eHi (k).
  • These conditional probabilities are determined according to the Markovian fading channel model considered in the problem.
  • Clearly, for a reasonable number of both the discrete values for each state variable (i.e., Amax,i, Gi, Hi, and bmax,i + 1) and the source nodes deployed in the network (N), the state space will have a massive number of states.

C. Deep Reinforcement Learning for Optimizing AoI

  • DRL is suitable for their problem since it can reduce the dimensionality of the large state space while learning the optimal policy at the same time [55].
  • By applying the update step in (15), the system can always exploit the learning process by taking the action which minimizes the long-term average cost, i.e., the action that minimizes the Q-function value of the current state.
  • Algorithm 1 summarizes the steps of the proposed DRL algorithm.

IV. STRUCTURAL PROPERTIES OF THE AGE-OPTIMAL POLICY

  • More specifically, the optimal actions at some states can now be directly determined based on the optimal actions taken at some other states (due to the threshold-based structure of the age-optimal policy), and hence the computational complexity of the policy improvement step can be greatly reduced.
  • It is also worth noting that the case of N = 1 in their system setup refers to the classical single source-destination pair model studied in most prior works on AoI in the literature, e.g., [4], [6], [8]–[13].

V. AGE-OPTIMAL POLICY VS. THROUGHPUT-OPTIMAL POLICY

  • The authors aim to analytically compare the structural properties of the age-optimal and the throughput-optimal policies.
  • Due to its higher tractability (as demonstrated in the previous section), the authors will focus on the single sourcedestination pair model for this comparison.
  • Specifically, the authors 10 first formulate the average throughput maximization problem for the case of N = 1 in the system setup presented in Section II.
  • Afterwards, the authors investigate some structural properties of the throughput-optimal policy from which they highlight the differences between the structures of the age-optimal and throughput-optimal polices.

A. Average Throughput Maximization Formulation and Proposed Solution

  • Srd is the state space of the discrete model for the throughput maximization problem, i.e., when the battery and channel power gain are discretized.
  • Note that the AoI is not included now in the state of the system.
  • For such single source-destination pair model, the action space is defined as A , {H,T1}, where the source node can either harvest energy or transmit a packet of size S at each time slot.
  • Hence, the average throughput maximization problem is modeled as a finite-state finite-action MDP for which there exists an optimal stationary deterministic policy [53].
  • Clearly, Q(s, a) represents the expected reward resulting from taking action a in state s.

B. Structural Properties of the Throughput-optimal Policy

  • By using (31), the result can be obtained using the same approach used in the proof of Lemma 2, i.e., by applying mathematical induction to the iterations of the VIA.
  • This result can be obtained using the same approach used in the proof of Theorem 2. Remark 4.
  • The authors results in Theorems 2 and 3 clearly demonstrate that the structures of the age-optimal and throughput-optimal policies are different, which will also be verified in the numerical results section.

VI. NUMERICAL RESULTS

  • The authors verify their analytical results derived in section IV, and show the performance of their proposed DRL algorithm in terms of the achievable average weighted sumAoI as a function of system design parameters.
  • The downlink and uplink channel power gains between the destination and source nodes are modeled as gi = hi = Γψ2d−νi ; where Γ is the signal power gain at a reference distance of 1 meter, 11 ψ2 ∼ exp(1) denotes the small-scale fading gain, and d−νi represents standard power law path-loss with exponent ν.
  • In addition, for the single source-destination pair model in Figs.
  • 6 and 7, the points located inside the solid polygon refer to the states for which it is possible to transmit an update packet (take T1 action), i.e., for each of those states b1 ≥ eT1 .
  • Furthermore, the points located inside the dotted polygon represent the set Sth,ad (defined in Remark 2), i.e., the set of states over which the age-optimal policy has a threshold-based structure.
  • Note that the dotted polygon is the same as the solid one in Fig.

B. Comparison of the Structures of the Age-optimal and Throughput-optimal Policies

  • Note that the slight gap between the optimal value and the achievable average AoI by the DRL algorithm is due to using an -greedy policy in the DRL algorithm (required for exploring all the state-action pairs while learning the optimal policy, and hence guaranteeing the convergence of the algorithm).
  • It is observed that the achievable average sum-AoI monotonically decreases as the size of update packets decreases and/or the capacity of batteries increases.

VII. CONCLUSION

  • The authors have proposed an implementable ageoptimal sampling strategy for designing freshness-aware RFpowered communication systems.
  • To obtain the age-optimal policy, the problem was modeled as an average cost MDP with finite state and action spaces.
  • Multiple system design insights were drawn from their numerical results.
  • They showed that the structures of the age-optimal and throughput-optimal policies in the single source-destination pair model are similar when the AoI value is relatively small (i.e., there is no urgency to update the information status at the destination node).
  • The authors results also revealed that the optimal average weighted sum-AoI is a monotonically increasing function with respect to the size of update packets (capacity of batteries at the source nodes).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Reinforcement Learning Framework for
Optimizing Age of Information in RF-Powered
Communication Systems
Mohamed A. Abd-Elmagid, Harpreet S. Dhillon and Nikolaos Pappas
The self-archived postprint version of this journal article is available at Linköping
University Institutional Repository (DiVA):
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-169252
N.B.: When citing this work, cite the original publication.
Abd-Elmagid, M. A., Dhillon, H. S., Pappas, N., (2020), A Reinforcement Learning Framework for
Optimizing Age of Information in RF-Powered Communication Systems, IEEE Transactions on
Communications, 68(8), 4747-4760. https://doi.org/10.1109/TCOMM.2020.2991992
Original publication available at:
https://doi.org/10.1109/TCOMM.2020.2991992
Copyright: Institute of Electrical and Electronics Engineers
http://www.ieee.org/index.html
©2020 IEEE. Personal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes or for
creating new collective works for resale or redistribution to servers or lists, or to reuse
any copyrighted component of this work in other works must be obtained from the
IEEE.

1
A Reinforcement Learning Framework for
Optimizing Age of Information in RF-powered
Communication Systems
Mohamed A. Abd-Elmagid, Harpreet S. Dhillon, and Nikolaos Pappas
Abstract—In this paper, we study a real-time monitoring
system in which multiple source nodes are responsible for sending
update packets to a common destination node in order to
maintain the freshness of information at the destination. Since
it may not always be feasible to replace or recharge batteries
in all source nodes, we consider that the nodes are powered
through wireless energy transfer (WET) by the destination. For
this system setup, we investigate the optimal online sampling
policy (referred to as the age-optimal policy) that jointly optimizes
WET and scheduling of update packet transmissions with the
objective of minimizing the long-term average weighted sum of
Age of Information (AoI) values for different physical processes
(observed by the source nodes) at the destination node, referred to
as the sum-AoI. To solve this optimization problem, we first model
this setup as an average cost Markov decision process (MDP)
with finite state and action spaces. Due to the extreme curse of
dimensionality in the state space of the formulated MDP, classical
reinforcement learning algorithms are no longer applicable to our
problem even for reasonable-scale settings. Motivated by this,
we propose a deep reinforcement learning (DRL) algorithm that
can learn the age-optimal policy in a computationally-efficient
manner. We further characterize the structural properties of
the age-optimal policy analytically, and demonstrate that it
has a threshold-based structure with respect to the AoI values
for different processes. We extend our analysis to characterize
the structural properties of the policy that maximizes average
throughput for our system setup, referred to as the throughput-
optimal policy. Afterwards, we analytically demonstrate that the
structures of the age-optimal and throughput-optimal policies are
different. We also numerically demonstrate these structures as
well as the impact of system design parameters on the optimal
achievable average weighted sum-AoI.
Index Terms—Age of Information, RF energy harvesting,
Markov Decision Process, Reinforcement learning.
I. INTRODUCTION
A typical real-time monitoring system consists of source
and destination nodes, where source nodes observe underlying
stochastic processes while the destination nodes keep track of
the status of these processes through status updates transmitted
(often wirelessly) by the source nodes. Examples of the source
nodes include Internet of Things (IoT) devices, aggregators
and sensors, while of the destination nodes include cellular
base stations (BSs) [2]. The performance of many such real-
time systems and applications depends upon how fresh the
M. A. Abd-Elmagid and H. S. Dhillon are with Wireless@VT, Department
of ECE, Virginia Tech, Blacksburg, VA. Email: {maelaziz, hdhillon}@vt.edu.
N. Pappas is with the Department of Science and Technology, Link
¨
oping Uni-
versity, SE-60174 Norrk
¨
oping, Sweden. Email: nikolaos.pappas@liu.se. The
support of the U.S. NSF (Grant CPS-1739642) is gratefully acknowledged.
This paper was presented in part at the IEEE Globecom, 2019 [1].
status updates are when they reach the destination nodes.
In practice, the timely delivery of the measurements to the
destination nodes is greatly restricted by the limited energy
budget of the source nodes and the pathloss of the wireless
channel between the source and destination nodes. Specifi-
cally, this could result in the loss or out-of-order reception
of the measurements at the destination nodes. Consequently,
the staleness of information status at the destination nodes
increases, which eventually degrades the performance of such
real-time applications.
Since it is highly inefficient or even impractical to replace or
recharge batteries in many source nodes, energy harvesting so-
lutions have been considered to enable a self-perpetuating op-
eration of communication networks by supplementing or even
circumventing the use of replaceable batteries in the source
nodes. Due to its ubiquity and cost efficient implementation,
radio-frequency (RF) energy harvesting has quickly emerged
as an appealing solution for charging low-power source nodes
(especially the ones that are deployed at difficult-to-reach
places) [3]. This necessitates designing efficient transmission
policies for freshness-aware RF-powered communication sys-
tems, which is the main objective of this paper. Towards this
objective, we use the concept of AoI to quantify the freshness
of information at the destination nodes [4]. This raises the
obvious question of optimally scheduling packet transmissions
from these RF-powered source nodes with the objective of
minimizing the average AoI at the destination nodes, subject to
the energy causality constraints at the source nodes. To address
this question, this paper makes the first attempt, to the best
of our knowledge, to develop a reinforcement learning-based
framework in which we: i) propose a computationally-efficient
approach to characterize the age-optimal transmission policy
numerically, ii) analytically derive the structural properties of
the age-optimal policy, and iii) analytically characterize key
differences in the structural properties of the age-optimal and
throughout-optimal policies.
A. Related Work
First introduced in [4], AoI is a new metric that quantifies
the freshness of information at a destination node due to the
transmission of update packets by the source node. Formally,
AoI is defined as the time passed since the latest successfully
received update packet at the destination was generated at the
source node. Under a simple queue-theoretic model in which
randomly generated packets arrive at the source according to

2
a Poisson process and then are transmitted to the destination
using a first-come-first-served (FCFS) discipline, the authors
of [4] characterized the average AoI expression. Afterwards,
a series of works [5]–[12] aimed at characterizing the average
AoI and its variations (e.g., Peak Age-of-Information (PAoI)
[8]–[10] and Value of Information of Update (VoIU) [11])
for adaptations of the queueing model studied in [4]. Another
direction of research [13]–[33] focused on employing AoI as
a performance metric for different communication systems
that deal with time critical information while having lim-
ited resources, e.g., multi-server information-update systems
[14], broadcast networks [15]–[17], multi-hop networks [18],
cognitive networks [19], unmanned aerial vehicle (UAV)-
assisted communication systems [20]–[22], IoT networks [2],
[23], [24], ultra-reliable low-latency vehicular networks [25],
multicast networks [26], decentralized random access schemes
[32], and multi-state time-varying networks [33]. Particularly,
the objective of this research direction was to characterize
optimal policies that minimize average AoI, referred to as age-
optimal polices, by applying different tools from optimization
theory. Note that [13]–[33] did not consider energy harvesting
as a powering source for the source nodes.
Different from [13]–[33], another line of research [34]–[48]
focused on the class of problems in which the source node is
powered by energy harvesting under various system settings.
The objective of this line of research was to investigate age-
optimal offline/online policies for update packet transmissions
subject to the energy causality constraint at the source under
various assumptions regarding the battery size, transmission
time of update packets and channel modeling. Specifically,
the infinite battery capacity case was studied in [34]–[37],
[44] whereas [38]–[43], [45], [46] considered the case of
finite battery capacity. Different from [36]–[41] where it was
assumed that each update packet could be transmitted to the
destination instantly subject to the energy causality constraint,
[34], [43], [44] considered stochastic transmission time and
[35], [45], [46] studied the non-zero fixed transmission time
case. While [34]–[36], [38]–[42], [45] considered error-free
channel models, i.e., every update packet transmission is
successfully received at the destination, a noisy channel model
was considered in [37], [43], [44], [46]. A common model
of the energy harvesting process in [34]–[45] is an external
point process (e.g., Poisson process) independent from all the
system design parameters. In contrast, when the source node is
powered by RF energy harvesting, as considered in this paper,
the energy harvested at the source is a function of the temporal
variation of the channel state information (CSI). This, in turn,
means that the age-optimal polices studied in [34]–[44] are not
directly applicable to this setting. In particular, one needs to
incorporate CSI statistics in the process of decision-making,
which adds another layer of complexity to the analysis of age-
optimal policies for such settings.
Before going into more details about our contributions, it is
instructive to note that the problem of age-optimal policy in
wireless powered communications systems has been studied
very recently in [47], [48] for a single source-destination pair
model. However, neither of the policies proposed in [47], [48]
took into account the evolution of the battery level at the
Source node
Wireless energy transfer
Update packet transmission
Battery
Source node i
Destination node
Destination node
Fig. 1. An illustration of the system setup.
source and the variation of CSI over time in the process of
decision-making. It is also worth noting that [22], [46], [49]–
[52] have recently applied reinforcement learning-based algo-
rithms to characterize the age-optimal policy. However, none
of these works applied a DRL-based algorithm to efficiently
design freshness-aware RF-powered communication systems.
Different from these, we consider a more general model in
which multiple RF-powered source nodes are deployed to
potentially sense different physical processes. For this setting,
we provide a novel reinforcement learning framework in which
we: 1) develop a DRL-based algorithm that characterizes the
online age-optimal sampling policy while considering the dy-
namics of batteries, AoI values for different processes and CSI,
and 2) analytically characterize key differences between the
structures of the online age-optimal and throughput-optimal
polices. More details on our contributions are provided next.
B. Contributions
This paper studies a real-time monitoring system in which
multiple source nodes are supposed to keep the status of their
observed physical processes fresh at a common destination
node by transmitting update packets frequently over time.
Furthermore, each source node is assumed to be powered by
harvesting energy from RF signals broadcast by the destination
node. For this setup, our main contributions are listed next.
A novel DRL algorithm for optimizing average weighted
sum-AoI. Given an importance weight for each physical pro-
cess at the destination node, we study the long-term average
weighted sum-AoI (i.e., sum of AoI values for different
processes at the destination node) minimization problem in
which WET and scheduling of update packet transmissions
from different source nodes are jointly optimized. To tackle
this problem, we model it as an average cost MDP with finite
state and action spaces. In particular, the MDP determines
whether each time slot should be allocated for WET or an
update packet transmission from one of the source nodes. This
decision is based on the available energies at the source nodes
(or their battery levels), the AoI values of different processes
at the destination node, and the CSI. Due to the extreme curse
of dimensionality in the state space of the formulated MDP, it
is computationally infeasible to characterize the age-optimal
policy using classical reinforcement learning algorithms [53],
[54] such as relative value iteration algorithm (RVIA), value
iteration algorithm (VIA) or policy iteration algorithm (PIA).
To overcome this hurdle, we propose a novel DRL algorithm

3
that can learn the age-optimal policy in a computationally-
efficient manner.
Analytical characterization for the structural properties
of the age-optimal policy. By analytically establishing the
monotonicity property of the value function associated with
the formulated MDP, we show that the age-optimal policy is a
threshold-based policy with respect to each of the AoI values
for different processes
1
. Moreover, for the single source-
destination pair model (i.e., the case of having a single source
node), our results demonstrate that the age-optimal policy is
a threshold-based policy with respect to each of the system
state variables, i.e., the battery level at the source, the AoI at
the destination and the channel power gains. This result is of
interest on its own because of the relevance of the source-
destination pair model in plethora of applications, such as
predicting and controlling forest fires, safety of an intelligent
transportation system, and efficient energy utilization in future
smart homes. Not surprisingly, this model has been of interest
in a large proportion of the prior work on AoI. Furthermore,
this result allows us to analytically demonstrate the key differ-
ences between the structures of the age-optimal and throughput
optimal policies.
System design insights. Our results provide several useful
system design insights. For instance, they show that the
differences between the structures of the age-optimal and
throughput-optimal policies in the single source-destination
pair model mainly depend upon the AoI value of the observed
process at the destination node. In particular, while the age-
optimal and throughput-optimal policies have different struc-
tures when the AoI value is large, these differences start to
vanish as the AoI value decreases. After showing the conver-
gence of our proposed DRL algorithm, our numerical results
also demonstrate the impact of system design parameters, such
as the capacity of batteries and the size of update packets, on
the achievable average weighted sum-AoI. Specifically, they
reveal that the achievable average weighted sum-AoI by the
DRL algorithm is monotonically decreasing (monotonically
increasing) with the capacity of batteries (the size of update
packets).
C. Organization
The rest of the paper is organized as follows. Section II
presents our system model. The long-term weighted sum-AoI
minimization problem is then formulated in Section III, where
a DRL algorithm is proposed to obtain its solution. Afterwards,
we present our analysis used to characterize the structural
properties of the age-optimal policy in Section IV. Using
the analytical results derived in Section IV, the key differ-
ences between the structural properties of the age-optimal and
throughput-optimal policies in the single source-destination
1
Note that constructing a threshold-based optimal policy under the analyt-
ical framework of MDPs is common in other research areas (such as power
control and distributed detection) as well. However, the novelty of our MDP
formulation lies in the use of the newly emerging concept of AoI in the
objective function to quantify freshness of information, which has not been
done in the other research areas. This process of decision-making is performed
while accounting for various system design parameters (i.e., the battery levels,
the AoI values at the destination node, and the CSI) as system state variables.
pair model are demonstrated in Section V. Section VI verifies
our analytical findings from Sections IV and V as well as
evaluates the performance of our proposed DRL algorithm
numerically. Finally, Section VII concludes the paper.
II. SYSTEM MODEL
A. Network Model
We study a real-time monitoring system in which a set I of
N source nodes is deployed to observe potentially different
physical processes, such as temperature or humidity. Each
source node is supposed to keep the information status of its
observed process at a destination node (for instance, a cellular
BS) fresh by sending status update packets over time. In the
context of IoT networks, the source node could refer to a
single IoT device or an aggregator located near a group of IoT
devices, which transmits update packets collected from them to
the destination node. The destination node is assumed to have
a stable energy source whereas each source node is equipped
with an RF energy harvesting circuitry as its only source of
energy. In particular, the source nodes harvest energy from
the RF signals broadcast by the destination in the downlink
such that the energy harvested at source node i is stored in
a battery with finite capacity B
max,i
Joules. The source and
destination nodes are assumed to have a single antenna each
and operate over the same frequency channel. Hence, at a
given time instant, each source node cannot simultaneously
harvest wireless energy in downlink and transmit data in
uplink.
We consider a discrete time horizon composed of slots of
unit length (without loss of generality) where slot k = 0, 1, . . .
corresponds to the time duration [k, k + 1). Denote by B
i
(k)
and A
i
(k) the amount of available energy at source node
i and the AoI of its observed process i at the destination,
respectively, at the beginning of time slot k. We assume that
A
i
(k) is upper bounded by a finite value A
max,i
which can be
chosen to be arbitrarily large, i.e., A
i
(k) {1, 2, · · · , A
max,i
}.
When A
i
(k) reaches A
max,i
, it means that the available
information at the destination nodes about process i is too stale
to be of any use. In addition, this assumption makes the AoI
variable of each process only take finite number of values, i.e.,
the AoI state space of each process is finite. This will facilitate
the solution of MDP, as will be clarified in the next section. Let
g
i
(k) and h
i
(k) denote the downlink and uplink channel power
gains between the destination and source node i over slot k,
respectively. The downlink and uplink channels are assumed
to be affected by quasi-static flat fading, i.e., they remain
constant over a time slot but change independently from one
slot to another. The locations of the source nodes are known
a priori, and hence their average channel power gains are pre-
estimated and known at the destination node. In particular, at
the beginning of an arbitrary time slot, the destination node
has perfect knowledge about the channel power gains in that
slot, and only a statistical knowledge for future slots. This is
a very reasonable assumption for many IoT applications.
B. State and Action Spaces
At the beginning of an arbitrary time slot k, the state s
i
(k)
of a source node i is characterized by its battery level, the

4
AoI of its observed process i at the destination, and its uplink
and downlink channel power gains from the destination node,
i.e., s
i
(k) , (B
i
(k), A
i
(k), g
i
(k), h
i
(k)) S
a
i
. Note that
S
a
i
is the state space which contains all the combinations
of B
i
(k), A
i
(k), g
i
(k) and h
i
(k), where the superscript a
indicates that it is defined for the average AoI minimization
problem. The state of the system at slot k is then given by
s(k) = {s
i
(k)}
i∈I
S
a
, where S
a
is the system state
space. Based on s(k), the action taken at slot k is given
by a(k) A , {H, T
1
, T
2
, · · · , T
N
}, as illustrated in Fig.
1. When a(k) = H, slot k is dedicated for WET where
the destination broadcasts RF energy signal in the downlink
to charge the batteries at the source nodes. Particularly, the
amount of energy harvested by an arbitrary source node i can
be expressed as
E
H
i
(k) = ηP g
i
(k), (1)
where η is the efficiency of the energy harvesting circuitry and
P is the average transmit power by the destination. We assume
that P is sufficiently large such that the energy harvested at
each source node due to uplink data transmissions by the other
source nodes is negligible. On the other hand, when a(k) = T
i
,
slot k is allocated for information transmission where source i
sends an update packet about its observed process to the des-
tination. We consider a generate-at-will policy [13], where the
source scheduled for transmission generates an update packet
at the beginning of the time slot whenever that slot is allocated
for information transmission. According to Shannon’s formula,
when the energy consumed by source i to transmit an update
packet of size S in slot k is E
T
i
(k), its maximum reliable
transmission rate is log
2
1 +
h
i
(k)E
T
i
(k)
σ
2
bits/Hz (recall that
the slot length is unity), where σ
2
is the noise power at the
destination. Hence, the action T
i
can only be decided if the
battery level at source i satisfies the following condition
B
i
(k) E
T
i
(k) =
σ
2
h
i
(k)
2
¯
S
1
. (2)
In every time slot, the battery level at each source node and
the AoI values for different processes at the destination are
updated based on the action decided. Specifically, if a(k) = T
i
,
then the battery level at source i decreases by E
T
i
(k), and the
AoI value of its observed process i becomes one (recall that
a generate-at-will policy is employed); if a(k) = H, then
the battery level at source i increases by E
H
i
(k) and the AoI
value of process i increases by one; otherwise, the battery
level at source i does not change and the AoI value of process
i increases by one. Hence, the evolution of the battery level
at source i and the AoI value of its observed process at the
destination node can be expressed, respectively, by
B
i
(k + 1) =
B
i
(k) E
T
i
(k), if a(k) = T
i
,
min
B
max,i
, B
i
(k) + E
H
i
(k)
, if a(k) = H,
B
i
(k), otherwise.
(3)
A
i
(k + 1) =
1, if a(k) = T
i
,
min {A
max,i
, A
i
(k) + 1} , otherwise.
(4)
Fig. 2. AoI evolution vs. time when N = 1 and A
max,1
= 4.
To help visualize (4), Fig. 2 shows the AoI evolution for
process 1 as a function of actions taken over time when N = 1
and A
max,1
= 4.
III. PROBLEM FORMULATION AND PROPOSED SOLUTION
A. Problem Statement
Our objective is to obtain the optimal policy, which spec-
ifies the actions taken at different states of the system over
time, achieving the minimum average weighted sum-AoI, i.e.,
sum of AoI values for different processes at the destination.
Particularly, a policy π = {π
0
, π
1
, · · · } is a sequence of
probability measures of actions over the state space. For
instance, the probability measure π
k
specifies the probability
of taking action a(k), conditioned on the sequence s
k
which
includes the past states and actions, and the current state, i.e.,
s
k
, {s(0), a(0), · · · , s(k 1), a(k 1), s(k)}. Formally, π
k
specifies P(a(k) | s
k
) such that
P
a(k)∈A(s(k))
P(a(k) | s
k
) =
1, where A(s(k)) is the set of possible actions at state
s(k) S
a
. The policy π is said to be stationary when
P(a(k) | s
k
) = P (a (k) | s (k)) , k, and is called deterministic
if P(a(k) | s
k
) = 1 for some a(k) A(s(k)). Under a policy
π, the long-term average AoI of process i at the destination
starting from an initial state s(0) can be expressed as
¯
A
π
i
, lim sup
K→∞
1
K + 1
K
X
k=0
E [A
i
(k) | s(0)] , (5)
where the expectation is taken with respect to the channel
conditions and the policy. Our goal is to find the optimal policy
π
?
, referred to as the age-optimal policy, that minimizes the
average weighted sum-AoI such that
π
?
= arg min
π
X
i∈I
θ
i
¯
A
π
i
, (6)
where θ
i
0 and
P
N
i=1
θ
i
= 1. Here, θ
i
is a weight
accounting for the importance of process i at the destination
node. Our intention behind using a weighted average cost
function is to provide a generic problem formulation that can
account for the potential differences between the observed
physical processes by the source nodes in terms of the impact
of the AoI value of each process on the optimal actions
taken at the destination node. In particular, the weights can
be chosen according to the importance of the AoI values of

Citations
More filters
Journal ArticleDOI
TL;DR: This work investigates the age performance of uncoded and coded schemes in the presence of stragglers under i.i.d. exponential transmission delays and shows that asymptotically MM-MDS coded scheme outperforms the other schemes.
Abstract: We consider a status update system in which the update packets need to be processed to extract the embedded useful information. The source node sends the acquired information to a computation unit (CU) which consists of a master node and $n$ worker nodes. The master node distributes the received computation task to the worker nodes. Upon computation, the master node aggregates the results and sends them back to the source node to keep it updated . We investigate the age performance of uncoded and coded (repetition coded, MDS coded, and multi-message MDS (MM-MDS) coded) schemes in the presence of stragglers under i.i.d. exponential transmission delays and i.i.d shifted exponential computation times. We show that asymptotically MM-MDS coded scheme outperforms the other schemes. Furthermore, we characterize the optimal codes such that the average age is minimized.

59 citations

Journal ArticleDOI
TL;DR: The goal in this paper is to characterize the spatial distribution of the mean AoI observed by the SD pairs by modeling them as a bipolar Poisson point process (PPP) by efficiently capturing the interference-induced coupling in the activities of theSD pairs.
Abstract: This paper considers a large-scale wireless network consisting of source-destination (SD) pairs, where the sources send time-sensitive information, termed status updates , to their corresponding destinations in a time-slotted fashion. We employ age of information (AoI) for quantifying the freshness of the status updates measured at the destination nodes under the preemptive and non-preemptive queueing disciplines with no storage facility. The non-preemptive queue drops the newly arriving updates until the update in service is successfully delivered, whereas the preemptive queue replaces the current update in service with the newly arriving update, if any. As the update delivery rate for a given link is a function of the interference field seen from the receiver, the temporal mean AoI can be treated as a random variable over space. Our goal in this paper is to characterize the spatial distribution of the mean AoI observed by the SD pairs by modeling them as a bipolar Poisson point process (PPP). Towards this objective, we first derive accurate bounds on the moments of success probability while efficiently capturing the interference-induced coupling in the activities of the SD pairs. Using this result, we then derive tight bounds on the moments as well as the spatial distribution of peak AoI (PAoI). Our numerical results verify our analytical findings and demonstrate the impact of various system design parameters on the mean PAoI.

55 citations

Posted Content
TL;DR: The quality of an update is model as an increasing function of the processing time spent while generating the update at the transmitter, and distortion is used as a proxy for quality, and model distortion as a decreasing function of processing time.
Abstract: We consider an information update system where an information receiver requests updates from an information provider in order to minimize its age of information. The updates are generated at the information provider (transmitter) as a result of completing a set of tasks such as collecting data and performing computations. We refer to this as the update generation process. We model the $quality$ of an update as an increasing function of the processing time spent while generating the update at the transmitter. In particular, we use $distortion$ as a proxy for $quality$, and model distortion as a decreasing function of processing time. Processing longer at the transmitter results in a better quality (lower distortion) update, but it causes the update to age. We determine the age-optimal policies for the update request times at the receiver and the update processing times at the transmitter subject to a minimum required quality (maximum allowed distortion) constraint on the updates. For the required quality constraint, we consider the cases of constant maximum allowed distortion constraints, as well as age-dependent maximum allowed distortion constraints.

43 citations

Journal ArticleDOI
TL;DR: A cache updating system with a source, a cache and a user, an alternating maximization based method to find the update rates for the cache and for the user is provided to maximize the freshness of the files at the user.
Abstract: We consider a cache updating system with a source, a cache and a user. There are $n$ files. The source keeps the freshest version of the files which are updated with known rates $\lambda _{i}$ . The cache downloads and keeps the freshest version of the files from the source with rates $c_{i}$ . The user gets updates from the cache with rates $u_{i}$ . When the user gets an update, it either gets a fresh update from the cache or the file at the cache becomes outdated by a file update at the source in which case the user gets an outdated update. We find an analytical expression for the average freshness of the files at the user. Next, we generalize our setting to the case where there are multiple caches in between the source and the user, and find the average freshness at the user. We provide an alternating maximization based method to find the update rates for the cache(s), $c_{i}$ , and for the user, $u_{i}$ , to maximize the freshness of the files at the user. We observe that for a given set of update rates for the user (resp. for the cache), the optimal rate allocation policy for the cache (resp. for the user) is a threshold policy , where the optimal update rates for rapidly changing files at the source may be equal to zero. Finally, we consider a system where multiple users are connected to a single cache and find update rates for the cache and the users to maximize the total freshness over all users.

42 citations

Journal ArticleDOI
TL;DR: In this article, the AoI-optimal policy for a single source-destination pair in which a radio frequency (RF)-powered source sends status updates about some physical process to a destination node was analyzed.
Abstract: This paper characterizes the structure of the Age of Information (AoI)-optimal policy in wireless powered communication systems while accounting for the time and energy costs of generating status updates at the source nodes. In particular, for a single source-destination pair in which a radio frequency (RF)-powered source sends status updates about some physical process to a destination node, we minimize the long-term average AoI at the destination node. The problem is modeled as an average cost Markov Decision Process (MDP) in which, the generation times of status updates at the source, the transmissions of status updates from the source to the destination, and the wireless energy transfer (WET) are jointly optimized. After proving the monotonicity property of the value function associated with the MDP, we analytically demonstrate that the AoI-optimal policy has a threshold-based structure w.r.t. the state variables. Our numerical results verify the analytical findings and reveal the impact of state variables on the structure of the AoI-optimal policy. Our results also demonstrate the impact of system design parameters on the optimal achievable average AoI as well as the superiority of our proposed joint sampling and updating policy w.r.t. the generate-at-will policy.

39 citations

References
More filters
Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations

Journal ArticleDOI
26 Feb 2015-Nature
TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

23,074 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

7,316 citations

Posted Content
TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

5,310 citations

Proceedings ArticleDOI
25 Mar 2012
TL;DR: A time-average age metric is employed for the performance evaluation of status update systems and the existence of an optimal rate at which a source must generate its information to keep its status as timely as possible at all its monitors is shown.
Abstract: Increasingly ubiquitous communication networks and connectivity via portable devices have engendered a host of applications in which sources, for example people and environmental sensors, send updates of their status to interested recipients. These applications desire status updates at the recipients to be as timely as possible; however, this is typically constrained by limited network resources. In this paper, we employ a time-average age metric for the performance evaluation of status update systems. We derive general methods for calculating the age metric that can be applied to a broad class of service systems. We apply these methods to queue-theoretic system abstractions consisting of a source, a service facility and monitors, with the model of the service facility (physical constraints) a given. The queue discipline of first-come-first-served (FCFS) is explored. We show the existence of an optimal rate at which a source must generate its information to keep its status as timely as possible at all its monitors. This rate differs from those that maximize utilization (throughput) or minimize status packet delivery delay. While our abstractions are simpler than their real-world counterparts, the insights obtained, we believe, are a useful starting point in understanding and designing systems that support real time status updates.

1,879 citations

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "A reinforcement learning framework for optimizing age of information in rf-powered communication systems" ?

In this paper, the authors study a real-time monitoring system in which multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination. Since it may not always be feasible to replace or recharge batteries in all source nodes, the authors consider that the nodes are powered through wireless energy transfer ( WET ) by the destination. For this system setup, the authors investigate the optimal online sampling policy ( referred to as the age-optimal policy ) that jointly optimizes WET and scheduling of update packet transmissions with the objective of minimizing the long-term average weighted sum of Age of Information ( AoI ) values for different physical processes ( observed by the source nodes ) at the destination node, referred to as the sum-AoI. Motivated by this, the authors propose a deep reinforcement learning ( DRL ) algorithm that can learn the age-optimal policy in a computationally-efficient manner. The authors further characterize the structural properties of the age-optimal policy analytically, and demonstrate that it has a threshold-based structure with respect to the AoI values for different processes. Afterwards, the authors analytically demonstrate that the structures of the age-optimal and throughput-optimal policies are different. The authors also numerically demonstrate these structures as well as the impact of system design parameters on the optimal achievable average weighted sum-AoI.