Journal Article•DOI•

A composable worst case latency analysis for multi-rank DRAM devices under open row policy

Zheng Pei Wu¹, Rodolfo Pellizzoni¹, Danlu Guo¹•Institutions (1)

01 Nov 2016-Real-time Systems (Springer US)-Vol. 52, Iss: 6, pp 761-807

TL;DR: A new memory controller design together with a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state and can be applied to multi-rank devices, which allow for increased access parallelism.

read less

Abstract: As multi-core systems are becoming more popular in real-time embedded systems, strict timing requirements for accessing shared resources must be met. In particular, a detailed latency analysis for double data rate dynamic RAM (DDR DRAM) is highly desirable. Several researchers have proposed predictable memory controllers to provide guaranteed memory access latency. However, the performance of such controllers sharply decreases as DDR devices become faster and the width of memory buses is increased. High-performance commercial-off-the-shelf (COTS) memory controllers in general-purpose systems employ open row policy to improve average case access latencies and memory throughput, but the use of such policy is not compatible with existing real-time controllers. In this article, we present a new memory controller design together with a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state. In particular, our approach scales better with increasing memory speed by predictably taking advantage of shorter latency for access to open DRAM rows. Furthermore, it can be applied to multi-rank devices, which allow for increased access parallelism. We evaluate our approach based on worst case analysis bounds and simulation results, using both synthetic tasks and a set of realistic benchmarks. In particular, benchmark evaluations show up to 45 % improvement in worst case task execution time compared to a competing predictable memory controller for a system with 16 requestors and one rank.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 DRAM Basics] – [3 Related Work] – [4 Memory Controller] – [5 Worst Case Per-Request Latency] – [6 Worst Case Cumulative Latency] – [7 Shared Data] – [8 Evaluation] and [9 Conclusions]

1 Introduction

In real-time embedded systems, the use of chip multiprocessors (CMPs) is becoming more popular due to their low power and high performance capabilities.
As memory devices are getting faster, the performance of predictable controllers is greatly diminished because the difference in access time between cached and not cached data in DRAM devices is growing.
In addition, the authors dynamically exploit the parallelism in the DRAM structure to reduce the interference among multiple requestors (cores or DMA).
(3) Based on the latency bounds for individual requests, the authors show how to compute the overall latency suffered by a task running on a fully timing compositional core [34].
The rest of the article is organized as follows.

2 DRAM Basics

Modern DRAM memory systems are composed of a memory controller and memory device.
In addition, modern systems can have multiple memory channels (i.e. multiple command and data bus).
For open requests, only a read or a write command is generated since the desired row is already cached in the row buffer.
Similarly, the tWTR timing constraint between the end of the data of Request 2 and the read command of Request 3 must be satisfied before the read command is issued.
Each requestor also shares banks with every other requestors.

4 Memory Controller

The arbitration rules of the memory controller are formalized in order to derive worst case latency analysis.
Note that their described latency analysis depends on the arbitration rules only, and not on the detailed implementation of the controller.
Note that CAS commands are considered serviced only when the associated data is transmitted to prevent a requestor from being delayed by two, rather than one, data transfers of another requestor.
(3) The controller then services the next write command (R3) in the FIFO queue at t = 4 following Rule-3.
Following the example, it is clear that if Requestors 1 and 3 have a long list of write commands waiting to be enqueued, the read command of Requestor 2 would be pushed back indefinitely and the worst case latency would be unbounded if the controller does not limit the number of re-ordering.

5 Worst Case Per-Request Latency

The worst case latency for a single memory request of a requestor under analysis is derived.
To simplify the analysis, the request latency is decomposed into two parts, tAC and tCD as shown in Figure 6. tAC (Arrival-to-CAS) is the worst case interval between the arrival of a request at the front of command buffer and the enqueuing of its corresponding CAS command into the FIFO.
Since again there are no timing constraints between such commands, the PRE or CAS command can only delay the ACT under analysis for one clock cycle due to command bus contention.
Therefore, the following lemma is obtained: Lemma 2.

6 Worst Case Cumulative Latency

This section shows how to use the results of previous section to compute the cumulative latency over all requests generated by the task under analysis.
Let us assume that the requestor executing the task under analysis is a fully timing compositional core as described in [34] (example: ARM7).
Since the analysis in Section 5 depends on the order of requests, this section shows how to derive a safe worst case requests order given the number of each type of requests.
Note that tAC , as computed in Eq.(1) and Eq.(9), depends on both the previous request of the task under analysis and the specific values of timing constraints, which vary based on the DDR device.

7 Shared Data

A final but important discussion is relative to data sharing in hard real time systems.
First, the set of communicating cores that share data are grouped into a shared queue partition in the front end, where each requestor has a request queue within the shared queue partition.
Even when the system is structured as a set of software partitions, high-speed I/O still requires data to be shared among cores and DMA requestors.
When a partition A is executing on core 1, the DMA for partition A will not be executing and hence does not access data at same time.

8 Evaluation

The authors directly compare their approach against the Analyzable Memory Controller (AMC) [25] since AMC employs a fair round robin arbitration that does not prioritize the requestors, similarly to their system.
Since synthetic benchmark is used, various parameters can be changed and fed as input to the analysis to observe how worst case latency bound changes.
For 16 bits data bus, AMC performs significantly better; this is expected since AMC can efficiently interleave over 4 banks, while their memory controller must issue 4 consecutive memory requests.
Even for 4 requestors with 32 bits bus and 1 rank, the improvement is up to 50% better than AMC, while in the case of 16 bits data bus, results are between 4 and 30% better than AMC.
Next, notice that the difference between simulated and analytical time (T-bar vs. box) for AMC is quite small, the maximum difference is less than 10% of analytical bound.

9 Conclusions

This article presented a new worst case latency analysis that takes DRAM state information into account to provide a composable bound.
The authors approach is specifically targeted at multi-core systems using modern DRAM devices with high clock rate and wide data buses.
First of all, the authors plan to synthesize and test the proposed controller on FPGA.
Authors may self-archive the authors accepted manuscript of their articles on their own websites.
Authors may also deposit this version of the article in any repository, provided it is only made publicly available 12 months after official publication or later.

Did you find this useful? Give us your feedback

Figures (24)

Fig. 10: Decomposition of CAS to Data Latency

Table 1: JEDEC Timing Constraints in Memory Cycles

Fig. 3: Timing constraints between different ranks

Fig. 2: Timing constraints for banks in same rank

Fig. 6: Worst Case Latency Decomposition

Fig. 17: 7 Requestors 64 bits bus with 1 Shared Bank

Fig. 11: Trade off between maximizing tFIRST and tOTHER

Table 6: Average Worst Case Latency (ns) of DDR3 Devices

Fig. 9: Interference Delay for ACT command

Fig. 13: Modified Memory Controller to Handle Shared Data

Fig. 19: Delay between two consecutive CAS commands

Content maybe subject to copyright Report

Noname manuscript No.

(will be inserted by the editor)

A Composable Worst Case Latency Analysis

for Multi-Rank DRAM Devices under Open Row Policy

Zheng Pei Wu · Rodolfo Pellizzoni · Danlu Guo

Received: date / Accepted: date

Abstract As multi-core systems are becoming more popular in real-time embedded

systems, strict timing requirements for accessing shared resources must be met. In

particular, a detailed latency analysis for Double Data Rate Dynamic RAM (DDR

DRAM) is highly desirable. Several researchers have proposed predictable memory

controllers to provide guaranteed memory access latency. However, the performance

of such controllers sharply decreases as DDR devices become faster and the width of

memory buses is increased. High-performance Commercial-Off-The-Shelf (COTS)

memory controllers in general-purpose systems employ open row policy to improve

average case access latencies and memory throughput, but the use of such policy is

not compatible with existing real-time controllers. In this article, we present a new

memory controller design together with a novel, composable worst case analysis for

DDR DRAM that provides improved latency bounds compared to existing works by

explicitly modeling the DRAM state. In particular, our approach scales better with in-

creasing memory speed by predictably taking advantage of shorter latency for access

to open DRAM rows. Furthermore, it can be applied to multi-rank devices, which al-

low for increased access parallelism. We evaluate our approach based on worst case

analysis bounds and simulation results, using both synthetic tasks and a set of realis-

tic benchmarks. In particular, benchmark evaluations show up to 45% improvement

in worst case task execution time compared to a competing predictable memory con-

troller for a system with 16 requestors and one rank.

1 Introduction

In real-time embedded systems, the use of chip multiprocessors (CMPs) is becoming

more popular due to their low power and high performance capabilities. As appli-

cations running on these multi-core systems are becoming more memory intensive,

Zheng Pei Wu · Rodolfo Pellizzoni · Danlu Guo

Department of Electrical and Computer Engineering, University of Waterloo (Canada)

E-mail: {zpwu, rpellizz, dlguo}@uwaterloo.ca

2 Zheng Pei Wu et al.

the shared main memory resource is turning into a signiﬁcant bottleneck. Therefore,

there is a need to bound the worst case memory latency caused by contention among

multiple cores to provide hard guarantees to real-time tasks. Several researchers have

addressed this problem by proposing new timing analyses for contention in main

memory and caches [30, 29, 28]. However, such analyses assume a constant time for

each memory request (load or store). In practice, modern CMPs use Double Data

Rate Dynamic RAM (DDR DRAM) as their main memory. The assumption of con-

stant access time in DRAM can lead to highly pessimistic bounds because DRAM

is a complex and stateful resource, i.e., the time required to perform one memory

request is highly dependent on the history of previous and concurrent requests.

DRAM access time is highly variable because of two main reasons: (1) DRAM

employs an internal caching mechanism where large chunks of data are ﬁrst loaded

into a row buffer before being read or written. (2) In addition, DRAM devices use a

parallel structure; in particular, multiple operations targeting different internal buffers

can be performed simultaneously. Due to these characteristics, developing a safe yet

realistic memory latency analysis is very challenging. To overcome such challenges,

a number of other researches have proposed the design of predictable DRAM con-

trollers [25, 1, 31, 12, 27]. These controllers simplify the analysis of memory latency

by statically pre-computing sequences of memory commands. The key idea is that

static command sequences allow leveraging DRAM parallelism without the require-

ment to analyze dynamic state information. Existing predictable controllers have been

shown to provide tight, predictable memory latency for hard real-time tasks when

applied to older DRAM standards such as DDR2. However, as we show in our eval-

uation, they perform poorly in the presence of more modern DRAM devices such as

DDR3 [17]. The ﬁrst drawback of existing predictable controllers is that they do not

take advantage of the caching mechanism. As memory devices are getting faster, the

performance of predictable controllers is greatly diminished because the difference

in access time between cached and not cached data in DRAM devices is growing.

Furthermore, as memory buses are becoming wider, the amount of data that can be

transferred in each bus cycle increases. For this reason, the ability of existing pre-

dictable controllers to exploit DRAM access parallelism in a static manner is dimin-

ished. Finally, memory controllers employed in Commercial-Off-The-Shelf (COTS)

systems are typically optimized for average case latency and maximum throughput,

and they behave quite differently compared to the discussed real-time controllers.

Hence, existing latency bounds cannot directly be applied to such controllers.

Therefore, in this article we consider a different approach that takes advantage

of the DRAM caching mechanism by explicitly modelling and analyzing DRAM

state information. In addition, we dynamically exploit the parallelism in the DRAM

structure to reduce the interference among multiple requestors (cores or DMA). Our

approach relies on the design of a new predictable memory controller, which fairly

arbitrates among commands of different requestors. The structure of our controller is

similar to existing controllers, but compared to COTS systems, we disable request re-

ordering to avoid a requestor being unfairly delayed (possibly forever). Our technique

relies on statically partitioning the available main memory (DRAM banks) among re-

questors. As such, it is targeted at partitioned real-time systems, such as integrated

modular avionics systems [26], where different applications are allocated on individ-

Title Suppressed Due to Excessive Length 3

ual cores and communication between applications is limited. For the same reason, it

is also restricted to multi-core, rather than many-core systems; in the evaluation, we

consider systems with up to 16 requestors.

In more details, the major contributions of this work are the following. (1) We

discuss the design of a new dynamic, predictable memory controller based on static

bank partitioning. (2) Based on the discussed controller, we derive a worst case DDR

DRAM memory latency analysis for individual load/store requests issued by a re-

questor under analysis in the presence of multiple other requestors contending for

memory access. Our analysis is composable, in the sense that the latency bound does

not depend on the activity of the other requestors, only on the number of requestors,

and it makes no assumption on the characteristics of the requestor under analysis (i.e.,

it can be an in-order/out-of-order core, DMA, etc.). (3) Based on the latency bounds

for individual requests, we show how to compute the overall latency suffered by a

task running on a fully timing compositional core [34]. (4) We evaluate our analy-

sis against previous predictable approaches using both synthetic tasks and a set of

benchmarks executed on an architectural simulator. In particular, we show that our

approach scales signiﬁcantly better with faster memory devices. We show results both

in terms of worst case analysis bounds, and measured latency on the simulator. For

a commonly used DRAM in a system with 16 requestors and no inter-core commu-

nication, our method shows up to 45% improvements on task worst case execution

time compared to [25].

The rest of the article is organized as follows. Section 2 provides required back-

ground knowledge on how DRAM works. Section 3 compares our approach to related

work in the ﬁeld. Section 4 discusses our memory controller design and Section 5

and 6 detail our worst case latency analysis. Section 7 discusses shared data, while

evaluation results are presented in Section 8. Finally, Section 9 concludes the article.

2 DRAM Basics

Modern DRAM memory systems are composed of a memory controller and mem-

ory device. Figure 1 shows an example of such system, where multiple cores and

DMA devices send requests to load or store data to the memory controller; the con-

troller handles individual requests by controlling the operation of the memory de-

vices, which stores the actual data. Since our request latency analysis is independent

of the characteristics of the hardware entity communicating with the memory con-

trollers, in Sections 2-5 we use the term requestor to denote any component (core or

DMA) that can send requests to the controller.

The device and controller are connected by a command bus and a data bus. The

command bus is used to transfer memory commands, which controls the operation of

the device, while the data bus carries the transferred data associated with a request.

The two buses can be used in parallel: a request of one requestor can use the command

bus while a request of another requestor uses the data bus. However, no more than

one request can use the command bus (or data bus) at the same time. The logic of the

controller is typically divided into a front end and back end. The front end generates

one or more memory commands for each request. The back end arbitrates among

4 Zheng Pei Wu et al.

generated commands and issues them to the device through the command bus. As we

discuss in Section 2.1, there are speciﬁc timing constraints that the back end must

satisfy.

Modern memory devices are organized into ranks and each rank is divided into

multiple banks, which can be accessed in parallel provided that no collisions occur on

either buses. Each bank comprises a row-buffer and an array of storage cells organized

as rows

and columns as shown in Figure 1. In addition, modern systems can have

multiple memory channels (i.e. multiple command and data bus). Each channel can

be treated independently or they could be interleaved together. This article treats each

channel independently and focuses on the analysis within a single channel. Note that

optimization of requestor assignments to channels in real-time memory controllers

has been discussed in [10, 11].

CMD

BUS

DATA

BUS

DRAM Controller

Front End

Back End

Row Buffer

Bank 1

Bank N

Rank 1 Rank R

Memory Device

Row Buffer

Bank 1

Bank N

. . .

CORE

DMA

CORE

DMA

. . .

Requestors

Fig. 1: DDR DRAM Organization

To access the data in a DRAM row, an Activate (ACT) command must be issued

to load the data into the row buffer before it can be read or written. Once the data

is in the row buffer, a CAS (read or write) command can be issued to retrieve or

store the data. If a second request needs to access a different row within the same

bank, the row buffer must be written back to the data array with a Pre-charge (PRE)

command before the second row can be activated. Finally, a periodic Refresh (REF)

command must be issued to all ranks and banks to ensure data integrity. Note that

each command takes one clock cycle on the command bus to be serviced.

A row that is cached in the row buffer is considered open, otherwise the row is

considered closed. A request that accesses an open row is called an Open Request

and a request that accesses a closed row is called Close Request. To avoid confusion,

requests are categorized as load or store while read and write are used to refer to

memory commands. When a request reaches the front end of the controller, the cor-

rect memory commands will be generated based on the status of the row buffers. For

open requests, only a read or a write command is generated since the desired row is

already cached in the row buffer. For close request, if the row buffer contains a row

that is not the desired row, then a PRE command is generated to close the current row.

DRAM rows are also referred to as ‘pages’ in the literature.

Title Suppressed Due to Excessive Length 5

Then an ACT is generated to load the new row and ﬁnally read/write is generated to

access data. If the row buffer is empty, then only ACT and read/write commands are

needed. Finally, all open rows must be closed with PRE commands before a REF can

be issued.

The size of a row is large (several kB), so each request only accesses a small por-

tion of the row by selecting the appropriate columns. Each CAS command accesses

data in a burst of length BL and the amount of data transferred is BL · W

BUS

, where

BUS

is the width of the data bus. Since DDR memory transfers data on rising and

falling edge of clock, the amount of time for one transfer is t

BUS

= BL/2 memory

clock cycles. For example, with BL = 8 and W

BUS

of 64 bits, it will take 4 cycles

to transfer 64 bytes of data.

2.1 DRAM Timing Constraints

The memory device takes time to perform different operations and therefore timing

constraints between various commands must be satisﬁed by the memory controller.

The operation and timing constraints of memory devices are deﬁned by the JEDEC

standard [17]. The standard deﬁnes different families of devices, such as DDR2 /

DDR3 / DDR4. As an example, Table 1 lists all timing parameters of interest to the

analysis, with typical values for DDR3 and DDR2 devices

. Note that as the fre-

quency increases and thus the clock period becomes smaller, the value of the timing

parameters in number of clock cycles also tends to increase. Figures 2 and 3 illus-

trate the various timing constraints. Square boxes represent commands issued on the

command bus (A for ACT, P for PRE and R/W for Read and Write). The data be-

ing transferred on the data bus is also shown. To avoid excessive clutter, command

and data transfers belonging to the same request are shown on the same line, but we

stress again that the command and data buses can be operated in parallel. Horizontal

arrows represent timing constraints between different commands while the vertical

arrows show when each request arrives. R denotes rank and B denotes bank in the

ﬁgures. Note that constraints are not drawn to actual scale to make the ﬁgures easier

to understand.

Figure 2 shows constraints related to banks within the same rank. All three re-

quests are close requests targeting to the same rank. Request 1 and 3 are accessing

Bank 0 while Request 2 is accessing Bank 1. Notice the write command of Request

2 cannot be issued immediately once the t

RCD

timing constraint has been satisﬁed.

This is because there is another timing constraint, t

RT W

, between read command of

Request 1 and write command of Request 2, and the write command can only be

issued once all applicable constraints are satisﬁed. Similarly, the t

W T R

timing con-

straint between the end of the data of Request 2 and the read command of Request

3 must be satisﬁed before the read command is issued. Figure 3 shows timing con-

straints between different ranks, which only consist of t

RT R

[33]. This is the time

between end of data of one rank and beginning of data of another rank. Note Request

3 is targeting an open row, therefore, it does not need to issue PRE or ACT command.

We use DDR3 in our evaluation since we found it to be the most commonly employed standard in

related work on predictable DRAM controllers.

HTML Viewer

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "A composable worst case latency analysis for multi-rank dram devices under open row policy" ?

In this article, the authors present a new memory controller design together with a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state. In particular, their approach scales better with increasing memory speed by predictably taking advantage of shorter latency for access to open DRAM rows. The authors evaluate their approach based on worst case analysis bounds and simulation results, using both synthetic tasks and a set of realistic benchmarks. Furthermore, it can be applied to multi-rank devices, which allow for increased access parallelism.

Q2. What future works have the authors mentioned in the paper "A composable worst case latency analysis for multi-rank dram devices under open row policy" ?

First of all, the authors plan to synthesize and test the proposed controller on FPGA.

Q3. Why did the authors recompute the length of AMC static command groups?

Since AMC was originally described for a slower DDR2 device, the authors recomputed the length of AMC static command groups based on the timing parameters of the employed DDR3 device.

Q4. How can tAC and tCD be computed separately?

By decomposing, the latency for tAC and tCD can now be computed separately, greatly simplifying the analysis; tReq is then computed as the sum of the two components.

Q5. what is the longest time constraint between an ACT command and any other command?

tAE : since the authors want to ensure that no command in the global queue is delayed by commands in the refresh sequence, the authors need to wait for the longest timing constraint between an ACT command and any other command issued after ending the sequence.

Q6. What are the worst case delays between a close and a read request?

tIP and tIA represent the worst case delay between inserting a command in the FIFO queue and when that command is issued, and thus capture interference caused by other requestors.

Q7. What is the way to calculate the latency of a request to shared data?

To derive the total latency for accessing shared data for the task under analysis, assume the number of loads to shared data isNSL and number of stores to shared data is NSS for the task under analysis.

Q8. How can the controller reduce the size of each TDMA slot?

By carefully scheduling the static command sequences, the controller can significantly reduce the size of each TDMA slot compared to previous static controllers when handling small size requests that do not require interleaving.

Q9. What is the main reason why DRAM is a complex and stateful resource?

The assumption of constant access time in DRAM can lead to highly pessimistic bounds because DRAM is a complex and stateful resource, i.e., the time required to perform one memory request is highly dependent on the history of previous and concurrent requests.

Q10. What is the way to compute the worst case latency of a memory request?

Since memory traces were obtained, no worst case pattern is needed since the order of requests are assumed to be known; instead, the authors simply computed the worst case latency of each request based on the type of the previous request according to Table 4.

Q11. What are the different ranks of memory devices?

Modern memory devices are organized into ranks and each rank is divided into multiple banks, which can be accessed in parallel provided that no collisions occur on either buses.

Q12. What is the way to derive a safe worst case requests order?

Since the analysis in Section 5 depends on the order of requests, this section shows how to derive a safe worst case requests order given the number of each type of requests.

Q13. What constraint limits the amount of current drawn to the device?

The tFAW constraint that limits the number of banks that can be activated in order to limit the amount of current drawn to the device to prevent over heating problems.

Q14. What is the downside of the analysis?

The downside is that the analysis is pessimistic, since it assumes than an interfering requestor could cause maximum delay on each individual command of the requestor under analysis, while this might not be possible in practice.

Q15. What is the worst case latency for a single request to shared data?

The worst case latency for a single request to shared data for the task under analysis is then:tReqShared(Load) = k−1∑ i=1 tReqOther,i(M + s− 1) + t Req Analysis(Load,M + s− 1), (25)for a load request, while for a store request it is:tReqShared(Store) = k−1∑ i=1 tReqOther,i(M + s− 1) + t Req Analysis(Store,M + s− 1).

Q16. Why is the write command not issued after the tRTW constraint is satisfied?

This is because there is another timing constraint, tRTW , between read command of Request 1 and write command of Request 2, and the write command can only be issued once all applicable constraints are satisfied.

A composable worst case latency analysis for multi-rank DRAM devices under open row policy

Summary (3 min read)

1 Introduction

2 DRAM Basics

4 Memory Controller

5 Worst Case Per-Request Latency

6 Worst Case Cumulative Latency

7 Shared Data

8 Evaluation

9 Conclusions

Figures (24)

Citations

Cites background or methods or result from "A composable worst case latency ana..."

Cites background from "A composable worst case latency ana..."

References

"A composable worst case latency ana..." refers methods in this paper

"A composable worst case latency ana..." refers background in this paper

"A composable worst case latency ana..." refers background in this paper

"A composable worst case latency ana..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "A composable worst case latency analysis for multi-rank dram devices under open row policy" ?

Q2. What future works have the authors mentioned in the paper "A composable worst case latency analysis for multi-rank dram devices under open row policy" ?

Q3. Why did the authors recompute the length of AMC static command groups?

Q4. How can tAC and tCD be computed separately?

Q5. what is the longest time constraint between an ACT command and any other command?

Q6. What are the worst case delays between a close and a read request?

Q7. What is the way to calculate the latency of a request to shared data?

Q8. How can the controller reduce the size of each TDMA slot?

Q9. What is the main reason why DRAM is a complex and stateful resource?

Q10. What is the way to compute the worst case latency of a memory request?

Q11. What are the different ranks of memory devices?

Q12. What is the way to derive a safe worst case requests order?

Q13. What constraint limits the amount of current drawn to the device?

Q14. What is the downside of the analysis?

Q15. What is the worst case latency for a single request to shared data?

Q16. Why is the write command not issued after the tRTW constraint is satisfied?