A composable worst case latency analysis for multi-rank DRAM devices under open row policy
Summary (3 min read)
1 Introduction
- In real-time embedded systems, the use of chip multiprocessors (CMPs) is becoming more popular due to their low power and high performance capabilities.
- As memory devices are getting faster, the performance of predictable controllers is greatly diminished because the difference in access time between cached and not cached data in DRAM devices is growing.
- In addition, the authors dynamically exploit the parallelism in the DRAM structure to reduce the interference among multiple requestors (cores or DMA).
- (3) Based on the latency bounds for individual requests, the authors show how to compute the overall latency suffered by a task running on a fully timing compositional core [34].
- The rest of the article is organized as follows.
2 DRAM Basics
- Modern DRAM memory systems are composed of a memory controller and memory device.
- In addition, modern systems can have multiple memory channels (i.e. multiple command and data bus).
- For open requests, only a read or a write command is generated since the desired row is already cached in the row buffer.
- Similarly, the tWTR timing constraint between the end of the data of Request 2 and the read command of Request 3 must be satisfied before the read command is issued.
- Each requestor also shares banks with every other requestors.
4 Memory Controller
- The arbitration rules of the memory controller are formalized in order to derive worst case latency analysis.
- Note that their described latency analysis depends on the arbitration rules only, and not on the detailed implementation of the controller.
- Note that CAS commands are considered serviced only when the associated data is transmitted to prevent a requestor from being delayed by two, rather than one, data transfers of another requestor.
- (3) The controller then services the next write command (R3) in the FIFO queue at t = 4 following Rule-3.
- Following the example, it is clear that if Requestors 1 and 3 have a long list of write commands waiting to be enqueued, the read command of Requestor 2 would be pushed back indefinitely and the worst case latency would be unbounded if the controller does not limit the number of re-ordering.
5 Worst Case Per-Request Latency
- The worst case latency for a single memory request of a requestor under analysis is derived.
- To simplify the analysis, the request latency is decomposed into two parts, tAC and tCD as shown in Figure 6. tAC (Arrival-to-CAS) is the worst case interval between the arrival of a request at the front of command buffer and the enqueuing of its corresponding CAS command into the FIFO.
- Since again there are no timing constraints between such commands, the PRE or CAS command can only delay the ACT under analysis for one clock cycle due to command bus contention.
- Therefore, the following lemma is obtained: Lemma 2.
6 Worst Case Cumulative Latency
- This section shows how to use the results of previous section to compute the cumulative latency over all requests generated by the task under analysis.
- Let us assume that the requestor executing the task under analysis is a fully timing compositional core as described in [34] (example: ARM7).
- Since the analysis in Section 5 depends on the order of requests, this section shows how to derive a safe worst case requests order given the number of each type of requests.
- Note that tAC , as computed in Eq.(1) and Eq.(9), depends on both the previous request of the task under analysis and the specific values of timing constraints, which vary based on the DDR device.
8 Evaluation
- The authors directly compare their approach against the Analyzable Memory Controller (AMC) [25] since AMC employs a fair round robin arbitration that does not prioritize the requestors, similarly to their system.
- Since synthetic benchmark is used, various parameters can be changed and fed as input to the analysis to observe how worst case latency bound changes.
- For 16 bits data bus, AMC performs significantly better; this is expected since AMC can efficiently interleave over 4 banks, while their memory controller must issue 4 consecutive memory requests.
- Even for 4 requestors with 32 bits bus and 1 rank, the improvement is up to 50% better than AMC, while in the case of 16 bits data bus, results are between 4 and 30% better than AMC.
- Next, notice that the difference between simulated and analytical time (T-bar vs. box) for AMC is quite small, the maximum difference is less than 10% of analytical bound.
9 Conclusions
- This article presented a new worst case latency analysis that takes DRAM state information into account to provide a composable bound.
- The authors approach is specifically targeted at multi-core systems using modern DRAM devices with high clock rate and wide data buses.
- First of all, the authors plan to synthesize and test the proposed controller on FPGA.
- Authors may self-archive the authors accepted manuscript of their articles on their own websites.
- Authors may also deposit this version of the article in any repository, provided it is only made publicly available 12 months after official publication or later.
Did you find this useful? Give us your feedback
Citations
49 citations
Cites background or methods or result from "A composable worst case latency ana..."
...Although Part-All scheme (which is followed by many related works [12], [8], [6], [7], [10], [20]) is able to substantially reduce WCD of the critical PEs compared to no-Part (Figures 3a and 3b), it also significantly reduces the bandwidth delivered to the noncritical PEs (Figure 4)....
[...]
...As discussed in [12], [20], [9], this bound can then be used to either derive the worst-case execution time of single real-time task, or to perform response-time analysis for a multi-tasking application....
[...]
...[12], [8], [6], [7]) considers DRAM bank partitioning, where banks are partitioned among PEs to reduce bank conflicts and hence improve WCD....
[...]
...Similar to previous work [9], [10], we do not account for the delay from the refresh process because it can be often neglected compared to other delays [9], or otherwise, it can be added as an extra delay term to the execution time of a task using existing methods [11], [12]....
[...]
36 citations
18 citations
12 citations
Cites background from "A composable worst case latency ana..."
...However, if that is not the case, its ability to effectively exploit the SDRAM is compromised [11]....
[...]
...Such strategy has been discussed in [11] and is out of the scope of this article....
[...]
...Supporting different granularities, which would be necessary for instance if a DMA engine competes for the SDRAM with cache-relying processors, is out of the scope of this article (as it constitutes an orthogonal challenge already investigated in [11])....
[...]
...We highlight that the same assumption has been made in [11], [12], which also employed a trace-based approach....
[...]
...To address the aforementioned scenario, researchers proposed using a combination of bank privatization and openrow policy [6], [7], [11]....
[...]
9 citations
References
[...]
4,039 citations
"A composable worst case latency ana..." refers methods in this paper
...For each benchmark, we obtain the memory trace by running the benchmark on the gem5 [3] architecture simulator; we employed a simple in-order timing model using the x86 instruction set architecture as our objective is the evaluation of the memory system rather than detailed core simulation....
[...]
1,864 citations
"A composable worst case latency ana..." refers background in this paper
...However, for simulation results, the other requestors are running the lbm benchmark from SPEC2006 CPU suite [15], which is highly bandwidth intensive....
[...]
249 citations
"A composable worst case latency ana..." refers background in this paper
...(3) Based on the latency bounds for individual requests, we show how to compute the overall latency suffered by a task running on a fully timing compositional core [34]....
[...]
...Let us assume that the requestor executing the task under analysis is a fully timing compositional core as described in [34] (example: ARM7)....
[...]
244 citations
"A composable worst case latency ana..." refers background in this paper
...Their work is part of a larger effort to develop PTARM [24], a precision-timed (PRET [8, 5]) architecture....
[...]
239 citations
Related Papers (5)
Frequently Asked Questions (16)
Q2. What future works have the authors mentioned in the paper "A composable worst case latency analysis for multi-rank dram devices under open row policy" ?
First of all, the authors plan to synthesize and test the proposed controller on FPGA.
Q3. Why did the authors recompute the length of AMC static command groups?
Since AMC was originally described for a slower DDR2 device, the authors recomputed the length of AMC static command groups based on the timing parameters of the employed DDR3 device.
Q4. How can tAC and tCD be computed separately?
By decomposing, the latency for tAC and tCD can now be computed separately, greatly simplifying the analysis; tReq is then computed as the sum of the two components.
Q5. what is the longest time constraint between an ACT command and any other command?
tAE : since the authors want to ensure that no command in the global queue is delayed by commands in the refresh sequence, the authors need to wait for the longest timing constraint between an ACT command and any other command issued after ending the sequence.
Q6. What are the worst case delays between a close and a read request?
tIP and tIA represent the worst case delay between inserting a command in the FIFO queue and when that command is issued, and thus capture interference caused by other requestors.
Q7. What is the way to calculate the latency of a request to shared data?
To derive the total latency for accessing shared data for the task under analysis, assume the number of loads to shared data isNSL and number of stores to shared data is NSS for the task under analysis.
Q8. How can the controller reduce the size of each TDMA slot?
By carefully scheduling the static command sequences, the controller can significantly reduce the size of each TDMA slot compared to previous static controllers when handling small size requests that do not require interleaving.
Q9. What is the main reason why DRAM is a complex and stateful resource?
The assumption of constant access time in DRAM can lead to highly pessimistic bounds because DRAM is a complex and stateful resource, i.e., the time required to perform one memory request is highly dependent on the history of previous and concurrent requests.
Q10. What is the way to compute the worst case latency of a memory request?
Since memory traces were obtained, no worst case pattern is needed since the order of requests are assumed to be known; instead, the authors simply computed the worst case latency of each request based on the type of the previous request according to Table 4.
Q11. What are the different ranks of memory devices?
Modern memory devices are organized into ranks and each rank is divided into multiple banks, which can be accessed in parallel provided that no collisions occur on either buses.
Q12. What is the way to derive a safe worst case requests order?
Since the analysis in Section 5 depends on the order of requests, this section shows how to derive a safe worst case requests order given the number of each type of requests.
Q13. What constraint limits the amount of current drawn to the device?
The tFAW constraint that limits the number of banks that can be activated in order to limit the amount of current drawn to the device to prevent over heating problems.
Q14. What is the downside of the analysis?
The downside is that the analysis is pessimistic, since it assumes than an interfering requestor could cause maximum delay on each individual command of the requestor under analysis, while this might not be possible in practice.
Q15. What is the worst case latency for a single request to shared data?
The worst case latency for a single request to shared data for the task under analysis is then:tReqShared(Load) = k−1∑ i=1 tReqOther,i(M + s− 1) + t Req Analysis(Load,M + s− 1), (25)for a load request, while for a store request it is:tReqShared(Store) = k−1∑ i=1 tReqOther,i(M + s− 1) + t Req Analysis(Store,M + s− 1).
Q16. Why is the write command not issued after the tRTW constraint is satisfied?
This is because there is another timing constraint, tRTW , between read command of Request 1 and write command of Request 2, and the write command can only be issued once all applicable constraints are satisfied.