scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A composable worst case latency analysis for multi-rank DRAM devices under open row policy

01 Nov 2016-Real-time Systems (Springer US)-Vol. 52, Iss: 6, pp 761-807
TL;DR: A new memory controller design together with a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state and can be applied to multi-rank devices, which allow for increased access parallelism.
Abstract: As multi-core systems are becoming more popular in real-time embedded systems, strict timing requirements for accessing shared resources must be met. In particular, a detailed latency analysis for double data rate dynamic RAM (DDR DRAM) is highly desirable. Several researchers have proposed predictable memory controllers to provide guaranteed memory access latency. However, the performance of such controllers sharply decreases as DDR devices become faster and the width of memory buses is increased. High-performance commercial-off-the-shelf (COTS) memory controllers in general-purpose systems employ open row policy to improve average case access latencies and memory throughput, but the use of such policy is not compatible with existing real-time controllers. In this article, we present a new memory controller design together with a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state. In particular, our approach scales better with increasing memory speed by predictably taking advantage of shorter latency for access to open DRAM rows. Furthermore, it can be applied to multi-rank devices, which allow for increased access parallelism. We evaluate our approach based on worst case analysis bounds and simulation results, using both synthetic tasks and a set of realistic benchmarks. In particular, benchmark evaluations show up to 45 % improvement in worst case task execution time compared to a competing predictable memory controller for a system with 16 requestors and one rank.

Summary (3 min read)

1 Introduction

  • In real-time embedded systems, the use of chip multiprocessors (CMPs) is becoming more popular due to their low power and high performance capabilities.
  • As memory devices are getting faster, the performance of predictable controllers is greatly diminished because the difference in access time between cached and not cached data in DRAM devices is growing.
  • In addition, the authors dynamically exploit the parallelism in the DRAM structure to reduce the interference among multiple requestors (cores or DMA).
  • (3) Based on the latency bounds for individual requests, the authors show how to compute the overall latency suffered by a task running on a fully timing compositional core [34].
  • The rest of the article is organized as follows.

2 DRAM Basics

  • Modern DRAM memory systems are composed of a memory controller and memory device.
  • In addition, modern systems can have multiple memory channels (i.e. multiple command and data bus).
  • For open requests, only a read or a write command is generated since the desired row is already cached in the row buffer.
  • Similarly, the tWTR timing constraint between the end of the data of Request 2 and the read command of Request 3 must be satisfied before the read command is issued.
  • Each requestor also shares banks with every other requestors.

4 Memory Controller

  • The arbitration rules of the memory controller are formalized in order to derive worst case latency analysis.
  • Note that their described latency analysis depends on the arbitration rules only, and not on the detailed implementation of the controller.
  • Note that CAS commands are considered serviced only when the associated data is transmitted to prevent a requestor from being delayed by two, rather than one, data transfers of another requestor.
  • (3) The controller then services the next write command (R3) in the FIFO queue at t = 4 following Rule-3.
  • Following the example, it is clear that if Requestors 1 and 3 have a long list of write commands waiting to be enqueued, the read command of Requestor 2 would be pushed back indefinitely and the worst case latency would be unbounded if the controller does not limit the number of re-ordering.

5 Worst Case Per-Request Latency

  • The worst case latency for a single memory request of a requestor under analysis is derived.
  • To simplify the analysis, the request latency is decomposed into two parts, tAC and tCD as shown in Figure 6. tAC (Arrival-to-CAS) is the worst case interval between the arrival of a request at the front of command buffer and the enqueuing of its corresponding CAS command into the FIFO.
  • Since again there are no timing constraints between such commands, the PRE or CAS command can only delay the ACT under analysis for one clock cycle due to command bus contention.
  • Therefore, the following lemma is obtained: Lemma 2.

6 Worst Case Cumulative Latency

  • This section shows how to use the results of previous section to compute the cumulative latency over all requests generated by the task under analysis.
  • Let us assume that the requestor executing the task under analysis is a fully timing compositional core as described in [34] (example: ARM7).
  • Since the analysis in Section 5 depends on the order of requests, this section shows how to derive a safe worst case requests order given the number of each type of requests.
  • Note that tAC , as computed in Eq.(1) and Eq.(9), depends on both the previous request of the task under analysis and the specific values of timing constraints, which vary based on the DDR device.

7 Shared Data

  • A final but important discussion is relative to data sharing in hard real time systems.
  • First, the set of communicating cores that share data are grouped into a shared queue partition in the front end, where each requestor has a request queue within the shared queue partition.
  • Even when the system is structured as a set of software partitions, high-speed I/O still requires data to be shared among cores and DMA requestors.
  • When a partition A is executing on core 1, the DMA for partition A will not be executing and hence does not access data at same time.

8 Evaluation

  • The authors directly compare their approach against the Analyzable Memory Controller (AMC) [25] since AMC employs a fair round robin arbitration that does not prioritize the requestors, similarly to their system.
  • Since synthetic benchmark is used, various parameters can be changed and fed as input to the analysis to observe how worst case latency bound changes.
  • For 16 bits data bus, AMC performs significantly better; this is expected since AMC can efficiently interleave over 4 banks, while their memory controller must issue 4 consecutive memory requests.
  • Even for 4 requestors with 32 bits bus and 1 rank, the improvement is up to 50% better than AMC, while in the case of 16 bits data bus, results are between 4 and 30% better than AMC.
  • Next, notice that the difference between simulated and analytical time (T-bar vs. box) for AMC is quite small, the maximum difference is less than 10% of analytical bound.

9 Conclusions

  • This article presented a new worst case latency analysis that takes DRAM state information into account to provide a composable bound.
  • The authors approach is specifically targeted at multi-core systems using modern DRAM devices with high clock rate and wide data buses.
  • First of all, the authors plan to synthesize and test the proposed controller on FPGA.
  • Authors may self-archive the authors accepted manuscript of their articles on their own websites.
  • Authors may also deposit this version of the article in any repository, provided it is only made publicly available 12 months after official publication or later.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Noname manuscript No.
(will be inserted by the editor)
A Composable Worst Case Latency Analysis
for Multi-Rank DRAM Devices under Open Row Policy
Zheng Pei Wu · Rodolfo Pellizzoni · Danlu Guo
Received: date / Accepted: date
Abstract As multi-core systems are becoming more popular in real-time embedded
systems, strict timing requirements for accessing shared resources must be met. In
particular, a detailed latency analysis for Double Data Rate Dynamic RAM (DDR
DRAM) is highly desirable. Several researchers have proposed predictable memory
controllers to provide guaranteed memory access latency. However, the performance
of such controllers sharply decreases as DDR devices become faster and the width of
memory buses is increased. High-performance Commercial-Off-The-Shelf (COTS)
memory controllers in general-purpose systems employ open row policy to improve
average case access latencies and memory throughput, but the use of such policy is
not compatible with existing real-time controllers. In this article, we present a new
memory controller design together with a novel, composable worst case analysis for
DDR DRAM that provides improved latency bounds compared to existing works by
explicitly modeling the DRAM state. In particular, our approach scales better with in-
creasing memory speed by predictably taking advantage of shorter latency for access
to open DRAM rows. Furthermore, it can be applied to multi-rank devices, which al-
low for increased access parallelism. We evaluate our approach based on worst case
analysis bounds and simulation results, using both synthetic tasks and a set of realis-
tic benchmarks. In particular, benchmark evaluations show up to 45% improvement
in worst case task execution time compared to a competing predictable memory con-
troller for a system with 16 requestors and one rank.
1 Introduction
In real-time embedded systems, the use of chip multiprocessors (CMPs) is becoming
more popular due to their low power and high performance capabilities. As appli-
cations running on these multi-core systems are becoming more memory intensive,
Zheng Pei Wu · Rodolfo Pellizzoni · Danlu Guo
Department of Electrical and Computer Engineering, University of Waterloo (Canada)
E-mail: {zpwu, rpellizz, dlguo}@uwaterloo.ca

2 Zheng Pei Wu et al.
the shared main memory resource is turning into a significant bottleneck. Therefore,
there is a need to bound the worst case memory latency caused by contention among
multiple cores to provide hard guarantees to real-time tasks. Several researchers have
addressed this problem by proposing new timing analyses for contention in main
memory and caches [30, 29, 28]. However, such analyses assume a constant time for
each memory request (load or store). In practice, modern CMPs use Double Data
Rate Dynamic RAM (DDR DRAM) as their main memory. The assumption of con-
stant access time in DRAM can lead to highly pessimistic bounds because DRAM
is a complex and stateful resource, i.e., the time required to perform one memory
request is highly dependent on the history of previous and concurrent requests.
DRAM access time is highly variable because of two main reasons: (1) DRAM
employs an internal caching mechanism where large chunks of data are first loaded
into a row buffer before being read or written. (2) In addition, DRAM devices use a
parallel structure; in particular, multiple operations targeting different internal buffers
can be performed simultaneously. Due to these characteristics, developing a safe yet
realistic memory latency analysis is very challenging. To overcome such challenges,
a number of other researches have proposed the design of predictable DRAM con-
trollers [25, 1, 31, 12, 27]. These controllers simplify the analysis of memory latency
by statically pre-computing sequences of memory commands. The key idea is that
static command sequences allow leveraging DRAM parallelism without the require-
ment to analyze dynamic state information. Existing predictable controllers have been
shown to provide tight, predictable memory latency for hard real-time tasks when
applied to older DRAM standards such as DDR2. However, as we show in our eval-
uation, they perform poorly in the presence of more modern DRAM devices such as
DDR3 [17]. The first drawback of existing predictable controllers is that they do not
take advantage of the caching mechanism. As memory devices are getting faster, the
performance of predictable controllers is greatly diminished because the difference
in access time between cached and not cached data in DRAM devices is growing.
Furthermore, as memory buses are becoming wider, the amount of data that can be
transferred in each bus cycle increases. For this reason, the ability of existing pre-
dictable controllers to exploit DRAM access parallelism in a static manner is dimin-
ished. Finally, memory controllers employed in Commercial-Off-The-Shelf (COTS)
systems are typically optimized for average case latency and maximum throughput,
and they behave quite differently compared to the discussed real-time controllers.
Hence, existing latency bounds cannot directly be applied to such controllers.
Therefore, in this article we consider a different approach that takes advantage
of the DRAM caching mechanism by explicitly modelling and analyzing DRAM
state information. In addition, we dynamically exploit the parallelism in the DRAM
structure to reduce the interference among multiple requestors (cores or DMA). Our
approach relies on the design of a new predictable memory controller, which fairly
arbitrates among commands of different requestors. The structure of our controller is
similar to existing controllers, but compared to COTS systems, we disable request re-
ordering to avoid a requestor being unfairly delayed (possibly forever). Our technique
relies on statically partitioning the available main memory (DRAM banks) among re-
questors. As such, it is targeted at partitioned real-time systems, such as integrated
modular avionics systems [26], where different applications are allocated on individ-

Title Suppressed Due to Excessive Length 3
ual cores and communication between applications is limited. For the same reason, it
is also restricted to multi-core, rather than many-core systems; in the evaluation, we
consider systems with up to 16 requestors.
In more details, the major contributions of this work are the following. (1) We
discuss the design of a new dynamic, predictable memory controller based on static
bank partitioning. (2) Based on the discussed controller, we derive a worst case DDR
DRAM memory latency analysis for individual load/store requests issued by a re-
questor under analysis in the presence of multiple other requestors contending for
memory access. Our analysis is composable, in the sense that the latency bound does
not depend on the activity of the other requestors, only on the number of requestors,
and it makes no assumption on the characteristics of the requestor under analysis (i.e.,
it can be an in-order/out-of-order core, DMA, etc.). (3) Based on the latency bounds
for individual requests, we show how to compute the overall latency suffered by a
task running on a fully timing compositional core [34]. (4) We evaluate our analy-
sis against previous predictable approaches using both synthetic tasks and a set of
benchmarks executed on an architectural simulator. In particular, we show that our
approach scales significantly better with faster memory devices. We show results both
in terms of worst case analysis bounds, and measured latency on the simulator. For
a commonly used DRAM in a system with 16 requestors and no inter-core commu-
nication, our method shows up to 45% improvements on task worst case execution
time compared to [25].
The rest of the article is organized as follows. Section 2 provides required back-
ground knowledge on how DRAM works. Section 3 compares our approach to related
work in the field. Section 4 discusses our memory controller design and Section 5
and 6 detail our worst case latency analysis. Section 7 discusses shared data, while
evaluation results are presented in Section 8. Finally, Section 9 concludes the article.
2 DRAM Basics
Modern DRAM memory systems are composed of a memory controller and mem-
ory device. Figure 1 shows an example of such system, where multiple cores and
DMA devices send requests to load or store data to the memory controller; the con-
troller handles individual requests by controlling the operation of the memory de-
vices, which stores the actual data. Since our request latency analysis is independent
of the characteristics of the hardware entity communicating with the memory con-
trollers, in Sections 2-5 we use the term requestor to denote any component (core or
DMA) that can send requests to the controller.
The device and controller are connected by a command bus and a data bus. The
command bus is used to transfer memory commands, which controls the operation of
the device, while the data bus carries the transferred data associated with a request.
The two buses can be used in parallel: a request of one requestor can use the command
bus while a request of another requestor uses the data bus. However, no more than
one request can use the command bus (or data bus) at the same time. The logic of the
controller is typically divided into a front end and back end. The front end generates
one or more memory commands for each request. The back end arbitrates among

4 Zheng Pei Wu et al.
generated commands and issues them to the device through the command bus. As we
discuss in Section 2.1, there are specific timing constraints that the back end must
satisfy.
Modern memory devices are organized into ranks and each rank is divided into
multiple banks, which can be accessed in parallel provided that no collisions occur on
either buses. Each bank comprises a row-buffer and an array of storage cells organized
as rows
1
and columns as shown in Figure 1. In addition, modern systems can have
multiple memory channels (i.e. multiple command and data bus). Each channel can
be treated independently or they could be interleaved together. This article treats each
channel independently and focuses on the analysis within a single channel. Note that
optimization of requestor assignments to channels in real-time memory controllers
has been discussed in [10, 11].
CMD
BUS
DATA
BUS
DRAM Controller
Front End
Back End
Row Buffer
Bank 1
Bank N
Rank 1 Rank R
Memory Device
Row Buffer
Bank 1
Bank N
. . .
CORE
DMA
CORE
DMA
. . .
. . .
Requestors
Fig. 1: DDR DRAM Organization
To access the data in a DRAM row, an Activate (ACT) command must be issued
to load the data into the row buffer before it can be read or written. Once the data
is in the row buffer, a CAS (read or write) command can be issued to retrieve or
store the data. If a second request needs to access a different row within the same
bank, the row buffer must be written back to the data array with a Pre-charge (PRE)
command before the second row can be activated. Finally, a periodic Refresh (REF)
command must be issued to all ranks and banks to ensure data integrity. Note that
each command takes one clock cycle on the command bus to be serviced.
A row that is cached in the row buffer is considered open, otherwise the row is
considered closed. A request that accesses an open row is called an Open Request
and a request that accesses a closed row is called Close Request. To avoid confusion,
requests are categorized as load or store while read and write are used to refer to
memory commands. When a request reaches the front end of the controller, the cor-
rect memory commands will be generated based on the status of the row buffers. For
open requests, only a read or a write command is generated since the desired row is
already cached in the row buffer. For close request, if the row buffer contains a row
that is not the desired row, then a PRE command is generated to close the current row.
1
DRAM rows are also referred to as ‘pages’ in the literature.

Title Suppressed Due to Excessive Length 5
Then an ACT is generated to load the new row and finally read/write is generated to
access data. If the row buffer is empty, then only ACT and read/write commands are
needed. Finally, all open rows must be closed with PRE commands before a REF can
be issued.
The size of a row is large (several kB), so each request only accesses a small por-
tion of the row by selecting the appropriate columns. Each CAS command accesses
data in a burst of length BL and the amount of data transferred is BL · W
BUS
, where
W
BUS
is the width of the data bus. Since DDR memory transfers data on rising and
falling edge of clock, the amount of time for one transfer is t
BUS
= BL/2 memory
clock cycles. For example, with BL = 8 and W
BUS
of 64 bits, it will take 4 cycles
to transfer 64 bytes of data.
2.1 DRAM Timing Constraints
The memory device takes time to perform different operations and therefore timing
constraints between various commands must be satisfied by the memory controller.
The operation and timing constraints of memory devices are defined by the JEDEC
standard [17]. The standard defines different families of devices, such as DDR2 /
DDR3 / DDR4. As an example, Table 1 lists all timing parameters of interest to the
analysis, with typical values for DDR3 and DDR2 devices
2
. Note that as the fre-
quency increases and thus the clock period becomes smaller, the value of the timing
parameters in number of clock cycles also tends to increase. Figures 2 and 3 illus-
trate the various timing constraints. Square boxes represent commands issued on the
command bus (A for ACT, P for PRE and R/W for Read and Write). The data be-
ing transferred on the data bus is also shown. To avoid excessive clutter, command
and data transfers belonging to the same request are shown on the same line, but we
stress again that the command and data buses can be operated in parallel. Horizontal
arrows represent timing constraints between different commands while the vertical
arrows show when each request arrives. R denotes rank and B denotes bank in the
figures. Note that constraints are not drawn to actual scale to make the figures easier
to understand.
Figure 2 shows constraints related to banks within the same rank. All three re-
quests are close requests targeting to the same rank. Request 1 and 3 are accessing
Bank 0 while Request 2 is accessing Bank 1. Notice the write command of Request
2 cannot be issued immediately once the t
RCD
timing constraint has been satisfied.
This is because there is another timing constraint, t
RT W
, between read command of
Request 1 and write command of Request 2, and the write command can only be
issued once all applicable constraints are satisfied. Similarly, the t
W T R
timing con-
straint between the end of the data of Request 2 and the read command of Request
3 must be satisfied before the read command is issued. Figure 3 shows timing con-
straints between different ranks, which only consist of t
RT R
[33]. This is the time
between end of data of one rank and beginning of data of another rank. Note Request
3 is targeting an open row, therefore, it does not need to issue PRE or ACT command.
2
We use DDR3 in our evaluation since we found it to be the most commonly employed standard in
related work on predictable DRAM controllers.

Citations
More filters
Journal ArticleDOI
TL;DR: This paper derives a generalized interference delay analysis for DRAM main memory that accounts for a breadth of features deployed in COTS platforms, and explores the design space by studying the effects of each feature on both the worst-case delay for critical applications, and the bandwidth for noncritical applications.
Abstract: Commercial off-the-shelf (COTS) heterogeneous multiple processors systems-on-chip (MPSoCs) are appealing platforms for emerging mixed criticality systems (MCSs). To satisfy MCS requirements, the platform must guarantee predictable timing bounds for critical applications, without degrading average performance for noncritical applications. In particular, this paper studies the main memory subsystem, which in modern MPSoCs is typically based on double data rate synchronous dynamic access memory. While there exists previous work on worst-case DRAM latency analysis, such work only covers a small subset of possible COTS configurations, which are not targeted at MCS. Therefore, we derive a generalized interference delay analysis for DRAM main memory that accounts for a breadth of features deployed in COTS platforms. We then explore the design space by studying the effects of each feature on both the worst-case delay for critical applications, and the bandwidth for noncritical applications.

49 citations


Cites background or methods or result from "A composable worst case latency ana..."

  • ...Although Part-All scheme (which is followed by many related works [12], [8], [6], [7], [10], [20]) is able to substantially reduce WCD of the critical PEs compared to no-Part (Figures 3a and 3b), it also significantly reduces the bandwidth delivered to the noncritical PEs (Figure 4)....

    [...]

  • ...As discussed in [12], [20], [9], this bound can then be used to either derive the worst-case execution time of single real-time task, or to perform response-time analysis for a multi-tasking application....

    [...]

  • ...[12], [8], [6], [7]) considers DRAM bank partitioning, where banks are partitioned among PEs to reduce bank conflicts and hence improve WCD....

    [...]

  • ...Similar to previous work [9], [10], we do not account for the delay from the refresh process because it can be often neglected compared to other delays [9], or otherwise, it can be added as an extra delay term to the execution time of a task using existing methods [11], [12]....

    [...]

Proceedings ArticleDOI
21 Apr 2020
TL;DR: A fine-grained analysis of the memory contention experienced by parallel tasks running on a multi-core platform is proposed, formulated to bound the memory interference by leveraging a three-phase execution model and holistically considering multiple memory transactions issued during each phase.
Abstract: When adopting multi-core systems for safety-critical applications, certification requirements mandate bounding the delays incurred in accessing shared resources. This is the case of global memories, whose access is often regulated by memory controllers optimized for average-case performance and not designed to be predictable. As a consequence, worst-case bounds on memory access delays often result to be too pessimistic, drastically reducing the advantage of having multiple cores. This paper proposes a fine-grained analysis of the memory contention experienced by parallel tasks running on a multi-core platform. To this end, an optimization problem is formulated to bound the memory interference by leveraging a three-phase execution model and holistically considering multiple memory transactions issued during each phase. Experimental results show the advantage in adopting the proposed approach on both synthetic task sets and benchmarks.

36 citations

Proceedings Article
01 Jan 2020
TL;DR: A framework to analyze the memory contention in COTS MPSoCs and provide safe and tight bounds to the delays suffered by any critical task due to this contention is proposed and comparisons with the state-of-the art approaches show that the proposed analysis provides the tightest bounds across all evaluated access scenarios.
Abstract: Multiple-Processors Systems-on-Chip (MPSoCs) provide an appealing platform to execute Mixed Criticality Systems (MCS) with both time-sensitive critical tasks and performance-oriented noncritical tasks. Their heterogeneity with a variety of processing elements can address the conflicting requirements of those tasks. Nonetheless, the complex (and hence hard-to-analyze) architecture of Commercial-Off-The-Shelf (COTS) MPSoCs presents a challenge encumbering their adoption for MCS. In this paper, we propose a framework to analyze the memory contention in COTS MPSoCs and provide safe and tight bounds to the delays suffered by any critical task due to this contention. Unlike existing analyses, our solution is based on two main novel approaches. 1) It conducts a hybrid analysis that blends both request-level and task-level analyses into the same framework. 2) It leverages available knowledge about the types of memory requests of the task under analysis as well as contending tasks; specifically, we consider information that is already obtainable by applying existing static analysis tools to each task in isolation. Thanks to these novel techniques, our comparisons with the state-of-the art approaches show that the proposed analysis provides the tightest bounds across all evaluated access scenarios. 2012 ACM Subject Classification Computer systems organization → Real-time systems; Computer systems organization→ System on a chip; Computer systems organization→ Multicore architectures

18 citations

Journal ArticleDOI
TL;DR: This article proposes a SDRAM controller that reorders read and write commands, which minimizes data bus turnarounds and compares the approach analytically and experimentally with existing real-time SDRam controllers both from the worst-case latency and power consumption perspectives.
Abstract: Synchronous dynamic random access memories (SDRAMs) are widely employed in multi- and many-core platforms due to their high-density and low-cost. Nevertheless, their benefits come at the price of a complex two-stage access protocol, which reflects their bank-based structure and an internal level of explicitly managed caching. In scenarios in which requestors demand real-time guarantees, these features pose a predictability challenge and, in order to tackle it, several SDRAM controllers have been proposed. In this context, recent research shows that a combination of bank privatization and open-row policy (exploiting the caching over the boundary of a single request) represents an effective way to tackle the problem. However, such approach uncovered a new challenge: the data bus turnaround overhead. In SDRAMs, a single data bus is shared by read and write operations. Alternating read and write operations is, consequently, highly undesirable, as the data bus must remain idle during a turnaround. Therefore, in this article, we propose a SDRAM controller that reorders read and write commands, which minimizes data bus turnarounds. Moreover, we compare our approach analytically and experimentally with existing real-time SDRAM controllers both from the worst-case latency and power consumption perspectives.

12 citations


Cites background from "A composable worst case latency ana..."

  • ...However, if that is not the case, its ability to effectively exploit the SDRAM is compromised [11]....

    [...]

  • ...Such strategy has been discussed in [11] and is out of the scope of this article....

    [...]

  • ...Supporting different granularities, which would be necessary for instance if a DMA engine competes for the SDRAM with cache-relying processors, is out of the scope of this article (as it constitutes an orthogonal challenge already investigated in [11])....

    [...]

  • ...We highlight that the same assumption has been made in [11], [12], which also employed a trace-based approach....

    [...]

  • ...To address the aforementioned scenario, researchers proposed using a combination of bank privatization and openrow policy [6], [7], [11]....

    [...]

Dissertation
01 Jan 2012
TL;DR: This work proposes DRAM power-aware rank scheduling schemes applied to the last-level cache and the memory controller that reduces write requests to DRAM and the state transitions by replacing cache blocks based on their dirty states and DRAM rank power states.
Abstract: Modern DRAMs provide multiple low-power states to save their energy consumption during idle times. The use of low-power states, however, can cause performance degradation because state transitions from low-power states to an active state incur time penalty. To effectively utilize the low-power states, we propose DRAM power-aware rank scheduling schemes applied to the last-level cache and the memory controller. Our scheme utilizing the last-level cache reduces write requests to DRAM and the state transitions by replacing cache blocks based on their dirty states and DRAM rank power states. Our scheme utilizing the memory controller decreases the state transitions with rank power state-aware batch writes. With the second scheme, the states transitions are reduced by 21.2%, on average. Consequently DRAM energy consumption is reduced by 11.2%, on average, with no performance loss.

9 citations

References
More filters
Journal ArticleDOI
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

4,039 citations


"A composable worst case latency ana..." refers methods in this paper

  • ...For each benchmark, we obtain the memory trace by running the benchmark on the gem5 [3] architecture simulator; we employed a simple in-order timing model using the x86 instruction set architecture as our objective is the evaluation of the memory system rather than detailed core simulation....

    [...]

Journal ArticleDOI
John L. Henning1
TL;DR: On August 24, 2006, the Standard Performance Evaluation Corporation (SPEC) announced CPU2006, which replaces CPU2000, and the SPEC CPU benchmarks are widely used in both industry and academia.
Abstract: On August 24, 2006, the Standard Performance Evaluation Corporation (SPEC) announced CPU2006 [2], which replaces CPU2000. The SPEC CPU benchmarks are widely used in both industry and academia [3].

1,864 citations


"A composable worst case latency ana..." refers background in this paper

  • ...However, for simulation results, the other requestors are running the lbm benchmark from SPEC2006 CPU suite [15], which is highly bandwidth intensive....

    [...]

Journal ArticleDOI
TL;DR: The architectural influence on static timing analysis is described and recommendations as to profitable and unacceptable architectural features are given and results show that measurement-based methods still used in industry are not useful for quite commonly used complex processors.
Abstract: Embedded hard real-time systems need reliable guarantees for the satisfaction of their timing constraints. Experience with the use of static timing-analysis methods and the tools based on them in the automotive and the aeronautics industries is positive. However, both the precision of the results and the efficiency of the analysis methods are highly dependent on the predictability of the execution platform. In fact, the architecture determines whether a static timing analysis is practically feasible at all and whether the most precise obtainable results are precise enough. Results contained in this paper also show that measurement-based methods still used in industry are not useful for quite commonly used complex processors. This dependence on the architectural development is of growing concern to the developers of timing-analysis tools and their customers, the developers in industry. The problem reaches a new level of severity with the advent of multicore architectures in the embedded domain. This paper describes the architectural influence on static timing analysis and gives recommendations as to profitable and unacceptable architectural features.

249 citations


"A composable worst case latency ana..." refers background in this paper

  • ...(3) Based on the latency bounds for individual requests, we show how to compute the overall latency suffered by a task running on a fully timing compositional core [34]....

    [...]

  • ...Let us assume that the requestor executing the task under analysis is a fully timing compositional core as described in [34] (example: ARM7)....

    [...]

Proceedings ArticleDOI
04 Jun 2007
TL;DR: It is time for a new era of processors whose temporal behavior is as easily controlled as their logical function, and these machines are called precision timed (PRET) machines.
Abstract: Patterson and Ditzel [12] did not invent reduced instruction set computers (RISC) in 1980. Earlier computers all had reduced instruction sets. Instead, they argued that trends in computer architecture had gotten off the sweet spot, and that by dropping back a few years and forking a new version of architectures, leveraging what had been learned, they could get better computers by employing simpler instruction sets.

244 citations


"A composable worst case latency ana..." refers background in this paper

  • ...Their work is part of a larger effort to develop PTARM [24], a precision-timed (PRET [8, 5]) architecture....

    [...]

Proceedings ArticleDOI
30 Sep 2007
TL;DR: In this article, the authors present a memory controller design that provides a guaranteed minimum bandwidth and a maximum latency bound to the IPs, which is accomplished using a novel two-step approach to predictable SDRAM sharing.
Abstract: Memory requirements of intellectual property components (IP) in contemporary multi-processor systems-on-chip are increasing. Large high-speed external memories, such as DDR2 SDRAMs, are shared between a multitude of IPs to satisfy these requirements at a low cost per bit. However, SDRAMs have highly variable access times that depend on previous requests. This makes it difficult to accurately and analytically determine latencies and the useful bandwidth at design time, and hence to guarantee that hard real-time requirements are met. The main contribution of this paper is a memory controller design that provides a guaranteed minimum bandwidth and a maximum latency bound to the IPs. This is accomplished using a novel two-step approach to predictable SDRAM sharing. First, we define memory access groups, corresponding to precomputed sequences of SDRAM commands, with known efficiency and latency. Second, a predictable arbiter is used to schedule these groups dynamically at run-time, such that an allocated bandwidth and a maximum latency bound is guaranteed to the IPs. The approach is general and covers all generations of SDRAM. We present a modular implementation of our memory controller that is efficiently integrated into the network interface of a network-on-chip. The area of the implementation is cheap, and scales linearly with the number of IPs. An instance with six ports runs at 200 MHz and requires 0.042 mm2 in 0.13μm CMOS technology.

239 citations

Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "A composable worst case latency analysis for multi-rank dram devices under open row policy" ?

In this article, the authors present a new memory controller design together with a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state. In particular, their approach scales better with increasing memory speed by predictably taking advantage of shorter latency for access to open DRAM rows. The authors evaluate their approach based on worst case analysis bounds and simulation results, using both synthetic tasks and a set of realistic benchmarks. Furthermore, it can be applied to multi-rank devices, which allow for increased access parallelism. 

First of all, the authors plan to synthesize and test the proposed controller on FPGA. 

Since AMC was originally described for a slower DDR2 device, the authors recomputed the length of AMC static command groups based on the timing parameters of the employed DDR3 device. 

By decomposing, the latency for tAC and tCD can now be computed separately, greatly simplifying the analysis; tReq is then computed as the sum of the two components. 

tAE : since the authors want to ensure that no command in the global queue is delayed by commands in the refresh sequence, the authors need to wait for the longest timing constraint between an ACT command and any other command issued after ending the sequence. 

tIP and tIA represent the worst case delay between inserting a command in the FIFO queue and when that command is issued, and thus capture interference caused by other requestors. 

To derive the total latency for accessing shared data for the task under analysis, assume the number of loads to shared data isNSL and number of stores to shared data is NSS for the task under analysis. 

By carefully scheduling the static command sequences, the controller can significantly reduce the size of each TDMA slot compared to previous static controllers when handling small size requests that do not require interleaving. 

The assumption of constant access time in DRAM can lead to highly pessimistic bounds because DRAM is a complex and stateful resource, i.e., the time required to perform one memory request is highly dependent on the history of previous and concurrent requests. 

Since memory traces were obtained, no worst case pattern is needed since the order of requests are assumed to be known; instead, the authors simply computed the worst case latency of each request based on the type of the previous request according to Table 4. 

Modern memory devices are organized into ranks and each rank is divided into multiple banks, which can be accessed in parallel provided that no collisions occur on either buses. 

Since the analysis in Section 5 depends on the order of requests, this section shows how to derive a safe worst case requests order given the number of each type of requests. 

The tFAW constraint that limits the number of banks that can be activated in order to limit the amount of current drawn to the device to prevent over heating problems. 

The downside is that the analysis is pessimistic, since it assumes than an interfering requestor could cause maximum delay on each individual command of the requestor under analysis, while this might not be possible in practice. 

The worst case latency for a single request to shared data for the task under analysis is then:tReqShared(Load) = k−1∑ i=1 tReqOther,i(M + s− 1) + t Req Analysis(Load,M + s− 1), (25)for a load request, while for a store request it is:tReqShared(Store) = k−1∑ i=1 tReqOther,i(M + s− 1) + t Req Analysis(Store,M + s− 1). 

This is because there is another timing constraint, tRTW , between read command of Request 1 and write command of Request 2, and the write command can only be issued once all applicable constraints are satisfied.