scispace - formally typeset
Search or ask a question

Showing papers by "Moinuddin K. Qureshi published in 2010"


Proceedings ArticleDOI
01 Apr 2010
TL;DR: This work proposes adaptive Write Cancellation policies, which can abort the processing of a scheduled write requests if a read request arrives to the same bank within a predetermined period, and Write Pausing, which exploits the iterative write algorithms used in PCM to pause at the end of each write iteration to service any pending reads.
Abstract: Phase Change Memory (PCM) is emerging as a promising technology to build large-scale main memory systems in a cost-effective manner. A characteristic of PCM is that it has write latency much higher than read latency. A higher write latency can typically be tolerated using buffers. However, once a write request is scheduled for service to a bank, it can still cause increased latency for later arriving read requests to the same bank. We show that for the baseline PCM system with read-priority scheduling, the write requests increase the effective read latency to 2.3x (on average), causing significant performance degradation. To reduce the read latency of PCM devices under such scenarios, we propose adaptive Write Cancellation policies. Such policies can abort the processing of a scheduled write requests if a read request arrives to the same bank within a predetermined period. We also propose Write Pausing, which exploits the iterative write algorithms used in PCM to pause at the end of each write iteration to service any pending reads. For the baseline system, the proposed technique removes 75% of the latency increase incurred by read requests and improves overall system performance by 46% (on average), while requiring negligible hardware and simple extensions to PCM controller.

313 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: MMS as discussed by the authors is a robust architecture for efficiently incorporating MLC PCM devices in main memory, based on observation that memory requirement varies between workloads, and systems are typically over-provisioned in terms of memory capacity.
Abstract: Phase Change Memory (PCM) is emerging as a scalable and power efficient technology to architect future main memory systems. The scalability of PCM is enhanced by the property that PCM devices can store multiple bits per cell. While such Multi-Level Cell (MLC) devices can offer high density, this benefit comes at the expense of increased read latency, which can cause significant performance degradation. This paper proposes Morphable Memory System (MMS), a robust architecture for efficiently incorporating MLC PCM devices in main memory. MMS is based on observation that memory requirement varies between workloads, and systems are typically over-provisioned in terms of memory capacity. So, during a phase of low memory usage, some of the MLC devices can be operated at fewer bits per cell to obtain lower latency. When the workload requires full memory capacity, these devices can be restored to high density MLC operation to have full main-memory capacity. We provide the runtime monitors, the hardware-OS interface, and the detailed mechanism for implementing MMS. Our evaluations on an 8-core 8GB MLC PCM-based system show that MMS provides, on average, low latency access for 95% of all memory requests, thereby improving overall system performance by 40%.

211 citations


Proceedings ArticleDOI
11 Sep 2010
TL;DR: Feedback-Directed Pipelining (FDP) is proposed, a software framework that chooses the core-to-stage allocation at run-time and first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance.
Abstract: Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.

62 citations


Journal ArticleDOI
TL;DR: The proposed accelerated critical sections mechanism reduces this limitation by executing critical sections on the high-performance core of an asymmetric chip multiprocessor, which can execute them faster than the smaller cores can.
Abstract: Contention for critical sections can reduce performance and scalability by causing thread serialization. The proposed accelerated critical sections mechanism reduces this limitation. ACS executes critical sections on the high-performance core of an asymmetric chip multiprocessor (ACMP), which can execute them faster than the smaller cores can.

29 citations


Patent
19 Nov 2010
TL;DR: In this article, a write data stream is detected and a write leveling process is adapted in response to the detected property, and the write line addresses are generated from the detected properties.
Abstract: Adaptive write leveling in limited lifetime memory devices including performing a method for monitoring a write data stream that includes write line addresses. A property of the write data stream is detected and a write leveling process is adapted in response to the detected property. The write leveling process is applied to the write data stream to generate physical addresses from the write line addresses.

26 citations


Patent
09 Apr 2010
TL;DR: In this article, the first memory region includes first memory units operating at a first density, and the second memory unit operating at the second density after being reassigned to the first region.
Abstract: A computer memory with dynamic cell density including a method that obtains a target size for a first memory region. The first memory region includes first memory units operating at a first density. The first memory units are includes in a memory in a memory system. The memory is operable at the first density and a second density. The method also includes: determining that a current size of the first memory region is not within a threshold of the target size and that the first memory region is smaller than the target size; identifying a second memory unit currently operating at the second density in a second memory region, the second memory unit included in the memory; and dynamically reassigning, during normal system operation, the second memory unit into the first memory region, the second memory unit operating at the first density after being reassigned to the first memory region.

25 citations


Patent
27 Jul 2010
TL;DR: In this article, a data switching activity identification mechanism is proposed for approximating data switching activities in a data processing system, which is based on the identification of a set of data storage devices and their associated bits.
Abstract: A mechanism is provided for approximating data switching activity in a data processing system. A data switching activity identification mechanism in the data processing system receives an identification of a set of data storage devices and a set of bits in the set of data storage devices in the data processing system to be monitored for the data switching activity. The data switching activity identification mechanism sums a count of the identified bits that have changed state for the data storage device along with other counts of the identified bits that have changed state for other data storage devices in the set of data storage devices to form an approximation of data switching activity. A power manager in the data processing system then adjusts a set of operational parameters associated with the data processing system using the approximation of data switching activity.

4 citations


Patent
Moinuddin K. Qureshi1
04 Feb 2010
TL;DR: In this paper, a cache snoop and access to physical memory are initiated in parallel for the data item if the indicator bit is a first predetermined bit (one (1) or zero (0)).
Abstract: An apparatus for memory access prediction which includes a plurality of processors, a plurality of memory caches associated with the processors, a plurality of saturation counters associated with the processors, each of the saturation counters having an indicator bit, and a physical memory shared with the processors, saturation counters and memory caches. Upon a cache miss for a data item, a cache snoop and access to physical memory are initiated in parallel for the data item if the indicator bit is a first predetermined bit (one (1) or zero (0)) whereas a cache snoop is initiated if the most significant bit is a second predetermined bit (zero (0) or one (1)).

4 citations