scispace - formally typeset
Search or ask a question
Author

Alaa R. Alameldeen

Bio: Alaa R. Alameldeen is an academic researcher from Intel. The author has contributed to research in topics: Cache & CPU cache. The author has an hindex of 29, co-authored 86 publications receiving 5455 citations. Previous affiliations of Alaa R. Alameldeen include Wisconsin Alumni Research Foundation & Simon Fraser University.


Papers
More filters
Journal ArticleDOI
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers as mentioned in this paper, which includes a set of timing simulator modules for modeling the timing of the memory system and microprocessors.
Abstract: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

1,515 citations

Journal ArticleDOI
02 Mar 2004
TL;DR: An adaptive policy that dynamically adapts to the costs and benefits of cache compression is developed and it is shown that compression can improve performance for memory-intensive commercial workloads by up to 17%.
Abstract: Modern processors use two or more levels ofcache memories to bridge the rising disparity betweenprocessor and memory speeds. Compression canimprove cache performance by increasing effectivecache capacity and eliminating misses. However,decompressing cache lines also increases cache accesslatency, potentially degrading performance.In this paper, we develop an adaptive policy thatdynamically adapts to the costs and benefits of cachecompression. We propose a two-level cache hierarchywhere the L1 cache holds uncompressed data and the L2cache dynamically selects between compressed anduncompressed storage. The L2 cache is 8-way set-associativewith LRU replacement, where each set can storeup to eight compressed lines but has space for only fouruncompressed lines. On each L2 reference, the LRUstack depth and compressed size determine whethercompression (could have) eliminated a miss or incurs anunnecessary decompression overhead. Based on thisoutcome, the adaptive policy updates a single globalsaturating counter, which predicts whether to allocatelines in compressed or uncompressed form.We evaluate adaptive cache compression usingfull-system simulation and a range of benchmarks. Weshow that compression can improve performance formemory-intensive commercial workloads by up to 17%.However, always using compression hurts performancefor low-miss-rate benchmarks-due to unnecessarydecompression overhead-degrading performance byup to 18%. By dynamically monitoring workload behavior,the adaptive policy achieves comparable benefitsfrom compression, while never degrading performanceby more than 0.4%.

304 citations

Journal ArticleDOI
01 Jun 2008
TL;DR: Two architectural techniques are proposed that enable microprocessor caches (L1 and L2), to operate at low voltages despite very high memory cell failure rates, and enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%.
Abstract: One of the most effective techniques to reduce a processor’s power consumption is to reduce supply voltage. However, reducing voltage in the context of manufacturing-induced parameter variations cancause many types of memory circuits to fail. As a result, voltage scaling is limited by a minimum voltage, often called Vccmin, beyond which circuits may not operate reliably. Large memory structures (e.g., caches) typically set Vccmin for the whole processor. In this paper, we propose two architectural techniques that enable microprocessor caches (L1and L2), to operate at low voltages despite very high memory cell failure rates. The Word-disable scheme combines two consecutive cache lines, to form a single cache line where only non-failing words are used. The Bit-fix scheme uses a quarter of the ways in a cache set to store positions and fix bits for failing bits in other ways of the set. During high voltage operation, both schemes allow use of the entire cache. During low voltage operation, they sacrifice cache capacity by 50% and 25%, respectively, to reduce Vccmin below 500mV. Compared to current designs with a Vccmin of 825mV, our schemes enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%

289 citations

Proceedings ArticleDOI
08 Feb 2003
TL;DR: It is shown that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment), and a methodology is proposed, based on multiple simulations and standard statistical techniques, to compensate for variability.
Abstract: Multi-threaded commercial workloads implement many important Internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a workload's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths. Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.

252 citations

01 Jan 2004
TL;DR: This work proposes and evaluates a simple significance-based compression scheme that has a low compression and decompression overhead and provides comparable compression ratios to more complex schemes that have higher cache hit latencies.
Abstract: With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Such algorithms are effective in compressing large data blocks and files. Cache lines, however, are typically short (32-256 bytes), and a per-line dictionary places a significant overhead that limits the compressibility and increases decompression latency of such algorithms. For such short lines, significance-based compression is an appealing alternative. We propose and evaluate a simple significance-based compression scheme that has a low compression and decompression overhead. This scheme, Frequent Pattern Compression (FPC) compresses individual cache lines on a word-by-word basis by storing common word patterns in a compressed format accompanied with an appropriate prefix. For a 64-byte cache line, compression can be completed in three cycles and decompression in five cycles, assuming 12 FO4 gate delays per cycle. We propose a compressed cache design in which data is stored in a compressed form in the L2 caches, but are uncompressed in the L1 caches. L2 cache lines are compressed to predetermined sizes that never exceed their original size to reduce decompression overhead. This simple scheme provides comparable compression ratios to more complex schemes that have higher cache hit latencies.

233 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

4,039 citations

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations

Journal ArticleDOI
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers as mentioned in this paper, which includes a set of timing simulator modules for modeling the timing of the memory system and microprocessors.
Abstract: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

1,515 citations

01 Jan 2005
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers and has released a set of timing simulator modules for modeling the timing of the memory system and microprocessors.

1,464 citations