Showing papers in &quot;ACM Sigarch Computer Architecture News in 2000&quot;

Automatic memory layout transformations to optimize spatial locality in parameterized loop nests

TL;DR: The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation.

...read moreread less

Abstract: The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation...

...read moreread less

59 citations

Journal Article•DOI•

[...]

Philippe Clauss, Benoit Meister

System architecture directions for networked sensors

TL;DR: This work focuses on spatial locality optimization such that all the data that are loaded as a block in the cache will be used successively by the program.

...read moreread less

Abstract: One of the most efficient ways to improve program performances onto nowadays computers is to optimize the way cache memories are used. In particular, many scientific applications contain loop nests that operate on large multi-dimensional arrays whose sizes are often parameterized. No special attention is paid to cache memory performance when such loops are written. In this work, we focus on spatial locality optimization such that all the data that are loaded as a block in the cache will be used successively by the program. Our method consists in providing a new array reference evaluation function to the compiler, such that the data layout corresponds exactly to the utilization order of these data. The computation of this function concerns the field of parameterized polyhedra and Ehrhart polynomials.

...read moreread less

54 citations

Journal Article•DOI•

[...]

HillJason, SzewczykRobert, WooAlec, HollarSeth, CullerDavid, PisterKristofer - Show less +2 more

Selective, accurate, and timely self-invalidation using last-touch prediction

TL;DR: In this article, the authors describe the development of integrated, low power, CMOS communication devices and sensors, which makes a rich design space of networked sensors viable and can be deeply embedded in the physical world.

...read moreread less

Abstract: Technological progress in integrated, low-power, CMOS communication devices and sensors makes a rich design space of networked sensors viable. They can be deeply embedded in the physical world and ...

...read moreread less

38 citations

Journal Article•DOI•

[...]

LaiAn-Chow, FalsafiBabak

Reducing virtual call overheads in a Java VM just-in-time compiler

TL;DR: This paper proposes Last-Touches, a solution to communicate in cache-coherent distributed shared memory by invalidating (or writing back) cached copies of a memory block, incurring high overheads.

...read moreread less

Abstract: Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Tou...

...read moreread less

15 citations

Journal Article•DOI•

[...]

Junpyo Lee¹, Byung-Sun Yang¹, Suhyun Kim, Kemal Ebcioglu, Erik R. Altman, Seungil Lee, Yoo C. Chung, Heung-Bok Lee, Je Hyung Lee, Soo-Mook Moon¹ - Show less +6 more•Institutions (1)

Seoul National University¹

Using cache line coloring to perform aggressive procedure inlining

TL;DR: Evaluated the performance impact of inline caches and type feedback in an actual Java virtual machine using the new open source Java VM JIT compiler called LaTTe and indicates that while monomoprhic inline cache and polymorphic inline caches achieve a speedup as much as a geometric mean of 3% and 9% respectively, type feedback cannot improve further over polymorphicinline caches and even degrades the performance for some programs.

...read moreread less

Abstract: Java, an object-oriented language, uses virtual methods to support the extension and reuse of classes. Unfortunately, virtual method calls affect performance and thus require an efficient implementation, especially when just-in-time (JIT) compilation is done. Inline caches and type feedback are solutions used by compilers for dynamically-typed object-oriented languages such as SELF [1, 2, 3], where virtual call overheads are much more critical to performance than in Java. With an inline cache, a virtual call that would otherwise have been translated into an indirect jump with two loads is translated into a simpler direct jump with a single compare. With type feedback combined with adaptive compilation, virtual methods can be inlined using checking code which verifies if the target method is equal to the inlined one.This paper evaluates the performance impact of these techniques in an actual Java virtual machine, which is our new open source Java VM JIT compiler called LaTTe [4]. We also discuss the engineering issues in implementing these techniques.Our experimental results with the SPECjvm98 benchhmarks indicate that while monomoprhic inline caches and polymorphic inline caches achieve a speedup as much as a geometric mean of 3% and 9% respectively, type feedback cannot improve further over polymorphic inline caches and even degrades the performance for some programs.

...read moreread less

12 citations

Journal Article•DOI•

[...]

Hakan Aydin¹, David Kaeli¹•Institutions (1)

Northeastern University¹

Optimizing cache miss equations polyhedra

TL;DR: By combining reordering with aggressive inlining, a larger executable image produced through inlining can be effectively remapped onto the cache address space, while not noticeably increasing the instruction cache miss rate.

...read moreread less

Abstract: Memory hierarchy performance has always been an important issue in computer architecture design. The likelihood of a bottleneck in the memory hierarchy is increasing, as improvements in microprocessor performance continue to outpace those made in the memory system. As a result, effective utilization of cache memories is essential in today's architectures.The nature of procedural software poses visibility problems when attempting to perform program optimization. One approach to increasing visibility in procedural design is to perform procedure inlining. The main downside of using inlining is that inlined procedures can place excess pressure on the instruction cache.To address this issue we attempt to perform code reordering. By combining reordering with aggressive inlining, a larger executable image produced through inlining can be effectively remapped onto the cache address space, while not noticeably increasing the instruction cache miss rate.In this paper, we evaluate our ability to perform aggressive inlining by employing cache line coloring. We have implemented three variations of our coloring algorithm in the Alto toolset and compare them against Alto's aggressive basic block reordering algorithms. Alto allows us to generate optimized executables, that can be run on hardware to generate results. We find that by using our algorithms, we can achieve up a 21% reduction is execution runtime over the base Compaq optimizing compiler, and a 6.4% reduction when compared to Alto's interprocedural basic block reordering algorithm.

...read moreread less

10 citations

Journal Article•DOI•

[...]

Nerina Bermudo¹, Xavier Vera¹, Antonio González¹, Josep Llosa¹•Institutions (1)

Polytechnic University of Catalonia¹

Recency-based TLB preloading

TL;DR: Effective techniques that exploit some properties of the particular polyhedra generated by CME are presented that reduce the complexity of the algorithm to solve CME, which results in a significant speed-up when compared with traditional methods.

...read moreread less

Abstract: Cache Miss Equations (CME) [GMM97] is a method that accurately describes the cache behavior by means of polyhedra. Even though the computation cost of generating CME is a linear function of the number of references, to solve them is a very time consuming task and thus trying to study a whole program may be infeasible.In this work, we present effective techniques that exploit some properties of the particular polyhedra generated by CME. Such technique reduce the complexity of the algorithm to solve CME, which results in a significant speed-up when compared with traditional methods. In particular, the proposed approach does not require the computation of the vertices of each polyhedron, which has an exponential complexity.

...read moreread less

6 citations

Journal Article•DOI•

[...]

SaulsburyAshley, DahlgrenFredrik, StenströmPer

Limits of task-based parallelism in irregular applications

TL;DR: Caching and other latency tolerating techniques have been quite successful in maintaining high memory system performance for general purpose processors as discussed by the authors, however, TLB misses have become a serious problem for many applications.

...read moreread less

Abstract: Caching and other latency tolerating techniques have been quite successful in maintaining high memory system performance for general purpose processors. However, TLB misses have become a serious bo...

...read moreread less

6 citations

Journal Article•DOI•

[...]

Barbara Kreaseck¹, Dean M. Tullsen¹, Brad Calder¹•Institutions (1)

University of California, San Diego¹

LIDE: a simulation environment for shared virtual memory systems

TL;DR: Because the biggest barrier to detecting independence in irregular codes is memory disambiguation, the goals of this paper are to identify memory-independent tasks using a profile-based approach and measure the amount of STP by estimating the amounts of memory- independent instructions those tasks expose.

...read moreread less

Abstract: Tomorrow's microprocessors will be able to handle multiple flows of control. Applications that exhibit task level parallelism (TLP) and can be decomposed into parallel tasks will perform well on these platforms. TLP arises when a task is independent of its neighboring code. Traditional parallel compilers exploit one variety of TLP, loop level parallelism (LLP), where loop iterations are executed in parallel. LLP can overwhelming be found in numeric, typically FORTRAN programs with regular patterns of data accesses. In contrast, irregular applications, typified by general purpose integer applications, exhibit little LLP as they tend to access data in irregular patterns through pointers. Without pointer disambiguation to analyze data access dependences, traditional parallel compilers cannot parallelize these irregular applications and ensure correct execution.We focus on a different variety of TLP, namely Speculative Task Parallelism (STP). STP arises when a task (either a leaf-procedure, a non-leaf procedure or an entire loop) is control- and memory-independent of its preceding code, and thus could be executed in parallel. Two sections of code are memory-independent when neither contains a store to a memory location that the other accesses. To exploit STP, we assume a hypothetical speculative machine that supports speculative futures (a parallel programming construct that executes a task early on a different thread or processor) with mechanisms for resolving incorrect speculation when the task is not, after all, independent. This allows us to speculatively parallelize code when there is a high probability of independence, but no guarantee.Figure 1 illustrates STP, showing a task Y in the dynamic instruction stream of an irregular application that has no memory access conflicts with a group of instructions, X, that precede Y. The shorter of X and Y determines the overlap of memory-independent instructions as seen in Figures 1(b) and 1(c). In the absence of any register dependences, X and Y may be executed in parallel, resulting in shorter execution time. It is hard for traditional parallel compilers of pointer-based languages to expose this parallelism.The goals of this paper are to identify such regions as X and Y within irregular applications and to find the number of instructions that may thus be removed from the critical path. This number represents the maximum STP when the cost of exploiting STP is zero.Because the biggest barrier to detecting independence in irregular codes is memory disambiguation, we identify memory-independent tasks using a profile-based approach and measure the amount of STP by estimating the amount of memory-independent instructions those tasks expose. We vary the level of control dependence and memory dependence to investigate their effect on the amount of memory-independence we find. We profile at different memory granularities and introduce synchronization to expose higher levels of memory-independence. Across this variety of speculation assumptions, 7 to 22% of dynamic instructions are within tasks that are found to be memory-independent. This was on the SPECint95 benchmarks, a set of irregular applications for which traditional methods of parallelization are ineffective.

...read moreread less

6 citations

Journal Article•DOI•

[...]

Salvador Petit¹, José A. Gil, Julio Sahuquillo, Ana Pont¹•Institutions (1)

Polytechnic University of Valencia¹

01 Sep 2000-ACM Sigarch Computer Architecture News

TL;DR: A cheap and flexible way to design efficient consistency approaches for SVM systems is proposed and an execution-driven simulator aimed at studying the behavior of memory consistency models is developed.

...read moreread less

Abstract: Shared Virtual Memory (SVM) systems are organizations of Distributed Shared Memory systems (DSM). These systems offer a shared memory programming model that is more intuitive than the message passing paradigm. Other advantages include low hardware and maintenance costs. This paper introduces a new simulation environment for such architectures. The developed tool is an execution-driven simulator aimed at studying the behavior of memory consistency models, with the exception of those needing compiler modifications. Thus, we propose a cheap and flexible way to design efficient consistency approaches for SVM systems.

...read moreread less

5 citations

Journal Article•DOI•

Thread-level parallelism and interactive performance of desktop applications

[...]

FlautnerKristián, UhligRich, ReinhardtSteve, MudgeTrevor

Early load address resolution via register tracking

TL;DR: The case for multiprocessing is less clear for desktop applica... as discussed by the authors, but it is already prevalent in servers where multiple clients present an obvious source of thread-level parallelism.

...read moreread less

Abstract: Multiprocessing is already prevalent in servers where multiple clients present an obvious source of thread-level parallelism. However, the case for multiprocessing is less clear for desktop applica...

...read moreread less

Journal Article•DOI•

[...]

BekermanMichael, YoazAdi, GabbayFreddy, JourdanStephan, KalaevMaxim, RonenRonny - Show less +2 more

Software profiling for hot path prediction

TL;DR: In this paper, the Intel's IA32 architecture where lack of registers results in increased memory access cost is discussed. But this is especially noticeable in the IA32 architectures where the number of registers needed for memory accesses is limited.

...read moreread less

Abstract: Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased num...

...read moreread less

Journal Article•DOI•

[...]

DuesterwaldEvelyn, BalaVasanth

A combined compiler and architecture technique to control multithreaded execution of branches and loop iterations

TL;DR: In this paper, the profile information in adaptive systems such as just-in-time compilers, dynamic optimizers, and binary translators is exploited for exploiting profile information of adaptive systems.

...read moreread less

Abstract: Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show ...

...read moreread less

Journal Article•DOI•

[...]

A. Unger¹, Eberhard Zehendner¹, Th. Ungerer•Institutions (1)

Schiller International University¹

Applying predication to efficiently handle runtime class testing

TL;DR: This work applies simultaneous Speculation Scheduling to situations where purely static techniques cannot prove data independence, and can be seen as a cost-effective alternative to purely dynamic speculation techniques.

...read moreread less

Abstract: Simultaneous Speculation Scheduling (S3) is a combined compiler and architecture technique to control multiple path execution. It can be used for dual path branch speculation in case of unpredictable branches and for multiple path speculative execution of loop iterations in case of loop-carried dependences that make parallel execution otherwise impossible. We apply S3 In situations where purely static techniques cannot prove data independence. S3 can be seen as a cost-effective alternative to purely dynamic speculation techniques. We explain the S3 technique and discuss the requirements on possible target architectures. We further compare S3 to other speculation techniques.

...read moreread less

Journal Article•DOI•

[...]

Chris Sadler¹, Sandeep K. S. Gupta¹, Rohit Bhatia²•Institutions (2)

Colorado State University¹, Hewlett-Packard²

Architectural support for fast symmetric-key cryptography

TL;DR: It is shown that predicated class testing can reduce the direct cost of virtual function calls by 43% on an average, which in contrast to other class testing techniques, is a performance improvement of up to 31% per function call.

...read moreread less

Abstract: Runtime class testing is a technique whereby virtual function calls are transformed into statically-bound function calls through a series of conditional branches. Through this transformation, the overhead of virtual function calls can be significantly reduced. However, the drawback of these tests is that by relying on conditional branches, the amount of instruction-level parallelism (ILP) is limited and the mispredict penalties can be relatively high. We show that by using predication during class testing, these drawbacks can be eliminated, and the benefits of class testing can be improved upon. Predication is a supported feature in Explicitly Parallel Instruction Computing (EPIC) architecture, which converts control dependencies into data dependencies, and thus eliminates the mispredict penalties. With analytical cost models and experimental results, we show that predicated class testing can reduce the direct cost of virtual function calls by 43% on an average, which in contrast to other class testing techniques, is a performance improvement of up to 31% per function call. These results are based on architectural specifications that most modern architectures will exceed, and are considered conservative. The actual speedups resulting from predicated class testing are expected to be higher

...read moreread less

Journal Article•DOI•

[...]

BurkeJerome, McDonaldJohn, AustinTodd

Frequent value locality and value-centric data cache design

TL;DR: The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems.

...read moreread less

Abstract: The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms nec...

...read moreread less

Journal Article•DOI•

[...]

ZhangYoutao, YangJun, GuptaRajiv

FLASH vs. (Simulated) FLASH

TL;DR: In this article, the authors studied the behavior of programs in the SPECint95 suite and observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality.

...read moreread less

Abstract: By studying the behavior of programs in the SPECint95 suite we observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality, according to which a few va...

...read moreread less

Journal Article•DOI•

[...]

GibsonJeff, KunzRobert, OfeltDavid, HorowitzMark, HennessyJohn, HeinrichMark - Show less +2 more

Using meta-level compilation to check FLASH protocol code

TL;DR: The problem with simulation is that it rarely models the system exactly, and qu... as mentioned in this paper points out that simulation is the primary method for evaluating computer systems during all phases of the design process.

...read moreread less

Abstract: Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and qu...

...read moreread less

Journal Article•DOI•

[...]

ChouAndy, ChelfBenjamin, EnglerDawson, HeinrichMark

Load-store optimization for software pipelining

TL;DR: In this article, the authors point out that building OS kernels and embedded software is difficult and an important source of this difficulty is the numerous rules they must obey: interrupts cannot be disabled for too long.

...read moreread less

Abstract: Building systems such as OS kernels and embedded software is difficult. An important source of this difficulty is the numerous rules they must obey: interrupts cannot be disabled for ~too long," gl...

...read moreread less

Journal Article•DOI•

[...]

Min Dai¹, Christine Eisenbeis¹, Sid-Ahmed-Ali Touati¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

A compiler optimization paradigm for dynamic energy management

TL;DR: A method for integrating software pipelining and load-store elimination techniques is introduced and it is demonstrated that integrated algorithm is more effective than other methods.

...read moreread less

Abstract: Software pipelining can generate efficient schedules for loop by overlapping the execution of operations from different iterations in order to exploit maximum Instruction Level Parallelism (ILP). Code optimization can decrease total number of calculations and memory related operations. As a result, instruction schedules can use freed resources to construct shorter schedules. Particularly, when the data is not presented in cache, the performance will be significantly degraded by memory references. Therefore, elimination of redundant load-store operations is most important for improving overall performance. This paper introduces a method for integrating software pipelining and load-store elimination techniques. Moreover, we demonstrate that integrated algorithm is more effective than other methods.

...read moreread less

Journal Article•DOI•

[...]

Akhilesh Tyagi¹, Gyungho Lee¹•Institutions (1)

Iowa State University¹