scispace - formally typeset
Search or ask a question

Showing papers in "ACM Sigarch Computer Architecture News in 2000"


Journal ArticleDOI
TL;DR: The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation.
Abstract: The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation...

59 citations


Journal ArticleDOI
TL;DR: This work focuses on spatial locality optimization such that all the data that are loaded as a block in the cache will be used successively by the program.
Abstract: One of the most efficient ways to improve program performances onto nowadays computers is to optimize the way cache memories are used. In particular, many scientific applications contain loop nests that operate on large multi-dimensional arrays whose sizes are often parameterized. No special attention is paid to cache memory performance when such loops are written. In this work, we focus on spatial locality optimization such that all the data that are loaded as a block in the cache will be used successively by the program. Our method consists in providing a new array reference evaluation function to the compiler, such that the data layout corresponds exactly to the utilization order of these data. The computation of this function concerns the field of parameterized polyhedra and Ehrhart polynomials.

54 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe the development of integrated, low power, CMOS communication devices and sensors, which makes a rich design space of networked sensors viable and can be deeply embedded in the physical world.
Abstract: Technological progress in integrated, low-power, CMOS communication devices and sensors makes a rich design space of networked sensors viable. They can be deeply embedded in the physical world and ...

38 citations


Journal ArticleDOI
TL;DR: This paper proposes Last-Touches, a solution to communicate in cache-coherent distributed shared memory by invalidating (or writing back) cached copies of a memory block, incurring high overheads.
Abstract: Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Tou...

15 citations


Journal ArticleDOI
TL;DR: Evaluated the performance impact of inline caches and type feedback in an actual Java virtual machine using the new open source Java VM JIT compiler called LaTTe and indicates that while monomoprhic inline cache and polymorphic inline caches achieve a speedup as much as a geometric mean of 3% and 9% respectively, type feedback cannot improve further over polymorphicinline caches and even degrades the performance for some programs.
Abstract: Java, an object-oriented language, uses virtual methods to support the extension and reuse of classes. Unfortunately, virtual method calls affect performance and thus require an efficient implementation, especially when just-in-time (JIT) compilation is done. Inline caches and type feedback are solutions used by compilers for dynamically-typed object-oriented languages such as SELF [1, 2, 3], where virtual call overheads are much more critical to performance than in Java. With an inline cache, a virtual call that would otherwise have been translated into an indirect jump with two loads is translated into a simpler direct jump with a single compare. With type feedback combined with adaptive compilation, virtual methods can be inlined using checking code which verifies if the target method is equal to the inlined one.This paper evaluates the performance impact of these techniques in an actual Java virtual machine, which is our new open source Java VM JIT compiler called LaTTe [4]. We also discuss the engineering issues in implementing these techniques.Our experimental results with the SPECjvm98 benchhmarks indicate that while monomoprhic inline caches and polymorphic inline caches achieve a speedup as much as a geometric mean of 3% and 9% respectively, type feedback cannot improve further over polymorphic inline caches and even degrades the performance for some programs.

12 citations


Journal ArticleDOI
TL;DR: By combining reordering with aggressive inlining, a larger executable image produced through inlining can be effectively remapped onto the cache address space, while not noticeably increasing the instruction cache miss rate.
Abstract: Memory hierarchy performance has always been an important issue in computer architecture design. The likelihood of a bottleneck in the memory hierarchy is increasing, as improvements in microprocessor performance continue to outpace those made in the memory system. As a result, effective utilization of cache memories is essential in today's architectures.The nature of procedural software poses visibility problems when attempting to perform program optimization. One approach to increasing visibility in procedural design is to perform procedure inlining. The main downside of using inlining is that inlined procedures can place excess pressure on the instruction cache.To address this issue we attempt to perform code reordering. By combining reordering with aggressive inlining, a larger executable image produced through inlining can be effectively remapped onto the cache address space, while not noticeably increasing the instruction cache miss rate.In this paper, we evaluate our ability to perform aggressive inlining by employing cache line coloring. We have implemented three variations of our coloring algorithm in the Alto toolset and compare them against Alto's aggressive basic block reordering algorithms. Alto allows us to generate optimized executables, that can be run on hardware to generate results. We find that by using our algorithms, we can achieve up a 21% reduction is execution runtime over the base Compaq optimizing compiler, and a 6.4% reduction when compared to Alto's interprocedural basic block reordering algorithm.

10 citations


Journal ArticleDOI
TL;DR: Effective techniques that exploit some properties of the particular polyhedra generated by CME are presented that reduce the complexity of the algorithm to solve CME, which results in a significant speed-up when compared with traditional methods.
Abstract: Cache Miss Equations (CME) [GMM97] is a method that accurately describes the cache behavior by means of polyhedra. Even though the computation cost of generating CME is a linear function of the number of references, to solve them is a very time consuming task and thus trying to study a whole program may be infeasible.In this work, we present effective techniques that exploit some properties of the particular polyhedra generated by CME. Such technique reduce the complexity of the algorithm to solve CME, which results in a significant speed-up when compared with traditional methods. In particular, the proposed approach does not require the computation of the vertices of each polyhedron, which has an exponential complexity.

6 citations


Journal ArticleDOI
TL;DR: Caching and other latency tolerating techniques have been quite successful in maintaining high memory system performance for general purpose processors as discussed by the authors, however, TLB misses have become a serious problem for many applications.
Abstract: Caching and other latency tolerating techniques have been quite successful in maintaining high memory system performance for general purpose processors. However, TLB misses have become a serious bo...

6 citations


Journal ArticleDOI
TL;DR: Because the biggest barrier to detecting independence in irregular codes is memory disambiguation, the goals of this paper are to identify memory-independent tasks using a profile-based approach and measure the amount of STP by estimating the amounts of memory- independent instructions those tasks expose.
Abstract: Tomorrow's microprocessors will be able to handle multiple flows of control. Applications that exhibit task level parallelism (TLP) and can be decomposed into parallel tasks will perform well on these platforms. TLP arises when a task is independent of its neighboring code. Traditional parallel compilers exploit one variety of TLP, loop level parallelism (LLP), where loop iterations are executed in parallel. LLP can overwhelming be found in numeric, typically FORTRAN programs with regular patterns of data accesses. In contrast, irregular applications, typified by general purpose integer applications, exhibit little LLP as they tend to access data in irregular patterns through pointers. Without pointer disambiguation to analyze data access dependences, traditional parallel compilers cannot parallelize these irregular applications and ensure correct execution.We focus on a different variety of TLP, namely Speculative Task Parallelism (STP). STP arises when a task (either a leaf-procedure, a non-leaf procedure or an entire loop) is control- and memory-independent of its preceding code, and thus could be executed in parallel. Two sections of code are memory-independent when neither contains a store to a memory location that the other accesses. To exploit STP, we assume a hypothetical speculative machine that supports speculative futures (a parallel programming construct that executes a task early on a different thread or processor) with mechanisms for resolving incorrect speculation when the task is not, after all, independent. This allows us to speculatively parallelize code when there is a high probability of independence, but no guarantee.Figure 1 illustrates STP, showing a task Y in the dynamic instruction stream of an irregular application that has no memory access conflicts with a group of instructions, X, that precede Y. The shorter of X and Y determines the overlap of memory-independent instructions as seen in Figures 1(b) and 1(c). In the absence of any register dependences, X and Y may be executed in parallel, resulting in shorter execution time. It is hard for traditional parallel compilers of pointer-based languages to expose this parallelism.The goals of this paper are to identify such regions as X and Y within irregular applications and to find the number of instructions that may thus be removed from the critical path. This number represents the maximum STP when the cost of exploiting STP is zero.Because the biggest barrier to detecting independence in irregular codes is memory disambiguation, we identify memory-independent tasks using a profile-based approach and measure the amount of STP by estimating the amount of memory-independent instructions those tasks expose. We vary the level of control dependence and memory dependence to investigate their effect on the amount of memory-independence we find. We profile at different memory granularities and introduce synchronization to expose higher levels of memory-independence. Across this variety of speculation assumptions, 7 to 22% of dynamic instructions are within tasks that are found to be memory-independent. This was on the SPECint95 benchmarks, a set of irregular applications for which traditional methods of parallelization are ineffective.

6 citations


Journal ArticleDOI
TL;DR: A cheap and flexible way to design efficient consistency approaches for SVM systems is proposed and an execution-driven simulator aimed at studying the behavior of memory consistency models is developed.
Abstract: Shared Virtual Memory (SVM) systems are organizations of Distributed Shared Memory systems (DSM). These systems offer a shared memory programming model that is more intuitive than the message passing paradigm. Other advantages include low hardware and maintenance costs. This paper introduces a new simulation environment for such architectures. The developed tool is an execution-driven simulator aimed at studying the behavior of memory consistency models, with the exception of those needing compiler modifications. Thus, we propose a cheap and flexible way to design efficient consistency approaches for SVM systems.

5 citations


Journal ArticleDOI
TL;DR: The case for multiprocessing is less clear for desktop applica... as discussed by the authors, but it is already prevalent in servers where multiple clients present an obvious source of thread-level parallelism.
Abstract: Multiprocessing is already prevalent in servers where multiple clients present an obvious source of thread-level parallelism. However, the case for multiprocessing is less clear for desktop applica...

Journal ArticleDOI
TL;DR: In this paper, the Intel's IA32 architecture where lack of registers results in increased memory access cost is discussed. But this is especially noticeable in the IA32 architectures where the number of registers needed for memory accesses is limited.
Abstract: Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased num...

Journal ArticleDOI
TL;DR: In this paper, the profile information in adaptive systems such as just-in-time compilers, dynamic optimizers, and binary translators is exploited for exploiting profile information of adaptive systems.
Abstract: Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show ...

Journal ArticleDOI
TL;DR: This work applies simultaneous Speculation Scheduling to situations where purely static techniques cannot prove data independence, and can be seen as a cost-effective alternative to purely dynamic speculation techniques.
Abstract: Simultaneous Speculation Scheduling (S3) is a combined compiler and architecture technique to control multiple path execution. It can be used for dual path branch speculation in case of unpredictable branches and for multiple path speculative execution of loop iterations in case of loop-carried dependences that make parallel execution otherwise impossible. We apply S3 In situations where purely static techniques cannot prove data independence. S3 can be seen as a cost-effective alternative to purely dynamic speculation techniques. We explain the S3 technique and discuss the requirements on possible target architectures. We further compare S3 to other speculation techniques.

Journal ArticleDOI
TL;DR: It is shown that predicated class testing can reduce the direct cost of virtual function calls by 43% on an average, which in contrast to other class testing techniques, is a performance improvement of up to 31% per function call.
Abstract: Runtime class testing is a technique whereby virtual function calls are transformed into statically-bound function calls through a series of conditional branches. Through this transformation, the overhead of virtual function calls can be significantly reduced. However, the drawback of these tests is that by relying on conditional branches, the amount of instruction-level parallelism (ILP) is limited and the mispredict penalties can be relatively high. We show that by using predication during class testing, these drawbacks can be eliminated, and the benefits of class testing can be improved upon. Predication is a supported feature in Explicitly Parallel Instruction Computing (EPIC) architecture, which converts control dependencies into data dependencies, and thus eliminates the mispredict penalties. With analytical cost models and experimental results, we show that predicated class testing can reduce the direct cost of virtual function calls by 43% on an average, which in contrast to other class testing techniques, is a performance improvement of up to 31% per function call. These results are based on architectural specifications that most modern architectures will exceed, and are considered conservative. The actual speedups resulting from predicated class testing are expected to be higher

Journal ArticleDOI
TL;DR: The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems.
Abstract: The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms nec...

Journal ArticleDOI
TL;DR: In this article, the authors studied the behavior of programs in the SPECint95 suite and observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality.
Abstract: By studying the behavior of programs in the SPECint95 suite we observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality, according to which a few va...

Journal ArticleDOI
TL;DR: The problem with simulation is that it rarely models the system exactly, and qu... as mentioned in this paper points out that simulation is the primary method for evaluating computer systems during all phases of the design process.
Abstract: Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and qu...

Journal ArticleDOI
TL;DR: In this article, the authors point out that building OS kernels and embedded software is difficult and an important source of this difficulty is the numerous rules they must obey: interrupts cannot be disabled for too long.
Abstract: Building systems such as OS kernels and embedded software is difficult. An important source of this difficulty is the numerous rules they must obey: interrupts cannot be disabled for ~too long," gl...

Journal ArticleDOI
TL;DR: A method for integrating software pipelining and load-store elimination techniques is introduced and it is demonstrated that integrated algorithm is more effective than other methods.
Abstract: Software pipelining can generate efficient schedules for loop by overlapping the execution of operations from different iterations in order to exploit maximum Instruction Level Parallelism (ILP). Code optimization can decrease total number of calculations and memory related operations. As a result, instruction schedules can use freed resources to construct shorter schedules. Particularly, when the data is not presented in cache, the performance will be significantly degraded by memory references. Therefore, elimination of redundant load-store operations is most important for improving overall performance. This paper introduces a method for integrating software pipelining and load-store elimination techniques. Moreover, we demonstrate that integrated algorithm is more effective than other methods.

Journal ArticleDOI
TL;DR: A compiler and microarchitecture assisted framework for dynamic energy monitoring by a program and introduces a new semantics for the branch instructions to get around the uncertainty of the control flow paths.
Abstract: We present a compiler and microarchitecture assisted framework for dynamic energy monitoring by a program. The programs will perform an ealloc (The operating system will support energy allocation through a program level utility ealloc similar to the dynamic memory allocation primitives such as malloc.) before entering every logical algorithm such as sorting, FIR filter, convolution. Based on the actual energy allocated by the OS (return parameter of ealloc), a program will enter one of its many variants each consuming different amount of energy through an eswitch. The compiler needs to monitor energy more frequently, probably as often as once every basic block. The existing instruction level energy models can be used. Assuming reasonably accurate instruction level energy models, the compiler can compute fairly accurate energy value for a basic block. However, the uncertainty of the control flow paths makes it close to impossible for the compiler to maintain the program level energy value. We introduce a new semantics for the branch instructions to get around this problem. A branch instruction in addition to branch condition and target offset also contains two more fields: taken basic block energy, and not taken basic block energy. The microarchitecture is assumed to maintain an energy counter (EC) which is program visible. The branch instruction also modifies the EC by adding to it the energy of the taken or not taken basic block based on branch condition. The EC thus provides an exact model of program energy consumed until that point. An instruction to clear EC (make it zero) will allow the compiler to use EC to track energy used between some two fixed points in the program. The compiler can now insert transformations that are conditional upon EC.

Journal ArticleDOI
TL;DR: Results show the viability of a COMA-BC system as a way of exploiting parallelism at a low cost using workstations and the coherence cache protocol that has been developed.
Abstract: In this paper we put forward a design for a multicomputer system based on a network of workstations which we call COMA-BC. It has a common address space in which a shared variables programming model can be used. The management of the shared address space is performed in a similar way to that in existing multiprocessor COMA systems. To be exact, the shared address space is divided into blocks, and their copies reside in the attraction memories of the workstations.The key piece in this system is the coherence cache protocol that we have developed. The goal of the protocol is to minimize the number and size of the messages travelling through the network so that the parallel applications can be executed without creating inconsistencies in the different copies of the blocks residing in the different nodes of the system.The proposed system has not been built, but a simulation environment has been specifically developed. This environment allows the simulation of the execution of parallel standard applications in COMA-BC. This simulation environment is driven by execution. Using the results and a simple analytic model, results have been obtained concerning the performance of the execution of standard parallel applications in terms of acceleration and efficiency. These results show the viability of a COMA-BC system as a way of exploiting parallelism at a low cost using workstations.

Journal ArticleDOI
TL;DR: D a t a can have five or m o r e d imens ions and can be of r ank 2 or g rea te r ( rank 0 is comm o n l y a scalar, 1 a vec tor , etc) .
Abstract: T h e g rowing gap be tween processor and m e m o r y has been exace rba t ed by t h e way we are us ing c o m p u t e r s : in o rde r t o ob ta in an ac tua l d a t a for a given in s t ruc t ion , i t can t ake several, if no t a lot of, d i f ferent m e m o r y accesses, d e p e n d e n t u p o n t h e way d a t a are s t r u c t u r e d . Indeed , d a t a is o f ten a func t ion of i n d e p e n d e n t variables, or d imens ions (e.g. space, t ime, energy, ..). S o m e d a t a can have five or m o r e d imens ions and can be of r ank 2 or g rea te r ( rank 0 is comm o n l y a scalar, 1 a vec tor , etc) . Of ten t he r e is an assoc ia t ion be tween the d imens iona l i ty of d a t a and i ts geomet ry , called a mesh or a grid. Such a grid can b e c o m e qu i t e complex , its s t r u c t u r e can even be dynamica l l y changed du r ing t h e very run of t h e app l ica t ion .