Showing papers in &quot;IEEE Transactions on Computers in 2001&quot;

An efficient optimal normal basis type II multiplier

TL;DR: This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs) that perform modular exponentiation with very long integers, at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes.

...read moreread less

Abstract: It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. Central tools for achieving system security are cryptographic algorithms. This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine a high-radix Montgomery modular multiplication algorithm with a new systolic array design. The designs are flexible, allowing any choice of operand and modulus. The new architecture also allows the use of high radices. Unlike previous approaches, we systematically implement and compare several variants of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture. The results allow conclusions about the feasibility and time-space trade-offs of our architecture for implementation on commercially available FPGAs. We found that 1,024-bit RSA decryption can be done in 3.1 ms with our fastest architecture.

...read moreread less

196 citations

Journal Article•DOI•

[...]

Berk Sunar, Çetin Kaya Koç¹•Institutions (1)

Oregon State University¹

01 Jan 2001-IEEE Transactions on Computers

TL;DR: This paper presents a new parallel multiplier for the Galois field GF(2/sup m/) whose elements are represented using the optimal normal basis of type II, and the time complexities of the proposed and the Massey-Omura multipliers are similar.

...read moreread less

Abstract: This paper presents a new parallel multiplier for the Galois field GF(2/sup m/) whose elements are represented using the optimal normal basis of type II. The proposed multiplier requires 1.5(m/sup 2/-m) XOR gates, as compared to 2(m/sup 2/-m) XOR gates required by the Massey-Omura multiplier. The time complexities of the proposed and the Massey-Omura multipliers are similar.

...read moreread less

164 citations

Journal Article•DOI•

Inherently lower-power high-performance superscalar architectures

[...]

V.V. Zyuban¹, Peter M. Kogge²•Institutions (2)

IBM¹, University of Notre Dame²

01 Mar 2001-IEEE Transactions on Computers

TL;DR: This work attempts to bring the power issue to the earliest phases of microprocessor development, in particular, the stage of defining a chip microarchitecture, by investigating power-optimization techniques of superscalar microprocessors at the microarch Architecture level that do not compromise performance.

...read moreread less

Abstract: In recent years, reducing power has become an important design goal for high-performance microprocessors. This work attempts to bring the power issue to the earliest phases of microprocessor development, in particular, the stage of defining a chip microarchitecture. We investigate power-optimization techniques of superscalar microprocessors at the microarchitecture level that do not compromise performance. First, major targets for power reduction are identified within microarchitecture, where power is heavily consumed or will be heavily consumed in next-generation superscalar processors. Then, a new, energy-efficient version of a multicluster microarchitecture is developed that reduces energy the identified critical design points with minimal performance impact. A methodology is developed for energy-performance optimization at the microarchitecture level that generates, for a microarchitecture, a set of energy-efficient configurations, forming a convex hull in the power-performance space. Detailed simulation of the baseline and proposed multicluster architectures has been performed using the developed optimization methodology. A comparison of the two microarchitectures, both optimized for energy efficiency, shows that the multicluster architecture is potentially up to twice as energy efficient for wide issue processors, with an advantage that Grows with the issue width. Conversely, at the same power dissipation level, the multicluster architecture supports configurations with measurably higher performance than equivalent conventional designs.

...read moreread less

161 citations

Journal Article•DOI•

Dynamic binary translation and optimization

[...]

Kemal Ebcioglu¹, Erik R. Altman¹, Michael K. Gschwind¹, Sumedh W. Sathaye¹•Institutions (1)

IBM¹

01 Jun 2001-IEEE Transactions on Computers

TL;DR: Different design trade-offs in the DAISY system and their impact on final system performance are reported, and the results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.

...read moreread less

Abstract: We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.

...read moreread less

152 citations

Journal Article•DOI•

DCT implementation with distributed arithmetic

[...]

Sungwook Yu¹, E.E. Swartziander²•Institutions (2)

Intel¹, University of Texas at Austin²

01 Sep 2001-IEEE Transactions on Computers

TL;DR: This paper presents an efficient method for implementing the Discrete Cosine Transform (DCT) with distributed arithmetic that uses the recursive DCT algorithm and requires less area than the conventional approaches, regardless of the memory reduction techniques employed in the ROM Accumulators.

...read moreread less

Abstract: This paper presents an efficient method for implementing the Discrete Cosine Transform (DCT) with distributed arithmetic. While conventional approaches use the original DCT algorithm or the even-odd frequency decomposition of the DCT algorithm, the proposed architecture uses the recursive DCT algorithm and requires less area than the conventional approaches, regardless of the memory reduction techniques employed in the ROM Accumulators (RACs). An efficient architecture for implementing the scaled DCT with distributed arithmetic is also proposed. The new architecture requires even less area while keeping the same structural regularity for an easy VLSI implementation. A comparison of synthesized DCT processors shows that the proposed method reduces the hardware area of regular and scaled DCT processors by 17 percent and 23 percent, respectively, relative to a conventional design. With the row-column decomposition method, the proposed architectures can be easily extended to compute the two-dimensional DCT required in many image compression applications such as HDTV.

...read moreread less

150 citations

Journal Article•DOI•

Markov regenerative stochastic petri nets to model and evaluate phased mission systems dependability

[...]

Ivan Mura¹, Andrea Bondavalli²•Institutions (2)

Motorola¹, University of Florence²

01 Dec 2001-IEEE Transactions on Computers

TL;DR: This study deals with model-based dependability transient analysis of phased mission systems and proposes a modeling methodology that exploits the power of the class of Markov regenerative stochastic Petri net models to attack the weak points of the state-of-the-art.

...read moreread less

Abstract: This study deals with model-based dependability transient analysis of phased mission systems. A review of the studies in the literature showed that several aspects of multiphased systems pose challenging problems to the dependability evaluation methods and tools. To attack the weak points of the state-of-the-art we propose a modeling methodology that exploits the power of the class of Markov regenerative stochastic Petri net models. By exploiting the techniques available in the literature for the analysis of the Markov Regenerative Processes, we obtain an analytical solution technique with a low computational complexity, basically dominated by the cost of the separate analysis of the system inside each phase. Last, the existence of analytical solutions allows us to derive the sensitivity functions of the dependability measures, thus providing the dependability engineer with additional means for the study of phased mission systems.

...read moreread less

146 citations

Journal Article•DOI•

Optimal reward-based scheduling for periodic real-time tasks

[...]

Hakan Aydin¹, Rami Melhem¹, Daniel Mosse¹, Pedro Mejia-Alvarez²•Institutions (2)

University of Pittsburgh¹, CINVESTAV²

Bit-parallel systolic multipliers for GF(2/sup m/) fields defined by all-one and equally spaced polynomials

TL;DR: This paper addresses the reward-based scheduling problem for periodic tasks and proves the optimality of Rate Monotonic Scheduling (with harmonic periods), Earliest Deadline First, and Least Laxity First policies for the case of uniprocessors when used with the optimal service times.

...read moreread less

Abstract: Reward-based scheduling refers to the problem in which there is a reward associated with the execution of a task. In our framework, each real-time task comprises a mandatory and an optional part. The mandatory part must complete before the task's deadline, while a nondecreasing reward function is associated with the execution of the optional part, which can be interrupted at any time. Imprecise computation and Increased-Reward-with-Increased-Service models fall within the scope of this-framework. In this paper, we address the reward-based scheduling problem for periodic tasks. An optimal schedule is one where mandatory-parts complete in a timely manner and the weighted average reward is maximized. For linear and concave reward functions, which are most common, we 1) show the existence of an optimal schedule where the optional service time of a task is constant at every instance and 2) show how to efficiently compute this service time. We also prove the optimality of Rate Monotonic Scheduling (with harmonic periods), Earliest Deadline First, and Least Laxity First policies for the case of uniprocessors when used with the optimal service times we computed. Moreover, we extend our result by showing that any policy which can fully utilize all the processors is also optimal for the multiprocessor periodic reward-based scheduling. To show-that our optimal solution is pushing the limits of reward-based scheduling, we further prove that, when the reward functions are convex, the problem becomes NP-Hard. Our static optimal solution, besides providing considerable reward improvements over the previous suboptimal strategies, also has a major practical benefit. Run-time overhead is eliminated and existing scheduling disciplines may be used without modification with the computed optimal service times.

...read moreread less

Journal Article•DOI•

[...]

Chiou-Yng Lee¹, Erl-Huei Lu¹, Jau-Yien Lee¹•Institutions (1)

Chang Gung University¹

01 May 2001-IEEE Transactions on Computers

TL;DR: Two low-complexity bit-parallel systolic multipliers are presented based on the algorithm proposed, which can be applied in computing multiplications over the class of fields GF(2/sup m/) in which the elements are represented with the root of an irreducible equally spaced polynomial.

...read moreread less

Abstract: Two operations, the cyclic shifting and the inner product, are defined by the properties of irreducible all one polynomials. An effective algorithm is proposed for computing multiplications over a class of fields GF(2/sup m/) using the two operations. Then, two low-complexity bit-parallel systolic multipliers are presented based on the algorithm. The first multiplier is composed of (m+1)/sup 2/ identical cells, each consisting of one 2-input AND gate, one 2-input XOR gate, and three 1-bit latches. The other multiplier comprises of (m+1)/sup 2/ identical cells and mXOR gates. Each cell consists of one 2-input AND gate, one 2-input XOR gate, and four 1-bit latches. Each multiplier exhibits very low latency and propagation delay and is thus very fast. Moreover, the architectures of the two multipliers can be applied in computing multiplications over the class of fields GF(2/sup m/) in which the elements are represented with the root of an irreducible equally spaced polynomial of degree m.

...read moreread less

Journal Article•DOI•

A variational calculus approach to optimal checkpoint placement

[...]

Yibei Ling, Jie Mi¹, Xiaola Lin²•Institutions (2)

Florida International University¹, University of Hong Kong²

Hardware and software techniques for controlling DRAM power modes

TL;DR: By means of the calculus of variations, an explicit formula is derived that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery.

...read moreread less

Abstract: Checkpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery. Theoretical result shows that the optimal checkpointing frequency is proportional to the square root of the failure rate and can be uniquely determined by the failure rate (time-varying or constant) if the recovery function is strictly increasing and the failure rate is /spl lambda/(/spl infin/)>0. J.L. Bruno and E.G. Coffman (1997) suggest that optimal checkpointing by its nature is a function of system failure rate, i.e., the time-varying failure rate demands time-varying checkpointing in order to meet the criteria of certain optimality. The results obtained in this paper agree with their viewpoint.

...read moreread less

Journal Article•DOI•

[...]

V. Delaluz¹, Mahmut Kandemir¹, N. Vijaykrishnan¹, Anand Sivasubramaniam¹, Mary Jane Irwin¹ - Show less +1 more•Institutions (1)

Pennsylvania State University¹

The Impulse memory controller

TL;DR: An in-depth investigation of software and hardware techniques to take advantage of the DRAM mode control capabilities at a module granularity for energy savings using a memory system architecture capturing five different energy modes and corresponding resynchronization times.

...read moreread less

Abstract: The anticipated explosive growth of pervasive and mobile computing devices that are typically constrained by energy has brought hardware and software techniques for energy conservation into the spotlight. While there have been several studies and proposals for energy conservation for CPUs and peripherals, energy optimization techniques for selective operating mode control of DRAMs have not been fully explored. It has been shown that, for some systems, as much as 90 percent of overall system energy (excluding I/O) is consumed by the DRAM modules, thus, they serve as a good candidate for energy optimizations. Further, DRAM technology has also matured to provide several low energy operating modes (power modes), making it an opportunistic moment to conduct studies exploring the potential benefits of mode control techniques. This paper conducts an in-depth investigation of software and hardware techniques to take advantage of the DRAM mode control capabilities at a module granularity for energy savings. Using a memory system architecture capturing five different energy modes and corresponding resynchronization times, this paper presents several novel compilation techniques to both cluster the data across memory banks as well as to detect module idleness and perform energy mode transitions. In addition, hardware-assisted approaches (called self-monitoring) based on predictions of module interaccess times are proposed. These techniques are extensively evaluated using a set of a dozen benchmarks. It is shown that we get an average of 61 percent savings in DRAM energy using compiler-directed mode control. One of the self-monitored approaches gives as much as 89 percent savings (72 percent on the average), coming as close as 8.8 percent to the optimal energy savings that one can expect with DRAM module mode control. The optimization techniques are demonstrated to be invaluable for energy savings as memory technologies continue to evolve.

...read moreread less

Journal Article•DOI•

[...]

Lixin Zhang¹, Zhen Fang¹, Michael Parker¹, Binu K. Mathew¹, Lambert Schaelicke¹, John B. Carter¹, Wilson C. Hsieh¹, Sally A. McKee¹ - Show less +4 more•Institutions (1)

University of Utah¹

A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers)

TL;DR: The design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications are described and the effectiveness of these optimizations are demonstrated.

...read moreread less

Abstract: Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20 percent to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.

...read moreread less

Journal Article•DOI•

[...]

Olivier Beaumont¹, Vincent Boudet, A. Petitet², Fabrice Rastello, Yves Robert - Show less +1 more•Institutions (2)

École normale supérieure de Lyon¹, University of Tennessee²

01 Oct 2001-IEEE Transactions on Computers

TL;DR: In this paper, the authors study the load balancing problem for dense linear algebra kernels on heterogeneous networks of workstations and propose a data allocation heuristic to balance the load on heterogenous platforms with respect to the performance of processors.

...read moreread less

Abstract: The authors study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the load-balancing problem can be solved rather easily. When targeting two-dimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D load-balancing problem and prove its NP-completeness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.

...read moreread less

Journal Article•DOI•

Systematic design of original and modified Mastrovito multipliers for general irreducible polynomials

[...]

Tong Zhang¹, Keshab K. Parhi¹•Institutions (1)

University of Minnesota¹

Stochastic neural computation. II. Soft competitive learning

TL;DR: A systematic design of modified Mastrovito multiplier, which is suitable for GF(2/sup m/) generated by high-Hamming weight irreducible polynomials is proposed, which effectively exploits the spatial correlation of elements in Mastovito product matrix to reduce the complexity.

...read moreread less

Abstract: This paper considers the design of bit-parallel dedicated finite field multipliers using standard basis. An explicit algorithm is proposed for efficient construction of Mastrovito product matrix, based on which we present a systematic design of Mastrovito multiplier applicable to GF(2/sup m/) generated by an arbitrary irreducible polynomial. This design effectively exploits the spatial correlation of elements in Mastrovito product matrix to reduce the complexity. Using a similar methodology, we propose a systematic design of modified Mastrovito multiplier, which is suitable for GF(2/sup m/) generated by high-Hamming weight irreducible polynomials. For both original and modified Mastrovito multipliers, the developed multiplier architectures are highly modular, which is desirable for VLSI hardware implementation. Applying the proposed algorithm and design approach, we study the Mastrovito multipliers for several special irreducible polynomials, such as trinomial and equally-spaced-polynomial, and the obtained complexity results match the best known results. Moreover, we have discovered several new special irreducible polynomials which also lead to low-complexity Mastrovito multipliers.

...read moreread less

Journal Article•DOI•

[...]

B.D. Brown¹, Howard C. Card²•Institutions (2)

University of Winnipeg¹, University of Manitoba²

01 Sep 2001-IEEE Transactions on Computers

TL;DR: An investigation has been made into the use of stochastic arithmetic to implement an artificial neural network solution to a typical pattern recognition application, with results indicating an order of magnitude improvement over the floating-point implementation assuming clock frequency parity.

...read moreread less

Abstract: For pt. I see ibid., p.891-905. An investigation has been made into the use of stochastic arithmetic to implement an artificial neural network solution to a typical pattern recognition application. Optical character recognition is performed on very noisy characters in the E-13B MICR font. The artificial neural network is composed of two layers, the first layer being a set of soft competitive learning subnetworks and the second a set of fully connected linear output neurons. The observed number of clock cycles in the stochastic case represents an order of magnitude improvement over the floating-point implementation assuming clock frequency parity. Network generalization capabilities were also compared based on the network squared error as a function of the amount of noise added to the input patterns. The stochastic network maintains a squared error within 10 percent of that of the floating-point implementation for a wide range of noise levels.

...read moreread less

Journal Article•DOI•

Lifetime-sensitive modulo scheduling in a production environment

[...]

Josep Llosa¹, Eduard Ayguadé², Antonio González², Mateo Valero², Jason Eckhardt³ - Show less +1 more•Institutions (3)

University of Barcelona¹, Polytechnic University of Catalonia², Rice University³

01 Mar 2001-IEEE Transactions on Computers

TL;DR: This paper first describes the Swing Modulo Scheduling technique and evaluates it for the Perfect Club benchmark suite on a generic VLIW architecture, showing that it outperforms them in terms of the quality of the obtained schedules and compilation time.

...read moreread less

Abstract: This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements, and stage count. Swing Modulo Scheduling is a heuristic approach that has a low computational cost. This paper first describes the technique and evaluates it for the Perfect Club benchmark suite on a generic VLIW architecture. SMS is compared with other heuristic methods, showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. To further explore the effectiveness of SMS, the experience of incorporating it into a production quality compiler for the Equator MAP1000 processor is described; implementation issues are discussed, as well as modifications and improvements to the original algorithm. Finally, experimental results from using a set of industrial multimedia applications are presented.

...read moreread less

Journal Article•DOI•

Power analysis attacks and algorithmic approaches to their countermeasures for Koblitz curve cryptosystems

[...]

Masud Hasan¹•Institutions (1)

University of Waterloo¹

01 Oct 2001-IEEE Transactions on Computers

TL;DR: In this article, power analysis attacks are applied to cryptosystems that use scalar multiplication on Koblitz curves and a number of countermeasures against simple and differential power analysis attack are suggested.

...read moreread less

Abstract: Because of their shorter key sizes, cryptosystems based on elliptic curves are being increasingly used in practical applications. A special class of elliptic curves, namely, Koblitz curves, offers an additional, but crucial advantage of considerably reduced processing time. Power analysis attacks are applied to cryptosystems that use scalar multiplication on Koblitz curves. Both the simple and the differential power analysis attacks are considered and a number of countermeasures are suggested. While the proposed countermeasures against the simple power analysis attacks rely on making the power consumption for the elliptic curve scalar multiplication independent of the secret key, those for the differential power analysis attacks depend on randomizing the secret key prior to each execution of the scalar multiplication. These countermeasures are computationally efficient and suitable for hardware implementation.

...read moreread less

Journal Article•DOI•

Java runtime systems: characterization and architectural implications

[...]

Ramesh Radhakrishnan, N. Vijaykrishnan¹, Lizy K. John², Anand Sivasubramaniam¹, Juan C. Rubio², J. Sabarinathan³ - Show less +2 more•Institutions (3)

Pennsylvania State University¹, University of Texas at Austin², Motorola³

Scheduled dataflow: execution paradigm, architecture, and performance evaluation

TL;DR: The architectural issues explored in this study show that, when Java applications are executed with a JIT compiler, selective translation using good heuristics can improve performance, but the saving is only 10-15 percent at best, and reveals revealing insights and architectural proposals for designing an efficient Java runtime system.

...read moreread less

Abstract: The Java Virtual Machine (JVM) is the cornerstone of Java technology and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology. Interpretation, Just-in-Time (JIT) compilation, and hardware realization are well-known solutions for a JVM and previous research has proposed optimizations for each of these techniques. However, each technique has its pros and cons and may not be uniformly attractive for all hardware platforms. Instead, an understanding of the architectural implications of JVM implementations with real applications can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms. Toward this goal, this paper examines architectural issues from both the hardware and JVM implementation perspectives. The paper starts by identifying the important execution characteristics of Java applications from a bytecode perspective. It then explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs and investigates the CPU and cache architectural support that would benefit JVM implementations. We also study the available parallelism during the different execution modes using applications from the SPECjvm98 benchmarks. At the bytecode level, it is observed that less than 5 out of the 256 bytecodes constitute 90 percent of the dynamic bytecode stream. Method sizes fall into a trinodal distribution with peak of 1, 9, and 26 bytecodes across all benchmarks. The architectural issues explored in this study show that, when Java applications are executed with a JIT compiler, selective translation using good heuristics can improve performance, but the saving is only 10-15 percent at best. The instruction and data cache performance of Java applications are seen to be better than that of C/C/sub +/+ applications except in the case of data cache performance in the JIT mode. Write misses resulting from installation of JIT compiler output dominate the misses and deteriorate the data cache performance in JIT mode. A study on the available parallelism shows that Java programs executed using JIT compilers have parallelism comparable to C/C++ programs for small window sizes, but falls behind when the window size is increased. Java programs executed using the interpreter have very little parallelism due to the stack nature of the SVM instruction set, which is dominant in the interpreted execution mode. In addition, this work gives revealing insights and architectural proposals for designing an efficient Java runtime system.

...read moreread less

Journal Article•DOI•

[...]

Krishna M. Kavi¹, Roberto Giorgi, Joseph Arul²•Institutions (2)

University of Alabama¹, University of Alabama in Huntsville²

01 Aug 2001-IEEE Transactions on Computers

TL;DR: The results show that SDF architecture can outperform the superscalar and scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

...read moreread less

Abstract: In this paper, the scheduled dataflow (SDF) architecture-a decoupled memory/execution, multithreaded architecture using nonblocking threads-is presented in detail and evaluated against superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

...read moreread less

Journal Article•DOI•

Hiding relaxed memory consistency with a compiler

[...]

Jaejin Lee¹, David Padua²•Institutions (2)

Michigan State University¹, University of Illinois at Urbana–Champaign²

01 Aug 2001-IEEE Transactions on Computers

TL;DR: This work presents a compiler technique, which is based on Shasha and Snir's delay set analysis, to hide the underlying relaxed memory consistency model and to guarantee sequential consistency, and introduces dominators with respect to a node in a control flow graph to identify memory-barrier nodes.

...read moreread less

Abstract: We present a compiler technique, which is based on Shasha and Snir's delay set analysis, to hide the underlying relaxed memory consistency model for an optimizing compiler for explicitly parallel programs. The compiler presents programmers with a sequentially consistent view of the underlying machine, irrespective of whether it follows a sequentially consistent model or a relaxed model. To hide the underlying relaxed memory consistency model and to guarantee sequential consistency, our algorithm inserts fence instructions by identifying memory-barrier nodes. We reduce the number of fence instructions by exploiting the ordering constraints of the underlying memory consistency model and the property of fence and synchronization operations. We introduce dominators with respect to a node in a control flow graph to identify memory-barrier nodes and show that minimizing the number of memory-barrier nodes is NP-hard.

...read moreread less

Journal Article•DOI•

The maximum factor queue length batching scheme for video-on-demand systems

[...]

Charu C. Aggarwal¹, Joel L. Wolf¹, Philip S. Yu¹•Institutions (1)

IBM¹

Compiler support for scalable and efficient memory systems

TL;DR: A batching policy is proposed that schedules the video with the maximum factored queue length and is referred to as MFQL, which yields excellent empirical results in terms of standard performance measures such as average latency time, defection rates, and fairness.

...read moreread less

Abstract: In a video-on-demand environment, batching of video requests is often used to reduce I/O demand and improve throughput. Since viewers may defect if they experience long waits, a good video scheduling policy needs to consider not only the batch size but also the viewer defection probabilities and wait times. Two conventional scheduling policies for batching are the first-come-first-served (FCFS) policy, which schedules the video with the longest waiting request, and the maximum queue length-(MQL) policy, which selects the video with the maximum number of waiting requests. Neither of these policies leads to entirely satisfactory results. MQL tends to be too aggressive in scheduling popular videos by considering only the queue length to maximize batch size, while FCFS has the opposite effect by completely ignoring the queue length and focusing on arrival time to reduce defection. In this paper, we introduce the notion of factored queue length and propose a batching policy that schedules the video with the maximum factored queue length. We refer to this as the MFQL policy. The factored queue length is obtained by weighting each video queue length with a factor which is biased against the more popular videos. An optimization problem is formulated to solve for the best weighting factors for the various videos. We also consider MFQL implementation issues. A simulation is developed to compare the proposed MFQL variants with FCFS and MQL. Our study shows that MFQL yields excellent empirical results in terms of standard performance measures such as average latency time, defection rates, and fairness.

...read moreread less

Journal Article•DOI•

[...]

Rajeev Barua¹, Whay S. Lee², S. Arnarasinghe², Anant Agarwal²•Institutions (2)

University of Maryland, College Park¹, Massachusetts Institute of Technology²

A fast algorithm for multiplicative inversion in GF(2/sup m/) using normal basis

TL;DR: This paper considers how to effectively use a bank-exposed memory system comprised of small, decentralized cache banks for sequential programs and demonstrates that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.

...read moreread less

Abstract: Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bank-exposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguation-that of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.

...read moreread less

Journal Article•DOI•

[...]

Naofumi Takagi¹, J. Yoshiki², K. Takagi¹•Institutions (2)

Nagoya University¹, Oki Electric Industry²

01 May 2001-IEEE Transactions on Computers

TL;DR: A fast algorithm for multiplicative inversion in GF(2/sup m/) using normal basis is proposed, which is an improvement on those proposed by Itoh and Tsujii and by Chang et al., which are based on Fermat's theorem and require O(logm) multiplications.

...read moreread less

Abstract: A fast algorithm for multiplicative inversion in GF(2/sup m/) using normal basis is proposed. It is an improvement on those proposed by Itoh and Tsujii and by Chang et al., which are based on Fermat's theorem and require O(logm) multiplications. The number of multiplications is reduced by decomposing m-1 into several factors and a small remainder.

...read moreread less

Journal Article•DOI•

High-performance DRAMs in workstation environments

[...]

Vinodh Cuppu¹, Bruce Jacob¹, Brian Davis², Trevor Mudge²•Institutions (2)

University of Maryland, College Park¹, University of Michigan²

Lossless trace compression

TL;DR: A simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization, reveals several things, including that bus transmission speed will soon become a primary factor limiting memory-system performance.

...read moreread less

Abstract: This paper presents a simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use only a handful of DRAM chips (/spl sim/10, as opposed to /spl sim/1 or /spl sim/100). The study covers Fast Page Mode, Extended Data Out, Synchronous, Enhanced Synchronous, Double Data Rate, Synchronous Link, Rambus, and Direct Rambus designs. Our simulations reveal several things: 1) Current advanced DRAM technologies are attacking the memory bandwidth problem but not the latency problem; 2) bus transmission speed will soon become a primary factor limiting memory-system performance; 3) the post-L2 address stream still contains significant locality, though it varies from application to application; 4) systems without L2 caches are feasible for low- and medium-speed CPUs (1 GHz and below); and 5) as we move to wider buses, row access time becomes more prominent, making it important to investigate techniques to exploit the available locality to decrease access time.

...read moreread less

Journal Article•DOI•

[...]

Eric E. Johnson¹, Jiheng Ha, M. Baqar Zaidi²•Institutions (2)

New Mexico State University¹, Intel²

Matrix multiplication on the OTIS-Mesh optoelectronic computer

TL;DR: A range of information-lossless address and instruction trace compression schemes that can reduce both storage space and access time by an order of magnitude or more, without discarding either references or interreference timing information from the original trace are discussed.

...read moreread less

Abstract: The tremendous storage space required for a useful data base of program traces has prompted a search for trace reduction techniques. In this paper, we discuss a range of information-lossless address and instruction trace compression schemes that can reduce both storage space and access time by an order of magnitude or more, without discarding either references or interreference timing information from the original trace. The PDATS family of trace compression techniques achieves trace coding densities of about six references per byte. This family of techniques is now in use as the standard in the NMSU TraceBase, an extensive trace archive that has been established for use by the international research and teaching community.

...read moreread less

Journal Article•DOI•

[...]

Chih-Fang Wang¹, Sartaj Sahni•Institutions (1)

Southern Illinois University Carbondale¹