scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 2001"


Journal ArticleDOI
TL;DR: Experimental results from trace-driven simulations show that the performance of the LRFU is at least competitive with that of previously known policies for the workloads the authors considered.
Abstract: Efficient and effective buffering of disk blocks in main memory is critical for better file system performance due to a wide speed gap between main memory and hard disks. In such a buffering system, one of the most important design decisions is the block replacement policy that determines which disk block to replace when the buffer is full. In this paper, we show that there exists a spectrum of block replacement policies that subsumes the two seemingly unrelated and independent Least Recently Used (LRU) and Least Frequently Used (LFU) policies. The spectrum is called the LRFU (Least Recently/Frequently Used) policy and is formed by how much more weight we give to the recent history than to the older history. We also show that there is a spectrum of implementations of the LRFU that again subsumes the LRU and LFU implementations. This spectrum is again dictated by how much weight is given to recent and older histories and the time complexity of the implementations lies between O(1) (the time complexity of LRU) and {\rm O}(\log_2 n) (the time complexity of LFU), where n is the number of blocks in the buffer. Experimental results from trace-driven simulations show that the performance of the LRFU is at least competitive with that of previously known policies for the workloads we considered.

593 citations


Journal ArticleDOI
TL;DR: The primary contribution of this paper is in introducing several state machine-based computational elements for performing sigmoid nonlinearity mappings, linear gain, and exponentiation functions, and describing an efficient method for the generation of, and conversion between, stochastic and deterministic binary signals.
Abstract: This paper examines a number of stochastic computational elements employed in artificial neural networks, several of which are introduced for the first time, together with an analysis of their operation. We briefly include multiplication, squaring, addition, subtraction, and division circuits in both unipolar and bipolar formats, the principles of which are well-known, at least for unipolar signals. We have introduced several modifications to improve the speed of the division operation. The primary contribution of this paper, however, is in introducing several state machine-based computational elements for performing sigmoid nonlinearity mappings, linear gain, and exponentiation functions. We also describe an efficient method for the generation of, and conversion between, stochastic and deterministic binary signals. The validity of the present approach is demonstrated in a companion paper through a sample application, the recognition of noisy optical characters using soft competitive learning. Network generalization capabilities of the stochastic network maintain a squared error within 10 percent of that of a floating-point implementation for a wide range of noise levels. While the accuracy of stochastic computation may not compare favorably with more conventional binary radix-based computation, the low circuit area, power, and speed characteristics may, in certain situations, make them attractive for VLSI implementation of artificial neural networks.

497 citations


Journal ArticleDOI
TL;DR: This paper defines four temporal constraints based on determining a maximum number of deadlines that can be missed during a window of time (a given number of invocations) and provides the theoretical analysis of the properties and relationships of these constraints.
Abstract: In a hard real-time system, it is assumed that no deadline is missed, whereas, in a soft or firm real-time system, deadlines can be missed, although this usually happens in a nonpredictable way. However, most hard real-time systems could miss some deadlines provided that it happens in a known and predictable way. Also, adding predictability on the pattern of missed deadlines for soft and firm real-time systems is desirable, for instance, to guarantee levels of quality of service. We introduce the concept of weakly hard real-time systems to model real-time systems that can tolerate a clearly specified degree of missed deadlines. For this purpose, we define four temporal constraints based on determining a maximum number of deadlines that can be missed during a window of time (a given number of invocations). This paper provides the theoretical analysis of the properties and relationships of these constraints. It also shows the exact conditions under which a constraint is harder to satisfy than another constraint. Finally, results on fixed priority scheduling and response-time schedulability tests for a wide range of process models are presented.

395 citations


Journal ArticleDOI
TL;DR: The concept of frames, along with the mechanisms and strategies outlined in this paper, will play an important role in future processor architecture and the success of these optimizations demonstrates the significance of the rePLay Framework.
Abstract: In this paper, we propose a new processor framework that supports dynamic optimization. The rePLay Framework embeds an optimization engine atop a high-performance execution engine. The heart of the rePLay Framework is the concept of a frame. Frames are large, single-entry, single-exit optimization regions spanning many basic blocks in the program's dynamic instruction stream, yet containing only a single flow of control. This atomic property of frames increases the flexibility in applying optimizations. To support frames, rePLay includes a hardware-based recovery mechanism that rolls back the architectural state to the beginning of a frame if, for example, an early exit condition is detected. This mechanism permits the optimizer to make speculative, aggressive optimizations upon frames. In this paper, we investigate some of the underlying phenomenon that support rePLay. Primarily, we evaluate rePLay's region formation strategy. A rePLay configuration with a 256-entry frame cache, using 74 KB frame constructor and frame sequencer, achieves an average frame size of 88 Alpha AXP instructions with 68 percent coverage of the dynamic istream, an average frame completion rate of 92.81 percent, and a frame predictor accuracy of 81.26 percent. These results soundly demonstrate that the frames upon which the optimizations are performed are large and stable. Using the most frequently initiated frames from rePLay executions as samples, we also highlight possible strategies for the rePLay optimization engine. Coupled with the high coverage of frames achieved through the dynamic frame construction, the success of these optimizations demonstrates the significance of the rePLay Framework. We believe that the concept of frames, along with the mechanisms and strategies outlined in this paper, will play an important role in future processor architecture.

204 citations


Journal ArticleDOI
TL;DR: This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs) that perform modular exponentiation with very long integers, at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes.
Abstract: It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. Central tools for achieving system security are cryptographic algorithms. This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine a high-radix Montgomery modular multiplication algorithm with a new systolic array design. The designs are flexible, allowing any choice of operand and modulus. The new architecture also allows the use of high radices. Unlike previous approaches, we systematically implement and compare several variants of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture. The results allow conclusions about the feasibility and time-space trade-offs of our architecture for implementation on commercially available FPGAs. We found that 1,024-bit RSA decryption can be done in 3.1 ms with our fastest architecture.

196 citations


Journal ArticleDOI
TL;DR: This paper presents a new parallel multiplier for the Galois field GF(2/sup m/) whose elements are represented using the optimal normal basis of type II, and the time complexities of the proposed and the Massey-Omura multipliers are similar.
Abstract: This paper presents a new parallel multiplier for the Galois field GF(2/sup m/) whose elements are represented using the optimal normal basis of type II. The proposed multiplier requires 1.5(m/sup 2/-m) XOR gates, as compared to 2(m/sup 2/-m) XOR gates required by the Massey-Omura multiplier. The time complexities of the proposed and the Massey-Omura multipliers are similar.

164 citations


Journal ArticleDOI
TL;DR: This work attempts to bring the power issue to the earliest phases of microprocessor development, in particular, the stage of defining a chip microarchitecture, by investigating power-optimization techniques of superscalar microprocessors at the microarch Architecture level that do not compromise performance.
Abstract: In recent years, reducing power has become an important design goal for high-performance microprocessors. This work attempts to bring the power issue to the earliest phases of microprocessor development, in particular, the stage of defining a chip microarchitecture. We investigate power-optimization techniques of superscalar microprocessors at the microarchitecture level that do not compromise performance. First, major targets for power reduction are identified within microarchitecture, where power is heavily consumed or will be heavily consumed in next-generation superscalar processors. Then, a new, energy-efficient version of a multicluster microarchitecture is developed that reduces energy the identified critical design points with minimal performance impact. A methodology is developed for energy-performance optimization at the microarchitecture level that generates, for a microarchitecture, a set of energy-efficient configurations, forming a convex hull in the power-performance space. Detailed simulation of the baseline and proposed multicluster architectures has been performed using the developed optimization methodology. A comparison of the two microarchitectures, both optimized for energy efficiency, shows that the multicluster architecture is potentially up to twice as energy efficient for wide issue processors, with an advantage that Grows with the issue width. Conversely, at the same power dissipation level, the multicluster architecture supports configurations with measurably higher performance than equivalent conventional designs.

161 citations


Journal ArticleDOI
TL;DR: Different design trade-offs in the DAISY system and their impact on final system performance are reported, and the results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.
Abstract: We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.

152 citations


Journal ArticleDOI
TL;DR: This paper presents an efficient method for implementing the Discrete Cosine Transform (DCT) with distributed arithmetic that uses the recursive DCT algorithm and requires less area than the conventional approaches, regardless of the memory reduction techniques employed in the ROM Accumulators.
Abstract: This paper presents an efficient method for implementing the Discrete Cosine Transform (DCT) with distributed arithmetic. While conventional approaches use the original DCT algorithm or the even-odd frequency decomposition of the DCT algorithm, the proposed architecture uses the recursive DCT algorithm and requires less area than the conventional approaches, regardless of the memory reduction techniques employed in the ROM Accumulators (RACs). An efficient architecture for implementing the scaled DCT with distributed arithmetic is also proposed. The new architecture requires even less area while keeping the same structural regularity for an easy VLSI implementation. A comparison of synthesized DCT processors shows that the proposed method reduces the hardware area of regular and scaled DCT processors by 17 percent and 23 percent, respectively, relative to a conventional design. With the row-column decomposition method, the proposed architectures can be easily extended to compute the two-dimensional DCT required in many image compression applications such as HDTV.

150 citations


Journal ArticleDOI
TL;DR: This study deals with model-based dependability transient analysis of phased mission systems and proposes a modeling methodology that exploits the power of the class of Markov regenerative stochastic Petri net models to attack the weak points of the state-of-the-art.
Abstract: This study deals with model-based dependability transient analysis of phased mission systems. A review of the studies in the literature showed that several aspects of multiphased systems pose challenging problems to the dependability evaluation methods and tools. To attack the weak points of the state-of-the-art we propose a modeling methodology that exploits the power of the class of Markov regenerative stochastic Petri net models. By exploiting the techniques available in the literature for the analysis of the Markov Regenerative Processes, we obtain an analytical solution technique with a low computational complexity, basically dominated by the cost of the separate analysis of the system inside each phase. Last, the existence of analytical solutions allows us to derive the sensitivity functions of the dependability measures, thus providing the dependability engineer with additional means for the study of phased mission systems.

146 citations


Journal ArticleDOI
TL;DR: This paper addresses the reward-based scheduling problem for periodic tasks and proves the optimality of Rate Monotonic Scheduling (with harmonic periods), Earliest Deadline First, and Least Laxity First policies for the case of uniprocessors when used with the optimal service times.
Abstract: Reward-based scheduling refers to the problem in which there is a reward associated with the execution of a task. In our framework, each real-time task comprises a mandatory and an optional part. The mandatory part must complete before the task's deadline, while a nondecreasing reward function is associated with the execution of the optional part, which can be interrupted at any time. Imprecise computation and Increased-Reward-with-Increased-Service models fall within the scope of this-framework. In this paper, we address the reward-based scheduling problem for periodic tasks. An optimal schedule is one where mandatory-parts complete in a timely manner and the weighted average reward is maximized. For linear and concave reward functions, which are most common, we 1) show the existence of an optimal schedule where the optional service time of a task is constant at every instance and 2) show how to efficiently compute this service time. We also prove the optimality of Rate Monotonic Scheduling (with harmonic periods), Earliest Deadline First, and Least Laxity First policies for the case of uniprocessors when used with the optimal service times we computed. Moreover, we extend our result by showing that any policy which can fully utilize all the processors is also optimal for the multiprocessor periodic reward-based scheduling. To show-that our optimal solution is pushing the limits of reward-based scheduling, we further prove that, when the reward functions are convex, the problem becomes NP-Hard. Our static optimal solution, besides providing considerable reward improvements over the previous suboptimal strategies, also has a major practical benefit. Run-time overhead is eliminated and existing scheduling disciplines may be used without modification with the computed optimal service times.

Journal ArticleDOI
TL;DR: Two low-complexity bit-parallel systolic multipliers are presented based on the algorithm proposed, which can be applied in computing multiplications over the class of fields GF(2/sup m/) in which the elements are represented with the root of an irreducible equally spaced polynomial.
Abstract: Two operations, the cyclic shifting and the inner product, are defined by the properties of irreducible all one polynomials. An effective algorithm is proposed for computing multiplications over a class of fields GF(2/sup m/) using the two operations. Then, two low-complexity bit-parallel systolic multipliers are presented based on the algorithm. The first multiplier is composed of (m+1)/sup 2/ identical cells, each consisting of one 2-input AND gate, one 2-input XOR gate, and three 1-bit latches. The other multiplier comprises of (m+1)/sup 2/ identical cells and mXOR gates. Each cell consists of one 2-input AND gate, one 2-input XOR gate, and four 1-bit latches. Each multiplier exhibits very low latency and propagation delay and is thus very fast. Moreover, the architectures of the two multipliers can be applied in computing multiplications over the class of fields GF(2/sup m/) in which the elements are represented with the root of an irreducible equally spaced polynomial of degree m.

Journal ArticleDOI
TL;DR: By means of the calculus of variations, an explicit formula is derived that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery.
Abstract: Checkpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery. Theoretical result shows that the optimal checkpointing frequency is proportional to the square root of the failure rate and can be uniquely determined by the failure rate (time-varying or constant) if the recovery function is strictly increasing and the failure rate is /spl lambda/(/spl infin/)>0. J.L. Bruno and E.G. Coffman (1997) suggest that optimal checkpointing by its nature is a function of system failure rate, i.e., the time-varying failure rate demands time-varying checkpointing in order to meet the criteria of certain optimality. The results obtained in this paper agree with their viewpoint.

Journal ArticleDOI
TL;DR: An in-depth investigation of software and hardware techniques to take advantage of the DRAM mode control capabilities at a module granularity for energy savings using a memory system architecture capturing five different energy modes and corresponding resynchronization times.
Abstract: The anticipated explosive growth of pervasive and mobile computing devices that are typically constrained by energy has brought hardware and software techniques for energy conservation into the spotlight. While there have been several studies and proposals for energy conservation for CPUs and peripherals, energy optimization techniques for selective operating mode control of DRAMs have not been fully explored. It has been shown that, for some systems, as much as 90 percent of overall system energy (excluding I/O) is consumed by the DRAM modules, thus, they serve as a good candidate for energy optimizations. Further, DRAM technology has also matured to provide several low energy operating modes (power modes), making it an opportunistic moment to conduct studies exploring the potential benefits of mode control techniques. This paper conducts an in-depth investigation of software and hardware techniques to take advantage of the DRAM mode control capabilities at a module granularity for energy savings. Using a memory system architecture capturing five different energy modes and corresponding resynchronization times, this paper presents several novel compilation techniques to both cluster the data across memory banks as well as to detect module idleness and perform energy mode transitions. In addition, hardware-assisted approaches (called self-monitoring) based on predictions of module interaccess times are proposed. These techniques are extensively evaluated using a set of a dozen benchmarks. It is shown that we get an average of 61 percent savings in DRAM energy using compiler-directed mode control. One of the self-monitored approaches gives as much as 89 percent savings (72 percent on the average), coming as close as 8.8 percent to the optimal energy savings that one can expect with DRAM module mode control. The optimization techniques are demonstrated to be invaluable for energy savings as memory technologies continue to evolve.

Journal ArticleDOI
TL;DR: The design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications are described and the effectiveness of these optimizations are demonstrated.
Abstract: Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20 percent to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.

Journal ArticleDOI
TL;DR: In this paper, the authors study the load balancing problem for dense linear algebra kernels on heterogeneous networks of workstations and propose a data allocation heuristic to balance the load on heterogenous platforms with respect to the performance of processors.
Abstract: The authors study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the load-balancing problem can be solved rather easily. When targeting two-dimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D load-balancing problem and prove its NP-completeness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.

Journal ArticleDOI
TL;DR: A systematic design of modified Mastrovito multiplier, which is suitable for GF(2/sup m/) generated by high-Hamming weight irreducible polynomials is proposed, which effectively exploits the spatial correlation of elements in Mastovito product matrix to reduce the complexity.
Abstract: This paper considers the design of bit-parallel dedicated finite field multipliers using standard basis. An explicit algorithm is proposed for efficient construction of Mastrovito product matrix, based on which we present a systematic design of Mastrovito multiplier applicable to GF(2/sup m/) generated by an arbitrary irreducible polynomial. This design effectively exploits the spatial correlation of elements in Mastrovito product matrix to reduce the complexity. Using a similar methodology, we propose a systematic design of modified Mastrovito multiplier, which is suitable for GF(2/sup m/) generated by high-Hamming weight irreducible polynomials. For both original and modified Mastrovito multipliers, the developed multiplier architectures are highly modular, which is desirable for VLSI hardware implementation. Applying the proposed algorithm and design approach, we study the Mastrovito multipliers for several special irreducible polynomials, such as trinomial and equally-spaced-polynomial, and the obtained complexity results match the best known results. Moreover, we have discovered several new special irreducible polynomials which also lead to low-complexity Mastrovito multipliers.

Journal ArticleDOI
TL;DR: An investigation has been made into the use of stochastic arithmetic to implement an artificial neural network solution to a typical pattern recognition application, with results indicating an order of magnitude improvement over the floating-point implementation assuming clock frequency parity.
Abstract: For pt. I see ibid., p.891-905. An investigation has been made into the use of stochastic arithmetic to implement an artificial neural network solution to a typical pattern recognition application. Optical character recognition is performed on very noisy characters in the E-13B MICR font. The artificial neural network is composed of two layers, the first layer being a set of soft competitive learning subnetworks and the second a set of fully connected linear output neurons. The observed number of clock cycles in the stochastic case represents an order of magnitude improvement over the floating-point implementation assuming clock frequency parity. Network generalization capabilities were also compared based on the network squared error as a function of the amount of noise added to the input patterns. The stochastic network maintains a squared error within 10 percent of that of the floating-point implementation for a wide range of noise levels.

Journal ArticleDOI
TL;DR: This paper first describes the Swing Modulo Scheduling technique and evaluates it for the Perfect Club benchmark suite on a generic VLIW architecture, showing that it outperforms them in terms of the quality of the obtained schedules and compilation time.
Abstract: This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements, and stage count. Swing Modulo Scheduling is a heuristic approach that has a low computational cost. This paper first describes the technique and evaluates it for the Perfect Club benchmark suite on a generic VLIW architecture. SMS is compared with other heuristic methods, showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. To further explore the effectiveness of SMS, the experience of incorporating it into a production quality compiler for the Equator MAP1000 processor is described; implementation issues are discussed, as well as modifications and improvements to the original algorithm. Finally, experimental results from using a set of industrial multimedia applications are presented.

Journal ArticleDOI
TL;DR: In this article, power analysis attacks are applied to cryptosystems that use scalar multiplication on Koblitz curves and a number of countermeasures against simple and differential power analysis attack are suggested.
Abstract: Because of their shorter key sizes, cryptosystems based on elliptic curves are being increasingly used in practical applications. A special class of elliptic curves, namely, Koblitz curves, offers an additional, but crucial advantage of considerably reduced processing time. Power analysis attacks are applied to cryptosystems that use scalar multiplication on Koblitz curves. Both the simple and the differential power analysis attacks are considered and a number of countermeasures are suggested. While the proposed countermeasures against the simple power analysis attacks rely on making the power consumption for the elliptic curve scalar multiplication independent of the secret key, those for the differential power analysis attacks depend on randomizing the secret key prior to each execution of the scalar multiplication. These countermeasures are computationally efficient and suitable for hardware implementation.

Journal ArticleDOI
TL;DR: The architectural issues explored in this study show that, when Java applications are executed with a JIT compiler, selective translation using good heuristics can improve performance, but the saving is only 10-15 percent at best, and reveals revealing insights and architectural proposals for designing an efficient Java runtime system.
Abstract: The Java Virtual Machine (JVM) is the cornerstone of Java technology and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology. Interpretation, Just-in-Time (JIT) compilation, and hardware realization are well-known solutions for a JVM and previous research has proposed optimizations for each of these techniques. However, each technique has its pros and cons and may not be uniformly attractive for all hardware platforms. Instead, an understanding of the architectural implications of JVM implementations with real applications can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms. Toward this goal, this paper examines architectural issues from both the hardware and JVM implementation perspectives. The paper starts by identifying the important execution characteristics of Java applications from a bytecode perspective. It then explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs and investigates the CPU and cache architectural support that would benefit JVM implementations. We also study the available parallelism during the different execution modes using applications from the SPECjvm98 benchmarks. At the bytecode level, it is observed that less than 5 out of the 256 bytecodes constitute 90 percent of the dynamic bytecode stream. Method sizes fall into a trinodal distribution with peak of 1, 9, and 26 bytecodes across all benchmarks. The architectural issues explored in this study show that, when Java applications are executed with a JIT compiler, selective translation using good heuristics can improve performance, but the saving is only 10-15 percent at best. The instruction and data cache performance of Java applications are seen to be better than that of C/C/sub +/+ applications except in the case of data cache performance in the JIT mode. Write misses resulting from installation of JIT compiler output dominate the misses and deteriorate the data cache performance in JIT mode. A study on the available parallelism shows that Java programs executed using JIT compilers have parallelism comparable to C/C++ programs for small window sizes, but falls behind when the window size is increased. Java programs executed using the interpreter have very little parallelism due to the stack nature of the SVM instruction set, which is dominant in the interpreted execution mode. In addition, this work gives revealing insights and architectural proposals for designing an efficient Java runtime system.

Journal ArticleDOI
TL;DR: The results show that SDF architecture can outperform the superscalar and scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.
Abstract: In this paper, the scheduled dataflow (SDF) architecture-a decoupled memory/execution, multithreaded architecture using nonblocking threads-is presented in detail and evaluated against superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

Journal ArticleDOI
TL;DR: This work presents a compiler technique, which is based on Shasha and Snir's delay set analysis, to hide the underlying relaxed memory consistency model and to guarantee sequential consistency, and introduces dominators with respect to a node in a control flow graph to identify memory-barrier nodes.
Abstract: We present a compiler technique, which is based on Shasha and Snir's delay set analysis, to hide the underlying relaxed memory consistency model for an optimizing compiler for explicitly parallel programs. The compiler presents programmers with a sequentially consistent view of the underlying machine, irrespective of whether it follows a sequentially consistent model or a relaxed model. To hide the underlying relaxed memory consistency model and to guarantee sequential consistency, our algorithm inserts fence instructions by identifying memory-barrier nodes. We reduce the number of fence instructions by exploiting the ordering constraints of the underlying memory consistency model and the property of fence and synchronization operations. We introduce dominators with respect to a node in a control flow graph to identify memory-barrier nodes and show that minimizing the number of memory-barrier nodes is NP-hard.

Journal ArticleDOI
TL;DR: A batching policy is proposed that schedules the video with the maximum factored queue length and is referred to as MFQL, which yields excellent empirical results in terms of standard performance measures such as average latency time, defection rates, and fairness.
Abstract: In a video-on-demand environment, batching of video requests is often used to reduce I/O demand and improve throughput. Since viewers may defect if they experience long waits, a good video scheduling policy needs to consider not only the batch size but also the viewer defection probabilities and wait times. Two conventional scheduling policies for batching are the first-come-first-served (FCFS) policy, which schedules the video with the longest waiting request, and the maximum queue length-(MQL) policy, which selects the video with the maximum number of waiting requests. Neither of these policies leads to entirely satisfactory results. MQL tends to be too aggressive in scheduling popular videos by considering only the queue length to maximize batch size, while FCFS has the opposite effect by completely ignoring the queue length and focusing on arrival time to reduce defection. In this paper, we introduce the notion of factored queue length and propose a batching policy that schedules the video with the maximum factored queue length. We refer to this as the MFQL policy. The factored queue length is obtained by weighting each video queue length with a factor which is biased against the more popular videos. An optimization problem is formulated to solve for the best weighting factors for the various videos. We also consider MFQL implementation issues. A simulation is developed to compare the proposed MFQL variants with FCFS and MQL. Our study shows that MFQL yields excellent empirical results in terms of standard performance measures such as average latency time, defection rates, and fairness.

Journal ArticleDOI
TL;DR: This paper considers how to effectively use a bank-exposed memory system comprised of small, decentralized cache banks for sequential programs and demonstrates that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.
Abstract: Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bank-exposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguation-that of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.

Journal ArticleDOI
TL;DR: A fast algorithm for multiplicative inversion in GF(2/sup m/) using normal basis is proposed, which is an improvement on those proposed by Itoh and Tsujii and by Chang et al., which are based on Fermat's theorem and require O(logm) multiplications.
Abstract: A fast algorithm for multiplicative inversion in GF(2/sup m/) using normal basis is proposed. It is an improvement on those proposed by Itoh and Tsujii and by Chang et al., which are based on Fermat's theorem and require O(logm) multiplications. The number of multiplications is reduced by decomposing m-1 into several factors and a small remainder.

Journal ArticleDOI
TL;DR: A simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization, reveals several things, including that bus transmission speed will soon become a primary factor limiting memory-system performance.
Abstract: This paper presents a simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use only a handful of DRAM chips (/spl sim/10, as opposed to /spl sim/1 or /spl sim/100). The study covers Fast Page Mode, Extended Data Out, Synchronous, Enhanced Synchronous, Double Data Rate, Synchronous Link, Rambus, and Direct Rambus designs. Our simulations reveal several things: 1) Current advanced DRAM technologies are attacking the memory bandwidth problem but not the latency problem; 2) bus transmission speed will soon become a primary factor limiting memory-system performance; 3) the post-L2 address stream still contains significant locality, though it varies from application to application; 4) systems without L2 caches are feasible for low- and medium-speed CPUs (1 GHz and below); and 5) as we move to wider buses, row access time becomes more prominent, making it important to investigate techniques to exploit the available locality to decrease access time.

Journal ArticleDOI
TL;DR: A range of information-lossless address and instruction trace compression schemes that can reduce both storage space and access time by an order of magnitude or more, without discarding either references or interreference timing information from the original trace are discussed.
Abstract: The tremendous storage space required for a useful data base of program traces has prompted a search for trace reduction techniques. In this paper, we discuss a range of information-lossless address and instruction trace compression schemes that can reduce both storage space and access time by an order of magnitude or more, without discarding either references or interreference timing information from the original trace. The PDATS family of trace compression techniques achieves trace coding densities of about six references per byte. This family of techniques is now in use as the standard in the NMSU TraceBase, an extensive trace archive that has been established for use by the international research and teaching community.

Journal ArticleDOI
TL;DR: Algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTIS-Mesh optoelectronic computer show that their algorithms to multiply a column and row vector use an optimal number of data moves for both the group row and group submesh mappings.
Abstract: We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTIS-Mesh optoelectronic computer. Two mappings, group row and group submesh, of a matrix onto an OTIS-Mesh are considered and the relative merits of each compared. We show that our algorithms to multiply a column and row vector use an optimal number of data moves for both the group row and group submesh mappings, our algorithm to multiply a row vector and a column vector is optimal for the group row mapping, and our algorithm to multiply a matrix by a column vector is optimal for the group row mapping.

Journal ArticleDOI
TL;DR: This paper presents a symbolic approach for the analysis of bounded Petri nets and shows how large reachability sets can be generated, represented, and analyzed with moderate BDD sizes.
Abstract: This paper presents a symbolic approach for the analysis of bounded Petri nets. The structure and behavior of the Petri net is symbolically modeled by using Boolean functions, thus reducing reasoning about Petri nets to Boolean calculation. The set of reachable markings is calculated by symbolically firing the transitions in the Petri net. Highly concurrent systems suffer from the state explosion problem produced by an exponential increase of the number of reachable states. This state explosion is handled by using Binary Decision Diagrams (BDDs) which are capable of representing large sets of markings with small data structures. Petri nets have the ability to model a large variety of systems and the flexibility to describe causality, concurrency, and conditional relations. The manipulation of vast state spaces generated by Petri nets enables the efficient analysis of a wide range of problems, e.g., deadlock freeness, liveness, and concurrency. A number of examples are presented in order to show how large reachability sets can be generated, represented, and analyzed with moderate BDD sizes. By using this symbolic framework, properties requiring an exhaustive analysis of the reachability graph can be efficiently verified.