scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1999"


Journal ArticleDOI
TL;DR: Experimental results obtained from a large number of benchmarks indicate that application of the proposed conflict analysis techniques to SAT algorithms can be extremely effective for aLarge number of representative classes of SAT instances.
Abstract: This paper introduces GRASP (Generic seaRch Algorithm for the Satisfiability Problem), a new search algorithm for Propositional Satisfiability (SAT). GRASP incorporates several search-pruning techniques that proved to be quite powerful on a wide variety of SAT problems. Some of these techniques are specific to SAT, whereas others are similar in spirit to approaches in other fields of Artificial Intelligence. GRASP is premised on the inevitability of conflicts during the search and its most distinguishing feature is the augmentation of basic backtracking search with a powerful conflict analysis procedure. Analyzing conflicts to determine their causes enables GRASP to backtrack nonchronologically to earlier levels in the search tree, potentially pruning large portions of the search space. In addition, by "recording" the causes of conflicts, GRASP can recognize and preempt the occurrence of similar conflicts later on in the search. Finally, straightforward bookkeeping of the causality chains leading up to conflicts allows GRASP to identify assignments that are necessary for a solution to be found. Experimental results obtained from a large number of benchmarks indicate that application of the proposed conflict analysis techniques to SAT algorithms can be extremely effective for a large number of representative classes of SAT instances.

1,482 citations


Journal ArticleDOI
TL;DR: Evaluation of this software-hardware approach shows that it is quite effective in achieving high performance when running sequential binaries, and includes a simple and efficient hardware mechanism to enable register-level communication between on-chip processors.
Abstract: Much emphasis is now being placed on chip-multiprocessor (CMP) architectures for exploiting thread-level parallelism in applications. In such architectures, speculation may be employed to execute applications that cannot be parallelized statically. In this paper, we present an efficient CMP architecture for the speculative execution of sequential binaries without source recompilation. We present software support that enables the identification of threads from a sequential binary. The hardware includes a memory disambiguation mechanism that enables the detection of interthread memory dependence violations during speculative execution. This hardware is different from past proposals in that it does not rely on a snoopy-based cache-coherence protocol. Instead, it uses an approach similar to a directory-based scheme. Furthermore, the architecture includes a simple and efficient hardware mechanism to enable register-level communication between on-chip processors. Evaluation of this software-hardware approach shows that it is quite effective in achieving high performance when running sequential binaries.

327 citations


Journal ArticleDOI
TL;DR: This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching and indicates that the timing analyzer efficiently produces tight predictions of best and best-case performance for pipelined and instruction cache.
Abstract: Predicting the execution time of code segments in real-time systems is challenging. Most recently designed machines contain pipelines and caches. Pipeline hazards may result in multicycle delays. Instruction or data memory references may not be found in cache and these misses typically require several cycles to resolve. Whether an instruction will stall due to a pipeline hazard or a cache miss depends on the dynamic sequence of previous instructions executed and memory references performed. Furthermore, these penalties are not independent since delays due to pipeline stalls and cache miss penalties may overlap. This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching. First, a method is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. Next, these categorizations are used in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the pipeline path analysis to estimate the worst and best-case execution performance of each loop and function in the program. Finally, a graphical user interface is invoked that allows a user to request timing predictions on portions of the program. The results indicate that the timing analyzer efficiently produces tight predictions of worst and best-case performance for pipelining and instruction caching.

223 citations


Journal ArticleDOI
TL;DR: An architecture based on a new formulation of the multiplication matrix is described and it is shown that the Mastrovito multiplier for the generating trinomial x/sup m/+x/sup n/+1, where m/spl ne/2n, also requires m/sup 2/-1 XOR and m/Sup 2/ AND gates.
Abstract: An efficient algorithm for the multiplication in GF(2/sup m/) was introduced by Mastrovito. The space complexity of the Mastrovito multiplier for the irreducible trinomial x/sup m/+x+1 was given as m/sup 2/-1 XOR and m/sup 2/ AND gales. In this paper, we describe an architecture based on a new formulation of the multiplication matrix and show that the Mastrovito multiplier for the generating trinomial x/sup m/+x/sup n/+1, where m/spl ne/2n, also requires m/sup 2/-1 XOR and m/sup 2/ AND gates, However, m/sup 2/-x/sup m/2/ XOR gates are sufficient when the generating trinomial is of the form x/sup m/+x/sup m/2/+1 for an even m. We also calculate the time complexity of the proposed Mastrovito multiplier and give design examples for the irreducible trinomials x/sup 7/+x/sup 4/+1 and x/sup 6/+x/sup 3/+1.

211 citations


Journal ArticleDOI
TL;DR: It is found that a superthreaded processor can achieve good performance on complex application programs through this close coupling of compile-time and run-time information.
Abstract: The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism that is available in application programs. The superthreaded processor, on the other hand, is a concurrent multithreaded architecture (CMA) that can exploit the multiple granularities of parallelism that are available in general-purpose application programs. Unlike other CMAs that rely primarily on hardware for run-time dependence detection and speculation, the superthreaded processor combines compiler-directed thread-level speculation of control and data dependences with run-time data dependence verification hardware. This hybrid of a superscalar processor and a multiprocessor-on-a-chip can utilize many of the existing compiler techniques used in traditional parallelizing compilers developed for multiprocessors. Additional unique compiler techniques, such as the conversion of data speculation into control speculation, are also introduced to generate the superthreaded code and to enhance the parallelism between threads. A detailed execution-driven simulator is used to evaluate the performance potential of this new architecture. It is found that a superthreaded processor can achieve good performance on complex application programs through this close coupling of compile-time and run-time information.

190 citations


Journal ArticleDOI
TL;DR: The concept of temporal partitioning to partition a task into temporally interconnected subtasks and proper scheduling followed by proper scheduling can facilitate the configurable computer based execution are introduced.
Abstract: FPGA-based configurable computing machines are evolving rapidly. They offer the ability to deliver very high performance at a fraction of the cost when compared to supercomputers. The first generation of configurable computers (those with multiple FPGAs connected using a specific interconnect) used statically reconfigurable FPGAs. On these configurable computers, computations are performed by partitioning an entire task into spatially interconnected subtasks. Such configurable computers are used in logic emulation systems and for functional verification of hardware. In general, configurable computers provide the ability to reconfigure rapidly to any desired custom form. Hence, the available resources can be reused effectively to cut down the hardware costs and also improve the performance. In this paper, we introduce the concept of temporal partitioning to partition a task into temporally interconnected subtasks. Specifically, we present algorithms for temporal partitioning and scheduling data flow graphs for configurable computers. We are given a configurable computing unit (RPU) with a logic capacity of S/sub RPU/ and a computational task represented by an acyclic data flow graph G=(V, E). Computations with logic area requirements that exceed S/sub RPU/ cannot be completely mapped on a configurable computer (using traditional spatial mapping techniques). However, a temporal partitioning of the data flow graph followed by proper scheduling can facilitate the configurable computer based execution. Temporal partitioning of the data flow graph is a k-way partitioning of G=(V, E) such that each partitioned segment will not exceed S/sub RPU/ in its logic requirement. Scheduling assigns an execution order to the partitioned segments so as to ensure proper execution. Thus, for each segment in {s/sub 1/,s/sub 2/,...,s/sub k/}, scheduling assigns a unique ordering S/sub i/-j,1/spl les/i/spl les/k,1/spl les/j/spl les/k, such that the computation would execute in proper sequential order as defined by the flow graph G=(V, E).

182 citations


Journal ArticleDOI
TL;DR: In this paper, a high-speed method for function approximation that employs symmetric bipartite tables is presented, which uses less memory by taking advantage of symmetry and leading zeros in one of the two tables.
Abstract: This paper presents a high-speed method for function approximation that employs symmetric bipartite tables. This method performs two parallel table lookups to obtain a carry-save (borrow-save) function approximation, which is either converted to a two's complement number or is Booth encoded. Compared to previous methods for bipartite table approximations, this method uses less memory by taking advantage of symmetry and leading zeros in one of the two tables. It also has a closed-form solution for the table entries, provides tight bounds on the maximum absolute error, and can be applied to a wide range of functions. A variation of this method provides accurate initial approximations that are useful in multiplicative divide and square root algorithms.

166 citations


Journal ArticleDOI
TL;DR: In this article, a software reliability model based on non-homogeneous Poisson process is used to minimize the expected total software cost and a software tool is also developed using Excel and Visual Basic to facilitate the task of determining the optimal software release time.
Abstract: In this paper, a cost model with warranty cost, time to remove each error detected in the software system, and risk cost due to software failure is developed. A software reliability model based on non-homogeneous Poisson process is used. The optimal release policies to minimize the expected total software cost are discussed. A software tool is also developed using Excel and Visual Basic to facilitate the task of determining the optimal software release time. Numerical examples are provided to illustrate the results.

164 citations


Journal ArticleDOI
TL;DR: A method for estimating task execution times is presented in order to facilitate dynamic scheduling in a heterogeneous metacomputing environment and predicts the execution time as a function of several parameters of the input data.
Abstract: In this paper, a method for estimating task execution times is presented in order to facilitate dynamic scheduling in a heterogeneous metacomputing environment. Execution time is treated as a random variable and is statistically estimated from past observations. This method predicts the execution time as a function of several parameters of the input data and does not require any direct information about the algorithms used by the tasks or the architecture of the machines. Techniques based upon the concept of analytic benchmarking/code profiling are used to characterize the performance differences between machines, allowing observations from dissimilar machines to be used when making a prediction. Experimental results are presented which use actual execution time data gathered from 16 heterogeneous machines.

157 citations


Journal ArticleDOI
TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54 percent for various commercial benchmarks while only using two-thirds the memory of a demand-fetch cache organization.
Abstract: Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor. This design results in a prefetching system that provides good coverage, is accurate, and produces timely results that can be effectively used by the processor. We also explored a range of techniques that can be used to reduce the bandwidth demands of prefetching, leading to improved memory system performance. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54 percent for various commercial benchmarks while only using two-thirds the memory of a demand-fetch cache organization.

132 citations


Journal ArticleDOI
TL;DR: A new analytical model for obtaining latency measures in high-radix k-ary n-cubes with fully adaptive routing, based on Duato's algorithm (1998), is proposed.
Abstract: Analytical models of deterministic routing in wormhole-routed k-ary n-cubes have widely been reported in the literature. Although many fully adaptive routing algorithms have been proposed to overcome the performance limitations of deterministic routing, there have been hardly any studies that describe analytical models for these algorithms. This paper proposes a new analytical model for obtaining latency measures in high-radix k-ary n-cubes with fully adaptive routing, based on Duato's algorithm (1998). The validity of the model is demonstrated by comparing analytical results with those obtained through simulation experiments.

Journal ArticleDOI
TL;DR: It is proved that, for the important hypercube structured multiprocessor systems (n-cubes), the diagnosability under the comparison model is n when n/spl ges/5.
Abstract: A. Sengupta and A. Dahbura (1992) discussed how to characterize a diagnosable system under the comparison diagnosis model proposed by J. Maeng and M. Malek (1981) and a polynomial algorithm was given to identify the faulty processors provided that the system's diagnosability is known. However, for a general system, the determination of its diagnosability is not algorithmically easy. This paper proves that, for the important hypercube structured multiprocessor systems (n-cubes), the diagnosability under the comparison model is n when n/spl ges/5. The paper also studies the diagnosability of enhanced hypercube, which is obtained by adding 2/sup n-1/ more links to a regular hypercube of 2/sup n/ processors. It is shown that the augmented communication ability among processors also increases the system's diagnosability under the comparison model. We prove that the diagnosability is n+1 for an enhanced hypercube when n/spl ges/6.

Journal ArticleDOI
TL;DR: This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity, and proposes a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior.
Abstract: The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing instruction set architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.

Journal ArticleDOI
TL;DR: This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocesser machines that considers loop and data layout transformations in a unified framework and can optimize some nests for which optimization techniques based on loop transformations alone are not successful.
Abstract: Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.

Journal ArticleDOI
TL;DR: New heuristic algorithms are proposed for the Graph Partitioning problem and detailed experimental results are presented on benchmark suites used in the previous literature, consisting of graphs derived from parametric models and of "real-world" graphs of large size.
Abstract: New heuristic algorithms are proposed for the Graph Partitioning problem. A greedy construction scheme with an appropriate tie-breaking rule (MIN-MAX-GREEDY) produces initial assignments in a very fast time. For some classes of graphs, independent repetitions of MIN-MAX-GREEDY are sufficient to reproduce solutions found by more complex techniques. When the method is not competitive, the initial assignments are used as starting points for a prohibition-based scheme, where the prohibition is chosen in a randomized and reactive way, with a bias towards more successful choices in the previous part of the run. The relationship between prohibition-based diversification (Tabu Search) and the variable-depth Kernighan-Lin algorithm is discussed, Detailed experimental results are presented on benchmark suites used in the previous literature, consisting of graphs derived from parametric models (random graphs, geometric graphs, etc.) and of "real-world" graphs of large size. On the first series of graphs, a better performance for equivalent or smaller computing times is obtained, while, on the large "real-world" instances, significantly better results than those of multilevel algorithms are obtained, but for a much larger computational effort.

Journal ArticleDOI
TL;DR: This paper proposes new methods to: 1) Perform fault tolerance based task clustering, which determines the best placement of assertion and duplicate-and-compare tasks, 2) Derive the best error recovery topology using a small number of extra processing elements, and 4) Share assertions to reduce the fault tolerance overhead.
Abstract: Embedded systems employed in critical applications demand high reliability and availability in addition to high performance. Hardware-software co-synthesis of an embedded system is the process of partitioning, mapping, and scheduling its specification into hardware and software modules to meet performance, cost, reliability, and availability goals. In this paper, we address the problem of hardware-software co-synthesis of fault-tolerant real-time heterogeneous distributed embedded systems. Fault detection capability is imparted to the embedded system by adding assertion and duplicate-and-compare tasks to the task graph specification prior to co-synthesis. The dependability (reliability and availability) of the architecture is evaluated during co-synthesis. Our algorithm, called COFTA (Co-synthesis Of Fault-Tolerant Architectures), allows the user to specify multiple types of assertions for each task. It uses the assertion or combination of assertions which achieves the required fault coverage without incurring too much overhead. We propose new methods to: 1) Perform fault tolerance based task clustering, which determines the best placement of assertion and duplicate-and-compare tasks, 2) Derive the best error recovery topology using a small number of extra processing elements, 3) Exploit multidimensional assertions, and 4) Share assertions to reduce the fault tolerance overhead. Our algorithm can tackle multirate systems commonly found in multimedia applications. Application of the proposed algorithm to a large number of real-life telecom transport system examples (the largest example consisting of 2,172 tasks) shows its efficacy. For fault secure architectures, which just have fault detection capabilities, COFTA is able to achieve up to 48.8 percent and 25.6 percent savings in embedded system cost over architectures employing duplication and task-based fault tolerance techniques, respectively. The average cost overhead of COFTA fault-secure architectures over simplex architectures is only 7.3 percent. In case of fault-tolerant architectures, which cannot only detect but also tolerate faults, COFTA is able to achieve up to 63.1 percent and 23.8 percent savings in embedded system cost over architectures employing triple-modular redundancy, and task-based fault tolerance techniques, respectively. The average cost overhead of COFTA fault-tolerant architectures over simplex architectures is only 55.4 percent.

Journal ArticleDOI
TL;DR: This paper analyzes some of the main properties of a double base number system, using bases 2 and 3, and introduces an index calculus for logarithmic-like arithmetic with considerable hardware reductions in lookup table size.
Abstract: In this paper, we analyze some of the main properties of a double base number system, using bases 2 and 3; in particular, we emphasize the sparseness of the representation. A simple geometric interpretation allows an efficient implementation of the basic arithmetic operations and we introduce an index calculus for logarithmic-like arithmetic with considerable hardware reductions in lookup table size. We discuss the application of this number system in the area of digital signal processing; we illustrate the discussion with examples of finite impulse response filtering.

Journal ArticleDOI
TL;DR: A new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability and a polynomial-time diagnosis algorithm is described, which diagnoses all fault situations with low latency and very little overhead.
Abstract: This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for t-diagnosable systems is also given. A variation of this algorithm, which allows dynamic fault occurrence and incomplete diagnostic information, has been implemented in the COmmon Spaceborne Multicomputer Operating System (COSMOS). Results produced using a simulator for the JPL MAX multicomputer system running COSMOS show that the algorithm diagnoses all fault situations with low latency and very little overhead. These simulations demonstrate the practicality of the proposed diagnosis model and algorithm for multicomputer systems having weak reliable broadcast. This includes systems with fault-tolerant hardware for broadcast, as well as those where reliable broadcast is implemented in software.

Journal ArticleDOI
TL;DR: A microarchitecture incorporating a trace cache provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of control flow prediction and instruction supply.
Abstract: As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. Trace caches overcome this limitation by caching traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. In this paper, we present and evaluate a microarchitecture incorporating a trace cache. The microarchitecture provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of (1) control flow prediction acid (2) instruction supply. For the SPEC95 integer benchmarks, trace-level sequencing improves performance from 15 percent to 35 percent over an otherwise equally sophisticated, but contiguous, multiple-block fetch mechanism. Most of this performance improvement is due to the trace cache. However, for one benchmark whose performance is limited by branch mispredictions, the performance gain is almost entirely due to improved prediction accuracy.

Journal ArticleDOI
TL;DR: DAT is presented, a technique that augments loop tiling with data alignment, achieving improved efficiency (by ensuring that the cache is never under-utilized) as well as improved flexibility (by eliminating self-interference cache conflicts independent of the tile size) in a more stable and better cache performance.
Abstract: Loop blocking (tiling) is a well-known compiler optimization that helps improve cache performance by dividing the loop iteration space into smaller blocks (tiles); reuse of array elements within each tile is maximized by ensuring that the working set for the tile fits into the data cache. Padding is a data alignment technique that involves the insertion of dummy elements into a data structure for improving cache performance. In this work, we present DAT, a technique that augments loop tiling with data alignment, achieving improved efficiency (by ensuring that the cache is never under-utilized) as well as improved flexibility (by eliminating self-interference cache conflicts independent of the tile size). This results in a more stable and better cache performance than existing approaches, in addition to maximizing cache utilization, eliminating self-interference, and minimizing cross-interference conflicts. Further, while all previous efforts are targeted at programs characterized by the reuse of a single array, we also address the issue of minimizing conflict misses when several tiled arrays are involved. To validate our technique, we ran extensive experiments using both simulations as well as actual measurements on SUN Sparc5 and Sparc10 workstations. The results on benchmarks exhibiting varying memory access patterns demonstrate the effectiveness of our technique through consistently high hit ratios and improved performance across varying problem sizes.

Journal ArticleDOI
TL;DR: This paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among instruction-window size, branch-prediction accuracy, and instruction- and data-cache size and major mechanisms driving the observed trade-offs are described.
Abstract: Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Trade-offs among instruction-window size, branch-prediction accuracy, and instruction- and data-cache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or overstate the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed trade-offs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.

Journal ArticleDOI
TL;DR: The design presented here incorporates a concurrent position correction logic, operating in parallel with the LOP, to detect the presence of that error and produce the correct shift amount.
Abstract: This paper describes the design of a leading-one prediction (LOP) logic for floating-point addition with an exact determination of the shift amount for normalization of the adder result. Leading-one prediction is a technique to calculate the number of leading zeros of the result in parallel with the addition. However, the prediction might be in error by one bit and previous schemes to correct this error result in a delay increase. The design presented here incorporates a concurrent position correction logic, operating in parallel with the LOP, to detect the presence of that error and produce the correct shift amount. We describe the error detection as part of the overall LOP, perform estimates of its delay and complexity, and compare with previous schemes.

Journal ArticleDOI
TL;DR: A novel class of arithmetic architectures for Galois fields GF(2/sup k/) is described, capable of exploring the time-space trade-off paradigm in a flexible manner and two different approaches to squaring are provided.
Abstract: The article describes a novel class of arithmetic architectures for Galois fields GF(2/sup k/). The main applications of the architecture are public key systems which are based on the discrete logarithm problem for elliptic curves. The architectures use a representation of the field GF(2/sup k/) as GF((2/sup n/)/sup m/), where k=n/spl middot/m. The approach explores bit parallel arithmetic in the subfield GF(2/sup n/) and serial processing for the extension field arithmetic. This mixed parallel-serial (hybrid) approach can lead to fast implementations. As the core module, a hybrid multiplier is introduced and several optimizations are discussed. We provide two different approaches to squaring. We develop exact expressions for the complexity of parallel squarers in composite fields, which can have a surprisingly low complexity. The hybrid architectures are capable of exploring the time-space trade-off paradigm in a flexible manner. In particular, the number of clock cycles for one field multiplication, which is the atomic operation in most public key schemes, can be reduced by a factor of n compared to other known realizations. The acceleration is achieved at the cost of an increased computational complexity. We describe a proof-of-concept implementation of an ASIC for multiplication and squaring in GF((2/sup n/)/sup m/), m variable.

Journal ArticleDOI
TL;DR: A novel scheduling approach for servicing soft aperiodic requests in a hard real time environment, where a set of hard periodic tasks is scheduled using the Earliest Deadline First algorithm, which achieves full processor utilization and optimal a periodic responsiveness, still guaranteeing the execution of the periodic tasks.
Abstract: We present a novel scheduling approach for servicing soft aperiodic requests in a hard real time environment, where a set of hard periodic tasks is scheduled using the Earliest Deadline First algorithm. The main characteristic of the proposed algorithm is that it achieves full processor utilization and optimal aperiodic responsiveness, still guaranteeing the execution of the periodic tasks. Another interesting feature of the proposed algorithm is that it can easily be tuned to balance performance versus complexity for adapting it to different application requirements. Schedulability issues, performance results, and implementation complexity of the algorithm are discussed and compared with other methods, such as Background, the Total Bandwidth Server, and the Slack Stealer. Resource reclaiming and extensions to more general cases are also considered. Extensive simulations show that a substantial improvement can be achieved with a little increase of complexity, ranging from the performance of the Total Bandwidth Server up to the optimal behavior.

Journal ArticleDOI
TL;DR: A new gate-level model that handles time-multiplexed computation is proposed and an enchanced force directed scheduling (FDS) algorithm is introduced to partition sequential circuits that finds a correct partition with low logic and communication costs, under the assumption that maximum performance is desired.
Abstract: A fundamental feature of Dynamically Reconfigurable FPGAs (DRFPGAs) is that the logic and interconnect are time-multiplexed. Thus, for a circuit to be implemented on a DRFPGA, it needs to be partitioned such that each subcircuit can be executed at a different time. In this paper, the partitioning of sequential circuits for execution on a DRFPGA is studied. To determine how to correctly partition a sequential circuit and what are the costs in doing so, we propose a new gate-level model that handles time-multiplexed computation. We also introduce an enchanced force directed scheduling (FDS) algorithm to partition sequential circuits that finds a correct partition with low logic and communication costs, under the assumption that maximum performance is desired. We use our algorithm to partition seven large ISCAS'89 sequential benchmark circuits. The experimental results show that the enhanced FDS reduces communication costs by 27.5 percent with only a 1.1 percent increase in the gate cost compared to traditional FDS.

Journal ArticleDOI
TL;DR: This paper shows that the Walsh spectrum of Boolean functions can be analyzed by looking at algebraic properties of a class of Cayley graphs associated with Boolean functions.
Abstract: Several problems in digital logic can be conveniently approached in the spectral domain. In this paper we show that the Walsh spectrum of Boolean functions can be analyzed by looking at algebraic properties of a class of Cayley graphs associated with Boolean functions. We use this idea to investigate the Walsh spectrum of certain special functions.

Journal ArticleDOI
Jack Jean1, Karen A. Tomko1, V. Yavagal1, J. Shah1, R. Cook1 
TL;DR: The development of a dynamically reconfigurable system that can support multiple applications running concurrently and the impact of supporting concurrency and preloading in reducing application execution time is demonstrated.
Abstract: This paper describes the development of a dynamically reconfigurable system that can support multiple applications running concurrently. A dynamically reconfigurable system allows hardware reconfiguration while part of the reconfigurable hardware is busy computing. An FPGA resource manager (RM) is developed to allocate and de-allocate FPGA resources and to preload FPGA configuration files. For each individual application, different tasks that require FPGA resources are represented as a flow graph which is made available to the RM so as to enable efficient resource management and preloading. The performance of using the RM to support several applications is summarized. The impact of supporting concurrency and preloading in reducing application execution time is demonstrated.

Journal ArticleDOI
TL;DR: The proposed scheme employs a pseudorandom scan cell selection routine which, in conjunction with a conventional signature analysis and simple reasoning procedure, allows flexible trade-offs between the test application time and the diagnostic resolution.
Abstract: The paper presents a new fault diagnosis technique for scan-based designs with BIST. It can be used for nonadaptive identification of the scan cells that are driven by erroneous signals. The proposed scheme employs a pseudorandom scan cell selection routine which, in conjunction with a conventional signature analysis and simple reasoning procedure, allows flexible trade-offs between the test application time and the diagnostic resolution.

Journal ArticleDOI
TL;DR: This work extends the concept of subcube to the more powerful pseudocube and defines a class of symmetric functions, particularly suitable for SPP representation, as a relevant example of application of this new technique.
Abstract: Consider a hypercube of 2/sup n/ points described by n Boolean variables and a subcube of 2/sup m/ points, m/spl les/n. As is well-known, the Boolean function with value 1 in the points of the subcube can be expressed as the product (AND) of n-m variables. The standard synthesis of arbitrary functions exploits this property. We extend the concept of subcube to the more powerful pseudocube. The basic set is still composed of 2/sup m/ points, but has a more general form. The function with value 1 in a pseudocube, called pseudoproduct, is expressed as the AND of n-m EXOR-factors, each containing at most m+1 variables. Subcubes are special cases of pseudocubes and their corresponding pseudoproducts reduce to standard products. An arbitrary Boolean function can be expressed as a sum of pseudoproducts (SPP). This expression is in general much shorter than the standard sum of products, as demonstrated on some known benchmarks. The logical network of an n-bit adder is designed in SPP, as a relevant example of application of this new technique. A class of symmetric functions is also defined, particularly suitable for SPP representation.

Journal ArticleDOI
TL;DR: This paper describes four applications in the domain of configurable computing, considering both static and dynamic systems, including: SPYDER (a reconfigurable processor development system), RENCO (a Reconfigurable network computer), Firefly (an evolving machine), and the BioWatch (a self-repairing watch).
Abstract: Field-programmable gate arrays (FPGAs) are large, fast integrated circuits-that can be modified, or configured, almost at any point by the end user. Within the domain of configurable computing, we distinguish between two modes of configurability: static-where the configurable processor's configuration string is loaded once at the outset, after which it does not change during execution of the task at hand, and dynamic-where the processor's configuration may change at any moment. This paper describes four applications in the domain of configurable computing, considering both static and dynamic systems, including: SPYDER (a reconfigurable processor development system), RENCO (a reconfigurable network computer), Firefly (an evolving machine), and the BioWatch (a self-repairing watch). While static configurability mainly aims at attaining the classical computing goal of improving performance, dynamic configurability might bring about an entirely new breed of hardware devices-ones that are able to adapt within dynamic environments.