scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1994"


Journal ArticleDOI
TL;DR: An additive fuzzy system can uniformly approximate any real continuous function on a compact domain to any degree of accuracy.
Abstract: An additive fuzzy system can uniformly approximate any real continuous function on a compact domain to any degree of accuracy. An additive fuzzy system approximates the function by covering its graph with fuzzy patches in the input-output state space and averaging patches that overlap. The fuzzy system computes a conditional expectation E|Y|X| if we view the fuzzy sets as random sets. Each fuzzy rule defines a fuzzy patch and connects commonsense knowledge with state-space geometry. Neural or statistical clustering systems can approximate the unknown fuzzy patches from training data. These adaptive fuzzy systems approximate a function at two levels. At the local level the neural system approximates and tunes the fuzzy rules. At the global level the rules or patches approximate the function. >

1,282 citations


Journal ArticleDOI
TL;DR: High quality pseudorandom pattern generators built around rule 90 and 150 programmable cellular automata with a rule selector has been proposed as running key generators in stream ciphers, both the schemes provide better security against different types of attacks.
Abstract: This paper deals with the theory and application of Cellular Automata (CA) for a class of block ciphers and stream ciphers. Based on CA state transitions certain fundamental transformations are defined which are block ciphering functions of the proposed enciphering scheme, These fundamental transformations are found to generate the simple (alternating) group of even permutations which in turn is a subgroup of the permutation group, These functions are implemented with a class of programmable cellular automata (PCA) built around rules 51, 153, and 195. Further, high quality pseudorandom pattern generators built around rule 90 and 150 programmable cellular automata with a rule selector (i.e., combining function) has been proposed as running key generators in stream ciphers, Both the schemes provide better security against different types of attacks. With a simple, regular, modular and cascadable structure of CA, hardware implementation of such schemes idealy suit VLSI implementation. >

381 citations


Journal ArticleDOI
TL;DR: A reset subsystem is designed that can be embedded in an arbitrary distributed system in order to allow the system processes to reset the system when necessary, and is very robust: it can tolerate fail-stop failures and repairs of processes and channels, even when a reset is in progress.
Abstract: A reset subsystem is designed that can be embedded in an arbitrary distributed system in order to allow the system processes to reset the system when necessary. Our design is layered, and comprises three main components: a leader election, a spanning tree construction, and a diffusing computation. Each of these components is self-stabilizing in the following sense: if the coordination between the up-processes in the system is ever lost (due to failures or repairs of processes and channels), then each component eventually reaches a state where coordination is regained. This capability makes our reset subsystem very robust: it can tolerate fail-stop failures and repairs of processes and channels, even when a reset is in progress. >

313 citations


Journal ArticleDOI
TL;DR: The vertex connectivity for the n-dimensional cube is obtained, and the minimal sets of faulty nodes that disconnect the cube are characterized.
Abstract: Introduces a new measure of conditional connectivity for large regular graphs by requiring each vertex to have at least g good neighbors in the graph. Based on this requirement, the vertex connectivity for the n-dimensional cube is obtained, and the minimal sets of faulty nodes that disconnect the cube are characterized. >

270 citations


Journal ArticleDOI
D. Lee1, Mihalis Yannakakis1
TL;DR: In this paper, the complexity of finite-state machine testing has been studied and it has been shown that it is PSPACE-complete to determine whether a finite state machine has a preset distinguishing sequence.
Abstract: We study the complexity of two fundamental problems in the testing of finite-state machines. 1) Distinguishing sequences (state identification). We show that it is PSPACE-complete to determine whether a finite-state machine has a preset distinguishing sequence. There are machines that have distinguishing sequences, but only of exponential length. We give a polynomial time algorithm that determines whether a finite-state machine has an adaptive distinguishing sequence. (The previous classical algorithms take exponential time.) Furthermore, if there is an adaptive distinguishing sequence, then we give an efficient algorithm that constructs such a sequence of length at most n(n/spl minus/1)/2 (which is the best possible), where n is the number of states. 2) Unique input output sequences (state verification). It is PSPACE-complete to determine whether a state of a machine has a unique input output sequence. There are machines whose states have unique input output sequences but only of exponential length. >

266 citations


Journal ArticleDOI
TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.
Abstract: The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average. >

265 citations


Journal ArticleDOI
TL;DR: A comprehensive study of new residue generators and MOMA's is presented and four design schemes of the n-input residue generators mod A, which are best suited for various pairs of n and A, are proposed.
Abstract: Residue generator is an essential building block of encoding/decoding circuitry for arithmetic error detecting codes and binary-to-residue number system (RNS) converter. In either case, a residue generator is an overhead for a system and as such it should be built with minimum amount of hardware and should not compromise the speed of a system. Multioperand modular adder (MOMA) is a computational element used to implement various operations in digital signal processing systems using RNS. A comprehensive study of new residue generators and MOMA's is presented. The design methods given here take advantage of the periodicity of the series of powers of 2 taken module A (A is a module). Four design schemes of the n-input residue generators mod A, which are best suited for various pairs of n and A, are proposed. Their pipelined versions can be clocked with the cycle determined by the delay of a full-adder and a latch. A family of design methods for parallel and word-serial, using similar concepts, is also given. Both classes of circuits employ new highly-parallel schemes using carry-save adders with end-around carry and a minimal amount of ROM and are well-suited for VLSI implementation. They are faster and use less hardware than similar circuits known to date. One of the MOMA's can be used to build a high-speed residue-to-binary converter based on the Chinese remainder theorem. >

224 citations


Journal ArticleDOI
TL;DR: The problem of guaranteeing synchronous message deadlines in token ring networks where the timed token medium access control protocol is employed is studied and a normalized proportional allocation scheme is proposed, which can guarantee the synchronous messages deadlines for synchronous traffic of up to 33% of available utilization.
Abstract: We study the problem of guaranteeing synchronous message deadlines in token ring networks where the timed token medium access control protocol is employed. Synchronous bandwidth, defined as the maximum time for which a node can transmit its synchronous messages every time it receives the token, is a key parameter in the control of synchronous message transmission. To ensure the transmission of synchronous messages before their deadlines, synchronous capacities must be properly allocated to individual nodes. We address the issue of appropriate allocation of the synchronous capacities. Several synchronous bandwidth allocation schemes are analyzed in terms of their ability to satisfy deadline constraints of synchronous messages. We show that an inappropriate allocation of the synchronous capacities could cause message deadlines to be missed, even if the synchronous traffic is extremely low. We propose a scheme, called the normalized proportional allocation scheme, which can guarantee the synchronous message deadlines for synchronous traffic of up to 33% of available utilization. >

177 citations


Journal ArticleDOI
TL;DR: The paper compares the trace-sampling techniques of set sampling and time sampling using the multi-billion reference traces of A.A. Borg et al. (1990) and applies both techniques to multi-megabyte caches, where sampling is most valuable, to find that set sampling meets the 10% sampling goal, while time sampling does not.
Abstract: The paper compares the trace-sampling techniques of set sampling and time sampling Using the multi-billion reference traces of A Borg et al (1990), we apply both techniques to multi-megabyte caches, where sampling is most valuable We evaluate whether either technique meets a 10% sampling goal: a method meets this goal if, at least 90% of the time, it estimates the trace's true misses per instruction with /spl les/10% relative error using /spl les/10% of the trace Results for these traces and caches show that set sampling meets the 10% sampling goal, while time sampling does not We also find that cold-start bias in time samples is most effectively reduced by the technique of DA Wood et al (1991) Nevertheless, overcoming cold-start bias requires tens of millions of consecutive references >

147 citations


Journal ArticleDOI
TL;DR: An analytical model is presented for the performance evaluation of hypercube computers aimed at modeling a deadlock-free wormhole routing scheme prevalent on second generation hypercube systems and extended to virtual cut-through routing and random wormholes routing techniques.
Abstract: We present an analytical model for the performance evaluation of hypercube computers. This analysis is aimed at modeling a deadlock-free wormhole routing scheme prevalent on second generation hypercube systems. Probability of blocking and average message delay are the two performance measures discussed. We start with the communication traffic to find the probability of blocking. The traffic analysis can capture any message destination distribution. Next, we find the average message delay that consists of two parts. The first part is the actual message transfer delay between any source and destination nodes. The second part of the delay is due to blocking caused by the wormhole routing scheme. The analysis is also extended to virtual cut-through routing and random wormhole routing techniques. The validity of the model is demonstrated by comparing analytical results with those from simulation. >

127 citations


Journal ArticleDOI
TL;DR: A new method for polynomial interpolation in hardware, with advantages demonstrated by its application to an accurate logarithmic number system (LNS) arithmetic unit, using an interleaved memory function interpolator.
Abstract: This paper describes a new method for polynomial interpolation in hardware, with advantages demonstrated by its application to an accurate logarithmic number system (LNS) arithmetic unit. The use of an interleaved memory reduces storage requirements by allowing each stored function value to be used in interpolation across several segments. This strategy can be shown to always use fewer words of memory than an optimized polynomial with stored polynomial coefficients. Interleaved memory function interpolators are then applied to the specific goal of an accurate logarithmic number system arithmetic unit. Many accuracy requirements for the LNS arithmetic unit are possible. Although a round to nearest would be desirable, it cannot be easily achieved. The goal suggested is to insure that the worst case LNS relative error is smaller than the worst case floating point (FP) relative error. Using the interleaved memory interpolator, the detailed design of an LNS arithmetic unit is performed using a second order polynomial interpolator including approximately 91K bits of ROM. This arithmetic unit has better accuracy and less complexity than previous LNS units. >

Journal ArticleDOI
TL;DR: Hardware designs that produce exactly rounded results for the functions of reciprocal, square-root, 2/sup x/, and log/sub 2/(x) are presented, and delay and area comparisons are made based on the degree of the approximating polynomial and the accuracy of the final result.
Abstract: This paper presents hardware designs that produce exactly rounded results for the functions of reciprocal, square-root, 2/sup x/, and log/sub 2/(x). These designs use polynomial approximation in which the terms in the approximation are generated in parallel, and then summed by using a multi-operand adder. To reduce the number of terms in the approximation, the input interval is partitioned into subintervals of equal size, and different coefficients are used for each subinterval. The coefficients used in the approximation are initially determined based on the Chebyshev series approximation. They are then adjusted to obtain exactly rounded results for all inputs. Hardware designs are presented, and delay and area comparisons are made based on the degree of the approximating polynomial and the accuracy of the final result. For single-precision floating point numbers, a design that produces exactly rounded results for all four functions has an estimated delay of 80 ns and a total chip area of 98 mm/sup 2/ in a 1.0-micron CMOS technology. Allowing the results to have a maximum error of one unit in the last place reduces the computational delay by 5% to 30% and the area requirements by 33% to 77%. >

Journal ArticleDOI
TL;DR: It is demonstrated that the verification of the circuit design for the hidden weighted bit function proposed Bryant can be carried out efficiently in terms of FBDD's while this is, for principal reasons, impossible in termsof OBDD's.
Abstract: OBDD's are the state-of-the-art data structure for Boolean function manipulation. Basic tasks of Boolean manipulation such as equivalence test, satisfiability test, tautology test and single Boolean synthesis steps can be performed efficiently in terms of fixed ordered OBDD's. The bottleneck of most OBDD-applications is the size of the represented Boolean functions since the total computation merely remains tractable as long as the OBDD-representations remain of reasonable size. Since it is well known that OBDD's are restricted FBDD's (free BDD's, i.e., BDD's that test, on each path, each input variable at most once), and that FBDD-representations are often much more (sometimes even exponentially more) concise than OBDD-representations. We propose to work with a more general FBDD-based data structure. We show that FBDD's of a fixed type provide, similar as OBDD's of a fixed variable ordering, canonical representations of Boolean functions, and that basic tasks of Boolean manipulation can be performed in terms of fixed typed FBDD's similarly efficient as in terms of fixed ordered OBDD's. In order to demonstrate the power of the FBDD-concept we show that the verification of the circuit design for the hidden weighted bit function proposed Bryant can be carried out efficiently in terms of FBDD's while this is, for principal reasons, impossible in terms of OBDD's. >

Journal ArticleDOI
TL;DR: A novel architecture for a fault-tolerant multiprocessor environment that achieves performance of a triple modular redundant system using duplex system redundancy and requires no rollbacks for recovering from single faults is proposed.
Abstract: We propose a novel architecture for a fault-tolerant multiprocessor environment. It is assumed that the multiprocessor organization consists of a pool of active processing modules and either a small number of spare modules or active modules with some spare processing capacity. A fault-tolerance scheme is developed for duplex systems using checkpoints. Our scheme, unlike traditional checkpointing schemes, requires no rollbacks for recovering from single faults. The objective is to achieve performance of a triple modular redundant system using duplex system redundancy. >

Journal ArticleDOI
TL;DR: A method that exploits properties of standing waves to reduce substantially clock skews due to unequal path lengths, for distribution network diameters up to several meters is described.
Abstract: The design of a synchronous system having a global clock must account for propagation-delay-induced phase shifts experienced by the clock signal (clock skew) in its distribution network. As clock speeds and system diameters increase, this requirement becomes increasingly constraining on system designs. The paper describes a method that exploits properties of standing waves to reduce substantially clock skews due to unequal path lengths, for distribution network diameters up to several meters. The basic principles are developed for a loaded transmission line, and then applied to an arbitrary branching tree of such lines to implement a clock distribution network. The extension of this method to two- and three-dimensional distribution media is also presented, suggesting the feasibility of implementing printed circuit board clock planes exhibiting negligible phase shift over their extents. >

Journal ArticleDOI
TL;DR: It is shown that the new approach maintains the high throughput of previous schemes, yet needs lower hardware overhead and achieves higher fault converge than previous schemes by J.Y. Jou and D.I. Tao.
Abstract: Algorithm-based fault tolerance (ABFT) is a low-overhead system-level fault tolerance technique. Many ABFT schemes have been proposed in the past for fast Fourier transform (FFT) networks. In this paper, a new ABFT scheme for FFT networks is proposed. We show that the new approach maintains the high throughput of previous schemes, yet needs lower hardware overhead and achieves higher fault converge than previous schemes by J.Y. Jou et al. (1988) and D.I. Tao et al. (1990). >

Journal ArticleDOI
TL;DR: These algorithms exploit microscopic parallelism using specialized hardware with heavy use of truncation based on detailed accuracy analysis for the computation of the common elementary functions, namely division, logarithm, reciprocal square root, arc tangent, sine and cosine.
Abstract: As the name suggests, elementary functions play a vital role in scientific computations. Yet due to their inherent nature, they are a considerable computing task by themselves. Not surprisingly, since the dawn of computing, the goal of speeding up elementary function computation has been pursued. This paper describes new hardware based algorithms for the computation of the common elementary functions, namely division, logarithm, reciprocal square root, arc tangent, sine and cosine. These algorithms exploit microscopic parallelism using specialized hardware with heavy use of truncation based on detailed accuracy analysis. The contribution of this work lies in the fact that these algorithms are very fast and yet are accurate. If we let the time to perform an IEEE Standard 754 double precision floating point multiplication be /spl tau//sub /spl times//, our algorithms to achieve roughly 3.68/spl tau//sub /spl times//,4.56/spl tau//sub /spl times//, 5.25/spl tau//sub /spl times//, 3.69/spl tau//sub /spl times//, 7.06/spl tau//sub /spl times//, and 6.5/spl tau//sub /spl times//, for division, logarithm, square root, exponential, are tangent and complex exponential (sine and cosine) respectively. The trade-off is the need for tables and some specialized hardware. The total amount of tables required, however, is less than 128 Kbytes. We discuss the hardware, algorithmic and accuracy aspects of these algorithms. >

Journal ArticleDOI
TL;DR: The authors present a simple and efficient nonblocking shared FIFO queue algorithm with O(n) system latency, no additional memory requirements, and enqueuing and dequeuing times independent of the size of the queue.
Abstract: Nonblocking algorithms for concurrent objects guarantee that an object is always accessible, in contrast to blocking algorithms in which a slow or halted process can render part or all of the data structure inaccessible to other processes. A number of algorithms have been proposed for shared FIFO queues, but nonblocking implementations are few and either limit the concurrency or provide inefficient solutions. The authors present a simple and efficient nonblocking shared FIFO queue algorithm with O(n) system latency, no additional memory requirements, and enqueuing and dequeuing times independent of the size of the queue. They use the compare & swap operation as the basic synchronization primitive. They model their algorithm analytically and with a simulation, and compare its performance with that of a blocking FIFO queue. They find that the nonblocking queue has better performance if processors are occasionally slow, but worse performance if some processors are always slower than others. >

Journal ArticleDOI
Peter Kornerup1
TL;DR: It is shown how the multiplier, with some simple back-end connections, can compute modular inverses and perform modular division for a power of two as modulus.
Abstract: A very simple multiplier cell is developed for use in a linear, purely systolic array forming a digit-serial multiplier for unsigned or 2'complement operands. Each cell produces two digit-product terms and accumulates these into a previous sum of the same weight, developing the product least significant digit first. Grouping two terms per cell, the ratio of active elements to latches is low, and only upper bound [n]/2 cells are needed for a full n by n multiply. A module-multiplier is then developed by incorporating a Montgomery type of module-reduction. Two such multipliers interconnect to form a purely systolic module exponentiator, capable of performing RSA encryption at very high clock frequencies, but with a low gate count and small area. It is also shown how the multiplier, with some simple back-end connections, can compute modular inverses and perform modular division for a power of two as modulus. >

Journal ArticleDOI
TL;DR: It is proved that in the aspect of diagnosability, enhanced hypercubes also achieve improvements in many measurements such as mean internode distance, diameter and traffic density.
Abstract: An enhanced hypercube is obtained by adding 2/sup n-1/ more links to a regular hypercube of 2/sup n/ processors. It has been shown that enhanced hypercubes have very good improvements over regular hypercubes in many measurements such as mean internode distance, diameter and traffic density. This paper proves that in the aspect of diagnosability, enhanced hypercubes also achieve improvements. Two diagnosis strategies, both using the well-known PMC diagnostic model, are studied: the precise (one-step) strategy proposed by Preparata, Metze and Chien (1967), and the pessimistic strategy proposed by Friedman (1975). Under the precise strategy, the diagnosability is shown to be increased to n+1 in enhanced hypercubes. (In regular hypercubes, the diagnosability is n under this strategy). Under the pessimistic strategy, the diagnosability is shown to be increased to 2n. (In regular hypercubes, the diagnosability under this strategy is 2n-2). Since the failure probability of one node is fairly low nowadays, so that the increase of diagnosability by one or two will considerably enhance the system's self-diagnostic capability, and considering the fact that diagnosability does not "easily" increase as the links in networks do, these improvements are noticeable. >

Journal ArticleDOI
TL;DR: The performance of large sym metric tree networks is examined by aggregating the component links and processors into a single single processor and closed form solutions for the minimum finish time and the optimal data allocation are obtained.
Abstract: Optimal load allocation for load sharing a divisible job over processors interconnected in either a bus or a tree network is considered. The processors are either equipped with front-end processors or not so equipped. Closed form solutions for the minimum finish time and the optimal data allocation for each processor are obtained. The performance of large symmetric tree networks is examined by aggregating the component links and processors into a single equivalent processor. This allows an easy examination of large tree networks. In addition, it becomes possible to find a closed form solution for the optimal amount of data that is to be assigned to each processor in the tree network in order to achieve the minimum finish time. >

Journal ArticleDOI
TL;DR: It is shown that the diagonal mesh outperforms the toroidal mesh in all cases, and thus provides an attractive alternative to the toroid mesh network.
Abstract: Diagonal and toroidal mesh are degree-4 point to point interconnection models suitable for connecting communication elements in parallel computers, particularly multicomputers. The two networks have a similar structure. The toroidal mesh is popular and well-studied whereas the diagonal mesh is relatively new. In this paper, we show that the diagonal mesh has a smaller diameter and a larger bisection width. It also retains advantages such as a simple rectangular structure, wirability and scalability of the toroidal mesh network. An optimal self-routing algorithm is developed for these networks. Using this algorithm and the existing routing algorithm for the toroidal mesh, we simulated and compare the performance of these two networks with N=35/spl times/71=2485, N=49/spl times/99=4851, and N=69/spl times/139=9591 nodes under a constant system with a fixed number of messages. Deflection routing is used to resolve conflicts. The effects of various deflection criteria are also investigated. We show that the diagonal mesh outperforms the toroidal mesh in all cases, and thus provides an attractive alternative to the toroidal mesh network. >

Journal ArticleDOI
TL;DR: Tasks that require linear chain, ring, mesh, and torus structure are considered, which are quite useful in parallel and pipeline computations and based on a key concept called free dimension, which can be used to partition a cube into subcubes such that each subcube contains, at most, one faulty node.
Abstract: Fault tolerance in hypercubes is achieved by exploiting inherent redundancy and executing tasks on faulty hypercubes. The authors consider tasks that require linear chain, ring, mesh, and torus structure, which are quite useful in parallel and pipeline computations. They assume the number of faults is on the order of the number of dimensions of the hypercube. The techniques are based on a key concept called free dimension, which can be used to partition a cube into subcubes such that each subcube contains, at most, one faulty node. Subgraphs are embedded in each subcube and then merged to form the entire graph. >

Journal ArticleDOI
TL;DR: A division algorithm in which the quotient-digit selection is performed by rounding the shifted residual in carry-save form by finding several convenient values of the radix.
Abstract: A division algorithm in which the quotient-digit selection is performed by rounding the shifted residual in carry-save form is presented. To allow the use of this simple function, the divisor (and dividend) is prescaled to a range close to one. The implementation presented results in a fast iteration because of the use of carry-save forms and suitable recodings. The execution time is calculated and several convenient values of the radix are selected. Comparison with other dividers for radices 2/sup 9/ to 2/sup 18/ is performed using the same assumptions. >

Journal ArticleDOI
TL;DR: Formal results for precedence constrained, real-time scheduling of unit time tasks are extended to arbitrary timed tasks with preemption and an exact characterisation of the EDF-like schedulers that can be used to transparently enforce precedence constraints among tasks is shown.
Abstract: Formal results for precedence constrained, real-time scheduling of unit time tasks are extended to arbitrary timed tasks with preemption. An exact characterisation of the EDF-like schedulers that can be used to transparently enforce precedence constraints among tasks is shown. These extended results are then integrated with a well-known protocol that handles real-time scheduling of tasks with shared resources, but does not consider precedence constraints. This results in schedulability formulas for task sets which allow preemption, shared resources, and precedence constraints, and a practical algorithm for many real-time uniprocessor systems. >

Journal ArticleDOI
TL;DR: A new scheme for designing error detecting and error correcting codes around cellular automata (CA) is reported and a CA-based hardware scheme for very fast decoding (and correcting) of the codewords is also reported.
Abstract: A new scheme for designing error detecting and error correcting codes around cellular automata (CA) is reported. A simple and efficient scheme for generating SEC-DED codes is presented which can also be extended for generating codes with higher distances. A CA-based hardware scheme for very fast decoding (and correcting) of the codewords is also reported. >

Journal ArticleDOI
TL;DR: For some sizes of shift register, the maximal-length LFSR implementation requires more than a single gate, and for some, the discrete logarithm calculation is hard, the paper proposes the use of certain one-gate L FSR's whose sequence lengths are nearly maximal, and which support easy discreteLogarithms.
Abstract: A linear feedback shift register, or LFSR, can implement an event counter by shifting whenever an event occurs. A single two-input exclusive-OR gate is often the only additional hardware necessary to allow a shift register to generate, by successive shifts, all of its possible nonzero values. The counting application requires that the number of shifts be recoverable from the LFSR contents so that further processing and analysis may be done. Recovering this number from the shift register value corresponds to a problem from number theory and cryptography known as the discrete logarithm. For some sizes of shift register, the maximal-length LFSR implementation requires more than a single gate, and for some, the discrete logarithm calculation is hard. The paper proposes for such size the use of certain one-gate LFSR's whose sequence lengths are nearly maximal, and which support easy discrete logarithms. These LFSR's have a concise mathematical characterization, and are quite common. The paper concludes by describing an application of these ideas in a computer hardware monitor, and by presenting a table that describes efficient LFSR's of size up to 64 bits. >

Journal ArticleDOI
TL;DR: The paper shows that a decoder implemented using the new power-sum circuit will have less complex circuitry and shorter decoding delay than one implemented using conventional product-sum multipliers.
Abstract: A systolic power-sum circuit designed to perform AB/sup 2/+C computations in the finite field GF(2/sup m/) is presented, where A, B, and C are arbitrary elements of GF(2/sup m/). This new circuit is constructed by m/sup 2/ identical cells, each of which consists of three 2-input AND logical gates, one 2-input XOR gate, one 3-input XOR gate, and ten latches. The AB/sup 2/+C computation is required in decoding many error-correcting codes. The paper shows that a decoder implemented using the new power-sum circuit will have less complex circuitry and shorter decoding delay than one implemented using conventional product-sum multipliers. >

Journal ArticleDOI
TL;DR: The Chaos router, a randomizing, nonminimal adaptive packet router is introduced, it is shown to be deadlock free and probabilistically livelock free, and performance results are presented for a variety of work loads.
Abstract: The Chaos router, a randomizing, nonminimal adaptive packet router is introduced. Adaptive routers allow messages to dynamically select paths, depending on network traffic, and bypass congested nodes. This flexibility contrasts with oblivious packet routers where the path of a packet is statically determined at the source node. A key advancement of the Chaos router over previous nonminimal routers is the use of randomization to eliminate the need for livelock protection. This simplifies adaptive routing to be of approximately the same complexity along the critical decision path as an oblivious router. The primary cost is that the Chaos router is probabilistically livelock free rather than being deterministically livelock free, but evidence is presented implying that these are equivalent in practice. The principal advantage is excellent performance for nonuniform traffic patterns. The Chaos router is described, it is shown to be deadlock free and probabilistically livelock free, and performance results are presented for a variety of work loads. >

Journal ArticleDOI
TL;DR: The authors present a model which can accurately evaluate the performance of single-buffered and multibuffered MINs (multistage interconnection networks) with 2*2 switching elements (SESs) and reveal that the proposed models are consistently much more accurate irrespective of the size of the network, buffer, and traffic condition.
Abstract: Multistage interconnection networks (MIN's) have a number of applications in the areas of computer and communication. While several analytical models have been proposed for the performance evaluation of MIN's, they are either not very accurate or too complex to be generalized. The authors propose a new model for evaluating multibuffered MIN's with 2/spl times/2 switching elements. It effectively and realistically models the correlation of packet movements between two adjacent stages as well as subsequent network cycles. As a result, the proposed model is very accurate for any size and traffic conditions of MIN's. It is also simple and can be easily generalized. >