scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1985"


Journal ArticleDOI
TL;DR: In this article, the authors presented a new class of universal routing networks, called fat-trees, which might be used to interconnect the processors of a general-purpose parallel supercomputer, and proved that a fat-tree of a given size is nearly the best routing network of that size.
Abstract: The author presents a new class of universal routing networks, called fat-trees, which might be used to interconnect the processors of a general-purpose parallel supercomputer. A fat-tree routing network is parameterized not only in the number of processors, but also in the amount of simultaneous communication it can support. Since communication can be scaled independently from the number of processors, substantial hardware can be saved for such applications as finite-element analysis without resorting to a special-purpose architecture. It is proved that a fat-tree of a given size is nearly the best routing network of that size. This universality theorem is established using a three-dimensional VLSI model that incorporates wiring as a direct cost. In this model, hardware size is measured as physical volume. It is proved that for any given amount of communications hardware, a fat-tree built from that amount of hardware can stimulate every other network built from the same amount of hardware, using only slightly more time (a polylogarithmic factor greater).

1,147 citations


Journal ArticleDOI
G. F. Pfister1, V. A. Norton1
TL;DR: The technique of message combining was found to be an effective means of eliminating this problem if it arises due to lock or synchronization contention, severely degrading all memory access, not just access to shared lock locations, due to an effect the authors call tree saturation.
Abstract: The combining of messages within a multistage switching network has been proposed to reduce memory contention in highly parallel shared-memory multiprocessors, especially for shared lock and synchronization data. A quantitative investigation of the performance impact of such contention and the effectiveness of combining in reducing this impact is reported. The effect of a nonuniform traffic pattern consisting of a single hot spot of higher access rate superimposed on a background of uniform traffic was investigated. The potential degradation due to even moderate hot spot traffic was found to be very significant, severely degrading all memory access, not just access to shared lock locations, due to an effect the authors call tree saturation. The technique of message combining was found to be an effective means of eliminating this problem if it arises due to lock or synchronization contention.

610 citations


Journal ArticleDOI
TL;DR: The conclusion from the analysis is that the pseudonoise generator's output sequence and the sequences generated by the linear feedback shift registers should be uncorrelated, which leads to constraints for the nonlinear combining function to be used.
Abstract: Pseudonoise sequences generated by linear feedback shift registers [1] with some nonlinear combining function have been proposed [2]–[5] for cryptographic applications as running key generators in stream ciphers. In this correspondence it will be shown that the number of trials to break these ciphers can be significantly reduced by using correlation methods. By comparison of computer simulations and theoretical results based on a statistical model, the validity of this analysis is demonstrated. Rubin [6] has shown that it is computationally feasible to solve a cipher proposed by Pless [2] in a known plaintext attack, using as few as 15 characters. Here, the number of ciphertext symbols is determined to perform a ciphertext-only attack on the Pless cipher using the correlation attack. Our conclusion from the analysis is that the pseudonoise generator's output sequence and the sequences generated by the linear feedback shift registers should be uncorrelated. This leads to constraints for the nonlinear combining function to be used.

547 citations


Journal ArticleDOI
Yung-Terng Wang1, Morris2
TL;DR: A taxonomy of load sharing algorithms is proposed that draws a basic dichotomy between source- initiative and server-initiative approaches and a performance metric called the Q-factor (quality of load share) is defined which summarizes both overall efficiency and fairness of an algorithm.
Abstract: An important part of a distributed system design is the choice of a load sharing or global scheduling strategy. A comprehensive literature survey on this topic is presented. We propose a taxonomy of load sharing algorithms that draws a basic dichotomy between source-initiative and server-initiative approaches. The taxonomy enables ten representative algorithms to be selected for performance evaluation. A performance metric called the Q-factor (quality of load sharing) is defined which summarizes both overall efficiency and fairness of an algorithm and allows algorithms to be ranked by performance. We then evaluate the algorithms using both mathematical and simulation techniques. The results of the study show that: i) the choice of load sharing algorithm is a critical design decision; ii) for the same level of scheduling information exchange, server-initiative has the potential of outperforming source-initiative algorithms (whether this potential is realized depends on factors such as communication overhead); iii) the Q-factor is a useful yardstick; iv) some algorithms, which have previously received little attention, e.g., multiserver cyclic service, may provide effective solutions.

507 citations


Journal ArticleDOI
TL;DR: Tight upper and lower bounds are proved on the number of processors, information transfer, wire area, and time needed to sort N numbers in a bounded-degree fixed-connection network.
Abstract: In this paper, we prove tight upper and lower bounds on the number of processors, information transfer, wire area, and time needed to sort N numbers in a bounded-degree fixed-connection network. Our most important new results are: 1) the construction of an N-node degree-3 network capable of sorting N numbers in O(log N) word steps; 2) a proof that any network capable of sorting N (7 log N)-bit numbers in T bit steps requires area A where AT2 = ?(N2 log2 N); and 3) the construction of a ``small-constant-factor'' bounded-degree network that sorts N ?(log N)-bit numbers in T = ?(log N) bit steps with A = ?(N2) area.

395 citations


Journal ArticleDOI
TL;DR: In this article, a pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m) with the simple squaring property of the normal basis representation used together with this multiplier.
Abstract: Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that can be easily realized on VLSI chips. Massey and Omura [1] recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. In this paper, a pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable, and therefore, naturally suitable for VLSI implementation.

373 citations


Journal ArticleDOI
TL;DR: A graph matching approach for solving the task assignment problem encountered in distributed computing systems with a cost function defined in terms of a single unit, time, and a new optimization criterion, called the minimax criterion, based on which both minimization of interprocessor communication and balance of processor loading can be achieved.
Abstract: A graph matching approach is proposed in this paper for solving the task assignment problem encountered in distributed computing systems. A cost function defined in terms of a single unit, time, is proposed for evaluating the effectiveness of task assignment. This cost function represents the maximum time for a task to complete module execution and communication in all the processors. A new optimization criterion, called the minimax criterion, is also proposed, based on which both minimization of interprocessor communication and balance of processor loading can be achieved. The proposed approach allows various system constraints to be included for consideration. With the proposed cost function and the minimax criterion, optimal task assignment is defined. Graphs are then used to represent the module relationship of a given task and the processor structure of a distributed computing system. Module assignment to system processors is transformed into a type of graph matching, called weak homomorphism. The search of optimal weak homomorphism corresponding to optimal task assignment is next formulated as a state-space search problem. It is then solved by the well-known A* algorithm in artificial intelligence after proper heuristic information for speeding up the search is suggested. An illustrative example and some experimental results are also included to show the effectiveness of the heuristic search.

358 citations


Journal ArticleDOI
Takagi1, Yasuura1, Yajima1
TL;DR: Since the multiplier has a regular cellular array structure similar to an array multiplier, it is suitable for VLSI implementation and is excellent in both computation speed and regularity in layout.
Abstract: A high-speed VLSI multiplication algorithm internally using redundant binary representation is proposed. In n bit binary integer multiplication, n partial products are first generated and then added up pairwise by means of a binary tree of redundant binary adders. Since parallel addition of two n-digit redundant binary numbers can be performed in a constant time independent of n without carry propagation, n bit multiplication can be performed in a time proportional to log2 n. The computation time is almost the same as that by a multiplier with a Wallace tree, in which three partial products will be converted into two, in contrast to our two-to-one conversion, and is much shorter than that by an array multiplier for longer operands. The number of computation elements of an n bit multiplier based on the algorithm is proportional to n2. It is almost the same as those of conventional ones. Furthermore, since the multiplier has a regular cellular array structure similar to an array multiplier, it is suitable for VLSI implementation. Thus, the multiplier is excellent in both computation speed and regularity in layout. It can be implemented on a VLSI chip with an area proportional to n2 log2 n. The algorithm can be directly applied to both unsigned and 2's complement binary integer multiplication.

344 citations


Journal ArticleDOI
TL;DR: Although the underlying network problems are NP-complete, it is proved that the procedures are reliable by assuming a probabilistic model of cell failure, thus minimizing the communication time between cells.
Abstract: VLSI technologists are fast developing wafer-scale integration. Rather than partitioning a silicon wafer into chips as is usually done, the idea behind wafer-scale integration is to assemble an entire system (or network of chips) on a single wafer, thus avoiding the costs and performance loss associated with individual packaging of chips. A major problem with assembling a large system of microprocessors on a single wafer, however, is that some of the processors, or cells, on the wafer are likely to be defective. In the paper, we describe practical procedures for integrating "around" such faults. The procedures are designed to minimize the length of the longest wire in the system, thus minimizing the communication time between cells. Although the underlying network problems are NP-complete, we prove that the procedures are reliable by assuming a probabilistic model of cell failure. We also discuss applications of the work to problems in VLSI layout theory, graph theory, fault-tolerant systems, planar geometry, and the probabilistic analysis of algorithms.

268 citations


Journal ArticleDOI
TL;DR: A pipeline structure of a transform decoder similar to a systolic array is developed to decode Reed-Solomon (RS) codes, using a modified Euclidean algorithm for computing the error-locator polynomial.
Abstract: A pipeline structure of a transform decoder similar to a systolic array is developed to decode Reed-Solomon (RS) codes. An important ingredient of this design is a modified Euclidean algorithm for computing the error-locator polynomial. The computation of inverse field elements is completely avoided in this modification of Euclid's algorithm. The new decoder is regular and simple, and naturally suitable for VLSI implementation. An example illustrating both the pipeline and systolic array aspects of this decoder structure is given for a (15,9) RS code.

247 citations


Journal ArticleDOI
Guo-Jie Li1, Wah
TL;DR: In this paper, a methodology to systematically search and reduce this space and to obtain the optimal design is proposed, including matrix multiplication, finite impulse response filtering, deconvolution, and triangular matrix inversion.
Abstract: Conventional design of systolic arrays is based on the mapping of an algorithm onto an interconnection of processing elements in a VLSI chip. This mapping is done in an ad hoc manner, and the resulting configuration usually represents a feasible but suboptimal design. In this paper, systolic arrays are characterized by three classes of parameters: the velocities of data flows, the spatial distributions of data, and the periods of computation. By relating these parameters in constraint equations that govern the correctness of the design, the design is formulated into an optimization problem. The size of the search space is a polynomial of the problem size, and a methodology to systematically search and reduce this space and to obtain the optimal design is proposed. Some examples of applying the method, including matrix multiplication, finite impulse response filtering, deconvolution, and triangular-matrix inversion, are given.

Journal ArticleDOI
Mackinnon1, Taylor1, Meijer1, Akl1
TL;DR: A cryptographic scheme for controlling access to information within a group of users organized in a hierarchy was proposed in [1].
Abstract: A cryptographic scheme for controlling access to information within a group of users organized in a hierarchy was proposed in [1]. The scheme enables a user at some level to compute from his own cryptographic key the keys of the users below him in the organization.

Journal ArticleDOI
Atallah1
TL;DR: The purpose of this correspondence is to describe an O( n log n) time algorithm for enumerating all the axes of symmetry of a planar figure which is made up of segments, circles, points, etc.
Abstract: A straight line is an axis ofsymmetry of a planar figure if the figure is invariant to reflection with respect to that line The purpose of this correspondence is to describe an O( n log n) time algorithm for enumerating all the axes of symmetry of a planar figure which is made up of (possibly intersecting) segments, circles, points, etc The solution involves a reduction of the problem to a combinatorial question on words Our algorithm is optimal since we can establish an Ω(n log n) time lower bound for this problem

Journal ArticleDOI
TL;DR: This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model) to solve the prefix computation problem, when the order of the elements is specified by a linked list.
Abstract: The prefix computation problem is to compute all n initial products a1* . . . *a1,i=1, . . ., n of a set of n elements, where * is an associative operation. An O(((logn) log(2n/p))XI(n/p)) time deterministic parallel algorithm using p≤n processors is presented to solve the prefix computation problem, when the order of the elements is specified by a linked list. For p≤O(n1-e)(e〉0 any constant), this algorithm achieves linear speedup. Such optimal speedup was previously achieved only by probabilistic algorithms. This study assumes the weakest PRAM model, where shared memory locations can only be exclusively read or written (the EREW model).

Journal ArticleDOI
TL;DR: The authors present a scheduling algorithm which works dynamically and on loosely coupled distributed systems for tasks with hard real-time constraints; i.e., the tasks must meet their deadlines.
Abstract: Most systems which are required to operate under severe real-time constraints assume that all tasks and their characteristics are known a priori. Scheduling of such tasks can be done statistically. Further, scheduling algorithms operating under such conditions are usually limited to multiprocessor configurations. The authors present a scheduling algorithm which works dynamically and on loosely coupled distributed systems for tasks with hard real-time constraints; i.e., the tasks must meet their deadlines. It uses a scheduling component local to every node and a distributed scheduling scheme which is specifically suited to hard real-time constraints and other timing considerations. Periodic tasks, nonperiodic tasks, scheduling overheads, communication overheads due to scheduling and preemption are all accounted for in the algorithm. Simulation studies are used to evaluate the performance of the algorithm.

Journal ArticleDOI
TL;DR: The theory and design of systematic t-unidirectional error-detecting codes are developed and optimal systematic codes capable of detecting 2, 3, and 6 uniddirectional errors using 2,3, and 4 check bits are given.
Abstract: The theory and design of systematic t-unidirectional error-detecting codes are developed. Optimal systematic codes capable of detecting 2, 3, and 6 unidirectional errors using 2, 3, and 4 check bits, respectively, are given. For r ≥5 where r is the number of check bits, the systematic codes described here can detect up to 5· 2r-4 + r -4 unidirectional errors. Encoding/ decoding methods for these codes are also investigated.

Journal ArticleDOI
TL;DR: The concept of presortedness and its use in sorting is studied, and a new insertion sort algorithm is shown to be optimal with respect to three natural measures.
Abstract: The concept of presortedness and its use in sorting are studied. Natural ways to measure presortedness are given and some general properties necessary for a measure are proposed. A concept of a sorting algorithm optimal with respect to a measure of presortedness is defined, and examples of such algorithms are given. A new insertion sort algorithm is shown to be optimal with respect to three natural measures. The problem of finding an optimal algorithm for an arbitrary measure is studied, and partial results are proven.

Journal ArticleDOI
TL;DR: This paper clarifies the relation between the diameter k and the edge or node connectivity Ce or c,, of digraphs by derived the following two inequalities: where n is the number of nodes, d is the maximum degree, andd is the minimum degree.
Abstract: This paper clarifies the relation between the diameter k and the edge or node connectivity Ce or c,, of digraphs. The following two inequalities are derived: where n is the number of nodes, d is the maximum degree, and d is the minimum degree.

Journal ArticleDOI
Thu V. Vu1
TL;DR: Two conversion techniques based on the Chinese remainder theorem are developed for use in residue number systems and are preferable for the full conversion from residues to unsigned or 2's complement integers.
Abstract: Two conversion techniques based on the Chinese remainder theorem are developed for use in residue number systems. The new implementations are fast and simple mainly because adders modulo a large and arbitrary integer M are effectively replaced by binary adders and possibly a lookup table of small address space. Although different in form, both techniques share the same principle that an appropriate representation of the summands must be employed in order to evaluate a sum modulo M efficiently. The first technique reduces the sum modulo M in the conversion formula to a sum modulo 2 through the use of fractional representation, which also exposes the sign bit of numbers. Thus, this technique is particularly useful for sign detection and for any operation requiring a comparison with a binary fraction of M. The other technique is preferable for the full conversion from residues to unsigned or 2's complement integers. By expressing the summands in terms of quotients and remainders with respect to a properly chosen divisor, the second technique systematically replaces the sum modulo M by two binary sums, one accumulating the quotients modulo a power of 2 and the other accumulating the remainders the ordinary way. A final recombination step is required but is easily implemented with a small lookup table and binary adders.

Journal ArticleDOI
TL;DR: This paper shows that the node-connectivity of SRG is (2r - 2) and presents routing methods for situations with a certain number of node failures, and the routing algorithms are shown to be computationally efficient.
Abstract: A class of communication networks which is suitable for "multiple processor systems" was studied by Pradhan and Reddy. The underlying graph (to be called Shift and Replace graph or SRG) is based on DeBruijn digraphs and is a function of two parameters r and m. Pradhan and Reddy have shown that the node-connectivity of SRG is at least r. The same authors give a routing algorithm which generally requires 2m hops if the number of node failures is ≤(r -1). In this paper we show that the node-connectivity of SRG is (2r - 2). This would immediately imply that the system can tolerate up to (2r - 3) node failures. We then present routing methods for situations with a certain number of node failures. When this number is ≤(r - 2) our routing algorithm requires at most m + 3 + logr m hops if 3 + logr m ≤m. When the number of node failures is ≤(2r - 3) our routing algorithm requires at most m + 5 + logr m hops if 4 + logr m ≤ m. In all the other situations our routing algorithm requires no more than 2m hops. The routing algorithms are shown to be computationally efficient.

Journal ArticleDOI
TL;DR: A heuristic for the effective cooperation of multiple decentralized components of a job scheduling function that can dynamically adapt to the quality of the state information being processed and is based on Bayesian decision theory.
Abstract: There is a wide spectrum of techniques that can be aptly named decentralized control. However, certain functions in distributed operating systems, e.g., scheduling, operate under such demanding requirements that no known optimal control solutions exist. It has been shown that heuristics are necessary. This paper presents a heuristic for the effective cooperation of multiple decentralized components of a job scheduling function. An especially useful feature of the heuristic is that it can dynamically adapt to the quality of the state information being processed. Extensive simulation results show the utility of this heuristic. The simulation results are compared to several analytical models and a baseline simulation model. The heuristic itself is based on the application of Bayesian decision theory. Bayesian decision theory was used because its principles can be applied as a systematic approach to complex decision making under conditions of imperfect knowledge, and it can run relatively cheaply in real time.

Journal ArticleDOI
TL;DR: A discrete-time model is presented of memory interference in multiprocessor systems using multiple-bus interconnection networks that differs from earlier models in its ability to model variable connection time and arbitrary inter-request time.
Abstract: A discrete-time model is presented of memory interference in multiprocessor systems using multiple-bus interconnection networks. It differs from earlier models in its ability to model variable connection time and arbitrary inter-request time. The model describes each processing element's behavior by means of a semi-Markov process, taking as input the number of processing elements, the number of memory modules, the number of buses, the mean think time of the processing elements, and the first and second moments of the connection time between processing elements and memories. The model produces as output the memory bandwidth, processing element utilization, memory module utilization, average queue length at a memory, and average waiting time experienced by a processing element while waiting to access a memory. Using the model, it is possible to analyze the interaction of the input parameters on the system performance without using a complex Markov chain; a four-state semi-Markov process is sufficient regardless of the think and connection time distributions. The accuracy and capability of the model are illustrated.

Journal ArticleDOI
TL;DR: This correspondence shows that the DTS problem is NP-complete, and presents a longest, first sequential scheduling algorithm which runs in worst case time O(dm log n) and uses O(m) space to produce a solution of length less than four times optimal.
Abstract: The problem of diagnostic test scheduling (DTS) is to assign to each edge e of a diagnostic graph G a time interval of length l(e) so that intervals corresponding to edges at any given vertex do not overlap and the overall finishing time is minimum. In this correspondence we show that the DTS problem is NP-complete. Then we present a longest, first sequential scheduling algorithm which runs in worst case time O(dm log n) and uses O(m) space to produce a solution of length less than four times optimal. Then we show that the general performance bound can be strengthened to 3 * OPT(G) for low-degree graphs and to 2 ·OPT(G) in some special cases of binomial diagnostic graphs.

Journal ArticleDOI
TL;DR: A highly reliable and efficient double-loop network architecture that is based on forward loop backward hop topology, with a loop in the forward direction connecting all the neighboring nodes, and a backward loop connecting nodes that are separated by a distance.
Abstract: Single-loop networks tend to become unreliable when the number of nodes in the network becomes large. Reliability can be improved using double loops. In this paper a highly reliable and efficient double-loop network architecture is proposed and analyzed. This network is based on forward loop backward hop topology, with a loop in the forward direction connecting all the neighboring nodes, and a backward loop connecting nodes that are separated by a distance ⌊√N⌋where N is the number of nodes in the network. It is shown that this topology is optimal, among this class of double-loop networks, in terms of diameter, average hop distance, processing overhead, delay, throughput, and reliability. The paper includes derivation of closed form expressions for diameter and average hop distance, throughput, and number of distinct routes between two farthest nodes. For fault-tolerance study, the effect of node and link failures on the performance of the network is analyzed. A simple distributed routing algorithm for reliable loop network operation is also presented.

Journal ArticleDOI
Fisher1, Kung1
TL;DR: This paper provides a spectrum of synchronization models; based on the assumptions made for each model, theoretical lower bounds on clock skew are derived, and appropriate or best possible synchronization schemes for large processor arrays are proposed.
Abstract: Highly parallel VLSI computing structures consist of many processing elements operating simultaneously. In order for such processing elements to communicate among themselves, some provision must be made for synchronization of data transfer. The simplest means of synchronization is the use of a global clock. Unfortunately, large clocked systems can be difficult to implement because of the inevitable problem of clock skews and delays, which can be especially acute in VLSI systems as feature sizes shrink. For the near term, good engineering and technology improvements can be expected to maintain the feasibility of clocking in such systems; however, clock distribution problems crop up in any technology as systems grow. An alternative means of enforcing necessary synchronization is the use of self-timed asynchronous schemes, at the cost of increased design complexity and hardware cost. Realizing that different circumstances call for different synchronization methods, this paper provides a spectrum of synchronization models; based on the assumptions made for each model, theoretical lower bounds on clock skew are derived, and appropriate or best possible synchronization schemes for large processor arrays are proposed.

Journal ArticleDOI
TL;DR: The effect of failures on the performance of multiple-bus multiprocessors is considered and mathematical models are developed to compute the reliability and the performance-related bandwidth availability.
Abstract: The effect of failures on the performance of multiple-bus multiprocessors is considered. Bandwidth expressions for this architecture are derived for uniform and nonuniform memory references. Mathematical models are developed to compute the reliability and the performance-related bandwidth availability (BA). The results obtained for the multiple-bus interconnection are compared with those of a crossbar. The models are also extended to analyze the partial bus structure, where the memories are divided into groups and each group is connected to a subset of buses. The reliability and the BA of the multiple-bus and partial bus architectures are compared.

Journal ArticleDOI
Abramovici1, Menon
TL;DR: This approach is based on extending fault simulation and test generation for stuck faults to cover bridging faults as well, and shows that adequate bridging fault coverage can be obtained in most cases without using sequences of vectors.
Abstract: In this correspondence we prepent a practical approach to fault simulation and test generation for bridging faults in combinational circuits. Unlike previous work, we consider Unrestricted bridging faults, including those that introduce feedback. Our approach is based on extending fault simulation and test generation for stuck faults to cover bridging faults as well. We consider combinational testing only, and show that adequate bridging fault coverage can be obtained in most cases without using sequences of vectors.

Journal ArticleDOI
TL;DR: In this paper, the authors present some analytical results for the calculation of the resulting effect bandwidth for one and two access streams to a memory system in a vector processor in a Cray X-MP and corresponding simulations are presented.
Abstract: Memory interleaving and multiple access ports are the key to a high memory bandwidth in vector processor systems. Each of the active ports supports an independent access stream to memory among which access conflicts may arise. Such conflicts lead to a decrease in memory bandwidth. The authors present some analytical results for the calculation of the resulting effect bandwidth for one and two access streams to a memory system in a vector processor. In particular, conditions for conflict-free access are given together with some conflicting cases that should be avoided. Finally, examples of measurements on a Cray X-MP and corresponding simulations are presented.

Journal ArticleDOI
TL;DR: A state model is presented which clarifies various coherence mechanisms as well as introduces a new state to enable the multicache system to more efficiently handle the processor writes.
Abstract: A coherence problem may occur in a multicache system as soon as data inconsistency exists in the private caches and the main memory. Without an effective solution to the coherence problem, the effectiveness of a multicache system will be inherently limited. The problem is closely examined in this paper and previous solutions, both centralized approaches and distributed approaches, are analyzed based on the notion of semicritical sections. A state model is then presented which clarifies various coherence mechanisms as well as introduces a new state to enable the multicache system to more efficiently handle the processor writes. Software guidance, for performance and not for integrity, is advocated in a new proposal which in a practical multicache environment explores the benefit of the new state with little cost.

Journal ArticleDOI
TL;DR: The rearrangeability proof and the control algorithm are well known for the Benes network, but there has been little progress for the case of nonsymmetric networks of similar hardware requirements.
Abstract: For any parallel computer systems which consist of many processing elements and memories, interconnection networks provide communication paths among processing elements and memories. Both the rearrangeability proof and the control algorithm are well known for the Benes network, which is intrinsically symmetric. However, there has been little progress for the case of nonsymmetric networks of similar hardware requirements.