scispace - formally typeset
Search or ask a question

Showing papers on "Loop fission published in 1989"


Journal ArticleDOI
01 Apr 1989
TL;DR: The Cydra TM 5 architecture adds unique support for overlapping successive iterations of a loop to a very long instruction word (VLIW) base, allowing highly parallel loop execution for a much larger class of loops than can be vectorized.
Abstract: The CydraTM 5 architecture adds unique support for overlapping successive iterations of a loop to a very long instruction word (VLIW) base. This architecture allows highly parallel loop execution for a much larger class of loops than can be vectorized, without requiring the unrolling of loops usually used by compilers for VLIW machines. This paper discusses the Cydra 5 loop scheduling model, the special architectural features which support it, and the loop compilation techniques used to take full advantage of the architecture.

147 citations


Proceedings ArticleDOI
01 Jun 1989
TL;DR: An iterative loop-folding procedure, implemented in the CATHEDRAL II compiler, is presented, which may significantly improve the utilization of parallel hardware, available in a data path.
Abstract: In this paper, we discuss a control-flow transformation called loop folding, during the scheduling of register-transfer code for DSP-systems. Loop folding is functionally equivalent to data-path pipelining. An iterative loop-folding procedure, implemented in the CATHEDRAL II compiler, is presented. This technique may significantly improve the utilization of parallel hardware, available in a data path.

141 citations


Patent
13 Mar 1989
TL;DR: In this paper, a logic simulator has a time loop with a number of time slots into which events are scheduled, such that event times corresponding to different cycles around the loop may be simultaneously present on the loop.
Abstract: A logic simulator has a time loop with a number of time slots into which events are scheduled. The events are wrapped around the loop, so that event times corresponding to different cycles around the loop may be simultaneously present on the loop. This allows a small loop size to be used, which improves performance. Preferably, the loop size is a prime number. If a complete cycle of the loop is made without finding any non-empty slots a jump is made to the next event time, so as to speed up the processing. In one described embodiment, the loop size is static, while in a second described embodiment the loop size is dynamically varied to minimize the insertion of events with different event times into the same slot.

57 citations


Patent
Leslie D. Kohn1
14 Feb 1989
TL;DR: In this paper, a special purpose instruction is used to reduce the program overhead associated with conditional branching at the end of a program loop by comparing a loop counter with a decrement value.
Abstract: A method and apparatus for providing program loop control in a data processor employs a special purpose instruction that substantially reduces the program overhead associated with conditional branching at the end of a program loop. The instruction first compares a loop counter with a decrement value. If the loop counter has counted down, a loop condition code, which is stored in a dedicated register bit, is cleared. Otherwise, the loop condition code remains set to indicate that further iterations of the loop are required. The decremented value of the loop counter is then stored in a loop counter register. In parallel with decrementing of the loop counter, a conditional branch is executed based on the value of the loop condition code set in the immediately previous iteration of the loop. If the loop condition code is cleared, i.e. if the loop has been completed, program control proceeds to the instruction following the loop after execution of the next instruction in sequence. conversely, if the loop condition code is set, program control returns to the branch address, i.e. the beginning of the loop, after execution of the next instruction in sequence. All of the operations of the present invention are performed within a single instruction cycle.

37 citations


Patent
23 Jan 1989
TL;DR: In this article, a data flow type information processor includes a program storing portion, a data pair producing portion and a processing portion, and a function for synchronizing with all of loop variables.
Abstract: A data flow type information processor includes a program storing portion, a data pair producing portion and a processing portion. In the data flow type information processor in executing a data flow program having a loop structure, a function for synchronizing with all of loop variables, that is, function for assuring that the value of all of the loop variables are determined in a loop execution stage to be considered, is applied to a group of instruction information for determining a loop termination.

33 citations


Patent
24 Apr 1989
TL;DR: In this paper, the authors present a horizontal computer for execution of an instruction loop with overlapped code, which includes a plurality of processors, a multiconnect unit for storing operands, an instruction unit for specifying address offsets and operations to be performed by the processors, and an invariant address unit for combining the address offsets with a modifiable pointer to form source and destination addresses.
Abstract: A horizontal computer for execution of an instruction loop with overlapped code. The computer includes a plurality of processors, a multiconnect unit for storing operands for the processors, an instruction unit for specifying address offsets and operations to be performed by the processors, and an invariant address unit for combining the address offsets with a modifiable pointer to form source and destination addresses in the multiconnect unit. The instruction unit enables different ones of the processors as a function of which iteration of the loop is being executed, for example by means of processor control circuitry or by selectively providing instructions to the processors, so that different operations are performed during different iterations.

20 citations


Patent
Vaughn L. Mower1
20 Nov 1989
TL;DR: In this article, a fast acquisition coherent code tracking loop for use in direct sequence spread spectrum systems is provided with an embedded frequency offset loop with a pair of multipliers, one of which is coupled to the carrier tracking loop through a scaling circuit and the second multiplier is coupled with the output of the highly stable VCO of the carrier tracker to provide extremely fast phase acquisition of the received PN code.
Abstract: A fast acquisition coherent code tracking loop for use in direct sequence spread spectrum systems is provided with an embedded frequency offset loop. The frequency offset loop in the code tracking loop is provided with a pair of multipliers, one of which is coupled to the carrier tracking loop through a scaling circuit and the second multiplier is coupled to the output of the highly stable VCO of the carrier tracking loop to provide extremely fast phase acquisition of the received PN code and very high frequency stability of the code tracking loop.

20 citations


Journal ArticleDOI
TL;DR: It is shown that this model based double iterative loop strategy has an important practical advantage in that it reduces the required number of set point changes to real subprocesses in order to achieve optimality.

15 citations


Proceedings ArticleDOI
01 Aug 1989
TL;DR: This work shows how the method keeps the performance of the matrix multiplication and a simplex algorithm from decreasing as the size of input changes, and shows no performance drop when N changes.
Abstract: When the number of processors P is less than the number of tasks N in a parallel loop, the loop has to be executed in ⌈N/P⌉ rounds and the last round executes only (N mod P) tasks. In many cases, in the last round all but a few processors are idle, which causes a significant drop in performance. This performance drop becomes more and more detrimental as the number of processors increases. Loop spreading is a technique for restructuring parallel loops so as to balance parallel tasks on multiple processors. A spread loop runs at least as fast as the non-spread loop even when N mod P = 0, and shows no performance drop when N changes. We show how the method keeps the performance of the matrix multiplication and a simplex algorithm from decreasing as the size of input changes.

6 citations


Patent
21 Oct 1989
TL;DR: In this paper, the authors propose an approach to reduce the overhead of loop execution by placing the decrement, compare, and branch-to-top instructions in hardware, reducing the number of instructions in the loop and speeding loop execution.
Abstract: Method and apparatus to avoid the code space and time overhead of the software-loop. Loops (repeatedly executed blocks of instructions) are often used in software and microcode. Loops may be employed for array manipulation, storage initialization, division and square-root interpretation, and micro-interpreta­ tion of instructions with variable-length operands. Software creates loops by keeping an iteration count in a register or in memory. During each iteration of the code loop, software decrements the count, and then branches to the "top" of the loop is the count remains nonzero. This apparatus puts the decrement, compare, and branch- to-top into hardware, reducing the number of instructions in the loop and speeding loop execution. Hardware further speeds loop execution by eliminating the wait for the branch to the top-of-loop instruction. That is, it prefetches the top-of-loop instruction near the bottom of the loop. The loop may be initialized for a fixed iteration count, or can accept a variable count in the iteration count register. The apparatus consists of counters for the number of instructions in the loop, an iteration counter, a pointer to the top-of-loop location, and an instruction to initiate the loop.

6 citations


Proceedings Article
01 Jan 1989

Journal ArticleDOI
TL;DR: A new approach for hierarchical system optimization and parameter estimation of a large-scale industrial process is described, which significantly reduces the required on-line iterations in order to reach the optimum and has a balanced distribution of model–based computations in the internal double loop.
Abstract: A new approach for hierarchical system optimization and parameter estimation of a large-scale industrial process is described. This approach can be viewed as a hierarchical implementation of the approximate linear model approach with a three iterative loop structure. The internal double loop iteration where only model-based information is involved is a two-model approach. Augmentation is introduced to enforce and accelerate convergence in the internal double loop. The third iterative loop arises from the hierarchical implementation of the technique and involves coordination of price vectors to ensure balance between separable sub-optimization problems. The major advantages of this triple iterative loop configuration are that it significantly reduces the required on-line iterations in order to reach the optimum, especially when the process is linear or nearly linear and it has a balanced distribution of model–based computations in the internal double loop. Optimality and convergence conditions are examined...

Proceedings ArticleDOI
01 Aug 1989
TL;DR: The concept of reservation table is extended, which is used to develop a pipeline control strategy, and an optimal schedule can be obtained based on the analysis of the extended reservation table, or scheduling table, which makes use of the cyclic regularity of loops.
Abstract: Loop optimization is an important aspect of microcode compaction to minimize execution time. In this paper a new loop optimization technique for horizontal microprograms is presented, which makes use of the cyclic regularity of loops.We have extended the concept of reservation table, which is used to develop a pipeline control strategy, so that both data dependencies and resource conflicts are taken into account. Based on the analysis of the extended reservation table, or scheduling table, an optimal schedule can be obtained. The iterations of a loop are then rearranged to form a new loop body, whose length may be greater than that of the original one. But the average initiation latency between iterations is minimal.

Patent
20 Jun 1989
TL;DR: In this paper, a compiler for generating code for enabling multiple processors to process programs in parallel is presented, which enables the multiple processor system to operate in the following manner: one interation of an outer loop in a set of nested loops is assigned to each processor.
Abstract: A compiler for generating code for enabling multiple processors to process programs in parallel. The code enables the multiple processor system to operate in the following manner: one interation of an outer loop in a set of nested loops is assigned to each processor. If the outer loop contains more iterations than processor in the system, the processors are initially assigned an earlier iteration, and the remaining iterations are assigned to the processor one as they finish their earlier iterations. Each processor runs the inner loop iterations serially. In order to enforce dependencies in the loops, each processor reports its progress in its iterations of the inner loop to the processor executing the succeeding outer loop iteration and the waits until the processor computing the preceding outer loop is ahead or behind in processing its inner loop iteration by an amount which guarantees that dependencies will be enforced.