scispace - formally typeset
Search or ask a question

Showing papers on "Loop fission published in 2008"


Book ChapterDOI
29 Mar 2008
TL;DR: This work proposes an automatic transformation framework to optimize arbitrarily-nested loop sequences with affine dependences for parallelism and locality simultaneously and finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation.
Abstract: The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a complex sequence of execution-reordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant body of research has addressed affine scheduling and partitioning, the problem of automaticallyfinding good affine transforms forcommunication-optimized coarsegrained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarilynested loop sequences with affine dependences for parallelism and locality simultaneously. The approach finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the minimization of inter-tile communication volume in the processor space, and minimization of reuse distances for local execution at each node. Programs requiring one-dimensional versusmulti-dimensional time schedules (with scheduling-based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelismat various levels can be detected. Preliminary studies of the framework show promising results.

231 citations


Patent
19 May 2008
TL;DR: In this article, the compiler replaces non-countable loops with a parallelized loop pattern that uses outlined function calls defined in a parallelization library (PL) in order to speculatively execute iterations of the parallelized loops.
Abstract: A system and method for speculatively parallelizing non-countable loops in a multi-threaded application. A multi-core processor receives instructions for a multi-threaded application. The application may contain non-countable loops. Non-countable loops have an iteration count value that cannot be determined prior to the execution of the non-countable loop, a loop index value that cannot be non-speculatively determined prior to the execution of an iteration of the non-countable loop, and control that is not transferred out of the loop body by a code line in the loop body. The compiler replaces the non-countable loop with a parallelized loop pattern that uses outlined function calls defined in a parallelization library (PL) in order to speculatively execute iterations of the parallelized loop. The parallelized loop pattern is configured to squash and re-execute any speculative thread of the parallelized loop pattern that is signaled to have a transaction failure.

38 citations


19 Sep 2008
TL;DR: Loop filters are designed totally in the Z-domain by utilizing a method that minimizes the loop phase jitter, allowing one to operate in ranges where previous methods can not operate.
Abstract: In a traditional loop filter the product between loop noise bandwidth and integration time (BLT) should remain well below unity in order to ensure the stability of the loop. This constraint, required for having a stable loop, significantly limits the maximum integration time and/or noise bandwidth. The current methodology in designing digital tracking loop filters mostly relies on transforming a continuoustime system into a discrete-time one. This transform, from the S-domain to Z-domain, is done by means of Laplace to Z-domain mappings, such as the bilinear transform. In these cases, the digital loops will be equivalent to its analog counterpart only if BLT is close to zero (Stephens & Thomas 1995, Lindsey & Chie 1981). As the product BLT increases, the effective loop noise bandwidth and closed loop pole locations deviate from the desired ones and eventually the loop becomes unstable. By designing filters with the controlled-root method the deficiencies of the continuous-update approximation in large BLT applications are avoided (Stephens & Thomas 1995). However, by using this method for the conventional NCOs (denoted as rate-only feedback NCOs in Stephens & Thomas 1995) which are mostly used in software receivers, the BLT is still limited to less than 0.4 for third order loops. In this paper, by considering the effect of integration and dump in the linear model of the digital phase-locked loop and considering rate-only feedback NCOs, loop filters are designed totally in the Z-domain by utilizing a method that minimizes the loop phase jitter. It is shown that, by using these new filters, a significant improvement for high BLT can be achieved, allowing one to operate in ranges where previous methods can not operate. As a result, stable loops with higher bandwidths and/or longer integration time can be easily designed. The deficiencies of previous methods are analyzed and the loop instability for large BLT is shown by employing live GPS signals. New loop filters are implemented in a GPS software receiver and their performance for large BLT evaluated by using live GPS signals for static tests and hardware simulated signals for dynamic tests.

35 citations


Patent
30 May 2008
TL;DR: In this paper, various technologies and techniques are disclosed for transforming a sequential loop into a parallel loop for use with a transactional memory system. But they do not discuss how to generate transactions in an amount up to the fixed number of iterations.
Abstract: Various technologies and techniques are disclosed for transforming a sequential loop into a parallel loop for use with a transactional memory system. Open ended and/or closed ended sequential loops can be transformed to parallel loops. For example, a section of code containing an original sequential loop is analyzed to determine a fixed number of iterations for the original sequential loop. The original sequential loop is transformed into a parallel loop that can generate transactions in an amount up to the fixed number of iterations. As another example, an open ended sequential loop can be transformed into a parallel loop that generates a separate transaction containing a respective work item for each iteration of a speculation pipeline. The parallel loop is then executed using the transactional memory system, with at least some of the separate transactions being executed on different threads.

32 citations


Journal ArticleDOI
TL;DR: This work extends and adapt an existing outer loop pipelined approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors to suggest that inclusion of outer loop Pipelining in future hardware compilers may be worthwhile.
Abstract: Most hardware compilers apply loop pipelining to increase the parallelism achieved, but pipelining is restricted to the only innermost level in a nested loop. In this work we extend and adapt an existing outer loop pipelining approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors. Each loop level in nine test loops is pipelined and the resulting schedules are implemented in VHDL and targeted to an Altera Stratix II FPGA. The results show that the fastest solution for all but one of the loops occurs when pipelining is applied one to three levels above the innermost loop. Across the nine test loops we achieve an acceleration over the innermost loop solution of up to seven times, with a mean speedup of 3.2 times. The results suggest that inclusion of outer loop pipelining in future hardware compilers may be worthwhile as it can allow significantly improved results to be achieved at the cost of a small increase in compile time.

27 citations


Proceedings ArticleDOI
10 Jul 2008
TL;DR: This paper describes a simple and quick method for loop bound calculation using a model checker that cannot only find loop bounds for integer iterator variables but works with practically all kind of loops.
Abstract: Knowing the boundaries of loops is an important prerequisite for both, static and dynamic worst case execution time (WCET) analysis. However, loop bound calculation is a complex task of its own, and often more effort than planned has to be put into it. This paper describes a simple and quick method for loop bound calculation using a model checker that cannot only find loop bounds for integer iterator variables but works with practically all kind of loops.

22 citations


Patent
Michael K. Gschwind1
14 Oct 2008
TL;DR: In this article, a safe access range, of the total access range of the array reference in the source code, is identified over which a compiler-based optimization of the loop source code can be safely applied without introducing new exception conditions.
Abstract: Mechanisms are provided for analyzing and optimizing loops with conditional control flow in source code based on array reference safety. Mechanisms are provided for analyzing blocks of the source code to identify a conditional control flow loop having loop source code specifying a total access range for an array reference. A safe access range, of the total access range of the array reference in the loop source code, is identified over which a compiler-based optimization of the loop source code can be safely applied without introducing new exception conditions. The compiler-based optimization of the loop source code is performed based on the identified safe access range to generate optimized code. The optimized code is output for generation of executable code for execution on a processor.

21 citations


Patent
12 Mar 2008
TL;DR: In this article, the authors propose a loop protection mechanism including dynamically determining a link connecting two adjacent nodes of a loop in a communication network in accordance with a predefined criterion, and reconfiguring the loop so that the loop is broken at the determined link which is an optimal link in terms of the predefined criteria.
Abstract: A loop protection mechanism including dynamically determining a link connecting two adjacent nodes of a loop in a communication network in accordance with a predefined criterion, and reconfiguring the loop so that the loop is broken at the determined link which is an optimal link in terms of the predefined criterion. The breaking of the loop enables for example to utilize loop- free technologies (e.g. Ethernet) in a physical loop architecture.

21 citations


Patent
12 Nov 2008
TL;DR: In this article, a machine instruction is defined that identifies a loop start, stores a corresponding loop start address on a return stack (or in other suitable storage) and directs fetch logic to take advantage of the identification by retaining in a fetch buffer or instruction cache the instruction(s) beginning at the loop start addresses, thereby avoiding usual branch delays on subsequent iterations of the loop.
Abstract: Instruction set techniques have been developed to identify explicitly the beginning of a loop body and to code a conditional loop-end in ways that allow a processor implementation to efficiently manage an instruction fetch buffer and/or entries in an instruction cache. In particular, for some computations and processor implementations, a machine instruction is defined that identifies a loop start, stores a corresponding loop start address on a return stack (or in other suitable storage) and directs fetch logic to take advantage of the identification by retaining in a fetch buffer or instruction cache the instruction(s) beginning at the loop start address, thereby avoiding usual branch delays on subsequent iterations of the loop. A conditional loop-end instruction can be used in conjunction with the loop start instruction to discard (or simply mark as no longer needed) the loop start address and the loop body instructions retained in the fetch buffer or instruction cache.

15 citations


Proceedings ArticleDOI
10 Mar 2008
TL;DR: An effective scheduling framework, Register and Memory Sensitive Partition-ing(RMSP), is proposed to minimize average schedule length per iteration under register and memory dual constraints for parallel embedded systems.
Abstract: Loops are the most important sections for embedded applications. To achieve high performance, two loop transformation techniques are often applied, namely loop pipelining and loop partitioning, loop pipelining is an effective approach to increase parallelism and reduce schedule length. Loop partitioning with prefetching increases data locality and hides memory latency. However, loop pipelining increases register pressure and loop partitioning increases local memory requirement. As most embedded systems have limited number of registers and limited memory, without careful study, these two techniques can not be applied effectively. In this paper, we propose an effective scheduling framework, Register and Memory Sensitive Partition-ing(RMSP), to minimize average schedule length per iteration under register and memory dual constraints for parallel embedded systems. Experiments show that RMSP reduces schedule length by 14.1% in average compared to previous methods applied directly.

15 citations


Journal ArticleDOI
TL;DR: Two algorithms of Software PIpelining for NEsted loops (SPINE) are proposed based on the fundamental understanding of the properties of software pipelining for nested loops: the SPINE-FULL algorithm generates fully parallelized loops with the minimal overheads and theSPINE-ROW-WISE algorithm achieves the maximal parallelism in an iteration with a fixed row-wise execution sequence.

Dissertation
01 Jan 2008
TL;DR: A hardware loop controller architecture is presented which supports hardware generation from the polyhedral representation used for loop transformations, and several methods are presented to estimate bounds on Ehrhart quasi-polynomials.
Abstract: Current high-level design environments offer little support to implement data-intensive applications on heterogeneous-memory systems; they rather focus on parallelism. This thesis addresses the memory hierarchy problem to high-level transformations of loop structures. The composition of long transformation sequences by combining shorter subsequences is studied together with the influence of the order of applying transformation steps. Several methods are presented to estimate bounds on Ehrhart quasi-polynomials, which can be used to statically evaluate program properties, such as memory usage. Since loop transformations not only influence the data access pattern but also the control complexity we present a hardware loop controller architecture which supports hardware generation from the polyhedral representation used for loop transformations. The techniques are demonstrated by the semi-automatic generation of an FPGA implementation of an inverse discrete wavelet transform.

Proceedings ArticleDOI
17 Mar 2008
TL;DR: This paper proposes the first dynamic loop scheduler, to the knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it, which reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7%.
Abstract: Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments.

Book ChapterDOI
21 Jul 2008
TL;DR: The resemblances between with the wide-known and used Loop transformations lead us to try taking concepts and results from this domain and see how they fit in Array OL context.
Abstract: Array OL specification model is a mixed graphical-textual language designed to model multidimensional intensive signal processing applications.Data and task parallelism are specified directly in the model. High level transformations are defined on this model, allowing the refactoring of an application and furthermore providing directions for optimization. The resemblances between with the wide-known and used Loop transformationslead us to try taking concepts and results from this domain and see how they fit in Array OL context.

Proceedings ArticleDOI
23 Sep 2008
TL;DR: A new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric by combining unrolling with shifting to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions execute in parallel with multiple instances of the kernel.
Abstract: Loops are an important source of optimization. In this paper, we propose a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization and programming paradigm as our framework. The method we propose extends our previous work on loop unrolling for reconfigurable architectures by combining unrolling with shifting to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm is based on profiling information about the kernelpsilas execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to a loop nest extracted from MPEG2 encoder containing the DCT kernel. The achieved speedup is 19.65x over software execution and 1.8x over loop unrolling.

Proceedings Article
16 Sep 2008
TL;DR: It is shown that for disjunctive logic programs, loop formulas of loops with no external support can be computed in polynomial time, and that an iterative procedure using unit propagation on these formulas and the program completion computes the well-founded models in the case of normal logic programs and the least fixed point of a simplification operator used by DLV for disjoined logic programs.
Abstract: If a loop has no external support rules, then its loop formula is equivalent to a set of unit clauses; and if it has exactly one external support rule, then its loop formula is equivalent to a set of binary clauses. In this paper, we consider how to compute these loops and their loop formulas in a normal logic program, and use them to derive consequences of a logic program. We show that an iterative procedure based on unit propagation, the program completion and the loop formulas of loops with no external support rules can compute the same consequences as the "Expand" operator in smodels, which is known to compute the well-founded model when the given normal logic program has no constraints. We also show that using the loop formulas of loops with at most one external support rule, the same procedure can compute more consequences, and these extra consequences can help ASP solvers such as cmodels to find answer sets of certain logic programs.

Patent
16 May 2008
TL;DR: In this paper, a loop is first simdized as if the memory unit imposes no alignment constraints, and then the compiler inserts data reorganization operations to satisfy the actual alignment requirements of the hardware.
Abstract: An approach is provided for vectorizing misaligned references in compiled code for SIMD architectures that support only aligned loads and stores. In this framework, a loop is first simdized as if the memory unit imposes no alignment constraints. The compiler then inserts data reorganization operations to satisfy the actual alignment requirements of the hardware. Finally, the code generation algorithm generates SIMD codes based on the data reorganization graph, addressing realistic issues such as runtime alignments, unknown loop bounds, residual iteration counts, and multiple statements with arbitrary alignment combinations. Loop peeling is used to reduce the computational overhead associated with misaligned data. A loop prologue and epilogue are peeled from individual iterations in the simdized loop, and vector-splicing instructions are applied to the peeled iterations, while the steady-state loop body incurs no additional computational overhead.

Journal Article
TL;DR: This communication provides an approach for the application of PID controllers within a cascade control system configuration based on considerations about the expected operating modes of both controllers, and proposes the use of a tuning that provides a balanced set-point / load-disturbance performance for the secondary controller.
Abstract: This communication provides an approach for the application of PID controllers within a cascade control system configuration. Based on considerations about the expected operating modes of both controllers, the tuning of both inner and outer loop controllers are selected accordingly. This fact motivates the use of a tuning that for the secondary controller that provides a balanced set-point / load-disturbance performance. A new approach is also provided for the assimilation of the inner closed-loop transfer function to a suitable form for tuning of the outer controller. Due to the fact that this inevitably introduces unmodelled dynamics into the design of the primary controller, a robust tuning is needed. The introduction and use of an additional sensor that allows for a separation of the fast and slow dynamics of the process results in a nested loop configuration as it is shown in figure (1). Each loop has associated its corresponding PID controller. The controller of the inner loop is called the secondary controller whereas the controller of the outer loop as the primary controller, being the output of the primary loop the variable of interest. The rationale behind this configuration is that the fast dynamics of the inner loop will provide faster disturbance attenuationandminimizethepossibleeffectdisturbancebeforetheyaffecttheprimaryoutput. Thissetupinvolves two controllers. It is therefore needed to tune both PIDs. The usual approach involves the tuning of the secondary controllerwhilesettingtheprimarycontrollerinmanualmode. Onasecondstep,theprimarycontrolleristunedby considering the secondary controller acting on the inner loop. It is therefore a more complicated design procedure than that of a standard single-loop based PID control system. In this paper a design issue that has not been addressed is considered: the tradeoff between the performance for set-point and load-disturbance response. When a load -disturbance occurs at the primary loop, the global load-disturbance depends on the set-point tracking performance of the secondary loop. In addition, good load- disturbance performance is expected for the secondary controller in order to attenuate disturbances that enter directly at the secondary loop. Also, it is well known that when the controller is optimally tuned for set-point response, the load-disturbance performance can be very poor (1). Based on this observation this paper proposes the use of a balanced performance tuning (2) for the secondary loop. Furthermore, an approximation procedure is provided in order to assimilate the dynamics seen by the primary controller to a First-Order-Plus-Time-Delay model such that usual tuning rules for PID control can be applied. However here a robust tuning is suggested, because,theprimarycontrollerwillneedtofacewithunmodelleddynamicscomingfromthemodelapproximation used for the secondary loop. Note that this kind of approximation is always needed if simple-model based tuning rules are to be applied. The rest of the paper is organized as follows. Next section presents the cascade control configuration and control setup to be used. Section 3 provides the main contribution of the paper as the design approach involving tuning of the controllers and approximation method. Section 4 presents an application example whereas section 5 ends with some conclusions and suggestions for further research.

Proceedings ArticleDOI
19 Oct 2008
TL;DR: This work presents a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant performance increase.
Abstract: We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant performance increase. The transformation technique optimizes loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean value of the conditional expression, determining the true/false paths, can be statically analyzed using a novel interval analysis technique that can evaluate conditional expressions in the general polynomial form. Results from interval analysis combined with loop dependency information is used to partition the iteration space of the nested loop. In such cases, the loop nest is decomposed such as to eliminate the conditional test, thus substantially reducing the execution time. Our technique completely eliminates the conditional from the loops (unlike previous techniques) thus further facilitating the application of other optimizations and improving the overall speedup. Applying the proposed transformation technique on loop kernels taken from Mediabench, SPEC-2000, mpeg4, qsdpcm and gimp, on average we measured a 175% (1.75X) improvement of execution time when running on a SPARC processor, a 336% (4.36X) improvement of execution time when running on an Intel Core Duo processor and a 198.9% (2.98X) improvement of execution time when running on a PowerPC G5 processor.

Patent
04 Jun 2008
TL;DR: In this paper, data dependence testing for loop fusion, e.g., with code replication, array contraction, and/or loop interchange, is described, and a compiler may optimize code for efficient execution during run-time by testing for dependencies associated with improving memory locality through code replication in loops that enable various loop transformations.
Abstract: Methods and apparatus to data dependence testing for loop fusion, e.g., with code replication, array contraction, and/or loop interchange, are described. In one embodiment, a compiler may optimize code for efficient execution during run-time by testing for dependencies associated with improving memory locality through code replication in loops that enable various loop transformations. Other embodiments are also described.

Journal ArticleDOI
TL;DR: This paper presents a cache-conscious analytical model for profitable loop fusion and uses this model to tune fusion parameters for different architectures through empirical search, showing significant speedup over fully optimised code generated by state-of-the-art commercial compilers.
Abstract: Loop fusion is recognised as an effective transformation for improving memory hierarchy performance. However, unconstrained loop fusion can lead to poor performance because of increased register pressure and cache conflict misses. In this paper, we present a cache-conscious analytical model for profitable loop fusion. We use this model to tune fusion parameters for different architectures through empirical search. Experiments on four different platforms for a set of applications show significant speedup over fully optimised code generated by state-of-the-art commercial compilers.

Proceedings ArticleDOI
18 May 2008
TL;DR: To examine the lock-in behaviour of the alias-locked loop, a nonlinear model is developed and used to simulate the architecture in the locked state, demonstrating the existence of stable modes of operation with bounded orbits.
Abstract: This paper presents a phase-locked loop (PLL) using an aliasing divider, referred to as an alias-locked loop (ALL). The ALL architecture makes it possible to create high-speed frequency synthesis circuits without relying on a traditional divider in the feedback path. To examine the lock-in behaviour of the alias-locked loop, a nonlinear model is developed and used to simulate the architecture in the locked state. Simulation results demonstrate the existence of stable modes of operation with bounded orbits. A version of the ALL is designed in 90-nm CMOS technology and simulated.

Patent
17 Apr 2008
TL;DR: In this article, a method, apparatus, and computer program are provided for assessing fairness, performance, and livelock in a logic development process utilizing comparative parallel looping, and the forward performance of each of the multiple processor threads is compared with each other.
Abstract: A method, apparatus, and computer program are provided for assessing fairness, performance, and livelock in a logic development process utilizing comparative parallel looping. Multiple loop macros are generated, the multiple loop macros respectively correspond to multiple processor threads, and the multiple loop macros are parallel comparative loop macros. The multiple processor threads for the multiple loop macros are executed in which a common resource is accessed. A forward performance of each of the multiple processor threads is verified. The forward performance of the multiple processor threads is compared with each other. It is determined whether any of the multiple processor threads fails to meet a minimum loop count or a minimum loop time. It is determined whether any of the multiple processor threads exceeds a maximum loop count or a maximum loop time. It is recognized whether fairness is maintained during the execution of the multiple processor threads.

01 Jan 2008
TL;DR: This paper presents an equation based model to estimate the area and clock frequency of the loop controller during high-level synthesis and manages to keep estimation errors reasonably low, so the estimation model can be used during design space exploration.
Abstract: High-level synthesis overcomes the high design effort required by using an FPGA by moving the hardware design to a higher abstraction level. At this higher level, loop transformations are used to improve the characteristics of the program. These transformations have a large impact on the resulting hardware, but their impact is only known after the time-consuming synthesis steps. This hinders a fast designspace exploration.In this paper, we tackle this issue by estimating the performance of the hardware loop controller, an often overlooked component in other approaches. We present an equation based model to estimate the area and clock frequency of the loop controller during high-level synthesis. In our approach, we manage to keep estimation errors reasonably low, so our estimation model can be used during design space exploration. Due to its simplicity, the overhead is minimal, which is critical when lots of design variants need to be estimated.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper proposes a modular control error compensator aimed at mitigating the performance degradation caused when the inner loop specifications are not achieved and shows that this compensator can be designed using μ synthesis and proposed an iterative procedure to optimize performance based on two concrete worst-case metrics.
Abstract: For complex dynamic systems, a modular control design process is often employed, wherein the overall design is partitioned into smaller modules. The designers of each module only possess a model for a particular subset of the entire plant as well as closed loop performance specifications for the other module(s). In this paper, we will examine a common modular control strategy in which an outer loop controller computes a desired virtual control input and the inner loop computes real control inputs in order to achieve this desired virtual control input as closely as possible. The outer loop design is based on a specification for the inner loop, which may not always be achieved. We propose a modular control error compensator that is aimed at mitigating the performance degradation caused when the inner loop specifications are not achieved. We show that this compensator can be designed using μ synthesis and propose an iterative procedure to optimize performance based on two concrete worst-case metrics. The effectiveness of the proposed compensator is shown through an automotive example.

Proceedings ArticleDOI
05 May 2008
TL;DR: A closed loop control scheme for a tele-echography system with communication delays is here experimentally developed and incorporated in an Internal Model Control (IMC) closed loop structure.
Abstract: A closed loop control scheme for a tele-echography system with communication delays is here experimentally developed. Nowadays, available tele-echography systems work in an open loop manner, without taking into account the communication delays.We propose here to apply the closed loop approach we previously theoretically developed. First, we briefly recall the key steps of this approach, which use partial differential equations (PDE) modeling of the delays, leading to an infinite dimensional model of the overall tele-operation system that we incorporate in an Internal Model Control (IMC) closed loop structure. We apply it in real-time thanks to a tele-operation testing platform which has been built, so various network and physical parameters are adjustable. It is composed of a master PC and a slave PC, communicating on a WLAN, and of a Phantom device which must track the positions given by a joystick. Finally, this paper shows the results obtained with a basic predictive control tested on the platform.

Journal ArticleDOI
TL;DR: A new loop fusion algorithm that is capable of fusing loop nests in the presence of fusion-preventing anti-dependences is presented, which reduces data cache misses, and improves the performance results of both sequential and parallel versions of the Jacobi program.
Abstract: Traditionally, loop nests are fused only when the data dependences in the loop nests are not violated. This paper presents a new loop fusion algorithm that is capable of fusing loop nests in the presence of fusion-preventing anti-dependences. All the violated anti-dependences are removed by automatic array copying. As a case study, this aggressive loop fusion strategy is applied to a Jacobi solver. The performance of iterative methods is typically limited by the speed of the memory system. Fusing the two loop nests in the Jacobi solver into one reduces data cache misses, and consequently, improves the performance results of both sequential and parallel versions of the Jacobi program, as validated by our experimental results on an HP AlphaServer SC45 supercomputer.

Journal ArticleDOI
TL;DR: This paper proposes a new loop filter design which, in addition to satisfying the prescribed lock-in range specification, achieves several other performance requirements as well, such as small noise bandwidth and good transient response (small settling time, small overshoot).

Book ChapterDOI
27 Jan 2008
TL;DR: The first step toward a loop-conscious processor architecture that has great potential to achieve high performance and relatively low energy consumption is taken, namely the proposed loop window, which can directly feed the execution backend queues with instructions.
Abstract: Current processors frequently run applications containing loop structures. However, traditional processor designs do not take into account the semantic information of the executed loops, failing to exploit an important opportunity. In this paper, we take our first step toward a loop-conscious processor architecture that has great potential to achieve high performance and relatively low energy consumption. In particular, we propose to store simple dynamic loops in a buffer, namely the loop window. Loop instructions are kept in the loop window along with all the information needed to build the rename mapping. Therefore, the loop window can directly feed the execution backend queues with instructions, avoiding the need for using the prediction, fetch, decode, and rename stages of the normal processor pipeline. Our results show that the loop window is a worthwhile complexity-effective alternative for processor design that reduces front-end activity by 14% for SPECint benchmarks and by 45% for SPECfp benchmarks.

01 Jan 2008
TL;DR: This paper studies the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock.
Abstract: —Spin locks are a synchronization mechanisms used to provide mutual exclusion to shared software resources. Spin locks are used over other synchronization mechanisms in several situations, like when the average waiting time to obtain the lock is short, in which case the probability of getting the lock is high, or when it is no possible to use other synchronization mechanisms. In this paper, we study the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock. For this purpose, we create a task that continuously executes the spin-lock loop and execute several instances of this task together with another active tasks. Our results show that, when the spin-lock tasks run with other applications in the same core of a T1 or a T2 processor, they introduce a significant overhead on other applications: 31% in T1 and 42% in T2, on average, respectively. For the T1 and T2 processors, we identify the fetch bandwidth as the main source of interaction between active threads and the spin-lock threads. We, propose 4 different variants of the Linux spin-lock loop that require less fetch bandwidth. Our proposal reduces the overhead of the spin-lock tasks over the other Applications down to 3.5% and 1.5% on average, in T1 and T2 respectively. This is a reduction of 28 percentage points with respect to the Linux spin-lock loop for T1. For T2 the reduction is about 40 percentage points.