scispace - formally typeset
Search or ask a question

Showing papers in "Design Automation for Embedded Systems in 2002"


Journal ArticleDOI
TL;DR: This paper compares three heuristic search algorithms: genetic algorithm (GA), simulated annealing (SA) and tabu search (TS), for hardware–software partitioning and shows that TS is superior to SA and GA in terms of both search time and quality of solutions.
Abstract: This paper compares three heuristic search algorithms: genetic algorithm (GA), simulated annealing (SA) and tabu search (TS), for hardware–software partitioning. The algorithms operate on functional blocks for designs represented as directed acyclic graphs, with the objective of minimising processing time under various hardware area constraints. Thecomparison involves a model for calculating processing time based on a non-increasing first-fit algorithm to schedule tasks, given that shared resource conflicts do not occur. The results show that TS is superior to SA and GA in terms of both search time and quality of solutions. In addition, we have implemented an intensification strategy in TS called penalty reward, which can further improve the quality of results.

142 citations


Journal ArticleDOI
TL;DR: A modular, flexible, and scalable heterogeneous multi-processor architecture template based on distributed shared memory is proposed and an efficient and transparent protocol for communication and (re)configuration is presented, enabling incremental design.
Abstract: The key issue in the design of Systems-on-a-Chip (SoC) is to trade-off efficiency against flexibility, and time to market versus cost. Current deep submicron processing technologiesenable integration of multiple software programmable processors (e.g., CPUs,DSPs) and dedicated hardware components into a single cost-efficient IC. Ourtop-down design methodology with various abstraction levels helps designingthese ICs in a reasonable amount of time. This methodology starts with a high-levelexecutable specification, and converges towards a silicon implementation.A major task in the design process is to ensure that all components (hardwareand software) communicate with each other correctly. In this article, we tacklethis problem in the context of the signal processing domain in two ways: wepropose a modular, flexible, and scalable heterogeneous multi-processor architecturetemplate based on distributed shared memory, and we present an efficient andtransparent protocol for communication and (re)configuration. The protocolimplementations have been incorporated in libraries, which allows quick traversalof the various abstraction levels, so enabling incremental design. The designdecisions to be taken at each abstraction level are evaluated by means of(co-)simulation. Prototyping is used too, to verify the system's functionalcorrectness. The effectiveness of our approach is illustrated by a designcase of a multi-standard video and image codec.

93 citations


Journal ArticleDOI
TL;DR: A number of high-level intermediate representations for compiling dataflow programs onto self-timed DSP platforms are reviewed, including representations for modeling the placement of interprocessor communication (IPC) operations; separating synchronization from data transfer during IPC; modeling and optimizinglinear orderings of communication operations; performing accurate design space exploration under communication resource contention.
Abstract: Self-timed scheduling is an attractive implementation style for multiprocessor DSP systems due to its ability to exploit predictability in application behavior, its avoidanceof over-constrained synchronization, and its simplified clocking requirements.However, analysis and optimization of self-timed systems under real-time constraintsis challenging due to the complex, irregular dynamics of self-timed operation.In this paper, we review a number of high-level intermediate representationsfor compiling dataflow programs onto self-timed DSP platforms, including representationsfor modeling the placement of interprocessor communication (IPC) operations;separating synchronization from data transfer during IPC; modeling and optimizinglinear orderings of communication operations; performing accurate design spaceexploration under communication resource contention; and exploring alternativeprocessor assignments during the synthesis process. We review the structureof these representations, and discuss efficient techniques that operate onthem to streamline scheduling, communication synthesis, and power managementof multiprocessor DSP implementations.

61 citations


Journal ArticleDOI
TL;DR: A design flow is presented that finds critical software loops automatically and manually re-implements these inconfigurable logic by implementing them in SA-C, a C language variation supporting a dataflow computation model and designed to specify and map DSP applicationsonto reconfigurable Logic.
Abstract: We examine the energy and performance benefits that can be obtained by re-mapping frequently executed loops from a microprocessor to reconfigurable logic. We present a design flow that finds critical software loops automatically and manually re-implements these inconfigurable logic by implementing them in SA-C, a C language variation supportinga dataflow computation model and designed to specify and map DSP applicationsonto reconfigurable logic. We apply this design flow on several examples fromthe MediaBench benchmark suite and report the energy and performance improvements.

50 citations


Journal ArticleDOI
TL;DR: A system-level design methodology for the efficient exploration of the architectural parameters of the memory sub-systems, from the energy-delay joint perspective, based on the EDP metric taking into consideration both performance and energy constraints is proposed.
Abstract: In this paper, we propose a system-level design methodology for the efficient exploration of the architectural parameters of the memory sub-systems, from the energy-delay joint perspective. The aim is to find the best configuration of the memory hierarchy without performing the exhaustive analysis of the parameters space. The target system architecture includes the processor, separated instruction and data caches, the main memory, and the system buses. To achieve a fast convergence toward the near-optimal configuration, the proposed methodology adopts an iterative local-search algorithm based on the sensitivity analysis of the cost function with respect to the tuning parameters of the memory sub-system architecture. The exploration strategy is based on the Energy-Delay Product (EDP) metric taking into consideration both performance and energy constraints. The effectiveness of the proposed methodology has been demonstrated through the design space exploration of a real-world case study: the optimization of the memory hierarchy of a MicroSPARC2-based system executing the set of Mediabench benchmarks for multimedia applications. Experimental results have shown an optimization speedup of 2 orders of magnitude with respect to the full search approach, while the near-optimal system-level configuration is characterized by a distance from the optimal full search configuration in the range of 2%.

44 citations


Journal ArticleDOI
TL;DR: It is shown that high performing realizations can in principle be obtained in a fraction of the design time currently employed to realize a parameterized implementation of a FPGA implementation.
Abstract: Compaan is a software tool that is capable of automatically translating nested loop programs, written in Matlab, into parallel process network descriptions suitable for implementation in hardware. In this article, we show a methodology and tool to convert theseprocess networks into FPGA implementations. We will show that we can in principleobtain high performing realizations in a fraction of the design time currentlyemployed to realize a parameterized implementation. This allows us to rapidlyexplore a range of transformations, such as loop unrolling and skewing, togenerate a circuit that meets the requirements of a particular application.The QR decomposition algorithm is used to demonstrate the capability of thetool. We present results showing how the number of clock cycles and calculations-per-secondvary with these transformations using a simple implementation of the functionunits. We also provide an indication of what we expect to achieve in the nearfuture once the tools are completed and applied the transformations to parallel,highly pipelined implementations of the function units.

43 citations


Journal ArticleDOI
TL;DR: A formal approach to the development of embedded controllers for a railway by using the B Method to add more and more implementation detail to the models and to decompose the models into sub-systems to arrive at models of individual controllers.
Abstract: We describe a formal approach to the development of embedded controllers for a railway. The approach starts with a system-level specification modeling the system under control and the desired control behavior. Correctness-preserving refinement is then used to add more and more implementation detail to the models and to decompose the models into sub-systems to arrive at models of individual controllers. The B Method is used as the formal notation and methodology.

31 citations


Journal ArticleDOI
TL;DR: A novel, efficient, small and very simple hardware unit, SoC Lock Cache (SoCLC), which resolvesthe critical section (CS) interactions among multiple processors and improvest the performance criteria in terms of lock latency, lock delay and bandwidth consumption in a shared-memory multiprocessor SoC.
Abstract: In this dissertation, we implement efficient lock-based synchronization by a novel, high performance, simple and scalable hardware technique and associated software for a target shared-memory multiprocessor System-on-a-Chip (SoC). The custom hardware part of our solution is provided in the form of an intellectual property (IP) hardware unit which we call the SoC Lock Cache (SoCLC). SoCLC provides effective lock hand-off by reducing on-chip memory traffic and improving performance in terms of lock latency, lock delay and bandwidth consumption. The proposed solution is independent from the memory hierarchy, cache protocol and the processor architectures used in the SoC, which enables easily applicable implementations of the SoCLC (e.g., as a reconfigurable or partially/fully custom logic), and which distinguishes SoCLC from previous approaches. Furthermore, the SoCLC mechanism has been extended to support priority inheritance with an immediate priority ceiling protocol (IPCP) implemented in hardware, which enhances the hard real-time performance of the system. Our experimental results in a four-processor SoC indicate that SoCLC can achieve up to 37% overall speedup over spin-lock and up to 48% overall speedup over MCS for a microbenchmark with false sharing. The priority inheritance implemented as part of the SoCLC hardware, on the other hand, achieves 1.43X speedup in overall execution time of a robot application when compared to the priority inheritance implementation under the Atalanta real-time operating system. Furthermore, it has been shown that with the IPCP mechanism integrated into the SoCLC, all of the tasks of the robot application could meet their deadlines (e.g., a high priority task with 250us worst case response time could complete its execution in 93us with SoCLC, however the same task missed its deadline by completing its execution in 283us without SoCLC). Therefore, with IPCP support, our solution can provide better real-time guarantees for real-time systems. To automate SoCLC design, we have also developed an SoCLC-generator tool, PARLAK, that generates user specified configurations of a custom SoCLC. We used PARLAK to generate SoCLCs from a version for two processors with 32 lock variables occupying 2,520 gates up to a version for fourteen processors with 256 lock variables occupying 78,240 gates.

27 citations


Journal ArticleDOI
TL;DR: A new design tool framework called IMPACCT is proposed, which correctly combines the state-of-the-arttechniques at the system level, thereby saving even experienced designers from many pitfalls of system-level power management.
Abstract: Power-aware systems are those that must exploit a widerange of power/performance trade-offs in order to adapt to the power availabilityand application requirements. They require the integration of many novel powermanagement techniques, ranging from voltage scaling to subsystem shutdown.However, those techniques do not always compose synergistically with eachother; in fact, they can combine subtractively and often yield counterintuitive,and sometimes incorrect, results in the context of a complete system. Thiscan become a serious problem as more of these power aware systems are beingdeployed in mission critical applications. To address the problem of technique integration for power-aware embedded systems, we propose a new design tool framework called IMPACCT and the associated design methodology. The system modeling methodology includes application model for capturing timing/powerconstraints and mode dependencies at the system level. The tool performs power-awarescheduling and mode selection to ensure that all timing/power constraintsare satisfied and that all overhead is taken into account. IMPACCT then synthesizesthe implementation targeting a symmetric multiprocessor platform. Experimentalresults show that the increased dynamic range of power/performance settingsenabled a Mars rover to achieve significant acceleration while using lessenergy. More importantly, our tool correctly combines the state-of-the-arttechniques at the system level, thereby saving even experienced designersfrom many pitfalls of system-level power management.

20 citations


Journal ArticleDOI
TL;DR: This work presents an approach that extends instruction and data cache modeling from basic blocks to program segments thereby increasing the overall running time analysis precision and combines it with data flow analysis based prediction of cache line contents.
Abstract: Verification of software running time is essential in embedded systemdesign with real-time constraints. Simulation with incomplete test patternsis unsafe for complex architectures when software running times are inputdata dependent. Formal analysis of such dependencies leads to software runningtime intervals rather than single values. These intervals depend on programproperties, execution paths and states of processes, as well as on the targetarchitecture. In the target architecture, caches have a major influence onsoftware running time. Current cache analysis techniques as a part of runningtime analysis approaches combine basic block level cache modeling with explicitor implicit program path analysis. We present an approach that extends instructionand data cache modeling from basic blocks to program segments thereby increasingthe overall running time analysis precision. We combine it with data flowanalysis based prediction of cache line contents. This novel cache analysisapproach shows high precision in the presented experiments.

16 citations


Journal ArticleDOI
TL;DR: GEZEL is proposed, a design environment consisting of a design language and an implementation methodology that can be used for domain specific processors used to implement cryptographic algorithms with high throughput and/or low energy consumption constraints.
Abstract: Security processors are used to implement cryptographic algorithmswith high throughput and/or low energy consumption constraints. The designof these processors is a balancing act between flexibility and energy consumption.The target is to create a processor with just enough programmability to covera set of algorithms--an application domain. This paper proposes GEZEL,a design environment consisting of a design language and an implementationmethodology that can be used for such domain specific processors. We use thesecurity domain as driver, and discuss the impact of the domain on the targetarchitecture. We also present a methodology to create, refine and verify asecurity processor.

Journal ArticleDOI
TL;DR: A fast and simple algorithm for sharing resources in multiprocessor systems, together with an innovative procedure for assigning a preemption threshold to tasks, both of which allow the use of a single user stack.
Abstract: The primary goal for real-time kernel software for single and multiple-processor on a chip systems is to support the design of timely and cost effective systems. The kernel must provide time guarantees, in order to predict the timely behaviorof the application, an extremely fast response time, in order not to waste computing power outside of the application cycles and save as much RAM space as possible in order to reduce the overall cost of the chip. The research on real-time software systems has produced algorithms that allow to effectively schedule system resources while guaranteeing the deadlines of the application and to group tasks in a very small number of non-preemptive sets which require much less RAM memory for stack. Unfortunately, up to now the research focus has been on time guarantees rather than on the optimization of RAM usage.Furthermore, these techniques do not apply to multiprocessor architectures which are likely to be widely used in future microcontrollers. This paper presents innovative scheduling and optimization algorithms that effectively solve the problem of guaranteeing schedulability with an extremely little operating system overhead and minimizing RAM usage. We developed a fast and simple algorithm for sharing resources in multiprocessor systems, together with an innovative procedure for assigning a preemption threshold to tasks. These allow the use of a single user stack. The experimental part shows the effectiveness of a simulated annealing-based tool that allows to find a schedulable system configuration starting from the selection of a near-optimal task allocation. When used in conjunction with our preemption threshold assignment algorithm, our tool further reduces the RAM usage in multiprocessor systems.

Journal ArticleDOI
TL;DR: A novel method (HASoC) for developing embedded systems that are targeted at system-on-a-chip implementations and supports a lifecycle that explicitly separates the behavior of a system from its implementation technology.
Abstract: We present a novel method (HASoC) for developing embedded systems that are targeted at system-on-a-chip implementations. The object-oriented development method is based on the experiences of using our existing MOOSE technique and supports a lifecycle that explicitly separates the behavior of a system from its implementation technology. The design process, whichuses a notation based on extensions to UML-RT, begins with the incremental development and validation of an abstract executable model of a system. Subsequently, this model is partitioned into hardware and software sub-systems to create a committed model, which is mapped onto a system platform that defines the implementation environment. The methodology emphasizes the reuse of pre-existing hardware and software platforms to ease the development process. A partial example application is presented in order to illustrate the main concepts in our methodology.

Journal ArticleDOI
TL;DR: The proposed approach has been applied to the Lx family of scalable embedded VLIWprocessors, jointly designed by STMicroelectronics and HPLabs, and demonstrated an average accuracy of 5% of the instruction-level estimation engine with respect to the RTL engine, with an average speed-up of four orders of magnitude.
Abstract: This paper describes a technique for modeling and estimating the power consumptionat the system-level for embedded VLIW (Very Long Instruction Word) architectures.The method is based on a hierarchy of dynamic power estimationengines: from the instruction-level down to the gate/transistor-level. Powermacro-models have been developed for the main components of the system: theVLIW core, the register file, the instruction and data caches. The main goalis to define a system-level simulation framework for the dynamic profilingof the power behavior during the software execution, providing also a break-downof the power contributions due to the single components of the system. Theproposed approach has been applied to the Lx family of scalable embedded VLIWprocessors, jointly designed by STMicroelectronics and HPLabs. Experimentalresults, carried out over a set of benchmarks for embedded multimedia applications,have demonstrated an average accuracy of 5% of the instruction-level estimationengine with respect to the RTL engine, with an average speed-up of four ordersof magnitude.

Journal ArticleDOI
TL;DR: This paper presents a methodology for designing and evaluating high-speed data acquisition systems using reprogrammable platforms.
Abstract: Complex embedded systems that do not target mass marketsoften have design and engineering costs that exceed production costs. Oneexample is the triggering and data acquisition system (DAQ) integrated intohigh-energy physics experiments. Parameterizable and reprogrammable architecturesare natural candidates as platforms for specialized embedded systems likehigh-speed data acquisition systems. In order to facilitate the design ofspecialized embedded systems, design strategies and tools are needed thatgreatly increase the efficiency of the design process. End-user programmabilityof reprogrammable platforms is essential, because system designers, withouttraining in low-level programming languages, are required to change the basedesign, compare designs, and generate configuration data for the reprogrammableplatforms. This paper presents a methodology for designing and evaluatinghigh-speed data acquisition systems using reprogrammable platforms.

Journal ArticleDOI
TL;DR: How LOPOCOS can support the system designer in identifying energy-efficient hardware/software implementations for the desired embedded systems is demonstrated by highlighting the necessary optimization steps during design space exploration for DVS enable architectures.
Abstract: In this paper, we introduce the LOPOCOS (Low Power Co-synthesis) system, a prototype CAD tool for system level co-design. LOPOCOS targets the design of energy-efficient embedded systems implemented as heterogeneous distributed architectures. In particular, it is designed to solve the specific problems involved in architectures that include dynamic voltage scalable (DVS) processors. The aim of this paper is to demonstrate how LOPOCOS can support the system designer in identifying energy-efficient hardware/software implementations for the desired embedded systems. Hence, highlighting the necessary optimization steps during design space exploration for DVS enable architectures. The optimization steps carried out in LOPOCOS involve component allocation and task/communication mapping as well as scheduling and dynamic voltage scaling. LOPOCOS has the following key features, which contribute to this energy efficiency. During the voltage scaling valuable power profile information of task execution is taken into account, hence, the accuracy of the energy estimation is improved. A combined optimization for scheduling and communication mapping based on genetic algorithm, optimizes simultaneously execution order and communication mapping towards the utilization of the DVS processors and timing behaviour. Furthermore, a separation of task and communication mapping allows a more effective implementation of both task and communication mapping optimizationsteps. Extensive experiments are conducted to demonstrate the efficiency of LOPOCOS. We report up to 38% higher energy reductions compared to previous co-synthesis techniques for DVS systems. The investigations include a real-life example of an optical flow detection algorithm.

Journal ArticleDOI
TL;DR: A new technique for the detection of Integrated Circuits within images of Printed Circuit Boards autonomously and without the need to be assisted by CAD data is presented, and results showing the reduction in complexity when compared to a Hough Transform are presented.
Abstract: This paper presents a new technique for thedetection of Integrated Circuits within images of Printed Circuit Boards autonomouslyand without the need to be assisted by CAD data. The technique is a key partof a suite of algorithms targeted for an embedded System On Chip architecturebased on the ARM7 platform for real time detection of PCB images for diagnosticpurposes. The technique has a significant reduction in complexity when comparedto conventional approaches such as the Hough Transform. The reduction in complexitymakes the approach ideal for an embedded vision application suchas the one described in this paper. This paper presents the technique, thetarget embedded architecture and results showing the reduction in complexitywhen compared to a Hough Transform.

Journal ArticleDOI
TL;DR: A compositional framework, together with its supporting toolset, for hardware/software co-design based on Interval Temporal Logic and its executable subset, Tempura, based on a single formal specification of the system the software and hardware parts of the implementation, while preserving all properties of theSystem specification.
Abstract: We describe a compositional framework, together with its supporting toolset, for hardware/software co-design. Our framework is an integration of a formal approach within a traditional design flow. The formal approach is based on Interval Temporal Logic and its executable subset, Tempura. Refinement is the key element in our framework because it will derivefrom a single formal specification of the system the software and hardware parts of the implementation, while preserving all properties of the system specification. During refinement simulation is used to choose the appropriate refinement rules, which are applied automatically in the HOL system. The framework is illustrated with two case studies. The work presented is part of a UK collaborative research project between the Software Technology Research Laboratory at the De Montfort University and the Oxford University Computing Laboratory.

Journal ArticleDOI
TL;DR: The rationale for developing the distributed semantics of Virtuoso’s microkernel is described and some of the implementation issues are described and Extensions of the model towards heterogeneous embedded target systems are discussed.
Abstract: Virtuoso VSP is a fully distributed real-time operating system originally developed on the Inmos transputer. Its generic architecture is based on a small but very fast nanokernel and a portable preemptive microkernel. It was further on ported in single and virtual single processor implementations to a wide range of processors. This paper describes the rationale for developing the distributed semantics of Virtuoso's microkernel and describes some of the implementation issues. The analysis is based on the parallel DSP implementations as these push the performance limits most for hard real-time applications. Extensions of the model towards heterogeneous embedded target systems are discussed.

Journal ArticleDOI
TL;DR: The energy benefits of combining the configurable features of voltage scaling and cache way shutdown in a single platform are illustrated and methods to assist a designer to tune such a platform to a particular software task and to particular energy optimization criteria are described.
Abstract: System-on-a-chip platform manufacturers are increasingly adding configurable features that provide power and performance flexibility, in order to increase a platform's applicability to a variety of embedded computing systems. We illustrate the energy benefits of combining the configurable features of voltage scaling and cache way shutdown in a single platform. We describe methods to assist a designer to tune such a platform to a particular software task and to particular energy optimization criteria.

Journal ArticleDOI
TL;DR: A software pipelining framework, CALiBeR (ClusterAware Load Balancing Retiming Algorithm), suitable for compilers targetingclustered embedded VLIW processors, and demonstrates that the algorithm compares favorably with one of the best state-of-the-art algorithms.
Abstract: This paper proposes a software pipelining framework, CALiBeR (ClusterAware Load Balancing Retiming Algorithm), suitable for compilers targetingclustered embedded VLIW processors. CALiBeR can be used by embedded systemdesigners to explore different code optimization alternatives, that is, high-qualitycustomized retiming solutions for desired throughput and program memory sizerequirements, while minimizing register pressure. An extensive set of experimentalresults is presented, demonstrating that our algorithm compares favorablywith one of the best state-of-the-art algorithms, achieving up to 50% improvementin performance and up to 47% improvement in register requirements. In orderto empirically assess the effectiveness of clustering for high ILP applications,additional experiments are presented contrasting the performance achievedby software pipelined kernels executing on clustered and on centralized machines.

Journal ArticleDOI
TL;DR: This paper presents a novel approach to computing tight upper bounds on the processor utilization for general real-time systems where tasks are composed of subtasks and precedence constraints may exist among subtasks of the same task.
Abstract: This paper presents a novel approach to computing tight upper bounds on the processor utilization for general real-time systems where tasks are composed of subtasks and precedence constraints may exist among subtasks of the same task. By careful analysis of preemption effects among tasks, the problem is formulated as a set of linear programming (LP) problems. Observations are made to reduce the number of LP problem instances required to be solved, which greatly improves the computation time of the utilization bounds. Furthermore, additional constraints are allowed to be included under certain circumstances to improve the quality of the bounds.

Journal ArticleDOI
TL;DR: This paper describes the development of CADRE (Configurable Asynchronous DSP for Reduced Energy), a 750K transistor, high performance, low-power digital signal processor IP block intended for digital mobile phone chipsets.
Abstract: Asynchronous design techniques have a number of compelling features that make them suited for complex system on chip designs. However, it is necessary to develop practical and efficient design techniques to overcome the present shortage of commercial design tools. This paper describes the development of CADRE (Configurable Asynchronous DSP for Reduced Energy), a 750K transistor, high performance, low-power digital signal processor IP block intended for digital mobile phone chipsets. A short time period was available for the project, and so a methodology was developed that allowed high-level simulation of the design at the earliest possible stage within the conventional schematic entry environment and simulation tools used for later circuit-level performance and power consumption assessment. Initial modeling was based on C behavioral models of the various data and control components, with the many asynchronous control circuits required automatically generated from their specifications. This has enabled design options to be explored and unusual features of the design, such as the Register Bank which is designed to exploit data access patterns, are presented along with the power and performance results of the processor as a whole.

Journal ArticleDOI
TL;DR: The necessary and sufficient condition for achieving the maximum throughput in a given pipeline operating under modulo scheduling is established, based on which a methodology for designing the hardware pipelines that achieve such a throughput is developed.
Abstract: Exploiting instruction-level parallelism (ILP) is extremely important for achieving high performance in application specific instruction set processors (ASIPs) and embedded processors. Unlike conventional general purpose processors, ASIPs and embedded processors typically run a single application and hence must be optimized extensively for this in order to extract maximum performance. Further, low power and low cost requirements of ASIPs may demand reuse of pipeline stages causing pipelines with complex structural hazards. In such architectures, exploiting higher ILP is a major challenge to the designer. Existing techniques deal with either scheduling hardware pipelines to obtain higher throughput or software pipelining--an instruction scheduling technique for iterative computation--for exploiting greater ILP. We integrate these techniques to co-schedule hardware and software pipelines to achieve greater instruction throughput. In this paper, we develop the underlying theory of Co-Scheduling, called the Modulo-Scheduled Pipeline (or MS-Pipeline) theory. More specifically, we establish the necessary and sufficient condition for achieving the maximum throughput in a given pipeline operating under modulo scheduling. Further, we establish a sufficient condition to achieve a specified throughput, based on which we also develop a methodology for designing the hardware pipelines that achieve such a throughput. Further, we present initial experimental results which help to establish the usefulness of MS-pipeline theory in software pipelining. As the proposed theory helps to analyze and improve the throughput of Modulo-Scheduled Pipelines (MS-pipelines), it is especially useful in designing ASIPs and embedded processors.