scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2007"


Journal ArticleDOI
TL;DR: A resynthesis approach is introduced wherein a sequence of gates is chosen from a network, and the reversible specification it realizes is resynthesized as an independent problem in hopes of reducing the network cost.
Abstract: We present certain new techniques for the synthesis of reversible networks of Toffoli gates, as well as improvements to previous methods. Gate count and technology oriented cost metrics are used. Two new synthesis procedures employing Reed-Muller spectra are introduced and shown to complement earlier synthesis approaches. The previously proposed template simplification method is enhanced through the introduction of a faster and more efficient template application algorithm, an updated classification of the templates, and the addition of new templates of sizes 7 and 9. A resynthesis approach is introduced wherein a sequence of gates is chosen from a network, and the reversible specification it realizes is resynthesized as an independent problem in hopes of reducing the network cost. Empirical results are presented to show that the methods are efficient in terms of the realization of reversible benchmark specifications.

195 citations


Journal ArticleDOI
TL;DR: This article builds a case for clustered VLIW processors with four or more clusters, and provides a classification of the intercluster interconnection design space, and evaluates a subset of this design space to show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably.
Abstract: VLIW processors have started gaining acceptance in the embedded systems domain. However, monolithic register file VLIW processors with a large number of functional units are not viable. This is because of the need for a large number of ports to support FU requirements, which makes them expensive and extremely slow. A simple solution is to break the register file into a number of smaller register files with a subset of FUs connected to it. These architectures are termed clustered VLIW processors.In this article, we first build a case for clustered VLIW processors with four or more clusters by showing that the achievable ILP in most of the media applications for a 16 ALU and 8 LD/ST VLIW processor is around 20. We then provide a classification of the intercluster interconnection design space, and show that a large part of this design space is currently unexplored. Next, using our performance evaluation methodology, we evaluate a subset of this design space and show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably. We also establish that this behavior is heavily application dependent, emphasizing the importance of application-specific architecture exploration. We also present results about the statistical behavior of these different architectures by varying the number of clusters in our framework from 4 to 16. These results clearly show the advantages of one specific architecture over others. Finally, based on our results, we propose a new interconnection network, which should lower this performance gap.

64 citations


Journal ArticleDOI
TL;DR: This work builds a library of new instructions created with various encoding alternatives taking into account the data path architecture constraints, and chooses the best set of instructions while satisfying the instruction bitwidth constraint.
Abstract: Application-specific instructions can significantly improve the performance, energy-efficiency, and code size of configurable processors. While generating new instructions from application-specific operation patterns has been a common way to improve the instruction set (IS) of a configurable processor, automating the design of ISs for given applications poses new challenges---how to create as well as utilize new instructions in a systematic manner, and how to choose the best set of application-specific instructions considering the various effects the new instructions may have on the data path and the compilationq To address these problems, we present a novel IS synthesis framework that optimizes the IS through an efficient instruction encoding for the given application as well as for the given data path architecture. We first build a library of new instructions created with various encoding alternatives taking into account the data path architecture constraints, and then select the best set of instructions while satisfying the instruction bitwidth constraint. We formulate the problem using integer linear programming and also present an effective heuristic algorithm. Experimental results using our technique generate ISs that show improvements of up to about 40p over the native IS for several application benchmarks running on typical embedded RISC processors.

60 citations


Journal ArticleDOI
TL;DR: This article presents an automated approach for analyzing data reuse opportunities in a program that allows modification of the program to use custom scratch-pad memory configurations comprising a hierarchical set of buffers for local storage of frequently reused data to reduce energy consumption and improve memory system performance.
Abstract: In multimedia and other streaming applications, a significant portion of energy is spent on data transfers. Exploiting data reuse opportunities in the application, we can reduce this energy by making copies of frequently used data in a small local memory and replacing speed- and power-inefficient transfers from main off-chip memory by more efficient local data transfers. In this article we present an automated approach for analyzing these opportunities in a program that allows modification of the program to use custom scratch-pad memory configurations comprising a hierarchical set of buffers for local storage of frequently reused data. Using our approach we are able to both reduce energy consumption of the memory subsystem when using a scratch-pad memory by about a factor of two, on average, and improve memory system performance compared to a cache of the same size.

59 citations


Journal ArticleDOI
TL;DR: The proposed methods are effective in predicting the probability distribution of total chip leakage, and it is shown that ignoring spatial correlations can underestimate the standard deviation of full-chip leakage power.
Abstract: In this article, we present a method to analyze the total leakage current of a circuit under process variations, considering interdie and intradie variations as well as the effect of the spatial correlations of intradie variations. The approach considers both the subthreshold and gate tunneling leakage power, as well as their interactions. With process variations, each leakage component is approximated by a lognormal distribution, and the total chip leakage is computed as a sum of the correlated lognormals. Since the lognormals to be summed are large in number and have complicated correlation structures due to both spatial correlations and the correlation among different leakage mechanisms, we propose an efficient method to reduce the number of correlated lognormals for summation to a manageable quantity. We do so by identifying dominant states of leakage currents and taking advantage of the spatial correlation model and input states at the gates. An improved approach utilizing the principal components computed from spatially correlated process parameters is also proposed to further improve runtime efficiency. We show that the proposed methods are effective in predicting the probability distribution of total chip leakage, and that ignoring spatial correlations can underestimate the standard deviation of full-chip leakage power.

45 citations


Journal ArticleDOI
TL;DR: This article presents several DVS scheduling algorithms that implement these methods that can guarantee task deadlines under arbitrarily large transition time overheads and reduce energy consumption by as much as 40% when compared to previous methods.
Abstract: Time transition overhead is a critical problem for hard real-time systems that employ dynamic voltage scaling (DVS) for power and energy management. While it is a common practice of much previous work to ignore transition overhead, these algorithms cannot guarantee deadlines and/or are less effective in saving energy when transition overhead is significant and not appropriately dealt with. In this article we introduce two techniques, one offline and one online, to correctly account for transition overhead in preemptive fixed-priority real-time systems. We present several DVS scheduling algorithms that implement these methods that can guarantee task deadlines under arbitrarily large transition time overheads and reduce energy consumption by as much as 40p when compared to previous methods.

36 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the resulting test set is as effective in detecting untargeted faults as an n-detection test set generated by a deterministic test generation procedure.
Abstract: We describe a procedure for forming n-detection test sets for n>1 without applying a test generation procedure to target faults. The proposed procedure accepts a one-detection test set. It extracts test cubes for target faults from the one-detection test set, and merges the test cubes to obtain new test vectors. By extracting and merging different test cubes in different iterations of this process, an n-detection test set is obtained. Merging of test cubes does not require test generation or fault simulation. Fault simulation is required for extracting test cubes for target faults. We demonstrate that the resulting test set is as effective in detecting untargeted faults as an n-detection test set generated by a deterministic test generation procedure. We also discuss the application of the proposed procedure starting from a random test set (instead of a one-detection test set).

31 citations


Journal ArticleDOI
TL;DR: A novel graph-based topological floorplan representation, named 3D-subTCG (3-Dimensional Transitive Closure subGraph), is used to deal with the 3-dimensional (temporal) floorplanning/placement problem, arising from dynamically reconfigurable FPGAs.
Abstract: Improving logic capacity by time-sharing, dynamically reconfigurable Field Gate Programmable Arrays (FPGAs) are employed to handle designs of high complexity and functionality. In this paper, we use a novel graph-based topological floorplan representation, named 3D-subTCG (3-Dimensional Transitive Closure subGraph), to deal with the 3-dimensional (temporal) floorplanning/placement problem, arising from dynamically reconfigurable FPGAs. The 3D-subTCG uses three transitive closure graphs to model the temporal and spatial relations between modules. We derive the feasibility conditions for the precedence constraints induced by the execution of the dynamically reconfigurable FPGAs. Because the geometric relationship is transparent to the 3D-subTCG and its induced operations (i.e., we can directly detect the relationship between any two tasks from the representation), we can easily detect any violation of the temporal precedence constraints on 3D-subTCG. We also derive important properties of the 3D-subTCG to reduce the solution space and shorten the running time for 3D (temporal) foorplanning/placement. Experimental results show that our 3D-subTCG-based algorithm is very effective and efficient.

25 citations


Journal ArticleDOI
TL;DR: This article presents new techniques that consist in applying loop fusion and tiling to several loop nests and to parallelize the resulting code across different processors and shows a significant reduction in the number of data cache misses and in processing time.
Abstract: Multiprocessor system-on-a-chip (MPSoC) architectures have received a lot of attention in the past years, but few advances in compilation techniques target these architectures This is particularly true for the exploitation of data locality Most of the compilation techniques for parallel architectures discussed in the literature are based on a single loop nest This article presents new techniques that consist in applying loop fusion and tiling to several loop nests and to parallelize the resulting code across different processors These two techniques reduce the number of memory accesses However, they increase dependencies and thereby reduce the exploitable parallelism in the code This article tries to address this contradiction To optimize the memory space used by temporary arrays, smaller buffers are used as a replacement Different strategies are studied to optimize the processing time spent accessing these buffers The experiments show that these techniques yield a significant reduction in the number of data cache misses (30p) and in processing time (50p)

22 citations


Journal ArticleDOI
TL;DR: It is shown how SANs can be used early in the design cycle to identify the best performance/power trade-offs among several application-architecture combinations.
Abstract: The objective of this article is to introduce the use of Stochastic Automata Networks (SANs) as an effective formalism for application-architecture modeling in system-level average-case analysis for platform-based design. By platform, we mean a family of heterogeneous architectures that satisfy a set of architectural constraints imposed to allow re-use of hardware and software components. More precisely, we show how SANs can be used early in the design cycle to identify the best performance/power trade-offs among several application-architecture combinations. Having this information available not only helps avoid lengthy simulations for predicting power and performance figures, but also enables efficient mapping of different applications onto a chosen platform. We illustrate the benefits of our methodology by using the “Picture-in-Picture” video decoder as a driver application.

20 citations


Journal ArticleDOI
TL;DR: An alternate formulation of the problem is to treat maximum permitted test power as a constraint and achieve a test power that is within this limit using the fewest number of gated scan cells, thereby leading to the least impact in area overhead.
Abstract: Power reduction during test application is important from the viewpoint of chip reliability and for obtaining correct test results. One of the ways to reduce scan test power is to block transitions propagating from the outputs of scan cells through combinational logic. In order to accomplish this, some researchers have proposed setting primary inputs to appropriate values or adding extra gates at the outputs of scan cells. In this article, we point out the limitations of such full gating techniques in terms of area overhead and performance degradation. We propose an alternate solution where a partial set of scan cells is gated. A subset of scan cells is selected to give maximum reduction in test power within a given area constraint. An alternate formulation of the problem is to treat maximum permitted test power as a constraint and achieve a test power that is within this limit using the fewest number of gated scan cells, thereby leading to the least impact in area overhead. Our problem formulation also comprehends performance constraints and prevents the inclusion of gating points on critical paths. The area overhead is predictable and closely corresponds to the average power reduction.

Journal ArticleDOI
TL;DR: A new online voltage scaling technique for battery-powered embedded systems with real-time constraints takes into account the execution times and discharge currents of tasks to further reduce the battery charge consumption when compared to the recently reported slack forwarding technique.
Abstract: This article proposes a new online voltage scaling (VS) technique for battery-powered embedded systems with real-time constraints. The VS technique takes into account the execution times and discharge currents of tasks to further reduce the battery charge consumption when compared to the recently reported slack forwarding technique [Ahmed and Chakrabarti 2004], while maintaining low online complexity of O(1). Furthermore, we investigate the impact of online rescheduling and remapping on the battery charge consumption for tasks with data dependency which has not been explicitly addressed in the literature and propose a novel rescheduling/remapping technique. Finally, we take leakage power into consideration and extend the proposed online techniques to include adaptive body biasing (ABB) which is used to reduce the leakage power. We demonstrate and compare the efficiency of the presented techniques using seven real-life benchmarks and numerous automatically generated examples.

Journal ArticleDOI
TL;DR: This paper presents a novel method to construct a dynamic single assignment (DSA) form of array intensive, pointer free C programs that scales very well with growing program sizes and overcomes a number of important limitations of existing methods.
Abstract: This paper presents a novel method to construct a dynamic single assignment (DSA) form of array intensive, pointer free C programs. A program in DSA form does not perform any destructive update of scalars and array elements; that is, each element is written at most once. As DSA makes the dependencies between variable references explicit, it facilitates complex analyses and optimizations of programs. Existing transformations into DSA perform a complex data flow analysis with exponential analysis time, and they work only for a limited class of input programs. Our method removes irregularities from the data flow by adding copy assignments to the program, so that it can use simple data flow analyses. The presented DSA transformation scales very well with growing program sizes and overcomes a number of important limitations of existing methods. We have implemented the method and it is being used in the context of memory optimization and verification of those optimizations. Experiments show that in practice, the method scales well indeed, and that added copy operations can be removed in case they are unwanted.

Journal ArticleDOI
TL;DR: This article presents a sink-n-hoist framework for a compiler to generate balanced scheduling of power-gating instructions that attempts to merge several power- gating instructions into a single compound instruction, thereby reducing the amount ofPower leakage instructions issued.
Abstract: Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies due to the continuing size reductions and increasing speeds of transistors. Recent studies have attempted to reduce leakage power using integrated architecture and compiler power-gating mechanisms. This approach involves compilers inserting instructions into programs to shut down and wake up components, as appropriate. While early studies showed this approach to be effective, there are concerns about the large amount of power-control instructions being added to programs due to the increasing amount of components equipped with power-gating controls in SoC design platforms. In this article we present a sink-n-hoist framework for a compiler to generate balanced scheduling of power-gating instructions. Our solution attempts to merge several power-gating instructions into a single compound instruction, thereby reducing the amount of power-gating instructions issued. We performed experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumption using Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing the amount of power-gating instructions while further reducing leakage power compared to previous methods.

Journal ArticleDOI
TL;DR: A fast incremental hierarchical memory-size requirement estimation technique that allows for the first time to handle real-life industrial-size applications and get realistic feedback during loop transformation exploration.
Abstract: Modern embedded multimedia and telecommunications systems need to store and access huge amounts of data. This becomes a critical factor for the overall energy consumption, area, and performance of the systems. Loop transformations are essential to improve the data access locality and regularity in order to optimally design or utilize a memory hierarchy. However, due to abstract high-level cost functions, current loop transformation steering techniques do not take the memory platform sufficiently into account. They usually also result in only one final transformation solution. On the other hand, the loop transformation search space for real-life applications is huge, especially if the memory platform is still not fully fixed. Use of existing loop transformation techniques will therefore typically lead to suboptimal end-products. It is critical to find all interesting loop transformation instances. This can only be achieved by performing an evaluation of the effect of later design stages at the early loop transformation stage.This article presents a fast incremental hierarchical memory-size requirement estimation technique. It estimates the influence of any given sequence of loop transformation instances on the mapping of application data onto a hierarchical memory platform. As the exact memory platform instantiation is often not yet defined at this high-level design stage, a platform-independent estimation is introduced with a Pareto curve output for each loop transformation instance. Comparison among the Pareto curves helps the designer, or a steering tool, to find all interesting loop transformation instances that might later lead to low-power data mapping for any of the many possible memory hierarchy instances. Initially, the source code is used as input for estimation. However, performing the estimation repeatedly from the source code is too slow for large search space exploration. An incremental approach, based on local updating of the previous result, is therefore used to handle sequences of different loop transformations. Experiments show that the initial approach takes a few seconds, which is two orders of magnitude faster than state-of-the-art solutions but still too costly to be performed interactively many times. The incremental approach typically takes just a few milliseconds, which is another two orders of magnitude faster than the initial approach. This huge speedup allows us for the first time to handle real-life industrial-size applications and get realistic feedback during loop transformation exploration.

Journal ArticleDOI
TL;DR: This work uses the max-min ant colony optimization technique to solve both time- and resource-constrained scheduling problems and automatically constructs a time/area tradeoff curve in a fast, effective manner.
Abstract: Design space exploration during high-level synthesis is often conducted through ad hoc probing of the solution space using some scheduling algorithm. This is not only time consuming but also very dependent on designer's experience. We propose a novel design exploration method that exploits the duality of time- and resource-constrained scheduling problems. Our exploration automatically constructs a time/area tradeoff curve in a fast, effective manner. It is a general approach and can be combined with any high-quality scheduling algorithm. In our work, we use the max-min ant colony optimization technique to solve both time- and resource-constrained scheduling problems. Our algorithm provides significant solution-quality savings (average 17.3p reduction of resource counts) with similar runtime compared to using force-directed scheduling exhaustively at every time step. It also scales well across a comprehensive benchmark suite constructed with classic and real-life samples.

Journal ArticleDOI
TL;DR: A framework for FPGA-based application design that supports a hierarchical modeling approach that integrates application and device modeling techniques and allows development of a library of models for design reuse and integrates a high-level performance estimator for rapid estimation of the latency, area, and energy of the designs.
Abstract: For an FPGA designer, several choices are available in terms of target FPGA devices, IP-cores, algorithms, synthesis options, runtime reconfiguration, degrees of parallelism, among others, while implementing a design. Evaluation of design alternatives in the early stages of the design cycle is important because the choices made can have a critical impact on the performance of the final design. However, a large number of alternatives not only results in a large number of designs, but also makes it a hard problem to efficiently manage, simulate, and evaluate them. In this article, we present a framework for FPGA-based application design that addresses the aforementioned issues. This framework supports a hierarchical modeling approach that integrates application and device modeling techniques and allows development of a library of models for design reuse. The framework integrates a high-level performance estimator for rapid estimation of the latency, area, and energy of the designs. In addition, a design space exploration tool allows efficient evaluation of candidate designs against the given performance requirements. The framework also supports extension through integration of widely used tools for FPGA-based design while presenting a unified environment for different target FPGAs. We demonstrate our framework through the modeling and performance estimation of a signal processing kernel and the design of end-to-end applications.

Journal ArticleDOI
TL;DR: A new technique, called Adaptive Stochastic Gradient Voltage-and-Task Scheduling (ASG-VTS), for power optimization of multicore hard realtime systems that combines stochastic and energy-gradient techniques to simultaneously solve the slack distribution and task reordering problem.
Abstract: This paper presents a new technique, called Adaptive Stochastic Gradient Voltage-and-Task Scheduling (ASG-VTS), for power optimization of multicore hard realtime systems. ASG-VTS combines stochastic and energy-gradient techniques to simultaneously solve the slack distribution and task reordering problem. It produces very efficient results with few mode transitions. Our experiments show that ASG-VTS reduces number of mode transitions by 4.8 times compared to traditional energy-gradient-based approaches. Also, our heuristic algorithm can quickly find a solution that is as good as the optimal for a real-life GSM encoder/decoder benchmark. The runtime of ASG-VTS is 150 times and 1034 times faster than energy-gradient based and optimal ILP algorithms, respectively. Since the runtime of ASG-VTS is very low, it is ideal for design space exploration in system-level design tools. We have also developed a web-based interface for ASG-VTS algorithm.

Journal ArticleDOI
TL;DR: This article presents an approach to area optimization of arithmetic datapaths at register-transfer level (RTL) on those designs that perform polynomial computations over finite word-length operands (bit-vectors) as algebra over finite integer rings of residue classes Z2m.
Abstract: This article presents an approach to area optimization of arithmetic datapaths at register-transfer level (RTL). The focus is on those designs that perform polynomial computations (add, mult) over finite word-length operands (bit-vectors). We model such polynomial computations over m-bit vectors as algebra over finite integer rings of residue classes Z2m. Subsequently, we use the number-theoretic and algebraic properties of such rings to transform a given datapath computation into another, bit-true equivalent computation. We also derive a cost model to estimate, at RTL, the area cost of the computation. Using the transformation procedure along with the cost model, we devise algorithmic procedures to search for a lower-cost implementation. We show how these theoretical concepts can be applied to RTL optimization of arithmetic datapaths within practical CAD settings. Experiments conducted over a variety of benchmarks demonstrate substantial optimizations using our approach.

Journal ArticleDOI
TL;DR: In this article, integer linear programming formulations and heuristic techniques for process allocation and data mapping on symmetric multiprocessing (SMP) and block-multithreading-based network processors are presented.
Abstract: Network processors incorporate several architectural features, including symmetric multiprocessing (SMP), block multithreading, and multiple memory elements, to support the high-performance requirements of current day applications. This article presents automated system-level design techniques for application development on such architectures. We propose integer linear programming formulations and heuristic techniques for process allocation and data mapping on SMP and block-multithreading-based network processors. The techniques incorporate process transformations and multithreading-aware data mapping to maximize the throughput of the application. The article presents experimental results that evaluate the techniques by implementing network processing applications on the Intel IXP 2400 architecture.

Journal ArticleDOI
TL;DR: A code-block-level containment-checking-based methodology for application partitioning verification and a state space reduction technique specific to the containment checking, reachability analysis, and deadlock detection problems are proposed.
Abstract: With the advent of multiprocessor embedded platforms, application partitioning and mapping have gained primacy as a design step. The output of this design step is a multithreaded partitioned application where each thread is mapped to a processing element (processor or ASIC) in the multiprocessor platform. This partitioned application must be verified to be consistent with the native unpartitioned application. This verification task is called application (or task) partitioning verification.This work proposes a code-block-level containment-checking-based methodology for application partitioning verification. We use a UML-based code-block-level modeling language which is rich enough to model most designs. We formulate the application partitioning verification problem as a special case of the containment checking problem, which we call the complete containment checking problem. We propose a state space reduction technique specific to the containment checking, reachability analysis, and deadlock detection problems. We propose novel data structures and token propagation methodologies which enhance the efficiency of containment checking. We present an efficient containment checking algorithm for the application partitioning verification problem. We develop a containment checking tool called TraceMatch and present experimental results. We present a comparison of the state space reduction achieved by TraceMatch with that achieved by formal analysis and verification tools like Spin, PEP, PROD, and LoLA.

Journal ArticleDOI
TL;DR: This algorithm divides an image computation step into a disjunctive set of easier ones that can be performed in isolation, and uses hypergraph partitioning to minimize the number of live variables in each dis junctive component, and variable scopes to simplify transition relations and reachable state subsets.
Abstract: Existing BDD-based symbolic algorithms designed for hardware designs do not perform well on software programs. We propose novel techniques based on unique characteristics of software programs. Our algorithm divides an image computation step into a disjunctive set of easier ones that can be performed in isolation. We use hypergraph partitioning to minimize the number of live variables in each disjunctive component, and variable scopes to simplify transition relations and reachable state subsets. Our experiments on nontrivial C programs show that BDD-based symbolic algorithms can directly handle software models with a much larger number of state variables than for hardware designs.

Journal ArticleDOI
TL;DR: The problem of area-balanced bipartitioning is shown to be NP-hard, and a maxflow-based heuristic is proposed.
Abstract: This article addresses the problem of recursively bipartitioning a given floorplan F using monotone staircases. At each level of the hierarchy, a monotone staircase from one corner of F to its opposite corner is identified, such that (i) the two parts of the bipartition are nearly equal in area (or in the number of blocks), and (ii) the number of nets crossing the staircase is minimal. The problem of area-balanced bipartitioning is shown to be NP-hard, and a maxflow-based heuristic is proposed. Such a hierarchy may be useful to repeater placement in deep-submicron physical design, and also to global routing.

Journal ArticleDOI
TL;DR: Experimental results show that these new techniques can reduce idle energy by 50--70%, or 30--50% of total system energy over previous offline-optimal but unsequenced techniques based on localized break-even-time analysis, thanks to rich options offered by mode sequencing.
Abstract: This article presents techniques for reducing idle energy by mode-sequence optimization (MSO) under timing constraints. Our component-level CoMSO algorithm computes energy-optimal mode-transition sequences for different lengths of idle intervals. Our system-level SyMSO algorithm shifts tasks within slack intervals while satisfying all timing and resource constraints in the given schedule. Experimental results on a commercial software-defined radio show that these new techniques can reduce idle energy by 50--70p, or 30--50p of total system energy over previous offline-optimal but unsequenced techniques based on localized break-even-time analysis, thanks to rich options offered by mode sequencing.

Journal ArticleDOI
TL;DR: A novel technique---decode filter cache (DFC)---for minimizing power consumption with minimal performance impact and the next fetch prediction mechanism reduces miss penalty by more than 91%.
Abstract: With advances in semiconductor technology, power management has increasingly become a very important design constraint in processor design. In embedded processors, instruction fetch and decode consume more than 40p of processor power. This calls for development of power minimization techniques for the fetch and decode stages of the processor pipeline. For this, filter cache has been proposed as an architectural extension for reducing the power consumption. A filter cache is placed between the CPU and the instruction cache (I-cache) to provide the instruction stream. A filter cache has the advantages of shorter access time and lower power consumption. However, the downside of a filter cache is a possible performance loss in case of cache misses. In this article, we present a novel technique---decode filter cache (DFC)---for minimizing power consumption with minimal performance impact. The DFC stores decoded instructions. Thus, a hit in the DFC eliminates instruction fetch and its subsequent decoding. The bypassing of both instruction fetch and decode reduces processor power. We present a runtime approach for predicting whether the next fetch source is present in the DFC. In case a miss is predicted, we reduce the miss penalty by accessing the I-cache directly. We propose to classify instructions as cacheable or noncacheable, depending on the decode width. For efficient use of the cache space, a sectored cache design is used for the DFC so that both cacheable and noncacheable instructions can coexist in the DFC sector. Experimental results show that the DFC reduces processor power by 34p on an average and our next fetch prediction mechanism reduces miss penalty by more than 91p.

Journal ArticleDOI
TL;DR: This article presents an application-driven retargetable prototyping platform that aims to facilitate the design exploration of the communication subsystem through application-level execution-driven simulations and quantitative analysis and shows that, through careful analysis and construction, it is possible for the modeling environment to support the common features of these architectures as part of the library.
Abstract: In multiprocessor-based SoCs, optimizing the communication architecture is often as important, if not more important, than optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of architectures of processing elements, the same is not true for the communication architectures. This article presents an application-driven retargetable prototyping platform that fills this gap. This environment aims to facilitate the design exploration of the communication subsystem through application-level execution-driven simulations and quantitative analysis. Based on an analysis of a wide range of on-chip communication architectures, we describe how a specific hierarchical class library can be used to develop new on-chip communication architectures, or variants of existing ones with relatively little incremental effort. We demonstrate this through three case studies including two commercial on-chip bus systems and an on-chip packet switching network. Here we show that, through careful analysis and construction, it is possible for the modeling environment to support the common features of these architectures as part of the library and permit instantiation of the individual architectures as variants of the library design. Consequently, system-level design choices regarding the communication architecture can be made with high confidence in the early stages of design. In addition to improving design quality, this methodology also results in significantly shortening design-time.

Journal ArticleDOI
TL;DR: This article proposes an approach called deadspace utilization (DSU) to reclaim the unused area of an interconnect optimized floorplan by linear programming and shows that this technique can be applied to reduce the area and total wirelength of aninterconnect optimize floorplan further while the routability can be maintained at the same time.
Abstract: Interconnect optimization has become the major concern in floorplanning. Many approaches would use simulated annealing (SA) with a cost function composed of a weighted sum of area, wirelength, and interconnect cost. These approaches can reduce the interconnect cost efficiently but the area penalty of the interconnect optimized floorplan is usually quite large. In this article, we propose an approach called deadspace utilization (DSU) to reclaim the unused area of an interconnect optimized floorplan by linear programming. Since modules are not necessarily rectangular in shape in floorplanning, some deadspace can be redistributed to the modules to increase the area occupied by each module. If the area of each module can be expanded by the same ratio, the whole floorplan can be compacted by that ratio to give a smaller floorplan. However, we will limit the compaction ratio to prevent overcongestion. Experiments show that we can apply this deadspace utilization technique to reduce the area and total wirelength of an interconnect optimized floorplan further while the routability can be maintained at the same time.

Journal ArticleDOI
TL;DR: A realistic technique supported by a tool flow to explore operation shuffling for improving generation of L0 clusters and a technique to support VLIW processors with multiple data clusters, which is essential to apply the methodology to real world processors.
Abstract: Clustering L0 buffers is effective for energy reduction in the instruction memory hierarchy of embedded VLIW processors. However, the efficiency of the clustering depends on the schedule of the target application. Especially in heterogeneous or data clustered VLIW processors, determining energy efficient scheduling is more constraining.This article proposes a realistic technique supported by a tool flow to explore operation shuffling for improving generation of L0 clusters. The tool flow explores assignment of operations for each cycle and generates various schedules. This approach makes it possible to reduce energy consumption for various processor architectures. However, the computational complexity is large because of the huge exploration space. Therefore, some heuristics are also developed, which reduce the size of the exploration space while the solution quality remains reasonable. Furthermore, we also propose a technique to support VLIW processors with multiple data clusters, which is essential to apply the methodology to real world processors.The experimental results indicate potential gains of up to 27.6p in energy in L0 buffers, through operation shuffling for heterogeneous processor architectures as well as a homogeneous architecture. Furthermore, the proposed heuristics drastically reduce the exploration search space by about 90p, while the results are comparable to full search, with average differences of less than 1p. The experimental results indicate that energy efficiency can be improved in most of the media benchmarks by the proposed methodology, where the average gain is around 10p in comparison with generating clusters without operation shuffling.

Journal ArticleDOI
TL;DR: A race-condition-aware clock skew scheduling is proposed to determine the clock skew schedule by taking race conditions into account, and the objective is not only to optimize the clock period, but also to minimize heuristically the required inserted delay.
Abstract: In this article, we provide a fresh viewpoint to the interactions between clock skew scheduling and delay insertion. A race-condition-aware (RCA) clock skew scheduling is proposed to determine the clock skew schedule by taking race conditions (i.e., hold violations) into account. Our objective is not only to optimize the clock period, but also to minimize heuristically the required inserted delay. Compared with previous work, our major contribution includes the following two aspects. First, our approach achieves exactly the same results, but has significant improvement in time complexity. Second, our viewpoint can be generalized to other sequential timing optimization techniques.

Journal ArticleDOI
TL;DR: This article presents an explicit design for the noncompound (k, f2(“k”)-USB, and presents an efficient detailed routing algorithm for the new (
Abstract: A switch block of k sides W terminals on each side is said to be universal (a (k, W)-USB) if it is routable for every set of 2-pin nets of channel density at most W. The generic optimum universal switch block design problem is to design a (k, W)-USB with the minimum number of switches for every pair of (k, W). This problem was first proposed and solved for ke4 in Chang et al. [1996], and then solved for even W or for k≤6 in Shuy et al. [2000] and Fan et al. [2002b]. No optimum (k, W)-USB is known for k≥7 and odd W≥3. But it is already known that when W is a large odd number, a near-optimum (k, W)-USB can be obtained by a disjoint union of (W−f2(k))/2 copies of the optimum (k, 2)-USB and a noncompound (k, f2(k))-USB, where the value of f2(k) is unknown for k≥8. In this article, we show that f2(k) e kp3−i/3, where 1≤i≤6 and i≡ k (mod 6), and present an explicit design for the noncompound (k, f2(k))-USB. Combining these two results we obtain the exact designs of (k, W)-USBs for all k≥7 and odd W≥3. The new (k, W)-USB designs also yield an efficient detailed routing algorithm.