Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2008"
TL;DR: Both experimental and analysis results show that the NoC architecture scales very well in terms of area, performance, energy, and design effort, while the P2P and bus-based architectures scale poorly on all accounts except for performance and area, respectively.
Abstract: Traditionally, design-space exploration for systems-on-chip (SoCs) has focused on the computational aspects of the problem at hand. However, as the number of components on a single chip and their performance continue to increase, a shift from computation-based to communication-based design becomes mandatory. As a result, the communication architecture plays a major role in the area, performance, and energy consumption of the overall system. This article presents a comprehensive evaluation of three on-chip communication architectures targeting multimedia applications. Specifically, we compare and contrast the network-on-chip (NoC) with point-to-point (P2P) and bus-based communication architectures in terms of area, performance, and energy consumption. As the main contribution, we present complete P2P, bus-, and NoC-based implementations of a real multimedia application (i. e. the MPEG-2 encoder), and provide direct measurements using an FPGA prototype and actual video clips, rather than simulation and synthetic workloads. We also support the experimental findings through a theoretical analysis. Both experimental and analysis results show that the NoC architecture scales very well in terms of area, performance, energy, and design effort, while the P2P and bus-based architectures scale poorly on all accounts except for performance and area, respectively.
228 citations
TL;DR: This work incorporates electrical masking by computing error attenuation probabilities, based on analytical models, into an extended PTM framework for reliability computation, and defines a susceptibility measure to identify gates whose errors are not well masked.
Abstract: We propose the probabilistic transfer matrix (PTM) framework to capture nondeterministic behavior in logic circuits. PTMs provide a concise description of both normal and faulty behavior, and are well-suited to reliability and error susceptibility calculations. A few simple composition rules based on connectivity can be used to recursively build larger PTMs (representing entire logic circuits) from smaller gate PTMs. PTMs for gates in series are combined using matrix multiplication, and PTMs for gates in parallel are combined using the tensor product operation. PTMs can accurately calculate joint output probabilities in the presence of reconvergent fanout and inseparable joint input distributions. To improve computational efficiency, we encode PTMs as algebraic decision diagrams (ADDs). We also develop equivalent ADD algorithms for newly defined matrix operations such as eliminate_variables and eliminate_redundant_variables, which aid in the numerical computation of circuit PTMs. We use PTMs to evaluate circuit reliability and derive polynomial approximations for circuit error probabilities in terms of gate error probabilities. PTMs can also analyze the effects of logic and electrical masking on error mitigation. We show that ignoring logic masking can overestimate errors by an order of magnitude. We incorporate electrical masking by computing error attenuation probabilities, based on analytical models, into an extended PTM framework for reliability computation. We further define a susceptibility measure to identify gates whose errors are not well masked. We show that hardening a few gates can significantly improve circuit reliability.
139 citations
TL;DR: This work presents a technique for automata-based checker generation of PSL properties for dynamic verification, and shows that the generated checkers are resource-efficient for use in hardware emulation, simulation acceleration and silicon debug.
Abstract: Assertion-based verification with languages such as PSL is gaining in importance. From assertions, one can generate hardware assertion checkers for use in emulation, simulation acceleration and silicon debug. We present techniques for checker generation of the complete set of PSL properties, including all variants of operators, both strong and weak. A full automata-based approach allows an entire assertion to be represented by a single automaton, hence allowing optimizations that can not be done in a modular approach where subcircuits are created only for individual operators. For this purpose, automata algorithms are developed for the base cases, and a complete set of rewrite rules is derived for other operators. Automata splitting is introduced for an efficient implementation of the eventuallye operator.
123 citations
TL;DR: The PeaCE codesign environment is the first full-fledged HW-SW codesign environments that provides seamless codesign flow from functional simulation to system synthesis, and is a reconfigurable framework in the sense that third-party design tools can be integrated to build a customized tool chain.
Abstract: Existent hardware-software (HW-SW) codesign tools mainly focus on HW-SW cosimulation to build a virtual prototyping environment that enables software design and system verification without need of making a hardware prototype. Not only HW-SW cosimulation, but also HW-SW codesign methodology involves system specification, functional simulation, design-space exploration, and hardware-software cosynthesis. The PeaCE codesign environment is the first full-fledged HW-SW codesign environment that provides seamless codesign flow from functional simulation to system synthesis. Targeting for multimedia applications with real-time constraints, PeaCE specifies the system behavior with a heterogeneous composition of three models of computation and utilizes features of the formal models maximally during the whole design process. It is also a reconfigurable framework in the sense that third-party design tools can be integrated to build a customized tool chain. Experiments with industry-strength examples prove the viability of the proposed technique.
122 citations
TL;DR: It is demonstrated that in addition to harnessing the probabilistic behavior of pcmos devices, psoc architectures yield significant improvements, both in energy consumed as well as performance in the context of Probabilistic or randomized applications with broad utility.
Abstract: Parameter variations, noise susceptibility, and increasing energy dissipation of cmos devices have been recognized as major challenges in circuit and microarchitecture design in the nanometer regime. Among these, parameter variations and noise susceptibility are increasingly causing cmos devices to behave in an “unreliable” or “probabilistic” manner. To address these challenges, a shift in design paradigm from current-day deterministic designs to “statistical” or “probabilistic” designs is deemed inevitable. To respond to this need, in this article, we introduce and study an entirely novel family of probabilistic architectures: the probabilistic system-on-a-chip (psoc). psoc architectures are based on cmos devices rendered probabilistic due to noise, referred to as probabilistic CMOS or PCMOS devices. We demonstrate that in addition to harnessing the probabilistic behavior of pcmos devices, psoc architectures yield significant improvements, both in energy consumed as well as performance in the context of probabilistic or randomized applications with broad utility. All of our application and architectural savings are quantified using the product of the energy and performance, denoted (energy × performance): The pcmos-based gains are as high as a substantial multiplicative factor of over 560 when compared to a competing energy-efficient cmos-based realization. Our architectural design is application specific and involves navigating design space spanning the algorithm (application), its architecture (psoc), and the probabilistic technology (pcmos).
78 citations
TL;DR: Techniques are presented to merge multiple use-cases into one hardware design to minimize cost and design time, making it well suited for fast design-space exploration (DSE) in MPSoC systems.
Abstract: Future applications for embedded systems demand chip multiprocessor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semiautomated, time consuming, and error prone.In this article, we present a design methodology to generate multiprocessor systems in a systematic and fully automated way for multiple use-cases. Techniques are presented to merge multiple use-cases into one hardware design to minimize cost and design time, making it well suited for fast design-space exploration (DSE) in MPSoC systems. Heuristics to partition use-cases are also presented such that each partition can fit in an FPGA, and all use-cases can be catered for.The proposed methodology is implemented into a tool for Xilinx FPGAs for evaluation. The tool is also made available online for the benefit of the research community and is used to carry out a DSE case study with multiple use-cases of real-life applications: H263 and JPEG decoders. The generation of the entire design takes about 100 ms, and the whole DSE was completed in 45 minutes, including FPGA mapping and synthesis. The heuristics used for use-case partitioning reduce the design-exploration time elevenfold in a case study with mobile-phone applications.
76 citations
TL;DR: This work proposes a methodology which enables heterogeneous specification of complex, electronic systems in SystemC supporting the integration of components under different models of computation (MoCs) and proposes the set of rules and guidelines required by each specific MoC.
Abstract: This work proposes a methodology which enables heterogeneous specification of complex, electronic systems in SystemC supporting the integration of components under different models of computation (MoCs). This feature is necessary in order to deal with the growing complexity, concurrency, and heterogeneity of electronic embedded systems. The specification methodology is based on the SystemC standard language. Nevertheless, the use of SystemC for heterogeneous system specification is not straightforward. The first problem to be addressed is the efficient and predictable mapping of untimed events required by abstract MoCs over the discrete-event MoC on which the SystemC simulation kernel is based. This mapping is essential in order to understand the simulation results provided by the SystemC model of those MoCs. The specification methodology proposes the set of rules and guidelines required by each specific MoC. Moreover, the methodology supports a smooth integration of several MoCs in the same system specification. A set of facilities is provided covering the deficiencies of the language. These facilities constitute the methodology-specific library called HetSC. The methodology and associated library have been demonstrated to be useful for the specification of complex, heterogeneous embedded systems supporting essential design tasks such as performance analysis and SW generation.
73 citations
TL;DR: Experiments with preliminary examples, including the H.263 decoder, show that the proposed parallel-programming framework increases the design productivity of MPSoC software significantly.
Abstract: As more processing elements are integrated in a single chip, embedded software design becomes more challenging: It becomes a parallel programming for nontrivial heterogeneous multiprocessors with diverse communication architectures, and design constraints such as hardware cost, power, and timeliness. In the current practice of parallel programming with MPI or OpenMP, the programmer should manually optimize the parallel code for each target architecture and for the design constraints. Thus, the design-space exploration of MPSoC (multiprocessor systems-on-chip) costs become prohibitively large as software development overhead increases drastically. To solve this problem, we develop a parallel-programming framework based on a novel programming model called common intermediate code (CIC). In a CIC, functional parallelism and data parallelism of application tasks are specified independently of the target architecture and design constraints. Then, the CIC translator translates the CIC into the final parallel code, considering the target architecture and design constraints to make the CIC retargetable. Experiments with preliminary examples, including the H.263 decoder, show that the proposed parallel-programming framework increases the design productivity of MPSoC software significantly.
64 citations
TL;DR: A new hardware-software emulation framework is presented that allows designers a complete exploration of the thermal behavior of final MPSoC designs early in the design flow and shows speedups of three orders of magnitude compared to cycle-accurate MP soC simulators.
Abstract: New tendencies envisage multiprocessor systems-on-chips (MPSoCs) as a promising solution for the consumer electronics market. MPSoCs are complex to design, as they must execute multiple applications (games, video) while meeting additional design constraints (energy consumption, time-to-market). Moreover, the rise of temperature in the die for MPSoCs can seriously affect their final performance and reliability. In this article, we present a new hardware-software emulation framework that allows designers a complete exploration of the thermal behavior of final MPSoC designs early in the design flow. The proposed framework uses FPGA emulation as the key element to model hardware components of the considered MPSoC platform at multimegahertz speeds. It automatically extracts detailed system statistics that are used as input to our software thermal library running in a host computer. This library calculates at runtime the temperature of on-chip components, based on the collected statistics from the emulated system and final floorplan of the MPSoC. This enables fast testing of various thermal management techniques. Our results show speedups of three orders of magnitude compared to cycle-accurate MPSoC simulators.
47 citations
TL;DR: The experimental results on two pipelined processor models demonstrate several orders-of-magnitude reduction in overall validation effort by drastically reducing both test-generation time and number of test programs required to achieve a coverage goal.
Abstract: Functional validation is a major bottleneck in pipelined processor design due to the combined effects of increasing design complexity and lack of efficient techniques for directed test generation. Directed test vectors can reduce overall validation effort, since shorter tests can obtain the same coverage goal compared to the random tests. This article presents a specification-driven directed test generation methodology. The proposed methodology makes three important contributions. First, a general graph model is developed that can capture the structure and behavior (instruction set) of a wide variety of pipelined processors. The graph model is generated from the processor specification. Next, we propose a functional fault model that is used to define the functional coverage for pipelined architectures. Finally, we propose two complementary test generation techniques: test generation using model checking, and test generation using template-based procedures. These test generation techniques accept the graph model of the architecture as input and generate test programs to detect all the faults in the functional fault model. Our experimental results on two pipelined processor models demonstrate several orders-of-magnitude reduction in overall validation effort by drastically reducing both test-generation time and number of test programs required to achieve a coverage goal.
46 citations
TL;DR: This work developed a software thermal sensor (STS) in a Linux system with a Pentium 4 Northwood core that offers detailed power and temperature breakdowns of each functional unit at runtime, enabling more efficient online power and thermal monitoring and management at a higher level, such as the operating system.
Abstract: The evolution of microprocessors has been hindered by increasing power consumption and heat dissipation on die. An excessive amount of heat creates reliability problems, reduces the lifetime of a processor, and elevates the cost of cooling and packaging considerably. It is therefore imperative to be able to monitor the temperature variations across the die in a timely and accurate manner.Most current techniques rely on on-chip thermal sensors to report the temperature of the processor. Unfortunately, significant variation in chip temperature both spatially and temporally exposes the limitation of the sensors. We present a compensating approach to tracking chip temperature through an OS resident software module that generates live power and thermal profiles of the processor. We developed such a software thermal sensor (STS) in a Linux system with a Pentium 4 Northwood core. We employed highly efficient numerical methods in our model to minimize the overhead of temperature calculation. We also developed an efficient algorithm for functional unit power modeling. Our power and thermal models are calibrated and validated against on-chip sensor readings, thermal images of the Northwood heat spreader, and the thermometer measurements on the package. The resulting STS offers detailed power and temperature breakdowns of each functional unit at runtime, enabling more efficient online power and thermal monitoring and management at a higher level, such as the operating system.
TL;DR: The goal of this project is to evaluate recently proposed security primitives for reconfigurable hardware by building a real embedded system with several cores on a single FPGA and implementing these primitives on the system.
Abstract: The extremely high cost of custom ASIC fabrication makes FPGAs an attractive alternative for deployment of custom hardware. Embedded systems based on reconfigurable hardware integrate many functions onto a single device. Since embedded designers often have no choice but to use soft IP cores obtained from third parties, the cores operate at different trust levels, resulting in mixed-trust designs. The goal of this project is to evaluate recently proposed security primitives for reconfigurable hardware by building a real embedded system with several cores on a single FPGA and implementing these primitives on the system. Overcoming the practical problems of integrating multiple cores together with security mechanisms will help us to develop realistic security-policy specifications that drive enforcement mechanisms on embedded systems.
TL;DR: This work proposes a predictive closed-loop flow control mechanism and develops traffic source and router models specifically targeted to NoCs and demonstrates the applicability of the proposed flow controller to actual designs using real NoC implementations.
Abstract: Networks-on-Chip (NoC) communication architectures have emerged recently as a scalable solution to on-chip communication problems. While the NoC architectures may offer higher bandwidth compared to traditional bus-based communication, their performance can degrade significantly in the absence of effective flow control algorithms. Unfortunately, flow control algorithms developed for macronetworks, either rely on local information, or suffer from large communication overhead and unpredictable delays. Hence, using them in the NoC context is problematic at best. For this reason, we propose a predictive closed-loop flow control mechanism and make the following contributions: First, we develop traffic source and router models specifically targeted to NoCs. Then, we utilize these models to predict the possible congestion in the network. Based on this information, the proposed scheme controls the packet injection rate at traffic sources in order to regulate the total number of packets in the network. We also illustrate the proposed traffic source model and the applicability of the proposed flow controller to actual designs using real NoC implementations. Finally, simulations and experimental study using our FPGA prototype show that the proposed controller delivers a better performance compared to the traditional switch-to-switch flow control algorithms under various real and synthetic traffic patterns.
TL;DR: A language extension for SystemC along with a design methodology for describing and simulating dynamically reconfigurable systems at all levels of abstraction are introduced, providing maximum freedom of description of reconfiguration behavior and its control.
Abstract: With the ongoing integration of (dynamic) reconfiguration into current system models, new methodologies and tools are needed to help the designer during the development process. This article introduces a language extension for SystemC along with a design methodology for describing and simulating dynamically reconfigurable systems at all levels of abstraction. The presented library provides maximum freedom of description of reconfiguration behavior and its control, while featuring simulation of runtime configuration, removal, and exchange of custom modules as well as third-party IP-cores during the complete architecture refinement process. When designing at RT-level, the resulting hardware description can easily be synthesized by standard synthesis tools.
[...]
TL;DR: The key technologies required for effective binary synthesis are surveyed: decompilation techniques necessary for binary synthesis to achieve results competitive with source-level synthesis, hardware/software partitioning methods necessary to find critical binary regions suitable for synthesis, synthesis methods for converting regions to custom circuits, and binary update methods that enable replacement of critical binary areas by circuits.
Abstract: Recent high-level synthesis approaches and C-based hardware description languages attempt to improve the hardware design process by allowing developers to capture desired hardware functionality in a well-known high-level source language. However, these approaches have yet to achieve wide commercial success due in part to the difficulty of incorporating such approaches into software tool flows. The requirement of using a specific language, compiler, or development environment may cause many software developers to resist such approaches due to the difficulty and possible instability of changing well-established robust tool flows. Thus, in the past several years, synthesis from binaries has been introduced, both in research and in commercial tools, as a means of better integrating with tool flows by supporting all high-level languages and software compilers. Binary synthesis can be more easily integrated into a software development tool-flow by only requiring an additional backend tool, and it even enables completely transparent dynamic translation of executing binaries to configurable hardware circuits. In this article, we survey the key technologies underlying the important emerging field of binary synthesis. We compare binary synthesis to several related areas of research, and we then describe the key technologies required for effective binary synthesis: decompilation techniques necessary for binary synthesis to achieve results competitive with source-level synthesis, hardware/software partitioning methods necessary to find critical binary regions suitable for synthesis, synthesis methods for converting regions to custom circuits, and binary update methods that enable replacement of critical binary regions by circuits.
TL;DR: A new design environment, called SoCDAL, for accelerating multiprocessor system-on-chip design through fast design-space exploration targeting real-time multimedia systems and a new approach which enables analyzing a process network model statically with some restrictions is introduced.
Abstract: Time-to-market pressure and the ever-growing design complexity of multiprocessor system-on-chips have demanded an efficient design environment that enables fast exploration of large design space. In this article, we introduce a new design environment, called SoCDAL, for accelerating multiprocessor system-on-chip design through fast design-space exploration targeting real-time multimedia systems. SoCDAL is a set of mostly automated tools covering system specification, hardware/software estimation, application-to-architecture mapping, simulation model generation, and system verification through simulation. For system specification, the process network model has been widely used for system specification because of its modeling capability. However, it is hard to use for real-time systems design, since its behavior cannot be estimated statically. We introduce a new approach which enables analyzing a process network model statically with some restrictions. For the hardware/software estimation, we analyze codes statically. Application-to-architecture mapping process implements a novel algorithm to support an arbitrary number of processors, with performance evaluation by static scheduling considering communication behavior. Mapping results are used to generate simulation models automatically at several transaction levels to be pipelined to a commercial tool. We show the effectiveness of our approaches by some experimental results with multimedia applications such as JPEG, H.263, and H.264 encoders, as well as an H.264 decoder.
TL;DR: An original algorithm to automatically check the deadlock-freeness of a network with a given routing function and a prototype tool has been developed and automatic deadlock checking of large scale networks with various routing functions have been successfully achieved.
Abstract: We present an extension of Duato's necessary and sufficient condition a routing function must satisfy in order to be deadlock-free, to support environment constraints inducing extra-dependencies between messages. We also present an original algorithm to automatically check the deadlock-freeness of a network with a given routing function. A prototype tool has been developed and automatic deadlock checking of large scale networks with various routing functions have been successfully achieved. We provide comparative results with standard approach, highlighting the benefits of our method.
TL;DR: A query language for checking different notions of user-definable approximate equivalence is presented which extends the syntax of the Ana CTL model checking language, which can be used combining model checking with equivalence checking.
Abstract: We present a method for application of formal techniques like model checking and equivalence checking for validation of the transient response of nonlinear analog circuits. We propose a temporal logic called Ana CTL (computational tree logic for analog circuit verification) which is suitable for specifying properties specific to analog circuits. The application of Ana CTL for validation of transient behavior of arbitrarily nonlinear analog circuits is presented. The transient response of a circuit under all possible input waveforms is represented as a finite state machine (FSM), by bounding and discretizing the continuous state space of an analog circuit. We have developed algorithms to run Ana CTL queries on this discretized model using search-based methods which reduce the runtime considerably by avoiding creation of the whole FSM. The application of these methods on several real-life analog circuits is presented and we show that this system is a useful aid for detecting and debugging early design errors.We also present methods for checking the equivalence of transient response of two analog circuits. The behavior of two different analog circuits can rarely be exactly similar. Hence, we introduce a notion of approximate equivalence. A query language for checking different notions of user-definable approximate equivalence is presented which extends the syntax of the Ana CTL model checking language. In its extended form, Ana CTL can be used combining model checking with equivalence checking.
TL;DR: A hybrid design-time/runtime reconfiguration scheduling heuristic that generates its final schedule at runtime but carries out most computations at design- time is developed.
Abstract: Due to the emergence of portable devices that must run complex dynamic applications there is a need for flexible platforms for embedded systems Runtime reconfigurable hardware can provide this flexibility but the reconfiguration latency can significantly decrease the performance When dealing with task graphs, runtime support that schedules the reconfigurations in advance can drastically reduce this overhead However, executing complex scheduling heuristics at runtime may generate an excessive penalty Hence, we have developed a hybrid design-time/runtime reconfiguration scheduling heuristic that generates its final schedule at runtime but carries out most computations at design-time We have tested our approach in a PowerPC 405 processor embedded on a FPGA demonstrating that it generates a very small runtime penalty while providing almost as good schedules as a full runtime approach
TL;DR: This article describes and evaluates a compiler algorithm that maps the arrays of a loop-based computation to internal storage structures, either RAM blocks or discrete registers, to minimize the overall execution time while considering the capacity and bandwidth constraints of the storage resources.
Abstract: Configurable architectures offer the unique opportunity of realizing hardware designs tailored to the specific data and computational patterns of an application code. Customizing the storage structures is becoming increasingly important in mitigating the continuing gap between memory latencies and internal computing speeds. In this article we describe and evaluate a compiler algorithm that maps the arrays of a loop-based computation to internal storage structures, either RAM blocks or discrete registers. Our objective is to minimize the overall execution time while considering the capacity and bandwidth constraints of the storage resources. The novelty of our approach lies in creating a single framework that combines high-level compiler techniques with lower-level scheduling information for mapping the data. We illustrate the benefits of our approach for a set of image/signal processing kernels using a Xilinx Virtex™ Field-Programmable Gate Array (FPGA). Our algorithm leads to faster designs compared to the state-of-the-art custom data layout mapping technique, in some instances using less storage. When compared to hand-coded designs, our results are comparable in terms of execution time and resources, but are derived in a minute fraction of the design time.
TL;DR: An optimization framework is presented that explicitly considers the characteristics of the FC-Bh system and is aimed at minimizing the fuel consumption and is applied on top of a prediction-based DPM policy and is used to derive a new fuel-efficient DPM scheme.
Abstract: This article presents our work on the development of a fuel cell (FC) and battery hybrid (FC-Bh) system for use in portable microelectronic systems. We describe the design and control of the hybrid system, as well as a dynamic power management (DPM)-based energy management policy that extends its operational lifetime. The FC is of the proton exchange membrane (PEM) type, operates at room temperature, and has an energy density which is 4--6 times that of a Li-ion battery. The FC cannot respond to sudden changes in the load, and so a system powered solely by the FC is not economical. An FC-Bh power source, on the other hand, can provide the high energy density of the FC and the high power density of a battery.In this work we first describe the prototype FC-Bh system that we have built. Such a prototype helps to characterize the performance of a hybrid power source, and also helps explore new energy management strategies for embedded systems powered by hybrid sources. Next we describe a Matlab/Simulink-based FC-Bh system simulator which serves as an alternate experimental platform and that enables quick evaluation of system-level control policies. Finally, we present an optimization framework that explicitly considers the characteristics of the FC-Bh system and is aimed at minimizing the fuel consumption. This optimization framework is applied on top of a prediction-based DPM policy and is used to derive a new fuel-efficient DPM scheme. The proposed scheme demonstrates up to 32p system lifetime extension compared to a competing scheme when run on a real trace-based MPEG encoding example.
TL;DR: This article derives utilization bounds for several variants of global preemptive/nonpreemptive EDF scheduling, and compares the performance of different utilization bound tests.
Abstract: Field Programmable Gate Arrays (FPGAs) are very popular in today's embedded systems design, and Partial Runtime-Reconfigurable (PRTR) FPGAs allow HW tasks to be placed and removed dynamically at runtime. Hardware task scheduling on PRTR FPGAs brings many challenging issues to traditional real-time scheduling theory, which have not been adequately addressed by the research community compared to software task scheduling on CPUs. In this article, we consider the schedulability analysis problem of HW task scheduling on PRPR FPGAs. We derive utilization bounds for several variants of global preemptive/nonpreemptive EDF scheduling, and compare the performance of different utilization bound tests.
TL;DR: An efficient algorithm to construct a low-power zero-skew gated clock network is proposed, given the module locations and activity information, and it is presented a recursive approach to compute the effective switched capacitance of a general gated and buffered clock network, accounting for both the clock tree's and controller tree's switched capacitors.
Abstract: We propose an efficient algorithm to construct a low-power zero-skew gated clock network, given the module locations and activity information. Unlike previous works, we consider masking logic insertion and buffer insertion simultaneously, and guarantee to yield a zero-skew clock tree. Both the logical and physical information of the modules are carefully taken into consideration when determining where masking logic should be inserted. We also account for the power overhead of the control signals so that the total average power consumption of the constructed zero-skew gated clock network can be minimized. To this end, we present a recursive approach to compute the effective switched capacitance of a general gated and buffered clock network, accounting for both the clock tree's and controller tree's switched capacitance. The power consumptions of the gated clock networks constructed by our algorithm are 20 to 36p lower than those reported in the best previous work in the literature.
TL;DR: A theoretical framework is proposed that allows designers to quantify the performance improvement that is to be expected if they were to migrate from a fully synchronous design to the proposed multiple VFI design style.
Abstract: The increasing variability in manufacturing process parameters is expected to lead to significant performance degradation in deep submicron technologies. Multiple Voltage-Frequency Island (VFI) design styles with fine-grained, process-variation aware clocking have recently been shown to possess increased immunity to manufacturing process variations. In this article, we propose a theoretical framework that allows designers to quantify the performance improvement that is to be expected if they were to migrate from a fully synchronous design to the proposed multiple VFI design style. Specifically, we provide techniques to efficiently and accurately estimate the probability distribution of the execution rate (or throughput) of both single and multiple VFI systems under the influence of manufacturing process variations. Finally, using an MPEG-2 encoder benchmark, we demonstrate how the proposed analysis framework can be used by designers to make architectural decisions such as the granularity of VFI domain partitioning based on the throughput constraints their systems are required to satisfy.
TL;DR: A computationally efficient technique for reducing interconnect active power in VLSI systems is presented, and the existence of a unique power-optimal wire order within a bundle is proven, and a method to construct this order is derived.
Abstract: A computationally efficient technique for reducing interconnect active power in VLSI systems is presented. Power reduction is accomplished by simultaneous wire spacing and net ordering, such that cross-capacitances between wires are optimally shared. The existence of a unique power-optimal wire order within a bundle is proven, and a method to construct this order is derived. The optimal order of wires depends only on the activity factors of the underlying signals; hence, it can be performed prior to spacing optimization. By using this order of wires, optimality of the combined solution is guaranteed (as compared with any other ordering and spacing of the wires). Timing-aware power optimization is enabled by simultaneously considering timing criticality weights and activity factors for the signals. The proposed algorithm has been applied to various interconnect layouts, including wire bundles from high-end microprocessor circuits in 65 nm technology. Interconnect power reduction of 17p on average has been observed in such bundles.
TL;DR: This paper proposes an efficient transformation-based approach to construct a timing-driven octilinear Steiner tree based on the computation of the octil inear distance and the concept of Steiner-point reassignment and path reconstruction in an octILinear routing model.
Abstract: It is well known that the problem of constructing a timing-driven rectilinear Steiner tree for any signal net is important in performance-driven designs and has been extensively studied. Until now, many efficient approaches have been proposed for the construction of a timing-driven rectilinear Steiner tree. As technology process advances, p45° and −45° diagonal segments can be permitted in an octilinear routing model. To our knowledge, no approach is proposed to construct a timing-driven octilinear Steiner tree for any signal net. In this paper, given a rectilinear Steiner tree for any signal net, we propose an efficient transformation-based approach to construct a timing-driven octilinear Steiner tree based on the computation of the octilinear distance and the concept of Steiner-point reassignment and path reconstruction in an octilinear routing model. The experimental results show that our proposed transformation-based approach can use reasonable CPU time to construct a TOST, and a 10p--18p improvement in timing delay and a 5p--14p improvement in total wire length in the original RSTs are obtained in the construction of TOSTs for the tested signal nets.
TL;DR: A circuit-switched interconnection architecture which uses crossroad switches to construct dedicated channels dynamically between any pairs of cores for nonhuge application-specific SoCs and can be further reduced approximately 25% by applying partially dedicated path mechanism.
Abstract: As the number of cores on a chip increases, power consumed by the communication structures takes a significant portion of the overall power budget. In this article, we first propose a circuit-switched interconnection architecture which uses crossroad switches to construct dedicated channels dynamically between any pairs of cores for nonhuge application-specific SoCs. The structure of the crossroad switch is simple, which can be regarded as a NoC-lite router, and we can easily construct a low-power on-chip network with these switches by a system-level design methodology. We also present the design methodology to tailor the proposed interconnection architecture to low-power structures by two proposed optimization schemes with profiled communication characteristics. The first scheme is power-aware topology construction, which can build low-power application-specific interconnection topologies. To further reduce the power consumption, we propose the second optimization scheme to predetermine the operating mode of dual-mode switches in the NoC at runtime. We evaluate several interconnection techniques, and the results show that the proposed architecture is more low-power and high-performance than others under some constraints and scale boundaries. We take multimedia applications as case studies, and experimental results show the power savings of power-aware topology approximate to 49p of the interconnection architecture. The power consumption can be further reduced approximately 25p by applying partially dedicated path mechanism.
TL;DR: A method is described which enables to evolve large synthetic RTL benchmark circuits with a predefined structure and testability and a new collection of synthetic benchmark circuits was developed.
Abstract: This article presents a new real-world application of evolutionary computing in the area of digital-circuits testing. A method is described which enables to evolve large synthetic RTL benchmark circuits with a predefined structure and testability. Using the proposed method, a new collection of synthetic benchmark circuits was developed. These benchmark circuits will be useful in a validation process of novel algorithms and tools in the area of digital-circuits testing. Evolved benchmark circuits currently represent the most complex benchmark circuits with a known level of testability. Furthermore, these circuits are the largest that have ever been designed by means of evolutionary algorithms. This work also investigates suitable parameters of the evolutionary algorithm for this problem and explores the limits in the complexity of evolved circuits.
TL;DR: A dense logic design for matching multiple regular expressions with a field programmable gate array (FPGA) at 10+ Gbps leverages on the design techniques that enforce the shortest critical path on most FPGA architectures while optimizing the circuit size.
Abstract: This article presents a dense logic design for matching multiple regular expressions with a field programmable gate array (FPGA) at 10p Gbps. It leverages on the design techniques that enforce the shortest critical path on most FPGA architectures while optimizing the circuit size. The architecture is capable of supporting a maximum throughput of 12.90 Gbps on a Xilinx Virtex 4 LX200 and its performance is linearly scalable with size. Additionally, this article presents techniques for parsing data streams to provide semantic information for patterns found within a data stream. We illustrate how a content-based router can be implemented with our parsing techniques using an XML parser as an example. The content-based router presented was designed, implemented, and tested in a Xilinx Virtex XCV2000E FPGA on the FPX platform. It is capable of processing 32-bits of data per clock cycle and runs at 100 MHz. This allows the system to process and route XML messages at 3.2 Gbps.
TL;DR: A layout-based scan chain ordering method is proposed to improve fault coverage for LOS test with limited routing overhead and a fast and effective algorithm is used to eliminate conflicts in test vectors while at the same time restrict the extra scan chain routing.
Abstract: Launch-off-shift (LOS) is a popular delay test technique for scan-based designs. However, it is usually not possible to achieve good delay fault coverage in LOS test due to conflicts in test vectors. In this article, we propose a layout-based scan chain ordering method to improve fault coverage for LOS test with limited routing overhead. A fast and effective algorithm is used to eliminate conflicts in test vectors while at the same time restrict the extra scan chain routing. This approach provides many advantages. (1) The proposed method can improve delay fault coverage for LOS test. (2) With layout information taken into account, the routing penalty is limited, and thus the impact on circuit performance will not be significant. Experimental results show that the proposed LOS test method achieves about the same level of delay fault coverage as enhanced scan does, while the average scan chain wire length is about 2.2 times of the shortest scan chain.