The organization of behavior: A neuropsychological theory

Multiple, often conflicting objectives arise naturally in most real-world optimization scenarios. As evolutionary algorithms possess several characteristics due to which they are well suited to this type of problem, evolution-based methods have been used for multiobjective optimization for more than a decade. Meanwhile evolutionary multiobjective optimization has become established as a separate subdiscipline combining the fields of evolutionary computation and classical multiple criteria decision making. In this paper, the basic principles of evolutionary multiobjective optimization are discussed from an algorithm design perspective. The focus is on the major issues such as fitness assignment, diversity preservation, and elitism in general rather than on particular algorithms. Different techniques to implement these strongly related concepts will be discussed, and further important aspects such as constraint handling and preference articulation are treated as well. Finally, two applications will presented and some recent trends in the field will be outlined.

Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications

Modern embedded computing systems tend to be heterogeneous in the sense of being composed of subsystems with very different characteristics, which communicate and interact in a variety of ways-synchronous or asynchronous, buffered or unbuffered, etc. Obviously, when designing such systems, a modeling language needs to reflect this heterogeneity. Today's modeling environments usually offer a variant of what we call amorphous heterogeneity to address this problem. This paper argues that modeling systems in this manner leads to unexpected and hard-to-analyze interactions between the communication mechanisms and proposes a more structured approach to heterogeneity, called hierarchical heterogeneity, to solve this problem. It proposes a model structure and semantic framework that support this form of heterogeneity, and discusses the issues arising from heterogeneous component interaction and the desire for component reuse. It introduces the notion of domain polymorphism as a way to address these issues.

/pdf/taming-heterogeneity-the-ptolemy-approach-v4qpoppcbo.pdf

Taming heterogeneity - the Ptolemy approach

Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy.In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

To alleviate the complex communication problems that arise as the number of on-chip components increases, network-on-chip (NoC) architectures have been recently proposed to replace global interconnects. In this paper, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation. Motivation, problem description, proposed approaches, and open issues are discussed for each problem from system, microarchitecture, and circuit perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective.

Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives

The coarse-grained reconfigurable architectures have advantages over the traditional FPGAs in terms of delay, area and configuration time. To execute entire applications, most of them combine an instruction set processor(ISP) and a reconfigurable matrix. However, not much attention is paid to the integration of these two parts, which results in high communication overhead and programming difficulty. To address this problem, we propose a novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix. The advantages include simplified programming model, shared resource costs, and reduced communication overhead. To exploit this architecture, our previously developed compiler framework is adapted to the new architecture. The results show that the new architecture has good performance and is very compiler-friendly.

ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix

We present cycle-static dataflow (CSDF), which is a new model for the specification and implementation of digital signal processing algorithms. The CSDF paradigm is an extension of synchronous dataflow that still allows for static scheduling and, thus, a very efficient implementation of an application. In comparison with synchronous dataflow, it is more versatile because it also supports algorithms with a cyclically changing, but predefined, behavior. Our examples show that this capability results in a higher degree of parallelism and, hence, a higher throughput, shorter delays, and less buffer memory. Moreover, they indicate that CSDF is essential for modelling prescheduled components, like application-specific integrated circuits. Besides introducing the CSDF paradigm, we also derive necessary and sufficient conditions for the schedulability of a CSDF graph. We present and compare two methods for checking the liveness of a graph. The first one checks the liveness of loops, and the second one constructs a single-processor schedule for one iteration of the graph. Once the schedulability is tested, a makespan optimal schedule on a multiprocessor can be constructed. We also introduce the heuristic scheduling method of our graphical rapid prototyping environment (GRAPE).

Cycle-static dataflow

Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compilation tools are essential to their success. In this paper, we present a modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures. This algorithm is a key part of our dynamically reconfigurable embedded systems compiler (DRESC). It is capable of solving placement, scheduling and routing of operations simultaneously in a modulo-constrained 3D space and uses an abstract architecture representation to model a wide class of coarse-grained architectures. The experimental results show high performance and efficient resource utilization on tested kernels.

/pdf/exploiting-loop-level-parallelism-on-coarse-grained-w2pj8m6rjh.pdf

Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling

Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compiling tools are essential to their success. In this paper, we present a retargetable compiler for a family of coarse-grained reconfigurable architectures. Several key issues are addressed. Program analysis and transformation prepare dataflow for scheduling. Architecture abstraction generates an internal graph representation from a concrete architecture description. A modulo scheduling algorithm is key to exploit parallelism and achieve high performance. The experimental results show up to 28.7 instructions per cycle (IPC) over tested kernels.

DRESC: a retargetable compiler for coarse-grained reconfigurable architectures

Multimedia support appears on embedded platforms, such as WAP for mobile phones. However, true multimedia applications require both the computation power that only dedicated hardware can provide and the flexibility of software implementations. To this end, we are investigating reconfigurable architectures, composed of an instruction-set processor running software processes and coupled to an FPGA on which hardware tasks are spawned by dynamic partial reconfiguration. This paper focuses on two main aspects. It explains how separating communication from computation enables hardware multi-tasking and it describes our implementation of a fixed communication-layer that decouples the computation elements, allowing them to be dynamically reconfigured. This communication layer is an interconnection network, implemented on a Virtex FPGA, allowing fast synchronous communication between hardware tasks implemented on the same matrix. The network is a 2D torus and uses wormhole routing. It achieves transfer rates up to 77.6 MB/s between two adjacent routers, when clocked at 40 MHz. Interconnection networks on FPGAs allow fine-grain dynamic partial reconfiguration and make hardware multi-tasking a reality.

/pdf/interconnection-networks-enable-fine-grain-dynamic-multi-4kma64eejl.pdf

Rudy Lauwereins

Papers

ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix

Cycle-static dataflow

Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling

DRESC: a retargetable compiler for coarse-grained reconfigurable architectures

Interconnection Networks Enable Fine-Grain Dynamic Multi-tasking on FPGAs