There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality.

Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

I and i

On robust estimation of the location parameter

Aspect] is a simple and practical aspect-oriented extension to Java With just a few new constructs, AspectJ provides support for modular implementation of a range of crosscutting concerns. In AspectJ's dynamic join point model, join points are well-defined points in the execution of the program; pointcuts are collections of join points; advice are special method-like constructs that can be attached to pointcuts; and aspects are modular units of crosscutting implementation, comprising pointcuts, advice, and ordinary Java member declarations. AspectJ code is compiled into standard Java bytecode. Simple extensions to existing Java development environments make it possible to browse the crosscutting structure of aspects in the same kind of way as one browses the inheritance structure of classes. Several examples show that AspectJ is powerful, and that programs written using it are easy to understand.

An overview of AspectJ

Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

The objective of this paper is to give a comprehensive introduction to applied cryptography with an engineer or computer scientist in mind. The emphasis is on the knowledge needed to create practical systems which supports integrity, confidentiality, or authenticity. Topics covered includes an introduction to the concepts in cryptography, attacks against cryptographic systems, key use and handling, random bit generation, encryption modes, and message authentication codes. Recommendations on algorithms and further reading is given in the end of the paper. This paper should make the reader able to build, understand and evaluate system descriptions and designs based on the cryptographic components described in the paper.

[서평]「Applied Cryptography」

This article demonstrates an approach for combining general tuning techniques with the POWER8 hardware architecture through optimizing three representative stencil benchmarks. Two typical real-world applications, with kernels similar to those of the winning programs of the Gordon Bell Prize 2016 and 2017, are employed to illustrate algorithm modifications and a combination of hardware-oriented tuning strategies with the application algorithms. This work fills the gap between hardware capability and software performance of the POWER8 processor, and provides useful guidance for optimizing stencil-based scientific applications on POWER systems.

https://dl.acm.org/doi/pdf/10.1145/3264422

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

This paper introduces ADAM, an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs, especially during the development process. ADAM has three key elements. First, a novel approximate maximum common subgraph detection algorithm with linear time complexity to maximize sharing of resources in the merged design. Second, a prototype tool implementing this common subgraph detection algorithm for dataflow graphs derived from Verilog designs; this tool would also generate the appropriate control circuits to enable selection of the original designs at runtime. Third, a comprehensive analysis of compilation time versus degree of similarity to identify the optimized user parameters for the proposed approach. Experimental results show that ADAM can reduce compilation time by around 5 times when each design is 95% similar to the others, and the compilation time is reduced from 1 hour to 10 minutes in the case of binomial filters.

/pdf/adam-automated-design-analysis-and-merging-for-speeding-up-4ahjm121o5.pdf

ADAM: Automated Design Analysis and Merging for Speeding up FPGA Development

This paper presents the Forward Financial Framework (F3), an application framework for describing and implementing forward looking financial computations on high performance, heterogeneous platforms. F3 allows the computational finance problem specification to be captured precisely yet succinctly, then automatically creates efficient implementations for heterogeneous platforms, utilising both multi-core CPUs and FPGAs. The automatic mapping of a high-level problem description to a low-level heterogeneous implementation is possible due to the domain-specific knowledge which is built in F3, along with a software architecture that allows for additional domain knowledge and rules to be added to the framework. Currently the system is able to utilise domain-knowledge of the run-time characteristics of pricing tasks to partition pricing problems and allocate them to appropriate compute resources, and to exploit relationships between financial instruments to balance computation against communication. The versatility of the framework is demonstrated using a benchmark of option pricing problems, where F3 achieves comparable speed and energy efficiency to external manual implementations. Further, the domain-knowledge guided partitioning scheme suggests a partitioning of subtasks that is 13% faster than the average, while exploiting domain dependencies to reduce redundant computations results in an average gain in efficiency of 27%.

A Heterogeneous Computing Framework for Computational Finance

Genetic Algorithms (GAs) are a class of numerical and combinatorial optimisers which are especially useful for solving complex non-linear and non-convex problems. However, the required execution time often limits their application to small-scale or latency-insensitive problems, so techniques to increase the computational efficiency of GAs are needed. FPGA-based acceleration has significant potential for speeding up genetic algorithms, but existing FPGA GAs are limited by the generational approaches inherited from software GAs. Many parts of the generational approach do not map well to hardware, such as the large shared population memory and intrinsic loop-carried dependency. To address this problem, this paper proposes a new hardware-oriented approach to GAs, called Pipelined Genetic Propagation (PGP), which is intrinsically distributed and pipelined. PGP represents a GA solver as a graph of loosely coupled genetic operators, which allows the solution to be scaled to the available resources, and also to dynamically change topology at run-time to explore different solution strategies. Experiments show that pipelined genetic propagation is effective in solving seven different applications. Our PGP design is 5 times faster than a recent FPGA-based GA system, and 90 times faster than a CPU-based GA system.

/pdf/pipelined-genetic-propagation-59q5pr8202.pdf

Pipelined Genetic Propagation

Residue number system (RNS), which originates from the Chinese remainder theorem, is regarded as a promising number representation in the domain of digital signal processing (DSP). This paper describes our work on optimizing residue arithmetic units on the platform of reconfigurable devices, such as FPGAs. First, we provide improved designs for residue arithmetic units. For reverse converters from RNS to binary numbers, we propose a novel design that uses only n-bit additions. Compared to previous work, the design consumes up to 14.3% less area and provides lower latency. Second, we develop a reconfigurable RNS arithmetic library generator for the moduli set {2n-1, 2n, 2n+1}. The generator supports a wide range of RNS numbers, and enables us to perform an extensive comparison between RNS and other number representations at both the arithmetic unit level and the application level. The comparison shows that, for applications involving a large number of multiplications, the RNS designs can reduce up to 1/2 DSP48s for large bit-width settings.

/pdf/optimizing-residue-arithmetic-on-fpgas-4ndf41gamc.pdf

Wayne Luk

Papers

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

ADAM: Automated Design Analysis and Merging for Speeding up FPGA Development

A Heterogeneous Computing Framework for Computational Finance

Pipelined Genetic Propagation

Optimizing residue arithmetic on FPGAs