Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. It is also observed that the FPGA performs increasingly better as a vision application's pipeline complexity grows.

https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1219&context=ece_pubs

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

We present and analyze a quantum algorithm to estimate credit risk more efficiently than Monte Carlo simulations can do on classical computers. More precisely, we estimate the economic capital requirement, i.e. the difference between the Value at Risk and the expected value of a given loss distribution. The economic capital requirement is an important risk metric because it summarizes the amount of capital required to remain solvent at a given confidence level. We implement this problem for a realistic loss distribution and analyze its scaling to a realistic problem size. In particular, we provide estimates of the total number of required qubits, the expected circuit depth, and how this translates into an expected runtime under reasonable assumptions on future fault-tolerant quantum hardware.

Credit Risk Analysis Using Quantum Computers

Monte Carlo Methods and Models in Finance and Insurance

Thank you very much for downloading computer organization and design 2nd edition. As you may know, people have search numerous times for their favorite novels like this computer organization and design 2nd edition, but end up in infectious downloads. Rather than enjoying a good book with a cup of coffee in the afternoon, instead they are facing with some infectious bugs inside their desktop computer. computer organization and design 2nd edition is available in our book collection an online access to it is set as public so you can get it instantly. Our books collection hosts in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the computer organization and design 2nd edition is universally compatible with any devices to read.

Computer Organization And Design 2nd Edition

This paper presents MAXelerator, the first hardware accelerator for privacy-preserving machine learning (ML) on cloud servers. Cloud-based ML is being increasingly employed in various data sensitive scenarios. While it enhances both efficiency and quality of the service, it also raises concern about privacy of the users' data. We create a practical privacy-preserving solution for matrix-based ML on cloud servers. We show that for the majority of the ML applications, the privacy-sensitive computation boils down to either matrix multiplication, which is a repetition of Multiply-Accumulate (MAC) or the MAC itself. We design an FPGA architecture for privacy-preserving MAC to accelerate the ML computation based on the well known Secure Function Evaluation protocol named Yao's Garbled Circuit. MAXelerator demonstrates up to 57 × improvement in throughput per core compared to the fastest existing GC framework. We corroborate the effectiveness of the accelerator with real-world case studies in privacy-sensitive scenarios.

https://dl.acm.org/doi/pdf/10.1145/3195970.3196074

MAXelerator: FPGA accelerator for privacy preserving multiply-accumulate (MAC) on cloud servers

The rapidly growing applications based on morphological operations in image processing and computer vision make efficient implementations of these key blocks an important topic of research. Nevertheless, a detailed comparison of the energy efficiency and performance of these implementations that covers all available major hardware platforms is still missing. In this paper we evaluate the performance and power consumption of the most efficient available morphological image processing algorithms for CPU, GPU, and FPGA platforms in detail. In addition, we study the suitability of available morphological library units for high-level synthesis and compare the results with an optimized hand-coded FPGA implementation. We demonstrate that even high-end GPUs cannot achieve the throughputs of modern CPUs and FPGAs by far. Our experimental results show that an FPGA implementation is 8–10 times more energy efficient for this application, being comparable in speed to CPUs for large kernels.

A quantitative cross-architecture study of morphological image processing on CPUs, GPUs, and FPGAs

In today's markets, high-speed and energy-efficient computations are mandatory in the financial and insurance industry. At the same time, the gradual convergence of high-performance computing with embedded systems is having a huge impact on the design methodologies, where dedicated accelerators are implemented to increase performance and energy efficiency. This paper follows this trend and presents a novel way to price high-dimensional American options using techniques of the embedded community. The proposed architecture targets heterogeneous CPU/FPGA systems, and it exploits the FPGA reconfiguration to deliver high-throughput. With a bit-true algorithmic transformation based on recomputation, it is possible to eliminate the memory bottleneck and access costs. The result is a pricing system that is 16x faster and 268x more energy-efficient than an optimized Intel CPU implementation.

Reverse longstaff-schwartz american option pricing on hybrid CPU/FPGA systems

Risk analysis and management currently have a strong presence in financial institutions, where high performance and energy efficiency are key requirements for acceleration systems, especially when it comes to intraday analysis. In this regard, we approach the estimation of the widely-employed portfolio risk metrics value-at-risk (VaR) and conditional value-at-risk (cVaR) by means of nested Monte Carlo (MC) simulations. We do so by combining theory and software/hardware implementation. This allows us for the first time to investigate their performance on heterogeneous compute systems and across different compute platforms, namely central processing unit (CPU), many integrated core (MIC) architecture XeonPhi, graphics processing unit (GPU), and field-programmable gate array (FPGA). To this end, the OpenCL framework is employed to generate portable code, and the size of the simulations is scaled in order to evaluate variations in performance. Furthermore, we assess different parallelization schemes, and the targeted platforms are evaluated and compared in terms of runtime and energy efficiency. Our implementation also allowed us to derive a new algorithmic optimization regarding the generation of the required random number sequences. Moreover, we provide specific guidelines on how to properly handle these sequences in portable code, and on how to efficiently implement nested MC-based VaR and cVaR simulations on heterogeneous compute systems.

/pdf/nested-mc-based-risk-measurement-of-complex-portfolios-4juwmgq9kr.pdf

Nested MC-Based Risk Measurement of Complex Portfolios: Acceleration and Energy Efficiency

High-performance computing systems are highly required nowadays in financial risk analysis and management, which has become a key process at the core of financial institutions. Risk metrics such as portfolio value-at-risk and expected shortfall are computed on a daily basis. In this regard, the most general and flexible approach is a nested Monte Carlo simulation, which is very compute-intensive.In this work, we exploit the Open Computing Language (OpenCL) to efficiently map the nested simulation of complex portfolios with multiple algorithms on heterogeneous computing systems, maximizing system-level performance. The code portability and individual customizations allow us to profile the kernels on different accelerating platforms, such as CPU, Intel's Xeon Phi, and GPU. The combination of OpenCL, a new bit-accurate algorithmic optimization, and the extension of an existing numerical scheme using interpolation, allows us to achieve over 1000x speedup compared to the state-of-the-art approach, making this approach even feasible for intraday risk analysis. Our proposed system design also minimizes costly host-device data transfers and the required device global memory, enabling complex portfolios to be easily scaled.

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL

In the field of high performance heterogeneous computing systems, field programmable gate arrays (FPGAs) have shown great advantages in terms of acceleration and energy efficiency. And with the inclusion of the OpenCL framework for parallel programming, the design complexity has been greatly reduced. However, the parallel implementation of applications containing data-dependent branches usually experiences an important loss in performance, which affects all platforms alike. This data dependency leads the execution of parallel threads, also called work-items in OpenCL, to diverge. Whereas fixed architectures like CPU, GPU and Xeon Phi cannot efficiently cope with this divergent execution, the flexibility offered by FPGAs in terms of architecture can be exploited to tackle this problem. In this work, we present a new approach for FPGA implementations that decouples the parallel OpenCL work-items, avoiding the interference of data-dependent branches between them. We also demonstrate the necessary workarounds to obtain the maximum performance in a pipelined design, when unpredictable for-loop exit conditions are caused by the data dependency. Furthermore, we show how to efficiently interleave computation with transfers to device global memory in each work-item. This approach is then evaluated with a real-life case study from Finance, with four different configurations implemented on FPGA with Xilinx SDAccel, and compared to the optimized implementation on CPU, GPU, and Xeon Phi. Our results show that FPGAs can deliver up to 5.5x speedup, whereas the system-level energy efficiency increases between 2x and 9.5x in all cases.

Javier Alejandro Varela

Papers

A quantitative cross-architecture study of morphological image processing on CPUs, GPUs, and FPGAs

Reverse longstaff-schwartz american option pricing on hybrid CPU/FPGA systems

Nested MC-Based Risk Measurement of Complex Portfolios: Acceleration and Energy Efficiency

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL

Exploiting Decoupled OpenCL Work-Items with Data Dependencies on FPGAs: A Case Study