Topic

SIMD

About: SIMD is a research topic. Over the lifetime, 5842 publications have been published within this topic receiving 107472 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The Design and Implementation of FFTW3

[...]

Matteo Frigo¹, Steven G. Johnson²•Institutions (2)

IBM¹, Massachusetts Institute of Technology²

24 Jan 2005

TL;DR: It is shown that such an approach can yield an implementation of the discrete Fourier transform that is competitive with hand-optimized libraries, and the software structure that makes the current FFTW3 version flexible and adaptive is described.

...read moreread less

Abstract: FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

...read moreread less

5,172 citations

Proceedings Article•DOI•

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

[...]

Chunho Lee¹, Miodrag Potkonjak¹, William H. Mangione-Smith¹•Institutions (1)

University of California, Los Angeles¹

01 Dec 1997

TL;DR: The MediaBench benchmark suite as discussed by the authors is a benchmark suite that has been designed to fill the gap between the compiler community and embedded applications developers, which has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement, and integration with system synthesis algorithms to establish usefulness.

...read moreread less

Abstract: Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.

...read moreread less

2,254 citations

Journal Article•DOI•

Some Computer Organizations and Their Effectiveness

[...]

Michael J. Flynn¹•Institutions (1)

Johns Hopkins University¹

01 Sep 1972-IEEE Transactions on Computers

TL;DR: A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes, which indicates that saturation develops when the fraction of task time spent locked out approaches 1/n, where n is the number of processors.

...read moreread less

Abstract: A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes. Two aspects of the model are distinguished: logical and physical. General parallel- or multiple-stream organizations are examined as to type and effectiveness?especially regarding intrinsic logical difficulties. The overlapped simplex processor (SISD) is limited by data dependencies. Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple-data stream (SIMD)] are analyzed. In particular, a nesting type explanation is offered for Minsky's conjecture?the performance of a parallel processor increases as log M instead of M (the number of data stream processors). Multiprocessors (MIMD) are subjected to a saturation syndrome based on general communications lockout. Simplified queuing models indicate that saturation develops when the fraction of task time spent locked out (L/E) approaches 1/n, where n is the number of processors. Resources sharing in multiprocessors can be used to avoid several other classic organizational problems.

...read moreread less

1,982 citations

Journal Article•DOI•

NVIDIA Tesla: A Unified Graphics and Computing Architecture

[...]

Erik Lindholm¹, John R. Nickolls¹, S. Oberman¹, John S. Montrym¹•Institutions (1)

Nvidia¹

01 Mar 2008-IEEE Micro

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

Abstract: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

1,492 citations

Journal Article•DOI•

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

[...]

Yu-Hsin Chen¹, Joel Emer², Vivienne Sze¹•Institutions (2)

Massachusetts Institute of Technology¹, Nvidia²

18 Jun 2016

TL;DR: A novel dataflow, called row-stationary (RS), is presented, that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy.In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

...read moreread less

1,126 citations

Collapse

Network Information

Performance

Metrics

6,095

Papers

115,259

Citations

No. of papers in the topic in previous years
Year	Papers
2023	59
2022	194
2021	121
2020	173
2019	167
2018	202

SIMD

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics