scispace - formally typeset
Search or ask a question
Topic

FLOPS

About: FLOPS is a research topic. Over the lifetime, 259 publications have been published within this topic receiving 4315 citations. The topic is also known as: Floating Point Operations Per Second.


Papers
More filters
Book ChapterDOI
08 Sep 2018
TL;DR: ShuffleNet V2 as discussed by the authors proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs, based on a series of controlled experiments, and derives several practical guidelines for efficient network design.
Abstract: Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

3,393 citations

Proceedings ArticleDOI
11 Jun 1998
TL;DR: A family of semi-dynamic and dynamic edge-triggered flip-flops to be used with static and dynamic circuits, respectively, used in the UltraSPARC-III microprocessor.
Abstract: Describes a family of semi-dynamic and dynamic edge-triggered flip-flops to be used with static and dynamic circuits, respectively. The flip-flops provide both short latency and the capability of incorporating logic functions with minimum delay penalty, properties which make them very attractive for high-performance microprocessor design. The circuits described are used in the UltraSPARC-III microprocessor.

192 citations

Journal ArticleDOI
G. F. Grohoski1
TL;DR: The IBM RISC System/6000 processor is a second-generation RISC processor which reduces the execution pipeline penalties caused by branch instructions and also provides high floating-point performance.
Abstract: The IBM RISC System/6000 processor is a second-generation RISC processor which reduces the execution pipeline penalties caused by branch instructions and also provides high floating-point performance. It employs multiple functional units which operate concurrently to maximize the instruction execution rate. By employing these advanced machine-organization techniques, it can execute up to four instructions simultaneously. Approximately 11 MFLOPS are achieved on the LINPACK benchmarks.

149 citations

Journal ArticleDOI
TL;DR: In this paper, a flop of a pair (X, B) is a flip of a model (X + B) which is crepant for KX +B.
Abstract: A result by Birkar-Cascini-Hacon-McKernan together with the boundedness of length of extremal rays implies that different minimal models can be connected by a sequence of flops. A flop of a pair (X, B) is a flip of a pair (X, B � ) which is crepant for KX +B

128 citations

Journal ArticleDOI
01 May 2002
TL;DR: Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads that fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, and achieves excellent "real-computation" per transistor and per watt ratios.
Abstract: Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw band- width. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

112 citations


Network Information
Related Topics (5)
Matrix (mathematics)
105.5K papers, 1.9M citations
75% related
Deep learning
79.8K papers, 2.1M citations
74% related
Robustness (computer science)
94.7K papers, 1.6M citations
74% related
Wavelet
78K papers, 1.3M citations
74% related
Pixel
136.5K papers, 1.5M citations
74% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023219
2022409
202122
202013
20199
201813