scispace - formally typeset
Search or ask a question
Topic

Floating point

About: Floating point is a research topic. Over the lifetime, 7407 publications have been published within this topic receiving 111554 citations.


Papers
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: A quantization scheme is proposed that allows inference to be carried out using integer- only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware.
Abstract: The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.

2,067 citations

Book
01 Jan 1991
TL;DR: This book gives a thorough, up-to-date treatment of the behavior of numerical algorithms in finite precision arithmetic by combining algorithmic derivations, perturbation theory, and rounding error analysis.
Abstract: From the Publisher: What is the most accurate way to sum floating point numbers? What are the advantages of IEEE arithmetic? How accurate is Gaussian elimination and what were the key breakthroughs in the development of error analysis for the method? The answers to these and many related questions are included here. This book gives a thorough, up-to-date treatment of the behavior of numerical algorithms in finite precision arithmetic. It combines algorithmic derivations, perturbation theory, and rounding error analysis. Software practicalities are emphasized throughout, with particular reference to LAPACK and MATLAB. The best available error bounds, some of them new, are presented in a unified format with a minimum of jargon. Because of its central role in revealing problem sensitivity and providing error bounds, perturbation theory is treated in detail. Historical perspective and insight are given, with particular reference to the fundamental work of Wilkinson and Turing, and the many quotations provide further information in an accessible format. The book is unique in that algorithmic developments and motivations are given succinctly and implementation details minimized, so that attention can be concentrated on accuracy and stability results. Here, in one place and in a unified notation, is error analysis for most of the standard algorithms in matrix computations. Not since Wilkinson's Rounding Errors in Algebraic Processes (1963) and The Algebraic Eigenvalue Problem (1965) has any volume treated this subject in such depth. A number of topics are treated that are not usually covered in numerical analysis textbooks, including floating point summation, block LU factorization, condition number estimation, the Sylvester equation, powers of matrices, finite precision behavior of stationary iterative methods, Vandermonde systems, and fast matrix multiplication. Although not designed specifically as a textbook, this volume is a suitable reference for an advanced course, and could be used by instructors at all levels as a supplementary text from which to draw examples, historical perspective, statements of results, and exercises (many of which have never before appeared in textbooks). The book is designed to be a comprehensive reference and its bibliography contains more than 1100 references from the research literature. Audience Specialists in numerical analysis as well as computational scientists and engineers concerned about the accuracy of their results will benefit from this book. Much of the book can be understood with only a basic grounding in numerical analysis and linear algebra. About the Author Nicholas J. Higham is a Professor of Applied Mathematics at the University of Manchester, England. He is the author of more than 40 publications and is a member of the editorial boards of the SIAM Journal on Matrix Analysis and Applications and the IMA Journal of Numerical Analysis. His book Handbook of Writing for the Mathematical Sciences was published by SIAM in 1993.

1,911 citations

01 Jan 2002
TL;DR: Higham as discussed by the authors gives a thorough, up-to-date treatment of the behavior of numerical algorithms in finite precision arithmetic, combining algorithmic derivations, perturbation theory, and rounding error analysis.
Abstract: From the Publisher: What is the most accurate way to sum floating point numbers? What are the advantages of IEEE arithmetic? How accurate is Gaussian elimination and what were the key breakthroughs in the development of error analysis for the method? The answers to these and many related questions are included here. This book gives a thorough, up-to-date treatment of the behavior of numerical algorithms in finite precision arithmetic. It combines algorithmic derivations, perturbation theory, and rounding error analysis. Software practicalities are emphasized throughout, with particular reference to LAPACK and MATLAB. The best available error bounds, some of them new, are presented in a unified format with a minimum of jargon. Because of its central role in revealing problem sensitivity and providing error bounds, perturbation theory is treated in detail. Historical perspective and insight are given, with particular reference to the fundamental work of Wilkinson and Turing, and the many quotations provide further information in an accessible format. The book is unique in that algorithmic developments and motivations are given succinctly and implementation details minimized, so that attention can be concentrated on accuracy and stability results. Here, in one place and in a unified notation, is error analysis for most of the standard algorithms in matrix computations. Not since Wilkinson's Rounding Errors in Algebraic Processes (1963) and The Algebraic Eigenvalue Problem (1965) has any volume treated this subject in such depth. A number of topics are treated that are not usually covered in numerical analysis textbooks, including floating point summation, block LU factorization, condition number estimation, the Sylvester equation, powers of matrices, finite precision behavior of stationary iterative methods, Vandermonde systems, and fast matrix multiplication. Although not designed specifically as a textbook, this volume is a suitable reference for an advanced course, and could be used by instructors at all levels as a supplementary text from which to draw examples, historical perspective, statements of results, and exercises (many of which have never before appeared in textbooks). The book is designed to be a comprehensive reference and its bibliography contains more than 1100 references from the research literature. Audience Specialists in numerical analysis as well as computational scientists and engineers concerned about the accuracy of their results will benefit from this book. Much of the book can be understood with only a basic grounding in numerical analysis and linear algebra. About the Author Nicholas J. Higham is a Professor of Applied Mathematics at the University of Manchester, England. He is the author of more than 40 publications and is a member of the editorial boards of the SIAM Journal on Matrix Analysis and Applications and the IMA Journal of Numerical Analysis. His book Handbook of Writing for the Mathematical Sciences was published by SIAM in 1993.

1,565 citations

Journal ArticleDOI
David E. Goldberg1
TL;DR: This paper presents a tutorial on the aspects of floating-point that have a direct impact on designers of computer systems, and concludes with examples of how computer system builders can better support floating point.
Abstract: Floating-point arithmetic is considered as esoteric subject by many people. This is rather surprising, because floating-point is ubiquitous in computer systems: Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on the aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating point standard, and concludes with examples of how computer system builders can better support floating point.

1,372 citations

StandardDOI
01 Jan 2008

1,354 citations


Network Information
Related Topics (5)
Scalability
50.9K papers, 931.6K citations
85% related
Cache
59.1K papers, 976.6K citations
85% related
Data compression
43.6K papers, 756.5K citations
82% related
Server
79.5K papers, 1.4M citations
82% related
Scheduling (computing)
78.6K papers, 1.3M citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023105
2022206
2021214
2020302
2019304
2018309