scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Fused floating-point add and subtract unit

TL;DR: The paper describes the fused floating-point Add-Subtract Unit using IEEE-754 standard 32 bit floating point number representation and compares one add unit with fused floating point add-subtract unit and one multifunctional unit.
Abstract: The paper describes the fused floating-point Add-Subtract Unit using IEEE-754 standard 32 bit floating point number representation. In fused add-subtract unit, both add unit and subtract unit perform parallel operation and this approach of fused unit reduces the hardware required and also the cost of designed unit. Besides addition and subtraction if one more operation suppose multiplication is to be performed then a multifunctional design is required. The multi-functional floating-point unit includes a multiplier unit above a Fused Add-Subtract Unit which uses hardware more efficiently in comparison to the separate unit blocks for each operation. This method reduces the area of the designed block but the speed of operation is also reduced. The blocks are reduced based on the common operation between each designed unit. For example, in every unit rounding operation is done after each operation for floating point number. If a common block for rounding is designed then the hardware is reduced. The paper also compares one add unit with fused floating point add-subtract unit and one multifunctional unit
Citations
More filters
Journal Article
TL;DR: A review of the FP units design is discussed for arithmetic operations in manner of addition, multiplication, division and subtraction forFloating point implementation on the reconfigurable hardware architecture is reviewed.
Abstract: Arithmetic operations in digital systems forms an important part of study in recent years. Floating point (FP) implementation on the reconfigurable hardware architecture is an important area of concern. Many fresh designs have been advanced recently and segregate with existing approaches on the basis of various performance parameters. In given paper a review of the FP units design is discussed for arithmetic operations in manner of addition, multiplication, division and subtraction. Also units having multi operands are also reviewed in this work which utilizes both distributed and fused concept.

1 citations

Journal ArticleDOI
TL;DR: An effective implementation of Fused Floating-point Add-Subtract unit with a modification in dual path design is presented and a Dual Path FFAS (DPFFAS) unit has reduced latency when compared with FFAS unit.
Abstract: Reconfigurable architectures have provided a low cost, fast turnaround platform for the development and deployment of designs in communication and signal processing applications. The floating point operations are used in most of the signal processing applications that require high precision and good accuracy. In this paper, an effective implementation of Fused Floating-point Add-Subtract (FFAS) unit with a modification in dual path design is presented. To enhance the performance of FFAS unit for reconfigurable architectures, a dual path unit with a modification in close path design is proposed. The proposed design is targeted on a Xilinx Virtex-6 device and implemented on ML605 Evaluation board for single, double and double extended precision. When compared to discrete floating point adder design, the FFAS unit reduces area requirement and power dissipation as the later shares common logic. A Dual Path FFAS (DPFFAS) unit has reduced latency when compared with FFAS unit. The latency is further reduced with the proposed modified DPFFAS when compared with DPFFAS for reconfigurable architectures.

Additional excerpts

  • ...A multifunctional floating point unit [14] is designed that includes multiplication above fused add-sub unit producing a hardware efficient implementation of fused floating point arithmetic....

    [...]

Proceedings ArticleDOI
01 Dec 2016
TL;DR: Stuck-at fault model using Built in self test(BIST) method is designed for the floating point unit to check the fault in the design to reduce the dynamic power by 10.47% compared to the conventional method.
Abstract: Arithmetic computations can be on integer or floating(real) numbers. In digital systems, ALU handles arithmetic operations. However, ALU is not suitable for handling operations on real numbers as the result may not be precise and accurate. Hence to perform operations on real numbers digital systems use a dedicated unit called floating point unit(FPU). In this paper, the designed FPU is single precision and operates on IEEE - 754 - 2008 format. The available arithmetic operations on this FPU are floating point multiplication, division, addition and subtraction. The designed FPU can operate on both normal(normalized) and subnormal(denormalized) numbers present in floating point numbers. In this paper, stuck-at fault model using Built in self test(BIST) method is designed for the floating point unit to check the fault in the design. Basic idea behind the BIST is testing the device by itself. The proposed design is modified for parallel testing by dividing the FPU into 3 independent blocks. In this method when one of the blocks is in its normal operation the other block of the FPU is tested in parallel. The design's RTL code is written in Verilog HDL and Xilinx Vivado 2015 is used for simulation. The proposed method reduces the dynamic power by 10.47% compared to the conventional method.
Dissertation
01 Jan 2020
TL;DR: In this paper, a math coprocessor for the AMIR CPU that can perform addition, subtraction, multiplication and division on IEEE-754 single precision floating-point numbers is presented.
Abstract: Math coprocessors are vital components in modern computing to improve the overall performance of the system. The AMIR CPU is a homegrown softcore 32-bit CPU that can only handle integer numbers making it inadequate for high-performance real-time systems. The aim of this project is to design and develop a math coprocessor for the AMIR CPU that can perform addition, subtraction, multiplication and division on IEEE-754 single precision floating-point numbers. The design of the math coprocessor is devised and improved based on past works on IEEE 754 floating-point operations and math coprocessor implementations. The architecture of the proposed math coprocessor consists of a control unit with instruction decode, floating-point computation unit and a register file. The architecture type is a serial controller with pipelined data path. The proposed math coprocessor retrieves instruction from the instruction register, decodes it, retrieves operands from the CPU register, performs computation then stores the results into the internal register, pending retrieval from the AMIR CPU. The proposed math coprocessor managed to achieve at least 99.999% accuracy for all four arithmetic operations with a maximum frequency of 63.8 MHz, while utilizing less than 30% of the available resource on board an Intel Cyclone IV EP4CE10E22C8 FPGA. The design is not without flaws as the proposed design has problems with instruction queueing due to the absence of an instruction buffer. Nevertheless, with further improvements and features, the proposed math coprocessor has the potential to enable the AMIR CPU to be used in a wide range of applications.
References
More filters
Book
01 Feb 1996
TL;DR: In this paper, the authors present an overview of the design of Verilog HDLs and its application in computer aided digital design (CADD), including the following: 1. Hierarchical Modeling Concepts.
Abstract: PART I. BASIC VERILOG TOPICS. 1. Overview of Digital Design with Verilog HDL. Evolution of Computer Aided Digital Design. Emergence of HDLs. Typical Design Flow. Importance of HDLs. Popularity of Verilog HDL. Trends in HDLs. 2. Hierarchical Modeling Concepts. Design Methodologies. 4-bit Ripple Carry Counter. Modules. Instances. Components of a Simulation. Example. Design Block. Stimulus Block. Summary. Exercises. 3. Basic Concepts. Lexical Conventions. Whitespace. Comments. Operators. Number Specification. Sized numbers. Unsized numbers. X or Z values. Negative numbers. Underscore characters and question marks. Strings. Identifiers and Keywords. Escaped Identifiers. Data Types. Value Set. Nets. Registers. Vectors. Integer , Real, and Time Register Data Types. Integer. Real Time. Arrays. Memories. Parameters. Strings. System Tasks and Compiler Directives. System Tasks. Displaying information. Monitoring information. Stopping and finishing in a simulation. Compiler Directives. 'define. 'include. Summary. Exercises. 4. Modules and Ports. Modules. Ports. List of Ports. Port Declaration. Port Connection Rules. Inputs. Outputs. Inouts. Width matching. Unconnected ports. Example of illegal port connection. Connecting Ports to External Signals. Connecting by ordered list. Connecting ports by name. Hierarchical Names. Summary. Exercises. 5. Gate-Level Modeling. Gate Types. And/Or Gates. Buf/Not Gates. Bufif/notif. Examples. Gate-level multiplexer. 4-bit full adder. Gate Delays. Rise, Fall, and Turn-off Delays. Rise delay. Fall delay. Turn-off delay. Min/Typ/Max Values. Min value. Typ val. Max value. Delay Example. Summary. Exercises. 6. Dataflow Modeling. Continuous Assignments. Implicit Continuous Assignment. Delays. Regular Assignment Delay. Implicit Continuous Assignment Delay. Net Declaration Delay. Expressions, Operators, and Operands. Expressions. Operands. Operators. Operator Types. Arithmetic Operators. Binary operators. Unary operators. Logical Operators. Relational Operators. Equality Operators. Bitwise Operators. Reduction Operators. Shift Operators. Concatenation Operator. Replication Operator. Conditional Operator. Operator Precedence. Examples. 4-to-1 Multiplexer. Method 1: logic equation. Method 2: conditional operator. 4-bit Full Adder. Method 1: dataflow operators. Method 2: full adder with carry lookahead. Ripple Counter. Summary. Exercises. 7. Behavioral Modeling. Structured Procedures. Initial Statement. Always Statement. Procedural Assignments. Blocking assignments. Nonblocking Assignments. Application of nonblocking assignments. Timing Controls. Delay-Based Timing Control. Regular delay control. Intra-assignment delay control. Zero delay control. Event-Based Timing Control. Regular event control. Named event control. Event OR control. Level-Sensitive Timing Control. Conditional Statements. Multiway Branching. Case Statement. Casex, casez Keywords. Loops. While Loop. For Loop. Repeat Loop. Forever loop. Sequential and Parallel Blocks. Block Types. Sequential blocks. Parallel blocks. Special Features of Blocks. Nested blocks. Named blocks. Disabling named blocks. Examples. 4-to-1 Multiplexer. 4-bit Counter. Traffic Signal Controller. Specification. Stimulus. Summary. Exercises. 8. Tasks and Functions. Differences Between Tasks and Functions. Tasks. Task Declaration and Invocation. Task Examples. Use of Input and Output Arguments. Asymmetric Sequence Generator. Functions. Function Declaration and Invocation. Function Examples. Parity calculation. Left/right shifter. Summary. Exercises. 9. Useful Modeling Techniques. Procedural Continuous Assignments. Assign and deassign. Force and release. Force and release on registers. Force and release on nets. Overriding Parameters. Defparam Statement. Module_Instance Parameter Values. Conditional Compilation and Execution. Conditional Compilation. Conditional Execution. Time Scales. Useful System Tasks. File Output. Opening a file. Writing to files. Closing files. Displaying Hierarchy. Strobing. Random Number Generation. Initializing Memory from File. Value Change Dump File. Summary. Exercises. PART II. ADVANCED VERILOG TOPICS. 10. Timing and Delays. Types of Delay Models. Distributed Delay. Lumped Delay. Pin-to-Pin Delays. Path Delay Modeling. Specify Blocks. Inside Specify Blocks. Parallel Connection. Full Connection. Specparam Statements. Conditional Path Delays. Rise, fall, and turn-off delays. Min, max, and typical delays. Handling x transitions. Timing Checks. $setup and $hold checks. $setup task. $hold task. $width Check. Delay Back-Annotation. Summary. Exercises. 11. Switch-Level Modeling. Switch-Modeling Elements. MOS Switches. CMOS Switches. Directional Switches. Power and Ground. Resistive Switches. Delay Specification on Switches. MOS and CMOS switches. Bidirectional pass switches. Specify blocks. Examples. CMOS Nor Gate. 2-to-1 Multiplexer. Simple CMOS Flip-Flop. Summary. Exercises. 12. User-Defined Primitives. UDP Basics. Parts of UDP Definition. UDP Rules. Combinational UDPs. Combinational UDP Definition. State Table Entries. Shorthand Notation for Don't Cares. Instantiating UDP Primitives. Example of a Combinational UDP. Sequential UDPs. Level-Sensitive Sequential UDPs. Edge-Sensitive Sequential UDPs. Example of a Sequential UDP. UDP Table Shorthand Symbols. Guidelines for UDP Design. Summary. Exercises. 13. Programming Language Interface. Uses of PLI. Linking and Invocation of PLI Tasks. Linking PLI Tasks. Linking PLI in Verilog-XL. Linking in VCS. Invoking PLI Tasks. General Flow of PLI Task Addition and Invocation. Internal Data Representation. PLI Library Routines. Access Routines. Mechanics of Access Routines. Types of Access Routines. Examples of Access Routines. Utility Routines. Mechanics of Utility Routines. Types of Utility Routines. Example of Utility Routines. Summary. Exercises. 14. Logic Synthesis with Verilog HDL. What Is Logic Synthesis? Impact of Logic Synthesis. Verilog HDL Synthesis. Verilog Constructs. Verilog Operators. Interpretation of a Few Verilog Constructs. The Assign statement. The if-else statement. The case statement for loops. The Function Statement. Synthesis Design Flow. RTL to Gates. RTL Description. Translation. Unoptimized Intermediate Representation. Logic Optimization. Technology Mapping and Optimization. Technology library. Design constraints. Optimized gate-level description. An Example of RTL-to-Gates. Design Sspecification. RTL description. Technology library. Design constraints. Logic synthesis. Final, Optimized, Gate-Level Description. IC Fabrication. Verification of Gate-Level Netlist. Functional Verification. Timing Verification. Modeling Tips for Logic Synthesis. Verilog Coding Style. Use meaningful names for signals and variables. Avoid mixing positive and negative edge-triggered flip-flops. Use basic building blocks vs. Use continuous assign statements. Instantiate multiplexers vs. Use if-else or case statements. Use parentheses to optimize logic structure. Use arithmetic operators *, /, and % vs. Design building blocks. Be careful with multiple assignments to the same variable. Define if-else or case statements explicitly. Design Partitioning. Horizontal partitioning. Vertical Partitioning. Parallelizing design structure. Design Constraint Specification. Example of Sequential Circuit Synthesis. Design Specification. Circuit Requirements. Finite State Machine (FSM). Verilog Description. Technology Library. Design Constraints. Logic Synthesis. Optimized Gate-Level Netlist. Verification. Summary. Exercises. PART III: APPENDICES. A. Strength Modeling and Advanced Net Definitions. B. List of PLI Routines. C. List of Keywords, System Tasks, and Compiler Directives. D. Formal Syntax Definition. E. Verilog Tidbits. F. Verilog Examples. Index.

432 citations

Book
01 Mar 1995
TL;DR: The IBM RISC System/6000 (RS/6000) floating-point unit (FPU) exemplifies a second-generation RISC CPU architecture and an implementation which greatly increases floating point performance and accuracy.
Abstract: The IBM RISC System/6000 (RS/6000) floating-point unit (FPU) exemplifies a second-generation RISC CPU architecture and an implementation which greatly increases floating-point performance and accuracy. The key feature of the FPU is a unified floating-point multiply-add-fused unit (MAF) which performs the accumulate operation ({ital A} {times} {ital B}) + {ital C} as an indivisible operation. This single functional unit reduces the latency for chained floating-point operations, as well as rounding errors and chip busing. It also reduces the number of adders/normalizers by combining the addition required for fast multiplication with accumulation. The MAF unit is made practical by a unique fast-shifter, which eases the overlap of multiplication and addition, and a leading-zero/one anticipator, which eases overlap of normalization and addition. The accumulate instruction required by this architecture reduces the instruction path length by combining two instructions into one. Additionally, the RS/6000 FPU is tightly coupled to the rest of the CPU, unlike typical floating-point coprocessor chips.

180 citations

Journal ArticleDOI
E. Hokenek1, R. K. Montoye1, Peter W. Cook1
TL;DR: Improved design techniques for logarithmic addition and higher order counters for multiplication complete this second-generation RISC floating-point unit design, and it allows for reduced overall system latency.
Abstract: A 440000-transistor second-generation RISC (reduced instruction set computer) floating-point chip is described. The pipeline latency is only two cycles, and a double-precision result is produced every cycle. System throughput and accuracy are increased by using a floating-point multiply-add-fused unit, which carries out a double-precision accumulate as a two-cycle pipelined execution with only one rounding error. While the cycle time (40 ns) is competitive with other CMOS RISC systems, the floating-point performance stretches to the range of bipolar RISC systems (7.4-13 MFLOPS LINPACK). Leading zero anticipation makes the two-cycle pipeline possible by nearly eliminating the additional postnormalization time, and it allows for reduced overall system latency. Partial decode shifters allow complete time sharing for the multiply and data alignment. Improved design techniques for logarithmic addition and higher order counters for multiplication complete this second-generation RISC floating-point unit design. >

152 citations

Journal ArticleDOI
R. K. Montoye1, E. Hokenek1, Steve Runyon1
TL;DR: The RS/6000 FPU is a unified floating-point multiply-add-fused unit (MAF) which performs the accumulate operation as an indivisible operation, which reduces the latency for chained floating- point operations, as well as rounding errors and chip busing.
Abstract: The IBM RISC System/6000 (RS/6000) floating-point unit (FPU) exemplifies a second-generation RISC CPU architecture and an implementation which greatly increases floating-point performance and accuracy. The key feature of the FPU is a unified floating-point multiply-add-fused unit (MAF) which performs the accumulate operation ({ital A} {times} {ital B}) + {ital C} as an indivisible operation. This single functional unit reduces the latency for chained floating-point operations, as well as rounding errors and chip busing. It also reduces the number of adders/normalizers by combining the addition required for fast multiplication with accumulation. The MAF unit is made practical by a unique fast-shifter, which eases the overlap of multiplication and addition, and a leading-zero/one anticipator, which eases overlap of normalization and addition. The accumulate instruction required by this architecture reduces the instruction path length by combining two instructions into one. Additionally, the RS/6000 FPU is tightly coupled to the rest of the CPU, unlike typical floating-point coprocessor chips.

145 citations

Journal ArticleDOI
TL;DR: Two fused floating-point operations are described and applied to the implementation of fast Fourier transform (FFT) processors and the numerical results of the fused implementations are slightly more accurate, since they use fewer rounding operations.
Abstract: This paper describes two fused floating-point operations and applies them to the implementation of fast Fourier transform (FFT) processors. The fused operations are a two-term dot product and an add-subtract unit. The FFT processors use "butterfly” operations that consist of multiplications, additions, and subtractions of complex valued data. Both radix-2 and radix-4 butterflies are implemented efficiently with the two fused floating-point operations. When placed and routed using a high performance standard cell technology, the fused FFT butterflies are about 15 percent faster and 30 percent smaller than a conventional implementation. Also the numerical results of the fused implementations are slightly more accurate, since they use fewer rounding operations.

115 citations


Additional excerpts

  • ...INTRODUCTION The single precision IEEE 754 standard fused floating point add-subtract unit perform X= A+B (1) Further improving the design for multifunctional unit i....

    [...]

  • ...I. INTRODUCTION The single precision IEEE 754 standard fused floating point add-subtract unit perform X= A+B (1) Further improving the design for multifunctional unit i.e. addition, subtraction and multiplication....

    [...]