## Real-Time BIST Detector for BGA Faults in Field Programmable Gate Arrays (FPGAs)

James P. Hofmeister, Justin Judkins, Ph.D., Edgar Ortiz, Douglas Goodman, Ridgetop Group, Inc. 3580 West Ina Road, Tucson, Arizona 85741 USA Hoffy@ridgetop-group.com, Justin@ridgetop-group.com

> Pradeep Lall, PhD, Auburn University, Auburn, Alabama 36849 USA <u>Iall@eng.auburn.edu</u>

Abstract - In this paper we introduce a solderjoint built-in-self-test (SJ BIST) for detecting high-resistance and intermittent faults in operational, fully programmed field programmable gate arrays (FPGAs). The approach is simple to implement, offers a method to detect high-resistance faults that result from damaged solder joints, and uses a maximum of one small capacitor externally connected to each selected test pin or each group of two test pins.

### INTRODUCTION

This paper introduces an innovative, in-situ solder-joint built-in-self-test (SJ BIST) to detect high-resistance damage to solder-joint networks of fully operational Field Programmable Gate Arrays (FPGAs) in ball grid array (BGA) packages such as a XILINX® FG1152/FG1156. FPGAs are used in all kinds of control systems in both defense and commercial applications.

A two-port group SJ BIST core was designed, programmed, simulated, synthesized, and loaded into an FPGA on a development board. The SJ BIST core correctly detects and reports instances of high resistance without false errors. The initial test results are presented in this paper. Initial designs for Highly Accelerated Life Test (HALT) experiments have been completed and we plan on fabricating boards, populating them with programmed FPGAs, and conducting HALTs at both the Center for Advanced Vehicle Electronics (CAVE) at Auburn University and at a Department of Defense contractor during a Phase II period of a Small Business Innovation Research contract award. Evaluation of the SJ BIST is also being conducted at a German university under the sponsorship of an automobile manufacturer.

### Mechanics-of-Failure

Solder-joint damage under thermo-mechanical and shock stresses is cumulative; damage manifests in the form of plastic work and cracks, which propagate until the eventual fracture of solder joints [1-4], resulting in FPGA operational failures. An illustration of a fractured solder joint (or bump) under thermo-mechanical stresses is shown in Figure 1. Thermo-mechanical stresses may result from differential expansion under environmental and operational temperature exposure due to coefficient of thermal expansion (CTE) mismatches. Shock loads may be imposed during shipping and normal operation in harsh environments. Even though one or more solder balls (bumps) are cracked, a solder-joint network belonging to a damaged bump might not immediately experience a catastrophic failure. One reason for this is that other solder balls of the BGA package remain intact and tend to keep the package pressed toward the board to maintain electrical contact between the surfaces of cracks [4-6].

However, subsequent mechanical vibration or shock tends to cause such cracked bumps to momentarily open and cause hard-to-diagnose faults of high resistance  $-100\Omega$ ,  $300\Omega$ ,  $500\Omega$  and  $1000\Omega$  have been used as threshold levels [1,7-10] - lasting for periods of hundreds of nanoseconds, or less, to more than 1µs [1,5,10].



Figure 1: Crack Propagation at the Top and Bottom of a Solder Joint, 15mm BGA [2]

These intermittent faults increase in frequency as evidenced by a practice of logging BGA package failures only after multiple events of high resistance: an initial event followed by some number (for example, 2 to 10) of additional events within a specified period of time, such as ten percent of the number of cycles of the initial event [8-10]. Even then, an intermittent fault of high resistance in a solder-joint network might not result in an operational fault. For example, the high-resistance fault might happen in a ground or power connection, or it might happen during a period when the network is not being written, or it might be too short in duration to cause a signal error. Figure 2 shows a shock-actuated intermittent OPEN (high resistance) of a package interconnect.

Figure 3 represents HALT test results performed on XILINX FG1156 Daisy Chain packages in which 30 out of 32 tested packages failed in a test period consisting of 3108 cycles. Each temperature cycle of the HALT was a transition from -55°C to 125°C in 30 minutes with 3-minute ramps and 12-minute dwells. What is not immediately apparent is that each of the logged FPGA failures (diamond symbols) represents at least 30 events of high resistance: a FAIL was defined as being at least two OPENs (net resistance of  $500\Omega$  or higher) within one temperature cycle, log 15 FAILURES [9]. A single fault in a temperature cycle was not counted as a FAIL event.



Figure 2: Shock-actuated Failure: Transient Strain and Resistance



Figure 3: Representation of XILINX FPGA HALT Test Results [9]

# Location of Greatest Stress on FPGA I/O Ports

The I/O ports of an FPGA nearest the edges of the BGA package, especially those nearest one of the four corners of a BGA package, experience the greatest thermo-mechanical stresses [11-14]. For this reason, the corner I/O solder joints of the XILINK FG1156 are either not used or they are used as additional ground connections. This means that I/O ports on the outer edge of the BGA package that are near one of the four corners are strong candidates for SJ BIST testing because those ports are likely to fail first.

#### State of the Art

In previous work, the authors have demonstrated the use of leading indicators of failure for prognostication of electronics [11-14]. One important reason for using an in-situ SJ BIST is that stress magnitudes are hard to derive, much less keep track of, which leads to inaccurate life expectancy predictions [15]. Another reason for using an in-situ SJ BIST is that even though a particular damaged solder-joint port might not result in immediate FPGA operational failure, the damage indicates the FPGA is likely to have other I/O ports that are damaged – i.e., the FPGA is no longer reliable. An in-situ SJ BIST can also be used in newly designed manufacturing reliability tests to address a concern that failure modes caused by the PCB-FPGA assembly are not being detected during component qualification [6].

Prior to this innovation, there were no known methods for detecting faults in operational, fully programmed FPGAs. Furthermore, FPGAs are not amenable to the measurement techniques typically used in manufacturing reliability tests such as Highly Accelerated Life Tests [4]. This is because those measurement techniques require devices to be powered-off and because FPGA I/O ports are digital, rather than analog, circuits, an example of which is shown in Figure 4.

Modern BGA FPGAs, such as the fine-pitch XILINX FG1156, have more than a thousand I/O ports and very small pitch and ball sizes. For example, the FG1156 has a  $34 \times 34$  array of nominal 0.60mm solder balls with a pitch of 1.0mm (see Figure 5). This tends to make physical inspection techniques impractical and not useful.

## SJ BIST INNOVATION

The SJ BIST innovation requires the attachment of a small capacitor to an I/O port, preferably an unused port near a corner of the package. The SJ BIST writes a logical '1' to charge the capacitor and then reads the voltage across the charged capacitor. If the solder-joint network is undamaged, the write causes the capacitor to be fully charged and a logical '1' is read by the SJ BIST. When the solder-joint network is sufficiently damaged, the RC time constant becomes large, the capacitor is insufficiently charged, a logical '0' instead of a logical '1' is read by the SJ BIST, and a fault is reported.



Figure 4: Example of an FPGA I/O Buffer [16]



Figure 5: Bottom View of a XILINX FG1156 – Package Size is 35 x 35 mm with a 34 x 34 Array of Solder Balls of Nominal Diameter of 0.6mm and Pitch of 1.0mm [17]

## **SJ BIST Description**

This SJ BIST description is for two cases: one in which the solder-joint network is undamaged, and one in which the solder-joint network is damaged enough to cause errors (faults) in I/O signals.

## Undamaged Solder Joint

Referring to Figure 6, the top picture is the normal signal across a  $1.0\mu$ F capacitor connected to two I/O ports selected for testing. The signal across the capacitor is caused by the SJ BIST writing '1s' and '0s.'



Figure 6: Solder Joint BIST – Input 1Mhz Clock: Signal Across Capacitance: Normal Resistance of <1 Ohm (top) and Resistance of 100 Ohm (bottom): 2µs x 2.0V Grid

Still referring to Figure 6, the charged voltage on the capacitor is read from the second I/O port. The SJ BIST then writes a '1' and a '0' to the same capacitor through the second I/O port and reads the charge through the first I/O port.

#### Damaged Solder Joint

A high-resistance fault in an I/O port causes the capacitor to fail to fully charge as shown in the oscilloscope view shown at the bottom of Figure 6 and Figure 7. Because of the increase in the network resistance, the charged voltage across the capacitor is less than 1.0V instead of 3.3V at the time of the read. This is logical '0' instead of a logical '1' – which is a fault. The SJ BIST detects it.

Should a fault occur in both I/O ports, the capacitor might not be fully discharged during the '0' writes. The SJ BIST employs special logic to detect this condition and to continually write a '0' to discharge the capacitor before resuming the normal write-read logic.

## SJ BIST: Fault Evaluation

We have verified the SJ BIST works correctly for the following conditions: (1) fault during the write of a '1' to I/O port 1 and (2) fault during the write of a '1' to I/O port 2. We verified the SJ BIST works correctly with no false alarms at 100kHz, 1MHz, 10MHz and 20MHz. The test result for 1MHz is shown in Figure 6; Figure 7 shows the test result at 20MHz.



Figure 7: Solder-Joint BIST – Input 10MHz Clock: Signal Across Capacitance: Normal Resistance of <1 Ohm (top) and Resistance of 100 Ohm (bottom): 2µs x 2.0V Grid

## SJ BIST Signals

The SJ BIST, at minimum, must present at least one error signal (a fault indicator) either to an external FPGA I/O port or to an internal fault management program. For evaluation and investigation, our SJ BIST core provides two error signals plus fault counts.

The SJ BIST, at minimum, must accept at least one control signal: an enable (disable) BIST.

## Error Signals and Fault Counts

In addition to recording fault counts, the SJ BIST core described in this paper provides two error signals: (1) at least one fault has been detected in the two-port network being tested, and (2) at least one fault is currently active. The fault counts are provided for research evaluation purposes. For a deployed SJ BIST, we anticipate most applications would only use the two error signals. We also believe a deployed SJ BIST application would most likely use at least four groups of cores – one for each corner of an FPGA.

### Control Signals

In addition to CLK, the SJ BIST core has two input-control signals: ENABLE and RESET. ENABLE is used to turn the SJ BIST detection on and off; RESET is used to reset both the fault signal latches and the fault counters. For a deployed SJ BIST, RESET might not be used.

## Faults: Duration, Detection, and Number of Ports

Our current effort is focused on the design and development of two SJ BIST cores: a two-port and a one-port SJ BIST. To test more than one or two I/O ports, we believe that multiple SJ BIST cores should be used in the deployed FPGA.

Each of the SJ BIST cores has advantages and disadvantages related to the number of gates, the number of externally connected capacitors, the power dissipation and the minimum duration of a fault period for "guaranteed" detection.

## Fault Duration and Detection: Two-Port SJ BIST Core

Referring back to Figure 6, the signal sequence is write-read '10' (test I/O port 1), write-read '10' (test I/O port 2), and continue to repeat the sequence. Parallel logic checks to see if the capacitor is written to correctly. This sequence takes two clocks to complete, which means the following: (1) a fault must have a minimum duration of two clock periods for "guaranteed" detection; (2) a fault with a duration of one-half of a clock period is detectable when it occurs at the start of either the write-read '1' or the write-read '0' sequence for that pin. For an FPGA with a 20MHz CLK, the guaranteed detection duration is 100ns.

To test eight I/O ports, two I/O ports for each corner of a BGA package, four 2-port SJ BIST cores could be used and the error signals ORed together.

#### SUMMARY

In this paper we presented an overview of the physics of failure associated with the solder joints of FPGAs in BGA packages: the primary contributor to fatigue damage is thermomechanical stress related to CTE mismatches, shock and vibration, and power on-off sequencing. Solder-joint fatigue damage can result in cracks that cause intermittent instances of high-resistance spikes that are hard to diagnose. In reliability testing, OPENs (faults) are often characterized by spikes of  $100\Omega$  or more lasting for less than 100ns to  $1\mu$ s or longer.

Prior to the innovative SJ BIST presented in this paper, there were no known methods for detecting high-resistance faults in solder-joint networks belonging to operational, fully programmed FPGAs.

An in-situ SJ BIST that can be used in operational FPGAs is useful because stress magnitudes are hard to derive, which leads to inaccurate life expectancy predictions; in addition, even though a particular damaged solder-joint port might not result in immediate FPGA operational failure, the damage indicates the FPGA is no longer reliable. An in-situ SJ BIST can also be used in newly designed manufacturing reliability tests to investigate failure modes related to the PCB-FPGA assembly.

Two SJ BIST cores have been designed: a oneport SJ BIST and a two-port SJ BIST. The twoport SJ BIST was programmed, simulated, synthesized, loaded into an FPGA on a development board, and tested in a laboratory. The test results show the SJ BIST core correctly detects and reports instances of high resistance (100 $\Omega$  or more) without false errors – and no errors are detected or reported when the network resistance is 1.0 $\Omega$  or less.

#### ACKNOWLEDGMENT

The authors would like to acknowledge the support of the NAVAIR SBIR Program for its assistance in advancing this work. Two patent applications have been filed associated with this work.

#### REFERENCES

[1]. Accelerated Reliability Task IPC-SM-785, SMT Force Group Standard, Product Reliability Committee of the IPC, Published by Analysis Tech., Inc., 2005, <u>www.analysistech.com/event-tech-IPC-SM-785</u>.

- [2]. P. Lall, M. N. Islam, N. Singh, J. C. Suhling, and R. Darveaux, "Model for BGA and CSP reliability in automotive underhood applications," *IEEE Trans. Comp. and Pack. Tech.*, vol. 27, no. 3, Sep. 2004, pp. 585-593.
- [3]. R. Gannamani, V. Valluri, Sidharth and M-L. Zhang, "Reliability evaluation of chip scale packages," Advanced Micro Devices, Sunnyvale, CA, in Daisy Chain Samples, App. Note, *Spansion*, Jul. 2003, pp. 4-9.
- [4]. Sony Semiconductor Quality and Reliability Handbook, Revised May 2001, vol. 2, pp. 66-67, vol. 4, pp. 120-129, <u>http://www.sony.net/products/SC-HP/tec/catalog</u>.
- [5]. Use Condition Based Reliability Evaluation: An Example Applied to Ball Grid Array (BGA) Packages, SEMATECH Technology Transfer #99083813A-XFR, International SEMATECH, 1999, pg. 6.
- [6]. Comparison of Ball Grid Array (BGA) Component and Assembly Level Qualification Tests and Failure Modes, SEMATECH Technology Transfer #00053957A-XFR, International SEMATECH, May 31, 2000, pp. 1-4.
- [7]. R. Roergren, P-E. Teghall, and P. Carlsson, "Reliability of BGA packages in an automotive environment," IVF-The Swedish Institute of Production Engineering Research, Moelndal, Sweden, accessed Dec. 25, 2005, <u>http://www.ivf.se</u>.
- [8]. D. E. Hodges Popp, A. Mawer, and G. Presas, "Flip chip PBGA solder joint reliability: power cycling versus thermal cycling," Motorola Semiconductor Products Sector, Austin, TX, Dec. 19, 2005.
- [9]. The Reliability Report, XILINX, Sep. 1, 2003, pp. 225-229, xgoogle.xilinx.com.
- [10]. J-P. Clech, D. M. Noctor, J. C. Manock, G. W. Lynott, and F. E. Bader, "Surface mount assembly failure statistics and failure-free times," in *Proc., 44<sup>th</sup> ECTC,* Washington, D.C., May 1-4, 1994, pp. 487-497.
- [11]. P. Lall, P. Choudhary, and S. Gupte, "Health monitoring for damage initiation & progression during mechanical shock in electronic assemblies," *Proc.* 56<sup>th</sup> IEEE Electronic Components and Technology

*Conf.,* San Diego, CA, May 30-Jun. 2, 2006, pp. 85-94.

- [12]. P. Lall, M. Hande, M. N. Singh, J. Suhling and J. Lee, "Feature extraction and damage data for prognostication of leaded and leadfree electronics," *Proc.* 56<sup>th</sup> IEEE Electronic Components and Technology Conf., San Diego, CA, May 30-Jun. 2, 2006, pp.718-727.
- [13]. P. Lall, N. Islam and J. Suhling, "Leading indicators of failure for prognostication of leaded and lead-free electronics in harsh environments," *Proc. ASME InterPACK Conf.*, San Francisco, CA, Jul. 17-22, 2005, Paper IPACK2005-73426, pp. 1-9.
- [14]. P. Lall, M. N. Islam, K. Rahim, and J. Suhling, "Prognostics and health management of electronic packaging," Accepted for publication in *IEEE Trans. on Components and Packaging Technologies*, Paper available in digital format on IEEE Explore, Mar. 2005, pp. 1-12.
- [15]. P. Lall, "Challenges in accelerated life testing," Inter Society Conf., Thermal Phenomena, 2004, pg. 727.
- [16]. FPGA I/O Buffer shown was taken from documentation for Altera FPGA development kit, May, 2006.
- [17]. XILINX Fine-Pitch BGA (FG1156/FGG1156) Package, PK039 (v1.2), Jun. 25, 2004.