scispace - formally typeset
Search or ask a question
Author

Martin D. F. Wong

Bio: Martin D. F. Wong is an academic researcher from University of Illinois at Urbana–Champaign. The author has contributed to research in topics: Static routing & Routing (electronic design automation). The author has an hindex of 31, co-authored 206 publications receiving 3707 citations. Previous affiliations of Martin D. F. Wong include Urbana University & University of Toronto.


Papers
More filters
Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy that guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.
Abstract: Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the fastest CPU implementation. In this paper, we present a new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy. It guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.

235 citations

Proceedings ArticleDOI
31 May 2005
TL;DR: Two iterative algorithms based on node-by-node traversals and row- by-row traversals of the power grid, respectively are presented and can be considered as efficient implementations of the classical successive over relaxation iterative method for solving linear systems.
Abstract: Due to the extremely large size of power grids, IR drop analysis has become a computationally challenging problem both in terms of runtime and memory usage. Although IR drop analysis can be naturally formulated as the problem of solving a linear system, the system is too large to be solved by existing linear solvers. In this paper, we present two iterative algorithms based on node-by-node traversals and row-by-row traversals of the power grid, respectively. Our algorithms are extremely fast and guarantee convergence to the exact solutions. In fact, they can be considered as efficient implementations of the classical successive over relaxation iterative method for solving linear systems. Our methods take full advantage of the special structure of the power grid. Experimental results show that our algorithms out-perform the random-walk-based algorithm which is the best known method today. For a 16-million node problem, our row-based algorithm took 26.47 minutes while the random-walk-based algorithm took 19.6 hours. Our row-based algorithm produced an exact solution while the random walk produced a solution with maximum error of 5.7 mV.

105 citations

Proceedings ArticleDOI
07 Jun 2004
TL;DR: This paper proposes a maze routing method that considers the optical effect in the routing algorithm by utilizing the symmetrical property of the optical system and an effective algorithm is designed to solve the problem.
Abstract: As the technology migrates into the deep submicron manufacturing(DSM) era, the critical dimension of the circuits is getting smaller than the lithographic wavelength. The unavoidable light diffraction phenomena in the sub-wavelength technologies have become one of the major factors in the yield rate. Optical proximity correction (OPC) is one of the methods adopted to compensate for the light diffraction effect as a post layout process.However, the process is time-consuming and the results are still limited by the original layout quality. In this paper, we propose a maze routing method that considers the optical effect in the routing algorithm. By utilizing the symmetrical property of the optical system, the light diffraction is efficiently calculated and stored in tables. The costs that guide the router to minimize the optical interferences are obtained from these look-up tables. The problem is first formulated as a constrained maze routing problem, then it is shown to be a multiple constrained shortest path problem. Based on the Lagrangian relaxation method, an effective algorithm is designed to solve the problem.

96 citations

Proceedings ArticleDOI
23 Jan 2007
TL;DR: This paper presents an optimal algorithm that can minimize lithography cost subject to any given coupling capacitance bound and achieves a highly uniform density because of the locality of coupling capacitate, which automatically ameliorates chemical mechanical polish (CMP) problem.
Abstract: As integrated circuits manufacturing technology is advancing into 65nm and 45nm nodes, extensive resolution enhancement techniques (RETs) are needed to correctly manufacture a chip design. The widely used RET called off-axis illumination (OAI) introduces forbidden pitches which lead to very complex design rules. It has been observed that imposing uniformity on layout designs can substantially improve printability under OAI. For metal layers, uniformity can be achieved simply by inserting dummy metal wire segments at all free spaces. Simulation results indeed show significant improvement in printability with such a dummy metal insertion approach. To minimize mask cost, it is advantageous to use dummy metal segments that are of the same size as regular metal wires due to their simple geometry. But these dummy wires are printable and hence increase coupling capacitances and potentially affect yield. The alternative is to use a set of parallel sub-resolution thin wires (which is not printed) to replace a printable dummy wire segment. These invisible dummy metal segments do not increase coupling capacitances but bring a higher lithography cost, which includes mask cost and RET/process expense. This paper presents a strategy for dummy metal insertion that can optimally trade off lithography cost and coupling capacitance. In particular, we present an optimal algorithm that can minimize lithography cost subject to any given coupling capacitance bound. Moreover, this dummy metal insertion achieves a highly uniform density because of the locality of coupling capacitance, which automatically ameliorates chemical mechanical polish (CMP) problem.

86 citations

Proceedings ArticleDOI
05 Nov 2012
TL;DR: A polynomial time algorithm is proposed to solve the standard cell based row-structure layout decomposition problem in TPL and it is shown that for standardcell based TPL layout decompositions problem, it is polynometric time solvable.
Abstract: As minimum feature size keeps shrinking, and the next generation lithography (e.g, EUV) further delays, double patterning lithography (DPL) has been widely recognized as a feasible lithography solution in 20nm technology node. However, as technology continues to scale to 14/10nm, DPL begins to show its limitations and usually generates too many undesirable stitches. Triple patterning lithography (TPL) is a natural extension of DPL to conquer the difficulties and achieve a stitch-free layout decomposition. In this paper, we study the standard cell based row-structure layout decomposition problem in TPL. Although the general TPL layout decomposition problem is NP-hard, in this paper we will show that for standard cell based TPL layout decomposition problem, it is polynomial time solvable. We propose a polynomial time algorithm to solve the problem optimally and our approach has the capability to find all stitch-free decompositions. Color balancing is also considered to ensure a balanced triple patterning decomposition. To speed up the algorithm, we further propose a hierarchical algorithm for standard cell based layout, which can reduce the run time by 34.5% on average without sacrificing the optimality. We also extend our algorithm to allow stitches for complex circuit designs, and our algorithm guarantees to find optimal solutions with minimum number of stitches.

82 citations


Cited by
More filters
Journal Article
TL;DR: The phase-shifting mask as mentioned in this paper consists of a normal transmission mask that has been coated with a transparent layer patterned to ensure that the optical phases of nearest apertures are opposite.
Abstract: The phase-shifting mask consists of a normal transmission mask that has been coated with a transparent layer patterned to ensure that the optical phases of nearest apertures are opposite. Destructive interference between waves from adjacent apertures cancels some diffraction effects and increases the spatial resolution with which such patterns can be projected. A simple theory predicts a near doubling of resolution for illumination with partial incoherence σ < 0.3, and substantial improvements in resolution for σ < 0.7. Initial results obtained with a phase-shifting mask patterned with typical device structures by electron-beam lithography and exposed using a Mann 4800 10× tool reveals a 40-percent increase in usuable resolution with some structures printed at a resolution of 1000 lines/mm. Phase-shifting mask structures can be used to facilitate proximity printing with larger gaps between mask and wafer. Theory indicates that the increase in resolution is accompanied by a minimal decrease in depth of focus. Thus the phase-shifting mask may be the most desirable device for enhancing optical lithography resolution in the VLSI/VHSIC era.

705 citations

01 Jan 2012
TL;DR: By including versions of varying levels of optimization of the same fundamental algorithm, the Parboil benchmarks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware.
Abstract: The Parboil benchmarks are a set of throughput computing applications useful for studying the performance of throughput computing architecture and compilers. The name comes from the culinary term for a partial cooking process, which represents our belief that useful throughput computing benchmarks must be “cooked”, or preselected to implement a scalable algorithm with fine-grained paralle l tasks. But useful benchmarks for this field cannot be “fully cooked”, because the architectures and programming models and supporting tools are evolving rapidly enough that static benchmark codes will lose relevance very quickly. We have collected benchmarks from throughput computing application researchers in many different scientific and commercial fields including image processing, biomolec ular simulation, fluid dynamics, and astronomy. Each benchmark includes several implementations. Some implementations we provide as readable base implementations from which new optimization efforts can begin, and others as examples of the current state-of-the-art targeting specific CPU and GPU architectures. As we continue to optimiz e these benchmarks for new and existing architectures ourselves, we will also gladly accept new implementations and benchmark contributions from developers to recognize those at the frontier of performance optimization on each architecture. Finally, by including versions of varying levels of optimization of the same fundamental algorithm, the benchmarks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware. Less optimized versions are presented as challenges to the compiler and architecture research communities: to develop the technology that automatically raises the performance of simpler implementations to the performance level of sophisticated programmer-optimized implementations, or demonstrate any other performance or programmability improvements. We hope that these benchmarks will facilitate effective demonstrations of such technology.

695 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.
Abstract: Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

541 citations