scispace - formally typeset
Search or ask a question
Topic

Supercomputer

About: Supercomputer is a research topic. Over the lifetime, 9990 publications have been published within this topic receiving 150873 citations. The topic is also known as: High performance computing & High-performance computing.


Papers
More filters
Journal ArticleDOI
Frank Arute1, Kunal Arya1, Ryan Babbush1, Dave Bacon1, Joseph C. Bardin1, Joseph C. Bardin2, Rami Barends1, Rupak Biswas3, Sergio Boixo1, Fernando G. S. L. Brandão4, Fernando G. S. L. Brandão1, David A. Buell1, B. Burkett1, Yu Chen1, Zijun Chen1, Ben Chiaro5, Roberto Collins1, William Courtney1, Andrew Dunsworth1, Edward Farhi1, Brooks Foxen5, Brooks Foxen1, Austin G. Fowler1, Craig Gidney1, Marissa Giustina1, R. Graff1, Keith Guerin1, Steve Habegger1, Matthew P. Harrigan1, Michael J. Hartmann6, Michael J. Hartmann1, Alan Ho1, Markus R. Hoffmann1, Trent Huang1, Travis S. Humble7, Sergei V. Isakov1, Evan Jeffrey1, Zhang Jiang1, Dvir Kafri1, Kostyantyn Kechedzhi1, Julian Kelly1, Paul V. Klimov1, Sergey Knysh1, Alexander N. Korotkov1, Alexander N. Korotkov8, Fedor Kostritsa1, David Landhuis1, Mike Lindmark1, E. Lucero1, Dmitry I. Lyakh7, Salvatore Mandrà3, Jarrod R. McClean1, Matt McEwen5, Anthony Megrant1, Xiao Mi1, Kristel Michielsen9, Kristel Michielsen10, Masoud Mohseni1, Josh Mutus1, Ofer Naaman1, Matthew Neeley1, Charles Neill1, Murphy Yuezhen Niu1, Eric Ostby1, Andre Petukhov1, John Platt1, Chris Quintana1, Eleanor Rieffel3, Pedram Roushan1, Nicholas C. Rubin1, Daniel Sank1, Kevin J. Satzinger1, Vadim Smelyanskiy1, Kevin J. Sung11, Kevin J. Sung1, Matthew D. Trevithick1, Amit Vainsencher1, Benjamin Villalonga12, Benjamin Villalonga1, Theodore White1, Z. Jamie Yao1, Ping Yeh1, Adam Zalcman1, Hartmut Neven1, John M. Martinis5, John M. Martinis1 
24 Oct 2019-Nature
TL;DR: Quantum supremacy is demonstrated using a programmable superconducting processor known as Sycamore, taking approximately 200 seconds to sample one instance of a quantum circuit a million times, which would take a state-of-the-art supercomputer around ten thousand years to compute.
Abstract: The promise of quantum computers is that certain computational tasks might be executed exponentially faster on a quantum processor than on a classical processor1. A fundamental challenge is to build a high-fidelity processor capable of running quantum algorithms in an exponentially large computational space. Here we report the use of a processor with programmable superconducting qubits2-7 to create quantum states on 53 qubits, corresponding to a computational state-space of dimension 253 (about 1016). Measurements from repeated experiments sample the resulting probability distribution, which we verify using classical simulations. Our Sycamore processor takes about 200 seconds to sample one instance of a quantum circuit a million times-our benchmarks currently indicate that the equivalent task for a state-of-the-art classical supercomputer would take approximately 10,000 years. This dramatic increase in speed compared to all known classical algorithms is an experimental realization of quantum supremacy8-14 for this specific computational task, heralding a much-anticipated computing paradigm.

2,527 citations

18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

2,262 citations

Proceedings ArticleDOI
13 Dec 2014
TL;DR: This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
Abstract: Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

1,486 citations

Proceedings Article
Frank B. Schmuck1, Roger L. Haskin1
28 Jan 2002
TL;DR: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Abstract: GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

1,434 citations

Book
01 Jun 1994
TL;DR: In this article, the authors presented a new class of universal routing networks, called fat-trees, which might be used to interconnect the processors of a general-purpose parallel supercomputer, and proved that a fat-tree of a given size is nearly the best routing network of that size.
Abstract: The author presents a new class of universal routing networks, called fat-trees, which might be used to interconnect the processors of a general-purpose parallel supercomputer. A fat-tree routing network is parameterized not only in the number of processors, but also in the amount of simultaneous communication it can support. Since communication can be scaled independently from the number of processors, substantial hardware can be saved for such applications as finite-element analysis without resorting to a special-purpose architecture. It is proved that a fat-tree of a given size is nearly the best routing network of that size. This universality theorem is established using a three-dimensional VLSI model that incorporates wiring as a direct cost. In this model, hardware size is measured as physical volume. It is proved that for any given amount of communications hardware, a fat-tree built from that amount of hardware can stimulate every other network built from the same amount of hardware, using only slightly more time (a polylogarithmic factor greater).

1,227 citations


Network Information
Related Topics (5)
Scalability
50.9K papers, 931.6K citations
89% related
Cache
59.1K papers, 976.6K citations
85% related
Cloud computing
156.4K papers, 1.9M citations
85% related
Server
79.5K papers, 1.4M citations
84% related
Virtual machine
43.9K papers, 718.3K citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023304
2022708
2021328
2020406
2019459
2018545