Other affiliations: University of California, Davis, Advanced Micro Devices, Washington State University
Bio: Mark Oskin is an academic researcher from University of Washington. The author has contributed to research in topics: Quantum information & Nearest neighbor search. The author has an hindex of 27, co-authored 95 publications receiving 3376 citations. Previous affiliations of Mark Oskin include University of California, Davis & Advanced Micro Devices.
Papers published on a yearly basis
••16 Apr 1998
TL;DR: This work describes an implementation of Active Pages on RADram (Reconfigurable Architecture DRAM), a memory system based upon the integration of DRAM and reconfigurable logic and explores the sensitivity of the results to implementations in other memory technologies.
Abstract: Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can operate upon that data. We describe an implementation of Active Pages on RADram (Reconfigurable Architecture DRAM), a memory system based upon the integration of DRAM and reconfigurable logic. Results from the SimpleScalar simulator [BA97] demonstrate up to 1000X speedups on several applications using the RADram system versus conventional memory systems. We also explore the sensitivity of our results to implementations in other memory technologies.
••07 Mar 2009
TL;DR: The case for fully deterministic shared memory multiprocessing is made and it is shown that determinism can be provided with little performance cost using the architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
Abstract: Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming.In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
TL;DR: The RAMP project's goal is to enable the intensive, multidisciplinary innovation that the computing industry will need to tackle the problems of parallel processing.
Abstract: The RAMP project's goal is to enable the intensive, multidisciplinary innovation that the computing industry will need to tackle the problems of parallel processing. RAMP itself is an open-source, community-developed, FPGA-based emulator of parallel architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable design modules. three complete designs - for transactional memory, distributed systems, and distributed-shared memory - demonstrate the platform's potential.
••18 Nov 2002
TL;DR: A programming framework is developed and applied to a variety of problems, including matrix multiplication and 3-SAT, and it is believed that efficient use of graphics hardware will become increasingly important to high-performance computing on commodity hardware.
Abstract: Recently, graphics hardware architectures have begun to emphasize versatility, offering rich new ways to programmatically reconfigure the graphics pipeline. In this paper, we explore whether current graphics architectures can be applied to problems where general-purpose vector processors might traditionally be used. We develop a programming framework and apply it to a variety of problems, including matrix multiplication and 3-SAT. Comparing the speed of our graphics card implementations to standard CPU implementations, we demonstrate startling performance improvements in many cases, as well as room for improvement in others. We analyze the bottlenecks and propose minor extensions to current graphics architectures which would improve their effectiveness for solving general-purpose problems. Based on our results and current trends in microarchitecture, we believe that efficient use of graphics hardware will become increasingly important to high-performance computing on commodity hardware.
•08 Jul 2015
TL;DR: Grappa enables users to program a cluster as if it were a single, large, non-uniform memory access (NUMA) machine, and addresses deficiencies of previous DSM systems by exploiting application parallelism, trading off latency for throughput.
Abstract: We present Grappa, a modern take on software distributed shared memory (DSM) for in-memory data-intensive applications Grappa enables users to program a cluster as if it were a single, large, non-uniform memory access (NUMA) machine Performance scales up even for applications that have poor locality and input-dependent load distribution Grappa addresses deficiencies of previous DSM systems by exploiting application parallelism, trading off latency for throughput We evaluate Grappa with an in-memory MapReduce framework (10× faster than Spark ); a vertex-centric framework inspired by GraphLab (133× faster than native GraphLab ); and a relational query execution engine (125× faster than Shark ) All these frameworks required only 60-690 lines of Grappa code
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …
TL;DR: In this paper, the authors proposed a method for quantum interconnects, which convert quantum states from one physical system to those of another in a reversible manner, allowing the distribution of entanglement across the network and teleportation of quantum states between nodes.
Abstract: Quantum networks provide opportunities and challenges across a range of intellectual and technical frontiers, including quantum computation, communication and metrology. The realization of quantum networks composed of many nodes and channels requires new scientific capabilities for generating and characterizing quantum coherence and entanglement. Fundamental to this endeavour are quantum interconnects, which convert quantum states from one physical system to those of another in a reversible manner. Such quantum connectivity in networks can be achieved by the optical interactions of single photons and atoms, allowing the distribution of entanglement across the network and the teleportation of quantum states between nodes.
18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.
TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.
Abstract: The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest.
••01 Oct 2002
TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.
Abstract: Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution.Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.