scispace - formally typeset
Search or ask a question
Author

Sameh Galal

Other affiliations: Nokia
Bio: Sameh Galal is an academic researcher from Stanford University. The author has contributed to research in topics: Modular design & Floating-point unit. The author has an hindex of 9, co-authored 13 publications receiving 818 citations. Previous affiliations of Sameh Galal include Nokia.

Papers
More filters
Journal ArticleDOI
TL;DR: This work presents a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints, and finds that in a 90 nm CMOS technology at 1 W/mm2, one can achieve a performance of 27 GFlops/ mm2 single precision, and 7.5 GFlop/mm double precision.
Abstract: Energy-efficient computation is critical if we are going to continue to scale performance in power-limited systems. For floating-point applications that have large amounts of data parallelism, one should optimize the throughput/mm2 given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. Looking at FP multiply-add units and ignoring register and memory overheads, we find that in a 90 nm CMOS technology at 1 W/mm2, one can achieve a performance of 27 GFlops/mm2 single precision, and 7.5 GFlops/mm double precision. Adding register file overheads reduces the throughput by less than 50 percent if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point. A 1 W/mm2 design at 90 nm is a "high-energy" design, so scaling it to a lower energy design in 45 nm still yields a 7× performance gain, while a more balanced 0.1 W/mm2 design only speeds up by 3.5× when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only ~3x for both power densities when scaling to a 22 nm technology.

133 citations

Patent
01 Oct 2004
TL;DR: In this article, the first data processing arrangement to determine the information feed data is provided based on processing of the token, and the token is processed at the first node in the network.
Abstract: Sharing information feed data via a network involves forming a token describing the information feed data. The token is received at a first data processing arrangement via the network. The token is processed at the first data processing arrangement to determine the information feed data. Access to the information feed data is provided at the first data processing arrangement based on processing of the token.

123 citations

Journal ArticleDOI
TL;DR: The dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality are discussed.
Abstract: Unlike traditional dark silicon works that attack the computing logic, this article puts a focus on the memory part, which dissipates most of the energy for memory-bound CPU applications. This article discusses the dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality. –Muhammad Shafique, Vienna University of Technology

121 citations

Patent
29 Sep 2006
TL;DR: In this paper, a method and a mobile terminal executing the method for browsing available information feeds on a limited display area via sequential views is presented. But the method is not suitable for mobile devices.
Abstract: A method and a mobile terminal executing the method for browsing available information feeds on a limited display area via sequential views. Items of a certain feed are first listed by utilizing representative identifiers. The user of the terminal device may through swift, 1-click type actions then inspect the descriptions of preferred items one at a time before selecting the item to be fully accessed.

109 citations


Cited by
More filters
Patent
01 Feb 2006
TL;DR: In this article, the authors describe hardware, software and electronic service components and systems to provide large-scale, reliable and secure foundations for distributed databases and content management systems, combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.
Abstract: The invention relates to hardware, software and electronic service components and systems to provide large-scale, reliable, and secure foundations for distributed databases and content management systems, combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.

659 citations

Journal ArticleDOI
Stephen W. Keckler1, William J. Dally1, Brucek Khailany1, Michael Garland1, D. Glasco1 
TL;DR: The capabilities of state-of-the art GPU-based high-throughput computing systems are discussed and the challenges to scaling single-chip parallel-computing systems are considered, highlighting high-impact areas that the computing research community can address.
Abstract: This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

626 citations

Patent
01 Feb 2006
TL;DR: In this article, systems and methods including hardware, software and electronic service components and systems to provide large-scale, reliable, and secure foundations for distributed databases and content management systems combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.
Abstract: Disclosed herein are systems and methods including hardware, software and electronic service components and systems to provide large-scale, reliable, and secure foundations for distributed databases and content management systems combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.

576 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A programming model is defined that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results and is faster and more energy efficient than executing the original code.
Abstract: This paper describes a learning-based approach to the acceleration of approximate programs. We describe the \emph{Parrot transformation}, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a \emph{neural processing unit} (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3x and energy savings of 3.0x on average with quality loss of at most 9.6%.

532 citations

Journal ArticleDOI
06 Jan 2021-Nature
TL;DR: In this paper, the authors demonstrate a computationally specific integrated photonic hardware accelerator (tensor core) that is capable of operating at speeds of trillions of multiply-accumulate operations per second.
Abstract: With the proliferation of ultrahigh-speed mobile networks and internet-connected devices, along with the rise of artificial intelligence (AI)1, the world is generating exponentially increasing amounts of data that need to be processed in a fast and efficient way. Highly parallelized, fast and scalable hardware is therefore becoming progressively more important2. Here we demonstrate a computationally specific integrated photonic hardware accelerator (tensor core) that is capable of operating at speeds of trillions of multiply-accumulate operations per second (1012 MAC operations per second or tera-MACs per second). The tensor core can be considered as the optical analogue of an application-specific integrated circuit (ASIC). It achieves parallelized photonic in-memory computing using phase-change-material memory arrays and photonic chip-based optical frequency combs (soliton microcombs3). The computation is reduced to measuring the optical transmission of reconfigurable and non-resonant passive components and can operate at a bandwidth exceeding 14 gigahertz, limited only by the speed of the modulators and photodetectors. Given recent advances in hybrid integration of soliton microcombs at microwave line rates3-5, ultralow-loss silicon nitride waveguides6,7, and high-speed on-chip detectors and modulators, our approach provides a path towards full complementary metal-oxide-semiconductor (CMOS) wafer-scale integration of the photonic tensor core. Although we focus on convolutional processing, more generally our results indicate the potential of integrated photonics for parallel, fast, and efficient computational hardware in data-heavy AI applications such as autonomous driving, live video processing, and next-generation cloud computing services.

478 citations