Home
/
Authors
/
Sameh Galal

Author

Sameh Galal

Other affiliations: Nokia

Bio: Sameh Galal is an academic researcher from Stanford University. The author has contributed to research in topics: Modular design & Floating-point unit. The author has an hindex of 9, co-authored 13 publications receiving 818 citations. Previous affiliations of Sameh Galal include Nokia.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

BACH: Grand challenge on breast cancer histology images.

[...]

Guilherme Aresta¹, Teresa Araújo¹, Scotty Kwok, Sai Saketh Chennamsetty, Mohammed Safwan, Varghese Alex, Bahram Marami², Marcel Prastawa², Monica Chan², Michael J. Donovan², Gerardo Fernandez², Jack Zeineh², Matthias Kohl, Christoph Walz³, Florian Ludwig, Stefan Braunewell, Maximilian Baust, Quoc Dang Vu⁴, Minh Nguyen Nhat To⁴, Eal Kim⁴, Jin Tae Kwak⁴, Sameh Galal, Veronica Sanchez-Freire, Nadia Brancati⁵, Maria Frucci⁵, Daniel Riccio⁶, Yaqi Wang⁷, Lingling Sun⁷, Kaiqiang Ma⁷, Jiannan Fang⁷, Ismael Kone, Lahsen Boulmane, Aurélio Campilho¹, Catarina Eloy¹, António Polónia¹, Paulo Aguiar¹ - Show less +32 more•Institutions (7)

University of Porto¹, Icahn School of Medicine at Mount Sinai², Ludwig Maximilian University of Munich³, Sejong University⁴, Indian Council of Agricultural Research⁵, University of Naples Federico II⁶, Hangzhou Dianzi University⁷

01 Aug 2019-Medical Image Analysis

TL;DR: The Grand Challenge on Breast Cancer Histology images (BACH) was organized in conjunction with the 15th International Conference on Image Analysis and Recognition (ICIAR 2018) as mentioned in this paper.

...read moreread less

335 citations

Journal Article•DOI•

Energy-Efficient Floating-Point Unit Design

[...]

Sameh Galal¹, Mark Horowitz¹•Institutions (1)

Stanford University¹

01 Jul 2011-IEEE Transactions on Computers

TL;DR: This work presents a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints, and finds that in a 90 nm CMOS technology at 1 W/mm2, one can achieve a performance of 27 GFlops/ mm2 single precision, and 7.5 GFlop/mm double precision.

...read moreread less

Abstract: Energy-efficient computation is critical if we are going to continue to scale performance in power-limited systems. For floating-point applications that have large amounts of data parallelism, one should optimize the throughput/mm2 given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. Looking at FP multiply-add units and ignoring register and memory overheads, we find that in a 90 nm CMOS technology at 1 W/mm2, one can achieve a performance of 27 GFlops/mm2 single precision, and 7.5 GFlops/mm double precision. Adding register file overheads reduces the throughput by less than 50 percent if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point. A 1 W/mm2 design at 90 nm is a "high-energy" design, so scaling it to a lower energy design in 45 nm still yields a 7× performance gain, while a more balanced 0.1 W/mm2 design only speeds up by 3.5× when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only ~3x for both power densities when scaling to a 22 nm technology.

...read moreread less

133 citations

Patent•

Sharing information feed data

[...]

Murali Krishna Punaganti Venkata¹, Sameh Galal¹, Chand Malu¹•Institutions (1)

Nokia¹

01 Oct 2004

TL;DR: In this article, the first data processing arrangement to determine the information feed data is provided based on processing of the token, and the token is processed at the first node in the network.

...read moreread less

Abstract: Sharing information feed data via a network involves forming a token describing the information feed data. The token is received at a first data processing arrangement via the network. The token is processed at the first data processing arrangement to determine the information feed data. Access to the information feed data is provided at the first data processing arrangement based on processing of the token.

...read moreread less

123 citations

Journal Article•DOI•

Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

[...]

Ardavan Pedram¹, Stephen Richardson¹, Mark Horowitz¹, Sameh Galal, Shahar Kvatinsky² - Show less +1 more•Institutions (2)

Stanford University¹, Technion – Israel Institute of Technology²

01 Apr 2017-IEEE Design & Test of Computers

TL;DR: The dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality are discussed.

...read moreread less

Abstract: Unlike traditional dark silicon works that attack the computing logic, this article puts a focus on the memory part, which dissipates most of the energy for memory-bound CPU applications. This article discusses the dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality. –Muhammad Shafique, Vienna University of Technology

...read moreread less

121 citations

Patent•

A method and a device for browsing information feeds

[...]

Murali Punaganti¹, Sameh Galal¹, Chand Malu¹•Institutions (1)

Nokia¹

29 Sep 2006

TL;DR: In this paper, a method and a mobile terminal executing the method for browsing available information feeds on a limited display area via sequential views is presented. But the method is not suitable for mobile devices.

...read moreread less

Abstract: A method and a mobile terminal executing the method for browsing available information feeds on a limited display area via sequential views. Items of a certain feed are first listed by utilizing representative identifiers. The user of the terminal device may through swift, 1-click type actions then inspect the descriptions of preferred items one at a time before selecting the item to be fully accessed.

...read moreread less

109 citations

Cited by

PDF

Open Access

More filters

Patent•

Systems and methods for use of structured and unstructured distributed data

[...]

James Moore, Bela A. Labovitch

01 Feb 2006

TL;DR: In this article, the authors describe hardware, software and electronic service components and systems to provide large-scale, reliable and secure foundations for distributed databases and content management systems, combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.

...read moreread less

Abstract: The invention relates to hardware, software and electronic service components and systems to provide large-scale, reliable, and secure foundations for distributed databases and content management systems, combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.

...read moreread less

659 citations

Journal Article•DOI•

GPUs and the Future of Parallel Computing

[...]

Stephen W. Keckler¹, William J. Dally¹, Brucek Khailany¹, Michael Garland¹, D. Glasco¹ - Show less +1 more•Institutions (1)

Nvidia¹

01 Sep 2011-IEEE Micro

TL;DR: The capabilities of state-of-the art GPU-based high-throughput computing systems are discussed and the challenges to scaling single-chip parallel-computing systems are considered, highlighting high-impact areas that the computing research community can address.

...read moreread less

Abstract: This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

...read moreread less

626 citations

Patent•

Security systems and methods for use with structured and unstructured data

[...]

James Moore, Bela A. Labovitch

01 Feb 2006

TL;DR: In this article, systems and methods including hardware, software and electronic service components and systems to provide large-scale, reliable, and secure foundations for distributed databases and content management systems combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.

...read moreread less

Abstract: Disclosed herein are systems and methods including hardware, software and electronic service components and systems to provide large-scale, reliable, and secure foundations for distributed databases and content management systems combining unstructured and structured data, and allowing post-input reorganization to achieve a high degree of flexibility.

...read moreread less

576 citations

Proceedings Article•DOI•

Neural Acceleration for General-Purpose Approximate Programs

[...]

Hadi Esmaeilzadeh¹, Adrian Sampson¹, Luis Ceze¹, Doug Burger²•Institutions (2)

University of Washington¹, Microsoft²

01 Dec 2012

TL;DR: A programming model is defined that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results and is faster and more energy efficient than executing the original code.

...read moreread less

Abstract: This paper describes a learning-based approach to the acceleration of approximate programs. We describe the \emph{Parrot transformation}, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a \emph{neural processing unit} (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3x and energy savings of 3.0x on average with quality loss of at most 9.6%.

...read moreread less

532 citations

Journal Article•DOI•

Parallel convolutional processing using an integrated photonic tensor core.

[...]

Johannes Feldmann¹, Nathan Youngblood², Nathan Youngblood³, Maxim Karpov⁴, Helge Gehring¹, Xuan Li³, Maik Stappers¹, M. Le Gallo⁵, Xin Fu⁴, Anton Lukashchuk⁴, Arslan S. Raja⁴, Junqiu Liu⁴, C.D. Wright⁶, Abu Sebastian⁵, Tobias J. Kippenberg⁴, Wolfram H. P. Pernice¹, Harish Bhaskaran³ - Show less +13 more•Institutions (6)

University of Münster¹, University of Pittsburgh², University of Oxford³, École Polytechnique Fédérale de Lausanne⁴, IBM⁵, University of Exeter⁶

06 Jan 2021-Nature

TL;DR: In this paper, the authors demonstrate a computationally specific integrated photonic hardware accelerator (tensor core) that is capable of operating at speeds of trillions of multiply-accumulate operations per second.

...read moreread less

Abstract: With the proliferation of ultrahigh-speed mobile networks and internet-connected devices, along with the rise of artificial intelligence (AI)1, the world is generating exponentially increasing amounts of data that need to be processed in a fast and efficient way. Highly parallelized, fast and scalable hardware is therefore becoming progressively more important2. Here we demonstrate a computationally specific integrated photonic hardware accelerator (tensor core) that is capable of operating at speeds of trillions of multiply-accumulate operations per second (1012 MAC operations per second or tera-MACs per second). The tensor core can be considered as the optical analogue of an application-specific integrated circuit (ASIC). It achieves parallelized photonic in-memory computing using phase-change-material memory arrays and photonic chip-based optical frequency combs (soliton microcombs3). The computation is reduced to measuring the optical transmission of reconfigurable and non-resonant passive components and can operate at a bandwidth exceeding 14 gigahertz, limited only by the speed of the modulators and photodetectors. Given recent advances in hybrid integration of soliton microcombs at microwave line rates3-5, ultralow-loss silicon nitride waveguides6,7, and high-speed on-chip detectors and modulators, our approach provides a path towards full complementary metal-oxide-semiconductor (CMOS) wafer-scale integration of the photonic tensor core. Although we focus on convolutional processing, more generally our results indicate the potential of integrated photonics for parallel, fast, and efficient computational hardware in data-heavy AI applications such as autonomous driving, live video processing, and next-generation cloud computing services.

...read moreread less

478 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188

Collapse