E
Elmoustapha Ould-Ahmed-Vall
Researcher at Intel
Publications - 299
Citations - 1664
Elmoustapha Ould-Ahmed-Vall is an academic researcher from Intel. The author has contributed to research in topics: Operand & Opcode. The author has an hindex of 19, co-authored 299 publications receiving 1656 citations. Previous affiliations of Elmoustapha Ould-Ahmed-Vall include Georgia Institute of Technology & AMIT.
Papers
More filters
Patent
Compression in machine learning and deep learning processing
Ray Joydeep,Ben Ashbaugh,Surti Prasoonkumar,Pradeep Ramani,Rama Harihara,Jerin C. Justin,Jing Huang,Xiaoming Cui,Timothy B. Costa,Ting Gong,Elmoustapha Ould-Ahmed-Vall,Kumar Balasubramanian,Anil Thomas,Oguz H. Elibol,Jayaram Bobba,Guozhong Zhuang,Bhavani Subramanian,Gokce Keskin,Chandrasekaran Sakthivel,Rajesh Poornachandran +19 more
TL;DR: In this paper, the authors present an embodiment of an apparatus for compression of untyped data including a graphical processing unit (GPU) including a data compression pipeline, the data pipeline includes a data port coupled with one or more shader cores.
Patent
Collapsing of multiple nested loops, methods and instructions
TL;DR: In this paper, a multi-dimensional loop counter update instruction with a decode logic and an execution logic are presented. And methods to collapse loops using such instructions are described and claimed.
Patent
Compute optimizations for neural networks
Kevin Nealis,Anbang Yao,Xiaoming Chen,Elmoustapha Ould-Ahmed-Vall,Sara S. Baghsorkhi,Eriko Nurvitadhi,Balaji Vembu,Nicolas C. Galoppo Von Borries,Rajkishore Barik,Tsung-Han Lin,Sinha Kamal +10 more
TL;DR: In this paper, a neural network and an arithmetic logic unit including a barrel shifter, an adder, and an accumulator register are used to decode a single instruction into a decoded instruction that specifies multiple operands including an input value and a quantized weight value.
Patent
Compute optimizations for low precision machine learning operations
Elmoustapha Ould-Ahmed-Vall,Sara S. Baghsorkhi,Anbang Yao,Kevin Nealis,Xiaoming Chen,Koker Altug,Appu Abhishek R,John C. Weast,Mike B. MacPherson,Dukhwan Kim,Linda L. Hurd,Ben Ashbaugh,Barath Lakshmanan,Liwei Ma,Ray Joydeep,Ping T. Tang,Michael S. Strickland +16 more
TL;DR: In this article, the authors present an accelerator module comprising a memory stack including multiple memory dies, a graphics processing unit (GPU) coupled with the memory stack via one or more memory controllers, the GPU including a plurality of multiprocessors having a single instruction, multiple thread (SIMT) architecture, and the at least one single instruction to cause at least a portion of the GPU to perform a floating-point operation on input having differing precisions.
Patent
Systems, apparatuses, and methods for performing mask bit compression
TL;DR: In this article, a single mask bit compression instruction that includes a source writemask register operand, a destination writeemask operand and an opcode is described. But it does not specify a single opcode.