scispace - formally typeset
Search or ask a question

Showing papers by "Elmoustapha Ould-Ahmed-Vall published in 2020"


Patent•
21 Jul 2020
TL;DR: In this article, a matrix compress/decompress instruction is decoded to generate a compressed result according to a compress algorithm by compressing the specified decompressed source matrix by either packing non-zero-valued elements together, or using fewer bits to represent one or more elements and using the header to identify matrix elements being represented by fewer bits; and store the compressed result to the specified compressed destination matrix.
Abstract: Disclosed embodiments relate to matrix compress/decompress instructions. In one example, a processor includes fetch circuitry to fetch a compress instruction having a format with fields to specify an opcode and locations of decompressed source and compressed destination matrices, decode circuitry to decode the fetched compress instructions, and execution circuitry, responsive to the decoded compress instruction, to: generate a compressed result according to a compress algorithm by compressing the specified decompressed source matrix by either packing non-zero-valued elements together and storing the matrix position of each non-zero-valued element in a header, or using fewer bits to represent one or more elements and using the header to identify matrix elements being represented by fewer bits; and store the compressed result to the specified compressed destination matrix.

8 citations


Patent•
01 Apr 2020
TL;DR: In this article, a row-interleaved format is used to transform a source matrix into a destination matrix by interleaving J elements of each J-element sub-column of the source matrix in either row-major or column-major order into a K-wide submatrix of the specified destination matrix.
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, a processor includes fetch and decode circuitry to fetch and decode an instruction having fields to specify an opcode and locations of source and destination matrices, wherein the opcode indicates that the processor is to transform the specified source matrix into the specified destination matrix having the row-interleaved format; and execution circuitry to respond to the decoded instruction by transforming the specified source matrix into the specified RowInt-formatted destination matrix by interleaving J elements of each J-element sub-column of the specified source matrix in either row-major or column-major order into a K-wide submatrix of the specified destination matrix, the K-wide submatrix having K columns and enough rows to hold the J elements.

1 citations


Patent•
17 Sep 2020
TL;DR: In this paper, the authors present a page fault management mechanism for predictive page fault handling, which comprises a processor to receive a virtual address that triggered a page failure for a compute process, check a virtual memory space for virtual memory allocation for the compute process that triggered the page fault and manage the page failure according to one of the first protocol in response to a determination that the virtual address was a last page in the virtual memory assignment for the computation process, or a second protocol in the case when the page defect was not a last-page in the allocation.
Abstract: Methods and apparatus relating to predictive page fault handling. In an example, an apparatus comprises a processor to receive a virtual address that triggered a page fault for a compute process, check a virtual memory space for a virtual memory allocation for the compute process that triggered the page fault and manage the page fault according to one of a first protocol in response to a determination that the virtual address that triggered the page fault is a last page in the virtual memory allocation for the compute process, or a second protocol in response to a determination that the virtual address that triggered the page fault is not a last page in the virtual memory allocation for the compute process. Other embodiments are also disclosed and claimed.

1 citations


Patent•
09 Jan 2020
TL;DR: In this paper, a scene generation plan based on a digital representation of an N dimensional space or at least one of the one or more processing factors is generated by a global scene generator.
Abstract: Systems, apparatuses and methods may provide a way to monitor, by a process monitor, one or more processing factors of one or more client devices hosting one or more user sessions More particularly, the systems, apparatuses and methods may provide a way to generate, responsively, a scene generation plan based on one or more of a digital representation of an N dimensional space or at least one of the one or more processing factors, and generate, by a global scene generator, a global scene common to the one or more client devices based on the digital representation of the space The systems, apparatuses and methods may further provide for performing, by a local scene generator, at least a portion of the global illumination based on one or more of the scene generation plan, or application parameters

1 citations


Patent•
23 Jan 2020
TL;DR: In this paper, the authors present an embodiment of an apparatus for compression of untyped data including a graphical processing unit (GPU) including a data compression pipeline, the data pipeline includes a data port coupled with one or more shader cores.
Abstract: Embodiments are generally directed to compression in machine learning and deep learning processing. An embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.

1 citations


Patent•
05 Nov 2020
TL;DR: In this paper, the render rate is varied across and/or up and down the display screen based on where the user is looking in order to reduce power consumption or increase performance.
Abstract: In accordance with some embodiments, the render rate is varied across and/or up and down the display screen. This may be done based on where the user is looking in order to reduce power consumption and/or increase performance. Specifically the screen display is separated into regions, such as quadrants. Each of these regions is rendered at a rate determined by at least one of what the user is currently looking at, what the user has looked at in the past and/or what it is predicted that the user will look at next. Areas of less focus may be rendered at a lower rate, reducing power consumption in some embodiments.

1 citations


Patent•
02 Jul 2020
TL;DR: In this article, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor was to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements on the third matrix.
Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.

1 citations


Patent•
24 Mar 2020
TL;DR: In this article, a processor includes fetch and decode circuitry to decode an instruction specifying ternary tile operation, and locations of destination and first, second, and third source matrices, each of the matrices having M rows by N columns.
Abstract: Disclosed embodiments relate to systems and methods for performing instructions specifying ternary tile operations. In one example, a processor includes fetch and decode circuitry to fetch and decode an instruction specifying a ternary tile operation, and locations of destination and first, second, and third source matrices, each of the matrices having M rows by N columns; and execution circuitry to respond to the decoded instruction by, for each equal-sized group of K elements of the specified first, second, and third source matrices, generate K results by performing the ternary tile operation in parallel on K corresponding elements of the specified first, second, and third source matrices, and store each of the K results to a corresponding element of the specified destination matrix, wherein corresponding elements of the specified source and destination matrices occupy a same relative position within their associated matrix.

Patent•
17 Sep 2020
TL;DR: In this paper, the authors present an embodiment of an apparatus that includes one or more processors including one or multiple graphics processing units (GPUs); and a plurality of caches including at least an L1 cache and an L3 cache.
Abstract: Embodiments are generally directed to data prefetching for graphics data processing. An embodiment of an apparatus includes one or more processors including one or more graphics processing units (GPUs); and a plurality of caches to provide storage for the one or more GPUs, the plurality of caches including at least an L1 cache and an L3 cache, wherein the apparatus to provide intelligent prefetching of data by a prefetcher of a first GPU of the one or more GPUs including measuring a hit rate for the L1 cache; upon determining that the hit rate for the L1 cache is equal to or greater than a threshold value, limiting a prefetch of data to storage in the L3 cache, and upon determining that the hit rate for the L1 cache is less than a threshold value, allowing the prefetch of data to the L1 cache.

Patent•
24 Sep 2020
TL;DR: In this article, a cache controller is configured to set an initial aging policy using an aging field based on age of cache lines within the cache memory and to determine whether a hint or an instruction to indicate a level of aging has been received.
Abstract: Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache memory that is coupled to the processing resources. The cache controller is configured to set an initial aging policy using an aging field based on age of cache lines within the cache memory and to determine whether a hint or an instruction to indicate a level of aging has been received.

Patent•
30 Sep 2020
TL;DR: In this paper, a method and apparatus for performing reduction operations on a plurality of data element values are disclosed, including decoding circuitry to decode an instruction, and execution circuitry to execute the decoded instruction, where the execution includes to convert the values for each operand, each value being converted into a multiplicative lower-precision value, where an exponent is to be stored for each operator, and generate a floating-point value by converting a resulting value from the arithmetic operations into the floatingpoint format.
Abstract: A method and apparatus for performing reduction operations on a plurality of data element values are disclosed. Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.

Patent•
01 Oct 2020
TL;DR: In this article, a vector-matrix comparison is performed by mapping each of the data element values of the vector to one of consecutive rows of the matrix; for each data element value, the vector is compared with data element match results.
Abstract: Methods and apparatus for vector-matrix comparison are disclosed. In one embodiment, a processor comprises decoding and execution circuitry. The decoding circuitry decodes an instruction, where operands of the instruction specifies an output location to store output results, a vector of data element values, and a matrix of data element values. The execution circuitry executes the decoded instruction. The execution includes to map each of the data element values of the vector to one of consecutive rows of the matrix; for each data element value of the vector, to compare that data element value of the vector with data element values in a respective row of the matrix and obtain data element match results. The execution further includes to store the output results based on the data element match results, where each output result maps to a respective data element column position and indicates a vector match result.

Patent•
18 Jun 2020
TL;DR: In this paper, an apparatus and method for accumulating complex numbers is presented, where the real and imaginary components of the first and second plurality of a set of complex numbers are stored as packed data elements within the first source and second source registers, respectively.
Abstract: An apparatus and method for accumulating complex numbers. For example, one embodiment of a processor comprises: a first source register to store a first plurality of real and imaginary components of a first set of complex numbers; a second source register to store a second plurality of real and imaginary components of a second set of complex numbers; wherein the real and imaginary components of the first and second plurality of are to be stored as packed data elements within the first and second source registers; and execution circuitry comprising: multiplier circuitry to multiply selected real and imaginary values from the first source register with selected real and imaginary values from the second source register to generate a first plurality of values, adder circuitry to add and subtract selected combinations of the first plurality of values to generate a second plurality of values, and accumulation circuitry to combine the second plurality of values with a third set of complex numbers stored in a destination register to generate an accumulated result, the accumulated result to be written to the destination register.

Patent•
17 Sep 2020
TL;DR: In this paper, general-purpose graphics processing units having on-chip dense memory for temporal buffering are disclosed. But they do not discuss the use of the high-density memory for buffering.
Abstract: Apparatuses including general-purpose graphics processing units having on chip dense memory for temporal buffering are disclosed. In one embodiment, a graphics multiprocessor includes a plurality of compute engines to perform first computations to generate a first set of data, cache for storing data, and a high density memory that is integrated on chip with the plurality of compute engines and the cache. The high density memory to receive the first set of data, to temporarily store the first set of data, and to provide the first set of data to the cache during a first time period that is prior to a second time period when the plurality of compute engines will use the first set of data for second computations.

Patent•
24 Sep 2020
TL;DR: In this paper, an embodiment of an apparatus includes one or more processors including a graphics processor; a memory for storage of data for processing by the one or multiple processors; and a cache to cache data from the memory.
Abstract: Embodiments are generally directed to cache structure and utilization. An embodiment of an apparatus includes one or more processors including a graphics processor; a memory for storage of data for processing by the one or more processors; and a cache to cache data from the memory; wherein the apparatus is to provide for dynamic overfetching of cache lines for the cache, including receiving a read request and accessing the cache for the requested data, and upon a miss in the cache, overfetching data from memory or a higher level cache in addition to fetching the requested data, wherein the overfetching of data is based at least in part on a current overfetch boundary, and provides for data is to be prefetched extending to the current overfetch boundary.

Patent•
24 Sep 2020
TL;DR: In this paper, a multi-tile memory management for detecting cross tile access, providing multi-Tile inference scaling with multicasting of data via copy operation, and providing page migration are presented.
Abstract: Multi-tile Memory Management for Detecting Cross Tile Access, Providing Multi-Tile Inference Scaling with multicasting of data via copy operation, and Providing Page Migration are disclosed herein. In one embodiment, a graphics processor for a multi-tile architecture includes a first graphics processing unit (GPU) having a memory and a memory controller, a second graphics processing unit (GPU) having a memory and a cross-GPU fabric to communicatively couple the first and second GPUs. The memory controller is configured to determine whether frequent cross tile memory accesses occur from the first GPU to the memory of the second GPU in the multi- GPU configuration and to send a message to initiate a data transfer mechanism when frequent cross tile memory accesses occur from the first GPU to the memory of the second GPU.

Patent•
26 Mar 2020
TL;DR: An apparatus and method for processing array of structures (AoS) and structure of arrays (SoA) data is described in this paper. But it does not specify how to decode AoS gather instructions.
Abstract: An apparatus and method for processing array of structures (AoS) and structure of arrays (SoA) data. For example, one embodiment of a processor comprises: a destination tile register to store data elements in a structure of arrays (SoA) format; a first source tile register to store indices associated with the data elements; instruction fetch circuitry to fetch an array of structures (AoS) gather instruction comprising operands identifying the first source tile register and the destination tile register; a decoder to decode the AoS gather instruction; and execution circuitry to determine a plurality of system memory addresses based on the indices from the first source tile register, to read data elements from the system memory addresses in an AoS format, and to load the data elements to the destination tile register in an SoA format.

Patent•
30 Jul 2020
TL;DR: In this paper, a matrix operation of zeroing a matrix in response to a single instruction is described, where the decoding and execution of the matrix zeroing operation is performed in a processor with a source/destination matrix operand identifier.
Abstract: Embodiments detailed herein relate to matrix operations. In particular, performing a matrix operation of zeroing a matrix in response to a single instruction. For example, a processor detailed which includes decode circuitry to decode an instruction having fields for an opcode and a source/destination matrix operand identifier; and execution circuitry to execute the decoded instruction to zero each data element of the identified source/destination matrix.

Patent•
24 Sep 2020
TL;DR: In this article, the authors described a data initialization technique for cache lines in which a processor read one or more metadata codes and then invoke a random number generator to generate random numerical data for the lines.
Abstract: Methods and apparatus relating to data initialization techniques. In an example, an apparatus comprises a processor to read one or more metadata codes which map to one or more cache lines in a cache memory and invoke a random number generator to generate random numerical data for the one or more cache lines in response to a determination that the one more metadata codes indicate that the cache lines are to contain random numerical data. Other embodiments are also disclosed and claimed.

Patent•
24 Sep 2020
TL;DR: In this article, the authors present an embodiment of an apparatus that includes a circuit element to produce a result in processing of an application; a load-store unit to receive the result and generate pre-fetch information for a cache utilizing the result; and a prefetch generator to produce prefetch addresses based at least in part on the pre fetch information.
Abstract: Embodiments are generally directed to graphics processor data access and sharing. An embodiment of an apparatus includes a circuit element to produce a result in processing of an application; a load-store unit to receive the result and generate pre-fetch information for a cache utilizing the result; and a prefetch generator to produce prefetch addresses based at least in part on the pre-fetch information; wherein the load-store unit is to receive software assistance for prefetching, and wherein generation of the pre-fetch information is based at least in part on the software assistance.

Patent•
17 Sep 2020
TL;DR: In this paper, a general-purpose graphics processing unit comprising a set of processing elements to execute one or more thread groups of a second kernel to be executed by the general purpose graphics processor, an on-chip memory coupled to the set of processors, and a scheduler coupled with the scheduler.
Abstract: One embodiment provides for a general-purpose graphics processing unit comprising a set of processing elements to execute one or more thread groups of a second kernel to be executed by the general-purpose graphics processor, an on-chip memory coupled to the set of processing elements, and a scheduler coupled with the set of processing elements, the scheduler to schedule the thread groups of the kernel to the set of processing elements, wherein the scheduler is to schedule a thread group of the second kernel to execute subsequent to a thread group of a first kernel, the thread group of the second kernel configured to access a region of the on-chip memory that contains data written by the thread group of the first kernel in response to a determination that the second kernel is dependent upon the first kernel.

Patent•
05 Mar 2020
TL;DR: In this paper, a decoded instruction is decoded to compute at least a real output value and an imaginary output value based on at least cosine calculation and a sine calculation, the cosine and sine calculations each based on an index value from a packed data source operand.
Abstract: Embodiments of systems, apparatuses, and methods for performing controllable sine and/or cosine operations in a processor are described. For example, execution circuitry executes a decoded instruction to compute at least a real output value and an imaginary output value based on at least a cosine calculation and a sine calculation, the cosine and sine calculations each based on an index value from a packed data source operand, add the index value with an index increment value from the packed data source operand to create an updated index value, and store the real output value, the imaginary output value, and the updated index value to a packed data destination operand.

Patent•
17 Sep 2020
TL;DR: In this paper, the scalar core integration in a graphics processor has been discussed, where a processor receives a set of workload instructions for a graphics workload from a host complex, determines a first subset of operations in the set of operations that is suitable for execution by a scalar processor complex of the graphics processing device and a second subset of operation that is not suitable for operation by a vector processor complex.
Abstract: Methods and apparatus relating to scalar core integration in a graphics processor. In an example, an apparatus comprises a processor to receive a set of workload instructions for a graphics workload from a host complex, determine a first subset of operations in the set of operations that is suitable for execution by a scalar processor complex of the graphics processing device and a second subset of operations in the set of operations that is suitable for execution by a vector processor complex of the graphics processing device, assign the first subset of operations to the scalar processor complex for execution to generate a first set of outputs, assign the second subset of operations to the vector processor complex for execution to generate a second set of outputs. Other embodiments are also disclosed and claimed.

Patent•
14 Mar 2020
TL;DR: In this paper, the authors present techniques to enable the dynamic reconfiguration of memory on a general-purpose graphics processing unit (GPGU) based on hardware statistics and virtual memory address translation.
Abstract: Embodiments described herein provide techniques to enable the dynamic reconfiguration of memory on a general-purpose graphics processing unit. One embodiment described herein enables dynamic reconfiguration of cache memory bank assignments based on hardware statistics. One embodiment enables for virtual memory address translation using mixed four kilobyte and sixty-four kilobyte pages within the same page table hierarchy and under the same page directory. One embodiment provides for a graphics processor and associated heterogenous processing system having near and far regions of the same level of a cache hierarchy.

Patent•
01 Apr 2020
TL;DR: In this paper, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode, locations of a two-dimensional (2D) matrix and a one-dimensional vector, and a group of elements comprising one of a row, part of a column, multiple rows, multiple columns, a column and a rectangular sub-tile of the specified 2D matrix.
Abstract: Disclosed embodiments relate to systems for performing instructions to quickly convert and use matrices (tiles) as one-dimensional vectors. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode, locations of a two-dimensional (2D) matrix and a one-dimensional (1D) vector, and a group of elements comprising one of a row, part of a row, multiple rows, a column, part of a column, multiple columns, and a rectangular sub-tile of the specified 2D matrix, and wherein the opcode is to indicate a move of the specified group between the 2D matrix and the 1D vector, decode circuitry to decode the fetched instruction; and execution circuitry, responsive to the decoded instruction, when the opcode specifies a move from 1D, to move contents of the specified 1D vector to the specified group of elements.

Patent•
24 Sep 2020
TL;DR: In this article, the authors present a system and methods for updating remote memory side caches in a multi-GPU configuration, which includes a first graphics processing unit (GPU) (2810) having a first memory (2870-1), a second memory side cache (2880-2), a first communication fabric (2860-1) and a first MMU (2855-1).
Abstract: Systems and methods for updating remote memory side caches in a multi-GPU configuration are disclosed herein. A graphics processor for a multi-tile architecture includes a first graphics processing unit (GPU) (2810) having a first memory (2870-1), a first memory side cache memory (2880-1), a first communication fabric (2860-1), and a first memory management unit (MMU) (2855-1). The graphics processor includes a second GPU (2820) having a second memory (2870-2), a second memory side cache memory (2880-2), a second MMU (2855-2), and a second communication fabric (2860-2) that is communicatively coupled to the first communication fabric. The first MMU is configured to control memory requests for the first memory, to update content in the first memory, to update content in the first memory side cache memory, and to determine whether to update the content in the second memory side cache memory.

Patent•
24 Sep 2020
TL;DR: In this paper, software, firmware, and hardware logic that provides techniques to perform arithmetic on sparse data via a systolic processing unit are described, including techniques to use decompression information when performing sparse compute operations and to exploit block sparsity within the cache hierarchy of a GPGPU.
Abstract: Embodiments described herein include, software, firmware, and hardware logic that provides techniques to perform arithmetic on sparse data via a systolic processing unit. One embodiment provides techniques to optimize training and inference on a systolic array when using sparse data. One embodiment provides techniques to use decompression information when performing sparse compute operations. One embodiment enables the disaggregation of special function compute arrays via a shared reg file. One embodiment enables packed data compress and expand operations on a GPGPU. One embodiment provides techniques to exploit block sparsity within the cache hierarchy of a GPGPU.

Patent•
05 Mar 2020
TL;DR: In this article, the authors describe vector-packet fractional multiplication of signed words with rounding, saturation, and high-result selection in a processor. But their focus is on vector-packed multiplication.
Abstract: Embodiments of systems, apparatuses, and methods for vector-packed fractional multiplication of signed words with rounding, saturation, and high-result selection in a processor are described. For example, execution circuitry executes a decoded instruction to perform a fractional multiplication operation for each of a plurality of pairs of packed data elements to yield a plurality of output values, round each of the plurality of output values, detect whether any of the plurality of output values reflect an overflow or underflow, for any of the plurality of output values that reflect an overflow or underflow, saturate the output value, and store the plurality of output values into a corresponding plurality of positions of the packed data destination operand.

Patent•
21 Jul 2020
TL;DR: In this article, an apparatus is described having instruction execution logic circuitry, which has input vector element routing circuitry to perform the following for each of three different instructions: route into an output vector element location an input vector elements from one of a plurality of inputs vector element locations that are available to source the output vector elements.
Abstract: An apparatus is described having instruction execution logic circuitry The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths

Patent•
24 Sep 2020
TL;DR: In this paper, the authors describe a multi-tile memory management system, which comprises a cache memory, a high-bandwidth memory, and a shader core communicatively coupled to the cache memory.
Abstract: Methods and apparatus relating to techniques for multi-tile memory management. In an example, an apparatus comprises a cache memory, a high-bandwidth memory, a shader core communicatively coupled to the cache memory and comprising a processing element to decompress a first data element extracted from an in-memory database in the cache memory and having a first bit length to generate a second data element having a second bit length, greater than the first bit length, and an arithmetic logic unit (ALU) to compare the data element to a target value provided in a query of the in-memory database. Other embodiments are also disclosed and claimed.