scispace - formally typeset
Search or ask a question

Showing papers by "Elmoustapha Ould-Ahmed-Vall published in 2011"


Patent•
30 Nov 2011
TL;DR: In this article, an instruction specifying: a destination operand, a size of vector elements, a source operand and a mask corresponding to a portion of the vector element data fields in the source operands, corresponding to the mask and compare the values for equality.
Abstract: Instructions and logic provide vector horizontal compare functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read values from data fields of the specified size in the source operand, corresponding to the mask and compare the values for equality. In some embodiments, responsive to a detection of inequality, a trap may be taken. In some alternative embodiments, a flag may be set. In other alternative embodiments, a mask field may be set to a masked state for the corresponding unequal value(s). In some embodiments, responsive to all unmasked data fields of the source operand being equal to a particular value, that value may be broadcast to all data fields of the specified size in the destination operand.

135 citations


Patent•
30 Sep 2011
TL;DR: A vector friendly instruction format as mentioned in this paper has a plurality of fields including a base operation field, a modifier field, an augmentation operation field and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operator field, the modifier field and the alpha field.
Abstract: A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.

63 citations


Patent•
01 Apr 2011
TL;DR: In this paper, the execution of a blend instruction causes a data element-by-element selection of data elements of first and second source operands using the corresponding bit positions of a writemask as a selector between the first operands and storage of the selected data elements into the destination at the corresponding position in the destination.
Abstract: Embodiments of systems, apparatuses, and methods for performing a blend instruction in a computer processor are described. In some embodiments, the execution of a blend instruction causes a data element-by-element selection of data elements of first and second source operands using the corresponding bit positions of a writemask as a selector between the first and second operands and storage of the selected data elements into the destination at the corresponding position in the destination.

42 citations


Patent•
23 Dec 2011
TL;DR: In this article, a vector packed instruction to convert a mask register into a list of index values instruction that includes a destination vector register operand, a source writemask register operator, and an opcode is described.
Abstract: Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a mask register into a list of index values in response to a single vector packed convert a mask register into a list of index values instruction that includes a destination vector register operand, a source writemask register operand, and an opcode are described.

31 citations


Patent•
22 Dec 2011
TL;DR: In this article, the packed data operation mask concatenation instruction indicates a first source having a first packed data operator mask, and indicates a second source with a second packed operation operator mask.
Abstract: A method of an aspect includes receiving a packed data operation mask concatenation instruction. The packed data operation mask concatenation instruction indicates a first source having a first packed data operation mask, indicates a second source having a second packed data operation mask, and indicates a destination. A result is stored in the destination in response to the packed data operation mask concatenation instruction. The result includes the first packed data operation mask concatenated with the second packed data operation mask. Other methods, apparatus, systems, and instructions are disclosed.

27 citations


Patent•
23 Dec 2011
TL;DR: In this article, a vector packed unary encoding using masks instruction that includes a source vector register operand, a destination writemask register operator, and an opcode is described.
Abstract: Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed unary encoding using masks in response to a single vector packed unary encoding using masks instruction that includes a source vector register operand, a destination writemask register operand, and an opcode are described.

25 citations


Patent•
14 Dec 2011
TL;DR: A loop remainder mask instruction as mentioned in this paper indicates a current iteration count of a loop as a first operand, an iteration limit of an iteration as a second operand and a destination.
Abstract: A loop remainder mask instruction indicates a current iteration count of a loop as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop remainder mask instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates a number of data elements of the array past an end of a preceding portion of the array that are to be handled separately from the preceding portion, the end of the preceding portion being where the current iteration count is recorded.

22 citations


Patent•
15 Dec 2011
TL;DR: In this paper, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of the third array and to generate second code representing the program loop using at least one vector instruction.
Abstract: According to one embodiment, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array and to generate second code representing the program loop using at least one vector instruction. The second code include a shuffle instruction to shuffle elements of the first array based on the third array using a shuffle table in a vector manner, a blend instruction to blend the shuffled elements of the first array using a blend table in a vector manner, and a store instruction to store the blended elements of the first array in the second array.

19 citations


Patent•
Bret L. Toll1, Robert Valentine1, Corbal Jesus1, Elmoustapha Ould-Ahmed-Vall1, Mark J. Charney1 •
23 Dec 2011
TL;DR: In this article, a single mask bit compression instruction that includes a source writemask register operand, a destination writeemask operand and an opcode is described. But it does not specify a single opcode.
Abstract: Embodiments of systems, apparatuses, and methods for performing in a computer processor mask bit compression in response to a single mask bit compression instruction that includes a source writemask register operand, a destination writemask register operand, and an opcode are described.

16 citations


Patent•
26 Sep 2011
TL;DR: In this article, the authors provide vector scatter-op and/or gather-op functionality, where the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value.
Abstract: Instructions and logic provide vector scatter-op and/or gather-op functionality. In some embodiments, responsive to an instruction specifying: a gather and a second operation, a destination register, an operand register, and a memory address; execution units read values in a mask register, wherein fields in the mask register correspond to offset indices in the indices register for data elements in memory. A first mask value indicates the element has not been gathered from memory and a second value indicates that the element does not need to be, or has already been gathered. For each having the first value, the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. When all mask register fields have the second value, the second operation is performed using corresponding data in the destination and operand registers to generate results.

16 citations


Patent•
23 Dec 2011
TL;DR: A vector packed horizontal add or subtract of packed data elements in response to a single vector packed HADD instruction that includes a destination vector register operand, a source vector register operator, and an opcode is described in this article.
Abstract: Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal add or subtract of packed data elements in response to a single vector packed horizontal add or subtract instruction that includes a destination vector register operand, a source vector register operand, and an opcode are describes.

Patent•
30 Nov 2011
TL;DR: In this paper, the authors provide vector horizontal majority voting functionality, responsive to an instruction specifying: a destination operand, a size of vector elements, a source operand and a mask corresponding to a portion of the vector element data fields in the source operands, corresponding to the mask specified by the instruction.
Abstract: Instructions and logic provide vector horizontal majority voting functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read a number of values from data fields of the specified size in the source operand, corresponding to the mask specified by the instruction and store a result value to that number of corresponding data fields in the destination operand, the result value computed from the majority of values read from the number of data fields of the source operand.

Patent•
22 Dec 2011
TL;DR: In this article, the authors describe a system, apparatuses, and methods for performing a mask broadcast instruction in a computer processor, which causes a broadcast of a data element of the source operand to a destination register of the destination operand according to the broadcast size.
Abstract: Embodiments of systems, apparatuses, and methods for performing a mask broadcast instruction in a computer processor are described. In some embodiments, the execution of a mask broadcast instruction causes a broadcast of a data element of the source operand to a destination register of the destination operand according to the broadcast size.

Patent•
14 Jan 2011
TL;DR: In this article, a processing core implemented on a semiconductor chip is described, which includes logic circuitry to identify whether vector instructions and integer scalar instructions are to be executed with two registers or three registers.
Abstract: A processing core implemented on a semiconductor chip is described. The processing core includes logic circuitry to identify whether vector instructions and integer scalar instructions are to be executed with two registers or three registers, where, in the case of two registers input operand information is destroyed in one of two registers, and, in the case of three registers input operand is not destroyed. The processing core also includes steering circuitry coupled to the logic circuitry. The steering circuitry is to control first data paths between scalar integer execution units and a scalar integer register bank such that two registers are accessed from the scalar register bank if two register execution is identified for the scalar integer instructions or three registers are accessed from the scalar integer register bank if three register execution is identified for the scalar integer instructions. The steering circuitry is also to control second data paths between vector execution units and a vector register bank such that two registers are accessed from the vector register bank if two register execution is identified for the vector instructions or three registers are accessed from the vector register bank if three register execution is identified for the vector instructions.

Patent•
23 Dec 2011
TL;DR: In this paper, an instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: route into an output vector element location an input vector elements from one of a plurality of inputs vector element locations that are available to source the output vector elements.
Abstract: An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths.

Patent•
23 Dec 2011
TL;DR: In this paper, a vector packed conversion of a list of index values into a mask value instruction that includes a destination writemask register, a source vector register operand, and an opcode is described.
Abstract: Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a list of index values into a mask value in response to a single vector packed conversion of a list of index values into a mask value instruction that includes a destination writemask register operand, a source vector register operand, and an opcode are described.

Patent•
28 Dec 2011
TL;DR: In this paper, the authors describe several systems, apparatuses, and methods for delta encoding on packed data elements of a source and storing the results in packed data element of a destination using a single vector packed delta encode instruction.
Abstract: Embodiments of systems, apparatuses, and methods for performing delta encoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta encode instruction are described.

Patent•
23 Dec 2011
TL;DR: In this article, an instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: route into an output vector element location an input vector elements from one of a plurality of inputs vector element locations that are available to source the output vector elements.
Abstract: An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths.

Patent•
Robert Valentine1, Elmoustapha Ould-Ahmed-Vall1, Jesus Corbal1, Tal Uliel1, Bret L. Toll1 •
23 Dec 2011
TL;DR: In this paper, an apparatus and method for shuffling data elements from source registers to a destination register is described, where each mask bit associated with the data element indicates that a shuffle operation should be performed.
Abstract: An apparatus and method are described for shuffling data elements from source registers to a destination register. For example, a method according to one embodiment includes the following operations: reading each mask bit stored in a mask data structure, the mask data structure containing mask bits associated with data elements of a destination register, the values usable for determining whether a masking operation or a shuffle operation should be performed on data elements stored within a first source register and a second source register; for each data element of the destination register, if a mask bit associated with the data element indicates that a shuffle operation should be performed, then shuffling data elements from the first source register and the second source register to the specified data element within the destination register; and if the mask bit indicates that a masking operation should be performed, then performing a specified masking operation with respect to the data element of the destination register.

Patent•
29 Dec 2011
TL;DR: In this article, a dot product instruction indicates a first source packed data including at least four data elements, indicates a second source packing data with at least eight data elements and indicates a destination storage location.
Abstract: A method of an aspect includes receiving a dot product instruction. The dot product instruction indicates a first source packed data including at least four data elements, indicates a second source packed data including at least eight data elements, and indicates a destination storage location. A result packed data is stored in the destination storage location in response to the dot product instruction. The result includes a plurality of data elements that each includes a dot product result. Each of the dot product results includes a sum of products of the at least four data elements of the first source packed data with corresponding data elements in a different subset of at least four data elements of the second source packed data. Other methods, apparatus, systems, and instructions are disclosed.

Patent•
30 Dec 2011
TL;DR: In this article, a unique packed data element identification result is stored in the destination storage location in response to the unique packed element identification instruction, which indicates which of the plurality of the packed data elements are unique in the source packed data.
Abstract: A method of an aspect includes receiving a unique packed data element identification instruction. The unique packed data element identification instruction indicates a source packed data having a plurality of packed data elements and indicates a destination storage location. A unique packed data element identification result is stored in the destination storage location in response to the unique packed data element identification instruction. The unique packed data element identification result indicates which of the plurality of the packed data elements are unique in the source packed data. Other methods, apparatus, systems, and instructions are disclosed.

Patent•
23 Dec 2011
TL;DR: In this article, a functional unit logic circuitry has been described having a first register to store a first input vector operand having an element for each dimension of a multi-dimensional data structure.
Abstract: An apparatus is described having functional unit logic circuitry. The functional unit logic circuitry has a first register to store a first input vector operand having an element for each dimension of a multi-dimensional data structure. Each element of the first vector operand specifying the size of its respective dimension. The functional unit has a second register to store a second input vector operand specifying coordinates of a particular segment of the multi-dimensional structure. The functional unit also has logic circuitry to calculate an address offset for the particular segment relative to an address of an origin segment of the multi-dimensional structure.

Patent•
Suleyman Sair1, Elmoustapha Ould-Ahmed-Vall1, Charles R. Yount1, Kshitij A. Doshi1, Bret L. Toll1 •
30 Dec 2011
TL;DR: In this paper, a vector frequency compress instruction that includes a source operand and a destination operand is decoded by a processor core that includes an execution engine unit to execute the decoding.
Abstract: A processor core that includes a hardware decode unit to decode a vector frequency compress instruction that includes a source operand and a destination operand. The source operand specifying a source vector register that includes a plurality of source data elements including one or more runs of identical data elements that are each to be compressed in a destination vector register as a value and run length pair. The destination operand identifies the destination vector register. The processor core also includes an execution engine unit to execute the decoded vector frequency compress instruction which causes, for each source data element, a value to be copied into the destination vector register to indicate that source data element's value. One or more runs of the source data elements equal are encoded in the destination vector register as the predetermined compression value followed by a run length for that run.

Patent•
Elmoustapha Ould-Ahmed-Vall1, Suleyman Sair1, Kshitij A. Doshi1, Charles R. Yount1, Bret L. Toll1 •
30 Dec 2011
TL;DR: In this paper, a processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions.
Abstract: A processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions. The execution unit generates a first mask by comparing set of source data elements with a set of zeros and then counts the trailing zeros in the mask. A second mask is made based on the count of trailing zeros. The execution unit then copies the set of source data elements to a buffer using the second mask and then reads the number of RLE zeros from the set of source data elements. The buffer is shifted and copied to a result and the set of source data elements is shifted to the right. If more valid data elements are in the set of source data elements this is repeated until all valid data is processed.

Patent•
22 Dec 2011
TL;DR: In this article, an apparatus and method for expanding bits from a mask register in a processor and computing system with vector registers and vector data elements is described, where each mask register bit is replaced with a vector element.
Abstract: An apparatus and method are described for expanding bits from a mask register in a processor and computing system with vector registers and vector data elements. For example, a method according to one embodiment includes the following operations: reading each mask register bit stored in a mask register, the mask register containing mask values used for performing operations on vector values stored in a set of vector registers; and replicating each mask register bit N times into a destination register, where N is the number of vector elements stored in each vector register.

Patent•
Bret L. Toll1, Robert Valentine1, Corbal Jesus1, Elmoustapha Ould-Ahmed-Vall1, Mark J. Charney1 •
29 Dec 2011
TL;DR: In this paper, the packed data operation mask comparison instruction indicates that each mask bit of the first mask corresponds to a mask bit from the second mask in the corresponding position in the comparison.
Abstract: Receive packed data operation mask comparison instruction indicating first packed data operation mask having first packed data operation mask bits and second packed data operation mask having second packed data operation mask bits. Each packed data operation mask bit of first mask corresponds to a packed data operation mask bit of second mask in corresponding position. Modify first flag to first value if bitwise AND of each packed data operation mask bit of first mask with each corresponding packed data operation mask bit of second mask is zero. Otherwise modify first flag to second value. Modify second flag to third value if bitwise AND of each packed data operation mask bit of first mask with bitwise NOT of each corresponding packed data operation mask bit of second mask is zero. Otherwise modify second flag to fourth value.

Patent•
23 Dec 2011
TL;DR: In this paper, an instruction execution logic circuitry for first, second, third, fourth, and fourth instructions is described, where the first instruction selects a first group of input vector elements from one of multiple first non overlapping sections of respective first and second input vectors, and the second group has a second bit width that is larger than the first bit width.
Abstract: An apparatus is described that includes instruction execution logic circuitry to execute first, second, third and fourth instructions. Both the first instruction and the second instruction select a first group of input vector elements from one of multiple first non overlapping sections of respective first and second input vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction select a second group of input vector elements from one of multiple second non overlapping sections of respective third and fourth input vectors. The second group has a second bit width that is larger than the first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus includes masking layer circuitry to mask the first and second groups of the first and third instructions at a first granularity, where, respective resultants produced therewith are respective resultants of the first and third instructions. The masking circuitry is also to mask the first and second groups of the second and fourth instructions at a second granularity, where, respective resultants produced therewith are respective resultants of the second and fourth instructions.

Patent•
23 Dec 2011
TL;DR: In this article, an apparatus and method for broadcasting from a general purpose source register to a destination vector register is described, where the mask indicator is set to a first or second indication.
Abstract: An apparatus and method are described for broadcasting from a general purpose source register to a destination vector register. For example, a method according to one embodiment includes the following operations: selecting data element position N within the destination vector register to be updated; broadcasting a set of data from the general purpose source register to data element position N within the destination vector register if a mask indicator is set to a first indication; and either copying zeroes to data element position N within the destination vector register or maintaining existing values stored within data element position N within the destination vector register if the mask indicator is set to a second indication.

Patent•
22 Dec 2011
TL;DR: In this paper, the packed data operation mask register arithmetic combination instruction (POMCASI) is defined as an arithmetic combination of at least a portion of bits of the first POMC mask register and a corresponding portion of the second POMI register.
Abstract: A method of an aspect includes receiving a packed data operation mask register arithmetic combination instruction. The packed data operation mask register arithmetic combination instruction indicates a first packed data operation mask register, indicates a second packed data operation mask register, and indicates a destination storage location. An arithmetic combination of at least a portion of bits of the first packed data operation mask register and at least a corresponding portion of bits of the second packed data operation mask register is stored in the destination storage location in response to the packed data operation mask register arithmetic combination instruction. Other methods, apparatus, systems, and instructions are disclosed.

Patent•
30 Dec 2011
TL;DR: In this article, a processor core that includes a hardware decode unit and an execution engine unit is used to decode a vector frequency expand instruction, where the source operand specifies a source vector register that includes one or more pairs of a value and run length that are to be expanded into a run of that value based on the run length.
Abstract: A processor core that includes a hardware decode unit and an execution engine unit. The hardware decode unit to decode a vector frequency expand instruction, wherein the vector frequency compress instruction includes a source operand and a destination operand, wherein the source operand specifies a source vector register that includes one or more pairs of a value and run length that are to be expanded into a run of that value based on the run length. The execution engine unit to execute the decoded vector frequency expand instruction which causes, a set of one or more source data elements in the source vector register to be expanded into a set of destination data elements comprising more elements than the set of source data elements and including at least one run of identical values which were run length encoded in the source vector register.