scispace - formally typeset
Search or ask a question
Author

Elmoustapha Ould-Ahmed-Vall

Bio: Elmoustapha Ould-Ahmed-Vall is an academic researcher from Intel. The author has contributed to research in topics: Operand & Opcode. The author has an hindex of 19, co-authored 299 publications receiving 1656 citations. Previous affiliations of Elmoustapha Ould-Ahmed-Vall include Georgia Institute of Technology & AMIT.


Papers
More filters
Patent
30 Nov 2011
TL;DR: In this article, an instruction specifying: a destination operand, a size of vector elements, a source operand and a mask corresponding to a portion of the vector element data fields in the source operands, corresponding to the mask and compare the values for equality.
Abstract: Instructions and logic provide vector horizontal compare functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read values from data fields of the specified size in the source operand, corresponding to the mask and compare the values for equality. In some embodiments, responsive to a detection of inequality, a trap may be taken. In some alternative embodiments, a flag may be set. In other alternative embodiments, a mask field may be set to a masked state for the corresponding unequal value(s). In some embodiments, responsive to all unmasked data fields of the source operand being equal to a particular value, that value may be broadcast to all data fields of the specified size in the destination operand.

135 citations

Patent
30 Sep 2011
TL;DR: A vector friendly instruction format as mentioned in this paper has a plurality of fields including a base operation field, a modifier field, an augmentation operation field and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operator field, the modifier field and the alpha field.
Abstract: A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.

63 citations

Proceedings ArticleDOI
Elmoustapha Ould-Ahmed-Vall1, J. Woodlee1, Charles R. Yount1, K.A. Doshi1, S. Abraham1 
25 Apr 2007
TL;DR: A model-tree based approach based on the M5' algorithm is implemented and validated that accounts for event interactions and workload characteristics, attesting it as a sound approach for performance analysis of modern superscalar machines.
Abstract: The identification of performance issues on specific computer architectures has a variety of important benefits such as tuning software to improve performance, comparing the performance of various platforms and assisting in the design of new platforms. In order to enable this analysis, most modern micro-processors provide access to hardware-based event counters. Unfortunately, features such as out-of-order execution, pre-fetching and speculation complicate the interpretation of the raw data. Thus, the traditional approach of assigning a uniform estimated penalty to each event does not accurately identify and quantify performance limiters. This paper presents a novel method employing a statistical regression-modeling approach to better achieve this goal. Specifically, a model-tree based approach based on the M5' algorithm is implemented and validated that accounts for event interactions and workload characteristics. Data from a subset of the SPEC CPU2006 suite is used by the algorithm to automatically build a performance-model tree, identifying the unique performance classes (phases) found in the suite and associating with each class a unique, explanatory linear model of performance events. These models can be used to identify performance problems for a given workload and estimate the potential gain from addressing each problem. This information can help orient the performance optimization efforts to focus available time and resources on techniques most likely to impact performance problems with highest potential gain. The model tree exhibits high correlation (more than 0.98) and low relative absolute error (less than 8 %) between predicted and measured performance, attesting it as a sound approach for performance analysis of modern superscalar machines

53 citations

Journal ArticleDOI
TL;DR: A general fault-tolerant event detection scheme that allows nodes to detect erroneous local decisions by leveraging the local decisions reported by their neighbors and is proven to be optimal under the maximum a posteriori (MAP) criterion.
Abstract: This paper presents a general fault-tolerant event detection scheme that allows nodes to detect erroneous local decisions by leveraging the local decisions reported by their neighbors. This detection scheme can handle cases where nodes have different accuracy levels. The derived fault-tolerant estimator is proven to be optimal under the maximum a posteriori (MAP) criterion. An equivalent weighted voting scheme is also derived. Further, two new error models are derived to take into account the neighbor distance and the geographical distributions of the two decision quorums. These models are particularly suitable for detection applications where the event under consideration is highly localized. The fault-tolerant estimator is simulated using a network of 1,024 nodes deployed randomly in a square region and assigned random probabilities of failure. Several estimation schemes that allow nodes to learn their error rates continuously are developed. These error rates are used in the distributed estimation schemes to assign appropriate weights to the nodes in the voting scheme.

49 citations

Proceedings ArticleDOI
12 Dec 2005
TL;DR: A distributed algorithm to solve the unique ID assignment problem is presented and it is demonstrated that a high percentage of nodes are assigned globally unique IDs at the termination of the algorithm when the algorithm parameters are set properly.
Abstract: A sensor network consists of a set of battery-powered nodes, which collaborate to perform sensing tasks in a given environment. It may contain one or more base stations to collect sensed data and possibly relay it to a central processing and storage system. These networks are characterized by scarcity of resources, in particular the available energy. We present a distributed algorithm to solve the unique ID assignment problem. The proposed solution starts by assigning long unique IDs and organizing nodes in a tree structure. This tree structure is used to compute the size of the network. Then, unique IDs are assigned using the minimum number of bytes. Globally unique IDs are useful in providing many network functions, e.g. configuration, monitoring of individual nodes, and various security mechanisms. Theoretical and simulation analysis of the proposed solution have been preformed. The results demonstrate that a high percentage of nodes (more than 99%) are assigned globally unique IDs at the termination of the algorithm when the algorithm parameters are set properly. Furthermore, the algorithm terminates in a relatively short time that scales well with the network size. For example, the algorithm terminates in about 5 minutes for a network of 1,000 nodes

46 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Existing solutions and open research issues at the application, transport, network, link, and physical layers of the communication protocol stack are investigated, along with possible cross-layer synergies and optimizations.

2,311 citations

Journal Article
TL;DR: This work proposes a tiered system architecture in which data collected at numerous, inexpensive sensor nodes is filtered by local processing on its way through to larger, more capable and more expensive nodes.
Abstract: As new fabrication and integration technologies reduce the cost and size of micro-sensors and wireless interfaces, it becomes feasible to deploy densely distributed wireless networks of sensors and actuators. These systems promise to revolutionize biological, earth, and environmental monitoring applications, providing data at granularities unrealizable by other means. In addition to the challenges of miniaturization, new system architectures and new network algorithms must be developed to transform the vast quantity of raw sensor data into a manageable stream of high-level data. To address this, we propose a tiered system architecture in which data collected at numerous, inexpensive sensor nodes is filtered by local processing on its way through to larger, more capable and more expensive nodes.We briefly describe Habitat monitoring as our motivating application and introduce initial system building blocks designed to support this application. The remainder of the paper presents details of our experimental platform.

454 citations

Journal ArticleDOI
22 Jan 2010
TL;DR: This paper explores how design in the moderate inversion region helps to recover some of that lost performance, while staying quite close to the minimum-energy point, and introduces a pass-transistor based logic family that excels in this operational region.
Abstract: Operation in the subthreshold region most often is synonymous to minimum-energy operation. Yet, the penalty in performance is huge. In this paper, we explore how design in the moderate inversion region helps to recover some of that lost performance, while staying quite close to the minimum-energy point. An energy-delay modeling framework that extends over the weak, moderate, and strong inversion regions is developed. The impact of activity and design parameters such as supply voltage and transistor sizing on the energy and performance in this operational region is derived. The quantitative benefits of operating in near-threshold region are established using some simple examples. The paper shows that a 20% increase in energy from the minimum-energy point gives back ten times in performance. Based on these observations, a pass-transistor based logic family that excels in this operational region is introduced. The logic family operates most of its logic in the above-threshold mode (using low-threshold transistors), yet containing leakage to only those in subthreshold. Operation below minimum-energy point of CMOS is demonstrated. In leakage-dominated ultralow-power designs, time-multiplexing will be shown to yield not only area, but also energy reduction due to lower leakage. Finally, the paper demonstrates the use of ultralow-power design techniques in chip synthesis.

391 citations

Journal ArticleDOI
TL;DR: This article reviews some research activities in WSN and reviews some CPS platforms and systems that have been developed recently, including health care, navigation, rescue, intelligent transportation, social networking, and gaming applications.

323 citations

Journal ArticleDOI
TL;DR: The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%.
Abstract: A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7p.The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.

168 citations