https://www.cs.montana.edu/courses/csci566/Ref/WMSN-Survey.pdf

A survey on wireless multimedia sensor networks

As new fabrication and integration technologies reduce the cost and size of micro-sensors and wireless interfaces, it becomes feasible to deploy densely distributed wireless networks of sensors and actuators. These systems promise to revolutionize biological, earth, and environmental monitoring applications, providing data at granularities unrealizable by other means. In addition to the challenges of miniaturization, new system architectures and new network algorithms must be developed to transform the vast quantity of raw sensor data into a manageable stream of high-level data. To address this, we propose a tiered system architecture in which data collected at numerous, inexpensive sensor nodes is filtered by local processing on its way through to larger, more capable and more expensive nodes.We briefly describe Habitat monitoring as our motivating application and introduce initial system building blocks designed to support this application. The remainder of the paper presents details of our experimental platform.

Habitat monitoring: Application driver for wireless communications technology

Operation in the subthreshold region most often is synonymous to minimum-energy operation. Yet, the penalty in performance is huge. In this paper, we explore how design in the moderate inversion region helps to recover some of that lost performance, while staying quite close to the minimum-energy point. An energy-delay modeling framework that extends over the weak, moderate, and strong inversion regions is developed. The impact of activity and design parameters such as supply voltage and transistor sizing on the energy and performance in this operational region is derived. The quantitative benefits of operating in near-threshold region are established using some simple examples. The paper shows that a 20% increase in energy from the minimum-energy point gives back ten times in performance. Based on these observations, a pass-transistor based logic family that excels in this operational region is introduced. The logic family operates most of its logic in the above-threshold mode (using low-threshold transistors), yet containing leakage to only those in subthreshold. Operation below minimum-energy point of CMOS is demonstrated. In leakage-dominated ultralow-power designs, time-multiplexing will be shown to yield not only area, but also energy reduction due to lower leakage. Finally, the paper demonstrates the use of ultralow-power design techniques in chip synthesis.

/pdf/ultralow-power-design-in-near-threshold-region-37ua2hp61u.pdf

Ultralow-Power Design in Near-Threshold Region

https://ir.nctu.edu.tw:443/bitstream/11536/15393/1/000299669400001.pdf

Review: From wireless sensor networks towards cyber physical systems

A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7p.The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.

https://biblio.ugent.be/publication/2093763/file/2093773.pdf

A mechanistic performance model for superscalar out-of-order processors

Embodiments detailed herein relate to systems and methods to load a tile register pair. In one example, a processor includes: decode circuitry to decode a load matrix pair instruction having fields for an opcode and source and destination identifiers to identify source and destination matrices, respectively, each matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded load matrix pair instruction to load every element of left and right tiles of the identified destination matrix from corresponding element positions of left and right tiles of the identified source matrix, respectively, wherein the executing operates on one row of the identified destination matrix at a time, starting with the first row.

Systems and methods to load a tile register pair

In one embodiment, a processor includes: a fetch logic to fetch instructions, the instructions including a partial reduction instruction; a decode logic to decode the partial reduction instruction and provide the decoded partial reduction instruction to one or more execution units; and the one or more execution units to, responsive to the decoded partial reduction instruction, perform a plurality of N partial reduction operations to generate an result array including N output data elements, where an input array comprises N lanes, and where each of the N partial reduction operations is to reduce a set of input data elements included in a corresponding lane of the N lanes. Other embodiments are described and claimed.

Instruction and logic for partial reduction operations

An example processor includes a register and a fused multiply-add (FMA) low functional unit. The register stores first, second, and third floating point (FP) values. The FMA low functional unit receives a request to perform an FMA low operation: multiplies the first FP value with the second FP value to obtain a first product value; adds the first product with the third FP value to generate a first result value; rounds the first result to generate a first FMA value; multiplies the first FP value with the second FP value to obtain a second product value; adds the second product value with the third FP value to generate a second result value; and subtracts the FMA value from the second result value to obtain a third result value, which can then be normalized and rounded (FMA low result) and sent the FMA low result to an application.

Fused Multiply-Add (FMA) low functional unit

In an embodiment, a processor includes a plurality of cores, with at least one core including a cancellation monitor unit. The cancellation monitor unit comprises circuitry to: detect an execution of a floating point (FP) instruction in the core, wherein the execution of the FP instruction uses a set of FP inputs and generates an FP output; determine a maximum exponent value associated with the set of FP inputs to the FP instruction; subtract an exponent value of the FP output from the maximum exponent value to obtain an exponent difference; and in response to a determination that the exponent difference meets or exceeds a threshold level, increment a cancellation event count. Other embodiments are described and claimed.

Hardware cancellation monitor for floating point operations

Instructions and logic provide vector horizontal compare functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read values from data fields of the specified size in the source operand, corresponding to the mask and compare the values for equality. In some embodiments, responsive to a detection of inequality, a trap may be taken. In some alternative embodiments, a flag may be set. In other alternative embodiments, a mask field may be set to a masked state for the corresponding unequal value(s). In some embodiments, responsive to all unmasked data fields of the source operand being equal to a particular value, that value may be broadcast to all data fields of the specified size in the destination operand.

Elmoustapha Ould-Ahmed-Vall

Papers

Systems and methods to load a tile register pair

Instruction and logic for partial reduction operations

Fused Multiply-Add (FMA) low functional unit

Hardware cancellation monitor for floating point operations

Providing vector horizontal compare functionality within a vector register