scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 2008"


Journal Article•DOI•
TL;DR: Results confirm the unique benefits for future generations of CMPs that can be achieved by bringing optics into the chip in the form of photonic NoCs, as well as a comparative power analysis of a photonic versus an electronic NoC.
Abstract: The design and performance of next-generation chip multiprocessors (CMPs) will be bound by the limited amount of power that can be dissipated on a single die We present photonic networks-on-chip (NoC) as a solution to reduce the impact of intra-chip and off-chip communication on the overall power budget A photonic interconnection network can deliver higher bandwidth and lower latencies with significantly lower power dissipation We explain why on-chip photonic communication has recently become a feasible opportunity and explore the challenges that need to be addressed to realize its implementation We introduce a novel hybrid micro-architecture for NoCs combining a broadband photonic circuit-switched network with an electronic overlay packet-switched control network We address the critical design issues including: topology, routing algorithms, deadlock avoidance, and path-setup/tear-down procedures We present experimental results obtained with POINTS, an event-driven simulator specifically developed to analyze the proposed idea, as well as a comparative power analysis of a photonic versus an electronic NoC Overall, these results confirm the unique benefits for future generations of CMPs that can be achieved by bringing optics into the chip in the form of photonic NoCs

873 citations


Journal Article•DOI•
TL;DR: Simulation experiments show that an Ethernet link with ALR can operate at a lower data rate for over 80 percent of the time, yielding significant energy savings with only a very small increase in packet delay.
Abstract: The rapidly increasing energy consumption by computing and communications equipment is a significant economic and environmental problem that needs to be addressed. Ethernet network interface controllers (NICs) in the US alone consume hundreds of millions of US dollars in electricity per year. Most Ethernet links are underutilized and link energy consumption can be reduced by operating at a lower data rate. In this paper, we investigate adaptive link rate (ALR) as a means of reducing the energy consumption of a typical Ethernet link by adaptively varying the link data rate in response to utilization. Policies to determine when to change the link data rate are studied. Simple policies that use output buffer queue length thresholds and fine-grain utilization monitoring are shown to be effective. A Markov model of a state-dependent service rate queue with rate transitions only at service completion is used to evaluate the performance of ALR with respect to the mean packet delay, the time spent in an energy-saving low link data rate, and the oscillation of link data rates. Simulation experiments using actual and synthetic traffic traces show that an Ethernet link with ALR can operate at a lower data rate for over 80 percent of the time, yielding significant energy savings with only a very small increase in packet delay.

423 citations


Journal Article•DOI•
TL;DR: In this paper, a new coding scheme, called the STAR code, was proposed for correcting triple storage node failures (erasures), which is an extension of the double-erasure-correcting EVENODD code.
Abstract: Proper data placement schemes based on erasure correcting codes are one of the most important components for a highly available data storage system. For such schemes, low decoding complexity for correcting (or recovering) storage node failures is essential for practical systems. In this paper, we describe a new coding scheme, which we call the STAR code, for correcting triple storage node failures (erasures). The STAR code is an extension of the double-erasure-correcting EVENODD code and a modification of the generalized triple-erasure-correcting EVENODD code. The STAR code is an Maximum Distance Separable (MDS) code and thus is optimal in terms of node failure recovery capability for a given data redundancy. We provide detailed STAR code decoding algorithms for correcting various triple node failures. We show that the decoding complexity of the STAR code is much lower than those of existing comparable codes; thus, the STAR code is practically very meaningful for storage systems that need higher reliability.

264 citations


Journal Article•DOI•
TL;DR: This paper presents an architecture of a state-of-the-art processor for RFID tags with an elliptic curve (EC) processor over GF(2163) and shows the plausibility of meeting both security and efficiency requirements even in a passive RFID tag.
Abstract: RFID (radio frequency identification) tags need to include security functions, yet at the same time their resources are extremely limited. Moreover, to provide privacy, authentication and protection against tracking of RFID tags without loosing the system scalability, a public-key based approach is inevitable, which is shown by M. Burmester et al. In this paper, we present an architecture of a state-of-the-art processor for RFID tags with an elliptic curve (EC) processor over GF(2163). It shows the plausibility of meeting both security and efficiency requirements even in a passive RFID tag. The proposed processor is able to perform EC scalar multiplications as well as general modular arithmetic (additions and multiplications) which are needed for the cryptographic protocols. As we work with large numbers, the register file is the most critical component in the architecture. By combining several techniques, we are able to reduce the number of registers from 9 to 6 resulting in EC processor of 10.1 K gates. To obtain an efficient modulo arithmetic, we introduce a redundant modular operation. Moreover the proposed architecture can support multiple cryptographic protocols. The synthesis results with a 0.13 um CMOS technology show that the gate area of the most compact version is 12.5 K gates.

253 citations


Journal Article•DOI•
TL;DR: A new counter-based approach to deal with cache pollution, predicting lines that have become dead and replacing them early from the L2 cache and identifying never-reaccessed lines, which is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs.
Abstract: Recent studies have shown that, in highly associative caches, the performance gap between the least recently used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve cache performance. In LRU replacement, a line, after its last use, remains in the cache for a long time until it becomes the LRU line. Such deadlines unnecessarily reduce the cache capacity available for other lines. In addition, in multilevel caches, temporal reuse patterns are often inverted, showing in the L1 cache but, due to the filtering effect of the L1 cache, not showing in the L2 cache. At the L2, these lines appear to be brought in the cache but are never reaccessed until they are replaced. These lines unnecessarily pollute the L2 cache. This paper proposes a new counter-based approach to deal with the above problems. For the former problem, we predict lines that have become dead and replace them early from the L2 cache. For the latter problem, we identify never-reaccessed lines, bypass the L2 cache, and place them directly in the L1 cache. Both techniques are achieved through a single counter-based mechanism. In our approach, each line in the L2 cache is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs. When the counter reaches a threshold, the line ";expires"; and becomes replaceable. Each line's threshold is unique and is dynamically learned. We propose and evaluate two new replacement algorithms: Access interval predictor (AIP) and live-time predictor (LvP). AIP and LvP speed up 10 capacity-constrained SPEC2000 benchmarks by up to 48 percent and 15 percent on average (7 percent on average for the whole 21 Spec2000 benchmarks). Cache bypassing further reduces L2 cache pollution and improves the average speedups to 17 percent (8 percent for the whole 21 Spec2000 benchmarks).

230 citations


Journal Article•DOI•
TL;DR: A novel selection strategy based on the concept of Neighbors-on-Path is presented that can be coupled with any adaptive routing algorithm to exploit the situations of indecision occurring when the routing function returns several admissible output channels.
Abstract: Efficient and deadlock-free routing is critical to the performance of networks-on-chip. The effectiveness of any adaptive routing algorithm strongly depends on the underlying selection strategy. A selection function is used to select the output channel where the packet will be forwarded on. In this paper we present a novel selection strategy that can be coupled with any adaptive routing algorithm. The proposed selection strategy is based on the concept of Neighbors-on-Path the aims of which is to exploit the situations of indecision occurring when the routing function returns several admissible output channels. The overall objective is to choose the channel that will allow the packet to be routed to its destination along a path that is as free as possible of congested nodes. Performance evaluation is carried out by using a flit-accurate simulator under traffic scenarios generated by both synthetic and real applications. Results obtained show how the proposed selection strategy applied to the Odd-Even routing algorithm yields an improvement in both average delay and saturation point up to 20% and 30% on average respectively, with a minimal overhead in terms of area occupation. In addition, a positive effect on total energy consumption is also observed under near-congestion packet injection rates.

226 citations


Journal Article•DOI•
TL;DR: This paper designs and implements a synchronous and distributed medium access control protocol and proposes a QoS-aware SD-MAC to ensure the serviceability of the entire system and to improve the bandwidth utilization of the system.
Abstract: To bridge the widening gap between computation requirements and communication efficiency faced by gigascale heterogeneous SoCs in the upcoming ubiquitous era, a new on-chip communication system, dubbed Wireless Network-on-Chip (WNoC), is introduced by using the recently developed CMOS UWB wireless interconnection technology. In this paper, a synchronous and distributed medium access control (SD-MAC) protocol is designed and implemented. Tailored for WNoC, SD-MAC employs a binary countdown approach to resolve channel contention between RF nodes. The receiver_select_sender mechanism and hidden terminal elimination scheme are proposed to increase the throughput and channel utilization of the system. Our simulation study shows the promising performance of SD-MAC in terms of throughput, latency, and network utilization. We further propose a QoS-aware SD-MAC to ensure the serviceability of the entire system and to improve the bandwidth utilization. As a major component of simple and compact RF node design, a MAC unit implements the proposed SD-MAC that guarantees correct operation of synchronized frames while keeping overhead low. The synthesis results demonstrate several attractive features such as high speed, low power consumption, nice scalability and low area cost.

196 citations


Journal Article•DOI•
TL;DR: A high performance architecture of elliptic curve scalar multiplication based on the Montgomery ladder method over finite field GF(2m) is proposed and a pseudo-pipelined word serial finite field multiplier with word size w, suitable for the scalar multiplied, is developed.
Abstract: A high performance architecture of elliptic curve scalar multiplication based on the Montgomery ladder method over finite field GF(2m) is proposed. A pseudo-pipelined word serial finite field multiplier with word size w, suitable for the scalar multiplication is also developed. Implemented in hardware, this system performs a scalar multiplication in approximately 6lceilm/wrceil(m-1) clock cycles and the gate delay in the critical path is equal to TAND + lceillog2(w/k)rceilTXOR, where TAND and TXOR are delays due to two-input AND and XOR gates respectively and 1 les k Lt w is used to shorten the critical path.

166 citations


Journal Article•DOI•
TL;DR: A number of properties of AC(P) used to symbolically simplify and handle connectors are provided, including a general component model encompassing methods for incremental model decomposition and efficient implementation by using symbolic techniques.
Abstract: We provide an algebraic formalization of connectors in the BIP component framework. A connector relates a set of typed ports. Types are used to describe different modes of synchronization, in particular, rendezvous and broadcast. Connectors on a set of ports P are modeled as terms of the algebra AC(P), generated from P by using a binary fusion operator and a unary typing operator. Typing associates with terms (ports or connectors) synchronization types - trigger or synchron - that determine modes of synchronization. Broadcast interactions are initiated by triggers. Rendezvous is a maximal interaction of a connector that includes only synchrons. The semantics of AC(P) associates with a connector the set of its interactions. It induces on connectors an equivalence relation which is not a congruence as it is not stable for fusion. We provide a number of properties of AC(P) used to symbolically simplify and handle connectors. We provide examples illustrating applications of AC(P), including a general component model encompassing methods for incremental model decomposition and efficient implementation by using symbolic techniques.

162 citations


Journal Article•DOI•
Tim Güneysu1, Timo Kasper1, Martin Novotny1, Christof Paar1, Andy Rupp1 •
TL;DR: This work describes various exhaustive key search attacks on symmetric ciphers and demonstrates an attack on a security mechanism employed in the electronic passport and introduces efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., elliptic curve cryptosSystems (ECCs) and number cofactorization for RSA.
Abstract: Cryptanalysis of ciphers usually involves massive computations. The security parameters of cryptographic algorithms are commonly chosen so that attacks are infeasible with available computing resources. Thus, in the absence of mathematical breakthroughs to a cryptanalytical problem, a promising way for tackling the computations involved is to build special-purpose hardware exhibiting a (much) better performance-cost ratio than off-the-shelf computers. This contribution presents a variety of cryptanalytical applications utilizing the cost-optimized parallel code breaker (COPACOBANA) machine, which is a high-performance low-cost cluster consisting of 120 field-programmable gate arrays (FPGAs). COPACOBANA appears to be the only such reconfigurable parallel FPGA machine optimized for code breaking tasks reported in the open literature. Depending on the actual algorithm, the parallel hardware architecture can outperform conventional computers by several orders of magnitude. In this work, we focus on novel implementations of cryptanalytical algorithms, utilizing the impressive computational power of COPACOBANA. We describe various exhaustive key search attacks on symmetric ciphers and demonstrate an attack on a security mechanism employed in the electronic passport (e-passport). Furthermore, we describe time-memory trade-off techniques that can, e.g., be used for attacking the popular A5/1 algorithm used in GSM voice encryption. In addition, we introduce efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., elliptic curve cryptosystems (ECCs) and number cofactorization for RSA. Even though breaking RSA or elliptic curves with parameter lengths used in most practical applications is out of reach with COPACOBANA, our attacks on algorithms with artificially short bit lengths allow us to extrapolate more reliable security estimates for real-world bit lengths. This is particularly useful for deriving estimates about the longevity of asymmetric key lengths.

157 citations


Journal Article•DOI•
TL;DR: This paper addresses performance issues with exact response time analysis (RTA) for fixed priority preemptive systems, and initial values are introduced that improve the efficiency of the standard RTA algorithm when exact response times are required, and when only exact scheduling need be determined.
Abstract: Efficient exact schedulability tests are required both for on-line admission of applications to dynamic systems and as an integral part of design tools for complex distributed real-time systems. This paper addresses performance issues with exact response time analysis (RTA) for fixed priority preemptive systems. Initial values are introduced that improve the efficiency of the standard RTA algorithm (i) when exact response times are required, and (ii) when only exact schedulability need be determined. The paper also explores modifications to the standard RTA algorithm, including; the use of a response time upper bound to determine when exact analysis is needed, incremental computation aimed at faster convergence, and checking tasks in reverse priority order to identify unschedulable task sets early. The various initial values and algorithm implementations are compared by means of experiments on a PC recording the number of iterations required, and execution time measurements on a real-time embedded microprocessor. Recommendations are provided for engineers tasked with the problem of implementing exact schedulability tests, as part of on-line acceptance tests and spare capacity allocation algorithms, or as part of off-line system design tools.

Journal Article•DOI•
TL;DR: An exact technique is presented to chart the Pareto space of throughput and storage trade-offs, which can be used to determine the minimal buffer space needed to execute a graph under a given throughput constraint.
Abstract: Multimedia applications usually have throughput constraints. An implementation must meet these constraints, while it minimizes resource usage and energy consumption. The compute intensive kernels of these applications are often specified as cyclo-static or synchronous dataflow graphs. Communication between nodes in these graphs requires storage space which influences throughput. We present an exact technique to chart the Pareto space of throughput and storage trade-offs, which can be used to determine the minimal buffer space needed to execute a graph under a given throughput constraint. The feasibility of the exact technique is demonstrated with experiments on a set of realistic DSP and multimedia applications. To increase scalability of the approach, a fast approximation technique is developed that guarantees both throughput and a, tight, bound on the maximal overestimation of buffer requirements. The approximation technique allows to trade off worst-case overestimation versus run-time.

Journal Article•DOI•
TL;DR: Simulations show that the improved FEC and CNHA/CWA schemes outperform the most recent O(log2|T|) schemes in terms of lookup time, update time, and memory requirement.
Abstract: Dynamic IP router table schemes, which have recently been proposed in the literature, perform an IP lookup or an online prefix update in O(log2|T|) memory accesses (MAs). In terms of lookup time, they are still slower than the full expansion/compression (FEC) scheme (compressed next-hop array/code word array (CNHA/CWA)), which requires exactly (at most) three MAs, irrespective of the number of prefixes |T| in a routing table T. The prefix updates in both FEC and CNHA/CWA have a drawback: Inefficient offline structure reconstruction is arguably the only viable solution. This paper solves the problem. We propose the use of lexicographic ordered prefixes to reduce the offline construction time of both schemes. Simulations on several real routing databases, run on the same platform, show that our approach constructs FEC (CNHA/CWA) tables in 2.68 to 7.54 (4.57 to 6) times faster than that from previous techniques. We also propose an online update scheme that, using an updatable address set and selectively decompressing the FEC and CNHA/CWA structures, modifies only the next hops of the addresses in the set. Recompressing the updated structures, the resulting forwarding tables are identical to those obtained by structure reconstructions, but are obtained at much lower computational cost. Our simulations show that the improved FEC and CNHA/CWA outperform the most recent O(log2|T|) schemes in terms of lookup time, update time, and memory requirement.

Journal Article•DOI•
TL;DR: A novel approach based on a Double-Data-Rate (DDR) computation template is proposed, which is compared to other existing architectures and countermeasures, and a thorough dependability analysis is given.
Abstract: Differential Fault Analysis (DFA) is one of the most powerful techniques to attack cryptosystems. Several countermeasures have been proposed, which are based either on information or temporal redundancy. In this work, we propose a novel approach based on a Double-Data-Rate (DDR) computation template. A few sample architectures have been implemented: they are compared to other existing architectures and countermeasures, and a thorough dependability analysis is given.

Journal Article•DOI•
TL;DR: This paper presents a secure NoC architecture composed of a set of data protection units (DPUs) implemented within the network interfaces, and focuses on the dynamic updating of the DPUs to support their utilization in dynamic environments, and on the utilization of authentication techniques to increase the level of security.
Abstract: Security is gaining increasing relevance in the development of embedded devices. Towards a secure system at each level of design, this paper addresses security aspects related to network-on-chip (NoC) architectures, foreseen as the communication infrastructure of next-generation embedded devices. In the context of NoC-based multiprocessor systems, we focus on the topic, not yet thoroughly faced, of data protection. In this paper, we present a secure NoC architecture composed of a set of data protection units (DPUs) implemented within the network interfaces. The run-time configuration of the programmable part of the DPUs is managed by a central unit, the network security manager (NSM). The DPU, similar to a firewall, can check and limit the access rights (none, read, write, or both) of processors accessing data and instructions in a shared memory - in particular distinguishing between the operating roles (supervisor/user and secure/unsecure) of the processing elements. We explore different alternative implementations for the DPU and demonstrate how this unit does not affect the network latency if the memory request has the appropriate rights. We also focus on the dynamic updating of the DPUs to support their utilization in dynamic environments, and on the utilization of authentication techniques to increase the level of security.

Journal Article•DOI•
TL;DR: The results show that the polling scheme can reduce the active time of sensors by a significant amount while sustaining 100 percent throughput and the problem of finding an optimal schedule is NP-hard and then given a fast online algorithm to solve it approximately.
Abstract: In this paper, we study two-layered heterogeneous sensor networks where two types of nodes are deployed: the basic sensor nodes and the cluster head nodes. The basic sensor nodes are simple and have limited power supplies, whereas the cluster head nodes are much more powerful and have many more power supplies, which organize sensors around them into clusters. Such two-layered heterogeneous sensor networks have better scalability and lower overall cost than homogeneous sensor networks. We propose using polling to collect data from sensors to the cluster head since polling can prolong network life by avoiding collisions and reducing the idle listening time of sensors. We focus on finding energy-efficient and collision-free polling schedules in a multihop cluster. To reduce energy consumption in idle listening, a schedule is optimal if it uses the minimum time. We show that the problem of finding an optimal schedule is NP-hard and then give a fast online algorithm to solve it approximately. We also consider dividing a cluster into sectors and using multiple nonoverlapping frequency channels to further reduce the idle listening time of sensors. We conducted simulations on the NS-2 simulator and the results show that our polling scheme can reduce the active time of sensors by a significant amount while sustaining 100 percent throughput.

Journal Article•DOI•
TL;DR: The dynamic range encoding scheme (DRES) is proposed to significantly improve the TCAM storage efficiency for range matching and is evaluated based on real-world databases and shows that DRES can reduce theTCAM storage expansion ratio from 6.20 to 1.23.
Abstract: One of the most critical resource management issues in the use of ternary content-addressable memory (TCAM) for packet classification/filtering is how to effectively support filtering rules with ranges, known as range matching. In this paper, the dynamic range encoding scheme (DRES) is proposed to significantly improve the TCAM storage efficiency for range matching. Unlike the existing range encoding schemes requiring additional hardware support, DRES uses the TCAM coprocessor itself to assist range encoding. Hence, DRES can be readily programmed in a network processor using a TCAM coprocessor for packet classification. A salient feature of DRES is its ability to allow a subset of ranges to be encoded and, hence, to have full control over the range code size. This advantage allows DRES to exploit the TCAM structure to maximize the TCAM storage efficiency. DRES is a comprehensive solution, including a dynamic range selection algorithm, a search key encoding scheme, a range encoding scheme, and a dynamic encoded range update algorithm. Although the dynamic range selection algorithm running in the software allows optimal selection of ranges to be encoded to fully utilize the TCAM storage, the dynamic encoded range update algorithm allows the TCAM database to be updated lock free without interrupting the TCAM database lookup process. DRES is evaluated based on real-world databases and the results show that DRES can reduce the TCAM storage expansion ratio from 6.20 to 1.23. The performance analysis of DRES based on a probabilistic model demonstrates that DRES significantly improves the TCAM storage efficiency for a wide spectrum of range distributions.

Journal Article•DOI•
TL;DR: This paper proposes a probabilistic approach to compute the covered area fraction at critical percolation for both of the SCPT and NCPT problems, and proposes a model forpercolation in WSNs, called correlated disk model, which provides a basis for solving the SC PT and N CPT problems together.
Abstract: While sensing coverage reflects the surveillance quality provided by a wireless sensor network (WSN), network connectivity enables data gathered by sensors to reach a central node, called the sink. Given an initially uncovered field and as more and more sensors are continuously added to a WSN, the size of partial covered areas increases. At some point, the situation abruptly changes from small fragmented covered areas to a single large covered area. We call this abrupt change as the sensing-coverage phase transition (SCPT). Also, given an originally disconnected WSN and as more and more sensors are added, the number of connected components changes such that the WSN suddenly becomes connected at some point. We call this sudden change as the network-connectivity phase transition (NCPT). The nature of such phase transitions is a central topic in percolation theory of Boolean models. In this paper, we propose a probabilistic approach to compute the covered area fraction at critical percolation for both of the SCPT and NCPT problems. Because sensing coverage and network connectivity are not totally orthogonal, we also propose a model for percolation in WSNs, called correlated disk model, which provides a basis for solving the SCPT and NCPT problems together.

Journal Article•DOI•
TL;DR: An improved block-based compact thermal model (HotSpot 4.0) is presented that automatically achieves good accuracy even under extreme conditions and has been extensively validated with detailed finite-element thermal simulation tools.
Abstract: Preventing silicon chips from negative, even disastrous thermal hazards has become increasingly challenging these days; considering thermal effects early in the design cycle is thus required. To achieve this, an accurate yet fast temperature model together with an early-stage, thermally optimized, design flow are needed. In this paper, we present an improved block-based compact thermal model (HotSpot 4.0) that automatically achieves good accuracy even under extreme conditions. The model has been extensively validated with detailed finite-element thermal simulation tools. We also show that properly modeling package components and applying the right boundary conditions are crucial to making full-chip thermal models like HotSpot accurately resemble what happens in the real world. Ignoring or over-simplifying package components can lead to inaccurate temperature estimations and potential thermal hazards that are costly to fix in later designs stages. Such a full-chip and package thermal model can then be incorporated into a thermally optimized design flow where it acts as an efficient communication medium among computer architects, circuit designers and package designers in early microprocessor design stages, to achieve early and accurate design decisions and also faster design convergence. For example, the temperature-leakage interaction can be readily analyzed within such a design flow to predict potential thermal hazards such as thermal runaway.

Journal Article•DOI•
TL;DR: This paper exploits the rich set of flexible features offered at the medium access control (MAC) layer of WiMax for the construction and transmission of MAC protocol data units (MPDUs) for supporting multiple VoIP streams and shows that the feedback-based technique coupled with retransmissions, aggregation, and variable length MPDUs are effective and increase the R-score and mean opinion score.
Abstract: Real-time services such as VoIP are becoming popular and are major revenue earners for network service providers. These services are no longer confined to the wired domain and are being extended over wireless networks. Although some of the existing wireless technologies can support some low-bandwidth applications, the bandwidth demands of many multimedia applications exceed the capacity of these technologies. The IEEE 802.16-based WiMax promises to be one of the wireless access technologies capable of supporting very high bandwidth applications. In this paper, we exploit the rich set of flexible features offered at the medium access control (MAC) layer of WiMax for the construction and transmission of MAC protocol data units (MPDUs) for supporting multiple VoIP streams. We study the quality of VoIP calls, usually given by R-score, with respect to the delay and loss of packets. We observe that loss is more sensitive than delay; hence, we compromise the delay performance within acceptable limits in order to achieve a lower packet loss rate. Through a combination of techniques like forward error correction, automatic repeat request, MPDU aggregation, and minislot allocation, we strike a balance between the desired delay and loss. Simulation experiments are conducted to test the performance of the proposed mechanisms. We assume a three-state Markovian channel model and study the performance with and without retransmissions. We show that the feedback-based technique coupled with retransmissions, aggregation, and variable length MPDUs are effective and increase the R-score and mean opinion score by about 40 percent.

Journal Article•DOI•
Tao Xie1•
TL;DR: This paper proposes a novel energy-aware strategy, called striping-based energy- aware (SEA), which can be integrated into data placement in RAID-structured storage systems to noticeably save energy while providing quick responses and extensive experimental results demonstrate that compared with traditional non-stripping data placement algorithms, the algorithms significantly improve performance and save energy.
Abstract: Many real-world applications need to frequently access data stored on large-scale parallel disk storage systems. On one hand, prompt responses to access requests are essential for these applications. On the other hand, however, with an explosive increase of data volume and the emerging of faster disks with higher power requirements, energy consumption of disk-based storage systems has become a salient issue. To achieve energy-conservation and prompt responses simultaneously, in this paper we propose a novel energy-aware strategy, called striping-based energy-aware (SEA), which can be integrated into data placement in RAID-structured storage systems to noticeably save energy while providing quick responses. Next, to illustrate the effectiveness of SEA, we implement two SEA-powered striping-based data placement algorithms, SEA0 and SEA5, by incorporating the SEA strategy into RAID-0 and RAID-5, respectively. Extensive experimental results demonstrate that compared with traditional non-stripping data placement algorithms, our algorithms significantly improve performance and save energy. Further, compared with an existing stripping-based data placement scheme, the two SEA-powered strategies noticeably reduce energy consumption with only a little performance degradation.

Journal Article•DOI•
TL;DR: This work shows how to maintain semantic equivalence between specification and implementation using an intermediate model (similar to a Kahn process network but with finite queues) that helps in defining the transformation.
Abstract: Synchronous systems offer a clean semantics and an easy verification path at the expense of often inefficient implementations. Capturing design specifications as synchronous models and then implementing the specifications in a less restrictive platform allow to address a much larger design space. The key issue in this approach is maintaining semantic equivalence between the synchronous model and its implementation. We address this problem by showing how to map a synchronous model onto a loosely time-triggered architecture that is fairly straightforward to implement as it does not require global synchronization or blocking communication. We show how to maintain semantic equivalence between specification and implementation using an intermediate model (similar to a Kahn process network but with finite queues) that helps in defining the transformation. Performance of the semantic preserving implementation is studied for the general case as well as for a few special cases.

Journal Article•DOI•
TL;DR: This paper develops a multiconstraint energy-saving model for the RAID environment by considering both disk characteristics and workload features and proposes an energy saving policy, eRAID (energy-efficient RAID), for conventional disk-based mirrored and parity redundant disk array architectures.
Abstract: Recently, high-energy consumption has become a serious concern for both storage servers and data centers. Recent research studies have utilized the short transition times of multispeed disks to decrease energy consumption. Manufacturing challenges and costs have so far prevented commercial deployment of multispeed disks. In this paper, we propose an energy saving policy, eRAID (energy-efficient RAID), for conventional disk-based mirrored and parity redundant disk array architectures. eRAID saves energy by spinning down partial or the entire mirror disk group with constraints of acceptable performance degradation. We first develop a multiconstraint energy-saving model for the RAID environment by considering both disk characteristics and workload features. Then, we develop a performance (response time and throughput) control scheme for eRAID based on the analytical model. Experimental results show that eRAID can save up to 32 percent energy while satisfying the predefined performance requirement.

Journal Article•DOI•
TL;DR: This work investigates the sensor localization problem from a novel perspective by treating it as a functional dual of target tracking, utilizing a moving location assistant (LA) (with a global positioning system (GPS) or a predefined moving path) to help location-unaware sensors to accurately discover their positions.
Abstract: As one of the fundamental issues in wireless sensor networks (WSNs), the sensor localization problem has recently received extensive attention. In this work, we investigate this problem from a novel perspective by treating it as a functional dual of target tracking. In traditional tracking problems, static location-aware sensors track and predict the position and/or velocity of a moving target. As a dual, we utilize a moving location assistant (LA) (with a global positioning system (GPS) or a predefined moving path) to help location-unaware sensors to accurately discover their positions. We call our proposed system Landscape. In Landscape, an LA (an aircraft, for example) periodically broadcasts its current location (we call it a beacon) while it moves around or through a sensor field. Each sensor collects the location beacons, measures the distance between itself and the LA based on the received signal strength (RSS), and individually calculates their locations via an Unscented Kalman Filter (UKF)-based algorithm. Landscape has several features that are favorable to WSNs, such as high scalability, no intersensor communication overhead, moderate computation cost, robustness to range errors and network connectivity, etc. Extensive simulations demonstrate that Landscape is an efficient sensor positioning scheme for outdoor sensor networks.

Journal Article•DOI•
TL;DR: A detailed 3-dimensional computational fluid dynamics based thermal modeling tool, called ThermoStat, for rack-mounted server systems, and proposes reactive and proactive thermal management for rack mounted server and isothermal workload distribution for rack.
Abstract: Temperature-aware computing is becoming more important in design of computer systems as power densities are increasing and the implications of high operating temperatures result in higher failure rates of components and increased demand for cooling capability. Computer architects and system software designers need to understand the thermal consequences of their proposals, and develop techniques to lower operating temperatures to reduce both transient and permanent component failures. Recognizing the need for thermal modeling tools to support those researches, there has been work on modeling temperatures of processors at the micro-architectural level which can be easily understood and employed by computer architects for processor designs. However, there is a dearth of such tools in the academic/research community for undertaking architectural/systems studies beyond a processor - a server box, rack or even a machine room. In this paper we presents a detailed 3-dimensional computational fluid dynamics based thermal modeling tool, called ThermoStat, for rack-mounted server systems. We conduct several experiments with this tool to show how different load conditions affect the thermal profile, and also illustrate how this tool can help design dynamic thermal management techniques. We propose reactive and proactive thermal management for rack mounted server and isothermal workload distribution for rack.

Journal Article•DOI•
TL;DR: An extensive comparison of the proposed architecture and previous QCA serial memories is pursued in terms of latency, timing, clocking requirements, and hardware complexity.
Abstract: Quantum-dot Cellular Automata (QCA) has been widely advocated as a new device architecture for nanotechnology. QCA systems require extremely low power, together with the potential for high density and regularity. These features make QCA an attractive technology for manufacturing memories in which the paradigm of memory-in-motion can be fully exploited. This paper proposes a novel serial memory architecture for QCA implementation. This architecture is based on utilizing new building blocks (referred to as tiles) in the storage and input/output circuitry of the memory. The QCA paradigm of memory-in-motion is accomplished using a novel arrangement in the storage loop and timing/clocking; a three-zone memory tile is proposed by which information is moved across a concatenation of tiles by utilizing a two-level clocking mechanism. Clocking zones are shared between memory cells and the length of the QCA line of a clocking zone is independent of the word size. QCA circuits for address decoding and input/output for simplification of the Read/Write operations are discussed in detail. An extensive comparison of the proposed architecture and previous QCA serial memories is pursued in terms of latency, timing, clocking requirements, and hardware complexity.

Journal Article•DOI•
TL;DR: This paper proposes FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication and matrix factorization, and shows that with faster floating-point units and larger devices, the performance of the designs increases accordingly.
Abstract: Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (field programmable gate arrays) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor based designs. We also show that with faster floating-point units and larger devices, the performance of our designs increases accordingly.

Journal Article•DOI•
TL;DR: A novel strategy to detect interconnect faults between distinct channels in networks-on-chip is proposed and a cost-effective test sequence for Mesh NoC topologies based on XY routing is considered.
Abstract: A novel strategy to detect interconnect faults between distinct channels in networks-on-chip is proposed. Short faults between distinct channels in the data, control and communication handshake lines are considered in a cost-effective test sequence for Mesh NoC topologies based on XY routing.

Journal Article•DOI•
TL;DR: The proposed architecture increases the correlation among the patterns generated by LT-LFSR with negligible impact on test length, and is flexible to be used in both BIST and scan-based BIST architectures.
Abstract: A low-transition test pattern generator, called the low-transition linear feedback shift register (LT-LFSR), is proposed to reduce the average and peak power of a circuit during test by reducing the transitions among patterns. Transitions are reduced in two dimensions: 1) between consecutive patterns (fed to a combinational only circuit) and 2) between consecutive bits (sent to a scan chain in a sequential circuit). LT-LFSR is independent of circuit under test and flexible to be used in both BIST and scan-based BIST architectures. The proposed architecture increases the correlation among the patterns generated by LT-LFSR with negligible impact on test length. The experimental results for the ISCAS'85 and '89 benchmarks confirm up to 77 percent and 49 percent reduction in average and peak power, respectively.

Journal Article•DOI•
Patrick Longa1, Ali Miri1•
TL;DR: An innovative methodology for accelerating the elliptic curve point formulae over prime fields using the substitution of multiplication with squaring and other cheaper operations, by exploiting the fact that field squaring is generally less costly than multiplication.
Abstract: We present an innovative methodology for accelerating the elliptic curve point formulae over prime fields. This flexible technique uses the substitution of multiplication with squaring and other cheaper operations, by exploiting the fact that field squaring is generally less costly than multiplication. Applying this substitution to the traditional formulae, we obtain faster point operations in unprotected sequential implementations. We show the significant impact our methodology has in protecting against Simple Side-Channel Attacks (SSCA). We modify the ECC point formulae to achieve a faster atomic structure when applying atomicity side-channel protection. In contrast to previous atomic operations that assumed squarings are undistinguishable from multiplications, our new atomic structure offers true SSCA-protection because it includes squaring in its formulation. We also extend our implementation to parallel architectures such as SIMD (Single-Instruction Multiple-Data). With the introduction of a new coordinate system and with the flexibility of our methodology, we present, to our knowledge, the fastest formulae for SIMD-based schemes that are capable of executing 3 and 4 operations simultaneously. Finally, a new parallel SSCA-protected scheme is proposed for multiprocessor/parallel architectures by applying the atomic structure presented in this work. Our parallel and atomic operations are shown to be significantly faster than previous implementations.