scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 2012"


Journal ArticleDOI
TL;DR: The vCUDA as discussed by the authors is a general-purpose graphics processing unit (GPGPU) computing solution for virtual machines (VMs) that allows applications executing within VMs to leverage hardware acceleration, which can be beneficial to the performance of a class of highperformance computing (HPC) applications.
Abstract: This paper describes vCUDA, a general-purpose graphics processing unit (GPGPU) computing solution for virtual machines (VMs). vCUDA allows applications executing within VMs to leverage hardware acceleration, which can be beneficial to the performance of a class of high-performance computing (HPC) applications. The key insights in our design include API call interception and redirection and a dedicated RPC system for VMs. With API interception and redirection, Compute Unified Device Architecture (CUDA) applications in VMs can access a graphics hardware device and achieve high computing performance in a transparent way. In the current study, vCUDA achieved a near-native performance with the dedicated RPC system. We carried out a detailed analysis of the performance of our framework. Using a number of unmodified official examples from CUDA SDK and third-party applications in the evaluation, we observed that CUDA applications running with vCUDA exhibited a very low performance penalty in comparison with the native environment, thereby demonstrating the viability of vCUDA architecture.

263 citations


Journal ArticleDOI
TL;DR: Using measurement of actual systems running scientific, commercial and productivity workloads, power models for six subsystems on two platforms are developed and validated and it is possible to estimate system power consumption without the need for power sensing hardware.
Abstract: This paper proposes the use of microprocessor performance counters for online measurement of complete system power consumption. The approach takes advantage of the "trickle-down” effect of performance events in microprocessors. While it has been known that CPU power consumption is correlated to processor performance, the use of well-known performance-related events within a microprocessor such as cache misses and DMA transactions to estimate power consumption in memory and disk and other subsystems outside of the microprocessor is new. Using measurement of actual systems running scientific, commercial and productivity workloads, power models for six subsystems (CPU, memory, chipset, I/O, disk, and GPU) on two platforms (server and desktop) are developed and validated. These models are shown to have an average error of less than nine percent per subsystem across the considered workloads. Through the use of these models and existing on-chip performance event counters, it is possible to estimate system power consumption without the need for power sensing hardware.

206 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of k-coverage in WSNs such that in each scheduling round, every location in a monitored field is covered by at least k active sensors while all active sensors are being connected.
Abstract: Sensing coverage is an essential functionality of wireless sensor networks (WSNs). However, it is also well known that coverage alone in WSNs is not sufficient, and hence network connectivity should also be considered for the correct operation of WSNs. In this paper, we address the problem of k-coverage in WSNs such that in each scheduling round, every location in a monitored field (or simply field) is covered by at least k active sensors while all active sensors are being connected. Precisely, we study sensors duty-cycling strategies for generating k-coverage configurations in WSNs. First, we model the k-coverage problem in WSNs. Second, we derive a sufficient condition of the sensor spatial density for complete k-coverage of a field. We also provide a relationship between the communication and sensing ranges of sensors to maintain both k-coverage of a field and connectivity among all active sensors. Third, we propose four configuration protocols to solve the problem of k-coverage in WSNs. We prove that our protocols select a minimum number of sensors to achieve full k-coverage of a field while guaranteeing connectivity between them. Then, we relax some widely used assumptions for coverage configuration in WSNs, to promote the use of our proposed protocols in real-world sensing applications. Our simulation results show that our protocols outperform an existing distributed k-coverage configuration protocol.

185 citations


Journal ArticleDOI
TL;DR: PlaGate is described, a novel tool that can be integrated with existing plagiarism detection tools to improve plagiarism Detection performance and implements a new approach for investigating the similarity between source-code files with a view to gathering evidence for proving plagiarism.
Abstract: Plagiarism is a growing problem in academia. Academics often use plagiarism detection tools to detect similar source-code files. Once similar files are detected, the academic proceeds with the investigation process which involves identifying the similar source-code fragments within them that could be used as evidence for proving plagiarism. This paper describes PlaGate, a novel tool that can be integrated with existing plagiarism detection tools to improve plagiarism detection performance. The tool also implements a new approach for investigating the similarity between source-code files with a view to gathering evidence for proving plagiarism. Graphical evidence is presented that allows for the investigation of source-code fragments with regards to their contribution toward evidence for proving plagiarism. The graphical evidence indicates the relative importance of the given source-code fragments across files in a corpus. This is done by using the Latent Semantic Analysis information retrieval technique to detect how important they are within the specific files under investigation in relation to other files in the corpus.

173 citations


Journal ArticleDOI
TL;DR: This paper proposes a polling-based mobile gathering approach and forms it into an optimization problem, named bounded relay hop mobile data gathering (BRH-MDG), where a subset of sensors will be selected as polling points that buffer locally aggregated data and upload the data to the mobile collector when it arrives.
Abstract: Recent study reveals that great benefit can be achieved for data gathering in wireless sensor networks by employing mobile collectors that gather data via short-range communications. To pursue maximum energy saving at sensor nodes, intuitively, a mobile collector should traverse the transmission range of each sensor in the field such that each data packet can be directly transmitted to the mobile collector without any relay. However, this approach may lead to significantly increased data gathering latency due to the low moving velocity of the mobile collector. Fortunately, it is observed that data gathering latency can be effectively shortened by performing proper local aggregation via multihop transmissions and then uploading the aggregated data to the mobile collector. In such a scheme, the number of local transmission hops should not be arbitrarily large as it may increase the energy consumption on packet relays, which would adversely affect the overall efficiency of mobile data gathering. Based on these observations, in this paper, we study the tradeoff between energy saving and data gathering latency in mobile data gathering by exploring a balance between the relay hop count of local data aggregation and the moving tour length of the mobile collector. We first propose a polling-based mobile gathering approach and formulate it into an optimization problem, named bounded relay hop mobile data gathering (BRH-MDG). Specifically, a subset of sensors will be selected as polling points that buffer locally aggregated data and upload the data to the mobile collector when it arrives. In the meanwhile, when sensors are affiliated with these polling points, it is guaranteed that any packet relay is bounded within a given number of hops. We then give two efficient algorithms for selecting polling points among sensors. The effectiveness of our approach is validated through extensive simulations.

171 citations


Journal ArticleDOI
TL;DR: The intractability of determining whether a system specified in this model can be scheduled to meet all its certification requirements, even for systems subject to merely two sets of certification requirements is demonstrated.
Abstract: Many safety-critical embedded systems are subject to certification requirements; some systems may be required to meet multiple sets of certification requirements, from different certification authorities. Certification requirements in such "mixed-criticality” systems give rise to interesting scheduling problems, that cannot be satisfactorily addressed using techniques from conventional scheduling theory. In this paper, we study a formal model for representing such mixed-criticality workloads. We demonstrate first the intractability of determining whether a system specified in this model can be scheduled to meet all its certification requirements, even for systems subject to merely two sets of certification requirements. Then we quantify, via the metric of processor speedup factor, the effectiveness of two techniques, reservation-based scheduling and priority-based scheduling, that are widely used in scheduling such mixed-criticality systems, showing that the latter of the two is superior to the former. We also show that the speedup factors we obtain are tight for these two techniques.

149 citations


Journal ArticleDOI
TL;DR: Nash equilibria of QoR games give poly-log approximations to hard optimization problems in general networks where each player selfishly selects a path that minimizes the sum of congestion and dilation of the player's path.
Abstract: A classic optimization problem in network routing is to minimize C + D, where C is the maximum edge congestion and D is the maximum path length (also known as dilation). The problem of computing the optimal C* + D* is NP-complete even when either C* or D* is a small constant. We study routing games in general networks where each player i selfishly selects a path that minimizes Ci + Di the sum of congestion and dilation of the player's path. We first show that there are instances of this game without Nash equilibria. We then turn to the related quality of routing (QoR) games which always have Nash equilibria. QoR games represent networks with a small number of service classes where paths in different classes do not interfere with each other (with frequency or time division multiplexing). QoR games have O(log4 n) price of anarchy when either C* or D* is a constant. Thus, Nash equilibria of QoR games give poly-log approximations to hard optimization problems.

144 citations


Journal ArticleDOI
TL;DR: This work compares two real-time architectures developed using FPGA and GPU devices for the computation of phase-based optical flow, stereo, and local image features (energy, orientation, and phase) and provides suggestions for selecting the most suitable technology.
Abstract: Low-level computer vision algorithms have extreme computational requirements. In this work, we compare two real-time architectures developed using FPGA and GPU devices for the computation of phase-based optical flow, stereo, and local image features (energy, orientation, and phase). The presented approach requires a massive degree of parallelism to achieve real-time performance and allows us to compare FPGA and GPU design strategies and trade-offs in a much more complex scenario than previous contributions. Based on this analysis, we provide suggestions to real-time system designers for selecting the most suitable technology, and for optimizing system development on this platform, for a number of diverse applications.

138 citations


Journal ArticleDOI
TL;DR: Two fused floating-point operations are described and applied to the implementation of fast Fourier transform (FFT) processors and the numerical results of the fused implementations are slightly more accurate, since they use fewer rounding operations.
Abstract: This paper describes two fused floating-point operations and applies them to the implementation of fast Fourier transform (FFT) processors. The fused operations are a two-term dot product and an add-subtract unit. The FFT processors use "butterfly” operations that consist of multiplications, additions, and subtractions of complex valued data. Both radix-2 and radix-4 butterflies are implemented efficiently with the two fused floating-point operations. When placed and routed using a high performance standard cell technology, the fused FFT butterflies are about 15 percent faster and 30 percent smaller than a conventional implementation. Also the numerical results of the fused implementations are slightly more accurate, since they use fewer rounding operations.

115 citations


Journal ArticleDOI
TL;DR: This work presents a novel algorithm called short range gray encoding (SRGE) for the efficient representation of short range rules, and proves that any TCAM encoding scheme has worst-case expansion ratio W or more.
Abstract: Ternary content-addressable memories (TCAMs) are increasingly used for high-speed packet classification. TCAMs compare packet headers against all rules in a classification database in parallel and thus provide high throughput unparalleled by software-based solutions. TCAMs are not well-suited, however, for representing rules that contain range fields. Such rules typically have to be represented (or encoded) by multiple TCAM entries. The resulting range expansion can dramatically reduce TCAM utilization. A TCAM range-encoding algorithm A is database-independent if, for all ranges r, it encodes r independently of the database in which it appears; otherwise, we say that A is database-dependent. Typically, when storing a classification database in TCAM, a few dozens of so-called extra bits in each TCAM entry remain unused. These extra bits are used by some (both database-dependent and database-independent) prior algorithms to reduce range expansion. The majority of real-life database ranges are short. We present a novel database-independent algorithm called Short Range Gray Encoding (SRGE) for the efficient representation of short range rules. SRGE encodes range endpoints as binary-reflected Gray codes and then represents the resulting range by a minimal set of ternary strings. To the best of our knowledge, SRGE is the first algorithm that achieves a reduction in range expansion in general, and a significant expansion reduction for short ranges in particular, without resorting to the use of extra bits. The “traditional” database-independent technique for representing range entries in TCAM is prefix expansion. As we show, SRGE significantly reduces the expansion of short ranges in comparison with prefix expansion. We also prove that the SRGE algorithm's range expansion is at least as good as that of prefix expansion for any range. Real-world classification databases contain a small number of unique long ranges, some of which appear in numerous rules. These long ranges cause high expansion which is not significantly reduced by any database-independent range encoding scheme that we are aware of, including SRGE. We introduce hybrid SRGE, a database-dependent encoding scheme that uses SRGE for reducing the expansion of short ranges and uses extra bits for reducing the expansion caused by long ones. Our comparative analysis establishes that hybrid SRGE utilizes TCAM more efficiently than previously published range-encoding algorithms. This work also makes a more theoretic contribution. Prefix expansion for ranges defined by W-bit endpoints has worst-case expansion ratio of 2W-2. It follows from the work of Schieber et al. [1] that the SRGE algorithm has a slightly better worst-case expansion ratio of 2W-4. We prove that any independent TCAM encoding scheme has worst-case expansion ratio of at least W.

112 citations


Journal ArticleDOI
TL;DR: This paper presents two algorithms: one, that synthesizes an optimal circuit for any 4-bit reversible specification, and another that synthesized all optimal implementations, and illustrates that the proposed approach may be extended to accommodate physical constraints via reporting LNN-optimal reversible circuits.
Abstract: Optimal synthesis of reversible functions is a nontrivial problem. One of the major limiting factors in computing such circuits is the sheer number of reversible functions. Even restricting synthesis to 4-bit reversible functions results in a huge search space (16! ≈ 244 functions). The output of such a search alone, counting only the space required to list Toffoli gates for every function, would require over 100 terabytes of storage. In this paper, we present two algorithms: one, that synthesizes an optimal circuit for any 4-bit reversible specification, and another that synthesizes all optimal implementations. We employ several techniques to make the problem tractable. We report results from several experiments, including synthesis of all optimal 4-bit permutations, synthesis of random 4-bit permutations, optimal synthesis of all 4-bit linear reversible circuits, and synthesis of existing benchmark functions; we compose a list of the hardest permutations to synthesize, and show distribution of optimal circuits. We further illustrate that our proposed approach may be extended to accommodate physical constraints via reporting LNN-optimal reversible circuits. Our results have important implications in the design and optimization of reversible and quantum circuits, testing circuit synthesis heuristics, and performing experiments in the area of quantum information processing.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that AMUSE can emulate soft error effects for complex circuits including microprocessors and memories, considering the real delays of an ASIC technology, and support massive fault injection campaigns, in the order of tens of millions of faults within acceptable time.
Abstract: Estimation of soft error sensitivity is crucial in order to devise optimal mitigation solutions that can satisfy reliability requirements with reduced impact on area, performance, and power consumption. In particular, the estimation of Single Event Transient (SET) effects for complex systems that include a microprocessor is challenging, due to the huge potential number of different faults and effects that must be considered, and the delay-dependent nature of SET effects. In this paper, we propose a multilevel FPGA emulation-based fault injection approach for evaluation of SET effects called AMUSE (Autonomous MUltilevel emulation system for Soft Error evaluation). This approach integrates Gate level and Register-Transfer level models of the circuit under test in a FPGA and is able to switch to the appropriate model as needed during emulation. Fault injection is performed at the Gate level, which provides delay accuracy, while fault propagation across clock cycles is performed at the Register-Transfer level for higher performance. Experimental results demonstrate that AMUSE can emulate soft error effects for complex circuits including microprocessors and memories, considering the real delays of an ASIC technology, and support massive fault injection campaigns, in the order of tens of millions of faults within acceptable time.

Journal ArticleDOI
TL;DR: An exact analysis of the energy minimization problem for a real-time embedded application running on a VFS-enabled CPU and using multiple devices is undertaken and a provably optimal and efficient algorithm is proposed to determine the optimal CPU frequency as well as device state transition decisions to minimize the system-level energy.
Abstract: Voltage/Frequency Scaling (VFS) and Device Power Management (DPM) are two popular techniques commonly employed to save energy in real-time embedded systems. VFS policies aim at reducing the CPU energy, while DPM-based solutions involve putting the system components (e.g., memory or I/O devices) to low-power/sleep states at runtime, when sufficiently long idle intervals can be predicted. Despite numerous research papers that tackled the energy minimization problem using VFS or DPM separately, the interactions of these two popular techniques are not yet well understood. In this paper, we undertake an exact analysis of the problem for a real-time embedded application running on a VFS-enabled CPU and using multiple devices. Specifically, by adopting a generalized system-level energy model, we characterize the variations in different components of the system energy as a function of the CPU processing frequency. Then, we propose a provably optimal and efficient algorithm to determine the optimal CPU frequency as well as device state transition decisions to minimize the system-level energy. We also extend our solution to deal with workload variability. The experimental evaluations confirm that substantial energy savings can be obtained through our solution that combines VFS and DPM optimally under the given task and energy models.

Journal ArticleDOI
TL;DR: This work utilizes memristors as weights in the realization of low-power Field Programmable Gate Arrays (FPGAs) using threshold logic which is necessary not only for low power embedded systems, but also realizing biological applications using threshold Logic.
Abstract: Researchers have claimed that the memristor, the fourth fundamental circuit element, can be used for computing. In this work, we utilize memristors as weights in the realization of low-power Field Programmable Gate Arrays (FPGAs) using threshold logic which is necessary not only for low power embedded systems, but also realizing biological applications using threshold logic. Boolean functions, which are subsets of threshold functions, can be implemented using the proposed Memristive Threshold Logic (MTL) gate, whose functionality can be configured by changing the weights (memristance). A CAD framework is also developed to map the weights of a threshold gate to corresponding memristance values and synthesize logic circuits using MTL gates. Performance of the MTL gates at the circuit and logic levels is also evaluated using this CAD framework using ISCAS-85 combinational benchmarking circuits. This work also provides solutions based on device options and refreshing memristance, against drift in memristance, which can be a potential problem during operation. Comparisons with the existing CMOS look-up-table (LUT) and capacitor threshold logic (CTL) gates show that MTL gates exhibit less energy-delay product by at least 90 percent.

Journal ArticleDOI
TL;DR: Investigation of energy-efficient scheduling of sequential tasks with precedence constraints on multiprocessor computers with dynamically variable voltage and speed makes initial contribution to analytical performance study of heuristic power allocation and scheduling algorithms for precedence constrained sequential tasks.
Abstract: Energy-efficient scheduling of sequential tasks with precedence constraints on multiprocessor computers with dynamically variable voltage and speed is investigated as combinatorial optimization problems. In particular, the problem of minimizing schedule length with energy consumption constraint and the problem of minimizing energy consumption with schedule length constraint are considered. Our scheduling problems contain three nontrivial subproblems, namely, precedence constraining, task scheduling, and power supplying. Each subproblem should be solved efficiently so that heuristic algorithms with overall good performance can be developed. Such decomposition of our optimization problems into three subproblems makes design and analysis of heuristic algorithms tractable. Three types of heuristic power allocation and scheduling algorithms are proposed for precedence constrained sequential tasks with energy and time constraints, namely, prepower-determination algorithms, postpower-determination algorithms, and hybrid algorithms. The performance of our algorithms are analyzed and compared with optimal schedules analytically. Such analysis has not been conducted in the literature for any algorithm. Therefore, our investigation in this paper makes initial contribution to analytical performance study of heuristic power allocation and scheduling algorithms for precedence constrained sequential tasks. Our extensive simulation data demonstrate that for wide task graphs, the performance ratios of all our heuristic algorithms approach one as the number of tasks increases.

Journal ArticleDOI
TL;DR: This paper analyzes and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and generalizes these results into a general reliability speedup/wall framework by considering not only speedup but also costup.
Abstract: Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.

Journal ArticleDOI
TL;DR: This paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution.
Abstract: Object detection applications are often associated with real-time performance constraints that stem from the embedded environment that they are often deployed in. Consequently, researchers have proposed dedicated hardware architectures, utilizing a variety of classification algorithms targeting object detection. Support Vector Machines (SVMs) is among the most popular classification algorithms used in object detection yielding high accuracy rates. However, existing SVM hardware implementations attempting to speed up SVM classification, have either targeted only simple applications, or SVM training. As such, there are limited proposed hardware architectures that are generic enough to be used in a variety of object detection applications. Hence, this paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution. The proposed hardware architecture provides parallel processing, resource sharing among the processing units, and efficient memory management. Furthermore, the size of the array is scalable to the hardware demands, and can also handle a variety of applications such as multiclass classification problems. A prototype of the proposed architecture was implemented on an FPGA platform and evaluated using three popular detection applications, demonstrating real-time performance (40-122 fps for a variety of applications).

Journal ArticleDOI
TL;DR: This paper describes a unique spatio-temporal tradeoff that includes efficient spatial fitting of VMs on servers to achieve high utilization of machine resources, as well as balanced temporal fitting of servers with VMs having similar runtimes to ensure a server runs at a high utilization throughout its uptime.
Abstract: MapReduce is a distributed computing paradigm widely used for building large-scale data processing applications. When used in cloud environments, MapReduce clusters are dynamically created using virtual machines (VMs) and managed by the cloud provider. In this paper, we study the energy efficiency problem for such MapReduce clouds. We describe a unique spatio-temporal tradeoff that includes efficient spatial fitting of VMs on servers to achieve high utilization of machine resources, as well as balanced temporal fitting of servers with VMs having similar runtimes to ensure a server runs at a high utilization throughout its uptime. We propose VM placement algorithms that explicitly incorporate these tradeoffs. Further, we propose techniques that dynamically scale MapReduce clusters to further improve energy consumption while ensuring that jobs meet or improve their expected runtimes. Our algorithms achieve energy savings over existing placement techniques, and an additional optimization technique further achieves savings while simultaneously improving job performance.

Journal ArticleDOI
TL;DR: It is shown that the performance of the presented AES-GCM architectures outperforms the previously reported ones in the utilized 65-nm CMOS technology.
Abstract: Since its acceptance as the adopted symmetric-key algorithm, the Advanced Encryption Standard (AES) and its recently standardized authentication Galois/Counter Mode (GCM) have been utilized in various security-constrained applications. Many of the AES-GCM applications are power and resource constrained and require efficient hardware implementations. In this paper, different application-specific integrated circuit (ASIC) architectures of building blocks of the AES-GCM algorithms are evaluated and optimized to identify the high-performance and low-power architectures for the AES-GCM. For the AES, we evaluate the performance of more than 40 S-boxes utilizing a fixed benchmark platform in 65-nm CMOS technology. To obtain the least complexity S-box, the formulations for the Galois Field (GF) subfield inversions in GF(24) are optimized. By conducting exhaustive simulations for the input transitions, we analyze the average and peak power consumptions of the AES S-boxes considering the switching activities, gate-level netlists, and parasitic information. Additionally, we present high-speed, parallel hardware architectures for reaching low-latency and high-throughput structures of the GCM. Finally, by investigating the high-performance GF(2128) multiplier architectures, we benchmark the proposed AES-GCM architectures using quadratic and subquadratic hardware complexity GF(2128) multipliers. It is shown that the performance of the presented AES-GCM architectures outperforms the previously reported ones in the utilized 65-nm CMOS technology.

Journal ArticleDOI
TL;DR: The methods presented in this paper provide a theoretical basis and analytical relations between speed, voltage, power and temperature, which provide greater insight into the early-phase design of processors and are also useful for online dynamic thermal management.
Abstract: This paper addresses the problem of determining the feasible speeds and voltages of multicore processors with hard real-time and temperature constraints. This is an important problem, which has applications in time-critical execution of programs like audio and video encoding on application-specific embedded processors. Two problems are solved. The first is the computation of the optimal time-varying voltages and speeds of each core in a heterogeneous multicore processor, that minimize the makespan-the latest completion time of all tasks, while satisfying timing and temperature constraints. The solution to the makespan minimization problem is then extended to the problem of determining the feasible speeds and voltages that satisfy task deadlines. The methods presented in this paper also provide a theoretical basis and analytical relations between speed, voltage, power and temperature, which provide greater insight into the early-phase design of processors and are also useful for online dynamic thermal management.

Journal ArticleDOI
TL;DR: The proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.
Abstract: We present the design, implementation and evaluation of a high-performance architecture for regular expression matching (REM) on field-programmable gate array (FPGA). Each regular expression (regex) is first parsed into a concise token list representation, then compiled to a modular nondeterministic finite automaton (RE-NFA) using a modified version of the McNaughton-Yamada algorithm. The RE-NFA can be mapped directly onto a compact register-transistor level (RTL) circuit. A number of optimizations are applied to improve the circuit performance: 1) spatial stacking is used to construct an REM circuit processing m ≥ 1 input characters per clock cycle; 2) single-character constrained repetitions are matched efficiently by parallel shift-register lookup tables; 3) complex character classes are matched by a BRAM-based classifier shared across regexes; 4) a multipipeline architecture is used to organize a large number of RE-NFAs into priority groups to limit the I/O size of the circuit. We implemented 2,630 unique PCRE regexes from Snort rules (February 2010) in the proposed REM architecture. Based on the place-and-route results from Xilinx ISE 11.1 targeting Virtex5 LX-220 FPGAs, the proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.

Journal ArticleDOI
TL;DR: The analytical formulation presented in this paper is general and offers the foundation for the quantitative and rapid evaluation of computer architectures under different constraints including that of single die area.
Abstract: Beginning with Amdahl's law, we derive a general objective function that links parallel processing performance gains at the system level, to energy and delay in the subsystem microarchitecture structures. The objective function employs parameterized models of computation and communication to represent the characteristics of processors, memories, and communications networks. The interaction of the latter microarchitectural elements defines global system performance in terms of energy-delay cost. Following the derivation, we demonstrate its utility by applying it to the problem of Chip Multiprocessor (CMP) architecture exploration. Given a set of application and architectural parameters, we solve for the optimal CMP architecture for six different architectural optimization examples. We find the parameters that minimize the total system cost, defined by the objective function under the area constraint of a single die. The analytical formulation presented in this paper is general and offers the foundation for the quantitative and rapid evaluation of computer architectures under different constraints including that of single die area.

Journal ArticleDOI
TL;DR: This paper proposes a programming framework that combines the ease of use of OpenMP with simple, yet powerful, language extensions to trigger array data partitioning and exploits profiled information on array access count to automatically generate data allocation schemes optimized for locality of references.
Abstract: Most of today's state-of-the-art processors for mobile and embedded systems feature on-chip scratchpad memories. To efficiently exploit the advantages of low-latency high-bandwidth memory modules in the hierarchy, there is the need for programming models and/or language features that expose such architectural details. On the other hand, effectively exploiting the limited on-chip memory space requires the programmer to devise an efficient partitioning and distributed placement of shared data at the application level. In this paper, we propose a programming framework that combines the ease of use of OpenMP with simple, yet powerful, language extensions to trigger array data partitioning. Our compiler exploits profiled information on array access count to automatically generate data allocation schemes optimized for locality of references.

Journal ArticleDOI
TL;DR: The first architecture is built around a sparse carry computation unit that computes only some of the carries of the modulo 2n+1 addition and its regularity and area efficiency are further enhanced by the introduction of a new prefix operator.
Abstract: Two architectures for modulo 2n+1 adders are introduced in this paper. The first one is built around a sparse carry computation unit that computes only some of the carries of the modulo 2n+1 addition. This sparse approach is enabled by the introduction of the inverted circular idempotency property of the parallel-prefix carry operator and its regularity and area efficiency are further enhanced by the introduction of a new prefix operator. The resulting diminished-1 adders can be implemented in smaller area and consume less power compared to all earlier proposals, while maintaining a high operation speed. The second architecture unifies the design of modulo 2n ± 1 adders. It is shown that modulo 2n+1 adders can be easily derived by straightforward modifications of modulo 2n-1 adders with minor hardware overhead.

Journal ArticleDOI
TL;DR: This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures and argues that these customizations can be generalized to perform other representative linear algebra operations.
Abstract: As technology is reaching physical limits, reducing power consumption is a key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Subprograms (BLAS). It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra operations. In addition to exposing the sources of inefficiencies in current CPUs and GPUs, our results show our prototype Linear Algebra Processor (LAP) implementing Double-precision GEMM (DGEMM) can achieve 600 GFLOPS while consuming less than 25 Watts in standard 45 nm technology, which is up to 50 × more energy efficient than cutting-edge CPUs.

Journal ArticleDOI
TL;DR: Experimental results show that LSBF, compared with a baseline approach and other state-of-the-art work in the literature, takes less time to respond AMQ and consumes much less storage space.
Abstract: In many network applications, Bloom filters are used to support exact-matching membership query for their randomized space-efficient data structure with a small probability of false answers. In this paper, we extend the standard Bloom filter to Locality-Sensitive Bloom Filter (LSBF) to provide Approximate Membership Query (AMQ) service. We achieve this by replacing uniform and independent hash functions with locality-sensitive hash functions. Such replacement makes the storage in LSBF to be locality sensitive. Meanwhile, LSBF is space efficient and query responsive by employing the Bloom filter design. In the design of the LSBF structure, we propose a bit vector to reduce False Positives (FP). The bit vector can verify multiple attributes belonging to one member. We also use an active overflowed scheme to significantly decrease False Negatives (FN). Rigorous theoretical analysis (e.g., on FP, FN, and space overhead) shows that the design of LSBF is space compact and can provide accurate response to approximate membership queries. We have implemented LSBF in a real distributed system to perform extensive experiments using real-world traces. Experimental results show that LSBF, compared with a baseline approach and other state-of-the-art work in the literature (SmartStore and LSB-tree), takes less time to respond AMQ and consumes much less storage space.

Journal ArticleDOI
TL;DR: Simulations in a commercial 45 nm, 1.2 V, CMOS process show that soft NMR provides up to 10× improvement in robustness, and 35 percent power savings over conventional NMR.
Abstract: Achieving robustness and energy efficiency in nanoscale CMOS process technologies is made challenging due to the presence of process, temperature, and voltage variations. Traditional fault-tolerance techniques such as N-modular redundancy (NMR) employ deterministic error detection and correction, e.g., majority voter, and tend to be power hungry. This paper proposes soft NMR that nontrivially extends NMR by consciously exploiting error statistics caused by nanoscale artifacts in order to design robust and energy-efficient systems. In contrast to conventional NMR, soft NMR employs Bayesian detection techniques in the voter. Soft voter algorithms are obtained through optimization of appropriate application aware cost functions. Analysis indicates that, on average, soft NMR outperforms conventional NMR. Furthermore, unlike NMR, in many cases, soft NMR is able to generate a correct output even when all N replicas are in error. This increase in robustness is then traded-off through voltage scaling to achieve energy efficiency. The design of a discrete cosine transform (DCT) image coder is employed to demonstrate the benefits of the proposed technique. Simulations in a commercial 45 nm, 1.2 V, CMOS process show that soft NMR provides up to 10× improvement in robustness, and 35 percent power savings over conventional NMR.

Journal ArticleDOI
TL;DR: This paper proposes two efficient soft error mitigation schemes, namely, Soft Error Mitigation (SEM) and Soft and Timing ErrorMitigation (STEM), using the approach of multiple clocking of data for protecting combinational logic blocks from soft errors.
Abstract: The threat of soft error induced system failure in computing systems has become more prominent, as we adopt ultradeep submicron process technologies. In this paper, we propose two efficient soft error mitigation schemes, namely, Soft Error Mitigation (SEM) and Soft and Timing Error Mitigation (STEM), using the approach of multiple clocking of data for protecting combinational logic blocks from soft errors. Our first technique, SEM, based on distributed and temporal voting of three registers, unloads the soft error detection overhead from the critical path of the systems. SEM is also capable of ignoring false errors and recovers from soft errors using in-situ fast recovery avoiding recomputation. Our second technique, STEM, while tolerating soft errors, adds timing error detection capability to guarantee reliable execution in aggressively clocked designs that enhance system performance by operating beyond worst-case clock frequency. We also present a specialized low overhead clock phase management scheme that ably supports our proposed techniques. Timing-annotated gate-level simulations, using 45 nm libraries, of a pipelined adder-multiplier and DLX processor show that both our techniques achieve near 100 percent fault coverage. For DLX processor, even under severe fault injection campaigns, SEM achieves an average performance improvement of 26.58 percent over a conventional triple modular redundancy voter-based soft error mitigation scheme, while STEM outperforms SEM by 27.42 percent.

Journal ArticleDOI
TL;DR: The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations and has been synthesized for FPGA targets and can be easily retargeted.
Abstract: Decomposition of a matrix into lower and upper triangular matrices (LU decomposition) is a vital part of many scientific and engineering applications, and the block LU decomposition algorithm is an approach well suited to parallel hardware implementation This paper presents an approach to speed up implementation of the block LU decomposition algorithm using FPGA hardware Unlike most previous approaches reported in the literature, the approach does not assume the matrix can be stored entirely on chip The memory accesses are studied for various FPGA configurations, and a schedule of operations for scaling well is shown The design has been synthesized for FPGA targets and can be easily retargeted The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations

Journal ArticleDOI
Jaehong Kim1, Sangwon Seo1, Dawoon Jung2, Jin-Soo Kim, Jaehyuk Huh1 
TL;DR: This paper proposes a methodology which can extract several essential parameters affecting the performance of SSDs, and apply the extracted parameters to SSD systems for performance improvement, and modify two operating system components to optimize their operations with the SSD parameters.
Abstract: Solid state disks (SSDs) have many advantages over hard disk drives, including better reliability, performance, durability, and power efficiency. However, the characteristics of SSDs are completely different from those of hard disk drives with rotating disks. To achieve the full potential performance improvement with SSDs, operating systems or applications must understand the critical performance parameters of SSDs to fine-tune their accesses. However, the internal hardware and software organizations vary significantly among SSDs and, thus, each SSD exhibits different parameters which influence the overall performance. In this paper, we propose a methodology which can extract several essential parameters affecting the performance of SSDs, and apply the extracted parameters to SSD systems for performance improvement. The target parameters of SSDs considered in this paper are 1) the size of read/write unit, 2) the size of erase unit, 3) the size of read buffer, and 4) the size of write buffer. We modify two operating system components to optimize their operations with the SSD parameters. The experimental results show that such parameter-aware management leads to significant performance improvements for large file accesses by performing SSD-specific optimizations.