scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 2017"


Journal ArticleDOI
TL;DR: This paper presents a method for minimizing Service Delay in a scenario with two cloudlet servers, which has a dual focus on computation and communication elements, controlling Processing Delay through virtual machine migration and improving Transmission Delay with Transmission Power Control.
Abstract: Due to physical limitations, mobile devices are restricted in memory, battery, processing, among other characteristics. This results in many applications that cannot be run in such devices. This problem is fixed by Edge Cloud Computing, where the users offload tasks they cannot run to cloudlet servers in the edge of the network. The main requirement of such a system is having a low Service Delay, which would correspond to a high Quality of Service. This paper presents a method for minimizing Service Delay in a scenario with two cloudlet servers. The method has a dual focus on computation and communication elements, controlling Processing Delay through virtual machine migration and improving Transmission Delay with Transmission Power Control. The foundation of the proposal is a mathematical model of the scenario, whose analysis is used on a comparison between the proposed approach and two other conventional methods; these methods have single focus and only make an effort to improve either Transmission Delay or Processing Delay, but not both. As expected, the proposal presents the lowest Service Delay in all study cases, corroborating our conclusion that a dual focus approach is the best way to tackle the Service Delay problem in Edge Cloud Computing.

335 citations


Journal ArticleDOI
TL;DR: Simulation results demonstrate that the proposal outperforms the benchmark method in terms of delay, throughput, and signaling overhead, and it is demonstrated how the uniquely characterized input and output traffic patterns can enhance the route computation of the deep learning based SDRs.
Abstract: Recent years, Software Defined Routers (SDRs) (programmable routers) have emerged as a viable solution to provide a cost-effective packet processing platform with easy extensibility and programmability Multi-core platforms significantly promote SDRs’ parallel computing capacities, enabling them to adopt artificial intelligent techniques, ie, deep learning, to manage routing paths In this paper, we explore new opportunities in packet processing with deep learning to inexpensively shift the computing needs from rule-based route computation to deep learning based route estimation for high-throughput packet processing Even though deep learning techniques have been extensively exploited in various computing areas, researchers have, to date, not been able to effectively utilize deep learning based route computation for high-speed core networks We envision a supervised deep learning system to construct the routing tables and show how the proposed method can be integrated with programmable routers using both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) We demonstrate how our uniquely characterized input and output traffic patterns can enhance the route computation of the deep learning based SDRs through both analysis and extensive computer simulations In particular, the simulation results demonstrate that our proposal outperforms the benchmark method in terms of delay, throughput, and signaling overhead

287 citations


Journal ArticleDOI
TL;DR: The results show that the proposed 16-bit approximate radix-4 Booth multiplier with approximate factors of 12 and 14 are more accurate than existing approximate Booth multipliers with moderate power consumption and the proposed R4ABM2 multiplier with an approximation factor of 14 is the most efficient design.
Abstract: Approximate computing is an attractive design methodology to achieve low power, high performance (low delay) and reduced circuit complexity by relaxing the requirement of accuracy. In this paper, approximate Booth multipliers are designed based on approximate radix-4 modified Booth encoding (MBE) algorithms and a regular partial product array that employs an approximate Wallace tree. Two approximate Booth encoders are proposed and analyzed for error-tolerant computing. The error characteristics are analyzed with respect to the so-called approximation factor that is related to the inexact bit width of the Booth multipliers. Simulation results at 45 nm feature size in CMOS for delay, area and power consumption are also provided. The results show that the proposed 16-bit approximate radix-4 Booth multipliers with approximate factors of 12 and 14 are more accurate than existing approximate Booth multipliers with moderate power consumption. The proposed R4ABM2 multiplier with an approximation factor of 14 is the most efficient design when considering both power-delay product and the error metric NMED. Case studies for image processing show the validity of the proposed approximate radix-4 Booth multipliers.

205 citations


Journal ArticleDOI
TL;DR: This work discusses the low-level number representation, strategies for precision and error bounds, and the implementation of efficient polynomial arithmetic with interval coefficients in C Arb.
Abstract: Arb is a C library for arbitrary-precision interval arithmetic using the midpoint-radius representation, also known as ball arithmetic. It supports real and complex numbers, polynomials, power series, matrices, and evaluation of many special functions. The core number types are designed for versatility and speed in a range of scenarios, allowing performance that is competitive with non-interval arbitrary-precision types such as MPFR and MPC floating-point numbers. We discuss the low-level number representation, strategies for precision and error bounds, and the implementation of efficient polynomial arithmetic with interval coefficients.

182 citations


Journal ArticleDOI
TL;DR: A custom multi-chip machine-learning architecture containing a combination of custom storage and computational units, with electrical and optical inter-chip interconnects separately is introduced, and it is shown that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system.
Abstract: Many companies are deploying services largely based on machine-learning algorithms for sophisticated processing of large amounts of data, either for consumers or industry. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines, and evaluate performance by integrating electrical and optical inter-chip interconnects separately. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of $656.63 \times$ over a GPU, and reduce the energy by $184.05 \times$ on average for a 64-chip system. We implement the node down to the place and route at 28 nm, containing a combination of custom storage and computational units, with electrical inter-chip interconnects.

169 citations


Journal ArticleDOI
TL;DR: Evaluation results show that the cyber-physical sensing framework can achieve both maximal adaptive data processing and dissemination performance, presenting better results than other commonly used dissemination protocols such as periodic, uniform and neighbor protocols in both single- Swarm and multi-swarm cases.
Abstract: We present $ADDSEN$ middleware as a holistic solution for Adaptive Data processing and dissemination for Drone swarms in urban SENsing. To efficiently process sensed data in the middleware, we have proposed a cyber-physical sensing framework using partially ordered knowledge sharing for distributed knowledge management in drone swarms. A reinforcement learning dissemination strategy is implemented in the framework. $ADDSEN$ uses online learning techniques to adaptively balance the broadcast rate and knowledge loss rate periodically. The learned broadcast rate is adapted by executing state transitions during the process of online learning. A strategy function guides state transitions, incorporating a set of variables to reflect changes in link status. In addition, we design a cooperative dissemination method for the task of balancing storage and energy allocation in drone swarms. We implemented $ADDSEN$ in our cyber-physical sensing framework, and evaluation results show that it can achieve both maximal adaptive data processing and dissemination performance, presenting better results than other commonly used dissemination protocols such as periodic, uniform and neighbor protocols in both single-swarm and multi-swarm cases.

90 citations


Journal ArticleDOI
TL;DR: A generic methodology for analytical modeling of probability of occurrence of error and the Probability Mass Function of error value in a selected class of approximate adders is presented, which can serve as performance metrics for the comparative analysis of various adders and their configurations.
Abstract: Approximate adders are widely being advocated as a means to achieve performance gain in error resilient applications. In this paper, a generic methodology for analytical modeling of probability of occurrence of error and the Probability Mass Function (PMF) of error value in a selected class of approximate adders is presented, which can serve as performance metrics for the comparative analysis of various adders and their configurations. The proposed model is applicable to approximate adders that comprise of sub-adder units of uniform as well as non-uniform lengths. Using a systematic methodology, we derive closed form expressions for the probability of error for a number of state-of-the-art high-performance approximate adders. The probabilistic analysis is carried out for arbitrary input distributions. It can be used to study the dependence of error statistics in an adder’s output on its configuration and input distribution. Moreover, it is shown that by building upon the proposed error model, we can estimate the probability of error in circuits with multiple approximate adders. We also demonstrate that, using the proposed analysis, the comparative performance of different approximate adders can be correctly predicted in practical applications of image processing.

88 citations


Journal ArticleDOI
TL;DR: It is demonstrated that a hardware (HW) implementation of network security algorithms can significantly reduce their energy consumption compared to an equivalent software (SW) version.
Abstract: Nowadays, a significant part of all network accesses comes from embedded and battery-powered devices, which must be energy efficient. This paper demonstrates that a hardware (HW) implementation of network security algorithms can significantly reduce their energy consumption compared to an equivalent software (SW) version. The paper has four main contributions: (i) a new feature extraction algorithm, with low processing demands and suitable for hardware implementation; (ii) a feature selection method with two objectives—accuracy and energy consumption; (iii) detailed energy measurements of the feature extraction engine and three machine learning (ML) classifiers implemented in SW and HW—Decision Tree (DT), Naive-Bayes (NB), and k-Nearest Neighbors (kNN); and (iv) a detailed analysis of the tradeoffs in implementing the feature extractor and ML classifiers in SW and HW. The new feature extractor demands significantly less computational power, memory, and energy. Its SW implementation consumes only 22 percent of the energy used by a commercial product and its HW implementation only 12 percent. The dual-objective feature selection enabled an energy saving of up to 93 percent. Comparing the most energy-efficient SW implementation (new extractor and DT classifier) with an equivalent HW implementation, the HW version consumes only 5.7 percent of the energy used by the SW version.

85 citations


Journal ArticleDOI
TL;DR: Off-the-Hook is a new approach for detecting phishing webpages in real-time as they are visited by a browser that relies on modeling inherent phisher limitations stemming from the constraints they face while building a webpage.
Abstract: Phishing is a major problem on the Web. Despite the significant attention it has received over the years, there has been no definitive solution. While the state-of-the-art solutions have reasonably good performance, they suffer from several drawbacks including potential to compromise user privacy, difficulty of detecting phishing websites whose content change dynamically, and reliance on features that are too dependent on the training data. To address these limitations we present a new approach for detecting phishing webpages in real-time as they are visited by a browser. It relies on modeling inherent phisher limitations stemming from the constraints they face while building a webpage. Consequently, the implementation of our approach, Off-the-Hook , exhibits several notable properties including high accuracy, brand-independence and good language-independence, speed of decision, resilience to dynamic phish and resilience to evolution in phishing techniques. Off-the-Hook is implemented as a fully-client-side browser add-on, which preserves user privacy. In addition, Off-the-Hook identifies the target website that a phishing webpage is attempting to mimic and includes this target in its warning. We evaluated Off-the-Hook in two different user studies. Our results show that users prefer Off-the-Hook warnings to Firefox warnings.

76 citations


Journal ArticleDOI
TL;DR: The design of a seamless hybrid wired and wireless interconnection network for multichip systems with dimensions spanning up to tens of centimeters with on-chip wireless transceivers is proposed and it is demonstrated with cycle accurate simulations that such a design increases the bandwidth and reduces the energy consumption in comparison to state-of-the-art wireline I/O based multichIP communication.
Abstract: Computing modules in typical datacenter nodes or server racks consist of several multicore chips either on a board or in a System-in-Package (SiP) environment. State-of-the-art inter-chip communication over wireline channels require data signals to travel from internal nets to the peripheral I/O ports and then get routed over the inter-chip channels to the I/O port of the destination chip. Following this, the data is finally routed from the I/O to internal nets of the destination chip over a wireline interconnect fabric. This multihop communication increases energy consumption while decreasing data bandwidth in a multichip system. Also, traditional I/O does not scale well with technology generations due to limitations of pitch. Moreover, intra-chip and inter-chip communication protocol within such a multichip system is often decoupled to facilitate design flexibility. However, a seamless interconnection between on-chip and off-chip data transfer can improve the communication efficiency significantly. Here, we propose the design of a seamless hybrid wired and wireless interconnection network for multichip systems with dimensions spanning up to tens of centimeters with on-chip wireless transceivers. We demonstrate with cycle accurate simulations that such a design increases the bandwidth and reduces the energy consumption in comparison to state-of-the-art wireline I/O based multichip communication.

75 citations


Journal ArticleDOI
TL;DR: A new power budget concept, called Thermal Safe Power (TSP), which is an abstraction that provides safe power and power density constraints as a function of the number of simultaneously active cores, which results in dark silicon estimations which are less pessimistic than estimations using constant power budgets.
Abstract: Chip manufacturers provide the Thermal Design Power (TDP) for a specific chip. The cooling solution is designed to dissipate this power level. But because TDP is not necessarily the maximum power that can be applied, chips are operated with Dynamic Thermal Management (DTM) techniques. To avoid excessive triggers of DTM, usually, system designers also use TDP as power constraint. However, using a single and constant value as power constraint, e.g., TDP, can result in significant performance losses in homogeneous and heterogeneous manycore systems. Having better power budgeting techniques is a major step towards dealing with the dark silicon problem. This paper presents a new power budget concept, called Thermal Safe Power (TSP), which is an abstraction that provides safe power and power density constraints as a function of the number of simultaneously active cores. Executing cores at any power consumption below TSP ensures that DTM is not triggered. TSP can be computed offline for the worst cases, or online for a particular mapping of cores. TSP can also serve as a fundamental tool for guiding task partitioning and core mapping decisions, specially when core heterogeneity or timing guarantees are involved. Moreover, TSP results in dark silicon estimations which are less pessimistic than estimations using constant power budgets.

Journal ArticleDOI
TL;DR: CABA is described, a novel continuous authentication system that is inspired by and leverages the emergence of sensors for pervasive and continuous health monitoring that authenticates users based on their BioAura, an ensemble of biomedical signal streams that can be collected continuously and non-invasively using wearable medical devices.
Abstract: Most computer systems authenticate users only once at the time of initial login, which can lead to security concerns Continuous authentication has been explored as an approach for alleviating such concerns Previous methods for continuous authentication primarily use biometrics, eg, fingerprint and face recognition, or behaviometrics, eg, key stroke patterns We describe CABA, a novel continuous authentication system that is inspired by and leverages the emergence of sensors for pervasive and continuous health monitoring CABA authenticates users based on their BioAura, an ensemble of biomedical signal streams that can be collected continuously and non-invasively using wearable medical devices While each such signal may not be highly discriminative by itself, we demonstrate that a collection of such signals, along with robust machine learning, can provide high accuracy levels We demonstrate the feasibility of CABA through analysis of traces from the MIMIC-II dataset We propose various applications of CABA, and describe how it can be extended to user identification and adaptive access control authorization Finally, we discuss possible attacks on the proposed scheme and suggest corresponding countermeasures

Journal ArticleDOI
TL;DR: This paper proposes four novel strategies for partitioning the DRAM in a system into a number of quality bins based on the frequency, location, and nature of bit errors in each of the physical pages, while also taking into account the property of variable retention time exhibited by DRAM cells.
Abstract: Approximate computing is an emerging design paradigm that leverages the inherent error tolerance present in many applications to improve their power consumption and performance. Due to the forgiving nature of these error-resilient applications, precise input data is not always necessary for them to produce outputs of acceptable quality. This makes the memory subsystem (i.e., the place where data is stored), a suitable component for introducing approximations in return for substantial energy savings. Towards this end, this paper proposes a systematic methodology for constructing a quality configurable approximate DRAM system. Our design is based upon an extensive experimental characterization of memory errors as a function of the DRAM refresh-rate. Leveraging the insights gathered from this characterization, we propose four novel strategies for partitioning the DRAM in a system into a number of quality bins based on the frequency, location, and nature of bit errors in each of the physical pages, while also taking into account the property of variable retention time exhibited by DRAM cells. During data allocation, critical data is placed in the highest quality bin (that contains only accurate pages) and approximate data is allocated to bins sorted in descending order of quality, with the refresh rate serving as the quality control knob. We validate our proposed scheme on several error-resilient applications implemented using an Altera Stratix IV GX FPGA based Terasic TR4-230 development board containing a 1GB DDR3 DRAM module. Experimental results demonstrate a significant improvement in the energy-quality trade-off compared to previous work and show a reduction in DRAM refresh power of up to 73 percent on average with minimal loss in output quality.

Journal ArticleDOI
TL;DR: This work studies the computation of an ECDSA signature verification operation on a twisted Edwards curve with an efficiently computable endomorphism, which allows reducing the number of point doublings by approximately 50 percent compared to a conventional implementation.
Abstract: Verification of an ECDSA signature requires a double scalar multiplication on an elliptic curve. In this work, we study the computation of this operation on a twisted Edwards curve with an efficiently computable endomorphism, which allows reducing the number of point doublings by approximately 50 percent compared to a conventional implementation. In particular, we focus on a curve defined over the 207-bit prime field $\mathbb {F}_p$ with $p = 2^{207}-5{,}131$ . We develop several optimizations to the operation and we describe two hardware architectures for computing the operation. The first architecture is a small processor implemented in 0.13 $\mu$ m CMOS ASIC and is useful in resource-constrained devices for the Internet of Things (IoT) applications. The second architecture is designed for fast signature verifications by using FPGA acceleration and can be used in the server-side of these applications. Our designs offer various trade-offs and optimizations between performance and resource requirements and they are valuable for IoT applications.

Journal ArticleDOI
TL;DR: This paper presents a resource management technique that introduces power density as a novel system level constraint, and provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime.
Abstract: Increasing power densities have led to the dark silicon era, for which heterogeneous multicores with different power and performance characteristics are promising architectures. This paper focuses on maximizing the overall system performance under a critical temperature constraint for heterogeneous tiled multicores, where all cores or accelerators inside a tile share the same voltage and frequency levels. For such architectures, we present a resource management technique that introduces power density as a novel system level constraint, in order to avoid thermal violations. The proposed technique then assigns applications to tiles by choosing their degree of parallelism and the voltage/frequency levels of each tile, such that the power density constraint is satisfied. Moreover, our technique provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime. Thus, the available thermal headroom is exploited to maximize the overall system performance.

Journal ArticleDOI
TL;DR: This paper proposes a task model that integrates control flow information by considering conditional parallel tasks (cp-tasks) represented by DAGs with both precedence and conditional edges, and a set of meaningful parameters are identified and computed by efficient algorithms and a response-time analysis is presented.
Abstract: Several task models have been introduced in the literature to describe the intrinsic parallelism of real-time activities, including fork/join, synchronous parallel, DAG-based, etc. Although schedulability tests and resource augmentation bounds have been derived for these task models in the context of multicore systems, they are still too pessimistic to describe the execution flow of parallel tasks characterized by multiple (and nested) conditional statements, where it is hard to decide which execution path to select for modeling the worst-case scenario. To overcome this problem, this paper proposes a task model that integrates control flow information by considering conditional parallel tasks (cp-tasks) represented by DAGs with both precedence and conditional edges. For this task model, a set of meaningful parameters are identified and computed by efficient algorithms and a response-time analysis is presented for different scheduling policies. Experimental results are finally reported to evaluate the efficiency of the proposed schedulability tests and their performance with respect to classic tests based on both conditional and non-conditional existing approaches.

Journal ArticleDOI
TL;DR: The main idea behind AE is based on the observation that theextreme value in an asymmetric local range is not likely to be replaced by a new extreme value in dealing with the boundaries-shifting problem, and has higher chunking throughput, smaller chunk size variance than the existing CDC algorithms, and is able to find proper chunk boundaries in low-entropy strings.
Abstract: Chunk-level deduplication plays an important role in backup storage systems. Existing Content-Defined Chunking (CDC) algorithms, while robust in finding suitable chunk boundaries, face the key challenges of (1) low chunking throughput that renders the chunking stage a serious deduplication performance bottleneck, (2) large chunk size variance that decreases deduplication efficiency, and (3) being unable to find proper chunk boundaries in low-entropy strings and thus failing to deduplicate these strings. To address these challenges, this paper proposes a new CDC algorithm called the Asymmetric Extremum (AE) algorithm. The main idea behind AE is based on the observation that the extreme value in an asymmetric local range is not likely to be replaced by a new extreme value in dealing with the boundaries-shifting problem. As a result, AE has higher chunking throughput, smaller chunk size variance than the existing CDC algorithms, and is able to find proper chunk boundaries in low-entropy strings. The experimental results based on real-world datasets show that AE improves the throughput performance of the state-of-the-art CDC algorithms by more than $2.3\times$ , which is fast enough to remove the chunking-throughput performance bottleneck of deduplication, and accelerates the system throughput by more than 50 percent, while achieving comparable deduplication efficiency.

Journal ArticleDOI
TL;DR: A custom hardware accelerator, which is optimized for a class of reconfigurable logic, for Lopez-Alt, Tromer and Vaikuntanathan's somewhat homomorphic encryption based schemes is proposed, working as a co-processor which enables the operating system to offload the most compute-heavy operations to this specialized hardware.
Abstract: After the introduction of first fully homomorphic encryption scheme in 2009, numerous research work has been published aiming at making fully homomorphic encryption practical for daily use. The first fully functional scheme and a few others that have been introduced has been proven difficult to be utilized in practical applications, due to efficiency reasons. Here, we propose a custom hardware accelerator, which is optimized for a class of reconfigurable logic, for Lopez-Alt, Tromer and Vaikuntanathan's somewhat homomorphic encryption based schemes. Our design is working as a co-processor which enables the operating system to offload the most compute-heavy operations to this specialized hardware. The core of our design is an efficient hardware implementation of a polynomial multiplier as it is the most compute-heavy operation of our target scheme. The presented architecture can compute the product of very-large polynomials in under 6.25 ms which is 102 times faster than its software implementation. In case of accelerating homomorphic applications; we estimate the per block homomorphic AES as 442 ms which is 28.5 and 17 times faster than the CPU and GPU implementations, respectively. In evaluation of Prince block cipher homomorphically, we estimate the performance as 52 ms which is 66 times faster than the CPU implementation.

Journal ArticleDOI
TL;DR: iKayak, a cross-platform resource scheduling middleware, which aims to improve the resource utilization and application performance in multi-tenant Spark-on-YARN clusters, and implement iKayak in YARN.
Abstract: While MapReduce is inherently designed for batch and high throughput processing workloads, there is an increasing demand for non-batch processes on big data, e.g., interactive jobs, real-time queries, and stream computations. Emerging Apache Spark fills in this gap, which can run on an established Hadoop cluster and take advantages of existing HDFS. As a result, the deployment model of Spark-on-YARN is widely applied by many industry leaders. However, we identify three key challenges to deploy Spark on YARN, inflexible reservation-based resource management, inter-task dependency blind scheduling, and the locality interference between Spark and MapReduce applications. The three challenges cause inefficient resource utilization and significant performance deterioration. We propose and develop a cross-platform resource scheduling middleware, iKayak , which aims to improve the resource utilization and application performance in multi-tenant Spark-on-YARN clusters. iKayak relies on three key mechanisms: reservation-aware executor placement to avoid long waiting for resource reservation, dependency-aware resource adjustment to exploit under-utilized resource occupied by reduce tasks, and cross-platform locality-aware task assignment to coordinate locality competition between Spark and MapReduce applications. We implement iKayak in YARN. Experimental results on a testbed show that iKayak can achieve 50 percent performance improvement for Spark applications and 19 percent performance improvement for MapReduce applications, compared to two popular Spark-on-YARN deployment models, i.e., YARN-client model and YARN-cluster model.

Journal ArticleDOI
TL;DR: The proposed analysis is validated by applying it to several state-of-the-art approximate multipliers and comparing with corresponding simulation results, and results show that the proposed analysis serves as an effective tool for predicting, evaluating and comparing the accuracy of various multipliers.
Abstract: Approximate multipliers are gaining importance in energy-efficient computing and require careful error analysis In this paper, we present the error probability analysis for recursive approximate multipliers with approximate partial products Since these multipliers are constructed from smaller approximate multiplier building blocks, we propose to derive the error probability in an arbitrary bit-width multiplier from the probabilistic model of the basic building block and the probability distributions of inputs The analysis is based on common features of recursive multipliers identified by carefully studying the behavioral model of state-of-the-art designs By building further upon the analysis, Probability Mass Function (PMF) of error is computed by individually considering all possible error cases and their inter-dependencies We further discuss the generalizations for approximate adder trees, signed multipliers, squarers and constant multipliers The proposed analysis is validated by applying it to several state-of-the-art approximate multipliers and comparing with corresponding simulation results The results show that the proposed analysis serves as an effective tool for predicting, evaluating and comparing the accuracy of various multipliers Results show that for the majority of the recursive multipliers, we get accurate error performance evaluation We also predict the multipliers’ performance in an image processing application to demonstrate its practical significance

Journal ArticleDOI
TL;DR: A streaming workflow allocation algorithm that takes into consideration the characteristics of streaming work and the price diversity among geo-distributed DCs, to further achieve the goal of cost minimization for streaming big data processing.
Abstract: The virtual machine (VM) allocation problem in cloud computing has been widely studied in recent years, and many algorithms have been proposed in the literature. Most of them have been successfully applied to batch processing models such as MapReduce; however, none of them can be applied to streaming workflow well because of the following weaknesses: 1) failure to capture the characteristics of tasks in streaming workflow for the short life cycle of data streams; 2) most algorithms are based on the assumptions that the price of VMs and traffic among data centers (DCs) are static and fixed. In this paper, we propose a streaming workflow allocation algorithm that takes into consideration the characteristics of streaming work and the price diversity among geo-distributed DCs, to further achieve the goal of cost minimization for streaming big data processing. First, we construct an extended streaming workflow graph (ESWG) based on the task semantics of streaming workflow and the price diversity of geo-distributed DCs, and the streaming workflow allocation problem is formulated into mixed integer linear programming based on the ESWG. Second, we propose two heuristic algorithms to reduce the computational space based on task combination and DC combination in order to meet the strict latency requirement. Finally, our experimental results demonstrate significant performance gains with lower total cost and execution time.

Journal ArticleDOI
TL;DR: This paper designs and analyzes a novel Adaptive Restore Scheme for Write Disturbance (ARS-WD) and Read Disturbances (AR-RD), respectively, which promotes advantages of MLC to provide a preferable L2 design alternative in terms of energy, area and latency product compared to SLC STT-RAM alternatives.
Abstract: For the sake of higher cell density while achieving near-zero standby power, recent research progress in Magnetic Tunneling Junction (MTJ) devices has leveraged Multi-Level Cell (MLC) configurations of Spin-Transfer Torque Random Access Memory (STT-RAM). However, in order to mitigate the write disturbance in an MLC strategy, data stored in the soft bit must be restored back immediately after the hard bit switching is completed. Furthermore, as the result of MTJ feature size scaling, the soft bit can be expected to become disturbed by the read sensing current, thus requiring an immediate restore operation to ensure the data reliability. In this paper, we design and analyze a novel Adaptive Restore Scheme for Write Disturbance (ARS-WD) and Read Disturbance (ARS-RD), respectively. ARS-WD alleviates restoration overhead by intentionally overwriting soft bit lines which are less likely to be read. ARS-RD, on the other hand, aggregates the potential writes and restore the soft bit line at the time of its eviction from higher level cache. Both of these two schemes are based on a lightweight forecasting approach for the future read behavior of the cache block. Our experimental results show substantial reduction in soft bit line restore operations, delivering 17.9 percent decrease in overall energy consumption and 9.4 percent increase in IPC, while incurring negligible capacity overhead. Moreover, ARS promotes advantages of MLC to provide a preferable L2 design alternative in terms of energy, area and latency product compared to SLC STT-RAM alternatives.

Journal ArticleDOI
TL;DR: A comparison of hardware architectures for large integer multiplication is presented and it is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs.
Abstract: Multipliers requiring large bit lengths have a major impact on the performance of many applications, such as cryptography, digital signal processing (DSP) and image processing. Novel, optimised designs of large integer multiplication are needed as previous approaches, such as schoolbook multiplication, may not be as feasible due to the large parameter sizes. Parameter bit lengths of up to millions of bits are required for use in cryptography, such as in lattice-based and fully homomorphic encryption (FHE) schemes. This paper presents a comparison of hardware architectures for large integer multiplication. Several multiplication methods and combinations thereof are analysed for suitability in hardware designs, targeting the FPGA platform. In particular, the first hardware architecture combining Karatsuba and Comba multiplication is proposed. Moreover, a hardware complexity analysis is conducted to give results independent of any particular FPGA platform. It is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs. Indeed, the proposed novel combination hardware design of the Karatsuba-Comba multiplier offers lowest latency for integers greater than 512 bits. For large multiplicands, greater than 16,384 bits, the hardware complexity analysis indicates that the NTT-Karatsuba-Schoolbook combination is most suitable.

Journal ArticleDOI
TL;DR: Compared with the existing schemes, the CABE scheme drastically decreases the storage, communication and computation overheads, and thus is more efficient in dealing with the applications with comparable attributes.
Abstract: Attribute-based encryption (ABE) has opened up a popular research topic in cryptography over the past few years. It can be used in various circumstances, as it provides a flexible way to conduct fine-grained data access control. Despite its great advantages in data access control, current ABE based access control system cannot satisfy the requirement well when the system judges the access behavior according to attribute comparison, such as “greater than $x$ ” or “less than $x$ ”, which are called comparable attributes in this paper. In this paper, based on a set of well-designed sub-attributes representing each comparable attribute, we construct a comparable attribute-based encryption scheme (CABE for short) to address the aforementioned problem. The novelty lies in that we provide a more efficient construction based on the generation and management of the sub-attributes with the notion of 0-encoding and 1-encoding. Extensive analysis shows that: Compared with the existing schemes, our scheme drastically decreases the storage, communication and computation overheads, and thus is more efficient in dealing with the applications with comparable attributes.

Journal ArticleDOI
TL;DR: This article demonstrates a new shield structure that is cryptographically secure based on the lightweight block cipher and independent mesh lines to ensure the security against probing attacks of the hardware located behind the shield.
Abstract: Probing attacks are serious threats on integrated circuits. Security products often include a protective layer called shield that acts like a digital fence. In this article, we demonstrate a new shield structure that is cryptographically secure. This shield is based on the lightweight block cipher and independent mesh lines to ensure the security against probing attacks of the hardware located behind the shield. Such structure can be proven secure against state-of-the-art invasive attacks. Then, we evaluate the impact of active shield on the performance of security IPs as PUF, TRNG, secure clock and AES using a set of fabricated ASICs with $65\;\text{nm}$ CMOS technology of STMicroelectronics. Also, the impact of active shield on Side-Channel Attack (SCA) is evaluated.

Journal ArticleDOI
TL;DR: Proposed topology for CPU/GPU HSA improves application performance by 29 percent and reduces latency by 50 percent, while reducing energy consumption by 64.5 percent and area by 17.39 percent as compared to baseline mesh.
Abstract: Heterogeneous System Architectures (HSA) that integrate cores of different architectures (CPU, GPU, etc.) on single chip are gaining significance for many class of applications to achieve high performance. Networks-on-Chip (NoCs) in HSA are monopolized by high volume GPU traffic, penalizing CPU application performance. In addition, building efficient interfaces between systems of different specifications while achieving optimal performance is a demanding task. Homogeneous NoCs, widely used for many core systems, fall short in meeting these communication requirements. To achieve high performance interconnection in HSA, we propose HyWin topology using mm-wave wireless links. The proposed topology implements sandboxed heterogeneous sub-networks, each designed to match needs of a processing subsystem, which are then interconnected at second level using wireless network. The sandboxed sub-networks avoid conflict of network requirements, while providing optimal performance for their respective subsystems. The long range wireless links provide low latency and low energy inter-subsystem network to provide easy access to memory controllers, lower level caches across the entire system. By implementing proposed topology for CPU/GPU HSA, we show that it improves application performance by 29 percent and reduces latency by 50 percent, while reducing energy consumption by 64.5 percent and area by 17.39 percent as compared to baseline mesh.

Journal ArticleDOI
TL;DR: In this paper, a setelitian ini dikumpulkan berdasarkan dokumen-dokumen keterangan ekspor impor ying dihasilkan oleh Kantor Bea Cukai, sesuai dengan peraturan.
Abstract: Ekspor dan impor barang-barang terdiri dari cakupan komoditas, sistem perdagangan, penilaian, pengukuran kuantitas dan rekan negara. Kegiatan ekspor dan import melibatkan kedua negara, yakni negara tujuan dan negara asal. Negara tujuan adalah adalah negara yang pada saat pengiriman diketahui sebagai negara terakhir dimana barang tersebut akan terkirim sedangkan negara asal adalah negara dimana barang-barang tersebut diproduksi, setelah diverifikasi oleh Kantor Bea Cukai, sesuai dengan peraturan. Penelitian ini membahas tentang Penerapan Datamining Pada Ekspor Buah-Buahan Menurut Negara Tujuan Menggunakan K-Means Clustering Method . Sumber data penelitian ini dikumpulkan berdasarkan dokumen-dokumen keterangan ekspor impor yang dihasilkan oleh Direktorat Jenderal Bea dan Cukai. Selain itu sejak tahun 2015 data ekspor juga berasal dari PT. Pos Indonesia, catatan instansi lain di perbatasan, dan hasil survei perdagangan lintas batas laut. Data yang digunakan dalam penelitian ini adalah data Ekspor Buah-buahan Menurut Negara Tujuan Utama dari tahun 2002-2015 yang terdiri dari 11 negara yakni Hongkong, Tiongkok, Singapura, Malaysia, Nepal, Vietnam, India, Pakistan, Bangladesh, Iran dan Negara Lainya. Varibale yang digunakan (1) jumlah ekspor berat bersih (netto) dan (2) nilai Free On Board (FOB). Data akan diolah dengan melakukan clustering ekspor buah-buahan berdasarkan negara tujuan utama dalam 3 cluster yaitu cluster tingkat ekspor tinggi, cluster tingkat ekspor sedang dan cluster tingkat ekspor rendah. Metode clustering yang digunakan dalam penelitian ini adalah metode K-Means. Cetroid data untuk cluster tingkat ekspor tinggi 904.276,5, Cetroid data untuk cluster tingkat ekspor sedang 265.501 dan Cetroid data untuk cluster tingkat ekspor rendah 34.280,1. Sehingga diperoleh penilaian berdasarkan indeks ekspor buah-buahan dengan 2 negara cluster tingkat ekspor tinggi yakni India dan Pakistan, 3 negara cluster tingkat ekspor sedang yakni Singapura, Bangladesh dan Negara lainnya dan 6 negara cluster tingkat ekspor rendah yakni Hongkong, Tiongkok, Malaysia, Nepal, Vietnam dan Iran. Hasil yang dari penelitian dapat digunakan untuk mengetahui jumlah ekspor buah-buahan menurut negara tujuan.

Journal ArticleDOI
TL;DR: This work conducts in-depth performance evaluations on HA- SMR drives with a special emphasis on the performance implications of the SMR-specific APIs and how these drives can be deployed in large storage systems and proposes a novel host-controlled buffer that can help to reduce the severity of the decline in HA-SMR performance under the authors' discovered unfavorable I/O access patterns.
Abstract: Shingled Magnetic Recording (SMR) drives can benefit large-scale storage systems by reducing the Total Cost of Ownership (TCO) of dealing with explosive data growth. Among all existing SMR models, Host Aware SMR (HA-SMR) looks the most promising for its backward compatibility with legacy I/O stacks and its ability to use new SMR-specific APIs to support host I/O stack optimization. Building storage systems using HA-SMR drives calls for a deep understanding of the drive’s performance characteristics. To accomplish this, we conduct in-depth performance evaluations on HA-SMR drives with a special emphasis on the performance implications of the SMR-specific APIs and how these drives can be deployed in large storage systems. We discover both favorable and adverse effects of using HA-SMR drives under various workloads. We also investigate the drive’s performance under legacy production environments using real-world enterprise traces. Finally, we propose a novel host-controlled buffer that can help to reduce the severity of the decline in HA-SMR performance under our discovered unfavorable I/O access patterns. Without a detailed comprehensive design, we show the potential of the host-controlled buffer by a case study.

Journal ArticleDOI
TL;DR: An extended analysis model is proposed to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration.
Abstract: The delay upper-bound analysis problem is of fundamental importance to real-time applications in Network-on-Chips (NoCs). In the paper, we revisit two state-of-the-art analysis models for real-time communication in wormhole NoCs with priority-based preemptive arbitration and show that the models only support specific router architectures with large buffer sizes. We then propose an extended analysis model to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration. Simulated evaluations show that our model supports one more router architecture and applies to small buffer sizes compared to the previous models.

Journal ArticleDOI
TL;DR: This paper proposes CPA and CCA secure KAC constructions that are efficiently implementable using elliptic curves and are suitable for implementation on cloud based data sharing environments with special focus on how the standalone KAC scheme can be efficiently combined with broadcast encryption.
Abstract: Online data sharing for increased productivity and efficiency is one of the primary requirements today for any organization. The advent of cloud computing has pushed the limits of sharing across geographical boundaries, and has enabled a multitude of users to contribute and collaborate on shared data. However, protecting online data is critical to the success of the cloud, which leads to the requirement of efficient and secure cryptographic schemes for the same. Data owners would ideally want to store their data/files online in an encrypted manner, and delegate decryption rights for some of these to users, while retaining the power to revoke access at any point of time. An efficient solution in this regard would be one that allows users to decrypt multiple classes of data using a single key of constant size that can be efficiently broadcast to multiple users. Chu et al. proposed a key aggregate cryptosystem (KAC) in 2014 to address this problem, albeit without formal proofs of security. In this paper, we propose CPA and CCA secure KAC constructions that are efficiently implementable using elliptic curves and are suitable for implementation on cloud based data sharing environments. We lay special focus on how the standalone KAC scheme can be efficiently combined with broadcast encryption to cater to $m$ data users and $m^{\prime}$ data owners while reducing the reducing the secure channel requirement from $\mathcal {O}(mm^{\prime})$ in the standalone case to $\mathcal {O}(m+m^{\prime})$ .