scispace - formally typeset
Search or ask a question

Showing papers on "Overhead (computing) published in 2021"


Proceedings ArticleDOI
20 Jun 2021
TL;DR: BoTNet as mentioned in this paper incorporates self-attention for image classification, object detection, and instance segmentation, and achieves state-of-the-art performance on the ImageNet benchmark.
Abstract: We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt [67] evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 1.64x faster in "compute"1 time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.2

675 citations


Journal ArticleDOI
TL;DR: In this article, a novel LIS architecture based on sparse channel sensors is proposed, where all the LIS elements are passive except for a few elements that are connected to the baseband.
Abstract: Employing large intelligent surfaces (LISs) is a promising solution for improving the coverage and rate of future wireless systems. These surfaces comprise massive numbers of nearly-passive elements that interact with the incident signals, for example by reflecting them, in a smart way that improves the wireless system performance. Prior work focused on the design of the LIS reflection matrices assuming full channel knowledge. Estimating these channels at the LIS, however, is a key challenging problem. With the massive number of LIS elements, channel estimation or reflection beam training will be associated with (i) huge training overhead if all the LIS elements are passive (not connected to a baseband) or with (ii) prohibitive hardware complexity and power consumption if all the elements are connected to the baseband through a fully-digital or hybrid analog/digital architecture. This paper proposes efficient solutions for these problems by leveraging tools from compressive sensing and deep learning. First, a novel LIS architecture based on sparse channel sensors is proposed. In this architecture, all the LIS elements are passive except for a few elements that are active (connected to the baseband). We then develop two solutions that design the LIS reflection matrices with negligible training overhead. In the first approach, we leverage compressive sensing tools to construct the channels at all the LIS elements from the channels seen only at the active elements. In the second approach, we develop a deep-learning based solution where the LIS learns how to interact with the incident signal given the channels at the active elements, which represent the state of the environment and transmitter/receiver locations. We show that the achievable rates of the proposed solutions approach the upper bound, which assumes perfect channel knowledge, with negligible training overhead and with only a few active elements, making them promising for future LIS systems.

405 citations


Journal ArticleDOI
TL;DR: A two-timescale channel estimation framework to exploit the property that the BS-RIS channel is high-dimensional but quasi-static, while the RIS-UE channel is mobile but low-dimensional is proposed.
Abstract: Channel estimation is challenging for the reconfigurable intelligent surface (RIS)-aided wireless communications. Since the number of coefficients of the cascaded channel among the base station (BS), the RIS, and the user equipment (UE), is the product of the number of BS antennas, the number of RIS elements, and the number of UEs, the pilot overhead can be prohibitively high. In this paper, we propose a two-timescale channel estimation framework to exploit the property that the BS-RIS channel is high-dimensional but quasi-static, while the RIS-UE channel is mobile but low-dimensional. Specifically, to estimate the quasi-static BS-RIS channel, we propose a dual-link pilot transmission scheme, where the BS transmits downlink pilots and receives uplink pilots reflected by the RIS. Then, we propose a coordinate descent-based algorithm to recover the BS-RIS channel. Since the quasi-static BS-RIS channel is estimated less frequently than the mobile channel is, the average pilot overhead can be reduced from a long-term perspective. Although the mobile RIS-UE channel has to be frequently estimated in a small timescale, the associated pilot overhead is low thanks to its low dimension. Simulation results show that the proposed two-timescale channel estimation framework can achieve accurate channel estimation with low pilot overhead.

236 citations


Book ChapterDOI
27 Sep 2021
TL;DR: In this paper, a self-attention mechanism along with relative position encoding was proposed to reduce the complexity of selfattention operation significantly from O(n 2 ) to approximate O (n).
Abstract: Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from \(O(n^2)\) to approximate O(n). A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

214 citations


Journal ArticleDOI
26 Jan 2021
TL;DR: This article proposes the first secure aggregation framework, named Turbo-Aggregate, which employs a multi-group circular strategy for efficient model aggregation, and leverages additive secret sharing and novel coding techniques for injecting aggregation redundancy in order to handle user dropouts while guaranteeing user privacy.
Abstract: Federated learning is a distributed framework for training machine learning models over the data residing at mobile devices, while protecting the privacy of individual users. A major bottleneck in scaling federated learning to a large number of users is the overhead of secure model aggregation across many users. In particular, the overhead of the state-of-the-art protocols for secure model aggregation grows quadratically with the number of users. In this article, we propose the first secure aggregation framework, named Turbo-Aggregate, that in a network with $N$ users achieves a secure aggregation overhead of $O(N\log {N})$ , as opposed to $O(N^{2})$ , while tolerating up to a user dropout rate of 50%. Turbo-Aggregate employs a multi-group circular strategy for efficient model aggregation, and leverages additive secret sharing and novel coding techniques for injecting aggregation redundancy in order to handle user dropouts while guaranteeing user privacy. We experimentally demonstrate that Turbo-Aggregate achieves a total running time that grows almost linear in the number of users, and provides up to $40\times $ speedup over the state-of-the-art protocols with up to $N=200$ users. Our experiments also demonstrate the impact of model size and bandwidth on the performance of Turbo-Aggregate.

170 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: In this paper, a cross-stage connection path is proposed to transfer knowledge from the teacher network to the student one, with the goal of greatly improving the performance of the student network.
Abstract: Knowledge distillation transfers knowledge from the teacher network to the student one, with the goal of greatly improving the performance of the student network. Previous methods mostly focus on proposing feature transformation and loss functions between the same level's features to improve the effectiveness. We differently study the factor of connection path cross levels between teacher and student networks, and reveal its great importance. For the first time in knowledge distillation, cross-stage connection paths are proposed. Our new review mechanism is effective and structurally simple. Our finally designed nested and compact framework requires negligible computation overhead, and outperforms other methods on a variety of tasks. We apply our method to classification, object detection, and instance segmentation tasks. All of them witness significant student network performance improvement.

165 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed an attention mechanism-based convolutional neural network-long short-term memory (AMCNN-LSTM) model to accurately detect anomalies.
Abstract: Since edge device failures (i.e., anomalies) seriously affect the production of industrial products in Industrial IoT (IIoT), accurately and timely detecting anomalies are becoming increasingly important. Furthermore, data collected by the edge device contain massive user’s private data, which is challenging current detection approaches as user privacy has attracted more and more public concerns. With this focus, this article proposes a new communication-efficient on-device federated learning (FL)-based deep anomaly detection framework for sensing time-series data in IIoT. Specifically, we first introduce an FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can improve its generalization ability. Second, we propose an attention mechanism-based convolutional neural network-long short-term memory (AMCNN-LSTM) model to accurately detect anomalies. The AMCNN-LSTM model uses attention mechanism-based convolutional neural network units to capture important fine-grained features, thereby preventing memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of the long short-term memory unit in predicting time-series data. Third, to adapt the proposed framework to the timeliness of industrial anomaly detection, we propose a gradient compression mechanism based on Top- ${k}$ selection to improve communication efficiency. Extensive experimental studies on four real-world data sets demonstrate that our framework accurately and timely detects anomalies and also reduces the communication overhead by 50% compared to the FL framework that does not use the gradient compression scheme.

159 citations


Journal ArticleDOI
TL;DR: In this paper, a hybrid metaheuristic algorithm named genetic simulated annealing-based particle swarm optimization (GSPO) was proposed to minimize the total energy consumed by mobile devices and edge servers by jointly optimizing the offloading ratio of tasks, CPU speeds of mobile devices, allocated bandwidth of available channels, and transmission power of each mobile device in each time slot.
Abstract: Smart mobile devices (SMDs) can meet users’ high expectations by executing computational intensive applications but they only have limited resources, including CPU, memory, battery power, and wireless medium. To tackle this limitation, partial computation offloading can be used as a promising method to schedule some tasks of applications from resource-limited SMDs to high-performance edge servers. However, it brings communication overhead issues caused by limited bandwidth and inevitably increases the latency of tasks offloaded to edge servers. Therefore, it is highly challenging to achieve a balance between high-resource consumption in SMDs and high communication cost for providing energy-efficient and latency-low services to users. This work proposes a partial computation offloading method to minimize the total energy consumed by SMDs and edge servers by jointly optimizing the offloading ratio of tasks, CPU speeds of SMDs, allocated bandwidth of available channels, and transmission power of each SMD in each time slot. It jointly considers the execution time of tasks performed in SMDs and edge servers, and transmission time of data. It also jointly considers latency limits, CPU speeds, transmission power limits, available energy of SMDs, and the maximum number of CPU cycles and memories in edge servers. Considering these factors, a nonlinear constrained optimization problem is formulated and solved by a novel hybrid metaheuristic algorithm named genetic simulated annealing-based particle swarm optimization (GSP) to produce a close-to-optimal solution. GSP achieves joint optimization of computation offloading between a cloud data center and the edge, and resource allocation in the data center. Real-life data-based experimental results prove that it achieves lower energy consumption in less convergence time than its three typical peers.

138 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a novel impulse-like timing metric based on length-alterable differential cross-correlation (LDCC), which is immune to carrier frequency offset (CFO) and capable of mitigating the impact of noise on timing estimation.
Abstract: Satellite communication system is expected to play a vital role for realizing various remote Internet-of-Things (IoT) applications in sixth-generation vision. Due to unique characteristics of satellite environment, one of the main challenges in this system is to accommodate massive random access (RA) requests of IoT devices while minimizing their energy consumptions. In this article, we focus on the reliable design and detection of RA preamble to effectively enhance the access efficiency in high-dynamic low-earth-orbit (LEO) scenarios. To avoid additional signaling overhead and detection process, a long preamble sequence is constructed by concatenating the conjugated and circularly shifted replicas of a single root Zadoff–Chu (ZC) sequence in RA procedure. Moreover, we propose a novel impulse-like timing metric based on length-alterable differential cross-correlation (LDCC), that is immune to carrier frequency offset (CFO) and capable of mitigating the impact of noise on timing estimation. Statistical analysis of the proposed metric reveals that increasing correlation length can obviously promote the output signal-to-noise power ratio, and the first-path detection threshold is independent of noise statistics. Simulation results in different LEO scenarios validate the robustness of the proposed method to severe channel distortion, and show that our method can achieve significant performance enhancement in terms of timing estimation accuracy, success probability of first access, and mean normalized access energy, compared with the existing RA methods.

130 citations


Journal ArticleDOI
TL;DR: The double-structured orthogonal matching pursuit (DS-OMP) algorithm, where the completely common non-zero rows and the partially commonNon-zero columns are jointly estimated for all users are proposed.
Abstract: Reconfigurable intelligent surface (RIS) can manipulate the wireless communication environment by controlling the coefficients of RIS elements. However, due to the large number of passive RIS elements without signal processing capability, channel estimation in RIS assisted wireless communication system requires high pilot overhead. In the second part of this invited paper, we propose to exploit the double-structured sparsity of the angular cascaded channels among users to reduce the pilot overhead. Specifically, we first reveal the double-structured sparsity, i.e., different angular cascaded channels for different users enjoy the completely common non-zero rows and the partially common non-zero columns. By exploiting this double-structured sparsity, we further propose the double-structured orthogonal matching pursuit (DS-OMP) algorithm, where the completely common non-zero rows and the partially common non-zero columns are jointly estimated for all users. Simulation results show that the pilot overhead required by the proposed scheme is lower than existing schemes.

123 citations


Journal ArticleDOI
TL;DR: In this paper, a two-timescale (TTS) transmission protocol was proposed to maximize the achievable average sum-rate for an IRS-aided multiuser system under the general correlated Rician channel model.
Abstract: Intelligent reflecting surface (IRS) has drawn a lot of attention recently as a promising new solution to achieve high spectral and energy efficiency for future wireless networks. By utilizing massive low-cost passive reflecting elements, the wireless propagation environment becomes controllable and thus can be made favorable for improving the communication performance. Prior works on IRS mainly rely on the instantaneous channel state information (I-CSI), which, however, is practically difficult to obtain for IRS-associated links due to its passive operation and large number of reflecting elements. To overcome this difficulty, we propose in this paper a new two-timescale (TTS) transmission protocol to maximize the achievable average sum-rate for an IRS-aided multiuser system under the general correlated Rician channel model. Specifically, the passive IRS phase shifts are first optimized based on the statistical CSI (S-CSI) of all links, which varies much slowly as compared to their I-CSI; while the transmit beamforming/precoding vectors at the access point (AP) are then designed to cater to the I-CSI of the users’ effective fading channels with the optimized IRS phase shifts, thus significantly reducing the channel training overhead and passive beamforming design complexity over the existing schemes based on the I-CSI of all channels. Besides, for ease of practical implementation, we consider discrete phase shifts at each reflecting element of the IRS. For the single-user case, an efficient penalty dual decomposition (PDD)-based algorithm is proposed, where the IRS phase shifts are updated in parallel to reduce the computational time. For the multiuser case, we propose a general TTS stochastic successive convex approximation (SSCA) algorithm by constructing a quadratic surrogate of the objective function, which cannot be explicitly expressed in closed-form. Simulation results are presented to validate the effectiveness of our proposed algorithms and evaluate the impact of S-CSI and channel correlation on the system performance.

Journal ArticleDOI
TL;DR: This study proposes an offloading model for a multi-user MEC system with multi-task, and an equivalent form of reinforcement learning is created where the state spaces are defined based on all possible solutions and the actions are defined on the basis of movement between the different states.
Abstract: Computation offloading at mobile edge computing (MEC) servers can mitigate the resource limitation and reduce the communication latency for mobile devices Thereby, in this study, we proposed an offloading model for a multi-user MEC system with multi-task In addition, a new caching concept is introduced for the computation tasks, where the application program and related code for the completed tasks are cached at the edge server Furthermore, an efficient model of task offloading and caching integration is formulated as a nonlinear problem whose goal is to reduce the total overhead of time and energy However, solving these types of problems is computationally prohibitive, especially for large-scale of mobile users Thus, an equivalent form of reinforcement learning is created where the state spaces are defined based on all possible solutions and the actions are defined on the basis of movement between the different states Afterwards, two effective Q-learning and Deep-Q-Network-based algorithms are proposed to derive the near-optimal solution for this problem Finally, experimental evaluations verify that our proposed model can substantially minimize the mobile devices’ overhead by deploying computation offloading and task caching strategy reasonably

Journal ArticleDOI
TL;DR: In this article, an overhead-aware resource allocation framework for wireless networks where reconfigurable intelligent surfaces are used to improve the communication performance is proposed and incorporated in the expressions of the system rate and energy efficiency.
Abstract: Reconfigurable intelligent surfaces have emerged as a promising technology for future wireless networks. Given that a large number of reflecting elements is typically used and that the surface has no signal processing capabilities, a major challenge is to cope with the overhead that is required to estimate the channel state information and to report the optimized phase shifts to the surface. This issue has not been addressed by previous works, which do not explicitly consider the overhead during the resource allocation phase. This work aims at filling this gap, by developing an overhead-aware resource allocation framework for wireless networks where reconfigurable intelligent surfaces are used to improve the communication performance. An overhead model is proposed and incorporated in the expressions of the system rate and energy efficiency, which are then optimized with respect to the phase shifts of the reconfigurable intelligent surface, the transmit and receive filters, the power and bandwidth used for the communication and feedback phases. The bi-objective maximization of the rate and energy efficiency is investigated, too. The proposed framework characterizes the trade-off between optimized radio resource allocation policies and the related overhead in networks with reconfigurable intelligent surfaces.

Proceedings ArticleDOI
19 Apr 2021
TL;DR: SIMDRAM as mentioned in this paper is a general-purpose processing-using-DRAM framework that enables the efficient implementation of complex operations and provides a flexible mechanism to support the implementation of arbitrary user-defined operations.
Abstract: Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable full adoption of processing-using-DRAM, it is necessary to provide support for more complex operations. In this paper, we propose SIMDRAM, a flexible general-purpose processing-using-DRAM framework that (1) enables the efficient implementation of complex operations, and (2) provides a flexible mechanism tosupport the implementation of arbitrary user-defined operations. The SIMDRAM framework comprises three key steps. The first step builds an efficient MAJ/NOT representation of a given desired operation. The second step allocates DRAM rows that are reserved for computation to the operation’s input and output operands, and generates the required sequence of DRAM commands to perform the MAJ/NOT implementation of the desired operation in DRAM. The third step uses the SIMDRAM control unit located inside the memory controller to manage the computation of the operation from start to end, by executing the DRAM commands generated in the second step of the framework. We design the hardware and ISA support for SIMDRAM framework to (1) address key system integration challenges, and (2) allow programmers to employ new SIMDRAM operations without hardware changes. We evaluate SIMDRAM for reliability, area overhead, throughput, and energy efficiency using a wide range of operations and seven real-world applications to demonstrate SIMDRAM’s generality. Our evaluations using a single DRAM bank show that (1) over 16 operations, SIMDRAM provides 2.0X the throughput and 2.6X the energy efficiency of Ambit, a state-of-the-art processing-using-DRAM mechanism; (2) over seven real-world applications, SIMDRAM provides 2.5X the performance of Ambit. Using 16 DRAM banks, SIMDRAM provides (1) 88X and 5.8X the throughput, and 257X and 31X the energy efficiency, of a CPU and a high-end GPU, respectively, over 16 operations; (2) 21X and 2.1X the performance of the CPU and GPU, over seven real-world applications. SIMDRAM incurs an area overhead of only 0.2% in a high-end CPU.

Journal ArticleDOI
TL;DR: By simulation results, it is proves that the proposed trust based authentication method for clustered vehicular ad hoc networks increases the accuracy in detecting malicious nodes and the packet delivery ratio, and decreases the delay of authentication and overhead.
Abstract: Vehicular Ad hoc Networks (VANETs) as a subset of mobile ad hoc networks which allow communication between any vehicle with other adjacent vehicles, road side units and infrastructure. In these networks, the purpose is to enhance the security, improve the management of urban and road traffic and provide services to the passenger. Due to problems such as reliability and privacy, messages that are exchanged in the network should be confidential and secure. Therefore, we need a secure topology to maintain trust, which enables the cryptographic process. In this paper, a trust based authentication method for clustered vehicular ad hoc networks is proposed. The efficient authentication method should be able to accurately detect malicious nodes and reduced delay and overhead. The main purpose of the proposed method is to create trustworthy and stable clusters that lead to the stability of the entire network. For this purpose, we estimate the trust degree of each vehicle by combining the trust between vehicles and the trust between the vehicle and Road Side Units (RSUs), and Cluster Heads (CHs) are selected based on this estimated trust degree. Cluster Heads along with verifiers are responsible for monitoring each vehicle. On the other hand, the cluster heads provide an optimal and secure route for transmitting messages. Messages are digitally signed by the sender and encrypted using a public/private key as distributed by a Trusted Authority (TA) and decrypted by the destination; so that each message contains a certificate from a trusted authority. In this identification, the sender and receiver of the message are verified and authentication will be achieved. By simulation results, it is proves that the proposed method increases the accuracy in detecting malicious nodes and the packet delivery ratio, and decreases the delay of authentication and overhead.

Journal ArticleDOI
TL;DR: This article proposes an efficient certificateless aggregate signature scheme with conditional privacy preservation that is suitable for resource-constrained environments, and it is compared with related works from aspects of computation cost, communication efficiency, and security requirements.
Abstract: As an extension of traditional vehicular ad hoc networks, the Internet of Vehicles (IoV) enables information collection and dissemination, which brings a lot of convenience and benefits to the intelligent transportation systems. However, the booming IoV confronts a few challenges in the aspects of vehicle location privacy preservation and the authenticity of the transmitted information. In order to meet these challenges, we propose an efficient certificateless aggregate signature scheme with conditional privacy preservation in this article. Our scheme utilizes the technique of full aggregation to reduce the bandwidth resources and computing overhead. Besides, the conditional privacy preservation in IoV system is realized by using pseudonym mechanism. We demonstrate that the proposed scheme is secure against the Type-I and Type-II adversaries in the random oracle under the computational Diffie–Hellman assumption. In addition, the proposed scheme is compared with related works from aspects of computation cost, communication efficiency, and security requirements. The comparison results show that the proposed scheme is efficient, and it is suitable for resource-constrained environments.

Journal ArticleDOI
TL;DR: This paper utilizes the deep learning technique to conduct the routing computation for the SDCSs and considers an online training manner to reduce the computation overhead of the central controller and improve the adaptation of CNNs to the changing traffic pattern.
Abstract: Software Defined Networking (SDN) is regarded as the next generation paradigm as it simplifies the structure of the data plane and improves the resource utilization. However, in current Software Defined Communication Systems (SDCSs), the maximum or minimum metric value based routing strategies come from traditional networks, which lack the ability of self-adaptation and do not efficiently utilize the computation resource in the controllers. To solve these problems, in this paper, we utilize the deep learning technique to conduct the routing computation for the SDCSs. Specifically, in our proposal, the considered Convolutional Neural Networks (CNNs) are adopted to intelligently compute the paths according to the input real-time traffic traces. To reduce the computation overhead of the central controller and improve the adaptation of CNNs to the changing traffic pattern, we consider an online training manner. Analysis shows that the computation complexity can be significantly reduced through the online training manner. Moreover, the simulation results demonstrate that our proposed CNNs are able to compute the appropriate paths combinations with high accuracy. Furthermore, the adopted periodical retraining enables the deep learning structures to adapt to the traffic changes.

Journal ArticleDOI
TL;DR: A hybrid D2D message authentication (HDMA) scheme is proposed for 5G-enabled VANETs, in which a novel group signature-based algorithm is used for mutual authentication between vehicle to vehicle (V2V) communication.
Abstract: The fifth-generation (5G) mobile communication technology with higher capacity and data rate, ultra-low device to device (D2D) latency, and massive device connectivity will greatly promote the development of vehicular ad hoc networks (VANETs). Meantime, new challenges such as security, privacy and efficiency are raised. In this article, a hybrid D2D message authentication (HDMA) scheme is proposed for 5G-enabled VANETs, in which a novel group signature-based algorithm is used for mutual authentication between vehicle to vehicle (V2V) communication. In addition, a pre-computed lookup table is adopted to reduce the computation overhead of modular exponentiation operation. Security analysis shows that HDMA is robust to resist various security attacks, and performance analysis also points out that, the authentication overhead of HDMA is more efficient than some traditional schemes with the help of the pre-computed lookup table in V2V and vehicle to infrastructure (V2I) communication.

Proceedings ArticleDOI
01 Jan 2021
TL;DR: A novel system, POSEIDON, is proposed, the first of its kind in the regime of privacy-preserving neural network training, employing multiparty lattice-based cryptography and preserving the confidentiality of the training data, the model, and the evaluation data, under a passive-adversary model and collusions between up to $N-1 parties.
Abstract: In this paper, we address the problem of privacy-preserving training and evaluation of neural networks in an $N$-party, federated learning setting. We propose a novel system, POSEIDON, the first of its kind in the regime of privacy-preserving neural network training. It employs multiparty lattice-based cryptography to preserve the confidentiality of the training data, the model, and the evaluation data, under a passive-adversary model and collusions between up to $N-1$ parties. To efficiently execute the secure backpropagation algorithm for training neural networks, we provide a generic packing approach that enables Single Instruction, Multiple Data (SIMD) operations on encrypted data. We also introduce arbitrary linear transformations within the cryptographic bootstrapping operation, optimizing the costly cryptographic computations over the parties, and we define a constrained optimization problem for choosing the cryptographic parameters. Our experimental results show that POSEIDON achieves accuracy similar to centralized or decentralized non-private approaches and that its computation and communication overhead scales linearly with the number of parties. POSEIDON trains a 3-layer neural network on the MNIST dataset with 784 features and 60K samples distributed among 10 parties in less than 2 hours.

Proceedings ArticleDOI
19 Apr 2021
TL;DR: In this article, a caching-inspired Greedy-Dual keep-alive policy is proposed to reduce the cold-start overhead of FaaS applications by more than 3× compared to current approaches.
Abstract: Functions as a Service (also called serverless computing) promises to revolutionize how applications use cloud resources. However, functions suffer from cold-start problems due to the overhead of initializing their code and data dependencies before they can start executing. Keeping functions alive and warm after they have finished execution can alleviate the cold-start overhead. Keep-alive policies must keep functions alive based on their resource and usage characteristics, which is challenging due to the diversity in FaaS workloads. Our insight is that keep-alive is analogous to caching. Our caching-inspired Greedy-Dual keep-alive policy can be effective in reducing the cold-start overhead by more than 3× compared to current approaches. Caching concepts such as reuse distances and hit-ratio curves can also be used for auto-scaled server resource provisioning, which can reduce the resource requirement of FaaS providers by 30% for real-world dynamic workloads. We implement caching-based keep-alive and resource provisioning policies in our FaasCache system, which is based on OpenWhisk. We hope that our caching analogy opens the door to more principled and optimized keep-alive and resource provisioning techniques for future FaaS workloads and platforms.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: ClusterFL as mentioned in this paper is a similarity-aware federated learning system that can provide high model accuracy and low communication overhead for human activity recognition (HAR) applications, which can efficiently drop out the nodes that converge slower or have little correlation with other nodes in each cluster.
Abstract: Federated Learning (FL) has recently received significant interests thanks to its capability of protecting data privacy. However, existing FL paradigms yield unsatisfactory performance for a wide class of human activity recognition (HAR) applications since they are oblivious to the intrinsic relationship between data of different users. We propose ClusterFL, a similarity-aware federated learning system that can provide high model accuracy and low communication overhead for HAR applications. ClusterFL features a novel clustered multi-task federated learning framework that maximizes the training accuracy of multiple learned models while automatically capturing the intrinsic clustering relationship among the data of different nodes. Based on the learned cluster relationship, ClusterFL can efficiently drop out the nodes that converge slower or have little correlation with other nodes in each cluster, significantly speeding up the convergence while maintaining the accuracy performance. We evaluate the performance of ClusterFL on an NVIDIA edge testbed using four new HAR datasets collected from total 145 users. The results show that, ClusterFL outperforms several state-of-the-art FL paradigms in terms of overall accuracy, and save more than 50% communication overhead at the expense of negligible accuracy degradation.

Journal ArticleDOI
TL;DR: In this article, a joint link scheduling and rate adaptation problem for a hierarchical satellite-UAV-terrestrial network on the ocean is addressed to minimize the total energy consumption with quality of service (QoS) guarantees.
Abstract: In the coming smart ocean era, reliable and efficient communications are crucial for promoting a variety of maritime activities. Current maritime communication networks (MCNs) mainly rely on marine satellites and on-shore base stations (BSs). The former generally provides limited transmission rate, while the latter lacks wide-area coverage capability. Due to these facts, the state-of-the-art MCN falls far behind terrestrial fifth-generation (5G) networks. To fill up the gap in the coming sixth-generation (6G) era, we explore the benefit of deployable BSs for maritime coverage enhancement. Both unmanned aerial vehicles (UAVs) and mobile vessels are used to configure deployable BSs. This leads to a hierarchical satellite-UAV-terrestrial network on the ocean. We address the joint link scheduling and rate adaptation problem for this hybrid network, to minimize the total energy consumption with quality of service (QoS) guarantees. Different from previous studies, we use only the large-scale channel state information (CSI), which is location-dependent and thus can be predicted through the position information of each UAV/vessel based on its specific trajectory/shipping lane. The problem is shown to be an NP-hard mixed integer nonlinear programming problem with a group of hidden non-linear equality constraints. We solve it suboptimally by using Min-Max transformation and iterative problem relaxation, leading to a process-oriented joint link scheduling and rate adaptation scheme. As observed by simulations, the scheme can provide agile on-demand coverage for all users with much reduced system overhead and a polynomial computation complexity. Moreover, it can achieve a prominent performance close to the optimal solution.

Journal ArticleDOI
TL;DR: A convergence upper bound is provided characterizing the tradeoff between convergence rate and global rounds, showing that a small number of active UEs per round still guarantees convergence and advocating the proposed FL algorithm for a paradigm shift in bandwidth-constrained learning wireless IoT networks.
Abstract: Federated learning (FL) allows multiple edge computing nodes to jointly build a shared learning model without having to transfer their raw data to a centralized server, thus reducing communication overhead. However, FL still faces a number of challenges such as nonindependent and identically distributed data and heterogeneity of user equipments (UEs). Enabling a large number of UEs to join the training process in every round raises a potential issue of the heavy global communication burden. To address these issues, we generalize the current state-of-the-art federated averaging (FedAvg) by adding a weight-based proximal term to the local loss function. The proposed FL algorithm runs stochastic gradient descent in parallel on a sampled subset of the total UEs with replacement during each global round. We provide a convergence upper bound characterizing the tradeoff between convergence rate and global rounds, showing that a small number of active UEs per round still guarantees convergence. Next, we employ the proposed FL algorithm in wireless Internet-of-Things (IoT) networks to minimize either total energy consumption or completion time of FL, where a simple yet efficient path-following algorithm is developed for its solutions. Finally, numerical results on unbalanced data sets are provided to demonstrate the performance improvement and robustness on the convergence rate of the proposed FL algorithm over FedAvg. They also reveal that the proposed algorithm requires much less training time and energy consumption than the FL algorithm with full user participation. These observations advocate the proposed FL algorithm for a paradigm shift in bandwidth-constrained learning wireless IoT networks.

Journal ArticleDOI
TL;DR: A performance analysis of the proposed protocol shows that the proposed strategy significantly reduces the number of authentication packets and MAC/PHY overhead while the security analysis demonstrates its robustness against various types of attacks.
Abstract: One of the most important and critical requirements for the Internet of Vehicles (IoV) is security under strict latency. Typically, authentication protocols for vehicular ad hoc networks need to authenticate themselves frequently. This results in reduced application traffic and increased overhead. Moreover, the mobile nature of vehicles makes them a prime target for physical, side channel, and cloning attacks. To address these issues, this article presents an efficient protocol for authentication in the IoV. The proposed protocol uses physical unclonable functions to provide the desired security characteristics. To reduce the overhead of authentication and improve the throughput of application layer packets, the proposed protocol uses a three-layered infrastructure architecture for IoVs, i.e., roadside units (RSUs), RSU gateways, and trusted authority. A vehicle needs to authenticate only once when it enters the area of an RSU gateway which may engulf multiple RSUs. A performance analysis of the protocol shows that the proposed strategy significantly reduces the number of authentication packets and MAC/PHY overhead while the security analysis demonstrates its robustness against various types of attacks.

Journal ArticleDOI
Lu Wei1, Jie Cui1, Yan Xu1, Jiujun Cheng2, Hong Zhong1 
TL;DR: An SSK updating algorithm is designed, which is constructed on Shamir’s secret sharing algorithm and secure pseudo random function, so that the TPDs of unrevoked vehicles can update SSK securely.
Abstract: Owing to the development of wireless communication technology and the increasing number of automobiles, vehicular ad hoc networks (VANETs) have become essential tools to secure traffic safety and enhance driving convenience. It is necessary to design a conditional privacy-preserving authentication (CPPA) scheme for VANETs because of their vulnerability and security requirements. Traditional CPPA schemes have two deficiencies. One is that the communication or storage overhead is not sufficiently low, but the traffic emergency message requires an ultra-low transmission delay. The other is that traditional CPPA schemes do not consider updating the system secret key (SSK), which is stored in an unhackable Tamper Proof Device (TPD), whereas side-channel attack methods and the wide usage of the SSK increase the probability of breaking the SSK. To solve the first issue, we propose a CPPA signature scheme based on elliptic curve cryptography, which can achieve message recovery and be reduced to elliptic curve discrete logarithm assumption, so that traffic emergency messages are secured with ultra-low communication overhead. To solve the second issue, we design an SSK updating algorithm, which is constructed on Shamir’s secret sharing algorithm and secure pseudo random function, so that the TPDs of unrevoked vehicles can update SSK securely. Formal security proof and analysis show that our proposed scheme satisfies the security and privacy requirements of VANETs. Performance analysis demonstrates that our proposed scheme requires less storage size and has a lower transmission delay compared with related schemes.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a predictive beamforming scheme in the context of dual-functional radar-communication (DFRC) systems, where the road-side unit estimates and predicts the motion parameters of vehicles based on the echoes of the DFRC signal.
Abstract: The development of dual-functional radar-communication (DFRC) systems, where vehicle localization and tracking can be combined with vehicular communication, will lead to more efficient future vehicular networks. In this paper, we develop a predictive beamforming scheme in the context of DFRC systems. We consider a system model where the road-side unit estimates and predicts the motion parameters of vehicles based on the echoes of the DFRC signal. Compared to the conventional feedback-based beam tracking approaches, the proposed method can reduce the signaling overhead and improve the accuracy of the angle estimation. To accurately estimate the motion parameters of vehicles in real-time, we propose a novel message passing algorithm based on factor graph, which yields a near optimal performance achieved by the maximum a posteriori estimation. The beamformers are then designed based on the predicted angles for establishing the communication links. With the employment of appropriate approximations, all messages on the factor graph can be derived in a closed-form, thus reduce the complexity. Simulation results show that the proposed DFRC based beamforming scheme is superior to the feedback-based approach in terms of both estimation and communication performance. Moreover, the proposed message passing algorithm achieves a similar performance of the high-complexity particle filtering-based methods.

Journal ArticleDOI
TL;DR: A model-driven deep learning (MDDL)-based channel estimation and feedback scheme for wideband millimeter-wave (mmWave) massive hybrid multiple-input multiple-output (MIMO) systems, where the angle-delay domain channels’ sparsity is exploited for reducing the overhead.
Abstract: This paper proposes a model-driven deep learning (MDDL)-based channel estimation and feedback scheme for wideband millimeter-wave (mmWave) massive hybrid multiple-input multiple-output (MIMO) systems, where the angle-delay domain channels’ sparsity is exploited for reducing the overhead. First, we consider the uplink channel estimation for time-division duplexing systems. To reduce the uplink pilot overhead for estimating high-dimensional channels from a limited number of radio frequency (RF) chains at the base station (BS), we propose to jointly train the phase shift network and the channel estimator as an auto-encoder. Particularly, by exploiting the channels’ structured sparsity from an a priori model and learning the integrated trainable parameters from the data samples, the proposed multiple-measurement-vectors learned approximate message passing (MMV-LAMP) network with the devised redundant dictionary can jointly recover multiple subcarriers’ channels with significantly enhanced performance. Moreover, we consider the downlink channel estimation and feedback for frequency-division duplexing systems. Similarly, the pilots at the BS and channel estimator at the users can be jointly trained as an encoder and a decoder, respectively. Besides, to further reduce the channel feedback overhead, only the received pilots on part of the subcarriers are fed back to the BS, which can exploit the MMV-LAMP network to reconstruct the spatial-frequency channel matrix. Numerical results show that the proposed MDDL-based channel estimation and feedback scheme outperforms state-of-the-art approaches.

Proceedings ArticleDOI
19 Apr 2021
TL;DR: In this paper, cache coherence is used instead of virtual memory for tracking applications' memory accesses transparently, at cache-line granularity, eliminating page faults from the application critical path when accessing remote data, and decoupling the application memory access tracking from the virtual memory page size.
Abstract: Disaggregated memory can address resource provisioning inefficiencies in current datacenters. Multiple software runtimes for disaggregated memory have been proposed in an attempt to make disaggregated memory practical. These systems rely on the virtual memory subsystem to transparently offer disaggregated memory to applications using a local memory abstraction. Unfortunately, using virtual memory for disaggregation has multiple limitations, including high overhead that comes from the use of page faults to identify what data to fetch and cache locally, and high dirty data amplification that comes from the use of page-granularity for tracking changes to the cached data (4KB or higher). In this paper, we propose a fundamentally new approach to designing software runtimes for disaggregated memory that addresses these limitations. Our main observation is that we can use cache coherence instead of virtual memory for tracking applications' memory accesses transparently, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time by 1.7-5X and reduces dirty data amplification by 2-10X, compared to state-of-the-art systems.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a deep learning-based CSI compression scheme called DeepCMC, which is composed of convolutional layers followed by quantization and entropy coding blocks.
Abstract: Massive multiple-input multiple-output (MIMO) systems require downlink channel state information (CSI) at the base station (BS) to achieve spatial diversity and multiplexing gains. In a frequency division duplex (FDD) multiuser massive MIMO network, each user needs to compress and feedback its downlink CSI to the BS. The CSI overhead scales with the number of antennas, users and subcarriers, and becomes a major bottleneck for the overall spectral efficiency. In this paper, we propose a deep learning (DL)-based CSI compression scheme, called DeepCMC , composed of convolutional layers followed by quantization and entropy coding blocks. In comparison with previous DL-based CSI reduction structures, DeepCMC proposes a novel fully-convolutional neural network (NN) architecture, with residual layers at the decoder, and incorporates quantization and entropy coding blocks into its design. DeepCMC is trained to minimize a weighted rate-distortion cost, which enables a trade-off between the CSI quality and its feedback overhead. Simulation results demonstrate that DeepCMC outperforms the state of the art CSI compression schemes in terms of the reconstruction quality of CSI for the same compression rate. We also propose a distributed version of DeepCMC for a multi-user MIMO scenario to encode and reconstruct the CSI from multiple users in a distributed manner. Distributed DeepCMC not only utilizes the inherent CSI structures of a single MIMO user for compression, but also benefits from the correlations among the channel matrices of nearby users to further improve the performance in comparison with DeepCMC. We also propose a reduced-complexity training method for distributed DeepCMC, allowing to scale it to multiple users, and suggest a cluster-based distributed DeepCMC approach for practical implementation.

Proceedings ArticleDOI
13 Feb 2021
TL;DR: In this paper, the intrinsic charge sharing operation during a dynamic memory access can be used effectively to perform analog CIM computations: by reconfiguring existing eDRAM columns as charge domain circuits, thus, greatly minimizing peripheral circuit area and power overhead.
Abstract: The unprecedented growth in deep neural networks (DNN) size has led to massive amounts of data movement from off-chip memory to on-chip processing cores in modern machine learning (ML) accelerators. Compute-in-memory (CIM) designs performing analog DNN computations within a memory array, along with peripheral mixed-signal circuits, are being explored to mitigate this memory-wall bottleneck: consisting of memory latency and energy overhead. Embedded-dynamic random-access memory (eDRAM) [1], [2], which integrates the 1T1C (T=Transistor, C=Capacitor) DRAM bitcell monolithically along with high-performance logic transistors and interconnects, can enable custom CIM designs. It offers the densest embedded bitcell, a low pJ/bit access energy, a low soft error rate, high-endurance, high-performance, and high-bandwidth: all desired attributes for ML accelerators. In addition, the intrinsic charge sharing operation during a dynamic memory access can be used effectively to perform analog CIM computations: by reconfiguring existing eDRAM columns as charge domain circuits, thus, greatly minimizing peripheral circuit area and power overhead. Configuring a part of eDRAM as a CIM engine (for data conversion, DNN computations, and weight storage) and retaining the remaining part as a regular memory (for inputs, gradients during training, and non-CIM workload data) can help to meet the layer/kernel dependent variable storage needs during a DNN inference/training step. Thus, the high cost/bit of eDRAM can be amortized by repurposing part of existing large capacity, level-4 eDRAM caches [7] in high-end microprocessors, into large-scale CIM engines.