scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2021"


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Rainbow Memory as mentioned in this paper proposes a memory management strategy based on per-sample classification uncertainty and data augmentation, which significantly improves the accuracy in blurry continual learning setups, outperforming state of the arts by large margins despite its simplicity.
Abstract: Continual learning is a realistic learning scenario for AI models. Prevalent scenario of continual learning, however, assumes disjoint sets of classes as tasks and is less realistic rather artificial. Instead, we focus on ‘blurry’ task boundary; where tasks shares classes and is more realistic and practical. To address such task, we argue the importance of diversity of samples in an episodic memory. To enhance the sample diversity in the memory, we propose a novel memory management strategy based on per-sample classification uncertainty and data augmentation, named Rainbow Memory (RM). With extensive empirical validations on MNIST, CIFAR10, CIFAR100, and ImageNet datasets, we show that the proposed method significantly improves the accuracy in blurry continual learning setups, outperforming state of the arts by large margins despite its simplicity. Code and data splits will be available in https://github.com/clovaai/rainbow-memory.

140 citations


Proceedings ArticleDOI
Dan Kondratyuk1, Liangzhe Yuan1, Yandong Li1, Li Zhang1, Mingxing Tan1, Matthew Brown1, Boqing Gong1 
21 Mar 2021
TL;DR: MoViNets as mentioned in this paper proposes a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D convolutional neural networks, which can operate on streaming video for online inference.
Abstract: We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code is available at https://github.com/google-research/movinet.

123 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames, and the query regions are tracked and predicted based on the optical flow estimated from the previous frame.
Abstract: Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, which lead to mismatching to similar objects and high computational complexity. To address these problems, we propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet). In RMNet, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames. For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame. The proposed local-to-local matching effectively alleviates the ambiguity of similar objects in both memory and query frames, which allows the information to be passed from the regional memory to the query region efficiently and effectively. Experimental results indicate that the proposed RM-Net performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.

94 citations


Proceedings ArticleDOI
Bohan Wu1, Suraj Nair1, Roberto Martín-Martín1, Li Fei-Fei1, Chelsea Finn1 
20 Jun 2021
TL;DR: In this article, a greedy hierarchical variational autoencoder (GHVAE) is proposed to solve the underfitting problem of large-scale video prediction by greedy and modular optimization.
Abstract: A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns highfidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state- of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35–40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules. Visualization and more details are at https://sites.google.com/view/ghvae.

87 citations


Proceedings ArticleDOI
23 May 2021
TL;DR: A new protocol for constant-round interactive ZK proofs that simultaneously allows for a highly efficient prover and low communication and an improved subfield Vector Oblivious Linear Evaluation (sVOLE) protocol with malicious security that is of independent interest is presented.
Abstract: Efficient zero-knowledge (ZK) proofs for arbitrary boolean or arithmetic circuits have recently attracted much attention. Existing solutions suffer from either significant prover overhead (i.e., high memory usage) or relatively high communication complexity (at least κ bits per gate, for computational security parameter κ). In this paper, we propose a new protocol for constant-round interactive ZK proofs that simultaneously allows for an efficient prover with asymptotically optimal memory usage and significantly lower communication compared to protocols with similar memory efficiency. Specifically:•The prover in our ZK protocol has linear running time and, perhaps more importantly, memory usage linear in the memory needed to evaluate the circuit non-cryptographically. This allows our proof system to scale easily to very large circuits.•for statistical security parameter ρ = 40, our ZK protocol communicates roughly 9 bits/gate for boolean circuits and 2–4 field elements/gate for arithmetic circuits over large fields.Using 5 threads, 400 MB of memory, and a 200 Mbps network to evaluate a circuit with hundreds of billions of gates, our implementation (ρ = 40, κ = 128) runs at a rate of 0.45 μs/gate in the boolean case, and 1.6 μs/gate for an arithmetic circuit over a 61-bit field.We also present an improved subfield Vector Oblivious Linear Evaluation (sVOLE) protocol with malicious security that is of independent interest.

85 citations


Proceedings ArticleDOI
09 Feb 2021
TL;DR: SwiftNet as discussed by the authors compresses spatiotemporal redundancy in matching-based VOS via Pixel-Adaptive Memory (PAM), which adaptively triggers memory updates on frames where objects display noteworthy inter-frame variations.
Abstract: In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77.8% $\mathcal{J}\& \mathcal{F}$ and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance. We achieve this by elaborately compressing spatiotemporal redundancy in matching-based VOS via Pixel-Adaptive Memory (PAM). Temporally, PAM adaptively triggers memory updates on frames where objects display noteworthy inter-frame variations. Spatially, PAM selectively performs memory update and match on dynamic pixels while ignoring the static ones, significantly reducing redundant computations wasted on segmentation-irrelevant pixels. To promote efficient reference encoding, light-aggregation encoder is also introduced in SwiftNet deploying reversed sub-pixel. We hope SwiftNet could set a strong and efficient baseline for real-time VOS and facilitate its application in mobile vision. The source code of SwiftNet can be found at https://github.com/haochenheheda/SwiftNet.

75 citations


Proceedings ArticleDOI
14 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed an innovative yet practical processing-in-memory (PIM) architecture, which improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5× respectively.
Abstract: Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the fraction has steadily increased with the stagnant technology scaling and poor data reuse characteristics of such emerging applications. To cost-effectively increase the bandwidth and energy efficiency, researchers began to reconsider the past processing-in-memory (PIM) architectures and advance them further, especially exploiting recent integration technologies such as 2.5D/3D stacking. Albeit the recent advances, no major memory manufacturer has developed even a proof-of-concept silicon yet, not to mention a product. This is because the past PIM architectures often require changes in host processors and/or application code which memory manufacturers cannot easily govern. In this paper, elegantly tackling the aforementioned challenges, we propose an innovative yet practical PIM architecture. To demonstrate its practicality and effectiveness at the system level, we implement it with a 20nm DRAM technology, integrate it with an unmodified commercial processor, develop the necessary software stack, and run existing applications without changing their source code. Our evaluation at the system level shows that our PIM improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5×, respectively. Atop the performance improvement, PIM also reduces the energy per bit transfer by 3.5×, and the overall energy efficiency of the system running the applications by 3.2×.

75 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wang et al. as mentioned in this paper presented PatchmatchNet, a learnable cascade formulation of patchmatch for high-resolution multi-view stereo, which can process higher resolution imagery and is more suited to run on resource limited devices than competitors that employ 3D cost volume regularization.
Abstract: We present PatchmatchNet, a novel and learnable cascade formulation of Patchmatch for high-resolution multi-view stereo. With high computation speed and low memory requirement, PatchmatchNet can process higher resolution imagery and is more suited to run on resource limited devices than competitors that employ 3D cost volume regularization. For the first time we introduce an iterative multi-scale Patchmatch in an end-to-end trainable architecture and improve the Patchmatch core algorithm with a novel and learned adaptive propagation and evaluation scheme for each iteration. Extensive experiments show a very competitive performance and generalization for our method on DTU, Tanks & Temples and ETH3D, but at a significantly higher efficiency than all existing top-performing models: at least two and a half times faster than state-of-the-art methods with twice less memory usage. Code is available at https://github.com/FangjinhuaWang/PatchmatchNet.

72 citations


Journal ArticleDOI
TL;DR: This work proposes DORY (Deployment Oriented to memoRY) – an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory and releases all the developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.
Abstract: The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency – requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY ( Deployment Oriented to memoRY ) – an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5× better MAC/cycle than the GreenWaves proprietary software solution and 18.1× better than the state-of-the-art result on an STM32-H743 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps – 15.4× better than an STM32-H743. We release all our developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.

69 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: This work proposes a dynamic prototype unit (DPU) to encode the normal dynamics as prototypes in real time, free from extra memory cost, and introduces meta-learning to the authors' DPU to form a novel few-shot normalcy learner, namely Meta-Prototype Unit (MPU).
Abstract: Frame reconstruction (current or future frame) based on Auto-Encoder (AE) is a popular method for video anomaly detection. With models trained on the normal data, the reconstruction errors of anomalous scenes are usually much larger than those of normal ones. Previous methods introduced the memory bank into AE, for encoding diverse normal patterns across the training videos. However, they are memory-consuming and cannot cope with unseen new scenarios in the testing data. In this work, we propose a dynamic prototype unit (DPU) to encode the normal dynamics as prototypes in real time, free from extra memory cost. In addition, we introduce meta-learning to our DPU to form a novel few-shot normalcy learner, namely Meta-Prototype Unit (MPU). It enables the fast adaption capability on new scenes by only consuming a few iterations of update. Extensive experiments are conducted on various benchmarks. The superior performance over the state-of-the-art demonstrates the effectiveness of our method. Our code is available at https://github.com/ktr-hubrt/MPN/.

69 citations


Proceedings ArticleDOI
13 Feb 2021
TL;DR: FIMDRAM as discussed by the authors integrates a 16-wide single-instruction multiple-data engine within the memory banks and exploits bank-level parallelism to provide $4 \times higher processing bandwidth than an off-chip memory solution.
Abstract: In recent years, artificial intelligence (AI) technology has proliferated rapidly and widely into application areas such as speech recognition, health care, and autonomous driving. To increase the capabilities of AI more powerful systems are needed to process a larger amount of data. This requirement has made domain-specific accelerators, such as GPUs and TPUs, popular; as they can provide orders of magnitude higher performance than state-of-the-art CPUs. However, these accelerators can only operate at their peak performance when they get the necessary data from memory as quickly as it is processed: requiring off-chip memory with a high bandwidth and a large capacity [1]. HBM has thus far met the bandwidth and capacity requirement [2] –[6], but recent AI technologies such as recurrent neural networks require an even higher bandwidth than HBM [7]–[8]. While a further increase in off-chip bandwidth can be accomplished by various techniques, it is often limited by power constraints at the chip or system level [9]. Hence, it is essential to decrease demand for off-chip bandwidth with unconventional architectures: such as processing-in-memory. In this paper, we present function-In-memory DRAM (FIMDRAM) that integrates a 16-wide single-instruction multiple-data engine within the memory banks and that exploits bank-level parallelism to provide $4 \times $ higher processing bandwidth than an off-chip memory solution. Second, we show techniques that do not require any modification to conventional memory controllers and their command protocols, which make FIMDRAM more practical for quick industry adoption. Finally, we conclude this paper with circuit- and system-level evaluations of our fabricated FIMDRAM.

Proceedings ArticleDOI
01 Jan 2021
TL;DR: In this article, the authors proposed an efficient attention mechanism equivalent to dot-product attention with substantially less memory and computational costs, which allows more widespread and flexible integration of attention modules into a network, which leads to better accuracies.
Abstract: Dot-product attention has wide applications in computer vision and natural language processing. However, its memory and computational costs grow quadratically with the input size. Such growth prohibits its application on high- resolution inputs. To remedy this drawback, this paper proposes a novel efficient attention mechanism equivalent to dot-product attention but with substantially less memory and computational costs. Its resource efficiency allows more widespread and flexible integration of attention modules into a network, which leads to better accuracies. Empirical evaluations demonstrated the effectiveness of its advantages. Efficient attention modules brought significant performance boosts to object detectors and instance segmenters on MS-COCO 2017. Further, the resource efficiency democratizes attention to complex models, where high costs prohibit the use of dot-product attention. As an exemplar, a model with efficient attention achieved state-of- the-art accuracies for stereo depth estimation on the Scene Flow dataset. Code is available at https://github.com/cmsflash/efficient-attention.

Proceedings ArticleDOI
18 Jul 2021
TL;DR: TinyML with Online Learning (TinyOL) as mentioned in this paper is based on the concept of online learning and is suitable for constrained IoT devices. But it is not suitable for streaming data.
Abstract: Tiny machine learning (TinyML) is a fast-growing research area committed to democratizing deep learning for all-pervasive microcontrollers (MCUs). Challenged by the constraints on power, memory, and computation, TinyML has achieved significant advancement in the last few years. However, the current TinyML solutions are based on batch/offline setting and support only the neural network's inference on MCUs. The neural network is first trained using a large amount of pre-collected data on a powerful machine and then flashed to MCUs. This results in a static model, hard to adapt to new data, and impossible to adjust for different scenarios, which impedes the flexibility of the Internet of Things (IoT). To address these problems, we propose a novel system called TinyOL (TinyML with Online-Learning), which enables incremental on-device training on streaming data. TinyOL is based on the concept of online learning and is suitable for constrained IoT devices. We experiment TinyOL under supervised and unsupervised setups using an autoencoder neural network. Finally, we report the performance of the proposed solution and show its effectiveness and feasibility.

Journal ArticleDOI
Jawook Gu1, Jong Chul Ye1
TL;DR: In this paper, a single generator is implemented using adaptive instance normalization (AdaIN) layers so that the baseline generator converting a low-dose CT image to a routine-doseCT image can be switched to a generator converting high-dose to lowdose by simply changing the AdaIN code.
Abstract: Recently, deep learning approaches using CycleGAN have been demonstrated as a powerful unsupervised learning scheme for low-dose CT denoising. Unfortunately, one of the main limitations of the CycleGAN approach is that it requires two deep neural network generators at the training phase, although only one of them is used at the inference phase. The secondary auxiliary generator is needed to enforce the cycle-consistency, but the additional memory requirements and the increase in the number of learnable parameters are major hurdles for successful CycleGAN training. Despite the use of additional generator, CycleGAN only translates between two domains, so it is not possible to investigate the intermediate level of denoising. To address this issue, here we propose a novel tunable CycleGAN architecture using a single generator. In particular, a single generator is implemented using adaptive instance normalization (AdaIN) layers so that the baseline generator converting a low-dose CT image to a routine-dose CT image can be switched to a generator converting high-dose to low-dose by simply changing the AdaIN code. Thanks to the shared baseline network, the additional memory requirement and weight increases are minimized, and the training can be done more stably even with small training data. Furthermore, by interpolating the AdaIN codes between the two domains, we can investigate various intermediate level of denoising results. Experimental results show that the proposed method outperforms the previous CycleGAN approaches while using only about half the parameters, and provide tunable denoising features that may be potentially useful in clinical environment.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: LightSpeech is proposed, which leverages neural architecture search (NAS) to automatically design more lightweight and efficient models based on FastSpeech, and achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
Abstract: Text to speech (TTS) has been broadly used to synthesize natural and intelligible speech in different scenarios. Deploying TTS in various end devices such as mobile phones or embedded devices requires extremely small memory usage and inference latency. While non-autoregressive TTS models such as FastSpeech have achieved significantly faster inference speed than autoregressive models, their model size and inference latency are still large for the deployment in resource constrained devices. In this paper, we propose LightSpeech, which leverages neural architecture search (NAS) to automatically design more lightweight and efficient models based on FastSpeech. We first profile the components of current Fast-Speech model and carefully design a novel search space containing various lightweight and potentially effective architectures. Then NAS is utilized to automatically discover well performing architectures within the search space. Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality. Audio demos are provided at https://speechresearch.github.io/lightspeech.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: SynCron as discussed by the authors is an end-to-end synchronization solution for near-data-processing (NDP) systems that adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support.
Abstract: Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive.This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded.We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by $1.27\times$ on average (up to $1.78\times$) under high-contention scenarios, and by $ 1.35\times$ on average (up to $2.29\times)$ under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consumption by $ 2.08\times$ on average (up to $4.25\times$).

Proceedings ArticleDOI
01 Feb 2021
TL;DR: Fafnir as mentioned in this paper proposes an efficient near-memory intelligent reduction tree, the leaves of which are all the ranks in a memory system, and the nodes gradually apply reduction operations while data is gathered from any rank.
Abstract: Memory-bound sparse gathering, caused by irregular random memory accesses, has become an obstacle in several on-demand applications such as embedding lookup in recommendation systems. To reduce the amount of data movement, and thereby better utilize memory bandwidth, previous studies have proposed near-data processing (NDP) solutions. The issue of prior work, however, is that they either minimize data movement effectively at the cost of limited memory parallelism or try to improve memory parallelism (up to a certain degree) but cannot successfully decrease data movement, as prior proposals rely on spatial locality (an optimistic assumption) to utilize NDP. More importantly, neither approach proposes a solution for gathering data from random memory addresses; rather they just offload operations to NDP. We propose an effective solution for sparse gathering, an efficient near-memory intelligent reduction (Fafnir) tree, the leaves of which are all the ranks in a memory system, and the nodes gradually apply reduction operations while data is gathered from any rank. By using such an overall tree, Fafnir does not rely on spatial locality; therefore, it minimizes data movement by performing entire operations at NDP and fully benefits from parallel memory accesses in parallel processing at NDP. Further, Fafnir offers other advantages such as using fewer connections (because of the tree topology), eliminating redundant memory accesses without using costly and less effective caching mechanisms, and being applicable to other domains of sparse problems such as scientific computing and graph analytics. To evaluate Fafnir, we implement it on an XCVU9P Xilinx FPGA and in 7 nm ASAP ASIC. Fafnir looks up the embedding tables up to 21.3× more quickly than the state-of-the-art NDP proposal. Furthermore, the generic architecture of Fafnir allows running classic sparse problems using the same 1.2 mm2 hardware up to 4.6× more quickly than the state of the art.

Journal ArticleDOI
TL;DR: In this paper, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.
Abstract: Data movement between the CPU and main memory is a first-order obstacle against improv ing performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarch ies, aggressive hardware prefetcher s) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Prior NDP works investigate the root causes of data movement bottlenecks using different profiling methodologies and tools. However, there is still a lack of understanding about the key metrics that can identify different data movement bottlenecks and their relation to traditional and emerging data movement mitigation mechanisms. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques (e.g., cach ing and prefetch ing) to more memory-centric techniques (e.g., NDP), thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV .

Proceedings ArticleDOI
01 Feb 2021
TL;DR: SpAtten as discussed by the authors leverages token sparsity, head sparsity and quantization opportunities to reduce the attention computation and memory access in NLP NLP applications, and proposes cascade token pruning to prune away unimportant tokens in the sentence.
Abstract: The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

Journal ArticleDOI
TL;DR: A layer-specific design that employs different organizations that are optimized for the different layers, which significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.
Abstract: Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Chen et al. as mentioned in this paper developed a memory-efficient network for large-scale video snapshot compressive imaging (SCI) based on multi-group reversible 3D convolutional neural networks.
Abstract: Video snapshot compressive imaging (SCI) captures a sequence of video frames in a single shot using a 2D detector. The underlying principle is that during one exposure time, different masks are imposed on the high-speed scene to form a compressed measurement. With the knowledge of masks, optimization algorithms or deep learning methods are employed to reconstruct the desired high-speed video frames from this snapshot measurement. Unfortunately, though these methods can achieve decent results, the long running time of optimization algorithms or huge training memory occupation of deep networks still preclude them in practical applications. In this paper, we develop a memory-efficient network for large-scale video SCI based on multi-group reversible 3D convolutional neural networks. In addition to the basic model for the grayscale SCI system, we take one step further to combine demosaicing and SCI reconstruction to directly recover color video from Bayer measurements. Extensive results on both simulation and real data captured by SCI cameras demonstrate that our proposed model outperforms previous state-of-the-art with less memory and thus can be used in large-scale problems. The code is at https: //github.com/BoChenGroup/RevSCI-net.

Proceedings ArticleDOI
26 Oct 2021
TL;DR: HeMem as mentioned in this paper is a tiered main memory management system designed from scratch for commercially available NVM and the big data applications that use it, which monitors application memory use by sampling memory access via CPU events, rather than page tables.
Abstract: High-capacity non-volatile memory (NVM) is a new main memory tier. Tiered DRAM+NVM servers increase total memory capacity by up to 8x, but can diminish memory bandwidth by up to 7x and inflate latency by up to 63% if not managed well. We study existing hardware and software tiered memory management systems on the recently available Intel Optane DC NVM with big data applications and find that no existing system maximizes application performance on real NVM. Based on our findings, we present HeMem, a tiered main memory management system designed from scratch for commercially available NVM and the big data applications that use it. HeMem manages tiered memory asynchronously, batching and amortizing memory access tracking, migration, and associated TLB synchronization overheads. HeMem monitors application memory use by sampling memory access via CPU events, rather than page tables. This allows HeMem to scale to terabytes of memory, keeping small and ephemeral data structures in fast memory, and allocating scarce, asymmetric NVM bandwidth according to access patterns. Finally, HeMem is flexible by placing per-application memory management policy at user-level. On a system with Intel Optane DC NVM, HeMem outperforms hardware, OS, and PL-based tiered memory management, providing up to 50% runtime reduction for the GAP graph processing benchmark, 13% higher throughput for TPC-C on the Silo in-memory database, 16% lower tail-latency under performance isolation for a key-value store, and up to 10x less NVM wear than the next best solution, without application modification.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the authors propose an incremental approach to few-shot instance segmentation, iMTFA, which learns discriminative embeddings for object instances that are merged into class representatives.
Abstract: Few-shot instance segmentation methods are promising when labeled training data for novel classes is scarce. However, current approaches do not facilitate flexible addition of novel classes. They also require that examples of each class are provided at train and test time, which is memory intensive. In this paper, we address these limitations by presenting the first incremental approach to few-shot instance segmentation: iMTFA. We learn discriminative embeddings for object instances that are merged into class representatives. Storing embedding vectors rather than images effectively solves the memory overhead problem. We match these class embeddings at the RoI-level using cosine similarity. This allows us to add new classes without the need for further training or access to previous training data. In a series of experiments, we consistently outperform the current state-of-the-art. Moreover, the reduced memory requirements allow us to evaluate, for the first time, few-shot instance segmentation performance on all classes in COCO jointly1.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper propose a key-values memory structure to record class-wise style variations and access them without requiring an object detector at the test time, which can transfer class-aware and accurate style representations across domains.
Abstract: We present a novel unsupervised framework for instance-level image-to-image translation. Although recent advances have been made by incorporating additional object annotations, existing methods often fail to handle images with multiple disparate objects. The main cause is that, during inference, they apply a global style to the whole image and do not consider the large style discrepancy between instance and background, or within instances. To address this problem, we propose a class-aware memory network that explicitly reasons about local style variations. A key-values memory structure, with a set of read/update operations, is introduced to record class-wise style variations and access them without requiring an object detector at the test time. The key stores a domain-agnostic content representation for allocating memory items, while the values encode domain-specific style representations. We also present a feature contrastive loss to boost the discriminative power of memory items. We show that by incorporating our memory, we can transfer class-aware and accurate style representations across domains. Experimental results demonstrate that our model outperforms recent instance-level methods and achieves state-of-the-art performance.

Proceedings ArticleDOI
13 Feb 2021
TL;DR: In this article, a 3-transistor (3T) analog memory cell is used to represent a 4b weight value as an analog voltage to reduce the size of the CIM array.
Abstract: Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $\sim 10 \times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $\sim 2 \times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: Sentinel as discussed by the authors uses dynamic profiling and coordinates operating system (OS) and runtime-level profiling to bridge the semantic gap between OS and applications, which enables tensorlevel profiling and co-allocating tensors with similar lifetime and memory access frequency into the same pages.
Abstract: Memory capacity is a major bottleneck for training deep neural networks (DNN). Heterogeneous memory (HM) combining fast and slow memories provides a promising direction to increase memory capacity. However, HM imposes challenges on tensor migration and allocation for high performance DNN training. Prior work heavily relies on DNN domain knowledge, unnecessarily causes tensor migration due to page-level false sharing, and wastes fast memory space. We present Sentinel, a software runtime system that automatically optimizes tensor management on HM. Sentinel uses dynamic profiling, and coordinates operating system (OS) and runtime-level profiling to bridge the semantic gap between OS and applications, which enables tensor-level profiling. This profiling enables co-allocating tensors with similar lifetime and memory access frequency into the same pages. Such fine-grained profiling and tensor collocation avoids unnecessary data movement, improves tensor movement efficiency, and enables larger batch training because of saving in fast memory space. Sentinel reduces fast memory consumption by 80% while retaining comparable performance to fast memory-only system; Sentinel consistently outperforms a state-of-the-art solution on CPU by 37% and two state-of-the-art solutions on GPU by 2x and 21% respectively in training throughput.

Proceedings ArticleDOI
26 Oct 2021
TL;DR: MIND1 as discussed by the authors is an in-network memory management unit for rack-scale disaggregation, which enables transparent resource elasticity while matching the performance of prior memory disaggregation proposals for real-world workloads.
Abstract: Memory disaggregation promises transparent elasticity, high resource utilization and hardware heterogeneity in data centers by physically separating memory and compute into network-attached resource "blades". However, existing designs achieve performance at the cost of resource elasticity, restricting memory sharing to a single compute blade to avoid costly memory coherence traffic over the network. In this work, we show that emerging programmable network switches can enable an efficient shared memory abstraction for disaggregated architectures by placing memory management logic in the network fabric. We find that centralizing memory management in the network permits bandwidth and latency-efficient realization of in-network cache coherence protocols, while programmable switch ASICs support other memory management logic at line-rate. We realize these insights into MIND1, an in-network memory management unit for rack-scale disaggregation. MIND enables transparent resource elasticity while matching the performance of prior memory disaggregation proposals for real-world workloads.

Proceedings ArticleDOI
19 Jun 2021
TL;DR: In this paper, a dynamic graph neural network (DGNN) was proposed to reduce the model size, memory footprint, and energy consumption by using k-NN search on binary vectors to speed up the construction of the dynamic graph.
Abstract: Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data. As they generalize the operations of classical CNNs on grids to arbitrary topologies, GNNs also bring much of the implementation challenges of their Euclidean counterparts. Model size, memory footprint, and energy consumption are common concerns for many real-world applications. Network binarization allocates a single bit to parameters and activations, thus dramatically reducing the memory requirements (up to 32x compared to single-precision floating-point numbers) and maximizing the benefits of fast SIMD instructions on modern hardware for measurable speedups. However, in spite of the large body of work on binarization for classical CNNs, this area remains largely unexplored in geometric deep learning. In this paper, we present and evaluate different strategies for the binarization of graph neural networks. We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks. In particular, we present the first dynamic graph neural network in Hamming space, able to leverage efficient k-NN search on binary vectors to speed-up the construction of the dynamic graph. We further verify that the binary models offer significant savings on embedded devices. Our code is publicly available on Github1.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: The battery-backed buffer (BBB) as mentioned in this paper was proposed to close the PoV/PoP gap between the cache and the memory by providing a battery-supported persist buffer (bbPB) in each core.
Abstract: Non-volatile memory (NVM) is poised to augment or replace DRAM as main memory. With the right abstraction and support, non-volatile main memory (NVMM) can provide an alternative to the storage system to host long-lasting persistent data. However, keeping persistent data in memory requires programs to be written such that data is crash consistent (i.e. it can be recovered after failure). Critical to supporting crash recovery is the guarantee of ordering of when stores become durable with respect to program order. Strict persistency, which requires persist order to coincide with program order of stores, is simple and intuitive but generally thought to be too slow. More relaxed persistency models are available but demand higher programming complexity, e.g. they require the programmer to insert persist barriers correctly in their program.We identify the source of strict persistency inefficiency as the gap between the point of visibility (PoV) which is the cache, and the point of persistency (PoP) which is the memory. In this paper, we propose a new approach to close the PoV/PoP gap which we refer to as Battery-Backed Buffer (BBB). The key idea of BBB is to provide a battery-backed persist buffer (bbPB) in each core next to the L1 data cache (L1D). A store value is allocated in the bbPB as it is written to cache, becoming part of the persistence domain. If a crash occurs, battery ensures bbPB can be fully drained to NVMM. BBB simplifies persistent programming as the programmer does not need to insert persist barriers or flushes. Furthermore, our BBB design achieves nearly identical results to eADR in terms of performance and number of NVMM writes, while requiring two orders of magnitude smaller energy and time to drain.

Proceedings ArticleDOI
TL;DR: MIND as discussed by the authors is an in-network memory management unit for rack-scale memory disaggregation, which enables transparent resource elasticity and high utilization for resources in data centers by physically separating memory and compute into network-attached resource "blades".
Abstract: Memory-compute disaggregation promises transparent elasticity, high utilization and balanced usage for resources in data centers by physically separating memory and compute into network-attached resource "blades". However, existing designs achieve performance at the cost of resource elasticity, restricting memory sharing to a single compute blade to avoid costly memory coherence traffic over the network. In this work, we show that emerging programmable network switches can enable an efficient shared memory abstraction for disaggregated architectures by placing memory management logic in the network fabric. We find that centralizing memory management in the network permits bandwidth and latency-efficient realization of in-network cache coherence protocols, while programmable switch ASICs support other memory management logic at line-rate. We realize these insights into MIND, an in-network memory management unit for rack-scale memory disaggregation. MIND enables transparent resource elasticity while matching the performance of prior memory disaggregation proposals for real-world workloads.