scispace - formally typeset
Search or ask a question

Showing papers on "FLOPS published in 2022"


Proceedings ArticleDOI
01 Jun 2022
TL;DR: CMT-S as mentioned in this paper proposes a new transformer-based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to extract local information.
Abstract: Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, there are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs). In this paper, we aim to address this issue and develop a network that can outperform not only the canonical transformers, but also the high-performance convolutional models. We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to extract local information. Furthermore, we scale it to obtain a family of models, called CMTs, obtaining much better trade-off for accuracy and efficiency than previous CNN-based and transformer-based models. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively. The proposed CMT-S also generalizes well on CIFAR10 (99.2%), CIFAR100 (91.7%), Flowers (98.7%), and other challenging vision datasets such as COCO (44.3% mAP), with considerably less computational cost.

70 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a convolutional block attention module (CBAM) to select the information critical to the vehicle detection task and suppress uncritical information, thus improving the detection accuracy of the algorithm.

55 citations


Journal ArticleDOI
TL;DR: Summit as mentioned in this paper was the most powerful supercomputer in the world, beating out the previous record holder, China's Sunway TaihuLight, by a comfortable margin, according to the well-known Top500 ranking of supercomputers.
Abstract: In 2018, a new supercomputer called Summit was installed at Oak Ridge National Laboratory, in Tennessee. Its theoretical peak capacity was nearly 200 peta-flops-that's 200 thousand trillion floating-point operations per second. At the time, it was the most powerful supercomputer in the world, beating out the previous record holder, China's Sunway TaihuLight, by a comfortable margin, according to the well-known Top500 ranking of supercomputers. (Summit is currently No.2, a Japanese supercomputer called Fugaku having since overtaken it.)

25 citations


Journal ArticleDOI
TL;DR: A comparative analysis with state-of-the-art (SOTA) DCNN models evidences the betterness, more efficiency, and more accuracy of the novel “DIAT-RadSATNet” architecture.
Abstract: Due to the smaller size, low cost, and easy operational features, small unmanned aerial vehicles (SUAVs) become more popular for various defense as well as civil applications. They can also give threat to national security if intentionally operated by any hostile actor(s). Since all the SUAV targets have a high degree of resemblances in their micro-Doppler (m-D) space, their accurate detection/classification can be highly guaranteed by the appropriate deep convolutional neural network (DCNN) architecture. In this work, a lightweight novel DCNN model (named “DIAT-RadSATNet”) is designed for the accurate SUAV targets: RC plane, three-short-blade rotor, three-long-blade rotor, quadcopter, bionic bird, and mini-helicopter + bionic bird; and detection/classification based on their m-D signatures. A diversified, $X$ -band (10 GHz) continuous-wave (CW) radar-based, open-field-collected m-D signatures dataset (named “DIAT- $\mu $ SAT”) is used for the design/testing of “DIAT-RadSATNet.” A set of new design principles is proposed through multifactors: layers, #parameters, floating-point operations (FLOPs), number of blocks, filter dimension, memory size, number of parallel paths, and accuracy; optimization is applied via a series of in-depth ablation studies. The novel “DIAT-RadSATNet” module consists of 0.45 M trainable parameters, 40 layers, 2.21-Mb memory size, 0.59G FLOPs, and 0.21-s computation-time complexity. The detection/classification accuracy of “DIAT-RadSATNet,” based on the open-field unknown dataset experiments, falls within 97.1% and 97.3%. A comparative analysis with state-of-the-art (SOTA) DCNN models evidences the betterness, more efficiency, and more accuracy of our novel “DIAT-RadSATNet” architecture.

24 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: The NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results was presented in this article , where the aim was to design a network for single image SR that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set.
Abstract: This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.

22 citations


Journal ArticleDOI
TL;DR: In this paper , an improved YOLOv5 road damage detection algorithm was proposed, which reduced the number of parameters and GFLOPs and reduced the size of the model.
Abstract: Abstract Road damage detection is an important task to ensure road safety and realize the timely repair of road damage. The previous manual detection methods are low in efficiency and high in cost. To solve this problem, an improved YOLOv5 road damage detection algorithm, MN-YOLOv5, was proposed. We optimized the YOLOv5s model and chose a new backbone feature extraction network MobileNetV3 to replace the basic network of YOLOv5, which greatly reduced the number of parameters and GFLOPs of the model, and reduced the size of the model. At the same time, the coordinate attention lightweight attention module is introduced to help the network locate the target more accurately and improve the target detection accuracy. The KMeans clustering algorithm is used to filter the prior frame to make it more suitable for the dataset and to improve the detection accuracy. To improve the generalization ability of the model, a label smoothing algorithm is introduced. In addition, the structure reparameterization method is used to accelerate model reasoning. The experimental results show that the improved YOLOv5 model proposed in this paper can effectively identify pavement cracks. Compared with the original model, the mAP increased by 2.5%, the F1 score increased by 2.6%, and the model volume was smaller than that of YOLOv5. 1.62 times, the parameter was reduced by 1.66 times, and the GFLOPs were reduced by 1.69 times. This method can provide a reference for the automatic detection method of pavement cracks.

19 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a self-supervised contrastive efficient asymmetric dilated network (SC-EADNet) for hyperspectral image (HSI) classification.
Abstract: Unsupervised and semisupervised feature learning has recently emerged as an effective way to reduce the reliance on expensive data collection and annotation for hyperspectral image (HSI) analysis. Existing unsupervised and semisupervised convolutional neural network (CNN)-based HSI classification works still face two challenges: underutilization of pixel-wise multiscale contextual information for feature learning and expensive computational cost, for example, large floating-point operations per seconds (FLOPs), due to the lack of lightweight design. To utilize the unlabeled pixels in the HSIs more efficiently, we propose a self-supervised contrastive efficient asymmetric dilated network (SC-EADNet) for HSI classification. There are two novelties in the SC-EADNet. First, a self-supervised multiscale pixel-wise contextual feature learning model is proposed, which generates multiple patches around each hyperspectral pixel and develops a contrastive learning framework to learn from these patches for HSI classification. Second, a lightweight feature extraction network EADNet, composed of multiple plug-and-play efficient asymmetric dilated convolution (EADC) blocks, is designed and inserted into the contrastive learning framework. The EADC block adopts different dilation rates to capture the spatial information of objects with varying shapes and sizes. Compared with other unsupervised, semisupervised, and supervised learning methods, our SC-EADNet provides competitive classification performance on four hyperspectral datasets, including Indian Pines, Pavia University, Salinas, and Houston 2013, but few FLOPs and fast computational speed.

16 citations


Journal ArticleDOI
TL;DR: In this paper , a Taylor-based method was proposed to prune more filters with less performance degradation, inspired by the existing research on centripetal stochastic gradient descent (C-SGD), where the filters are removed only when the ones that need to be pruned have the same value.
Abstract: Filter pruning is a technique that reduces computational complexity, inference time, and memory footprint by removing unnecessary filters in convolutional neural networks (CNNs) with an acceptable drop in accuracy, consequently accelerating the network. Unlike traditional filter pruning methods utilizing zeroing-out filters, we propose two techniques to achieve the effect of pruning more filters with less performance degradation, inspired by the existing research on centripetal stochastic gradient descent (C-SGD), wherein the filters are removed only when the ones that need to be pruned have the same value. First, to minimize the negative effect of centripetal vectors that gradually make filters come closer to each other, we redesign the vectors by considering the effect of each vector on the loss-function using the Taylor-based method. Second, we propose an adaptive gradient learning (AGL) technique that updates weights while adaptively changing the gradients. Through AGL, performance degradation can be mitigated because some gradients maintain their original direction, and AGL also minimizes the accuracy loss by perfectly converging the filters, which require pruning, to a single point. Finally, we demonstrate the superiority of the proposed method on various datasets and networks. In particular, on the ILSVRC-2012 dataset, our method removed 52.09% FLOPs with a negligible 0.15% top-1 accuracy drop on ResNet-50. As a result, we achieve the most outstanding performance compared to those reported in previous studies in terms of the trade-off between accuracy and computational complexity.

15 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: Wang et al. as discussed by the authors proposed a patch slimming approach that discards useless patches in a top-down paradigm, identifying the effective patches in the last layer and then use them to guide the patch selection process of previous layers.
Abstract: This paper studies the efficiency problem for visual transformers by excavating redundant calculation in given networks. The recent transformer architecture has demonstrated its effectiveness for achieving excellent performance on a series of computer vision tasks. However, similar to that of convolutional neural networks, the huge computational cost of vision transformers is still a severe issue. Considering that the attention mechanism aggregates different patches layer-by-layer, we present a novel patch slimming approach that discards useless patches in a topdown paradigm. We first identify the effective patches in the last layer and then use them to guide the patch selection process of previous layers. For each layer, the impact of a patch on the final output feature is approximated and patches with less impacts will be removed. Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers without affecting their performances. For example, over 45% FLOPs of the ViT-Ti model can be reduced with only 0.2% top-1 accuracy drop on the ImageNet dataset.

15 citations


Journal ArticleDOI
TL;DR: In this article , a matrix sketch based network pruning method is proposed to encode the second-order information of pretrained weights, which enables the representation capacity of pruned networks to be recovered with a simple fine-tuning procedure.
Abstract: We propose a novel network pruning approach by information preserving of pretrained network weights (filters). Network pruning with the information preserving is formulated as a matrix sketch problem, which is efficiently solved by the off-the-shelf frequent direction method. Our approach, referred to as FilterSketch, encodes the second-order information of pretrained weights, which enables the representation capacity of pruned networks to be recovered with a simple fine-tuning procedure. FilterSketch requires neither training from scratch nor data-driven iterative optimization, leading to a several-orders-of-magnitude reduction of time cost in the optimization of pruning. Experiments on CIFAR-10 show that FilterSketch reduces 63.3% of floating-point operations (FLOPs) and prunes 59.9% of network parameters with negligible accuracy cost for ResNet-110. On ILSVRC-2012, it reduces 45.5% of FLOPs and removes 43.0% of parameters with only 0.69% accuracy drop for ResNet-50. Our code and pruned models can be found at https://github.com/lmbxmu/FilterSketch .

14 citations


Journal ArticleDOI
TL;DR: Experimental results on a complex background of a self-built apple leaf disease dataset show that ConvViT achieves comparable identification results with the current performance of the state-of-the-art Swin-Tiny, which indicates that the proposed model is indeed an effective disease-identification model with practical application value.
Abstract: The complex backgrounds of crop disease images and the small contrast between the disease area and the background can easily cause confusion, which seriously affects the robustness and accuracy of apple disease- identification models. To solve the above problems, this paper proposes a Vision Transformer-based lightweight apple leaf disease- identification model, ConvViT, to extract effective features of crop disease spots to identify crop diseases. Our ConvViT includes convolutional structures and Transformer structures; the convolutional structure is used to extract the global features of the image, and the Transformer structure is used to obtain the local features of the disease region to help the CNN see better. The patch embedding method is improved to retain more edge information of the image and promote the information exchange between patches in the Transformer. The parameters and FLOPs (Floating Point Operations) of the model are significantly reduced by using depthwise separable convolution and linear-complexity multi-head attention operations. Experimental results on a complex background of a self-built apple leaf disease dataset show that ConvViT achieves comparable identification results (96.85%) with the current performance of the state-of-the-art Swin-Tiny. The parameters and FLOPs are only 32.7% and 21.7% of Swin-Tiny, and significantly ahead of MobilenetV3, Efficientnet-b0, and other models, which indicates that the proposed model is indeed an effective disease-identification model with practical application value.


Journal ArticleDOI
TL;DR: This proposal makes pointwise convolutions parameter efficient via grouping filters into parallel branches or groups, where each branch processes a fraction of the input channels, through interleaving the output of filters from different branches at intermediate layers of consecutive pointwise Convolution.
Abstract: In DCNNs, the number of parameters in pointwise convolutions rapidly grows due to the multiplication of the number of filters by the number of input channels that come from the previous layer. Our proposal makes pointwise convolutions parameter efficient via grouping filters into parallel branches or groups, where each branch processes a fraction of the input channels. However, by doing so, the learning capability of the DCNN is degraded. To avoid this effect, we suggest interleaving the output of filters from different branches at intermediate layers of consecutive pointwise convolutions. We applied our improvement to the EfficientNet, DenseNet-BC L100, MobileNet and MobileNet V3 Large architectures. We trained these architectures with the CIFAR-10, CIFAR-100, Cropped-PlantDoc and The Oxford-IIIT Pet datasets. When training from scratch, we obtained similar test accuracies to the original EfficientNet and MobileNet V3 Large architectures while saving up to 90% of the parameters and 63% of the flops.

Journal ArticleDOI
TL;DR: NL-LinkNet as mentioned in this paper proposes an efficient nonlocal LinkNet with nonlocal blocks (NLBs) that can grasp relations between global features and results in more accurate road segmentation.
Abstract: Road extraction from very high resolution (VHR) satellite images is one of the most important topics in the field of remote sensing. In this letter, we propose an efficient nonlocal LinkNet with nonlocal blocks (NLBs) that can grasp relations between global features. This enables each spatial feature point to refer to all other contextual information and results in more accurate road segmentation. In detail, our single model without any postprocessing like conditional random field (CRF) refinement performed better than any other published state-of-the-art ensemble model in the official DeepGlobe Challenge. Moreover, our nonlocal LinkNet (NL-LinkNet) beat the D-LinkNet, the winner of the DeepGlobe challenge (Demir et al. , 2018), with 43% less parameters, less giga floating-point operations per seconds (GFLOPs), and shorter training convergence time. We also present empirical analyses on the proper usages of NLBs for the baseline model.

Proceedings ArticleDOI
01 Jun 2022
TL;DR: RLFN as mentioned in this paper uses three convolutional layers for residual local feature learning to simplify feature aggregation, which achieves a good trade-off between model performance and inference time, and it won the first place in the NTIRE 2022 efficient super-resolution challenge.
Abstract: Deep learning based approaches has achieved great performance in single image super-resolution (SISR). However, recent advances in efficient super-resolution focus on reducing the number of parameters and FLOPs, and they aggregate more powerful features by improving feature utilization through complex layer connection strategies. These structures may not be necessary to achieve higher running speed, which makes them difficult to be deployed to resource-constrained devices. In this work, we propose a novel Residual Local Feature Network (RLFN). The main idea is using three convolutional layers for residual local feature learning to simplify feature aggregation, which achieves a good trade-off between model performance and inference time. Moreover, we revisit the popular contrastive loss and observe that the selection of intermediate features of its feature extractor has great influence on the performance. Besides, we propose a novel multi-stage warm-start training strategy. In each stage, the pre-trained weights from previous stages are utilized to improve the model performance. Combined with the improved contrastive loss and training strategy, the proposed RLFN outperforms all the state-of-the-art efficient image SR models in terms of runtime while maintaining both PSNR and SSIM for SR. In addition, we won the first place in the runtime track of the NTIRE 2022 efficient super-resolution challenge. Code will be available at https://github.com/fyan111/RLFN.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a novel lightweight baseline model LKASR based on large kernel attention (LKA), which consists of three parts, shallow feature extraction, deep feature extraction and high-quality image reconstruction.
Abstract: Image super-resolution, aims to recover a corresponding high-resolution image from a given low-resolution image. While most state-of-the-art methods only consider using fixed small-size convolution kernels (e.g., 1 × 1, 3 × 3) to extract image features, few works have been made to large-size convolution kernels for image super-resolution (SR). In this paper, we propose a novel lightweight baseline model LKASR based on large kernel attention (LKA). LKASR consists of three parts, shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module consists of multiple cascaded visual attention modules (VAM), each of which consists of a 1 × 1 convolution, a large kernel attention (acts as Transformer) and a feature refinement module (FRM, acts as CNN). Specifically, VAM applies lightweight architecture like swin transformer to realize iterative extraction of global and local features of images, which greatly improves the effectiveness of SR method (0.049s in Urban100 dataset). For different scales ( × 2, × 3, × 4), extensive experimental results on benchmark demonstrate that LKASR outperforms most lightweight SR methods by up to 0 . 17 ∼ 0 . 34 dB, while the total of parameters and FLOPs remains lightweight.

Journal ArticleDOI
TL;DR: Fast-FDM as discussed by the authors uses an optimized form of anchor-free Foveabox to accurately and efficiently detect green apples in harvesting environments, achieving a mean average precision (mAP) of 62.3% for green apple detection using fewer parameters and floating point of operations (FLOPs), achieving better trade-offs between accuracy and detection efficiency.

Journal ArticleDOI
TL;DR: A Prior Gradient Mask Guided Pruning-aware Fine-Tuning (PGMPF) framework to accelerate deep Convolutional Neural Networks (CNNs) by selectively suppresses the gradient of those ”unimportant” parameters via a prior gradient mask generated by the pruning criterion during fine-tuning.
Abstract: We proposed a Prior Gradient Mask Guided Pruning-aware Fine-Tuning (PGMPF) framework to accelerate deep Convolutional Neural Networks (CNNs). In detail, the proposed PGMPF selectively suppresses the gradient of those ”unimportant” parameters via a prior gradient mask generated by the pruning criterion during fine-tuning. PGMPF has three charming characteristics over previous works: (1) Pruning-aware network fine-tuning. A typical pruning pipeline consists of training, pruning and fine-tuning, which are relatively independent, while PGMPF utilizes a variant of the pruning mask as a prior gradient mask to guide fine-tuning, without complicated pruning criteria. (2) An excellent tradeoff between large model capacity during fine-tuning and stable convergence speed to obtain the final compact model. Previous works preserve more training information of pruned parameters during fine-tuning to pursue better performance, which would incur catastrophic non-convergence of the pruned model for relatively large pruning rates, while our PGMPF greatly stabilizes the fine-tuning phase by gradually constraining the learning rate of those ”unimportant” parameters. (3) Channel-wise random dropout of the prior gradient mask to impose some gradient noise to fine-tuning to further improve the robustness of final compact model. Experimental results on three image classification benchmarks CIFAR10/ 100 and ILSVRC-2012 demonstrate the effectiveness of our method for various CNN architectures, datasets and pruning rates. Notably, on ILSVRC-2012, PGMPF reduces 53.5% FLOPs on ResNet-50 with only 0.90% top-1 accuracy drop and 0.52% top-5 accuracy drop, which has advanced the state-of-the-art with negligible extra computational cost.

Journal ArticleDOI
24 Apr 2022-Sensors
TL;DR: A hybrid loss function with label smoothing to improve the distinguishing power of lightweight convolutional neural networks (CNNs) for cervical cell classification and strengthens confidence in hybrid loss-constrained lightweight CNNs, which can achieve satisfactory accuracy with much lower computational cost for the SIPakMeD dataset.
Abstract: Artificial intelligence (AI) technologies have resulted in remarkable achievements and conferred massive benefits to computer-aided systems in medical imaging. However, the worldwide usage of AI-based automation-assisted cervical cancer screening systems is hindered by computational cost and resource limitations. Thus, a highly economical and efficient model with enhanced classification ability is much more desirable. This paper proposes a hybrid loss function with label smoothing to improve the distinguishing power of lightweight convolutional neural networks (CNNs) for cervical cell classification. The results strengthen our confidence in hybrid loss-constrained lightweight CNNs, which can achieve satisfactory accuracy with much lower computational cost for the SIPakMeD dataset. In particular, ShufflenetV2 obtained a comparable classification result (96.18% in accuracy, 96.30% in precision, 96.23% in recall, and 99.08% in specificity) with only one-seventh of the memory usage, one-sixth of the number of parameters, and one-fiftieth of total flops compared with Densenet-121 (96.79% in accuracy). GhostNet achieved an improved classification result (96.39% accuracy, 96.42% precision, 96.39% recall, and 99.09% specificity) with one-half of the memory usage, one-quarter of the number of parameters, and one-fiftieth of total flops compared with Densenet-121 (96.79% in accuracy). The proposed lightweight CNNs are likely to lead to an easily-applicable and cost-efficient automation-assisted system for cervical cancer diagnosis and prevention.

Journal ArticleDOI
TL;DR: In this paper , a generalization of Orlov's projectivization formula for the derived category Dcohb(P(E)), where E does not need to be a vector bundle, is presented.

Journal ArticleDOI
TL;DR: In this article , a comprehensive comparison of several successful deep learning-based face detectors is conducted to uncover their efficiency using two metrics: FLOPs and latency, which can guide to choose appropriate face detectors for different applications and also to develop more efficient and accurate detectors.
Abstract: Face detection is to search all the possible regions for faces in images and locate the faces if there are any. Many applications including face recognition, facial expression recognition, face tracking and head-pose estimation assume that both the location and the size of faces are known in the image. In recent decades, researchers have created many typical and efficient face detectors from the Viola-Jones face detector to current CNN-based ones. However, with the tremendous increase in images and videos with variations in face scale, appearance, expression, occlusion and pose, traditional face detectors are challenged to detect various “in the wild” faces. The emergence of deep learning techniques brought remarkable breakthroughs to face detection along with the price of a considerable increase in computation. This paper introduces representative deep learning-based methods and presents a deep and thorough analysis in terms of accuracy and efficiency. We further compare and discuss the popular and challenging datasets and their evaluation metrics. A comprehensive comparison of several successful deep learning-based face detectors is conducted to uncover their efficiency using two metrics: FLOPs and latency. The paper can guide to choose appropriate face detectors for different applications and also to develop more efficient and accurate detectors.

Journal ArticleDOI
TL;DR: A large fraction of the entries in the LU factors and flops to perform the BLR LU factorization can be safely switched to lower precisions, leading to significant reductions of the storage and expected time costs, of up to a factor three using fp64, fp32, and bfloat16 arithmetics.
Abstract: We introduce a novel approach to exploit mixed precision arithmetic for low-rank approximations. Our approach is based on the observation that singular vectors associated with small singular values can be stored in lower precisions while preserving high accuracy overall. We provide an explicit criterion to determine which level of precision is needed for each singular vector. We apply this approach to block low-rank (BLR) matrices, most of whose off-diagonal blocks have low rank. We propose a new BLR LU factorization algorithm that exploits the mixed precision representation of the blocks. We carry out the rounding error analysis of this algorithm and prove that the use of mixed precision arithmetic does not compromise the numerical stability of the BLR LU factorization. Moreover, our analysis determines which level of precision is needed for each floating-point operation (flop), and therefore guides us toward an implementation that is both robust and efficient. We evaluate the potential of this new algorithm on a range of matrices coming from real-life problems in industrial and academic applications. We show that a large fraction of the entries in the LU factors and flops to perform the BLR LU factorization can be safely switched to lower precisions, leading to significant reductions of the storage and expected time costs, of up to a factor three using fp64, fp32, and bfloat16 arithmetics.

Journal ArticleDOI
Brais Martinez1
TL;DR: EdgeViTs as mentioned in this paper proposes a cost-effective local-global-local (LGL) information exchange bottleneck based on the optimal integration of self-attention and convolutions.
Abstract: Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several successful design choices (e.g., the convolutions and hierarchical multi-stage structure) of prior CNNs have been reintroduced into recent ViTs, they are still not sufficient to meet the limited resource requirements of mobile devices. This motivates a very recent attempt to develop light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a performance gap behind. In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency. This is realized by introducing a highly cost-effective local-global-local (LGL) information exchange bottleneck based on optimal integration of self-attention and convolutions. For device-dedicated evaluation, rather than relying on inaccurate proxies like the number of FLOPs or parameters, we adopt a practical approach of focusing directly on on-device latency and, for the first time, energy efficiency. Extensive experiments on image classification, object detection and semantic segmentation validate high efficiency of our EdgeViTs when compared to the state-of-the-art efficient CNNs and ViTs in terms of accuracy-efficiency tradeoff on mobile hardware. Specifically, we show that our models are Pareto-optimal when both accuracy-latency and accuracy-energy tradeoffs are considered, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs. Code is available at https://github.com/saic-fi/edgevit .

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a sensitivity-based method for channel pruning that utilizes second-order sensitivity as a criterion to prune insensitive filters while retaining sensitive ones and quantify the sensitivity of the filter using the sum of the sensitivities of all weights in the filter, rather than the magnitude-based metric frequently applied in the literature.

Journal ArticleDOI
TL;DR: EF-Train is designed, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs and develops a data reshaping approach with intra-tile continuous memory allocation and weight reuse.
Abstract: Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward and backward propagation and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

Journal ArticleDOI
TL;DR: This study presents a method for optimizing the FCM algorithm for high-speed field-programmable gate technology (FPGA) using a high-level C-like programming language called open computing language (OpenCL).
Abstract: Fuzzy C-Means (FCM) is a widely used clustering algorithm that performs well in various scientific applications. Implementing FCM involves a massive number of computations, and many parallelization techniques based on GPUs and multicore systems have been suggested. In this study, we present a method for optimizing the FCM algorithm for high-speed field-programmable gate technology (FPGA) using a high-level C-like programming language called open computing language (OpenCL). The method was designed to enable the high-level compiler/synthesis tool to manipulate a task-parallelism model and create an efficient design. Our experimental results (based on several datasets) show that the proposed method makes the FCM execution time more than 186 times faster than the conventional design running on a single-core CPU platform. Also, its processing power reached 89 giga floating points operations per second (GFLOPs).

Journal ArticleDOI
TL;DR: In this article , a resource adaptive convolutional neural network (RACNN) is proposed to reduce the hardware requirements of running CNN and improve the speed of environmental sound classification (ESC).
Abstract: Recently, with the construction of smart city, the research on environmental sound classification (ESC) has attracted the attention of academia and industry. The development of convolutional neural network (CNN) makes the accuracy of ESC reach a higher level, but the accuracy improvement brought by CNN is often accompanied by the deepening of network layers, which leads to the rapid growth of parameters and floating-point operations (FLOPs). Therefore, it is difficult to transplant CNN model to embedded devices, and the classification speed is also difficult to accept. In order to reduce the hardware requirements of running CNN and improve the speed of ESC, this paper proposes a resource adaptive convolutional neural network (RACNN). RACNN uses a novel resource adaptive convolutional (RAC) module, which can generate the same number of feature maps as conventional convolution operations more cheaply, and extract the time and frequency features of audio efficiently. The RAC block based on the RAC module is designed to build the lightweight RACNN model, and the RAC module can also be used to upgrade the existing CNN model. Experiments based on public datasets show that RACNN achieves higher performance than the state-of-the-art methods with lower computational complexity.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel architecture named Serial and Parallel Group Network (SPGNet), which can capture discriminative multi-scale information and at the same time keep the structure compact.
Abstract: Neural-network Processing Units (NPU), which specializes in the acceleration of deep neural networks (DNN), is of great significance to latency-sensitive areas like robotics or edge computing. However, there are few works focusing on the network design for NPU in recent studies. Most of the popular lightweight structures (e.g. MobileNet) are designed with depthwise convolution, which has less computation in theory but is not friendly to existing hardwares, and the speed tested on NPU is not always satisfactory. Even under similar FLOPs (the number of multiply-accumulates), vanilla convolution operation is always faster than depthwise one. In this paper, we will propose a novel architecture named Serial and Parallel Group Network (SPGNet), which can capture discriminative multi-scale information and at the same time keep the structure compact. Extensive evaluations have been conducted on different computer vision tasks, e.g. image classification (CIFAR and ImageNet), object detection (PASCAL VOC and MS COCO) and person re-identification (Market-1501 and DukeMTMC-ReID). The experimental results show that our proposed SPGNet can achieve comparable performance with the state-of-the-art networks while the speed is 120% faster than MobileNetV2 under similar FLOPS and over 300% faster than GhostNet with similar accuracy on NPU.

Journal ArticleDOI
TL;DR: EPruner as discussed by the authors uses an adaptive exemplar filter to simplify the algorithm design, resulting in an automatic and efficient pruning approach called EPruner, which uses a message-passing algorithm to obtain an adaptive number of exemplars which then act as the preserved filters.
Abstract: Popular network pruning algorithms reduce redundant information by optimizing hand-crafted models, and may cause suboptimal performance and long time in selecting filters. We innovatively introduce adaptive exemplar filters to simplify the algorithm design, resulting in an automatic and efficient pruning approach called EPruner. Inspired by the face recognition community, we use a message-passing algorithm Affinity Propagation on the weight matrices to obtain an adaptive number of exemplars, which then act as the preserved filters. EPruner breaks the dependence on the training data in determining the "important" filters and allows the CPU implementation in seconds, an order of magnitude faster than GPU-based SOTAs. Moreover, we show that the weights of exemplars provide a better initialization for the fine-tuning. On VGGNet-16, EPruner achieves a 76.34%-FLOPs reduction by removing 88.80% parameters, with 0.06% accuracy improvement on CIFAR-10. In ResNet-152, EPruner achieves a 65.12%-FLOPs reduction by removing 64.18% parameters, with only 0.71% top-5 accuracy loss on ILSVRC-2012. Our code is available at https://github.com/lmbxmu/EPruner.

Journal ArticleDOI
TL;DR: In this article , a region detail preserving network (RDP-Net) is proposed to improve the performance of CNN for change detection in the field of earth observation using only 1.70M parameters.
Abstract: Change detection (CD) is an essential earth observation technique. It captures the dynamic information of land objects. With the rise of deep learning, convolutional neural networks (CNN) have shown great potential in CD. However, current CNN models introduce backbone architectures that lose detailed information during learning. Moreover, current CNN models are heavy in parameters, which prevents their deployment on edge devices such as UAVs. In this work, we tackle this issue by proposing RDP-Net: a region detail preserving network for CD. We propose an efficient training strategy that constructs the training tasks during the warmup period of CNN training and lets the CNN learn from easy to hard. The training strategy enables CNN to learn more powerful features with fewer FLOPs and achieve better performance. Next, we propose an effective edge loss that increases the penalty for errors on details and improves the network’s attention to details such as boundary regions and small areas. Furthermore, we provide a CNN model with a brand new backbone that achieves the state-of-the-art empirical performance in CD with only 1.70M parameters. We hope our RDP-Net would benefit the practical CD applications on compact devices and could inspire more people to bring change detection to a new level with the efficient training strategy. The code and models are publicly available at https://github.com/Chnja/RDPNet.