scispace - formally typeset
Search or ask a question

Showing papers by "Peter A. Beerel published in 2022"


Journal ArticleDOI
TL;DR: This work investigates prior definitional efforts in this space, and shows that these notions either incorrectly model the desired security goals or fail to capture a natural “compositional” property that would be desirable in a logic locking system.
Abstract: Logic locking aims to protect the intellectual property of a circuit from a fabricator by modifying the original logic of the circuit into a new “locked” circuit such that an entity without the key should not be able to learn anything about the original circuit. While logic locking provides a promising solution to outsourcing the fabrication of chips, unfortunately, several of the proposed logic locking systems have been broken. The lack of established secure techniques stems in part from the absence of a rigorous treatment toward a notion of security for logic locking, and the disconnection between practice and formalisms. We seek to address this gap by introducing formal definitions to capture the desired security of logic locking schemes. In doing so, we investigate prior definitional efforts in this space, and show that these notions either incorrectly model the desired security goals or fail to capture a natural “compositional” property that would be desirable in a logic locking system. Finally we move to constructions. First, we show that universal circuits satisfy our security notions. Second, we show that, in order to do better than universal circuits, cryptographic assumptions are necessary.

11 citations


Journal ArticleDOI
TL;DR: In this article , the authors propose a Processing-in-Pixel-In-Memory (P 2 M) paradigm that customizes the pixel array by adding support for analog multi-channel, multi-bit convolution, batch normalization, and Rectified Linear Units (ReLU).
Abstract: Abstract The demand to process vast amounts of data generated from state-of-the-art high resolution cameras has motivated novel energy-efficient on-device AI solutions. Visual data in such cameras are usually captured in analog voltages by a sensor pixel array, and then converted to the digital domain for subsequent AI processing using analog-to-digital converters (ADC). Recent research has tried to take advantage of massively parallel low-power analog/digital computing in the form of near- and in-sensor processing, in which the AI computation is performed partly in the periphery of the pixel array and partly in a separate on-board CPU/accelerator. Unfortunately, high-resolution input images still need to be streamed between the camera and the AI processing unit, frame by frame, causing energy, bandwidth, and security bottlenecks. To mitigate this problem, we propose a novel Processing-in-Pixel-in-memory (P 2 M) paradigm, that customizes the pixel array by adding support for analog multi-channel, multi-bit convolution, batch normalization, and Rectified Linear Units (ReLU). Our solution includes a holistic algorithm-circuit co-design approach and the resulting P 2 M paradigm can be used as a drop-in replacement for embedding memory-intensive first few layers of convolutional neural network (CNN) models within foundry-manufacturable CMOS image sensor platforms. Our experimental results indicate that P 2 M reduces data transfer bandwidth from sensors and analog to digital conversions by $${\sim }\,21\times$$ 21 × , and the energy-delay product (EDP) incurred in processing a MobileNetV2 model on a TinyML use case for visual wake words dataset (VWW) by up to $$\mathord {\sim }\,11\times$$ 11 × compared to standard near-processing or in-sensor implementations, without any significant drop in test accuracy.

10 citations


Journal ArticleDOI
TL;DR: This work proposes Spiking Neural Networks (SNNs) that are generated from iso-architecture CNNs and trained with quantization-aware gradient descent to optimize their weights, membrane leak, and firing thresholds, and chooses hyperspectral imaging (HSI) as an application for 3D image recognition.
Abstract: High-quality 3D image recognition is an important component of many vision and robotics systems. However, the accurate processing of these images requires the use of compute-expensive 3D Convolutional Neural Networks (CNNs). To address this challenge, we propose the use of Spiking Neural Networks (SNNs) that are generated from iso-architecture CNNs and trained with quantization-aware gradient descent to optimize their weights, membrane leak, and firing thresholds. During both training and inference, the analog pixel values of a 3D image are directly applied to the input layer of the SNN without the need to convert to a spike-train. This significantly reduces the training and inference latency and results in high degree of activation sparsity, which yields significant improvements in computational efficiency. However, this introduces energy-hungry digital multiplications in the first layer of our models, which we propose to mitigate using a processing-in-memory (PIM) architecture. To evaluate our proposal, we propose a 3D and a 3D/2D hybrid SNN-compatible convolutional architecture and choose hyperspectral imaging (HSI) as an application for 3D image recognition. We achieve overall test accuracy of 98.68, 99.50, and 97.95% with 5 time steps (inference latency) and 6-bit weight quantization on the Indian Pines, Pavia University, and Salinas Scene datasets, respectively. In particular, our models implemented using standard digital hardware achieved accuracies similar to state-of-the-art (SOTA) with ~560.6× and ~44.8× less average energy than an iso-architecture full-precision and 6-bit quantized CNN, respectively. Adopting the PIM architecture in the first layer, further improves the average energy, delay, and energy-delay-product (EDP) by 30, 7, and 38%, respectively.

9 citations


Journal ArticleDOI
TL;DR: A dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are both robust against adversarially generated images and maintain high accuracy on clean images is presented.
Abstract: We present a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that both are robust against adversarially generated images and maintain high accuracy on clean images. In particular, the disclosed DNR training method is based on a unified constrained optimization formulation using a novel hybrid loss function that merges sparse learning with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. To further improve the robustness of the pruned models, we propose DNR++, an extension of the DNR method where we introduce the idea of sparse parametric Gaussian noise tensor that is added to the weight tensors to yield robust regularization. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed DNR and DNR++ achieve an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To demonstrate the efficacy of the proposed method under the no-increased-training-time “free” adversarial training scenario, we finally present FDNR++, a simple yet effective training modification that can yield robust yet compressed models requiring training time comparable to that of an unpruned non-adversarial training. To evaluate the merits of our disclosed training methods, experiments were performed with two widely accepted models, namely VGG16 and ResNet18, on CIFAR-10 and CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, our methods provide over 20× compression on all the datasets without any significant drop of either clean or adversarial classification performance. Moreover, extensive experiments show that our methods consistently find compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives. We provide insightful observations to help make various model, parameter density, and prune-type selection choices and have open-sourced our saved models and test codes to ensure reproducibility of our results.

7 citations


Journal ArticleDOI
TL;DR: A form of processing-in-pixel (PIP) that leverages advanced CMOS technologies to enable the pixel array to perform a wide range of complex operations required by the modern convolutional neural networks (CNN) for hyperspectral image recognition (HSI).
Abstract: Hyperspectral cameras generate a large amount of data due to the presence of hundreds of spectral bands as opposed to only three channels (red, green, and blue) in traditional cameras. This requires a significant amount of data transmis-sion between the hyperspectral image sensor and a processor used to classify/detect/track the images, frame by frame, expending high energy and causing bandwidth and security bottlenecks. To mitigate this problem, we propose a form of processing-in-pixel (PIP) that leverages advanced CMOS technologies to enable the pixel array to perform a wide range of complex operations required by the modern convolutional neural networks (CNN) for hyperspectral image recognition (HSI). Consequently, our PIP-optimized custom CNN layers effectively compress the input data, significantly reducing the bandwidth required to transmit the data downstream to the HSI processing unit. This reduces the average energy con-sumption associated with pixel array of cameras and the CNN processing unit by 25 . 06 × and 3 . 90 × respectively, compared to existing hardware implementations. Our custom models yield average test accuracies within 0 . 56% of the baseline models for the standard HSI benchmarks.

5 citations


Proceedings ArticleDOI
28 May 2022
TL;DR: P 2 M-DeTrack is a algorithm-hardware co-design framework based on a custom faster R-CNN-based model that is distributed partly inside the pixel array (front-end) and partly in a separate FPGA/ASIC (back-end), which reduces the data bandwidth between sensor and back-end by up to 24 × .
Abstract: Today’s high resolution, high frame rate cameras in autonomous vehicles generate a large volume of data that needs to be transferred and processed by a downstream processor or machine learning (ML) accelerator to enable intelligent computing tasks, such as multi-object detection and tracking. The massive amount of data transfer incurs significant energy, latency, and bandwidth bottlenecks, which hinders real-time processing. To mitigate this problem, we propose an algorithm-hardware co-design framework called Processing-in-Pixel-in-Memory-based object Detection and Tracking (P2M-DeTrack). P2M-DeTrack is based on a custom faster R-CNN-based model that is distributed partly inside the pixel array (front-end) and partly in a separate FPGA/ASIC (back-end). The proposed front-end in-pixel processing down-samples the input feature maps significantly with judiciously optimized strided convolution and pooling. Compared to a conventional baseline design that transfers frames of RGB pixels to the back-end, the resulting P2M-DeTrack designs reduce the data bandwidth between sensor and back-end by up to 24×. The designs also reduce the sensor and total energy (obtained from in-house circuit simulations at Globalfoundries 22nm technology node) per frame by 5.7× and 1.14×, respectively. Lastly, they reduce the sensing and total frame latency by an estimated 1.7× and 3×, respectively. We evaluate our approach on the multi-object object detection (tracking) task of the large-scale BDD100K dataset and observe only a 0.5% reduction in the mean average precision (0.8% reduction in the identification F1 score) compared to the state-of-the-art.

3 citations


Journal ArticleDOI
TL;DR: An optimized spiking long short-term memory networks (LSTM) training framework that involves a novel ANN-to-SNN conversion framework, followed by SNN training, and a pipelined parallel processing scheme which hides the SNN time steps, significantly improving system latency, especially for long sequences.
Abstract: Spiking Neural Networks (SNNs) have emerged as an attractive spatio-temporal computing paradigm for complex vision tasks. However, most existing works yield models that require many time steps and do not leverage the inherent temporal dynamics of spiking neural networks, even for sequential tasks. Motivated by this observation, we propose an \rev{optimized spiking long short-term memory networks (LSTM) training framework that involves a novel ANN-to-SNN conversion framework, followed by SNN training}. In particular, we propose novel activation functions in the source LSTM architecture and judiciously select a subset of them for conversion to integrate-and-fire (IF) activations with optimal bias shifts. Additionally, we derive the leaky-integrate-and-fire (LIF) activation functions converted from their non-spiking LSTM counterparts which justifies the need to jointly optimize the weights, threshold, and leak parameter. We also propose a pipelined parallel processing scheme which hides the SNN time steps, significantly improving system latency, especially for long sequences. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), in contrast to expensive multiply-and-accumulates (MAC) needed for ANNs, except for the input layer when using direct encoding, yielding significant improvements in energy efficiency. We evaluate our framework on sequential learning tasks including temporal MNIST, Google Speech Commands (GSC), and UCI Smartphone datasets on different LSTM architectures. We obtain test accuracy of 94.75% with only 2 time steps with direct encoding on the GSC dataset with 4.1x lower energy than an iso-architecture standard LSTM.

3 citations


Journal ArticleDOI
TL;DR: In this paper , the threshold of each SNN layer is estimated as the Hoyer extremum of a clipped version of its activation map, where the clipping threshold is trained using gradient descent with Hoyer regularizer.
Abstract: Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, state-of-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for one-time-step SNNs that uses a novel variant of the recently proposed Hoyer regularizer. We estimate the threshold of each SNN layer as the Hoyer extremum of a clipped version of its activation map, where the clipping threshold is trained using gradient descent with our Hoyer regularizer. This approach not only downscales the value of the trainable threshold, thereby emitting a large number of spikes for weight update with a limited number of iterations (due to only one time step) but also shifts the membrane potential values away from the threshold, thereby mitigating the effect of noise that can degrade the SNN accuracy. Our approach outperforms existing spiking, binary, and adder neural networks in terms of the accuracy-FLOPs trade-off for complex image recognition tasks. Downstream experiments on object detection also demonstrate the efficacy of our approach.

2 citations


Proceedings ArticleDOI
21 Dec 2022
TL;DR: In this paper , the authors proposed an in-sensor computing hardware-software co-design framework for SNNs targeting image recognition tasks, which reduces the bandwidth between sensing and processing by 12-96x and the resulting total energy by 2.32x compared to traditional CV processing.
Abstract: Due to the high activation sparsity and use of accumulates (AC) instead of expensive multiply-and-accumulates (MAC), neuromorphic spiking neural networks (SNNs) have emerged as a promising low-power alternative to traditional DNNs for several computer vision (CV) applications. However, most existing SNNs require multiple time steps for acceptable inference accuracy, hindering real-time deployment and increasing spiking activity and, consequently, energy consumption. Recent works proposed direct encoding that directly feeds the analog pixel values in the first layer of the SNN in order to significantly reduce the number of time steps. Although the overhead for the first layer MACs with direct encoding is negligible for deep SNNs and the CV processing is efficient using SNNs, the data transfer between the image sensors and the downstream processing costs significant bandwidth and may dominate the total energy. To mitigate this concern, we propose an in-sensor computing hardware-software co-design framework for SNNs targeting image recognition tasks. Our approach reduces the bandwidth between sensing and processing by 12-96x and the resulting total energy by 2.32x compared to traditional CV processing, with a 3.8% reduction in accuracy on ImageNet.

1 citations


Proceedings ArticleDOI
01 Jul 2022
TL;DR: A novel integer linear programming (ILP) algorithm to minimize the number of required registers given the numberof available clock phases and a corresponding number of processing threads and a key benefit of the proposed approach is that it requires no fast clock.
Abstract: This paper presents a novel multi-phase clocking methodology targeting multi-threaded gate-level pipelined sequential circuits. Gate-level-pipelined circuits, such as present in superconductive digital electronics, require many path balancing registers to enable proper multi-threaded computation. The paper introduces a novel integer linear programming (ILP) algorithm to minimize the number of required registers given the number of available clock phases and a corresponding number of processing threads. We evaluated our approach using eight SFQ benchmark circuits through path balancing, clock tree synthesis (CTS), and place-and-route (PnR). Compared with fully-balanced approaches, which require a very large number of threads to achieve peak throughput, the proposed method reduces the number of path-balancing registers by 55.5% with two clock phases and up to 95.5 % with ten clock phases. The CTS and PnR results show that the decrease in registers yields a decrease in total gate area by 40.6 % and clock tree wire length by 54.9% with two clock phases, and by 69.6% and 69.8% with ten clock phases, respectively, despite the increase in the number of clock phases. We compare our approach to the SOTA SFQ clocking solutions that rely on fully-path balanced circuits or dual slow/fast clocks. In addition to having lower overhead, a key benefit of the proposed approach is that it requires no fast clock. In particular, the clock frequency of the proposed multi-phased clocks is the same as the throughput of the circuit, avoiding the need to synthesize and route a high-speed clock.

1 citations


Journal ArticleDOI
TL;DR: This paper presents a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarialTraining.
Abstract: Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-limited or real-time applications. In this paper, we present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarial training. In particular, we add configurable scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance. Extensive experiments show that FLOAT can yield SOTA performance improving both clean and perturbed image classification by up to ∼6% and ∼10%, respectively. Moreover, real hardware measurement shows that FLOAT can reduce the training time by up to 1.43× with fewer model parameters of up to 1.47× on iso-hyperparameter settings compared to the FiLM-based alternatives. Additionally, to further improve memory efficiency we introduce FLOAT sparse (FLOATS), a form of noniterative model pruning and provide detailed empirical analysis to provide a three way accuracy-robustness-complexity trade-off for these new class of pruned conditionally trained models.

Proceedings ArticleDOI
16 Jan 2022
TL;DR: TriLock is proposed, a sequential logic locking method that can achieve high, tunable functional corruptibility while still guaranteeing exponential queries to the SAT solver in a SAT-based attack and adopts a state re-encoding method to obscure the boundary between the original state registers and those inserted by the locking method, thus making it more difficult to detect and remove the locking-related components.
Abstract: Sequential logic locking has been studied over the last decade as a method to protect sequential circuits from reverse engineering. However, most of the existing sequential logic locking techniques are threatened by increasingly more sophisticated SAT-based attacks, efficiently using input queries to a SAT solver to rule out incorrect keys, as well as removal attacks based on structural analysis. In this paper, we propose TriLock, a sequential logic locking method that simultaneously addresses these vulnerabilities. TriLock can achieve high, tunable functional corruptibility while still guaranteeing exponential queries to the SAT solver in a SAT-based attack. Further, it adopts a state re-encoding method to obscure the boundary between the original state registers and those inserted by the locking method, thus making it more difficult to detect and remove the locking-related components.

Proceedings ArticleDOI
01 Aug 2022
TL;DR: PipeEdge as mentioned in this paper proposes a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger, more accurate models that otherwise cannot fit on single edge devices.
Abstract: Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision and natural language processing. However, such models are too compute- or memory-intensive for resource-constrained edge devices. Prior works on parallel and distributed execution primarily focus on training-rather than inference-using homogeneous accelerators in data centers. We propose PipeEdge, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger, more accurate models that otherwise cannot fit on single edge devices. PipeEdge uses an optimal partition strategy that considers heterogeneity in compute, memory, and network bandwidth. Our empirical evaluation demonstrates that PipeEdge achieves 11.88× and 12.78× speedup using 16 edge devices for the ViT-Huge and BERT-Large models, respectively, with no accuracy loss. Similarly, PipeEdge improves throughput for ViT-Huge (which cannot fit in a single device) by 3.93× over a 4-device baseline using 16 edge devices. Finally, we show up to 4.16× throughput improvement over the state-of-the-art PipeDream when using a heterogeneous set of devices.

Proceedings ArticleDOI
27 Dec 2022
TL;DR: SMART as discussed by the authors proposes a sparse mixture once for all adversarial training (SMART) model that allows a model to train once and then in-situ trade-off between accuracy and robustness, that too at a reduced compute and parameter overhead.
Abstract: Existing deep neural networks (DNNs) that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on either activation or weight conditioned convolution operations. However, such conditional learning costs additional multiply-accumulate (MAC) or addition operations, increasing inference memory and compute costs. To that end, we present a sparse mixture once for all adversarial training (SMART), that allows a model to train once and then in-situ trade-off between accuracy and robustness, that too at a reduced compute and parameter overhead. In particular, SMART develops two expert paths, for clean and adversarial images, respectively, that are then conditionally trained via respective dedicated sets of binary sparsity masks. Extensive evaluations on multiple image classification datasets across different models show SMART to have up to 2.72x fewer non-zero parameters costing proportional reduction in compute overhead, while yielding SOTA accuracy-robustness trade-off. Additionally, we present insightful observations in designing sparse masks to successfully condition on both clean and perturbed images.

Book ChapterDOI
01 Jan 2023
TL;DR: In this article , a form of processing-in-pixel (PIP) is proposed to reduce the data transmission between the hyperspectral image sensor and a processor used to classify/detect/track the images, frame by frame.
Abstract: Hyperspectral cameras generate a large amount of data due to the presence of hundreds of spectral bands as opposed to only three channels (red, green, and blue) in traditional cameras. This requires a significant amount of data transmission between the hyperspectral image sensor and a processor used to classify/detect/track the images, frame by frame, expending high energy and causing bandwidth and security bottlenecks. To mitigate this problem, we propose a form of processing-in-pixel (PIP) that leverages advanced CMOS technologies to enable the pixel array to perform a wide range of complex operations required by the modern convolutional neural networks (CNN) for hyperspectral image (HSI) recognition. Consequently, our PIP-optimized custom CNN layers effectively compress the input data, significantly reducing the bandwidth required to transmit the data downstream to the HSI processing unit. This reduces the average energy consumption associated with pixel array of cameras and the CNN processing unit by $$25.06\times $$ and $$3.90\times $$ respectively, compared to existing hardware implementations. Our experimental results yield reduction of data rates after the sensor ADCs by up to $${\sim }10\times $$ , significantly reducing the complexity of downstream processing. Our custom models yield average test accuracies within $$0.56\%$$ of the baseline models for the standard HSI benchmarks.

Proceedings ArticleDOI
08 Nov 2022
TL;DR: In this paper , a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors is proposed to maintain transformer pipeline performance while incurring limited inference accuracy loss.
Abstract: Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue with QuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptive PTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incurring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clipping for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths, e.g., improving accuracy under 2-bit quantization by 15.85\% on ImageNet compared to naive quantization.

Proceedings ArticleDOI
06 Jun 2022
TL;DR: In this paper , the authors presented a novel Radiation Hardened by Design (RHBD) mutual exclusion element (mutex) that incorporates multiple RHBD techniques with reduced area overhead.
Abstract: Circuits in advanced CMOS technology are increasingly more sensitive to transient pulses caused by radiation particles that strike vulnerable circuit components, specially turned off transistors, often generating multiple voltage upsets. Towards mitigating these issues, this paper presents a novel Radiation Hardened by Design (RHBD) mutual exclusion element (mutex) that incorporates multiple RHBD techniques with reduced area overhead. We compared our proposed circuit to the baseline and the state-of-the-art designs, in terms of resiliency to Single Event Transients (SET) and Single Event Upsets (SEU), request to grant latency, and area overhead. Results shows that the proposed circuit mitigates SET and prevents SEU events incurring in 1.42x performance and 5.1x transistor area overhead compared to the baseline (unhardened) design. On the other hand, the proposed mutex circuit improves SEU resiliency at outputs, achieving 0.58x transistor area and 0.62x latency compared to the state-of-the-art RHBD mutex that uses modular redundancy.

Proceedings ArticleDOI
11 Oct 2022
TL;DR: The ISP pipeline is proposed to invert, which can convert the RGB images of any dataset to its raw counterparts, and enable model training on raw images, and the raw version of the COCO dataset is released, a large-scale benchmark for generic high-level vision tasks.
Abstract: Current computer vision (CV) systems use an image signal processing (ISP) unit to convert the high resolution raw images captured by image sensors to visually pleasing RGB images. Typically, CV models are trained on these RGB images and have yielded state-of-the-art (SOTA) performance on a wide range of complex vision tasks, such as object detection. In addition, in order to deploy these models on resource-constrained low-power devices, recent works have proposed in-sensor and in-pixel computing approaches that try to partly/fully bypass the ISP and yield significant bandwidth reduction between the image sensor and the CV processing unit by downsampling the activation maps in the initial convolutional neural network (CNN) layers. However, direct inference on the raw images degrades the test accuracy due to the difference in covariance of the raw images captured by the image sensors compared to the ISP-processed images used for training. Moreover, it is difficult to train deep CV models on raw images, because most (if not all) large-scale open-source datasets consist of RGB images. To mitigate this concern, we propose to invert the ISP pipeline, which can convert the RGB images of any dataset to its raw counterparts, and enable model training on raw images. We release the raw version of the COCO dataset, a large-scale benchmark for generic high-level vision tasks. For ISP-less CV systems, training on these raw images result in a ∼7.1% increase in test accuracy on the visual wake works (VWW) dataset compared to relying on training with traditional ISP-processed RGB datasets. To further improve the accuracy of ISP-less CV models and to increase the energy and bandwidth benefits obtained by in-sensor/in-pixel computing, we propose an energy-efficient form of analog in-pixel demosaicing that may be coupled with in-pixel CNN computations. When evaluated on raw images captured by real sensors from the PASCALRAW dataset, our approach results in a 8.1% increase in mAP. Lastly, we demonstrate a further 20.5% increase in mAP by using a novel application of few-shot learning with thirty shots each for the novel PASCALRAW dataset, constituting 3 classes. Codes are available at https://github.com/godatta/ISP-less-CV.