scispace - formally typeset
Search or ask a question
Author

Mohit Vaishnav

Bio: Mohit Vaishnav is an academic researcher from LNM Institute of Information Technology. The author has contributed to research in topics: Computer science & Motion compensation. The author has an hindex of 1, co-authored 4 publications receiving 4 citations.

Papers
More filters
Proceedings ArticleDOI
11 Jun 2022
TL;DR: This work introduces a novel visual reasoning benchmark, Compositional Visual Relations (CVR), and describes a novel method for creating compositions of abstract rules and associated image datasets at scale, as well as the ability to leverage compositionality.
Abstract: A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, a major gap remains in terms of the sample efficiency with which humans and AI systems learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- such that they can efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and associated image datasets at scale. Our proposed benchmark includes measures of sample efficiency, generalization and transfer across task rules, as well as the ability to leverage compositionality. We systematically evaluate modern neural architectures and find that, surprisingly, convolutional architectures surpass transformer-based architectures across all performance measures in most data regimes. However, all computational models are a lot less data efficient compared to humans even after learning informative visual representations using self-supervision. Overall, we hope that our challenge will spur interest in the development of neural architectures that can learn to harness compositionality toward more efficient learning.

6 citations

Proceedings Article
10 Jun 2022
TL;DR: Guided Attention Model for (visual) Reasoning (GAMR) as discussed by the authors is a module for visual reasoning, which instantiates an active vision theory and solves complex visual reasoning problems dynamically via sequences of attention shifts to select and route task-relevant visual information into memory.
Abstract: Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. Here, we present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR), which instantiates an active vision theory -- positing that the brain solves complex visual reasoning problems dynamically -- via sequences of attention shifts to select and route task-relevant visual information into memory. Experiments on an array of visual reasoning tasks and datasets demonstrate GAMR's ability to learn visual routines in a robust and sample-efficient manner. In addition, GAMR is shown to be capable of zero-shot generalization on completely novel reasoning tasks. Overall, our work provides computational support for cognitive theories that postulate the need for a critical interplay between attention and memory to dynamically maintain and manipulate task-relevant visual information to solve complex visual reasoning tasks.

4 citations

Proceedings ArticleDOI
29 Mar 2011
TL;DR: Overall performance of the proposed method is significantly better than many competitive methods at significantly reduced computational complexity.
Abstract: This paper presents a novel loss less compression method for video In this work, we propose a novel method for finding motion compensated frame This is computationally much efficient than other method reported in literature After finding the motion compensated frame, we propose a new method for efficiently applying LS based predictor on the frames The predictor structure uses pixels in the current frame and also in the motion compensated frame Overall performance of the proposed method is significantly better than many competitive methods at significantly reduced computational complexity

2 citations

Proceedings ArticleDOI
26 Mar 2014
TL;DR: A novel method for lossless compression of video is proposed that is an efficient replacement for the first method that predicts current pixel using an estimate of deviation from the pixel at same temporal location in the previous frame.
Abstract: In this paper, a novel method for lossless compression of video is proposed. Almost all the prediction based methods reported in literature are of two pass. In the first pass, motion compensated frame is obtained and in the second, some sophisticated method is used to predict the pixels of the current frame. The proposed method is an efficient replacement for the first method that predicts current pixel using an estimate of deviation from the pixel at same temporal location in the previous frame. In this scheme, causal pixels are divided into bins based on the distance between the current and causal pixels. The novelty of the work is in finding out the fixed coefficients of the bins for a particular type of video sequence. The overall performance of the proposed method is same with much lower computational complexity.

1 citations

Proceedings ArticleDOI
20 Mar 2013
TL;DR: A method of lossless video coding which not has only the decoder simple but encoder is also simple, unlike other reported methods which has computationally complex encoder.
Abstract: In this work, we propose a method of lossless video coding which not has only the decoder simple but encoder is also simple, unlike other reported methods which has computationally complex encoder. The computation is mainly due to not using motion compensation method, which is computationally complex process. The coefficient of the predictors are obtained based on an averaging process and then the thus obtained set of switched predictors is used for prediction. The parameters have been obtained after undergoing a statistical process of averaging so that proper relationship can be established between the predicted pixel and their context.

1 citations


Cited by
More filters
Proceedings ArticleDOI
04 Oct 2022
TL;DR: The Attribution, Relation, and Order (ARO) benchmark is created to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information.
Abstract: Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO&Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.

36 citations

Journal ArticleDOI
TL;DR: In this paper , the authors revisited the attention mechanism in vision transformers and found that self-attention modules group figures in the stimuli based on similarity of visual features such as color.
Abstract: Recently, a considerable number of studies in computer vision involve deep neural architectures called vision transformers. Visual processing in these models incorporates computational models that are claimed to implement attention mechanisms. Despite an increasing body of work that attempts to understand the role of attention mechanisms in vision transformers, their effect is largely unknown. Here, we asked if the attention mechanisms in vision transformers exhibit similar effects as those known in human visual attention. To answer this question, we revisited the attention formulation in these models and found that despite the name, computationally, these models perform a special class of relaxation labeling with similarity grouping effects. Additionally, whereas modern experimental findings reveal that human visual attention involves both feed-forward and feedback mechanisms, the purely feed-forward architecture of vision transformers suggests that attention in these models cannot have the same effects as those known in humans. To quantify these observations, we evaluated grouping performance in a family of vision transformers. Our results suggest that self-attention modules group figures in the stimuli based on similarity of visual features such as color. Also, in a singleton detection experiment as an instance of salient object detection, we studied if these models exhibit similar effects as those of feed-forward visual salience mechanisms thought to be utilized in human visual attention. We found that generally, the transformer-based attention modules assign more salience either to distractors or the ground, the opposite of both human and computational salience. Together, our study suggests that the mechanisms in vision transformers perform perceptual organization based on feature similarity and not attention.

4 citations

Journal ArticleDOI
TL;DR: In this paper , the authors present a framework that casts this inductive bias in terms of an extension of Transformers, in which specific types of attention mechanisms enforce the relational bottleneck and transform distributed symbols to implement a form of relational reasoning and abstraction.
Abstract: Reasoning in terms of relations, analogies, and abstraction is a hallmark of human intelligence. An active debate is whether this relies on the use of symbolic processing or can be achieved using the same forms of function approximation that have been used for tasks such as image, audio, and, most recently, language processing. We propose an intermediate approach, motivated by principles of cognitive neuroscience, in which abstract symbols can emerge from distributed, neural representations under the influence of an inductive bias for learning that we refer to as a ``relational bottleneck.'' We present a framework that casts this inductive bias in terms of an extension of Transformers, in which specific types of attention mechanisms enforce the relational bottleneck and transform distributed symbols to implement a form of relational reasoning and abstraction. We theoretically analyze the class of relation functions the models can compute and empirically demonstrate superior sample-efficiency on relational tasks compared to standard Transformer architectures.

2 citations

Journal ArticleDOI
TL;DR: This paper showed that models often implement solutions to subroutines via modular subnetworks, which can be ablated while maintaining the functionality of other sub-outlier sub-networks.
Abstract: Many tasks can be described as compositions over subroutines. Though modern neural networks have achieved impressive performance on both vision and language tasks, we know little about the functions that they implement. One possibility is that neural networks implicitly break down complex tasks into subroutines, implement modular solutions to these subroutines, and compose them into an overall solution to a task -- a property we term structural compositionality. Or they may simply learn to match new inputs to memorized representations, eliding task decomposition entirely. Here, we leverage model pruning techniques to investigate this question in both vision and language, across a variety of architectures, tasks, and pretraining regimens. Our results demonstrate that models oftentimes implement solutions to subroutines via modular subnetworks, which can be ablated while maintaining the functionality of other subroutines. This suggests that neural networks may be able to learn to exhibit compositionality, obviating the need for specialized symbolic mechanisms.

2 citations

Proceedings Article
TL;DR: This article proposed an application of the previously developed distribution-based compositionality assessment method to assess morphological generalisation in NLP tasks, such as machine translation or paraphrase detection, and demonstrated the use of their method by comparing translation systems with different BPE vocabulary sizes.
Abstract: Compositional generalisation refers to the ability to understand and generate a potentially infinite number of novel meanings using a finite group of known primitives and a set of rules to combine them. The degree to which artificial neural networks can learn this ability is an open question. Recently, some evaluation methods and benchmarks have been proposed to test compositional generalisation, but not many have focused on the morphological level of language. We propose an application of the previously developed distribution-based compositionality assessment method to assess morphological generalisation in NLP tasks, such as machine translation or paraphrase detection. We demonstrate the use of our method by comparing translation systems with different BPE vocabulary sizes. The evaluation method we propose suggests that small vocabularies help with morphological generalisation in NMT.

1 citations