Home
/
Authors
/
Mohit Vaishnav

Author

Mohit Vaishnav

Bio: Mohit Vaishnav is an academic researcher from LNM Institute of Information Technology. The author has contributed to research in topics: Computer science & Motion compensation. The author has an hindex of 1, co-authored 4 publications receiving 4 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A Benchmark for Compositional Visual Reasoning

[...]

Aimen Zerroug, Mohit Vaishnav, Julien Colin, Sebastian Musslick, Thomas Serre - Show less +1 more

11 Jun 2022

TL;DR: This work introduces a novel visual reasoning benchmark, Compositional Visual Relations (CVR), and describes a novel method for creating compositions of abstract rules and associated image datasets at scale, as well as the ability to leverage compositionality.

...read moreread less

Abstract: A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, a major gap remains in terms of the sample efficiency with which humans and AI systems learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- such that they can efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and associated image datasets at scale. Our proposed benchmark includes measures of sample efficiency, generalization and transfer across task rules, as well as the ability to leverage compositionality. We systematically evaluate modern neural architectures and find that, surprisingly, convolutional architectures surpass transformer-based architectures across all performance measures in most data regimes. However, all computational models are a lot less data efficient compared to humans even after learning informative visual representations using self-supervision. Overall, we hope that our challenge will spur interest in the development of neural architectures that can learn to harness compositionality toward more efficient learning.

...read moreread less

6 citations

Proceedings Article•

GAMR: A Guided Attention Model for (visual) Reasoning

[...]

Mohit Vaishnav, Thomas Serre

10 Jun 2022

TL;DR: Guided Attention Model for (visual) Reasoning (GAMR) as discussed by the authors is a module for visual reasoning, which instantiates an active vision theory and solves complex visual reasoning problems dynamically via sequences of attention shifts to select and route task-relevant visual information into memory.

...read moreread less

Abstract: Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. Here, we present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR), which instantiates an active vision theory -- positing that the brain solves complex visual reasoning problems dynamically -- via sequences of attention shifts to select and route task-relevant visual information into memory. Experiments on an array of visual reasoning tasks and datasets demonstrate GAMR's ability to learn visual routines in a robust and sample-efficient manner. In addition, GAMR is shown to be capable of zero-shot generalization on completely novel reasoning tasks. Overall, our work provides computational support for cognitive theories that postulate the need for a critical interplay between attention and memory to dynamically maintain and manipulate task-relevant visual information to solve complex visual reasoning tasks.

...read moreread less

4 citations

Proceedings Article•DOI•

A Novel Computationally Efficient Motion Compensation Method Based on Pixel by Pixel Prediction

[...]

Mohit Vaishnav¹, Ashwani Sharma¹, Anil Kumar Tiwari²•Institutions (2)

LNM Institute of Information Technology¹, Indian Institutes of Technology²

29 Mar 2011

TL;DR: Overall performance of the proposed method is significantly better than many competitive methods at significantly reduced computational complexity.

...read moreread less

Abstract: This paper presents a novel loss less compression method for video In this work, we propose a novel method for finding motion compensated frame This is computationally much efficient than other method reported in literature After finding the motion compensated frame, we propose a new method for efficiently applying LS based predictor on the frames The predictor structure uses pixels in the current frame and also in the motion compensated frame Overall performance of the proposed method is significantly better than many competitive methods at significantly reduced computational complexity

...read moreread less

2 citations

Proceedings Article•DOI•

Bin Classification Using Temporal Gradient Estimation for Lossless Video Coding

[...]

Mohit Vaishnav, Anil Kumar Tiwari¹•Institutions (1)

Indian Institute of Technology, Jodhpur¹

26 Mar 2014

TL;DR: A novel method for lossless compression of video is proposed that is an efficient replacement for the first method that predicts current pixel using an estimate of deviation from the pixel at same temporal location in the previous frame.

...read moreread less

Abstract: In this paper, a novel method for lossless compression of video is proposed. Almost all the prediction based methods reported in literature are of two pass. In the first pass, motion compensated frame is obtained and in the second, some sophisticated method is used to predict the pixels of the current frame. The proposed method is an efficient replacement for the first method that predicts current pixel using an estimate of deviation from the pixel at same temporal location in the previous frame. In this scheme, causal pixels are divided into bins based on the distance between the current and causal pixels. The novelty of the work is in finding out the fixed coefficients of the bins for a particular type of video sequence. The overall performance of the proposed method is same with much lower computational complexity.

...read moreread less

1 citations

Proceedings Article•DOI•

An Optimal Switched Adaptive Prediction Method for Lossless Video Coding

[...]

Dinesh Kumar Chobey¹, Mohit Vaishnav², Anil Kumar Tiwari³•Institutions (3)

Nanyang Technological University¹, LNM Institute of Information Technology², Indian Institute of Technology, Jodhpur³

20 Mar 2013

TL;DR: A method of lossless video coding which not has only the decoder simple but encoder is also simple, unlike other reported methods which has computationally complex encoder.

...read moreread less

Abstract: In this work, we propose a method of lossless video coding which not has only the decoder simple but encoder is also simple, unlike other reported methods which has computationally complex encoder. The computation is mainly due to not using motion compensation method, which is computationally complex process. The coefficient of the predictors are obtained based on an averaging process and then the thus obtained set of switched predictors is used for prediction. The parameters have been obtained after undergoing a statistical process of averaging so that proper relationship can be established between the predicted pixel and their context.

...read moreread less

1 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

When and why vision-language models behave like bags-of-words, and what to do about it?

[...]

Byron Rogers, F. Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou - Show less +1 more

04 Oct 2022

TL;DR: The Attribution, Relation, and Order (ARO) benchmark is created to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information.

...read moreread less

Abstract: Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO&Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.

...read moreread less

36 citations

Journal Article•DOI•

Self-attention in vision transformers performs perceptual grouping, not attention

[...]

Paria Mehrani, John K. Tsotsos

02 Mar 2023-Frontiers in computer science

TL;DR: In this paper , the authors revisited the attention mechanism in vision transformers and found that self-attention modules group figures in the stimuli based on similarity of visual features such as color.

...read moreread less

Abstract: Recently, a considerable number of studies in computer vision involve deep neural architectures called vision transformers. Visual processing in these models incorporates computational models that are claimed to implement attention mechanisms. Despite an increasing body of work that attempts to understand the role of attention mechanisms in vision transformers, their effect is largely unknown. Here, we asked if the attention mechanisms in vision transformers exhibit similar effects as those known in human visual attention. To answer this question, we revisited the attention formulation in these models and found that despite the name, computationally, these models perform a special class of relaxation labeling with similarity grouping effects. Additionally, whereas modern experimental findings reveal that human visual attention involves both feed-forward and feedback mechanisms, the purely feed-forward architecture of vision transformers suggests that attention in these models cannot have the same effects as those known in humans. To quantify these observations, we evaluated grouping performance in a family of vision transformers. Our results suggest that self-attention modules group figures in the stimuli based on similarity of visual features such as color. Also, in a singleton detection experiment as an instance of salient object detection, we studied if these models exhibit similar effects as those of feed-forward visual salience mechanisms thought to be utilized in human visual attention. We found that generally, the transformer-based attention modules assign more salience either to distractors or the ground, the opposite of both human and computational salience. Together, our study suggests that the mechanisms in vision transformers perform perceptual organization based on feature similarity and not attention.

...read moreread less

4 citations

Journal Article•DOI•

Abstractors: Transformer Modules for Symbolic Message Passing and Relational Reasoning

[...]

Awni Altabaa, Taylor W. Webb, Jonathan D. Cohen, John Lafferty

01 Apr 2023-arXiv.org

TL;DR: In this paper , the authors present a framework that casts this inductive bias in terms of an extension of Transformers, in which specific types of attention mechanisms enforce the relational bottleneck and transform distributed symbols to implement a form of relational reasoning and abstraction.

...read moreread less

Abstract: Reasoning in terms of relations, analogies, and abstraction is a hallmark of human intelligence. An active debate is whether this relies on the use of symbolic processing or can be achieved using the same forms of function approximation that have been used for tasks such as image, audio, and, most recently, language processing. We propose an intermediate approach, motivated by principles of cognitive neuroscience, in which abstract symbols can emerge from distributed, neural representations under the influence of an inductive bias for learning that we refer to as a ``relational bottleneck.'' We present a framework that casts this inductive bias in terms of an extension of Transformers, in which specific types of attention mechanisms enforce the relational bottleneck and transform distributed symbols to implement a form of relational reasoning and abstraction. We theoretically analyze the class of relation functions the models can compute and empirically demonstrate superior sample-efficiency on relational tasks compared to standard Transformer architectures.

...read moreread less

2 citations

Journal Article•DOI•

Break It Down: Evidence for Structural Compositionality in Neural Networks

[...]

Michael A. Lepori, Thomas Serre, Elizabeth-Jane Pavlick

26 Jan 2023-arXiv.org

TL;DR: This paper showed that models often implement solutions to subroutines via modular subnetworks, which can be ablated while maintaining the functionality of other sub-outlier sub-networks.

...read moreread less

Abstract: Many tasks can be described as compositions over subroutines. Though modern neural networks have achieved impressive performance on both vision and language tasks, we know little about the functions that they implement. One possibility is that neural networks implicitly break down complex tasks into subroutines, implement modular solutions to these subroutines, and compose them into an overall solution to a task -- a property we term structural compositionality. Or they may simply learn to match new inputs to memorized representations, eliding task decomposition entirely. Here, we leverage model pruning techniques to investigate this question in both vision and language, across a variety of architectures, tasks, and pretraining regimens. Our results demonstrate that models oftentimes implement solutions to subroutines via modular subnetworks, which can be ablated while maintaining the functionality of other subroutines. This suggests that neural networks may be able to learn to exhibit compositionality, obviating the need for specialized symbolic mechanisms.

...read moreread less

2 citations

Proceedings Article•

Evaluating Morphological Generalisation in Machine Translation by Distribution-Based Compositionality Assessment

[...]

Anssi Moisio, Mathias Creutz, Mikko Kurimo

TL;DR: This article proposed an application of the previously developed distribution-based compositionality assessment method to assess morphological generalisation in NLP tasks, such as machine translation or paraphrase detection, and demonstrated the use of their method by comparing translation systems with different BPE vocabulary sizes.

...read moreread less

Abstract: Compositional generalisation refers to the ability to understand and generate a potentially infinite number of novel meanings using a finite group of known primitives and a set of rules to combine them. The degree to which artificial neural networks can learn this ability is an open question. Recently, some evaluation methods and benchmarks have been proposed to test compositional generalisation, but not many have focused on the morphological level of language. We propose an application of the previously developed distribution-based compositionality assessment method to assess morphological generalisation in NLP tasks, such as machine translation or paraphrase detection. We demonstrate the use of our method by comparing translation systems with different BPE vocabulary sizes. The evaluation method we propose suggests that small vocabularies help with morphological generalisation in NMT.

...read moreread less

1 citations