scispace - formally typeset
Search or ask a question

Showing papers by "Weidi Xie published in 2022"


Book ChapterDOI
30 Mar 2022
TL;DR: In this paper , a two-stage open-vocabulary object detector is proposed, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model.
Abstract: The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet .

35 citations


Proceedings ArticleDOI
06 Apr 2022
TL;DR: A temporal alignment network that ingests long term video sequences, and associated text sentences, in order to determine if a sentence is alignable with the video, and if it is alignedable, then determine its alignment is proposed.
Abstract: The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, de-spite the considerable noise; (ii) to benchmark the align-ment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal de-scriptions. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin; (iii) we ap-ply the trained model in the zero-shot settings to mul-tiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2, and weakly supervised video action segmentation on Breakfast-Action. (iv) we use the automatically-aligned HowTo100M annotations for end-to-end finetuning of the backbone model, and obtain improved performance on downstream action recognition tasks.

19 citations


Journal ArticleDOI
23 Mar 2022
TL;DR: This paper proposes a simple but effective winner-takes-all voting mechanism for selecting the salient masks, leveraging object priors based on framing and distinctiveness and trains a salient object detector, termed SELF-MASK, which outperforms prior approaches on three unsupervised SOD benchmarks.
Abstract: In this paper, we tackle the challenging task of unsupervised salient object detection (SOD) by leveraging spectral clustering on self-supervised features. We make the following contributions: (i) We revisit spectral clustering and demonstrate its potential to group the pixels of salient objects across various self-supervised features, e.g., Mo-Cov2, SwAV, and DINO; (ii) Given mask proposals from multiple applications of spectral clustering on image features computed from different self-supervised models, we propose a simple but effective winner-takes-all voting mechanism for selecting the salient masks, leveraging object priors based on framing and distinctiveness; (iii) Using the selected object segmentation as pseudo groundtruth masks, we train a salient object detector, termed SELF-MASK, which outperforms prior approaches on three un-supervised SOD benchmarks. Code is publicly available at https://github.com/NoelShin/selfmask.

18 citations


Journal ArticleDOI
TL;DR: The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations, and to validate the necessity of the proposed components, conducts extensive experiments on the challenging LVIS and MS-COCO dataset.
Abstract: The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations . To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector that cate-gorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model; (ii) To pair the visual latent space (from RPN box proposal) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to optimise a couple of learnable prompt vectors, con-verting the textual embedding space to fit those visually object-centric images; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images. The self-trained detector, termed as PromptDet , significantly improves the detection performance on categories for which manual annotations are unavailable or hard to obtain, e.g. rare categories. Finally, (iv) to validate the necessity of our proposed components, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset, showing superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code:

14 citations


Proceedings ArticleDOI
27 Oct 2022
TL;DR: This paper introduces Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data, and demonstrates superior performance over previous models.
Abstract: When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

12 citations


Proceedings ArticleDOI
05 Jul 2022
TL;DR: An object-centric segmentation model with a depth-ordered layer representation implemented using a variant of the transformer architecture that ingests optical layers, which can effectively discover multiple moving objects and handle mutual occlusions.
Abstract: The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video. We make four contributions: First, we introduce an object-centric segmentation model with a depth-ordered layer representation. This is implemented using a variant of the transformer architecture that ingests optical flow, where each query vector specifies an object and its layer for the entire video. The model can effectively discover multiple moving objects and handle mutual occlusions; Second, we introduce a scalable pipeline for generating multi-object synthetic training data via layer compositions, that is used to train the proposed model, significantly reducing the requirements for labour-intensive annotations, and supporting Sim2Real generalisation; Third, we conduct thorough ablation studies, showing that the model is able to learn object permanence and temporal shape consistency, and is able to predict amodal segmentation masks; Fourth, we evaluate our model, trained only on synthetic data, on standard video segmentation benchmarks, DAVIS, MoCA, SegTrack, FBMS-59, and achieve state-of-the-art performance among existing methods that do not rely on any manual annotations. With test-time adaptation, we observe further performance boosts.

9 citations


Proceedings ArticleDOI
29 Aug 2022
TL;DR: A novel transformer-based architecture for generalised visual object counting, termed as Counting TRansformer ( CounTR), which explicitly captures the similarity between image patches or with given “exemplars” using the attention mechanism is introduced.
Abstract: In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of"exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given"exemplars"with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given"exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.

8 citations


Journal ArticleDOI
TL;DR: It is shown that a standard segmentation architecture trained on these archives with appropriate data augmentation achieves impressive semantic segmentation abilities for both single-object and multi-object images.
Abstract: The goal of this work is to segment and name regions of images without access to pixel-level labels during training. To tackle this task, we construct segmenters by distilling the complementary strengths of two foundation models. The first, CLIP (Radford et al. 2021), exhibits the ability to assign names to image content but lacks an accessible representation of object structure. The second, DINO (Caron et al. 2021), captures the spatial extent of objects but has no knowledge of object names. Our method, termed NamedMask, begins by using CLIP to construct category-specific archives of images. These images are pseudo-labelled with a category-agnostic salient object detector bootstrapped from DINO, then refined by category-specific segmenters using the CLIP archive la- bels. Thanks to the high quality of the refined masks, we show that a standard segmentation architecture trained on these archives with appropriate data augmentation achieves impressive semantic segmentation abilities for both single-object and multi-object images. As a result, our proposed NamedMask performs favourably against a range of prior work on five benchmarks including the VOC2012, COCO and large-scale ImageNet-S datasets.

6 citations


Proceedings ArticleDOI
26 Jun 2022
TL;DR: The proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications, and systematically investigates the effects of data augmentations to understand what enables to learn useful representations.
Abstract: We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations (transformation invariance); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames (transformation equivariance). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. The project page is https://jinxiang-liu.github.io/SSL-TIE.

5 citations


Journal ArticleDOI
TL;DR: This work proposes a dual-view detection system named DVDET to achieve aerial monocular object detection in both the 2D image space and the 3D physical space and proposes a novel trainable geo-deformable transformation module that can properly warp information from the drone’s perspective to the BEV.
Abstract: Drones equipped with cameras can significantly enhance human's ability to perceive the world because of their remarkable maneuverability in 3D space. Ironically, object detection for drones has always been conducted in the 2D image space, which fundamentally limits their ability to understand 3D scenes. Furthermore, existing 3D object detection methods developed for autonomous driving cannot be directly applied to drones due to the lack of deformation modeling, which is essential for the distant aerial perspective with sensitive distortion and small objects. To fill the gap, this work proposes a dual-view detection system named DVDET to achieve aerial monocular object detection in both the 2D image space and the 3D physical space. To address the severe view deformation issue, we propose a novel trainable geo-deformable transformation module that can properly warp information from the drone's perspective to the birds' eye view (BEV). Compared to the monocular methods for cars, our transformation includes a learnable deformable network for explicitly revising the severe deviation. To address the dataset challenge, we propose a new large-scale simulation dataset named AM3D-Sim, and a new real-world aerial dataset named AM3D-Real with high-quality annotations for 3D object detection. Extensive experiments show that i) aerial monocular 3D object detection is feasible; ii) the model pre-trained on the simulation dataset helps real-world performance; and iii) DVDET also helps monocular 3D object detection for cars. To encourage more researchers to investigate this area, we released the dataset and related code.

3 citations


Proceedings ArticleDOI
13 Oct 2022
TL;DR: This paper designs a multi-modal transformer model that employs ‘selectors’ to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams, and curates a dataset with only sparse in time and space synchronisation signals.
Abstract: The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync

Proceedings ArticleDOI
07 Oct 2022
TL;DR: This paper develops a general plugin that can be inserted into existing super-resolution models, conveniently augmenting their ability towards Arbitrary Resolution Image Scaling, thus termed ARIS, and injects the proposed ARIS plugin module into several existing models, namely, IPT, SwinIR, and HAT.
Abstract: Existing models on super-resolution often specialized for one scale, fundamentally limiting their use in practical scenarios. In this paper, we aim to develop a general plugin that can be inserted into existing super-resolution models, conveniently augmenting their ability towards Arbitrary Resolution Image Scaling, thus termed ARIS. We make the following contributions: (i) we propose a transformer-based plugin module, which uses spatial coordinates as query, iteratively attend the low-resolution image feature through cross-attention, and output visual feature for the queried spatial location, resembling an implicit representation for images; (ii) we introduce a novel self-supervised training scheme, that exploits consistency constraints to effectively augment the model's ability for upsampling images towards unseen scales, i.e. ground-truth high-resolution images are not available; (iii) without loss of generality, we inject the proposed ARIS plugin module into several existing models, namely, IPT, SwinIR, and HAT, showing that the resulting models can not only maintain their original performance on fixed scale factor but also extrapolate to unseen scales, substantially outperforming existing any-scale super-resolution models on standard benchmarks, e.g. Urban100, DIV2K, etc.

Proceedings ArticleDOI
10 Oct 2022
TL;DR: Turbo training enables long-schedule video-language training and end-to-end long-video training, delivering competitive or superior performance than previous works, which were infeasible to train under limited resources.
Abstract: The objective of this paper is an efficient training method for video tasks. We make three contributions: (1) We propose Turbo training, a simple and versatile training paradigm for Transformers on multiple video tasks. (2) We illustrate the advantages of Turbo training on action classification, video-language representation learning, and long-video activity classification, showing that Turbo training can largely maintain competitive performance while achieving almost 4X speed-up and significantly less memory consumption. (3) Turbo training enables long-schedule video-language training and end-to-end long-video training, delivering competitive or superior performance than previous works, which were infeasible to train under limited resources.

Journal ArticleDOI
TL;DR: A model for directly processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation, with the opacity channels being treated as the segmentation, is proposed.
Abstract: In this paper, we consider the task of unsupervised object discovery in videos. Previous works have shown promising results via processing optical flows to segment objects. However, taking flow as input brings about two drawbacks. First, flow cannot capture sufficient cues when objects re-main static or partially occluded. Second, it is challeng-ing to establish temporal coherency from flow-only input, due to the missing texture information. To tackle these limitations, we propose a model for directly processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation, with the opacity channels being treated as the segmentation. Ad-ditionally, to enforce object permanence, we apply temporal consistency loss on the inferred masks from randomly-paired frames, which refer to the motions at different paces, and encourage the model to segment the objects even if they may not move at the current time point. Experimentally, we demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets (DAVIS2016, SegTrackv2, and FBMS-59), while being computationally efficient by avoiding the overhead of computing optical flow as input.

Journal ArticleDOI
TL;DR: A novel Transformer-based framework for directly processing the sparsely sampled signals in k-space, going beyond the limitation of regular grids as ConvNets do and striving a balance between computational cost and reconstruction quality is proposed.
Abstract: . This paper considers the problem of fast MRI reconstruction. We propose a novel Transformer-based framework for directly processing the sparsely sampled signals in k-space, going beyond the limitation of regular grids as ConvNets do. We adopt an implicit representation of spectrogram, treating spatial coordinates as inputs, and dynamically query the partially observed measurements to complete the spectrogram, i.e. learning the inductive bias in k-space. To strive a balance between computational cost and reconstruction quality, we build an hierarchical structure with low-resolution and high-resolution decoders respectively. To validate the necessity of our proposed modules, we have conducted extensive experiments on two public datasets, and demonstrate superior or comparable performance over state-of-the-art approaches. Project page: https://zhaoziheng.github.io/Website/K-Space-Transformer computational cost. Extensive experiments show that the proposed K-space Transformer outperforms existing state-of-the-art methods or achieves comparable performance on two public datasets, demonstrating promising results of Transformer in MRI reconstruction and k-space Future studies can be conducted to adapt it to multi-coil imaging, k-space super-resolution and other potentially relevant

Book ChapterDOI
12 Sep 2022
TL;DR: AdLocUI as discussed by the authors uses a convolutional neural network with 2D slices sampled from co-aligned 3D ultrasound volumes to predict their locations in the 3D anatomical atlas.
Abstract: AbstractTwo-dimensional (2D) freehand ultrasound is the mainstay in prenatal care and fetal growth monitoring. The task of matching corresponding cross-sectional planes in the 3D anatomy for a given 2D ultrasound brain scan is essential in freehand scanning, but challenging. We propose AdLocUI, a framework that Adaptively Localizes 2D Ultrasound Images in the 3D anatomical atlas without using any external tracking sensor. We first train a convolutional neural network with 2D slices sampled from co-aligned 3D ultrasound volumes to predict their locations in the 3D anatomical atlas. Next, we fine-tune it with 2D freehand ultrasound images using a novel unsupervised cycle consistency, which utilizes the fact that the overall displacement of a sequence of images in the 3D anatomical atlas is equal to the displacement from the first image to the last in that sequence. We demonstrate that AdLocUI can adapt to three different ultrasound datasets, acquired with different machines and protocols, and achieves significantly better localization accuracy than the baselines. AdLocUI can be used for sensorless 2D freehand ultrasound guidance by the bedside. The source code is available at https://github.com/pakheiyeung/AdLocUI.KeywordsFreehand ultrasoundSlice to volume registrationDomain adaptation

Proceedings ArticleDOI
18 Oct 2022
TL;DR: A simple ‘plugin’ module for the detection head of two-stage object detectors to improve the recall of partially occluded objects and a scalable pipeline for generating training data for the module by using amodal completion of existing object detection and instance segmentation training datasets to establish occlusion relationships is proposed.
Abstract: Detecting occluded objects still remains a challenge for state-of-the-art object detectors. The objective of this work is to improve the detection for such objects, and thereby improve the overall performance of a modern object detector. To this end we make the following four contributions: (1) We propose a simple 'plugin' module for the detection head of two-stage object detectors to improve the recall of partially occluded objects. The module predicts a tri-layer of segmentation masks for the target object, the occluder and the occludee, and by doing so is able to better predict the mask of the target object. (2) We propose a scalable pipeline for generating training data for the module by using amodal completion of existing object detection and instance segmentation training datasets to establish occlusion relationships. (3) We also establish a COCO evaluation dataset to measure the recall performance of partially occluded and separated objects. (4) We show that the plugin module inserted into a two-stage detector can boost the performance significantly, by only fine-tuning the detection head, and with additional improvements if the entire architecture is fine-tuned. COCO results are reported for Mask R-CNN with Swin-T or Swin-S backbones, and Cascade Mask R-CNN with a Swin-B backbone.

Proceedings Article
TL;DR: In this article , a self-supervised approach is proposed to decompose a video into layers, where each layer contains an object of interest along with its associated shadows, reflections, and other visual effects.
Abstract: We explore a feed-forward approach for decomposing a video into layers, where each layer contains an object of interest along with its associated shadows, reflections, and other visual effects. This problem is challenging since associated effects vary widely with the 3D geometry and lighting conditions in the scene, and ground-truth labels for visual effects are difficult (and in some cases impractical) to collect. We take a self-supervised approach and train a neural network to produce a foreground image and alpha matte from a rough object segmentation mask under a reconstruction and sparsity loss. Under reconstruction loss, the layer decomposition problem is underdetermined: many combinations of layers may reconstruct the input video. Inspired by the game theory concept of focal points—or Schelling points —we pose the problem as a coordination game, where each player (network) predicts the effects for a single object without knowledge of the other players’ choices. The players learn to converge on the “natural” layer decomposition in order to maximize the likelihood of their choices aligning with the other players’. We train the network to play this game with itself, and show how to design the rules of this game so that the focal point lies at the correct layer decomposition. We demonstrate feed-forward results on a challenging synthetic dataset, then show that pretraining on this dataset significantly reduces optimization time for real videos.

Proceedings ArticleDOI
Wentao Liu, Chao Ma, Yu-Hao Yang, Weidi Xie, Ya Zhang 
20 Aug 2022
TL;DR: A novel T ransformer-based architecture for TIS, that treats the refinement task as a procedure for grouping pixels with similar features to those clicks given by the end users, and allows the user to edit masks for arbitrary number of categories.
Abstract: . The goal of this paper is to interactively refine the automatic segmentation on challenging structures that fall behind human performance, either due to the scarcity of available annotations or the difficulty nature of the problem itself, for example, on segmenting cancer or small organs. Specifically, we propose a novel T ransformer-based architecture for I nteractive S egmentation ( TIS ), that treats the refinement task as a procedure for grouping pixels with similar features to those clicks given by the end users. Our proposed architecture is composed of Transformer Decoder variants, which naturally fulfills feature comparison with the attention mechanisms. In contrast to existing approaches, our proposed TIS is not limited to binary segmentations, and allows the user to edit masks for arbitrary number of categories. To validate the proposed approach, we conduct extensive experiments on three challenging datasets and demonstrate superior performance over the existing state-of-the-art methods. The project page is: https://wtliu7.github.io/tis/ .

Proceedings Article
14 Jun 2022
TL;DR: In this article , a transformer-based framework for directly processing signal in k-space was proposed, where spatial coordinates were treated as inputs and sparsely sampled points were dynamically queried to reconstruct the spectrogram.
Abstract: This paper considers the problem of undersampled MRI reconstruction. We propose a novel Transformer-based framework for directly processing signal in k-space, going beyond the limitation of regular grids as ConvNets do. We adopt an implicit representation of k-space spectrogram, treating spatial coordinates as inputs, and dynamically query the sparsely sampled points to reconstruct the spectrogram, i.e. learning the inductive bias in k-space. To strike a balance between computational cost and reconstruction quality, we build the decoder with hierarchical structure to generate low-resolution and high-resolution outputs respectively. To validate the effectiveness of our proposed method, we have conducted extensive experiments on two public datasets, and demonstrate superior or comparable performance to state-of-the-art approaches.