scispace - formally typeset
Search or ask a question

Showing papers by "Hang Zhao published in 2023"



Journal ArticleDOI
TL;DR: Tang et al. as discussed by the authors collected the first RGB-Thermal dataset for human motion analysis, dubbed Thermal-IM, and developed a three-stage neural network model for accurate past human pose estimation.
Abstract: Inferring past human motion from RGB images is challenging due to the inherent uncertainty of the prediction problem. Thermal images, on the other hand, encode traces of past human-object interactions left in the environment via thermal radiation measurement. Based on this observation, we collect the first RGB-Thermal dataset for human motion analysis, dubbed Thermal-IM. Then we develop a three-stage neural network model for accurate past human pose estimation. Comprehensive experiments show that thermal cues significantly reduce the ambiguities of this task, and the proposed model achieves remarkable performance. The dataset is available at https://github.com/ZitianTang/Thermal-IM.

Journal ArticleDOI
TL;DR: In this article , a symbolic memory framework is proposed to augment large language models with symbolic memory for complex multi-hop reasoning, where the LLM generates SQL instructions to manipulate the SQL databases.
Abstract: Large language models (LLMs) with memory are computationally universal. However, mainstream LLMs are not taking full advantage of memory, and the designs are heavily influenced by biological brains. Due to their approximate nature and proneness to the accumulation of errors, conventional neural memory mechanisms cannot support LLMs to simulate complex reasoning. In this paper, we seek inspiration from modern computer architectures to augment LLMs with symbolic memory for complex multi-hop reasoning. Such a symbolic memory framework is instantiated as an LLM and a set of SQL databases, where the LLM generates SQL instructions to manipulate the SQL databases. We validate the effectiveness of the proposed memory framework on a synthetic dataset requiring complex reasoning. The project website is available at https://chatdatabase.github.io/ .

Journal ArticleDOI
TL;DR: This paper proposed a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-MODal Teacher (UMT) according to the distribution of uni modal and paired features.
Abstract: We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

Journal ArticleDOI
TL;DR: In this paper , a dynamic space-time dark level correction approach was proposed to address the fluctuation of the dark level, which employs cold space signals in space and time dimensions to estimate the dark levels for each frame individually and reduce errors due to environmental variations.
Abstract: Lunar radiometric calibration is used to solve the problem of consistent radiometric calibration for multiple satellite platforms and remote sensors. However, the dark level fluctuates when observing the Moon with a short-wave infrared spectrometer, which seriously affects the accuracy of lunar radiation data. In this work, we propose a dynamic space-time dark level correction approach to address the fluctuation of the dark level. This method employs cold space signals in space and time dimensions to estimate the dark level for each frame individually and to reduce errors due to environmental variations. Experiments on lunar observations at multiple phase angles were conducted, and the dark level correction results demonstrate that our proposed method is effective even in the short-wave infrared, and is also superior to currently existing techniques. For a single-band (1700 nm) image of the full Moon, the mean background proportion of the proposed method is 1.00%, which is better than that of the static dark correction method (2.25%) and linear dark correction method (5.93%).

TL;DR: In this paper , a self-supervised autoregressive representation learning (GPT) approach is proposed for visual feature learning, where image tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential.
Abstract: Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this paper, we explore the effect various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Rand om S egments with A utoregressive C oding (RandSAC). In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning. We illustrate the pertinence of these design choices and explore alternatives on a number of datasets ( e.g. , CIFAR10, CIFAR100, ImageNet). While our pre-training strategy works with vanilla Transformer, we also propose a conceptually simple, but highly effective, addition to the decoder that allows learnable skip-connections to encoder’s feature layers, which further improves the performance.

Journal ArticleDOI
TL;DR: Diff-Foley as mentioned in this paper adopts contrastive audio-visual pretraining to learn more temporally and semantically aligned features, then trains an LDM with CAVP-aligned visual features on spectrogram latent space.
Abstract: The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate Diff-Foley practical applicability and generalization capabilities via downstream finetuning. Project Page: see https://diff-foley.github.io/

Journal ArticleDOI
TL;DR: SparseViT as mentioned in this paper revisits activation sparsity for recent window-based vision transformers and achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation.
Abstract: High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.