Home
/
Authors
/
Xingxuan Zhang

Author

Xingxuan Zhang

Other affiliations: Shanghai Jiao Tong University

Bio: Xingxuan Zhang is an academic researcher from Tsinghua University. The author has contributed to research in topics: Computer science & Generalization. The author has an hindex of 4, co-authored 9 publications receiving 47 citations. Previous affiliations of Xingxuan Zhang include Shanghai Jiao Tong University.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Stable Learning for Out-Of-Distribution Generalization

[...]

Xingxuan Zhang¹, Peng Cui¹, Renzhe Xu¹, Linjun Zhou¹, Yue He¹, Zheyan Shen¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

01 Jun 2021

TL;DR: In this paper, the authors propose to remove the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels.

...read moreread less

Abstract: Approaches based on deep neural networks have achieved striking performance when testing data and training data share similar distribution, but can significantly fail otherwise. Therefore, eliminating the impact of distribution shifts between training and testing data is crucial for building performance-promising deep models. Conventional methods assume either the known heterogeneity of training data (e.g. domain labels) or the approximately equal capacities of different domains. In this paper, we consider a more challenging case where neither of the above assumptions holds. We propose to address this problem by removing the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels. Extensive experiments clearly demonstrate the effectiveness of our method on multiple distribution generalization benchmarks compared with state-of-the-art counterparts. Through extensive experiments on distribution generalization benchmarks including PACS, VLCS, MNIST-M, and NICO, we show the effectiveness of our method compared with state-of-the-art counterparts.

...read moreread less

113 citations

Proceedings Article•DOI•

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

[...]

Xingxuan Zhang¹, Feng Cheng¹, Wang Shi-lin¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Oct 2019

TL;DR: A Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well is proposed.

...read moreread less

Abstract: Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.

...read moreread less

66 citations

Journal Article•DOI•

NICO++: Towards Better Benchmarking for Domain Generalization

[...]

Xingxuan Zhang, Linjun Zhou, Renzhe Xu, Peng Cui, Zheyan Shen, Haoxin Liu - Show less +2 more

17 Apr 2022-arXiv.org

TL;DR: A large-scale benchmark with extensive labeled domains named NICO ++ is proposed along with more rational evaluation methods for comprehensively evaluating DG algorithms to prove that limited concept shift and signiﬁcant covariate shift favor the evaluation capability for generalization.

...read moreread less

Abstract: Despite the remarkable performance that modern deep neural networks have achieved on independent and identically distributed (I.I.D.) data, they can crash under distribution shifts. Most current evaluation methods for domain generalization (DG) adopt the leave-one-out strategy as a compromise on the limited number of domains. We propose a large-scale benchmark with extensive labeled domains named NICO ++ ‡ along with more rational evaluation methods for comprehensively evaluating DG algorithms. To evaluate DG datasets, we propose two metrics to quantify covariate shift and concept shift, respectively. Two novel generalization bounds from the perspective of data construction are proposed to prove that limited concept shift and signiﬁcant covariate shift favor the evaluation capability for generalization. Through extensive experiments, NICO ++ shows its superior evaluation capability compared with current DG datasets and its contribution in alleviating unfairness caused by the leak of oracle knowledge in model selection.

...read moreread less

22 citations

Proceedings Article•DOI•

Regulatory Instruments for Fair Personalized Pricing

[...]

Renzhe Xu, Xingxuan Zhang, Peng-Bi Cui, Bo Li, Zheyan Shen - Show less +1 more

09 Feb 2022

TL;DR: In this paper , the authors investigate the optimal pricing strategy of a profit-maximizing monopoly under both regulatory constraints and the impact of imposing them on consumer surplus, producer surplus, and social welfare.

...read moreread less

Abstract: Personalized pricing is a business strategy to charge different prices to individual consumers based on their characteristics and behaviors. It has become common practice in many industries nowadays due to the availability of a growing amount of high granular consumer data. The discriminatory nature of personalized pricing has triggered heated debates among policymakers and academics on how to design regulation policies to balance market efficiency and equity. In this paper, we propose two sound policy instruments, i.e., capping the range of the personalized prices or their ratios. We investigate the optimal pricing strategy of a profit-maximizing monopoly under both regulatory constraints and the impact of imposing them on consumer surplus, producer surplus, and social welfare. We theoretically prove that both proposed constraints can help balance consumer surplus and producer surplus at the expense of total surplus for common demand distributions, such as uniform, logistic, and exponential distributions. Experiments on both simulation and real-world datasets demonstrate the correctness of these theoretical results1. Our findings and insights shed light on regulatory policy design for the increasingly monopolized business in the digital era.

...read moreread less

8 citations

Proceedings Article•DOI•

3D Convolutional Neural Networks Based Speaker Identification and Authentication

[...]

Jianguo Liao¹, Shilin Wang¹, Xingxuan Zhang¹, Gongshen Liu¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Oct 2018

TL;DR: A novel end-to-end method based on 3D convolutional neural network (3DCNN) is proposed to extract discriminative spatiotemporal features from raw lip video streams to achieve better performance and higher robustness against variations caused by different speaker's pose and position.

...read moreread less

Abstract: Research shows that human lips can be used as a new kind of biometrics in personal identification and authentication. In this letter, a novel end-to-end method based on 3D convolutional neural network (3DCNN) is proposed to extract discriminative spatiotemporal features from raw lip video streams. In our approach, the lip video is first divided into a series of overlapping clips. For each clip, the lip-characteristics network is proposed to characterize the minutiae of the lip region and its movement. Finally, the entire lip video is represented by a set of sub-features corresponding to each clip in it. Experiments have been performed on a dataset with 200 speakers and the proposed method achieves high identification accuracy of 99.18% and very low authentication error (HTER of 0.15%). Compared with several state-of-the-art methods, our approach achieves better performance and higher robustness against variations caused by different speaker's pose and position.

...read moreread less

7 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

End-To-End Audio-Visual Speech Recognition with Conformers

[...]

Pingchuan Ma¹, Stavros Petridis¹, Maja Pantic¹•Institutions (1)

Imperial College London¹

06 Jun 2021

TL;DR: In this article, a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer) is proposed for sentence-level speech recognition.

...read moreread less

Abstract: In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

...read moreread less

121 citations

Proceedings Article•

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

[...]

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed

05 Jan 2022

TL;DR: This work introduces Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units.

...read moreread less

Abstract: Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

...read moreread less

85 citations

Posted Content•

ASR is all you need: cross-modal distillation for lip reading

[...]

Triantafyllos Afouras¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

28 Nov 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that ground truth transcriptions are not necessary to train a lip reading system and how arbitrary amounts of unlabelled video data can be leveraged to improve performance.

...read moreread less

Abstract: The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that ground truth transcriptions are not necessary to train a lip reading system; (ii) we show how arbitrary amounts of unlabelled video data can be leveraged to improve performance; (iii) we demonstrate that distillation significantly speeds up training; and, (iv) we obtain state-of-the-art results on the challenging LRS2 and LRS3 datasets for training only on publicly available data.

...read moreread less

60 citations

Journal Article•DOI•

Domain Generalization: A Survey

[...]

01 Jan 2022-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Domain generalization (DG) aims to achieve OOD generalization by using only source data for model learning as mentioned in this paper , which is a capability natural to humans yet challenging for machines to reproduce.

...read moreread less

Abstract: Generalization to out-of-distribution (OOD) data is a capability natural to humans yet challenging for machines to reproduce. This is because most learning algorithms strongly rely on the i.i.d.~assumption on source/target data, which is often violated in practice due to domain shift. Domain generalization (DG) aims to achieve OOD generalization by using only source data for model learning. Over the last ten years, research in DG has made great progress, leading to a broad spectrum of methodologies, e.g., those based on domain alignment, meta-learning, data augmentation, or ensemble learning, to name a few; DG has also been studied in various application areas including computer vision, speech recognition, natural language processing, medical imaging, and reinforcement learning. In this paper, for the first time a comprehensive literature review in DG is provided to summarize the developments over the past decade. Specifically, we first cover the background by formally defining DG and relating it to other relevant fields like domain adaptation and transfer learning. Then, we conduct a thorough review into existing methods and theories. Finally, we conclude this survey with insights and discussions on future research directions.

...read moreread less

52 citations

Proceedings Article•DOI•

ASR is All You Need: Cross-Modal Distillation for Lip Reading

[...]

Triantafyllos Afouras¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 May 2020

TL;DR: In this article, a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss was used to train an automatic speech recognition model.

...read moreread less

51 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

Collapse