scispace - formally typeset
Search or ask a question
Author

Bohan Li

Bio: Bohan Li is an academic researcher from Google. The author has contributed to research in topics: Reference frame & Encoder. The author has an hindex of 5, co-authored 19 publications receiving 64 citations. Previous affiliations of Bohan Li include Tsinghua University & University of California, Santa Barbara.

Papers
More filters
Journal ArticleDOI
26 Feb 2021
TL;DR: A technical overview of the AV1 codec design that enables the compression performance gains with considerations for hardware feasibility is provided.
Abstract: The AV1 video compression format is developed by the Alliance for Open Media consortium. It achieves more than a 30% reduction in bit rate compared to its predecessor VP9 for the same decoded video quality. This article provides a technical overview of the AV1 codec design that enables the compression performance gains with considerations for hardware feasibility.

95 citations

Proceedings ArticleDOI
07 Apr 2014
TL;DR: Experimental results showed that supervised STM may improve the generalization ability of the classifier.
Abstract: Historical Chinese character recognition has been a challenging topic in pattern recognition field because of large character set, various writing styles and lack of training samples. In this paper, we adopted Style Transfer Mapping (STM) method to historical Chinese character recognition. Optimal selection of parameters was discussed. Two sets of experiments were conducted. The first set of experiment was designed to test the performance of STM on different font styles by using available printed traditional Chinese characters. The second set of experiment was carried out on samples extracted from practical historical Chinese documents. Experimental results showed that supervised STM may improve the generalization ability of the classifier.

16 citations

Proceedings ArticleDOI
01 Apr 2022
TL;DR: AdaSpeech 4 is developed, a zero-shot adaptive TTS system for high-quality speech synthesis that achieves better voice quality and similarity than baselines in multiple datasets without any fine-tuning.
Abstract: Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently, by using a well-trained source TTS model without adapting it on the speech data of new speakers. Considering seen and unseen speakers have diverse characteristics, zero-shot adaptive TTS requires strong generalization ability on speaker characteristics, which brings modeling challenges. In this paper, we develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis. We model the speaker characteristics systematically to improve the generalization on new speakers. Generally, the modeling of speaker characteristics can be categorized into three steps: extracting speaker representation, taking this speaker representation as condition, and synthesizing speech/mel-spectrogram given this speaker representation. Accordingly, we improve the modeling in three steps: 1) To extract speaker representation with better generalization, we factorize the speaker characteristics into basis vectors and extract speaker representation by weighted combining of these basis vectors through attention. 2) We leverage conditional layer normalization to integrate the extracted speaker representation to TTS model. 3) We propose a novel supervision loss based on the distribution of basis vectors to maintain the corresponding speaker characteristics in generated mel-spectrograms. Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.

16 citations

Proceedings ArticleDOI
27 Mar 2018
TL;DR: A per-pixel motion field is built that connects the two-sided reference frames using optical flow estimation and effectively accounts for the true motion trajectories in the video signal including both translational and the more complex non-translational motion models, which are beyond the capability of the conventional block-based motion compensated prediction.
Abstract: The hierarchical coding structure that supports bi-directional motion compensated prediction is commonly used for video compression efficiency. Conventional approach directly seeks the reference pixel block from each individual reference frame and use it or its linear combinations for prediction. It largely ignores the motion information between these reference frames. To fully utilize all the information from the bi-directional reference frames, this work builds a per-pixel motion field that connects the two-sided reference frames using optical flow estimation. A reference frame is then interpolated at the current frame location. This collocated reference frame effectively accounts for the true motion trajectories in the video signal including both translational and the more complex non-translational motion models, which are beyond the capability of the conventional block-based motion compensated prediction. The scheme is experimentally shown to provide substantial compression performance gains. A number of optimization designs are proposed to make the codec complexity feasible while largely maintaining the coding performance.

12 citations

Proceedings ArticleDOI
07 Apr 2014
TL;DR: A novel method for historical Chinese character segmentation based on graph model is presented which is effective with a detection rate of 94.6% and an accuracy rate of 96.1% on a test set of practical historical Chinese document samples.
Abstract: Historical Chinese document recognition technology is important for digital library. However, historical Chinese character segmentation remains a difficult problem due to the complex structure of Chinese characters and various writing styles. This paper presents a novel method for historical Chinese character segmentation based on graph model. After a preliminary over-segmentation stage, the system applies a merging process. The candidate segmentation positions are denoted by the nodes of a graph, and the merging process is regarded as selecting an optimal path of the graph. The weight of edge in the graph is calculated by the cost function which considers geometric features and recognition confidence. Experimental results show that the proposed method is effective with a detection rate of 94.6% and an accuracy rate of 96.1% on a test set of practical historical Chinese document samples.

7 citations


Cited by
More filters
Journal Article
TL;DR: In this paper, the authors studied the problem of feedback stabilization over a signal-to-noise ratio (SNR) constrained channel and showed that for either state feedback, or for output feedback delay-free, minimum phase plants, there are limitations on the ability to stabilize an unstable plant over an SNR constrained channel.
Abstract: There has recently been significant interest in feedback stabilization problems over communication channels, including several with bit rate limited feedback. Motivated by considering one source of such bit rate limits, we study the problem of stabilization over a signal-to-noise ratio (SNR) constrained channel. We discuss both continuous and discrete time cases, and show that for either state feedback, or for output feedback delay-free, minimum phase plants, there are limitations on the ability to stabilize an unstable plant over an SNR constrained channel. These limitations in fact match precisely those that might have been inferred by considering the associated ideal Shannon capacity bit rate over the same channel.

379 citations

Journal ArticleDOI
TL;DR: In this article, a new adaptation layer is proposed to reduce the mismatch between training and test data on a particular source layer, and the adaptation process can be efficiently and effectively implemented in an unsupervised manner.

232 citations

Proceedings ArticleDOI
23 Aug 2015
TL;DR: This paper considers page segmentation as a pixel labeling problem, i.e., each pixel is classified as either periphery, background, text block, or decoration, and applies convolutional autoencoders to learn features directly from pixel intensity values.
Abstract: In this paper, we present an unsupervised feature learning method for page segmentation of historical handwritten documents available as color images. We consider page segmentation as a pixel labeling problem, i.e., each pixel is classified as either periphery, background, text block, or decoration. Traditional methods in this area rely on carefully hand-crafted features or large amounts of prior knowledge. In contrast, we apply convolutional autoencoders to learn features directly from pixel intensity values. Then, using these features to train an SVM, we achieve high quality segmentation without any assumption of specific topologies and shapes. Experiments on three public datasets demonstrate the effectiveness and superiority of the proposed approach.

110 citations

Journal ArticleDOI
TL;DR: In this article, a generalized discrete cosine transform with three parameters was proposed and its orthogonality was proved for some new cases, and a new type of discrete W transform was proposed.
Abstract: The discrete cosine transform (DCT), introduced by Ahmed, Natarajan and Rao, has been used in many applications of digital signal processing, data compression and information hiding. There are four types of the discrete cosine transform. In simulating the discrete cosine transform, we propose a generalized discrete cosine transform with three parameters, and prove its orthogonality for some new cases. A new type of discrete cosine transform is proposed and its orthogonality is proved. Finally, we propose a generalized discrete W transform with three parameters, and prove its orthogonality for some new cases. Keywords: Discrete Fourier transform, discrete sine transform, discrete cosine transform, discrete W transform Nigerian Journal of Technological Research , vol7(1) 2012

79 citations

Journal ArticleDOI
TL;DR: In this paper , a neural codec language model called Vall-E was proposed to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.
Abstract: We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

57 citations