scispace - formally typeset
Search or ask a question
Author

Pong C. Yuen

Bio: Pong C. Yuen is an academic researcher from Hong Kong Baptist University. The author has contributed to research in topics: Facial recognition system & Feature extraction. The author has an hindex of 52, co-authored 257 publications receiving 9001 citations. Previous affiliations of Pong C. Yuen include Southwest Baptist University & Chinese Academy of Sciences.


Papers
More filters
Book ChapterDOI
Matej Kristan1, Ales Leonardis2, Jiří Matas3, Michael Felsberg4, Roman Pflugfelder5, Luka Cehovin1, Tomas Vojir3, Gustav Häger4, Alan Lukežič1, Gustavo Fernandez5, Abhinav Gupta6, Alfredo Petrosino7, Alireza Memarmoghadam8, Alvaro Garcia-Martin9, Andres Solis Montero10, Andrea Vedaldi11, Andreas Robinson4, Andy J. Ma12, Anton Varfolomieiev13, A. Aydin Alatan14, Aykut Erdem15, Bernard Ghanem16, Bin Liu, Bohyung Han17, Brais Martinez18, Chang-Ming Chang19, Changsheng Xu20, Chong Sun21, Daijin Kim17, Dapeng Chen22, Dawei Du20, Deepak Mishra23, Dit-Yan Yeung24, Erhan Gundogdu25, Erkut Erdem15, Fahad Shahbaz Khan4, Fatih Porikli26, Fatih Porikli27, Fei Zhao20, Filiz Bunyak28, Francesco Battistone7, Gao Zhu27, Giorgio Roffo29, Gorthi R. K. Sai Subrahmanyam23, Guilherme Sousa Bastos30, Guna Seetharaman31, Henry Medeiros32, Hongdong Li27, Honggang Qi20, Horst Bischof33, Horst Possegger33, Huchuan Lu21, Hyemin Lee17, Hyeonseob Nam34, Hyung Jin Chang35, Isabela Drummond30, Jack Valmadre11, Jae-chan Jeong36, Jaeil Cho36, Jae-Yeong Lee36, Jianke Zhu37, Jiayi Feng20, Jin Gao20, Jin-Young Choi, Jingjing Xiao2, Ji-Wan Kim36, Jiyeoup Jeong, João F. Henriques11, Jochen Lang10, Jongwon Choi, José M. Martínez9, Junliang Xing20, Junyu Gao20, Kannappan Palaniappan28, Karel Lebeda38, Ke Gao28, Krystian Mikolajczyk35, Lei Qin20, Lijun Wang21, Longyin Wen19, Luca Bertinetto11, Madan Kumar Rapuru23, Mahdieh Poostchi28, Mario Edoardo Maresca7, Martin Danelljan4, Matthias Mueller16, Mengdan Zhang20, Michael Arens, Michel Valstar18, Ming Tang20, Mooyeol Baek17, Muhammad Haris Khan18, Naiyan Wang24, Nana Fan39, Noor M. Al-Shakarji28, Ondrej Miksik11, Osman Akin15, Payman Moallem8, Pedro Senna30, Philip H. S. Torr11, Pong C. Yuen12, Qingming Huang39, Qingming Huang20, Rafael Martin-Nieto9, Rengarajan Pelapur28, Richard Bowden38, Robert Laganiere10, Rustam Stolkin2, Ryan Walsh32, Sebastian B. Krah, Shengkun Li19, Shengping Zhang39, Shizeng Yao28, Simon Hadfield38, Simone Melzi29, Siwei Lyu19, Siyi Li24, Stefan Becker, Stuart Golodetz11, Sumithra Kakanuru23, Sunglok Choi36, Tao Hu20, Thomas Mauthner33, Tianzhu Zhang20, Tony P. Pridmore18, Vincenzo Santopietro7, Weiming Hu20, Wenbo Li40, Wolfgang Hübner, Xiangyuan Lan12, Xiaomeng Wang18, Xin Li39, Yang Li37, Yiannis Demiris35, Yifan Wang21, Yuankai Qi39, Zejian Yuan22, Zexiong Cai12, Zhan Xu37, Zhenyu He39, Zhizhen Chi21 
08 Oct 2016
TL;DR: The Visual Object Tracking challenge VOT2016 goes beyond its predecessors by introducing a new semi-automatic ground truth bounding box annotation methodology and extending the evaluation system with the no-reset experiment.
Abstract: The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http://votchallenge.net).

744 citations

Journal ArticleDOI
TL;DR: Experimental results show that the proposed method outperforms the existing methods, in terms of image quality and recognition accuracy, as well as face super-resolution methods.
Abstract: This paper addresses the very low resolution (VLR) problem in face recognition in which the resolution of the face image to be recognized is lower than 16 × 16. With the increasing demand of surveillance camera-based applications, the VLR problem happens in many face application systems. Existing face recognition algorithms are not able to give satisfactory performance on the VLR face image. While face super-resolution (SR) methods can be employed to enhance the resolution of the images, the existing learning-based face SR methods do not perform well on such a VLR face image. To overcome this problem, this paper proposes a novel approach to learn the relationship between the high-resolution image space and the VLR image space for face SR. Based on this new approach, two constraints, namely, new data and discriminative constraints, are designed for good visuality and face recognition applications under the VLR problem, respectively. Experimental results show that the proposed SR algorithm based on relationship learning outperforms the existing algorithms in public face databases.

467 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: A novel instance based softmax embedding method, which directly optimizes the `real' instance features on top of the softmax function, which achieves significantly faster learning speed and higher accuracy than all existing methods.
Abstract: This paper studies the unsupervised embedding learning problem, which requires an effective similarity measurement between samples in low-dimensional embedding space. Motivated by the positive concentrated and negative separated properties observed from category-wise supervised learning, we propose to utilize the instance-wise supervision to approximate these properties, which aims at learning data augmentation invariant and instance spread-out features. To achieve this goal, we propose a novel instance based softmax embedding method, which directly optimizes the `real' instance features on top of the softmax function. It achieves significantly faster learning speed and higher accuracy than all existing methods. The proposed method performs well for both seen and unseen testing categories with cosine similarity. It also achieves competitive performance even without pre-trained network over samples from fine-grained categories.

341 citations

Proceedings Article
27 Apr 2018
TL;DR: An improved two-stream CNN network is presented to learn the multimodality sharable feature representations and identity loss and contrastive loss are integrated to enhance the discriminability and modality-invariance with partially shared layer parameters.
Abstract: Person re-identification is widely studied in visible spectrum, where all the person images are captured by visible cameras. However, visible cameras may not capture valid appearance information under poor illumination conditions, e.g, at night. In this case, thermal camera is superior since it is less dependent on the lighting by using infrared light to capture the human body. To this end, this paper investigates a cross-modal re-identification problem, namely visible-thermal person re-identification (VT-REID). Existing cross-modal matching methods mainly focus on modeling the cross-modality discrepancy, while VT-REID also suffers from cross-view variations caused by different camera views. Therefore, we propose a hierarchical cross-modality matching model by jointly optimizing the modality-specific and modality-shared metrics. The modality-specific metrics transform two heterogenous modalities into a consistent space that modality-shared metric can be subsequently learnt. Meanwhile, the modality-specific metric compacts features of the same person within each modality to handle the large intra-modality intra-person variations (e.g. viewpoints, pose). Additionally, an improved two-stream CNN network is presented to learn the multi-modality sharable feature representations. Identity loss and contrastive loss are integrated to enhance the discriminability and modality-invariance with partially shared layer parameters. Extensive experiments illustrate the effectiveness and robustness of the proposed method.

281 citations

Proceedings ArticleDOI
01 Jul 2018
TL;DR: A dual-path network with a novel bi-directional dual-constrained top-ranking loss to learn discriminative feature representations and identity loss is further incorporated to model the identity-specific information to handle large intra-class variations.
Abstract: Cross-modality person re-identification between the thermal and visible domains is extremely important for night-time surveillance applications. Existing works in this filed mainly focus on learning sharable feature representations to handle the cross-modality discrepancies. However, besides the cross-modality discrepancy caused by different camera spectrums, visible thermal person re-identification also suffers from large cross-modality and intra-modality variations caused by different camera views and human poses. In this paper, we propose a dual-path network with a novel bi-directional dual-constrained top-ranking loss to learn discriminative feature representations. It is advantageous in two aspects: 1) end-to-end feature learning directly from the data without extra metric learning steps, 2) it simultaneously handles the cross-modality and intra-modality variations to ensure the discriminability of the learnt representations. Meanwhile, identity loss is further incorporated to model the identity-specific information to handle large intra-class variations. Extensive experiments on two datasets demonstrate the superior performance compared to the state-of-the-arts.

269 citations


Cited by
More filters
Posted Content
TL;DR: It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
Abstract: This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

7,951 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: SRGAN as mentioned in this paper proposes a perceptual loss function which consists of an adversarial loss and a content loss, which pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images.
Abstract: Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

6,884 citations

Journal ArticleDOI
TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.
Abstract: Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

6,447 citations

Book ChapterDOI
TL;DR: In this article, a new representation learning approach for domain adaptation is proposed, in which data at training and test time come from similar but different distributions, and features that cannot discriminate between the training (source) and test (target) domains are used to promote the emergence of features that are discriminative for the main learning task on the source domain.
Abstract: We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

4,862 citations

Posted Content
TL;DR: SRGAN, a generative adversarial network (GAN) for image super-resolution (SR), is presented, to its knowledge, the first framework capable of inferring photo-realistic natural images for 4x upscaling factors and a perceptual loss function which consists of an adversarial loss and a content loss.
Abstract: Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

4,404 citations