scispace - formally typeset
Search or ask a question
Author

Xiyan Liu

Bio: Xiyan Liu is an academic researcher from Chinese Academy of Sciences. The author has contributed to research in topics: Computer science & Artificial intelligence. The author has an hindex of 3, co-authored 6 publications receiving 53 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: This survey reviews the methods that appeared in the past 5 years for text detection and recognition in images and videos, including the recent state-of-the-art techniques on the following three related topics: (1) scene text detection, (2) sceneText recognition and (3) end-to-end text recognition system.
Abstract: Scene text detection and recognition has become a very active research topic in recent several years. It can find many applications in reality ranging from navigation for vision-impaired people to semantic natural scene understanding. In this survey, we are intended to give a thorough and in-depth reviews on the recent advances on this topic, mainly focusing on the methods that appeared in the past 5 years for text detection and recognition in images and videos, including the recent state-of-the-art techniques on the following three related topics: (1) scene text detection, (2) scene text recognition and (3) end-to-end text recognition system. Compared with the previous survey, this survey pays more attention to the application of deep learning techniques on scene text detection and recognition. We also give a brief introduction of other related works such as script identification, text/non-text classification and text-to-image retrieval. This survey also reviews and summarizes some benchmark datasets that are widely used in the literature. Based on these datasets, performances of state-of-the-art approaches are shown and discussed. Finally, we conclude this survey by pointing out several potential directions on scene text detection and recognition that need to be well explored in the future.

65 citations

Journal ArticleDOI
TL;DR: This model can rectify arbitrarily distorted document images with complicated page layouts and cluttered backgrounds and outperforms the state-of-the-art methods in terms of OCR accuracy and several widely used quantitative evaluation metrics.

24 citations

Proceedings ArticleDOI
01 Aug 2018
TL;DR: This paper proposes a novel framework called Conditional Cycle-Generative Adversarial Network (CCGAN), which can generate photo-realistic images conditioned on the given text descriptions, while maintaining the attributes of the original images.
Abstract: Traditional approaches for semantic image synthesis mainly focus on text descriptions while ignoring the related structures and attributes in the original images. Therefore, some critical information, e.g., the style, backgrounds, objects shapes and pose, is missed in the generated images. In this paper, we propose a novel framework called Conditional Cycle-Generative Adversarial Network (CCGAN) to address this issue. Our model can generate photo-realistic images conditioned on the given text descriptions, while maintaining the attributes of the original images. The framework mainly consists of two coupled conditional adversarial networks, which are able to learn a desirable image mapping that can keep the structures and attributes in the images. We introduce a conditional cycle consistency loss to prevent the contradiction between two generators. This loss allows the generated images to retain most of the features of the original image, so as to improve the stability of network training. Moreover, benefiting from the mechanism of circular training, the proposed networks can learn the semantic information of the text much accurately. Experiments on Caltech-UCSD Bird dataset and Oxford-102 flower dataset demonstrate that the proposed method significantly outperforms the existing methods in terms of image details reconstruction and semantic information expression.

9 citations

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors disentangled the text image into style representation and content representation, where the style representation is mapped into Gaussian distribution and the content representation is embedded using character index.
Abstract: Automatically generating handwritten text images is a challenging task due to the diverse handwriting styles and the irregular writing in natural scenes. In this paper, we propose an effective generative model called HTG-GAN to synthesize handwritten text images from latent prior. Unlike single-character synthesis, our method is capable of generating images of sequence characters with arbitrary length, which pays more attention to the structural relationship between characters. We model the structural relationship as the style representation to avoid explicitly modeling the stroke layout. Specifically, the text image is disentangled into style representation and content representation, where the style representation is mapped into Gaussian distribution and the content representation is embedded using character index. In this way, our model can generate new handwritten text images with specified contents and various styles to perform data augmentation, thereby boosting handwritten text recognition (HTR). Experimental results show that our method achieves state-of-the-art performance in handwritten text generation.

7 citations

Journal ArticleDOI
TL;DR: A novel model named FontGAN is proposed, which integrates the character structure stylization, de-stylization and texture transfer into a unified framework, and decouple character images into style representation and content representation, which offers fine-grained control of these two types of variables, thus improving the quality of the generated results.
Abstract: Character glyph synthesis is still an open challenging problem, which involves two related aspects, i.e., font style transfer and content consistency. In this paper, we propose a novel model named FontGAN, which integrates the character structure stylization, de-stylization and texture transfer into a unified framework. Specifically, we decouple character images into style representation and content representation, which offers fine-grained control of these two types of variables, thus improving the quality of the generated results. To effectively capture the style information, a style consistency module (SCM) is introduced. Technically, SCM exploits category-guided Kullback-Leibler divergence to explicitly model the style representation into different prior distributions. In this way, our model is capable of implementing transformations between multiple domains in one framework. In addition, we propose content prior module (CPM) to provide content prior for the model to guide the content encoding process and alleviates the problem of stroke deficiency during structure de-stylization. Benefiting from the idea of decoupling and regrouping, our FontGAN suffices to achieve many-to-many translation tasks for glyph structure. Experimental results demonstrate that the proposed FontGAN achieves the state-of-the-art performance in character glyph synthesis.

5 citations


Cited by
More filters
Journal ArticleDOI
15 Aug 2020-Sensors
TL;DR: This work compares Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO) deep neural networks for the outdoor advertisement panel detection problem by handling multiple and combined variabilities in the scenes.
Abstract: This work compares Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO) deep neural networks for the outdoor advertisement panel detection problem by handling multiple and combined variabilities in the scenes. Publicity panel detection in images offers important advantages both in the real world as well as in the virtual one. For example, applications like Google Street View can be used for Internet publicity and when detecting these ads panels in images, it could be possible to replace the publicity appearing inside the panels by another from a funding company. In our experiments, both SSD and YOLO detectors have produced acceptable results under variable sizes of panels, illumination conditions, viewing perspectives, partial occlusion of panels, complex background and multiple panels in scenes. Due to the difficulty of finding annotated images for the considered problem, we created our own dataset for conducting the experiments. The major strength of the SSD model was the almost elimination of False Positive (FP) cases, situation that is preferable when the publicity contained inside the panel is analyzed after detecting them. On the other side, YOLO produced better panel localization results detecting a higher number of True Positive (TP) panels with a higher accuracy. Finally, a comparison of the two analyzed object detection models with different types of semantic segmentation networks and using the same evaluation metrics is also included.

47 citations

Journal ArticleDOI
TL;DR: The survey first introduces image synthesis and its challenges, and then reviews key concepts such as generative adversarial networks (GANs) and deep convolutional encoder‐decoder neural networks (DCNNs), and proposes a taxonomy to summarize GAN‐based text‐to‐image synthesis into four major categories.
Abstract: Text-to-image synthesis refers to computational methods which translate human written textual descriptions, in the form of keywords or sentences, into images with similar semantic meaning to the text. In earlier research, image synthesis relied mainly on word to image correlation analysis combined with supervised methods to find best alignment of the visual content matching to the text. Recent progress in deep learning (DL) has brought a new set of unsupervised deep learning methods, particularly deep generative models which are able to generate realistic visual images using suitably trained neural network models. In this paper, we review the most recent development in the text-to-image synthesis research domain. Our survey first introduces image synthesis and its challenges, and then reviews key concepts such as generative adversarial networks (GANs) and deep convolutional encoder-decoder neural networks (DCNN). After that, we propose a taxonomy to summarize GAN based text-to-image synthesis into four major categories: Semantic Enhancement GANs, Resolution Enhancement GANs, Diversity Enhancement GANS, and Motion Enhancement GANs. We elaborate the main objective of each group, and further review typical GAN architectures in each group. The taxonomy and the review outline the techniques and the evolution of different approaches, and eventually provide a clear roadmap to summarize the list of contemporaneous solutions that utilize GANs and DCNNs to generate enthralling results in categories such as human faces, birds, flowers, room interiors, object reconstruction from edge maps (games) etc. The survey will conclude with a comparison of the proposed solutions, challenges that remain unresolved, and future developments in the text-to-image synthesis domain.

42 citations

Journal ArticleDOI
TL;DR: The proposed clustering and ranking stages lead to using only 11% of the whole database in classifying test images, which means more reduced computation complexity and more enhanced classification results are achieved compared to recent existing systems.
Abstract: Recently, deep learning techniques demonstrated efficiency in building better performing machine learning models which are required in the field of offline Arabic handwriting recognition. Our ancient civilizations presented valuable handwritten manuscripts that need to be documented digitally. If we compared between Latin and the isolated Arabic character recognition, the latter is much more challenging due to the similarity between characters, and the variability of the writing styles. This paper proposes a multi-stage cascading system to serve the field of offline Arabic handwriting recognition. The approach starts with applying the Hierarchical Agglomerative Clustering (HAC) technique to split the database into partially inter-related clusters. The inter-relations between the constructed clusters support representing the database as a big search tree model and help to attain a reduced complexity in matching each test image with a cluster. Cluster members are then ranked based on our new proposed ranking algorithm. This ranking algorithm starts with computing Pyramid Histogram of Oriented Gradients (PHoG), and is followed by measuring divergence by Kullback-Leibler method. Eventually, the classification process is applied only to the highly ranked matching classes. A comparative study is made to assess the effect of six different deep Convolution Neural Networks (DCNNs) on the final recognition rates of the proposed system. Experiments are done using the IFN/ENIT Arabic database. The proposed clustering and ranking stages lead to using only 11% of the whole database in classifying test images. Accordingly, more reduced computation complexity and more enhanced classification results are achieved compared to recent existing systems.

37 citations

Journal ArticleDOI
01 Jan 2021
TL;DR: A weighted naïve Bayes classifier (WNBC)-based deep learning process is used in this framework to effectively detect the text and to recognize the character from the scene images.
Abstract: Text obtained in natural scenes contains various information; therefore, it is extensively used in various applications to understand the image scenarios and also to retrieve the visual information. The semantic information provided by this scene image is very much valuable for human beings to realize the whole environment. But the text in such natural images depicts a flexible appearance in an unconstrained environment which makes the text identification and character recognition process a more challenging one. Therefore, a weighted naive Bayes classifier (WNBC)-based deep learning process is used in this framework to effectively detect the text and to recognize the character from the scene images. Normally, the natural scene images may carry some kind of noise in it, and to remove that, the guided image filter is introduced at the pre-processing stage. The features that are useful for the classification process are extracted using the Gabor transform and stroke width transform techniques. Finally, with these extracted features, the text detection and character recognition is successfully achieved by WNBC and deep neural network-based adaptive galactic swarm optimization. Then, the performance metrics such as accuracy, F1-score, precision, mean absolute error, mean square error and recall metrics are evaluated to estimate the adeptness of the proposed method.

36 citations