scispace - formally typeset
Search or ask a question
Author

Alexander Sheshkus

Other affiliations: Russian Academy of Sciences
Bio: Alexander Sheshkus is an academic researcher from Moscow Institute of Physics and Technology. The author has contributed to research in topics: Artificial neural network & Convolutional neural network. The author has an hindex of 9, co-authored 30 publications receiving 190 citations. Previous affiliations of Alexander Sheshkus include Russian Academy of Sciences.

Papers
More filters
Journal ArticleDOI
TL;DR: An “on the device” text line recognition framework that is designed for mobile or embedded systems based on two separate artificial neural networks (ANN) and dynamic programming instead of employing image processing methods for the segmentation step or end-to-end ANN.
Abstract: In this paper, we introduce an “on the device” text line recognition framework that is designed for mobile or embedded systems. We consider per-character segmentation as a language-independent problem and individual character recognition as a language-dependent one. Thus, the proposed solution is based on two separate artificial neural networks (ANN) and dynamic programming instead of employing image processing methods for the segmentation step or end-to-end ANN. To satisfy the tight constraints on memory size imposed by embedded systems and to avoid overfitting, we employ ANNs with a small number of trainable parameters. The primary purpose of our framework is the recognition of low-quality images of identity documents with complex backgrounds and a variety of languages and fonts. We demonstrate that our solution shows high recognition accuracy on natural datasets even being trained on purely synthetic data. We use MIDV-500 and Census 1961 Project datasets for text line recognition. The proposed method considerably surpasses the algorithmic method implemented in Tesseract 3.05, the LSTM method (Tesseract 4.00), and unpublished method used in the ABBYY FineReader 15 system. Also, our framework is faster than other compared solutions. We show the language-independence of our segmenter with the experiment with Cyrillic, Armenian, and Chinese text lines.

53 citations

Proceedings ArticleDOI
14 Feb 2015
TL;DR: An algorithm for real-time rectangular document borders detection in mobile device based applications based on combinatorial assembly of possible quadrangle candidates from a set of line segments and projective document reconstruction using the known focal length is proposed.
Abstract: In this paper we propose an algorithm for real-time rectangular document borders detection in mobile device based applications. The proposed algorithm is based on combinatorial assembly of possible quadrangle candidates from a set of line segments and projective document reconstruction using the known focal length. Fast Hough Transform is used for line detection. 1D modification of edge detector is proposed for the algorithm.

43 citations

Proceedings ArticleDOI
13 Apr 2018
TL;DR: An algorithm is described which allows to create artificial training datasets for OCR systems using russian passport as a case study to reduce the gap between natural and synthetic data distributions.
Abstract: This paper addresses one of the fundamental problems of machine learning - training data acquiring. Obtaining enough natural training data is rather difficult and expensive. In last years usage of synthetic images has become more beneficial as it allows to save human time and also to provide a huge number of images which otherwise would be difficult to obtain. However, for successful learning on artificial dataset one should try to reduce the gap between natural and synthetic data distributions. In this paper we describe an algorithm which allows to create artificial training datasets for OCR systems using russian passport as a case study.

23 citations

Proceedings ArticleDOI
15 Mar 2019
TL;DR: The most common label-preserving deformations are considered, which can be useful in many practical tasks and developed own real-time augmentation system, which demonstrated the effectiveness of suggested approach.
Abstract: In this paper we study the real-time augmentation - method of increasing variability of training dataset during the learning process. We consider the most common label-preserving deformations, which can be useful in many practical tasks. Due to limitations of existing augmentation tools like increase in learning time or dependence on a specific platform, we developed own real-time augmentation system. Experiments on MNIST and SVHN datasets demonstrated the effectiveness of suggested approach - the quality of the trained models improves, and learning time remains the same as if augmentation was not used.

23 citations

Proceedings ArticleDOI
09 Sep 2019
TL;DR: A novel neural network architecture based on Fast Hough Transform layer allows the neural network to accumulate features from linear areas across the entire image instead of local areas to solve the problem of vanishing points detection in the images of documents.
Abstract: In this paper we introduce a novel neural network architecture based on Fast Hough Transform layer. The layer of this type allows our neural network to accumulate features from linear areas across the entire image instead of local areas. We demonstrate its potential by solving the problem of vanishing points detection in the images of documents. Such problem occurs when dealing with camera shots of the documents in uncontrolled conditions. In this case, the document image can suffer several specific distortions including projective transform. To train our model, we use MIDV-500 dataset and provide testing results. Strong generalization ability of the suggested method is proven with its applying to a completely different ICDAR 2011 dewarping contest. In previously published papers considering this dataset authors measured quality of vanishing point detection by counting correctly recognized words with open OCR engine Tesseract. To compare with them, we reproduce this experiment and show that our method outperforms the state-of-the-art result.

22 citations


Cited by
More filters
Proceedings ArticleDOI
01 Nov 2017
TL;DR: This work is devoted to an identity document recognition system design for use in mobile phones and tablets using the computational capabilities of the device itself and experimental results are presented for an implemented commercial system "Smart IDReader" designed for identity documents recognition.
Abstract: This work is devoted to an identity document recognition system design for use in mobile phones and tablets using the computational capabilities of the device itself Key differences are discussed in relation to conservative cloud recognition systems which commonly use single images as an input by design A mobile recognition system chart is presented which is constructed with computational limitations in mind and which is implemented in a commercial solution An original approach designed to improve recognition precision and reliability using post-OCR results integration in video stream, as opposed to approaches which rely on frame image integration using "super-resolution" algorithms An interactive feedback between the system and its operator is discussed, such as automatic video stream recognition stopping decision Experimental results are presented for an implemented commercial system "Smart IDReader" designed for identity documents recognition

57 citations

Journal ArticleDOI
TL;DR: A new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion, which allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk.
Abstract: In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.

55 citations

Journal ArticleDOI
TL;DR: A 3D separable convolutional neural network is proposed for dynamic gesture recognition and the model is made less complex without compromising its high recognition accuracy, such that it can be deployed to augmented reality glasses more easily in the future.

54 citations

Book ChapterDOI
23 Aug 2020
TL;DR: This work reduces the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features, and shows that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data.
Abstract: Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel groupings, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features. We add line priors through a trainable Hough transform block into a deep network. Hough transform provides the prior knowledge about global line parameterizations, while the convolutional layers can learn the local gradient-like line features. On the Wireframe (ShanghaiTech) and York Urban datasets we show that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data.

53 citations

Journal ArticleDOI
TL;DR: An “on the device” text line recognition framework that is designed for mobile or embedded systems based on two separate artificial neural networks (ANN) and dynamic programming instead of employing image processing methods for the segmentation step or end-to-end ANN.
Abstract: In this paper, we introduce an “on the device” text line recognition framework that is designed for mobile or embedded systems. We consider per-character segmentation as a language-independent problem and individual character recognition as a language-dependent one. Thus, the proposed solution is based on two separate artificial neural networks (ANN) and dynamic programming instead of employing image processing methods for the segmentation step or end-to-end ANN. To satisfy the tight constraints on memory size imposed by embedded systems and to avoid overfitting, we employ ANNs with a small number of trainable parameters. The primary purpose of our framework is the recognition of low-quality images of identity documents with complex backgrounds and a variety of languages and fonts. We demonstrate that our solution shows high recognition accuracy on natural datasets even being trained on purely synthetic data. We use MIDV-500 and Census 1961 Project datasets for text line recognition. The proposed method considerably surpasses the algorithmic method implemented in Tesseract 3.05, the LSTM method (Tesseract 4.00), and unpublished method used in the ABBYY FineReader 15 system. Also, our framework is faster than other compared solutions. We show the language-independence of our segmenter with the experiment with Cyrillic, Armenian, and Chinese text lines.

53 citations