Showing papers by "Jiri Matas published in 2015"

PDF

Open Access

Proceedings Article•DOI•

ICDAR 2015 competition on Robust Reading

[...]

Dimosthenis Karatzas¹, Lluis Gomez-Bigorda¹, Anguelos Nicolaou¹, Suman K. Ghosh¹, Andrew D. Bagdanov¹, Masakazu Iwamura², Jiri Matas³, Lukas Neumann³, Vijay Chandrasekhar⁴, Shijian Lu⁴, Faisal Shafait⁵, Seiichi Uchida⁶, Ernest Valveny¹ - Show less +9 more•Institutions (6)

Autonomous University of Barcelona¹, Osaka Prefecture University², Czech Technical University in Prague³, Institute for Infocomm Research Singapore⁴, National University of Science and Technology⁵, Kyushu University⁶

23 Aug 2015

TL;DR: A new Challenge 4 on Incidental Scene Text has been added to the Challenges on Born-Digital Images, Focused Scene Images and Video Text and tasks assessing End-to-End system performance have been introduced to all Challenges.

...read moreread less

Abstract: Results of the ICDAR 2015 Robust Reading Competition are presented. A new Challenge 4 on Incidental Scene Text has been added to the Challenges on Born-Digital Images, Focused Scene Images and Video Text. Challenge 4 is run on a newly acquired dataset of 1,670 images evaluating Text Localisation, Word Recognition and End-to-End pipelines. In addition, the dataset for Challenge 3 on Video Text has been substantially updated with more video sequences and more accurate ground truth data. Finally, tasks assessing End-to-End system performance have been introduced to all Challenges. The competition took place in the first quarter of 2015, and received a total of 44 submissions. Only the tasks newly introduced in 2015 are reported on. The datasets, the ground truth specification and the evaluation protocols are presented together with the results and a brief summary of the participating methods.

...read moreread less

1,224 citations

Proceedings Article•DOI•

The Visual Object Tracking VOT2015 Challenge Results

[...]

Matej Kristan¹, Jiri Matas, Ale Leonardis², Michael Felsberg³, Luka Cehovin¹, Gustavo Fernandez⁴, Toma Vojir, Gustav Häger³, Georg Nebehay⁴, Roman Pflugfelder⁴ - Show less +6 more•Institutions (4)

University of Ljubljana¹, University of Birmingham², Linköping University³, Austrian Institute of Technology⁴

07 Dec 2015

TL;DR: The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance and presents a new VOT 2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute.

...read moreread less

Abstract: The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

...read moreread less

667 citations

Posted Content•

All you need is a good init

[...]

Dmytro Mishkin¹, Jiri Matas¹•Institutions (1)

Czech Technical University in Prague¹

19 Nov 2015-arXiv: Learning

TL;DR: Layer-sequential unit-variance initialization (LSUV initialization) as mentioned in this paper is a simple method for weight initialization for deep net learning, which consists of two steps: first, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices, and then proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.

...read moreread less

Abstract: Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

...read moreread less

224 citations

Journal Article•DOI•

MODS: Fast and robust method for two-view matching☆

[...]

Dmytro Mishkin¹, Jiri Matas¹, Michal Perdoch¹•Institutions (1)

Czech Technical University in Prague¹

01 Dec 2015-Computer Vision and Image Understanding

TL;DR: An improved method for tentative correspondence selection, applicable both with and without view synthesis, and a modification of the standard first to second nearest distance rule increases the number of correct matches by 5–20% at no additional computational cost are introduced.

...read moreread less

158 citations

Proceedings Article•DOI•

FASText: Efficient Unconstrained Scene Text Detector

[...]

Michal Buta, Lukas Neumann, Jiri Matas

07 Dec 2015

TL;DR: A novel easy-to-implement stroke detector based on an efficient pixel intensity comparison to surrounding pixels that efficiently detected and text fragments are subsequently extracted by local thresholding guided by keypoint properties.

...read moreread less

Abstract: We propose a novel easy-to-implement stroke detector based on an efficient pixel intensity comparison to surrounding pixels. Stroke-specific keypoints are efficiently detected and text fragments are subsequently extracted by local thresholding guided by keypoint properties. Classification based on effectively calculated features then eliminates non-text regions. The stroke-specific keypoints produce 2 times less region segmentations and still detect 25% more characters than the commonly exploited MSER detector and the process is 4 times faster. After a novel efficient classification step, the number of regions is reduced to 7 times less than the standard method and is still almost 3 times faster. All stages of the proposed pipeline are scale-and rotation-invariant and support a wide variety of scripts (Latin, Hebrew, Chinese, etc.) and fonts. When the proposed detector is plugged into a scene text localization and recognition pipeline, a state-of-the-art text localization accuracy is maintained whilst the processing time is significantly reduced.

...read moreread less

151 citations

Proceedings Article•DOI•

Detection and fine 3D pose estimation of texture-less objects in RGB-D images

[...]

Tomas Hodan¹, Xenophon Zabulis², Manolis I. A. Lourakis², Stepan Obdrzalek¹, Jiri Matas¹ - Show less +1 more•Institutions (2)

Czech Technical University in Prague¹, Foundation for Research & Technology – Hellas²

17 Dec 2015

TL;DR: Experimental evaluation shows that the proposed method yields a recognition rate comparable to the state of the art, while its complexity is sub-linear in the number of templates.

...read moreread less

Abstract: Despite their ubiquitous presence, texture-less objects present significant challenges to contemporary visual object detection and localization algorithms. This paper proposes a practical method for the detection and accurate 3D localization of multiple texture-less and rigid objects depicted in RGB-D images. The detection procedure adopts the sliding window paradigm, with an efficient cascade-style evaluation of each window location. A simple pre-filtering is performed first, rapidly rejecting most locations. For each remaining location, a set of candidate templates (i.e. trained object views) is identified with a voting procedure based on hashing, which makes the method's computational complexity largely unaffected by the total number of known objects. The candidate templates are then verified by matching feature points in different modalities. Finally, the approximate object pose associated with each detected template is used as a starting point for a stochastic optimization procedure that estimates accurate 3D pose. Experimental evaluation shows that the proposed method yields a recognition rate comparable to the state of the art, while its complexity is sub-linear in the number of templates.

...read moreread less

112 citations

Proceedings Article•DOI•

The Thermal Infrared Visual Object Tracking VOT-TIR2015 Challenge Results

[...]

Michael Felsberg¹, Amanda Berg¹, Gustav Häger¹, Jörgen Ahlberg¹, Matej Kristan², Jiri Matas, Ale Leonardis³, Luka Cehovin², Gustavo Fernandez⁴, Toma Vojir, Georg Nebehay⁴, Roman Pflugfelder⁴ - Show less +8 more•Institutions (4)

Linköping University¹, University of Ljubljana², University of Birmingham³, Austrian Institute of Technology⁴

07 Dec 2015

TL;DR: The Thermal Infrared Visual Object Tracking challenge 2015, VOT-TIR2015, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply pre-learned models of object appearance.

...read moreread less

Abstract: The Thermal Infrared Visual Object Tracking challenge 2015, VOT-TIR2015, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply pre-learned models of object appearance. VOT-TIR2015 is the first benchmark on short-term tracking in TIR sequences. Results of 24 trackers are presented. For each participating tracker, a short description is provided in the appendix. The VOT-TIR2015 challenge is based on the VOT2013 challenge, but introduces the following novelties: (i) the newly collected LTIR (Link -- ping TIR) dataset is used, (ii) the VOT2013 attributes are adapted to TIR data, (iii) the evaluation is performed using insights gained during VOT2013 and VOT2014 and is similar to VOT2015.

...read moreread less

99 citations

Proceedings Article•DOI•

Efficient Scene text localization and recognition with local character refinement

[...]

Lukas Neumann¹, Jiri Matas¹•Institutions (1)

Czech Technical University in Prague¹

23 Aug 2015

TL;DR: An unconstrained end-to-end text localization and recognition method that detects initial text hypothesis in a single pass by an efficient region-based method and refines the text hypothesis using a more robust local text model, which deviates from the common assumption of region- based methods that all characters are detected as connected components.

...read moreread less

86 citations

Posted Content•

WxBS: Wide Baseline Stereo Generalizations

[...]

Dmytro Mishkin¹, Jiri Matas, Michal Perdoch², Karel Lenc³•Institutions (3)

Czech Technical University in Prague¹, ETH Zurich², University of Oxford³

24 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: The WxBS-M matcher dominantes the state-of-the-art methods both on both the new and existing datasets and simple adaptive thresholding improves Hessian-Affine, DoG, MSER, and possibly other detectors and allows to use them on infrared and low contrast images.

...read moreread less

Abstract: We have presented a new problem -- the wide multiple baseline stereo (WxBS) -- which considers matching of images that simultaneously differ in more than one image acquisition factor such as viewpoint, illumination, sensor type or where object appearance changes significantly, eg over time A new dataset with the ground truth for evaluation of matching algorithms has been introduced and will be made public We have extensively tested a large set of popular and recent detectors and descriptors and show than the combination of RootSIFT and HalfRootSIFT as descriptors with MSER and Hessian-Affine detectors works best for many different nuisance factors We show that simple adaptive thresholding improves Hessian-Affine, DoG, MSER (and possibly other) detectors and allows to use them on infrared and low contrast images A novel matching algorithm for addressing the WxBS problem has been introduced We have shown experimentally that the WxBS-M matcher dominantes the state-of-the-art methods both on both the new and existing datasets

...read moreread less

53 citations

Posted Content•

Online Adaptive Hidden Markov Model for Multi-Tracker Fusion

[...]

Tomas Vojir, Jiri Matas, Jana Noskova

23 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel method for fusion of multiple trackers based on Hidden Markov Models, which outperforms the state-of-the-art, often significantly, on all data-sets in almost all criteria.

...read moreread less

Abstract: In this paper, we propose a novel method for visual object tracking called HMMTxD. The method fuses observations from complementary out-of-the box trackers and a detector by utilizing a hidden Markov model whose latent states correspond to a binary vector expressing the failure of individual trackers. The Markov model is trained in an unsupervised way, relying on an online learned detector to provide a source of tracker-independent information for a modified Baum- Welch algorithm that updates the model w.r.t. the partially annotated data. We show the effectiveness of the proposed method on combination of two and three tracking algorithms. The performance of HMMTxD is evaluated on two standard benchmarks (CVPR2013 and VOT) and on a rich collection of 77 publicly available sequences. The HMMTxD outperforms the state-of-the-art, often significantly, on all datasets in almost all criteria.

...read moreread less

35 citations

Proceedings Article•DOI•

WxBS: Wide Baseline Stereo Generalizations

[...]

Dmytro Mishkin¹, Jiri Matas, Michal Perdoch², Karel Lenc³•Institutions (3)

Czech Technical University in Prague¹, ETH Zurich², University of Oxford³

01 Sep 2015

TL;DR: Chen et al. as mentioned in this paper proposed WxBS-M, a novel matching algorithm for the wide baseline two-view matching problem, which is a generalization of the WXBS X matcher.

...read moreread less

Abstract: Generalization of the wide baseline two-view matching problem WXBS X stands for different subsets of “wide baselines" in acquisition conditions. • Novel dataset of ground-truthed image pairs which include multiple "wide baselines“ • We show that state-of-the art matchers fail on almost all image pairs. • WxBS-M a novel matching algorithm for the WXBS problem is introduced. We show experimentally that the WXBS-M matcher dominates the state-of-the-art methods both on the new and existing datasets Take away • SIFT family is still the best local descriptor, outperforms novel CNN [SiamNet2015] approaches. • (adaptive) Hessian-Affine is the best detector with broad applicability • Affine view synthesis greatly helps for non-geometrical problems. • Datasets and WxBS-Matcher available http://cmp.felk.cvut.cz/wbs/ • We need more diverse datasets for learning local descriptors than Yosemite and Liberty References WABS – Wide Appearance Baseline Stereo no photometric normalization with photo normalization (mean 0.5, var 0.2) WGBS – Wide Geometry Baseline Stereo WLBS – Wide iLlumination Baseline Stereo WSBS – Wide Sensor Baseline Stereo no photometric normalization with photo normalization (mean 0.5, var 0.2) no photometric normalization with photo normalization (mean 0.5, var 0.2) WGBS summary • SIFT family dominates • Photo-L2 normalized pixel intensities is a strong descriptor • ConvNet [SiamNet15] worse than SIFT (at least when not trained to handle large transformations) • Other descriptor not competitive *Images from Extreme View (EVD) and Oxford-Affine(OxAff) Datasets • SIFT family dominates • ConvNet [SiamNet15] worse than SIFT (at least when not trained to handle illumination transformations) • Other descriptor not competitive WLBS summary • SIFT family dominates • ConvNet [SiamNet15] performs poorly (not trained for photometric distortions) • Other descriptor not competitive WABS summary no photometric normalization with photo normalization (mean 0.5, var 0.2) • No descriptor performance acceptable • Only gradient folding in HalfSIFT works (poorly) • Note the Recall range [0, 0.14] indicating high difficulty WSBS summary Map2Photo: WABS special case with photo normalization (mean 0.5, var 0.2) no photometric normalization • Special (learned?) descriptor is needed for map-photo matching • Note the Recall range [0, 0.06] indicating extreme difficulty of map vs. photo matching *Images from SymBench, GDBootstrap, EgdeFoci (EF) datasets *Images from SymBench, VPRiCE 2015, EgdeFoci (EF) datasets *Images from GDBstrap and MMS datasets *map2ph dataset with this paper • [SiamNet15] S. Zagoruyko, N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. In CVPR 2015 • [HalfSIFT10] J. Chen, J. Tian, N. Lee, J. Zheng, R. Smith, and A. Laine. A partial intensity invariant feature descriptor for multimodal retinal image registration. Biomedical Engineering, IEEE Transactions on, 2010. • [MODS15] D. Mishkin and J. Matas and M. Perdoch. MODS: Fast and Robust Method for Two-View Matching. Accepted to CVIU, 2015. • [DEGENSAC05] O.Chum, T. Werner, J. Matas. Two-view Geometry Estimation Unaffected by a Dominant Plane. In CVPR 2005 5. 1st geom. Inconsistent rule: use for second nearest distance ratio only patches, which are inconsistent with closest one (yellow, not red) 6. Filter duplicates: discard redetections (red patches) HalfSIFT bin SIFT bin 2. Adaptive thresholding: if #HesAffs < θHesAff, lower the detection threshold 3. HalfRootSIFT: 1. Affine view synthesis WxBS-Matcher Input: I1, I2two images, Θmminimum required number of matches, Smaxmaximum number of iterations Output: Fundamental or homography matrix F or H; a list of corresponding local features while Nmatches < Θm and Iter < Smax do for I1and I2separately do 1 Generate synthetic views according to the scale-tilt-rotation-detector setup for Iter 2 Detect local features using adaptive thresholding 3 Extract rotation invariant descriptors with: 3a RootSIFT and 3b HalfRootSIFT 4 Reproject local features to I1, I2 end for 5 Generate tentative correspondences based on 1st geom. Inconsistent rule for RootSIFT and HalfRootSIFT separately using kD-tree 6 Filter duplicates 7 Geometric verification of all TC with modified DEGENSAC estimating F or H 8 Check geometric consistency of the local affine features with est. F or H end while TILDE detector results are post-CR deadline Best results among single detectors (AdHesAf) and view-synth based matchers (WxBS-M) Detector and matcher comparison

...read moreread less

Proceedings Article•DOI•

Efficient Texture-less Object Detection for Augmented Reality Guidance

[...]

Toma Hodan, Dima Damen¹, Walterio W. Mayol-Cuevas¹, Jiri Matas•Institutions (1)

University of Bristol¹

29 Sep 2015

TL;DR: The work has identified two other areas of improvement over the original method; proposing a Hough-based tracing, bringing a speed-up of more than 5 times, and a search for edgelets in stripes instead of wedges, achieving improved performance especially at lower rates of false positives per image.

...read moreread less

Abstract: Real-time scalable detection of texture-less objects in 2D images is a highly relevant task for augmented reality applications such as assembly guidance. The paper presents a purely edge-based method based on the approach of Damen et al. (2012) [5]. The proposed method exploits the recent structured edge detector by Dollar and Zitnick (2013) [8], which uses supervised examples for improved object outline detection. It was experimentally shown to yield consistently better results than the standard Canny edge detector. The work has identified two other areas of improvement over the original method; proposing a Hough-based tracing, bringing a speed-up of more than 5 times, and a search for edgelets in stripes instead of wedges, achieving improved performance especially at lower rates of false positives per image. Experimental evaluation proves the proposed method to be faster and more robust. The method is also demonstrated to be suitable to support an augmented reality application for assembly guidance.

...read moreread less

Posted Content•

Efficient Scene Text Localization and Recognition with Local Character Refinement

[...]

Lukas Neumann¹, Jiri Matas¹•Institutions (1)

Czech Technical University in Prague¹

14 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, an unconstrained end-to-end text localization and recognition method is presented, which detects initial text hypothesis in a single pass by an efficient region-based method and subsequently refines the text hypothesis using a more robust local text model.

...read moreread less

Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method detects initial text hypothesis in a single pass by an efficient region-based method and subsequently refines the text hypothesis using a more robust local text model, which deviates from the common assumption of region-based methods that all characters are detected as connected components. Additionally, a novel feature based on character stroke area estimation is introduced. The feature is efficiently computed from a region distance map, it is invariant to scaling and rotations and allows to efficiently detect text regions regardless of what portion of text they capture. The method runs in real time and achieves state-of-the-art text localization and recognition results on the ICDAR 2013 Robust Reading dataset.

...read moreread less

Proceedings Article•DOI•

Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection

[...]

David Novotny¹, Jiri Matas²•Institutions (2)

University of Oxford¹, Czech Technical University in Prague²

07 Dec 2015

TL;DR: A novel efficient method for extraction of object proposals that exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score to achieve state-of-the-art recall performance on Pascal VOC07.

...read moreread less

Abstract: A novel efficient method for extraction of object proposals is introduced. Its "objectness" function exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score [42]. The efficiency is achieved by the use of spatial bins in a novel combination with sparsity-inducing group normalized SVM. State-of-the-art recall performance is achieved on Pascal VOC07, significantly outperforming methods with comparable speed. Interestingly, when only 100 proposals per image are considered the method attains 78 % recall on VOC07. The method improves mAP of the RCNN class-specific detector, increasing it by 10 points when only 50 proposals are used in each image. The system trained on twenty classes performs well on the two hundred class ILSVRC2013 set confirming generalization capability.

...read moreread less

Patent•

Efficient unconstrained stroke detector

[...]

Michal Busta, Lukas Neumann, Jiri Matas

24 Nov 2015

TL;DR: In this article, a stroke-specific keypoints detector is proposed for text detection, which is scale and rotation invariant and produces significantly less false detections than the detectors commonly used in scene text localization.

...read moreread less

Abstract: We propose a stroke, i.e. a general curvilinear structure, detector based on pixel intensity comparisons in a local neighborhood. When applied to text detection, stroke-specific keypoints are detected. Text fragments are subsequently extracted possibly by local thresholding or a segmentation procedure guided by keypoint properties. Classification based on "strokeness" feature eliminates non-text segmentations. The proposed stroke-specific keypoints detector is scale- and rotation invariant. When applied to text detection, it is significantly faster and produces significantly less false detections than the detectors commonly used in scene text localization. The proposed stroke-specific keypoints detector produces 2 times less segmentations and detects 25% more characters than the commonly exploited MSER detector and the process is 4 times faster. After an efficient classification step, the number of segmentations is reduced to 7 times less than the standard method and is still almost 3 times faster. The proposed stroke-specific detector detects a wide variety of scripts (Latin, Hebrew, Chinese, etc.) and fonts. The detector is applicable to the problems where linear structures are localised in the image, such as bar code localisation and recognition or road network extraction.

...read moreread less

DOI•

Holistic Scene Understanding (Dagstuhl Seminar 15081)

[...]

Jiri Matas, Vittorio Murino, Bodo Rosenhahn, Laura Leal-Taixé

01 Jan 2015

TL;DR: The Dagstuhl Seminar 15081 "Holistic Scene Understanding" was a great success, which is also reflected in the very positive feedback from the evaluation.

...read moreread less

Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 15081 "Holistic Scene Understanding". During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Overall, the seminar was a great success, which is also reflected in the very positive feedback we received from the evaluation.

...read moreread less

Posted Content•

Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection

[...]

David Novotny¹, Jiri Matas²•Institutions (2)

University of Oxford¹, Czech Technical University in Prague²

27 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a novel method for extraction of object proposals is introduced, which exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score.

...read moreread less

Abstract: A novel efficient method for extraction of object proposals is introduced. Its "objectness" function exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score. The efficiency is achieved by the use of spatial bins in a novel combination with sparsity-inducing group normalized SVM. State-of-the-art recall performance is achieved on Pascal VOC07, significantly outperforming methods with comparable speed. Interestingly, when only 100 proposals per image are considered the method attains 78% recall on VOC07. The method improves mAP of the RCNN state-of-the-art class-specific detector, increasing it by 10 points when only 50 proposals are used in each image. The system trained on twenty classes performs well on the two hundred class ILSVRC2013 set confirming generalization capability.

...read moreread less

Proceedings Article•DOI•

Towards visual words to words

[...]

Rakesh Mehta¹, Ondrej Chum², Jiri Matas²•Institutions (2)

Tampere University of Technology¹, Czech Technical University in Prague²

23 Aug 2015

TL;DR: A novel representation, textual visual words, is proposed, which describes text by generic visual words that geometrically consistently predict bottom and top lines of text.

...read moreread less

Abstract: We address the problem of text localization and retrieval in real world images. We are first to study the retrieval of text images, i.e. the selection of images containing text in large collections at high speed. We propose a novel representation, textual visual words, which describe text by generic visual words that geometrically consistently predict bottom and top lines of text. The visual words are discretized SIFT descriptors of Hessian features. The features may correspond to various structures present in the text - character fragments, individual characters or their arrangements. The textual words representation is invariant to affine transformation of the image and local linear change of intensity. Experiments demonstrate that the proposed method outperforms the state-of-the-art on the MS dataset. The proposed method detects blurry, small font, low contrast, noisy text from real world images.

...read moreread less