scispace - formally typeset
Search or ask a question

Showing papers by "Jiri Matas published in 2015"


Proceedings ArticleDOI
23 Aug 2015
TL;DR: A new Challenge 4 on Incidental Scene Text has been added to the Challenges on Born-Digital Images, Focused Scene Images and Video Text and tasks assessing End-to-End system performance have been introduced to all Challenges.
Abstract: Results of the ICDAR 2015 Robust Reading Competition are presented. A new Challenge 4 on Incidental Scene Text has been added to the Challenges on Born-Digital Images, Focused Scene Images and Video Text. Challenge 4 is run on a newly acquired dataset of 1,670 images evaluating Text Localisation, Word Recognition and End-to-End pipelines. In addition, the dataset for Challenge 3 on Video Text has been substantially updated with more video sequences and more accurate ground truth data. Finally, tasks assessing End-to-End system performance have been introduced to all Challenges. The competition took place in the first quarter of 2015, and received a total of 44 submissions. Only the tasks newly introduced in 2015 are reported on. The datasets, the ground truth specification and the evaluation protocols are presented together with the results and a brief summary of the participating methods.

1,224 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance and presents a new VOT 2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute.
Abstract: The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

667 citations


Posted Content
TL;DR: Layer-sequential unit-variance initialization (LSUV initialization) as mentioned in this paper is a simple method for weight initialization for deep net learning, which consists of two steps: first, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices, and then proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.
Abstract: Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

224 citations


Journal ArticleDOI
TL;DR: An improved method for tentative correspondence selection, applicable both with and without view synthesis, and a modification of the standard first to second nearest distance rule increases the number of correct matches by 5–20% at no additional computational cost are introduced.

158 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: A novel easy-to-implement stroke detector based on an efficient pixel intensity comparison to surrounding pixels that efficiently detected and text fragments are subsequently extracted by local thresholding guided by keypoint properties.
Abstract: We propose a novel easy-to-implement stroke detector based on an efficient pixel intensity comparison to surrounding pixels. Stroke-specific keypoints are efficiently detected and text fragments are subsequently extracted by local thresholding guided by keypoint properties. Classification based on effectively calculated features then eliminates non-text regions. The stroke-specific keypoints produce 2 times less region segmentations and still detect 25% more characters than the commonly exploited MSER detector and the process is 4 times faster. After a novel efficient classification step, the number of regions is reduced to 7 times less than the standard method and is still almost 3 times faster. All stages of the proposed pipeline are scale-and rotation-invariant and support a wide variety of scripts (Latin, Hebrew, Chinese, etc.) and fonts. When the proposed detector is plugged into a scene text localization and recognition pipeline, a state-of-the-art text localization accuracy is maintained whilst the processing time is significantly reduced.

151 citations


Proceedings ArticleDOI
17 Dec 2015
TL;DR: Experimental evaluation shows that the proposed method yields a recognition rate comparable to the state of the art, while its complexity is sub-linear in the number of templates.
Abstract: Despite their ubiquitous presence, texture-less objects present significant challenges to contemporary visual object detection and localization algorithms. This paper proposes a practical method for the detection and accurate 3D localization of multiple texture-less and rigid objects depicted in RGB-D images. The detection procedure adopts the sliding window paradigm, with an efficient cascade-style evaluation of each window location. A simple pre-filtering is performed first, rapidly rejecting most locations. For each remaining location, a set of candidate templates (i.e. trained object views) is identified with a voting procedure based on hashing, which makes the method's computational complexity largely unaffected by the total number of known objects. The candidate templates are then verified by matching feature points in different modalities. Finally, the approximate object pose associated with each detected template is used as a starting point for a stochastic optimization procedure that estimates accurate 3D pose. Experimental evaluation shows that the proposed method yields a recognition rate comparable to the state of the art, while its complexity is sub-linear in the number of templates.

112 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: The Thermal Infrared Visual Object Tracking challenge 2015, VOT-TIR2015, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply pre-learned models of object appearance.
Abstract: The Thermal Infrared Visual Object Tracking challenge 2015, VOT-TIR2015, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply pre-learned models of object appearance. VOT-TIR2015 is the first benchmark on short-term tracking in TIR sequences. Results of 24 trackers are presented. For each participating tracker, a short description is provided in the appendix. The VOT-TIR2015 challenge is based on the VOT2013 challenge, but introduces the following novelties: (i) the newly collected LTIR (Link -- ping TIR) dataset is used, (ii) the VOT2013 attributes are adapted to TIR data, (iii) the evaluation is performed using insights gained during VOT2013 and VOT2014 and is similar to VOT2015.

99 citations


Proceedings ArticleDOI
23 Aug 2015
TL;DR: An unconstrained end-to-end text localization and recognition method that detects initial text hypothesis in a single pass by an efficient region-based method and refines the text hypothesis using a more robust local text model, which deviates from the common assumption of region- based methods that all characters are detected as connected components.
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method detects initial text hypothesis in a single pass by an efficient region-based method and subsequently refines the text hypothesis using a more robust local text model, which deviates from the common assumption of region-based methods that all characters are detected as connected components.

86 citations


Posted Content
TL;DR: The WxBS-M matcher dominantes the state-of-the-art methods both on both the new and existing datasets and simple adaptive thresholding improves Hessian-Affine, DoG, MSER, and possibly other detectors and allows to use them on infrared and low contrast images.
Abstract: We have presented a new problem -- the wide multiple baseline stereo (WxBS) -- which considers matching of images that simultaneously differ in more than one image acquisition factor such as viewpoint, illumination, sensor type or where object appearance changes significantly, eg over time A new dataset with the ground truth for evaluation of matching algorithms has been introduced and will be made public We have extensively tested a large set of popular and recent detectors and descriptors and show than the combination of RootSIFT and HalfRootSIFT as descriptors with MSER and Hessian-Affine detectors works best for many different nuisance factors We show that simple adaptive thresholding improves Hessian-Affine, DoG, MSER (and possibly other) detectors and allows to use them on infrared and low contrast images A novel matching algorithm for addressing the WxBS problem has been introduced We have shown experimentally that the WxBS-M matcher dominantes the state-of-the-art methods both on both the new and existing datasets

53 citations


Posted Content
TL;DR: A novel method for fusion of multiple trackers based on Hidden Markov Models, which outperforms the state-of-the-art, often significantly, on all data-sets in almost all criteria.
Abstract: In this paper, we propose a novel method for visual object tracking called HMMTxD. The method fuses observations from complementary out-of-the box trackers and a detector by utilizing a hidden Markov model whose latent states correspond to a binary vector expressing the failure of individual trackers. The Markov model is trained in an unsupervised way, relying on an online learned detector to provide a source of tracker-independent information for a modified Baum- Welch algorithm that updates the model w.r.t. the partially annotated data. We show the effectiveness of the proposed method on combination of two and three tracking algorithms. The performance of HMMTxD is evaluated on two standard benchmarks (CVPR2013 and VOT) and on a rich collection of 77 publicly available sequences. The HMMTxD outperforms the state-of-the-art, often significantly, on all datasets in almost all criteria.

35 citations


Proceedings ArticleDOI
01 Sep 2015
TL;DR: Chen et al. as mentioned in this paper proposed WxBS-M, a novel matching algorithm for the wide baseline two-view matching problem, which is a generalization of the WXBS X matcher.
Abstract: Generalization of the wide baseline two-view matching problem WXBS X stands for different subsets of “wide baselines" in acquisition conditions. • Novel dataset of ground-truthed image pairs which include multiple "wide baselines“ • We show that state-of-the art matchers fail on almost all image pairs. • WxBS-M a novel matching algorithm for the WXBS problem is introduced. We show experimentally that the WXBS-M matcher dominates the state-of-the-art methods both on the new and existing datasets Take away • SIFT family is still the best local descriptor, outperforms novel CNN [SiamNet2015] approaches. • (adaptive) Hessian-Affine is the best detector with broad applicability • Affine view synthesis greatly helps for non-geometrical problems. • Datasets and WxBS-Matcher available http://cmp.felk.cvut.cz/wbs/ • We need more diverse datasets for learning local descriptors than Yosemite and Liberty References WABS – Wide Appearance Baseline Stereo no photometric normalization with photo normalization (mean 0.5, var 0.2) WGBS – Wide Geometry Baseline Stereo WLBS – Wide iLlumination Baseline Stereo WSBS – Wide Sensor Baseline Stereo no photometric normalization with photo normalization (mean 0.5, var 0.2) no photometric normalization with photo normalization (mean 0.5, var 0.2) WGBS summary • SIFT family dominates • Photo-L2 normalized pixel intensities is a strong descriptor • ConvNet [SiamNet15] worse than SIFT (at least when not trained to handle large transformations) • Other descriptor not competitive *Images from Extreme View (EVD) and Oxford-Affine(OxAff) Datasets • SIFT family dominates • ConvNet [SiamNet15] worse than SIFT (at least when not trained to handle illumination transformations) • Other descriptor not competitive WLBS summary • SIFT family dominates • ConvNet [SiamNet15] performs poorly (not trained for photometric distortions) • Other descriptor not competitive WABS summary no photometric normalization with photo normalization (mean 0.5, var 0.2) • No descriptor performance acceptable • Only gradient folding in HalfSIFT works (poorly) • Note the Recall range [0, 0.14] indicating high difficulty WSBS summary Map2Photo: WABS special case with photo normalization (mean 0.5, var 0.2) no photometric normalization • Special (learned?) descriptor is needed for map-photo matching • Note the Recall range [0, 0.06] indicating extreme difficulty of map vs. photo matching *Images from SymBench, GDBootstrap, EgdeFoci (EF) datasets *Images from SymBench, VPRiCE 2015, EgdeFoci (EF) datasets *Images from GDBstrap and MMS datasets *map2ph dataset with this paper • [SiamNet15] S. Zagoruyko, N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. In CVPR 2015 • [HalfSIFT10] J. Chen, J. Tian, N. Lee, J. Zheng, R. Smith, and A. Laine. A partial intensity invariant feature descriptor for multimodal retinal image registration. Biomedical Engineering, IEEE Transactions on, 2010. • [MODS15] D. Mishkin and J. Matas and M. Perdoch. MODS: Fast and Robust Method for Two-View Matching. Accepted to CVIU, 2015. • [DEGENSAC05] O.Chum, T. Werner, J. Matas. Two-view Geometry Estimation Unaffected by a Dominant Plane. In CVPR 2005 5. 1st geom. Inconsistent rule: use for second nearest distance ratio only patches, which are inconsistent with closest one (yellow, not red) 6. Filter duplicates: discard redetections (red patches) HalfSIFT bin SIFT bin 2. Adaptive thresholding: if #HesAffs < θHesAff, lower the detection threshold 3. HalfRootSIFT: 1. Affine view synthesis WxBS-Matcher Input: I1, I2two images, Θmminimum required number of matches, Smaxmaximum number of iterations Output: Fundamental or homography matrix F or H; a list of corresponding local features while Nmatches < Θm and Iter < Smax do for I1and I2separately do 1 Generate synthetic views according to the scale-tilt-rotation-detector setup for Iter 2 Detect local features using adaptive thresholding 3 Extract rotation invariant descriptors with: 3a RootSIFT and 3b HalfRootSIFT 4 Reproject local features to I1, I2 end for 5 Generate tentative correspondences based on 1st geom. Inconsistent rule for RootSIFT and HalfRootSIFT separately using kD-tree 6 Filter duplicates 7 Geometric verification of all TC with modified DEGENSAC estimating F or H 8 Check geometric consistency of the local affine features with est. F or H end while TILDE detector results are post-CR deadline Best results among single detectors (AdHesAf) and view-synth based matchers (WxBS-M) Detector and matcher comparison

Proceedings ArticleDOI
29 Sep 2015
TL;DR: The work has identified two other areas of improvement over the original method; proposing a Hough-based tracing, bringing a speed-up of more than 5 times, and a search for edgelets in stripes instead of wedges, achieving improved performance especially at lower rates of false positives per image.
Abstract: Real-time scalable detection of texture-less objects in 2D images is a highly relevant task for augmented reality applications such as assembly guidance. The paper presents a purely edge-based method based on the approach of Damen et al. (2012) [5]. The proposed method exploits the recent structured edge detector by Dollar and Zitnick (2013) [8], which uses supervised examples for improved object outline detection. It was experimentally shown to yield consistently better results than the standard Canny edge detector. The work has identified two other areas of improvement over the original method; proposing a Hough-based tracing, bringing a speed-up of more than 5 times, and a search for edgelets in stripes instead of wedges, achieving improved performance especially at lower rates of false positives per image. Experimental evaluation proves the proposed method to be faster and more robust. The method is also demonstrated to be suitable to support an augmented reality application for assembly guidance.

Posted Content
TL;DR: In this article, an unconstrained end-to-end text localization and recognition method is presented, which detects initial text hypothesis in a single pass by an efficient region-based method and subsequently refines the text hypothesis using a more robust local text model.
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method detects initial text hypothesis in a single pass by an efficient region-based method and subsequently refines the text hypothesis using a more robust local text model, which deviates from the common assumption of region-based methods that all characters are detected as connected components. Additionally, a novel feature based on character stroke area estimation is introduced. The feature is efficiently computed from a region distance map, it is invariant to scaling and rotations and allows to efficiently detect text regions regardless of what portion of text they capture. The method runs in real time and achieves state-of-the-art text localization and recognition results on the ICDAR 2013 Robust Reading dataset.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A novel efficient method for extraction of object proposals that exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score to achieve state-of-the-art recall performance on Pascal VOC07.
Abstract: A novel efficient method for extraction of object proposals is introduced. Its "objectness" function exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score [42]. The efficiency is achieved by the use of spatial bins in a novel combination with sparsity-inducing group normalized SVM. State-of-the-art recall performance is achieved on Pascal VOC07, significantly outperforming methods with comparable speed. Interestingly, when only 100 proposals per image are considered the method attains 78 % recall on VOC07. The method improves mAP of the RCNN class-specific detector, increasing it by 10 points when only 50 proposals are used in each image. The system trained on twenty classes performs well on the two hundred class ILSVRC2013 set confirming generalization capability.

Patent
24 Nov 2015
TL;DR: In this article, a stroke-specific keypoints detector is proposed for text detection, which is scale and rotation invariant and produces significantly less false detections than the detectors commonly used in scene text localization.
Abstract: We propose a stroke, i.e. a general curvilinear structure, detector based on pixel intensity comparisons in a local neighborhood. When applied to text detection, stroke-specific keypoints are detected. Text fragments are subsequently extracted possibly by local thresholding or a segmentation procedure guided by keypoint properties. Classification based on "strokeness" feature eliminates non-text segmentations. The proposed stroke-specific keypoints detector is scale- and rotation invariant. When applied to text detection, it is significantly faster and produces significantly less false detections than the detectors commonly used in scene text localization. The proposed stroke-specific keypoints detector produces 2 times less segmentations and detects 25% more characters than the commonly exploited MSER detector and the process is 4 times faster. After an efficient classification step, the number of segmentations is reduced to 7 times less than the standard method and is still almost 3 times faster. The proposed stroke-specific detector detects a wide variety of scripts (Latin, Hebrew, Chinese, etc.) and fonts. The detector is applicable to the problems where linear structures are localised in the image, such as bar code localisation and recognition or road network extraction.

DOI
01 Jan 2015
TL;DR: The Dagstuhl Seminar 15081 "Holistic Scene Understanding" was a great success, which is also reflected in the very positive feedback from the evaluation.
Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 15081 "Holistic Scene Understanding". During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Overall, the seminar was a great success, which is also reflected in the very positive feedback we received from the evaluation.

Posted Content
TL;DR: In this paper, a novel method for extraction of object proposals is introduced, which exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score.
Abstract: A novel efficient method for extraction of object proposals is introduced. Its "objectness" function exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score. The efficiency is achieved by the use of spatial bins in a novel combination with sparsity-inducing group normalized SVM. State-of-the-art recall performance is achieved on Pascal VOC07, significantly outperforming methods with comparable speed. Interestingly, when only 100 proposals per image are considered the method attains 78% recall on VOC07. The method improves mAP of the RCNN state-of-the-art class-specific detector, increasing it by 10 points when only 50 proposals are used in each image. The system trained on twenty classes performs well on the two hundred class ILSVRC2013 set confirming generalization capability.

Proceedings ArticleDOI
23 Aug 2015
TL;DR: A novel representation, textual visual words, is proposed, which describes text by generic visual words that geometrically consistently predict bottom and top lines of text.
Abstract: We address the problem of text localization and retrieval in real world images. We are first to study the retrieval of text images, i.e. the selection of images containing text in large collections at high speed. We propose a novel representation, textual visual words, which describe text by generic visual words that geometrically consistently predict bottom and top lines of text. The visual words are discretized SIFT descriptors of Hessian features. The features may correspond to various structures present in the text - character fragments, individual characters or their arrangements. The textual words representation is invariant to affine transformation of the image and local linear change of intensity. Experiments demonstrate that the proposed method outperforms the state-of-the-art on the MS dataset. The proposed method detects blurry, small font, low contrast, noisy text from real world images.