scispace - formally typeset
Search or ask a question

Showing papers by "Jiri Matas published in 2019"


Proceedings ArticleDOI
Matej Kristan1, Amanda Berg2, Linyu Zheng3, Litu Rout4  +176 moreInstitutions (43)
01 Oct 2019
TL;DR: The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative; results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years.
Abstract: The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative. Results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis as well as the standard VOT methodology for long-term tracking analysis. The VOT2019 challenge was composed of five challenges focusing on different tracking domains: (i) VOTST2019 challenge focused on short-term tracking in RGB, (ii) VOT-RT2019 challenge focused on "real-time" shortterm tracking in RGB, (iii) VOT-LT2019 focused on longterm tracking namely coping with target disappearance and reappearance. Two new challenges have been introduced: (iv) VOT-RGBT2019 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2019 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2019, VOT-RT2019 and VOT-LT2019 datasets were refreshed while new datasets were introduced for VOT-RGBT2019 and VOT-RGBD2019. The VOT toolkit has been updated to support both standard shortterm, long-term tracking and tracking with multi-channel imagery. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

393 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: In this article, a method called sigmaconsensus is proposed to eliminate the need for a user-defined inlier-outlier threshold in RANSAC, which is marginalized over a range of noise scales.
Abstract: A method called, sigma-consensus, is proposed to eliminate the need for a user-defined inlier-outlier threshold in RANSAC. Instead of estimating the noise sigma, it is marginalized over a range of noise scales. The optimized model is obtained by weighted least-squares fitting where the weights come from the marginalization over sigma of the point likelihoods of being inliers. A new quality function is proposed not requiring sigma and, thus, a set of inliers to determine the model quality. Also, a new termination criterion for RANSAC is built on the proposed marginalization approach. Applying sigma-consensus, MAGSAC is proposed with no need for a user-defined sigma and improving the accuracy of robust estimation significantly. It is superior to the state-of-the-art in terms of geometric accuracy on publicly available real-world datasets for epipolar geometry (F and E) and homography estimation. In addition, applying sigma-consensus only once as a post-processing step to the RANSAC output always improved the model quality on a wide range of vision problems without noticeable deterioration in processing time, adding a few milliseconds.

187 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: The RRC-MLT-2019 challenge as discussed by the authors was the first edition of the multi-lingual scene text (MLT) detection and recognition challenge, which aims to systematically benchmark and push the state-of-the-art forward.
Abstract: With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

175 citations


Posted Content
TL;DR: MAGSAC++ and Progressive NAPSAC sampler is proposed and it is shown that the progressive spatial sampling in P-NAPSAC can be integrated with PROSAC sampling, which is applied to the first, location-defining, point.
Abstract: A new method for robust estimation, MAGSAC++, is proposed. It introduces a new model quality (scoring) function that does not require the inlier-outlier decision, and a novel marginalization procedure formulated as an iteratively re-weighted least-squares approach. We also propose a new sampler, Progressive NAPSAC, for RANSAC-like robust estimators. Exploiting the fact that nearby points often originate from the same model in real-world data, it finds local structures earlier than global samplers. The progressive transition from local to global sampling does not suffer from the weaknesses of purely localized samplers. On six publicly available real-world datasets for homography and fundamental matrix fitting, MAGSAC++ produces results superior to state-of-the-art robust methods. It is faster, more geometrically accurate and fails less often.

68 citations


Proceedings ArticleDOI
09 Jan 2019
TL;DR: The grayness index, GI in short, is derived using the Dichromatic Reflection Model and is learning-free, and outperforms state-of-the-art statistical methods and many recent deep methods.
Abstract: We propose a novel grayness index for finding gray pixels and demonstrate its effectiveness and efficiency in illumination estimation. The grayness index, GI in short, is derived using the Dichromatic Reflection Model and is learning-free. GI allows to estimate one or multiple illumination sources in color-biased images. On standard single-illumination and multiple-illumination estimation benchmarks, GI outperforms state-of-the-art statistical methods and many recent deep methods. GI is simple and fast, written in a few dozen lines of code, processing a 1080p image in ~0.4 seconds with a non-optimized Matlab code.

65 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: The proposed long-term RGB-D tracker called OTR – Object Tracking by Reconstruction performs online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs).
Abstract: Standard RGB-D trackers treat the target as a 2D structure, which makes modelling appearance changes related even to out-of-plane rotation challenging. This limitation is addressed by the proposed long-term RGB-D tracker called OTR – Object Tracking by Reconstruction. OTR performs online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). The 3D reconstruction supports two performance- enhancing features: (i) generation of an accurate spatial support for constrained DCF learning from its 2D projection and (ii) point-cloud based estimation of 3D pose change for selection and storage of view-specific DCFs which robustly localize the target after out-of-view rotation or heavy occlusion. Extensive evaluation on the Princeton RGB-D tracking and STC Benchmarks shows OTR outperforms the state-of-the-art by a large margin.

63 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: The CDTB dataset is the largest and most diverse dataset in RGB-D tracking, with an order of magnitude larger number of frames than related datasets, indicating a large gap between the two fields, which has not been previously detected by the prior benchmarks.
Abstract: We propose a new color-and-depth general visual object tracking benchmark (CDTB). CDTB is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The CDTB dataset is the largest and most diverse dataset in RGB-D tracking, with an order of magnitude larger number of frames than related datasets. The sequences have been carefully recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. Experiments with RGB and RGB-D trackers show that CDTB is more challenging than previous datasets. State-of-the-art RGB trackers outperform the recent RGB-D trackers, indicating a large gap between the two fields, which has not been previously detected by the prior benchmarks. Based on the results of the analysis we point out opportunities for future research in RGB-D tracker design.

52 citations


Proceedings ArticleDOI
04 Mar 2019
TL;DR: A deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN) can handle extremely strong and spatially-variant motion blur and is shown to improve the performance of existing feature detectors and descriptors against the motion blur.
Abstract: We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic training data using the gyroscope. The evaluation shows a clear improvement in visual quality over the state-of-the-art while achieving real-time performance. Furthermore, the method is shown to improve the performance of existing feature detectors and descriptors against the motion blur.

32 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Prog-X as discussed by the authors is a multi-model fitting algorithm that interleaves sampling and consolidation of the current data interpretation via repetitive hypothesis proposal, fast rejection, and integration of the new hypothesis into the kept instance set by labeling energy minimization.
Abstract: The Progressive-X algorithm, Prog-X in short, is proposed for geometric multi-model fitting. The method interleaves sampling and consolidation of the current data interpretation via repetitive hypothesis proposal, fast rejection, and integration of the new hypothesis into the kept instance set by labeling energy minimization. Due to exploring the data progressively, the method has several beneficial properties compared with the state-of-the-art. First, a clear criterion, adopted from RANSAC, controls the termination and stops the algorithm when the probability of finding a new model with a reasonable number of inliers falls below a threshold. Second, Prog-X is an any-time algorithm. Thus, whenever is interrupted, e.g. due to a time limit, the returned instances cover real and, likely, the most dominant ones. The method is superior to the state-of-the-art in terms of accuracy in both synthetic experiments and on publicly available real-world datasets for homography, two-view motion, and motion segmentation.

25 citations


Posted Content
TL;DR: The Progressive-X algorithm, Prog-X in short, is proposed for geometric multi-model fitting and is superior to the state-of-the-art in terms of accuracy in both synthetic experiments and on publicly available real-world datasets for homography, two-view motion, and motion segmentation.
Abstract: The Progressive-X algorithm, Prog-X in short, is proposed for geometric multi-model fitting. The method interleaves sampling and consolidation of the current data interpretation via repetitive hypothesis proposal, fast rejection, and integration of the new hypothesis into the kept instance set by labeling energy minimization. Due to exploring the data progressively, the method has several beneficial properties compared with the state-of-the-art. First, a clear criterion, adopted from RANSAC, controls the termination and stops the algorithm when the probability of finding a new model with a reasonable number of inliers falls below a threshold. Second, Prog-X is an any-time algorithm. Thus, whenever is interrupted, e.g. due to a time limit, the returned instances cover real and, likely, the most dominant ones. The method is superior to the state-of-the-art in terms of accuracy in both synthetic experiments and on publicly available real-world datasets for homography, two-view motion, and motion segmentation.

21 citations


Proceedings ArticleDOI
01 Jan 2019
TL;DR: Experiments on two real-world benchmarks show that the proposed approach outperforms state-of-the-art methods in the camera-agnostic scenario and in the setting where the camera is known, MSGP outperforms all statistical methods.
Abstract: We present a statistical color constancy method that relies on novel gray pixel detection and mean shift clustering. The method, called Mean Shifted Grey Pixel -- MSGP, is based on the observation: true-gray pixels are aligned towards one single direction. Our solution is compact, easy to compute and requires no training. Experiments on two real-world benchmarks show that the proposed approach outperforms state-of-the-art methods in the camera-agnostic scenario. In the setting where the camera is known, MSGP outperforms all statistical methods.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, the problem of different training and test set class priors is addressed in the context of CNN classifiers, and two approaches to the estimation of the unknown test priors are compared: an existing Maximum Likelihood Estimation (MLE) method and a proposed Maximum a Posteriori (MAP) approach introducing a Dirichlet hyper-prior on the class prior probabilities.
Abstract: The problem of different training and test set class priors is addressed in the context of CNN classifiers. We compare two approaches to the estimation of the unknown test priors: an existing Maximum Likelihood Estimation (MLE) method and a proposed Maximum a Posteriori (MAP) approach introducing a Dirichlet hyper-prior on the class prior probabilities. Experimental results show a significant improvement in the fine-grained classification tasks using known evaluation-time priors, increasing top-1 accuracy by 4.0% on the FGVC iNaturalist 2018 validation set and by 3.9% on the FGVCx Fungi 2018 validation set. Estimation of the unknown test set priors noticeably increases the accuracy on the PlantCLEF dataset, allowing a single CNN model to achieve state-of-the-art results and to outperform the competition-winning ensemble of 12 CNNs. The proposed MAP estimation increases the prediction accuracy by 2.8% on PlantCLEF 2017 and by 1.8% on FGVCx Fungi, where the MLE method decreases accuracy.

Posted Content
TL;DR: A deep depth-aware long-term tracker is proposed that achieves state-of-the-art RGBD tracking performance and is fast to run and reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features.
Abstract: The best RGBD trackers provide high accuracy but are slow to run. On the other hand, the best RGB trackers are fast but clearly inferior on the RGBD datasets. In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run. We reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features. Moreover, the same depth-aware correlation filter is used for target re-detection. Comprehensive evaluations show that the proposed tracker achieves state-of-the-art performance on the Princeton RGBD, STC, and the newly-released CDTB benchmarks and runs 20 fps.

Posted Content
TL;DR: It is shown that the progressive spatial sampling in P-NAPSAC can be integrated with PROSAC sampling, which is applied to the first, location-defining, point.
Abstract: We propose Progressive NAPSAC, P-NAPSAC in short, which merges the advantages of local and global sampling by drawing samples from gradually growing neighborhoods. Exploiting the fact that nearby points are more likely to originate from the same geometric model, P-NAPSAC finds local structures earlier than global samplers. We show that the progressive spatial sampling in P-NAPSAC can be integrated with PROSAC sampling, which is applied to the first, location-defining, point. P-NAPSAC is embedded in USAC, a state-of-the-art robust estimation pipeline, which we further improve by implementing its local optimization as in Graph-Cut RANSAC. We call the resulting estimator USAC*. The method is tested on homography and fundamental matrix fitting on a total of 10,691 models from seven publicly available datasets. USAC* with P-NAPSAC outperforms reference methods in terms of speed on all problems.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: The Tracking by Deblatting (TbD) tracker as discussed by the authors is based on the observation that motion blur is directly related to the intra-frame trajectory of an object.
Abstract: Objects moving at high speed along complex trajectories often appear in videos, especially videos of sports. Such objects elapse non-negligible distance during exposure time of a single frame and therefore their position in the frame is not well defined. They appear as semi-transparent streaks due to the motion blur and cannot be reliably tracked by standard trackers. We propose a novel approach called Tracking by Deblatting based on the observation that motion blur is directly related to the intra-frame trajectory of an object. Blur is estimated by solving two intertwined inverse problems, blind deblurring and image matting, which we call deblatting. The trajectory is then estimated by fitting a piecewise quadratic curve, which models physically justifiable trajectories. As a result, tracked objects are precisely localized with higher temporal resolution than by conventional trackers. The proposed TbD tracker was evaluated on a newly created dataset of videos with ground truth obtained by a high-speed camera using a novel Trajectory-IoU metric that generalizes the traditional Intersection over Union and measures the accuracy of the intra-frame trajectory. The proposed method outperforms baseline both in recall and trajectory accuracy.

Posted Content
TL;DR: AMOS Patches, a large set of image cut-outs intended primarily for the robustification of trainable local feature descriptors to illumination and appearance changes, achieves state-of-the-art in matching under illumination changes on standard benchmarks.
Abstract: We present AMOS Patches, a large set of image cut-outs, intended primarily for the robustification of trainable local feature descriptors to illumination and appearance changes. Images contributing to AMOS Patches originate from the AMOS dataset of recordings from a large set of outdoor webcams. The semiautomatic method used to generate AMOS Patches is described. It includes camera selection, viewpoint clustering and patch selection. For training, we provide both the registered full source images as well as the patches. A new descriptor, trained on the AMOS Patches and 6Brown datasets, is introduced. It achieves state-of-the-art in matching under illumination changes on standard benchmarks.

Posted Content
TL;DR: The dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge are presented, which has 4 tasks covering various aspects of multi-lingual scene text.
Abstract: With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

Journal ArticleDOI
TL;DR: It is shown how the original CA space can be generalized to multiple output by the Cartesian product (CartCA), but for target spaces with more than two outputs the CartCA becomes computationally infeasible and therefore an approximate solution - multi-view CA (MvCA) - where CartCA is applied to output pairs is proposed.

01 Jan 2019
TL;DR: Performance improvements were achieved by adjusting the CNN predictions according to the estimated change of the class prior probabilities, replacing network parameters with their running averages, testtime data augmentation, filtering the provided training set and adding additional training images from GBIF.
Abstract: The paper describes an automatic system for recognition of 10,000 plant species, with focus on species from the Guiana shield and the Amazon rain forest. The proposed system achieves the best results on the PlantCLEF 2019 test set with 31.9% accuracy. Compared against human experts in plant recognition, the system performed better than 3 of the 5 participating human experts and achieved 41.0% accuracy on the subset for expert evaluation. The proposed system is based on the Inception-v4 and Inception-ResNet-v2 Convolutional Neural Network (CNN) architectures. Performance improvements were achieved by: adjusting the CNN predictions according to the estimated change of the class prior probabilities, replacing network parameters with their running averages, testtime data augmentation, filtering the provided training set and adding additional training images from GBIF.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: It is shown that flash photography significantly improves the performance of gray pixel detection without illuminant prior, training data, calibration of the flash, and a novel flash photography dataset generated from the MIT intrinsic dataset is introduced.
Abstract: In the real world, a scene is usually cast by multiple illuminants and herein we address the problem of spatial illumination estimation. Our solution is based on detecting gray pixels with the help of flash photography. We show that flash photography significantly improves the performance of gray pixel detection without illuminant prior, training data or calibration of the flash. We also introduce a novel flash photography dataset generated from the MIT intrinsic dataset.

Proceedings ArticleDOI
TL;DR: The Tracking by Deblatting (TbD) tracker as discussed by the authors is based on the observation that motion blur is directly related to the intra-frame trajectory of an object.
Abstract: Objects moving at high speed along complex trajectories often appear in videos, especially videos of sports. Such objects elapse non-negligible distance during exposure time of a single frame and therefore their position in the frame is not well defined. They appear as semi-transparent streaks due to the motion blur and cannot be reliably tracked by standard trackers. We propose a novel approach called Tracking by Deblatting based on the observation that motion blur is directly related to the intra-frame trajectory of an object. Blur is estimated by solving two intertwined inverse problems, blind deblurring and image matting, which we call deblatting. The trajectory is then estimated by fitting a piecewise quadratic curve, which models physically justifiable trajectories. As a result, tracked objects are precisely localized with higher temporal resolution than by conventional trackers. The proposed TbD tracker was evaluated on a newly created dataset of videos with ground truth obtained by a high-speed camera using a novel Trajectory-IoU metric that generalizes the traditional Intersection over Union and measures the accuracy of the intra-frame trajectory. The proposed method outperforms baseline both in recall and trajectory accuracy.

Journal ArticleDOI
TL;DR: A hybrid approach that combines feature-based and mutual-information-based pose estimation methods to benefit from their complementary properties for pose estimation is evaluated.

Proceedings ArticleDOI
TL;DR: A novel method that tracks fast moving objects, mainly non-uniform spherical, in full 6 degrees of freedom, estimating simultaneously their 3D motion trajectory, 3D pose and object appearance changes with a time step that is a fraction of the video frame exposure time is proposed.
Abstract: We propose a novel method that tracks fast moving objects, mainly non-uniform spherical, in full 6 degrees of freedom, estimating simultaneously their 3D motion trajectory, 3D pose and object appearance changes with a time step that is a fraction of the video frame exposure time. The sub-frame object localization and appearance estimation allows realistic temporal super-resolution and precise shape estimation. The method, called TbD-3D (Tracking by Deblatting in 3D) relies on a novel reconstruction algorithm which solves a piece-wise deblurring and matting problem. The 3D rotation is estimated by minimizing the reprojection error. As a second contribution, we present a new challenging dataset with fast moving objects that change their appearance and distance to the camera. High speed camera recordings with zero lag between frame exposures were used to generate videos with different frame rates annotated with ground-truth trajectory and pose.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: Experiments conducted on a newly-created dataset of 63 care label images show that even when exploiting problem-specific constraints, a state-of-the-art scene text detection and recognition method achieve precision and recall slightly above 0.6, confirming the challenging nature of the problem.
Abstract: The paper introduces the problem of care label recognition and presents a method addressing it. A care label, also called a care tag, is a small piece of cloth or paper attached to a garment providing instructions for its maintenance and information about e.g. the material and size. The informationand instructions are written as symbols or plain text. Care label recognition is a challenging text and pictogram recognition problem - the often sewn text is small, looking as if printed using a non-standard font; the contrast of the text gradually fades, making OCR progressively more difficult. On the other hand, the information provided is typically redundant and thus it facilitates semi-supervised learning. The presented care label recognition method is based on the recently published End-to-End Method for Multi-LanguageScene Text, E2E-MLT, Busta et al. 2018, exploiting specific constraints, e.g. a care label vocabulary with multi-language equivalences. Experiments conducted on a newly-created dataset of 63 care label images show that even when exploiting problem-specific constraints, a state-of-the-art scene text detection and recognition method achieve precision and recall slightly above 0.6, confirming the challenging nature of the problem.

Posted Content
TL;DR: This article proposed a grayness index for finding gray pixels and demonstrate its effectiveness and efficiency in illumination estimation, which is derived using the Dichromatic Reflection Model and is learning-free.
Abstract: We propose a novel grayness index for finding gray pixels and demonstrate its effectiveness and efficiency in illumination estimation. The grayness index, GI in short, is derived using the Dichromatic Reflection Model and is learning-free. GI allows to estimate one or multiple illumination sources in color-biased images. On standard single-illumination and multiple-illumination estimation benchmarks, GI outperforms state-of-the-art statistical methods and many recent deep methods. GI is simple and fast, written in a few dozen lines of code, processing a 1080p image in ~0.4 seconds with a non-optimized Matlab code.

Posted Content
TL;DR: It is shown that flash photography significantly improves the performance of gray pixel detection without illuminant prior, training data, calibration of the flash, and a novel flash photography dataset generated from the MIT intrinsic dataset is introduced.
Abstract: In the real world, a scene is usually cast by multiple illuminants and herein we address the problem of spatial illumination estimation. Our solution is based on detecting gray pixels with the help of flash photography. We show that flash photography significantly improves the performance of gray pixel detection without illuminant prior, training data or calibration of the flash. We also introduce a novel flash photography dataset generated from the MIT intrinsic dataset.

Posted Content
TL;DR: In this paper, a simple method for synchronization of video streams with a precision better than one millisecond is proposed, which is applicable to any number of rolling shutter cameras and when a few photographic flashes or other abrupt lighting changes are present in the video.
Abstract: A simple method for synchronization of video streams with a precision better than one millisecond is proposed. The method is applicable to any number of rolling shutter cameras and when a few photographic flashes or other abrupt lighting changes are present in the video. The approach exploits the rolling shutter sensor property that every sensor row starts its exposure with a small delay after the onset of the previous row. The cameras may have different frame rates and resolutions, and need not have overlapping fields of view. The method was validated on five minutes of four streams from an ice hockey match. The found transformation maps events visible in all cameras to a reference time with a standard deviation of the temporal error in the range of 0.3 to 0.5 milliseconds. The quality of the synchronization is demonstrated on temporally and spatially overlapping images of a fast moving puck observed in two cameras.

Proceedings ArticleDOI
TL;DR: ALFA as discussed by the authors is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores, and achieves state-of-the-art results on PASCAL VOC 2007 and PASCALSoc 2012, outperforming the individual detectors as well as baseline combination strategies.
Abstract: We propose ALFA - a novel late fusion algorithm for object detection. ALFA is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores. Each cluster represents a single object hypothesis whose location is a weighted combination of the clustered bounding boxes. ALFA was evaluated using combinations of a pair (SSD and DeNet) and a triplet (SSD, DeNet and Faster R-CNN) of recent object detectors that are close to the state-of-the-art. ALFA achieves state of the art results on PASCAL VOC 2007 and PASCAL VOC 2012, outperforming the individual detectors as well as baseline combination strategies, achieving up to 32% lower error than the best individual detectors and up to 6% lower error than the reference fusion algorithm DBF - Dynamic Belief Fusion.