The PASCAL Visual Object Classes Challenge

Simultaneous tracking of multiple persons in real-world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic evaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance of multiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. These metrics have been extensively used in two large-scale international evaluations, the 2006 and 2007 CLEAR evaluations, to measure and compare the performance of multiple object trackers for a wide variety of tracking tasks. Selected performance results are presented and the advantages and drawbacks of the presented metrics are discussed based on the experience gained during the evaluations.

/pdf/evaluating-multiple-object-tracking-performance-the-clear-tm8230vs80.pdf

Evaluating multiple object tracking performance: the CLEAR MOT metrics

This chapter introduces the subject of statistical pattern recognition (SPR). It starts by considering how features are defined and emphasizes that the nearest neighbor algorithm achieves error rates comparable with those of an ideal Bayes’ classifier. The concepts of an optimal number of features, representativeness of the training data, and the need to avoid overfitting to the training data are stressed. The chapter shows that methods such as the support vector machine and artificial neural networks are subject to these same training limitations, although each has its advantages. For neural networks, the multilayer perceptron architecture and back-propagation algorithm are described. The chapter distinguishes between supervised and unsupervised learning, demonstrating the advantages of the latter and showing how methods such as clustering and principal components analysis fit into the SPR framework. The chapter also defines the receiver operating characteristic, which allows an optimum balance between false positives and false negatives to be achieved.

Statistical Pattern Recognition

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.

/pdf/multimodal-fusion-for-multimedia-analysis-a-survey-5f9wxprbby.pdf

Multimodal fusion for multimedia analysis: a survey

https://dspace.stir.ac.uk/bitstream/1893/25490/1/affective-computing-review.pdf

A review of affective computing

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.

/pdf/a-joint-particle-filter-for-audio-visual-speaker-tracking-482sdkawdw.pdf

A joint particle filter for audio-visual speaker tracking

In this work, we propose an algorithm for acoustic source localization based on time delay of arrival (TDOA) estimation. In earlier work by other authors, an initial closed-form approximation was first used to estimate the true position of the speaker followed by a Kalman filtering stage to smooth the time series of estimates. In the proposed algorithm, this closed-form approximation is eliminated by employing a Kalman filter to directly update the speaker's position estimate based on the observed TDOAs. In particular, the TDOAs comprise the observation associated with an extended Kalman filter whose state corresponds to the speaker's position. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the proposed algorithm provides source localization accuracy superior to the standard spherical and linear intersection techniques. Moreover, the proposed algorithm, although relying on an iterative optimization scheme, proved efficient enough for real-time operation.

Kalman filters for time delay of arrival-based source localization.

/pdf/kalman-filters-for-time-delay-of-arrival-based-source-qsgdfa5plh.pdf

Kalman filters for time delay of arrival-based source localization

In prior work, we proposed using an extended Kalman filter to directly update position estimates in a speaker localization system based on time delays of arrival. We found that such a scheme provided superior tracking quality as compared with the conventional closed-form approximation methods. In this work, we enhance our audio localizer with video information. We propose an algorithm to incorporate detected face positions in different camera views into the Kalman filter without doing any explicit triangulation. This approach yields a robust source localizer that functions reliably both for segments wherein the speaker is silent, which would be detrimental for an audio only tracker, and wherein many faces appear, which would confuse a video only tracker. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the audio-video localizer functioned better than a localizer based solely on audio or solely on video features.

/pdf/kalman-filters-for-audio-video-source-localization-s8x8kwar4r.pdf

Kalman filters for audio-video source localization

In this paper, we present a multi-view facial expression classification system. The system utilizes local features extracted around automatically located facial landmarks using pose-dependent active appearance models. A pose-dependent ensemble of support vector machine classifiers assigns the given sample to one of the six basic expression classes. Extensive experiments have been conducted on the BU-3DFE database, comparing normalized landmark coordinates, discrete cosine transform, local binary patterns, and scale invariant feature transform based features, as well as combinations of shape and appearance features for classification. We evaluate the influence of AAM fitting errors, F-score feature selection, and expression intensity levels on classification accuracy. Features selected from a combination of normalized landmark coordinates and DCT-based features lead to a correct classification rate of 74.1%, outperforming automatic state-of-the-art multi-view expression recognition systems.

Tobias Gehrig

Papers

A joint particle filter for audio-visual speaker tracking

Kalman filters for time delay of arrival-based source localization.

Kalman filters for time delay of arrival-based source localization

Kalman filters for audio-video source localization

Multi-view facial expression recognition using local appearance features