scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2006"


Proceedings ArticleDOI
01 Jan 2006
TL;DR: A novel on-line AdaBoost feature selection algorithm for tracking that allows to adapt the classifier while tracking the object and selects the most features for tracking resulting in stable tracking results.
Abstract: Very recently tracking was approached using classification techniques such as support vector machines. The object to be tracked is discriminated by a classifier from the background. In a similar spirit we propose a novel on-line AdaBoost feature selection algorithm for tracking. The distinct advantage of our method is its capability of on-line training. This allows to adapt the classifier while tracking the object. Therefore appearance changes of the object (e.g. out of plane rotations, illumination changes) are handled quite naturally. Moreover, depending on the background the algorithm selects the most discriminating features for tracking resulting in stable tracking results. By using fast computable features (e.g. Haar-like wavelets, orientation histograms, local binary patterns) the algorithm runs in real-time. We demonstrate the performance of the algorithm on several (publically available) video sequences.

1,305 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: The approach is not only able to classify different actions, but also to localize different actions simultaneously in a novel and complex video sequence.
Abstract: We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

927 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: This work shows that when applied to human faces, the constrained local model (CLM) algorithm is more robust and more accurate than the original AAM search method, which relies on the image reconstruction error to update the model parameters.
Abstract: We present an efficient and robust model matching method which uses a joint shape and texture appearance model to generate a set of region template detectors. The model is fitted to an unseen image in an iterative manner by generating templates using the joint model and the current parameter estimates, correlating the templates with the target image to generate response images and optimising the shape parameters so as to maximise the sum of responses. The appearance model is similar to that used in the Active Appearance Model due to Cootes et al. However in our approach the appearance model is used to generate likely feature templates, instead of trying to approximate the image pixels directly. We show that when applied to human faces, our constrained local model (CLM) algorithm is more robust and more accurate than the original AAM search method, which relies on the image reconstruction error to update the model parameters. We demonstrate improved localisation accuracy on two publicly available face data sets and improved tracking on a challenging set of in-car face sequences.

802 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: It is demonstrated that high precision can be achieved by combining multiple sources of information, both visual and textual, by automatic generation of time stamped character annotation by aligning subtitles and transcripts.
Abstract: We investigate the problem of automatically labelling appearances of characters in TV or film material. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking; (iii) using complementary cues of face matching and clothing matching to propose common annotations for face tracks. Results are presented on episodes of the TV series “Buffy the Vampire Slayer”.

683 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: This paper presents a belief propagation based global algorithm that generates high quality results while maintaining real-time performance, and is the first BP based global method that runs at real- time speed.
Abstract: In this paper, we present a belief propagation based global algorithm that generates high quality results while maintaining real-time performance. To our knowledge, it is the first BP based global method that runs at real-time speed. Our efficiency performance gains mainly from the parallelism of graphics hardware,which leads to a 45 times speedup compared to the CPU implementation. To qualify the accurancy of our approach, the experimental results are evaluated on the Middlebury data sets, showing that our approach is among the best (ranked first in the new evaluation system) for all real-time approaches. In addition, since the running time of general BP is linear to the number of iterations, adopting a large number of iterations is not feasible for practical applications. Hence a novel approach is proposed to adaptively update pixel cost. Unlike general BP methods, the running time of our proposed algorithm dramatically converges.

280 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: This work describes how straight lines can be added to a monocular Extended Kalman Filter Simultaneous Mapping and Localisation (EKF SLAM) system in a manner that is both fast and which integrates easily with point features.
Abstract: The use of line features in real-time visual tracking applications is commonplace when a prior map is available, but building the map while tracking in real-time is much more difficult. We describe how straight lines can be added to a monocular Extended Kalman Filter Simultaneous Mapping and Localisation (EKF SLAM) system in a manner that is both fast and which integrates easily with point features. To achieve real-time operation, we present a fast straight-line detector that hypothesises and tests straight lines connecting detected seed points. We demonstrate that the resulting system provides good camera localisation and mapping in real-time on a standard workstation, using either line features alone, or lines and points combined.

260 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: A method for object detection that combines AdaBoost learning with local histogram features that outperforms all methods reported in [5] for 7 out of 8 detection tasks and four object classes.
Abstract: We present a method for object detection that combines AdaBoost learning with local histogram features. On the side of learning we improve the performance by designing a weak learner for multi-valued features based on Weighted Fisher Linear Discriminant. Evaluation on the recent benchmark for object detection confirms the superior performance of our method compared to the state-of-the-art. In particular, using a single set of parameters our approach outperforms all methods reported in [5] for 7 out of 8 detection tasks and four object classes.

190 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: This paper demonstrates a real-time, full-3D edge tracker based on a particle filter that exploits graphics hardware in a novel manner, allowing it not only to perform hidden line removal for each particle but also to evaluate pose likelihoods directly on the graphics card.
Abstract: This paper demonstrates a real-time, full-3D edge tracker based on a particle filter. In contrast to previous methods this system is capable of tracking complex self-occluding three-dimensional structures. The system exploits graphics hardware in a novel manner, allowing it not only to perform hidden line removal for each particle but also to evaluate pose likelihoods directly on the graphics card. This approach allows video-rate filtering with hundreds of particles on a standard workstation.

124 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: The proposed method is based on both global statistics of geometrical features and local statistics of correlative features of facial surfaces, and the combination of them is proven to be able to improve the recognition performance.
Abstract: In this paper, we present a new method for face recognition using range data. The proposed method is based on both global statistics of geometrical features and local statistics of correlative features of facial surfaces. Firstly, we analyze the performances of common geometrical representations by using global histograms for matching. Secondly, we propose a new method to encode the relationships between points and their neighbors, which are demonstrated to own great power to represent the intrinsic structure of facial surfaces. Finally, the two kinds of features are supposed to be complementary to some extent, and the combination of them is proven to be able to improve the recognition performance. All the experiments are performed on the full 3D face dataset of FRGC 2.0 which is the largest 3D face database so far. Promising results have demonstrated the effectiveness of our proposed method.

123 citations


Proceedings ArticleDOI
01 Sep 2006
TL;DR: This paper proposes an efficient algorithm for hierarchical agglomerative clustering and proposes a method for building data structures for fast matching in high dimensional feature spaces.
Abstract: In this paper we address the problem of building object class representations based on local features and fast matching in a large database We propose an efficient algorithm for hierarchical agglomerative clustering We examine different agglomerative and partitional clustering strategies and compare the quality of obtained clusters Our combination of partitional-agglomerative clustering gives significant improvement in terms of efficiency while maintaining the same quality of clusters We also propose a method for building data structures for fast matching in high dimensional feature spaces These improvements allow to deal with large sets of training data typically used in recognition of multiple object classes

114 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: A method to detect malaria parasites in images acquired from Giemsa-stained peripheral blood samples using conventional light microscopes is described and achieves 74% sensitivity, 98% specificity, 88% positive prediction, and 95% negative prediction values for the parasite detection.
Abstract: This paper investigates the possibility of computerised diagnosis of malaria and describes a method to detect malaria parasites (Plasmodium spp) in images acquired from Giemsa-stained peripheral blood samples using conventional light microscopes. Prior to processing, the images are transformed to match a reference image colour characteristics. The parasite detector utilises a Bayesian pixel classifier to mark stained pixels. The class conditional probability density functions of the stained and the non-stained classes are estimated using the non-parametric histogram method. The stained pixels are further processed to extract features (histogram, Hu moments, relative shape measurements, colour auto-correlogram) for a parasite/non-parasite classifier. A distance weighted K-nearest neighbour classifier is trained with the extracted features and a detailed performance comparison is presented. Our method achieves 74% sensitivity, 98% specificity, 88% positive prediction, and 95% negative prediction values for the parasite detection.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: The results show that people can be re-detected in images where they do not face the camera, and two extensions improving the pictorial structure detections are described.
Abstract: The goal of this work is to find all occurrences of a particular person in a sequence of photographs taken over a short period of time. For identification, we assume each individual’s hair and clothing stays the same throughout the sequence. Even with these assumptions, the task remains challenging as people can move around, change their pose and scale, and partially occlude each other. We propose a two stage method. First, individuals are identified by clustering frontal face detections using color clothing information. Second, a color based pictorial structure model is used to find occurrences of each person in images where their frontal face detection was missed. Two extensions improving the pictorial structure detections are also described. In the first extension, we obtain a better clothing segmentation to improve the accuracy of the clothing color model. In the second extension, we simultaneously consider multiple detection hypotheses of all people potentially present in the shot. Our results show that people can be re-detected in images where they do not face the camera. Results are presented on several sequences from a personal photo collection.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: This work presents a method for precise eye localization that uses two Support Vector Machines trained on properly selected Haar wavelet coefficients and studies the strong correlation between the eye localization error and the face recognition rate.
Abstract: We present a method for precise eye localization that uses two Support Vector Machines trained on properly selected Haar wavelet coefficients. The evaluation of our technique on many standard databases exhibits very good performance. Furthermore, we study the strong correlation between the eye localization error and the face recognition rate.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: This work defines a welllocalised edge landmark and presents an efficient algorithm for selecting such landmarks and describes how to initialise new landmarks, observe mapped landmarks in subsequent images, and deal with the data association challenges of edges.
Abstract: While many visual simultaneous localization and mapping (SLAM) systems use point features as landmarks, few take advantage of the edge information in images. Those SLAM systems that do observe edge features do not consider edges with all degrees of freedom. Edges are difficult to use in vision SLAM because of selection, observation, initialization and data association challenges. A map that includes edge features, however, contains higher-order geometric information useful both during and after SLAM. We define a well-localized edge landmark and present an efficient algorithm for selecting such landmarks. Further, we describe how to initialize new landmarks, observe mapped landmarks in subsequent images, and address the data association challenges of edges. Our methods, implemented in a particle-filter SLAM system, operate at frame rate on live video sequences.

Proceedings Article
01 Jan 2006
TL;DR: An algorithm to jointly estimate groupwise geometric and photometric transformations while preserving the efficient pre-computation based design of the original inverse compositional algorithm is proposed, which shows clear improvements in computational efficiency and in terms of convergence.
Abstract: Image registration consists in estimating geometric and photometric transformations that align two images as best as possible. The direct approach consists in minimizing the discrepancy in the intensity or color of the pixels. The inverse compositional algorithm has been recently proposed for the direct estimation of groupwise geometric transformations. It is efficient in that it performs several computationally expensive calculations at a pre-computation phase. We propose the dual inverse compositional algorithm which deals with groupwise geometric and photometric transformations, the latter acting on the value of the pixels. Our algorithm preserves the efficient pre-computation based design of the original inverse compositional algorithm. Previous attempts at incorporating photometric transformations to the inverse compositional algorithm spoil this property. We demonstrate our algorithm on simulated and real data and show the improvement in computational efficiency compared to previous algorithms.


Proceedings ArticleDOI
01 Jan 2006
TL;DR: A new method for providing insensitivity to expression variation in range images based on Log-Gabor Templates is presented by decomposing a single image of a subject into 147 observations allowing high accuracy even in the presence of occulusions, distortions and facial expressions.
Abstract: The use of Three Dimensional (3D) data allows new facial recognition algorithms to overcome factors such as pose and illumination variations which have plagued traditional 2D Face Recognition. In this paper a new method for providing insensitivity to expression variation in range images based on Log-Gabor Templates is presented. By decomposing a single image of a subject into 147 observations the reliance of the algorithm upon any particular part of the face is relaxed allowing high accuracy even in the presence of occulusions, distortions and facial expressions. Using the 3D database collected by University of Notre Dame for the Face Recognition Grand Challenge (FRGC), benchmarking results are presented showing superior performance of the proposed method. Comparisons showing the relative strength of the algorithm against two commercial and two academic 3D face recognition algorithms are also presented. algoritms are also presented. 1 Introduction

Proceedings ArticleDOI
01 Jan 2006
TL;DR: A novel Bayesian approach to modelling temporal transitions of facial expressions represented in a manifold resulting in both superior recognition rates and improved robustness against static frame-based recognition methods is proposed.
Abstract: In this paper, we propose a novel Bayesian approach to modelling temporal transitions of facial expressions represented in a manifold, with the aim of dynamical facial expression recognition in image sequences. A generalised expression manifold is derived by embedding image data into a low dimensional subspace using Supervised Locality Preserving Projections. A Bayesian temporal model is formulated to capture the dynamic facial expression transition in the manifold. Our experimental results demonstrate the advantages gained from exploiting explicitly temporal information in expression image sequences resulting in both superior recognition rates and improved robustness against static frame-based recognition methods.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: This work demonstrates that their method outperforms, in terms of sub-pixel accuracy, not only other surface fitting techniques but also the state-of-the-art in motion estimation using phase correlation including the technique that motivated the work in the first place.
Abstract: We propose a method for obtaining high-accuracy sub-pixel motion estimates using phase correlation. Our method is motivated by recently published analysis according to which the Fourier inverse of the normalized cross-power spectrum of pairs of images which have been mutually shifted by a fractional amount can be approximated by a two-dimensional sinc function. We use a modified version of such a function to obtain a sub-pixel estimate of motion by means of variable-separable fitting in the vicinity of the maximum peak of the phase correlation surface. We demonstrate that our method outperforms, in terms of sub-pixel accuracy, not only other surface fitting techniques but also the state-of-the-art in motion estimation using phase correlation including the technique that motivated our work in the first place. Furthermore our method performs particularly well in the presence of artificially induced additive white Gaussian noise and also offers better motion vector coherence in terms of zero-order entropy.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: A novel technique for human head tracking using cascaded classifiers based on AdaBoost and Haar-like features for hypothesis evaluation using multiple classifiers trained respectively to detect one direction of a human head.
Abstract: We propose a method for real-time people tracking using multiple cameras. The particle filter framework is known to be effective for tracking people, but most of existing methods adopt only simple perceptual cues such as color histogram or contour similarity for hypothesis evaluation. To improve the robustness and accuracy of tracking more sophisticated hypothesis evaluation is indispensable. We therefore present a novel technique for human head tracking using cascaded classifiers based on AdaBoost and Haar-like features for hypothesis evaluation. In addition, we use multiple classifiers, each of which is trained respectively to detect one direction of a human head. During real-time tracking the most suitable classifier is adaptively selected by considering each hypothesis and known camera position. Our experimental results demonstrate the effectiveness and robustness of our method.

Proceedings ArticleDOI
Jesse Hoey1
01 Jan 2006
TL;DR: The method for tracking in the presence of distractors, changes in shape, and occlusions is described and is applied to an assistive system that tracks the hands and the towel during a handwashing task.
Abstract: This paper describes a method for tracking in the presence of distractors, changes in shape, and occlusions. An object is modeled as a flock of features describing its approximate shape. The ock’ s dynamics keep it spatially localised and moving in concert, but also well distributed across the object being tracked. A recursive Bayesian estimation of the density of the object is approximated with a set of samples. The method is demonstrated on two simple examples, and is applied to an assistive system that tracks the hands and the towel during a handwashing task.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: A novel volumetric reconstruction technique that combines shape-from-silhouette with stereo photo-consistency in a global optimisation that enforces feature constraints across multiple views is presented.
Abstract: This paper presents a novel volumetric reconstruction technique that combines shape-from-silhouette with stereo photo-consistency in a global optimisation that enforces feature constraints across multiple views. Human shape reconstruction is considered where extended regions of uniform appearance, complex self-occlusions and sparse feature cues represent a challenging problem for conventional reconstruction techniques. A unified approach is introduced to first reconstruct the occluding contours and left-right consistent edge contours in a scene and then incorporate these contour constraints in a global surface optimisation using graph-cuts. The proposed technique maximises photo-consistency on the surface, while satisfying silhouette constraints to provide shape in the presence of uniform surface appearance and edge feature constraints to align key image features across views.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: This paper presents an approach using local kernel histograms and contour-based features that have the conventional histograms advantages avoiding their inherent drawbacks and is more robust than color features regarding scene illumination variations.
Abstract: Constant background hypothesis for background subtraction algorithms is often not applicable in real environments because of shadows, reflections, or small moving objects in the background: flickering screens in indoor scenes, or waving vegetation in outdoor ones. In both indoor and outdoor scenes, the use of color cues for background segmentation is limited by illumination variations when lights are switched or weather changes. This problem can be partially allievated using robust color coordinates or background update algorithms but an important part of the color information is lost by the former solution and the latter is often too specialized to cope with most of real environment constraints. This paper presents an approach using local kernel histograms and contour-based features. Local kernel histograms have the conventional histograms advantages avoiding their inherent drawbacks. Contour based features are more robust than color features regarding scene illumination variations. The proposed algorithm performances are emphasized in the experimental results using test scenes involving strong illumination variations and non static backgrounds.

Proceedings ArticleDOI
04 Sep 2006
TL;DR: This work proposes to embed the visual vocabulary creation within the object model construction, allowing to make it more suited for object class discrimination, and shows that the proposed model outperforms approaches not learning such an adapted visual vocabulary.
Abstract: The visual vocabularyis an intermediate level representation which has been proven to be very powerful for addressing object categorization problems It is generally built by vector quantizing a set of local image descriptors, independently of the object model used for categorizing images We propose here to embed the visual vocabulary creation within the object model construction, allowing to make it more suited for object class discrimination We experimentally show that the proposed model outperforms approaches not learning such an adapted visual vocabulary

Proceedings ArticleDOI
01 Sep 2006
TL;DR: A simple method of updating a representation of the Jacobian as the search progresses is described, which allows to tune the AAM to the current example, and is particularly powerful when tracking objects through sequences.
Abstract: Active Appearance Models [5] are widely used to match statistical models of shape and appearance to new images rapidly. They work by findi ng model parameters which minimise the sum of squares of residual differences between model and target image. Their efficiency is achieved by pre-c omputing the Jacobian describing how the residuals are expected to change as the parameters vary. This leads to a method of predicting the position of the minima based on a single measurement of the residuals (though in practise the algorithm is iterated to refine the estimate). However, the estimate of the Jacobian from the training set will only be an approximation for any given target image, and may be a poor one if the target image is significantly different from the training im ages. This paper describes a simple method of updating a representation of the Jacobian as the search progresses. This allows us to tune the AAM to the current example. Though useful for matching to a single image, it is particularly powerful when tracking objects through sequences, as it gives a method of tuning the AAM as the search progresses. We demonstrate the power of the technique on a variety of datasets.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: This paper proposes a compact colour descriptor, which it is called Wiccest, requiring only 12 numbers to locally capture colour and texture information, and demonstrates the features to be applicable to highly compressed images while retaining discriminative power.
Abstract: Much emphasis has recently been placed on the detection and recognition of locally (weak) affine invariant region descriptors for object recognition. In this paper, we take recognition one step further by developing features for non-planar objects. We consider the description of objects with locally smoothly varying surface. For this class of objects, colour invariant histogram matching has proven to be very encouraging. However, matching many local colour cubes is computationally demanding. We propose a compact colour descriptor, which we call Wiccest, requiring only 12 numbers to locally capture colour and texture information. The Wiccest features are shown to be fairly insensitive to photometric effects like shadow, shading, and illumination colour. Moreover, we demonstrate the features to be applicable to highly compressed images while retaining discriminative power.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: This paper reviews existing weighting schemes and considers their noise properties, and a minimum-variance solution is introduced which exploits a camera noise model.
Abstract: A method for capturing high intensity dynamic range scenes with a low dynamic range camera consists in taking a series of images with di erent exposure settings and combining these into a single high dynamic range image. The combined image values are found by weighted averaging of values from the di erently exposed images on a per-pixel basis. This paper reviews existing weighting schemes and considers their noise properties. Furthermore, a minimum-variance solution is introduced which exploits a camera noise model. Special emphasis is on the case when the camera is linear. A method is given for estimating the uncertainty of the combined image values. The results are validated experimentally.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: It is shown that under the assumption that people walk with a constant velocity, calibration performance can be improved significantly and the incorporation of temporal data helps to take correlations between subsequent detections into consideration, which leads to an up-front reduction of the noise in the measurements and an overall improvement in auto-calibration performance.
Abstract: It has been shown that under a small number of assumptions, observations of people can be used to obtain metric calibration information of a camera, which is particularly useful for surveillance applications. However, previous work had to exclude the common criticial configuration of the camera’s principal point falling on the horizon line and very long focal lengths, both of which occur commonly in practise. Due to noise, the quality of the calibration quickly degrades at and in the vicinity of these configurations. This paper provides a robust solution to this problem by incorporating information about the motion of people into the estimation process. It is shown that under the assumption that people walk with a constant velocity, calibration performance can be improved significantly. In addition to solving the above problem, the incorporation of temporal data also helps to take correlations between subsequent detections into consideration, which leads to an up-front reduction of the noise in the measurements and an overall improvement in auto-calibration performance.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: This work proposes a novel photoflux functional for multi-view 3D reconstruction that is closely related to properties of photohulls and unifies two groups of multiview stereo techniques: “space carving” and “deformable models”.
Abstract: Our work was inspired by recent advances in image segmentation where fluxbased functionals significantly improved alignment of obje ct boundaries. We propose a novel photoflux functional for multi-view 3D reconstruction that is closely related to properties of photohulls. Our photohull prior can be combined with regularization. Thus, this work unifies two ma jor groups of multiview stereo techniques: “space carving” and “deformable models”. Our approach combines benefits of both groups and allows to recov er fine shape details without oversmoothing while robustly handling noise. Photoflux provides data-driven ballooning force that helps to segment thin structures or holes. Photoflux maximizing shapes can be also seen as regula rized Laplacian zero-crossings [3]. We discuss several versions of photoflux functional based on global, local, or non-deterministic visibility mo dels. Some forms of photoflux can be easily added into standard regularizatio n techniques. For other forms we propose new optimization methods.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: The main contributions of this paper lie in the integration of multiple HHMMs for recognising high-level behaviours of multiple people and the construction of the Rao-Blackwellised particle filters (RBPF) for approximate inference.
Abstract: Recognising behaviours of multiple people, especially high-level behaviours, is an important task in surveillance systems. When the reliable assignment of people to the set of observations is unavailable, this task becomes complicated. To solve this task, we present an approach, in which the hierarchical hidden Markov model (HHMM) is used for modeling the behaviour of each person and the joint probabilistic data association filters (JPDAF) is applied for data association. The main contributions of this paper lie in the integration of multiple HHMMs for recognising high-level behaviours of multiple people and the construction of the Rao-Blackwellised particle filters (RBPF) for approximate inference. Preliminary experimental results in a real environment show the robustness of our integrated method in behaviour recognition and its advantage over the use of Kalman filter in tracking people.