scispace - formally typeset
Search or ask a question
Book ChapterDOI

A self-referential perceptual inference framework for video interpretation

01 Apr 2003-pp 54-67
TL;DR: This paper presents an extensible architectural model for general content-based analysis and indexing of video data which can be customised for a given problem domain using a novel active knowledge representation methodology based on an ontological query language.
Abstract: This paper presents an extensible architectural model for general content-based analysis and indexing of video data which can be customised for a given problem domain. Video interpretation is approached as a joint inference problems which can be solved through the use of modern machine learning and probabilistic inference techniques. An important aspect of the work concerns the use of a novel active knowledge representation methodology based on an ontological query language. This representation allows one to pose the problem of video analysis in terms of queries expressed in a visual language incorporating prior hierarchical knowledge of the syntactic and semantic structure of entities, relationships, and events of interest occurring in a video sequence. Perceptual inference then takes place within an ontological domain defined by the structure of the problem and the current goal set.

Summary (4 min read)

1 Introduction

  • The content-based analysis of digital video footage requires methods which will automatically segment video sequences and key frames into image areas corresponding to salient objects (e.g. people, vehicles, background objects, etc.), track these objects in time, and provide a flexible framework for further analysis of their relative motion and interactions.
  • The resulting framework can be customised to a particular problem (e.g. tracking human beings from CCTV footage) by integrating the most appropriate low-level (e.g. facial feature extraction) and high-level (e.g. models of human motion) sources of domain-specific knowledge.
  • Belief networks are particularly suitable because they model the evolution and integration of stochastic state information over time and can be viewed as generalisations of a broad family of probabilistic models.
  • Such an analysis needs to incorporate a notion of the syntax and semantics which are seen as governing the domain of interest so that the most likely explanation of the observed data can be found.
  • Visual inference tasks can then be carried out by processing sentence structures in an appropriate ontological language.

2.1 Visual Recognition as Perceptual Inference

  • The general idea is that recognising an object or event requires one to relate ill-defined symbolic representations of concepts to concrete instances of the referenced object or behaviour pattern.
  • This idea has a relatively long heritage in syntactic approaches to pattern recognition [39,4] but interest has been revived recently in the video analysis community following the popularity and success of probabilistic methods such as Hidden Markov models (HMM) and related approaches adopted from the speech and language processing community.
  • The role of machine learning in computer vision continues to grow and recently there has been a very strong trend towards using Bayesian techniques for learning and inference, especially factorised graphical probabilistic models [23] such as Dynamic Belief networks (DBN).
  • Their application to multi-modal and data fusion [38] can utilise fusion strategies of e.g. Kalman [10] and particle filtering [20] methods.

2.2 Recognition of Actions and Structured Events

  • Over the last 15 years there has been growing interest within the computer vision and machine learning communities in the problem of analysing human behaviour in video.
  • Higher-level visual analysis of compound events has in recent years been performed on the basis of parsing techniques using a probabilistic grammar formalism.
  • The role of attentional control for video analysis was also pointed out in [6].
  • Selective visual processing on the basis of Bayes nets and decision theory has also been demonstrated in control tasks for active vision systems [28].
  • Bayesian techniques for integrating bottom-up information with topdown feedback have also been applied to challenging tasks involving the recognition of interactions between people in surveillance footage [26]. [24] presents an ontology of actions represented as states and state transitions hierarchically organised from most general to most specific .

3.1 Overview

  • The authors propose a cognitive architectural model for video interpretation.
  • This language gives a probabilistic hierarchical representation incorporating domain specific syntactic and semantic constraints to enable robust analysis of video sequences from a visual language specification tailored to a particular application and for the set of available component modules.
  • The nature of such queries will be task specific.

3.2 Recognition and Classification

  • The notion of image and video interpretation relative to the goal of satisfying a structured user query (which may be explicit or implicitly derived from a more general specification of system objectives) follows the trend in recent approaches to robust object recognition on the basis of a “union of weak classifiers”.
  • Making such methods robust, scalable, and generally applicable has proven a major problem.
  • This takes into account that criteria for what constitutes non-accidental and perceptually significant visual properties necessarily depend on the objectives and prior knowledge of the observer.
  • Such a ranking makes it possible to quickly eliminate highly improbable or irrelevant configurations and narrow down the search window.
  • Devising a strategy for recognising objects by applying the most appropriate combination of visual routines such as segmentation and classification modules can also be learned from data [13].

3.3 The Role of Language in Vision

  • As mentioned above, many problems in vision such as object recognition ([14]), video analysis ([18,27,24]), gesture recognition ([3,21,25]), and multimedia retrieval ([22,2,37]) can be viewed as relating symbolic terms to visual information by utilising syntactic and semantic structure in a manner related to approaches in speech and language processing [34].
  • Processing may then be performed selectively in response to queries formulated in terms of the structure of the domain, i.e. relating high-level symbolic representations to extracted features in the signal (image and temporal feature) domain.
  • By basing such a language on an ontology one can capture both concrete and abstract relationships between salient visual properties.
  • Ontologies encode the relational structure of concepts which one can use to describe and reason about aspects of the world.
  • Instead, only those image aspects which are of value given a particular query are evaluated and evaluation may stop as soon as the appropriate top level symbol sequence has been generated.

3.4 Self-Referential Perceptual Inference Framework

  • In spite of the benefits of DBNs and related formalisms outlined above, probabilistic graphical models also have limitations in terms of their ability to represent structured data at a more symbolic level and the requirement for normalisations to enable probabilistic interpretations of information.
  • Devising a probabilistic model is in itself not enough since one requires a framework which determines which inferences are actually made and how probabilistic outputs are to be interpreted.
  • These can in turn guide the search for evidence to confirm or reject the hypotheses on the basis of expectations defined over the lower level features.
  • Such a process is well suited to a generative method where new candidate interpretations are tested and refined over time.
  • The authors argue that an ontological content representation and query language can be used as an effective vehicle for hierarchical representation and goal-directed inference in video analysis tasks.

4.1 Image and Video Indexing

  • In [37] the authors proposed an ontological query language called OQUEL as a novel query specification interface and retrieval tool for content based image retrieval and presented results using the ICON system.
  • Images are retrieved by deriving an abstract syntax tree from a textual or forms-based user query and probabilistically evaluating it by analysing the composition and perceptual properties of salient image regions in light of the query.
  • This work employs the region based motion segmentation method described in [31] which uses a Bayesian framework to determined the most likely labelling of regions according to motion layers and their depth ordering.
  • A face detector and simple human shape model have recently been used to identify and track people.
  • An ontological language is under development which extends the static scene content descriptions with motion verbs (“moves”, “gestures”), spatial and temporal prepositions (“on top of”, “beside”, “before”), and adverbs (“quickly”, “soon”) for indexing and retrieval of video fragments.

4.2 Multi-modal Fusion for Sentient Computing

  • Interesting avenues for refinement, testing and deployment of the proposed cognitive inference framework arise from the “sentient computing” ([17,1]) project developed at AT&T Laboratories Cambridge and the Cambridge University Laboratory for Communications Engineering (LCE).
  • Applications can register with the system to receive notifications of relevant events to provide them with an awareness of the spatial context in which users interact with the system.
  • At a more mundane level, vision technology makes the installation, maintenance and operation of a sentient computing sys- tem easier by providing additional means of calibrating sensory infrastructure and adapting a model of the static environment (such as furniture and partition walls).
  • The system thereby remains robust to error rates by integrating information vertically (applying detectors with high false acceptance rates to guide those with potentially high false rejection rates) and horizontally (fusing different kinds of information at the same level to offset different error characteristics for disambiguation).

5 Conclusion

  • This paper presents an extensible video analysis framework which can be customised for a given task domain by employing appropriate data sources and application-specific constraints.
  • Recent advances in graph-based probabilistic inference techniques allow the system to propagate a stochastic model over time and combine different types of syntactic and semantic information.
  • The process of generating high-level interpretations subject to system goals is performed by parsing sentence forms in an ontological language for visual content at different levels of analysis.
  • The authors would like to acknowledge directional guidance and support from AT&T Laboratories and the Cambridge University Laboratory for Communications Engineering.
  • The principal author received financial support from the Royal Commission for the Exhibition of 1851.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Self-Referential Perceptual Inference
Framework for Video Interpretation
Christopher Town
1
and David Sinclair
2
1
University of Cambridge Computer Laboratory, 15 JJ Thomson Avenue,
Cambridge CB3 0FD, UK cpt23@cam.ac.uk
2
Waimara Ltd, 115 Ditton Walk, Cambridge UK das@waiamara.com
Abstract. This pap er presents an extensible architectural model for
general content-based analysis and indexing of video data which can
b e customised for a given problem domain. Video interpretation is ap-
proached as a joint inference problems which can be solved through the
use of modern machine learning and probabilistic inference techniques.
An imp ortant aspect of the work concerns the use of a novel active knowl-
edge representation methodology based on an ontological query language.
This representation allows one to pose the problem of video analysis in
terms of queries expressed in a visual language incorporating prior hi-
erarchical knowledge of the syntactic and semantic structure of entities,
relationships, and events of interest occurring in a video sequence. Per-
ceptual inference then takes place within an ontological domain defined
by the structure of the problem and the current goal set.
1 Introduction
The content-based analysis of digital video footage requires methods which will
automatically segment video sequences and key frames into image areas corre-
sponding to salient objects (e.g. people, vehicles, background objects, etc.), track
these objects in time, and provide a flexible framework for further analysis of
their relative motion and interactions.
We argue that these goals are achievable by following the trend in Computer
Vision research to depart from strict “bottom-up” or “top-down” hierarchical
paradigms and instead place greater emphasis on the mutual interaction between
different levels of representation. Moreover, it is argued that an extensible frame-
work for general robust video object segmentation and tracking is best attained
by pursuing an inherently flexible “self-referential” approach. Such a system em-
bodies an explicit representation of its own internal state (different sources of
knowledge about a video scene) and goals (finding the object-level interpreta-
tion which is most likely given this knowledge and the demands of a particular
application). The resulting framework can be customised to a particular prob-
lem (e.g. tracking human beings from CCTV footage) by integrating the most
appropriate low-level (e.g. facial feature extraction) and high-level (e.g. models
of human motion) sources of domain-specific knowledge. The system can then
be regarded as combining this information at a meta-level to arrive at the most
J.L. Crowley et al. (Eds.): ICVS 2003, LNCS 2626, pp. 54–67, 2003.
c
Springer-Verlag Berlin Heidelberg 2003

A Self-Referential Perceptual Inference Framework for Video Interpretation 55
likely interpretation (e.g. labelling a block of moving image regions as represent-
ing a human body) of the video data given the available information, possibly
undergoing several cycles of analysis-integration-conclusion in the process.
In order to make meaningful inferences during this iterative fusion of dif-
ferent sources of knowledge and levels of feature extraction/representation, it
is necessary to place such a methodology within the sound theoretical frame-
work afforded by modern probabilistic inference techniques such as the adaptive
Bayesian graphical methods known as Dynamic Belief networks. Dynamic Be-
lief networks are particularly suitable because they model the evolution and
integration of stochastic state information over time and can be viewed as gen-
eralisations of a broad family of probabilistic models.
A key part of the proposed approach concerns the notion that many tasks in
computer vision are closely related to, and may be addressed in terms of, opera-
tions in language processing. In both cases one ultimately seeks to find symbolic
representations which can serve as meaningful interpretations of underlying sig-
nal data. Such an analysis needs to incorporate a notion of the syntax and
semantics which are seen as governing the domain of interest so that the most
likely explanation of the observed data can be found. Whereas speech and lan-
guage processing techniques are concerned with the analysis of sound patterns,
phonemes, words, sentences, and dialogues, video analysis is confronted with
pixels, video frames, primitive features, regions, objects, motions, and events.
An important difference [32] between the two arises from the fact that visual in-
formation is inherently more ambiguous and semantically impoverished. There
consequently exists a wide semantic gap between human interpretations of image
information and that currently derivable by means of a computer.
We argue that this gap can be narrowed for a particular application domain
by means of an ontological language which encompasses a hierarchical represen-
tation of task-specific attributes, objects, relations, temporal events, etc., and
relates these to the processing modules available for their detection and recogni-
tion from the underlying medium. Words in the language therefore carry mean-
ing directly related to the appearance of real world objects. Visual inference
tasks can then be carried out by processing sentence structures in an appro-
priate ontological language. Such sentences are not purely symbolic since they
retain a linkage between the symbol and signal levels. They can therefore serve
as a computational vehicle for active knowledge representation which permits
incremental refinement of alternate hypotheses through the fusion of multiple
sources of information and goal-directed feedback to facilitate disambiguation
in a context specified by the current set of ontological statements. Particular
parts of the ontological language model may be implemented as Dynamic Be-
lief networks, stochastic grammar parsers, or neural networks, but the overall
frameworks need not be tied to a particular formalism such as the propagation of
conditional probability densities. Later sections will discuss these issues further
in light of related work and ongoing research efforts.

56 C. Town and D. Sinclair
2 Related Work
2.1 Visual Recognition as Perceptual Inference
An increasing number of research efforts in medium and high level video analysis
can be viewed as following the emerging trend that object recognition and the
recognition of temporal events are best approached in terms of generalised lan-
guage processing which attempts a machine translation [14] from information in
the visual domain to symbols and strings composed of predicates, objects, and
relations. The general idea is that recognising an object or event requires one to
relate ill-defined symbolic representations of concepts to concrete instances of the
referenced object or behaviour pattern. This is best approached in a hierarchical
manner by associating individual parts at each level of the hierarchy according
to rules governing which configurations of the underlying primitives give rise to
meaningful patterns at the higher semantic level. Many state-of-the-art recog-
nition systems therefore explicitly or implicitly employ a probabilistic grammar
which defines the syntactic rules which can be used to recognise compound ob-
jects or events based on the detection of individual components corresponding
to detected features in time and space. Recognition then amounts to parsing a
stream of basic symbols according to prior probabilities to find the most likely
interpretation of the observed data in light of the top-level starting symbols in
order to establish correspondence between numerical and symbolic descriptions
of information. This idea has a relatively long heritage in syntactic approaches
to pattern recognition [39,4] but interest has been revived recently in the video
analysis community following the popularity and success of probabilistic meth-
ods such as Hidden Markov models (HMM) and related approaches adopted
from the speech and language processing community.
While this approach has shown great promise for applications ranging from
image retrieval to face detection to visual surveillance, a number of problems
remain to be solved. The nature of visual information poses hard challenges
which hinder the extent to which mechanisms such as Hidden Markov models
and stochastic parsing techniques popular in the speech and language process-
ing community can be applied to information extraction from images and video.
Consequently there remains some lack of understanding as to which mechanisms
are most suitable for representing and utilising the syntactic and semantic struc-
ture of visual information and how such frameworks can best be instantiated.
The role of machine learning in computer vision continues to grow and recently
there has been a very strong trend towards using Bayesian techniques for learn-
ing and inference, especially factorised graphical probabilistic models [23] such as
Dynamic Belief networks (DBN). While finding the right structural assumptions
and prior probability distributions needed to instantiate such models requires
some domain specific insights, Bayesian graphs generally offer greater concep-
tual transparency than e.g. neural network models since the underlying causal
links and prior beliefs are made more explicit. The recent development of vari-
ous approximation schemes based on iterative parameter variation or stochastic
sampling for inference and learning have allowed researchers to construct proba-

A Self-Referential Perceptual Inference Framework for Video Interpretation 57
bilistic models of sufficient size to integrate multiple sources of information and
model complex multi-modal state distributions. Recognition can then be posed
as a joint inference problem relying on the integration of multiple (weak) clues
to disambiguate and combine evidence in the most suitable context as defined
by the top level model structure.
One of the earlier examples of using Dynamic Belief networks (DBN) for vi-
sual surveillance appears in [5]. DBNs offer many advantages for tracking tasks
such as incorporation of prior knowledge and good modelling ability to represent
the dynamic dependencies between parameters involved in a visual interpreta-
tion. Their application to multi-modal and data fusion [38] can utilise fusion
strategies of e.g. Kalman [10] and particle filtering [20] methods. As illustrated
by [11] and [33], concurrent probabilistic integration of multiple complemen-
tary and redundant cues can greatly increase the robustness of multi-hypothesis
tracking.
In [29] tracking of a person’s head and hands is performed using a Bayesian
Belief network which deduces the body part positions by fusing colour, motion
and coarse intensity measurements with context dependent semantics. Later
work by the same authors [30] again shows how multiple sources of evidence
(split into necessary and contingent modalities) for object position and identity
can be fused in a continuous Bayesian framework together with an observation
exclusion mechanism. An approach to visual tracking based on co-inference of
multiple modalities is also presented in [41] which describes an sequential Monte
Carlo approach to co-infer target object colour, shape, and position. In [7] a
joint probability data association filter (JPDAF) is used to compute the HMM’s
transition probabilities by taking into account correlations between temporally
and spatially related measurements.
2.2 Recognition of Actions and Structured Events
Over the last 15 years there has been growing interest within the computer vision
and machine learning communities in the problem of analysing human behaviour
in video. Such systems typically consist of a low or mid level computer vision
system to detect and segment a human being or object of interest, and a higher
level interpretation module that classifies motions into atomic behaviours such as
hand gestures or vehicle manoeuvres. Higher-level visual analysis of compound
events has in recent years been performed on the basis of parsing techniques using
a probabilistic grammar formalism. Such methods are capable of recognising
fairly complicated behavioural patterns although they remain limited to fairly
circumscribed scenarios such as sport events [18,19], small area surveillance [36,
26], and game playing [25]. Earlier work on video recognition such as [40] and [15]
already illustrated the power of using a context dependent semantic hierarchy
to guide focus of attention and combination of plausible hypothesis, but lacked
a robust way of integrating multiple sources of information in a probabilistically
sound way.
The role of attentional control for video analysis was also pointed out in [6].
The system described there performs selective processing in response to user

58 C. Town and D. Sinclair
queries for two cellular imaging applications. This gives the system a goal di-
rected attentional control mechanism since the most appropriate visual analysis
routines are performed in order to process the user query. Selective visual pro-
cessing on the basis of Bayes nets and decision theory has also been demonstrated
in control tasks for active vision systems [28]. Knowledge representation using
Bayesian networks and sequential decision making on the basis of expected cost
and utility allow selective vision systems to take advantage of prior knowledge
of a domain’s cognitive and geometrical structure and the expected performance
and cost of visual operators. An interesting two-level approach to parsing actions
and events in video is described in [21]. HMMs are used to detect candidate low-
level temporal features which are then parsed using a SCFG parsing scheme
which adds disambiguation and robustness to the stream of detected atomic
symbols. A similar approach is taken by [25] which uses the Earley-Stolcke pars-
ing algorithm for stochastic context-free grammars to determine the most likely
semantic derivation for recognition of complex multi-tasked activities from a
given video scenario. A method for recognising complex multi-agent action is
presented in [19]. Belief networks are again used to probabilistically represent
and infer the goals of individual agents and integrate these in time from visual
evidence. Bayesian techniques for integrating bottom-up information with top-
down feedback have also been applied to challenging tasks involving the recog-
nition of interactions between people in surveillance footage [26]. [24] presents
an ontology of actions represented as states and state transitions hierarchically
organised from most general to most specific (atomic).
3 Proposed Approach and Methodology
3.1 Overview
We propose a cognitive architectural model for video interpretation. It is based
on a self-referential (the system maintains an internal representation of its goals
and current hypotheses) probabilistic model for multi-modal integration of ev-
idence (e.g. motion estimators, edge trackers, region classifiers, face detectors,
shape models, perceptual grouping operators) and context-dependent inference
given a set of representational or derivational goals (e.g. recording movements
of people in a surveillance application). The system is capable of maintaining
multiple hypotheses at different levels of semantic granularity and can generate
an consistent interpretation by evaluating a query expressed in an ontological
language. This language gives a probabilistic hierarchical representation incor-
porating domain specific syntactic and semantic constraints to enable robust
analysis of video sequences from a visual language specification tailored to a
particular application and for the set of available component modules.
From an Artificial Intelligence point of view this might be regarded as an
approach to the symbol grounding problem [16] (sentences in the ontological
language have an explicit foundation of evidence in the feature domain, so there
is a way of bridging the semantic gap between the signal and symbol level) and
frame problem [12] (there is no need to exhaustively label everything that is going

Citations
More filters
01 Jan 2005
TL;DR: The authors summarizes the 28 video sequences available for result comparison in the PETS04 workshop, which are from about 500 to 1400 frames in length, for a total of about 26500 frames.
Abstract: This paper summarizes the 28 video sequences available for result comparison in the PETS04 workshop. The sequences are from about 500 to 1400 frames in length, for a total of about 26500 frames. The sequences are annotated with both target position and activities by the CAVIAR research team members.

210 citations

Journal ArticleDOI
TL;DR: This semantic analysis approach can be used in semantic annotation and transcoding systems, which take into consideration the users environment including preferences, devices used, available network bandwidth and content identity.
Abstract: An approach to knowledge-assisted semantic video object detection based on a multimedia ontology infrastructure is presented. Semantic concepts in the context of the examined domain are defined in an ontology, enriched with qualitative attributes (e.g., color homogeneity), low-level features (e.g., color model components distribution), object spatial relations, and multimedia processing methods (e.g., color clustering). Semantic Web technologies are used for knowledge representation in the RDF(S) metadata standard. Rules in F-logic are defined to describe how tools for multimedia analysis should be applied, depending on concept attributes and low-level features, for the detection of video objects corresponding to the semantic concepts defined in the ontology. This supports flexible and managed execution of various application and domain independent multimedia analysis tasks. Furthermore, this semantic analysis approach can be used in semantic annotation and transcoding systems, which take into consideration the users environment including preferences, devices used, available network bandwidth and content identity. The proposed approach was tested for the detection of semantic objects on video data of three different domains.

155 citations


Additional excerpts

  • ...Experimental results are presented in Section VI. Finally, conclusions are drawn in Section VII....

    [...]

Journal ArticleDOI
TL;DR: Some of the key problems associated with embodied cognitive vision, including the phylogeny/ontogeny trade-off in artificial systems and the developmental limitations imposed by real-time environmental coupling are highlighted.

62 citations


Cites background from "A self-referential perceptual infer..."

  • ...A cognitive framework that combines low-level processing with high-level processing using a language-based ontology and adaptive Bayesian networks is described in [36]....

    [...]

  • ...see [26,36]), exactly because cognitivism invokes this process of abstraction of isomorphic representations of the world....

    [...]

Book ChapterDOI
TL;DR: Using fuzzy DLs, the proposed reasoning framework captures the vagueness of the extracted image descriptions and accomplishes their semantic interpretation, while resolving inconsistencies rising from contradictory descriptions.
Abstract: Statistical learning approaches, bounded mainly to knowledge related to perceptual manifestations of semantics, fall short to adequately utilise the meaning and logical connotations pertaining to the extracted image semantics. Instigated by the Semantic Web, ontologies have appealed to a significant share of synergistic approaches towards the combined use of statistical learning and explicit semantics. While the relevant literature tends to disregard the uncertainty involved, and treats the extracted image descriptions as coherent, two valued propositions, this paper explores reasoning under uncertainty towards a more accurate and pragmatic handling of the underlying semantics. Using fuzzy DLs, the proposed reasoning framework captures the vagueness of the extracted image descriptions and accomplishes their semantic interpretation, while resolving inconsistencies rising from contradictory descriptions. To evaluate the proposed reasoning framework, an experimental implementation using the fuzzyDL Description Logic reasoner has been carried out. Experiments in the domain of outdoor images illustrate the added value, while outlining challenges to be further addressed.

30 citations

Journal ArticleDOI
TL;DR: A fuzzy DLs-based reasoning framework is investigated, which enables the integration of scene and object classifications into a semantically consistent interpretation by capturing and utilising the underlying semantic associations.
Abstract: Recent advances in semantic image analysis have brought forth generic methodologies to support concept learning at large scale. The attained performance however is highly variable, reflecting effects related to similarities and variations in the visual manifestations of semantically distinct concepts, much as to the limitations issuing from considering semantics solely in the form of perceptual representations. Aiming to enhance performance and improve robustness, we investigate a fuzzy DLs-based reasoning framework, which enables the integration of scene and object classifications into a semantically consistent interpretation by capturing and utilising the underlying semantic associations. Evaluation with two sets of input classifiers, configured so as to vary with respect to the wealth of concepts' interrelations, outlines the potential of the proposed approach in the presence of semantically rich associations, while delineating the issues and challenges involved.

29 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the problem of grounding symbolic representations in nonsymbolic representations of two kinds, i.e., "iconic representations" and "categorical representations" is addressed.

3,330 citations

Book
01 Jan 1996
TL;DR: The principal ideas of probabilistic reasoning - known as Bayesian networks - are outlined and their practical implications illustrated and are intended for MSc students in knowledge-based systems, artificial intelligence and statistics, and for professionals in decision support systems applications and research.
Abstract: Computational modelling of probability has become a major part of automated decision support systems. In this book, the principal ideas of probabilistic reasoning - known as Bayesian networks - are outlined and their practical implications illustrated. The book is intended for MSc students in knowledge-based systems, artificial intelligence and statistics, and for professionals in decision support systems applications and research.

2,782 citations

Journal ArticleDOI
TL;DR: A real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task and demonstrates the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.
Abstract: We describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task. The system deals in particularly with detecting when interactions between people occur and classifying the type of interaction. Examples of interesting interaction behaviors include following another person, altering one's path to meet another, and so forth. Our system combines top-down with bottom-up information in a closed feedback loop, with both components employing a statistical Bayesian approach. We propose and compare two different state-based learning architectures, namely, HMMs and CHMMs for modeling behaviors and interactions. Finally, a synthetic "Alife-style" training system is used to develop flexible prior models for recognizing human interactions. We demonstrate the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.

1,831 citations


"A self-referential perceptual infer..." refers background or methods in this paper

  • ...Bayesian techniques for integrating bottom-up information with topdown feedback have also been applied to challenging tasks involving the recognition of interactions between people in surveillance footage [26]....

    [...]

  • ...Such methods are capable of recognising fairly complicated behavioural patterns although they remain limited to fairly circumscribed scenarios such as sport events [18,19], small area surveillance [36, 26], and game playing [25]....

    [...]

Book ChapterDOI
28 May 2002
TL;DR: This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept.
Abstract: We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well -- for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.

1,765 citations


"A self-referential perceptual infer..." refers background in this paper

  • ...As mentioned above, many problems in vision such as object recognition ([14]), video analysis ([18,27,24]), gesture recognition ([3,21,25]), and multimedia retrieval ([22,2,37]) can be viewed as relating symbolic terms to visual information by utilising syntactic and semantic structure in a manner related to approaches in speech and language processing [34]....

    [...]

  • ...An increasing number of research efforts in medium and high level video analysis can be viewed as following the emerging trend that object recognition and the recognition of temporal events are best approached in terms of generalised language processing which attempts a machine translation [14] from information in the visual domain to symbols and strings composed of predicates, objects, and relations....

    [...]

Proceedings ArticleDOI
01 Aug 1999
TL;DR: A sensor-driven, or sentient, platform for context-aware computing that enables applications to follow mobile users as they move around a building and presents it in a form suitable for application programmers is described.
Abstract: We describe a sensor-driven, or sentient, platform for context-aware computing that enables applications to follow mobile users as they move around a building. The platform is particularly suitable for richly equipped, networked environments. The only item a user is required to carry is a small sensor tag, which identifies them to the system and locates them accurately in three dimensions. The platform builds a dynamic model of the environment using these location sensors and resource information gathered by telemetry software, and presents it in a form suitable for application programmers. Use of the platform is illustrated through a practical example, which allows a user's current working desktop to follow them as they move around the environment.

1,479 citations


"A self-referential perceptual infer..." refers methods in this paper

  • ...Interesting avenues for refinement, testing and deployment of the proposed cognitive inference framework arise from the “sentient computing” ([17,1]) project developed at AT&T Laboratories Cambridge and the Cambridge University Laboratory for Communications Engineering (LCE)....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "A self-referential perceptual inference framework for video interpretation" ?

This paper presents an extensible architectural model for general content-based analysis and indexing of video data which can be customised for a given problem domain.