scispace - formally typeset
Open AccessJournal ArticleDOI

Everyday concept detection in visual lifelogs: validation, relationships and trends

Reads0
Chats0
TLDR
This work explores the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs, and applies detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users.
Abstract
The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user's day-to-day activities. It captures on average 3,000 images in a typical day, equating to almost 1 million images per year. It can be used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer's life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs. Our concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept's presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were evaluated on a subset of 95,907 images, to determine the accuracy for detection of each semantic concept. We conducted further analysis on the temporal consistency, co-occurance and relationships within the detected concepts to more extensively investigate the robustness of the detectors within this domain.

read more

Content maybe subject to copyright    Report

Multimedia Tools and Applications manuscript No.
(will be inserted by the editor)
Everyday Concept Detection in Visual Lifelogs:
Validation, Relationships and Trends
Daragh Byrne · Aiden R. Doherty · Cees
G.M. Snoek · Gareth J.F. Jones · Alan F.
Smeaton
Received: 9-MAR-2009 / Accepted: 3-JUL-2009
Abstract The Microsoft SenseCam is a small lightweight wearable camera used to
passively capture photos and other sensor readings from a user’s day-to-day activities.
It captures on average 3,000 images in a typical day, equating to almost 1 million images
per year. It can be used to aid memory by creating a personal multimedia lifelog, or
visual recording of the wearer’s life. However the sheer volume of image data captured
within a visual lifelog creates a number of challenges, particularly for locating relevant
content. Within this work, we explore the applicability of semantic concept detection, a
method often used within video retrieval, on the domain of visual lifelogs. Our concept
detector models the correspondence between low-level visual features and high-level
semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised
machine learning. By doing so it determines the probability of a concept’s presence.
We apply detection of 27 everyday semantic concepts on a lifelog collection composed
of 257,518 SenseCam images from 5 users. The results were evaluated on a subset of
95,907 images, to determine the accuracy for detection of each semantic concept. We
conducted further analysis on the temporal consistency, co-occurance and relationships
D. Byrne
CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9,
Ireland
Tel.: +353-700-5262
Fax: +353-700–5442
E-mail: daragh.byrne@computing.dcu.ie
A.R. Doherty
CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9,
Ireland
C.G.M. Snoek
Intelligent Systems Lab Amsterdam, University of Amsterdam, Science Park 107, 1098XG
Amsterdam, The Netherlands
G.J.F. Jones
Centre for Digital Video Processing, Dublin City University, Glasnevin, Dublin 9, Ireland
A.F. Smeaton
CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9,
Ireland

2
within the detected concepts to more extensively investigate the robustness of the
detectors within this domain.
Keywords Microsoft SenseCam · lifelog · passive photos · concept detection ·
supervised learning
1 Introduction
Recording of personal life experiences through digital technology is a phenomenon we
are increasingly familiar with: music players, such as iTunes, remember the music we
listen to frequently; our web activity is recorded in web browsers’ history; and we cap-
ture important moments in our life-time through photos and video [1]. This notion of
digitally capturing our memories is known as lifelogging. While many steps have been
taken towards managing such ever-growing lifelogging collections [10,9,24], we are still
far from achieving on-demand, rapid and easy access. This is mainly due to the fact
that we cannot yet provide rapid, flexible access to content of interest from the collec-
tion.
The most obvious form of content retrieval is to offer refinement of the lifelog col-
lection based on temporal information. Retrieval may also be enabled based on the
low-level visual features of a query image. However, in order for such a search to be
effective the user must provide a visual example of the content they seek to retrieve
and there may be times when a user will not possess such an example, or that it may be
buried deep within the collection. Augmentation and annotation of the collection with
sources of context metadata is another method by which visual lifelogs may be made
searchable. Using sources of context such as location or weather conditions has been
demonstrated to be effective in this regard [4,12]. There are, however, limitations to
these approaches as well, most importantly any portion of the collection without asso-
ciated context metadata would not be searchable. Moreover, while information derived
from sensors such as Bluetooth and GPS [4] may cover the ‘who’ and the ‘where’ of
events in an individual’s lifelog, they do not allow for the retrieval of relevant content
based on the ‘what’ of an event.
An understanding of the ‘what’ or the semantics of an event would be invaluable
within the search process and would empower a user to rapidly locate relevant content.
Typically, such searching is enabled in image tools like Flickr through manual user con-
tributed annotations or ‘tags’, which are then used to retrieve visual content. Despite
being effective for retrieval, such a manual process could not be practical within the
domain of lifelogging, since it would be far too time and resource intensive given the
volume of the collection and the rate at which it grows. Therefore we should explore
methods for automatic annotation of visual lifelog collections.
One such method is concept detection, an often employed approach in video retrieval
[27,32,35], which aims to describe visual content with confidence values indicating the
presence or absence of object and scene categories. Although it is hard to bridge the
gap between low-level features that one can extract from visual data and the high-
level conceptual interpretation a user gives to this data, the video retrieval field has
made substantial progress by moving from specific single concept detection methods to
generic approaches. Such generic concept detection approaches are achieved by fusion
of colour-, texture-, and shape-invariant features [14,15,18,13], combined with super-
vised machine learning using support vector machines [5,34]. The emphasis on generic

3
indexing by learning has opened up the possibility of moving to larger concept detector
sets [20,33,36]. Unfortunately these concept detector sets are optimized for the (broad-
cast) video domain only, and their applicability to other domains such as visual lifelog
collections is unclear but is the focus of this work.
Visual lifelog data, and in particular Microsoft SenseCam data - the source for our in-
vestigation - is markedly different from typical video or photographic data and presents
a significantly more challenging domain for visual analysis. SenseCam images tend to
be of low quality owing to: their lower visual resolution; their use of a fisheye lens
which distorts the image somewhat but increases the field of view; and the absence
of a lens aperture resulting in many images being much darker or brighter than de-
sired for optimal visual analysis. Also, almost half of the images are generally found to
contain non-desirable artefacts such as grain, noise, blurring or light saturation [16].
Thus we have conducted an investigation to determine if semantic concept detection
methods translate to the novel domain of lifelogs and to determine the degree of ro-
bustness, precision and reliability that can be achieved with these approaches on such
collections. This investigation, and the results as reported, build upon work previous
reported in Byrne et al. [3]. Further to this prior work, here we present extended results
and analysis on the reliability of concept detection within the domain of visual lifelogs.
Additionally, we explore some aspects of a lifelog, such as temporal consistency and
its spatiotemporal nature that may lead to further enhancements of the robustness of
concept detection within lifelog archives.
The rest of this paper is organised as follows: first we outline how we applied concept
detection to images captured by the SenseCam lifelogging device (Section 2); then we
quantitatively describes how accurate our models are in detecting concepts (Section 3);
we next examine the temporal consistency (Section 4) and co-occurences (Section 5);
finally we summarise this work and outline potentially interesting future endeavours
for concept detection within the domain of lifelogging (Sections 6 and 7).
2 Concept Detection Requirements in the Visual Lifelog Domain
The major requirements for semantic concept detection on visual lifelogs are as follows:
a) the identification of everyday concepts; b) the identification of positive and negative
examples; and c) reliable and accurate detection. We now discuss how we followed these
steps with respect to lifelog images captured by a SenseCam.
2.1 Use Case: Concept Detection in SenseCam Images
To study the applicability of concept detection in the lifelog domain we make use of a
device known as the SenseCam. Microsoft Research in Cambridge, UK, have developed
the SenseCam as a small wearable device that passively captures a person’s day-to-
day activities, as a series of photographs and readings from in-built sensors [19]. It is
typically hung from a lanyard around the neck and, so it provides a ‘first person view
on the activities that the wearer is engaged in. Anything in the view of the wearer
can be captured by the SenseCam owing to its fisheye lens. The SenseCam contains
several built-in sensors which are designed to monitor the environment of the wearer.
These are: a three-axis accelerometer - to detect movement of the wearer; a passive
infrared sensor - to detect bodies in front of the wearer; light sensor - to detect changes

4
Fig. 1 The Microsoft SenseCam (Inset: right as worn by a user)
in light level such as when moving from indoors to outdoors; and an ambient tem-
perature sensor. At a minimum the SenseCam will automatically take a new image
approximately every 50 seconds, but sudden changes in the environment of the wearer,
detected by onboard sensors, triggers more frequent photo capture. It can capture a
typical day without interruption as the battery is sufficient to last for 18 hours and
can be recharged fully overnight. The SenseCam can take an average of 3,000 images
in a typical day and, as a result, a wearer can very quickly build large and rich photo
collections. Within a year, the lifelog photoset will grow to approximately 1 million
images.
Beyond simple triggering of capture by the onboard sensors, the device does not cur-
rently support more intelligent or efficient capture of images. As such, the device seeks
to capture as much detail about the activities in which a user engages by sampling them
at high frequency. With no external control over the decision to capture or the possi-
bility for more selective capture, we must consider means by which we can intelligently
determine which of the large number of images produced by this capture mechanism
will offer utility. In order to achieve this, we explore a post-processing step, semantic
concept detection, through which such understanding of the visual frames can be gar-
nered. We expect that semantic concept detection can ultimately be employed in order
to filter, reduce and retrieve the content contained within a visual lifelog. However the
motivation of this work is not to immediately offer such functionality but rather to
first establish that such techniques translate to this domain and these collections with
sufficient success to offer utility.
2.2 Collection Overview
In order to evaluate concept detection, we amassed a large and diverse dataset, com-
prised of 257,518 SenseCam images. These images were gathered by ve individual
users during five distinct timeframes, and so there was no overlap between the peri-
ods captured across each user’s dataset. A breakdown of the collection is illustrated
in Table 1. It is worth noting that not all collections featured the same physical sur-
roundings. Often collections contained large changes resulting from shifts in location,
user behaviour, and/or environments.

5
Table 1 An overview of the image collection used.
User Total Images Number of Positive Examples of Concepts Provided Days Covered
1 79,595 2,180 35
2 76,023 9,436 48
3 40,715 28,023 25
4 42,700 27,223 21
5 18,485 11,408 8
Total 257,518 78,270 137
2.3 Determining LifeLog Concepts
Current approaches to semantic concept detection require the provision of a set of pos-
itive and a set of negative labelled exemplar images for each concept. These are then
used by a classifier system to train and develop a model for the concept (see Section
2.4. As part of this investigation, we first had to identify the concepts present within
the collection for which we wanted to develop detection models, and for which a set
of training examples would be collected. In order to determine the typical concepts
within the collection, a subset of each user’s SenseCam images were visually inspected
by playing them sequentially at an accelerated speed. A list of concepts previously
used in video retrieval [27,33] and agreed upon as applicable to a SenseCam collection
were used as a starting point. As a new identifiable ‘concept’ was uncovered within
the collection it was added to this list. Each observed repetition of the concept gave it
additional weight and ranked it more highly for inclusion. Over 150 common concepts
were identified in this process. Next, it was decided that the most representative (i.e.
everyday) concepts should be selected and as such the candidates were then narrowed
to just 27 core concepts through iterative review and refinement. Criteria for this re-
finement included the generalisability of the concept across collections and users. For
example, the concepts ‘mountain’ and ‘snow’ occurred in User 1’s collection frequently
but could not be considered as an everyday concept as it was not present in the remain-
ing collections. The collection owners were involved throughout the review process and
were asked for feedback in negotiating the final selections. The 27 concepts represent
a set of everyday core concepts most likely to be collection-independent, which should
consequently be robust with respect to the user and setting. We were not motivated to
select those concepts which would offer most utility in filtering or retrieval, but rather
those which were most likely to occur in all collections and thereby enable robust eval-
uation of the applicability of semantic concept detection within the domain of visual
lifelogs, in which such techniques have not previously been explored. These core con-
cepts are outlined in Figure 2 using visual examples from the collection. Some concepts
are clearly related (e.g. it is logical to expect that ‘buildings’ and ‘outdoors’ would co-
occur) and as such it is important to note that each image may contain multiple (often
semantically related) concepts. This aspect of the collection and of semantic concepts
is further discussed in Section 5.
A large-scale manual annotation activity was undertaken to provide the required pos-
itive and negative labelled image examples. As annotating the entire collection was
impractical and given that SenseCam images tend to be temporally consistent, the
collection was skimmed by taking every fth image. As by their nature lifelog images
are highly personal, it is important for privacy reasons that it was only the owner of
the lifelog images who labels his or her images. Therefore, collection owners annotated

Citations
More filters
Journal ArticleDOI

The Evolution of First Person Vision Methods: A Survey

TL;DR: This paper summarizes the evolution of the state of the art in FPV video analysis between 1997 and 2014, highlighting, among others, the most commonly used features, methods, challenges, and opportunities within the field.
Journal ArticleDOI

Passively recognising human activities through lifelogging

TL;DR: This work describes a technique which allowed them to develop automatic classifiers for visual lifelogs to infer different lifestyle traits or characteristics and demonstrates the potential of lifelogging techniques to assist behavioural scientists in future.
Journal ArticleDOI

Toward Storytelling From Visual Lifelogging: An Overview

TL;DR: A thorough review of advances made so far in egocentric data analysis and new lines of research to move us toward storytelling from visual lifelogging can be found in this article.
Journal ArticleDOI

Remembering through lifelogging: A survey of human memory augmentation

TL;DR: The recent trend for lifelogging, continuously documenting ones life through wearable sensors and cameras, presents a clear opportunity to augment human memory beyond simple reminders and actually improve its capacity to remember.
Journal ArticleDOI

Unconscious emotions: quantifying and logging something we are not aware of

TL;DR: An experiment was designed to elicit emotions (both conscious and unconscious) with visual and auditory stimuli and to record cardiovascular responses of 34 participants and showed that heart rate responses to the presentation of the stimuli are unique for every category of the emotional stimuli and allow differentiation between various emotional experiences of the participants.
References
More filters
Journal ArticleDOI

The measurement of observer agreement for categorical data

TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Journal ArticleDOI

A new method for gray-level picture thresholding using the entropy of the histogram

TL;DR: Two methods of entropic thresholding proposed by Pun (Signal Process.,2, 1980, 223–237;Comput.16, 1981, 210–239) have been carefully and critically examined and a new method with a sound theoretical foundation is proposed.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Everyday concept detection in visual lifelogs: validation, relationships and trends" ?

Within this work, the authors explore the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs. The authors conducted further analysis on the temporal consistency, co-occurance and relationships D. Byrne CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9, Ireland Tel.: +353-700-5262 Fax: +353-700–5442 E-mail: daragh. 

The authors plan to undertake evaluations on lifelog collections to assess the utility of such retrieval methods. Finally, the authors believe that the exploration of active learning approaches which would combine user-contributed tagging ( or folksonomies ) with concept detection training, could be undertaken. 

The major requirements for semantic concept detection on visual lifelogs are as follows: a) the identification of everyday concepts; b) the identification of positive and negative examples; and c) reliable and accurate detection. 

SenseCam images tend to be of low quality owing to: their lower visual resolution; their use of a fisheye lens which distorts the image somewhat but increases the field of view; and the absence of a lens aperture resulting in many images being much darker or brighter than desired for optimal visual analysis. 

In order to obtain an image region descriptor with Gabor filters the authors follow these three steps: 1) parameterise the Gabor filters, 2) incorporate colour invariance, and 3) construct a histogram. 

These are: a three-axis accelerometer - to detect movement of the wearer; a passive infrared sensor - to detect bodies in front of the wearer; light sensor - to detect changesin light level such as when moving from indoors to outdoors; and an ambient temperature sensor. 

If a semantic concept or query is known to be highly temporally consistent, then by consulting the prediction on the previous shot, the overall performance of concept detection can be boosted. 

One such method is concept detection, an often employed approach in video retrieval [27,32,35], which aims to describe visual content with confidence values indicating the presence or absence of object and scene categories. 

As in video retrieval, these concepts offer the ability to bridge semantic understanding to enable search and location of images relevant to an information need. 

It was shown in [15] that the complete range of image statistics in natural textures can be well modeled with an integrated Weibull distribution, which in turn can be characterised by just 2 parameters. 

In summary, the authors would attribute poor performance of an everyday detector to one or more of the the following issues: a sub-optimal number of positive examples provided for training; a sub-optimal distribution of examples across the user’s collection; and/or a sub-optimal diversity in the visual distinctiveness of the provided positive examples (i.e. many highly visually similar examples). 

This resulted in almost 1400 positive and negative unique images per concept to be judged by the 9 annotators (50 to be judged by all 9 plus 9×150 individual judgments).