What have the authors stated for future works in "Everyday concept detection in visual lifelogs: validation, relationships and trends" ?

The authors plan to undertake evaluations on lifelog collections to assess the utility of such retrieval methods. Finally, the authors believe that the exploration of active learning approaches which would combine user-contributed tagging ( or folksonomies ) with concept detection training, could be undertaken.

What is the main reason why SenseCam images are of low quality?

SenseCam images tend to be of low quality owing to: their lower visual resolution; their use of a fisheye lens which distorts the image somewhat but increases the field of view; and the absence of a lens aperture resulting in many images being much darker or brighter than desired for optimal visual analysis.

How do the authors construct a codebook model from a Gabor filter?

In order to obtain an image region descriptor with Gabor filters the authors follow these three steps: 1) parameterise the Gabor filters, 2) incorporate colour invariance, and 3) construct a histogram.

How can the performance of a concept detection be boosted?

If a semantic concept or query is known to be highly temporally consistent, then by consulting the prediction on the previous shot, the overall performance of concept detection can be boosted.

What are the advantages of concept-based retrieval?

As in video retrieval, these concepts offer the ability to bridge semantic understanding to enable search and location of images relevant to an information need.

How can the authors model the full range of image statistics in natural textures?

It was shown in [15] that the complete range of image statistics in natural textures can be well modeled with an integrated Weibull distribution, which in turn can be characterised by just 2 parameters.

What are the main issues that would be attributed to the poor performance of an everyday detector?

In summary, the authors would attribute poor performance of an everyday detector to one or more of the the following issues: a sub-optimal number of positive examples provided for training; a sub-optimal distribution of examples across the user’s collection; and/or a sub-optimal diversity in the visual distinctiveness of the provided positive examples (i.e. many highly visually similar examples).

How many unique images were judged by the 9 annotators?

This resulted in almost 1400 positive and negative unique images per concept to be judged by the 9 annotators (50 to be judged by all 9 plus 9×150 individual judgments).

(Open Access) Everyday concept detection in visual lifelogs: validation, relationships and trends (2010) | Daragh Byrne

Q: What are the contributions mentioned in the paper "Everyday concept detection in visual lifelogs: validation, relationships and trends" ?

Within this work, the authors explore the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs. The authors conducted further analysis on the temporal consistency, co-occurance and relationships D. Byrne CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9, Ireland Tel.: +353-700-5262 Fax: +353-700–5442 E-mail: daragh.

Multimedia Tools and Applications manuscript No.

(will be inserted by the editor)

Everyday Concept Detection in Visual Lifelogs:

Validation, Relationships and Trends

Daragh Byrne · Aiden R. Doherty · Cees

G.M. Snoek · Gareth J.F. Jones · Alan F.

Smeaton

Received: 9-MAR-2009 / Accepted: 3-JUL-2009

Abstract The Microsoft SenseCam is a small lightweight wearable camera used to

passively capture photos and other sensor readings from a user’s day-to-day activities.

It captures on average 3,000 images in a typical day, equating to almost 1 million images

per year. It can be used to aid memory by creating a personal multimedia lifelog, or

visual recording of the wearer’s life. However the sheer volume of image data captured

within a visual lifelog creates a number of challenges, particularly for locating relevant

content. Within this work, we explore the applicability of semantic concept detection, a

method often used within video retrieval, on the domain of visual lifelogs. Our concept

detector models the correspondence between low-level visual features and high-level

semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised

machine learning. By doing so it determines the probability of a concept’s presence.

We apply detection of 27 everyday semantic concepts on a lifelog collection composed

of 257,518 SenseCam images from 5 users. The results were evaluated on a subset of

95,907 images, to determine the accuracy for detection of each semantic concept. We

conducted further analysis on the temporal consistency, co-occurance and relationships

D. Byrne

CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9,

Ireland

Tel.: +353-700-5262

Fax: +353-700–5442

E-mail: daragh.byrne@computing.dcu.ie

A.R. Doherty

CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9,

Ireland

C.G.M. Snoek

Intelligent Systems Lab Amsterdam, University of Amsterdam, Science Park 107, 1098XG

Amsterdam, The Netherlands

G.J.F. Jones

Centre for Digital Video Processing, Dublin City University, Glasnevin, Dublin 9, Ireland

A.F. Smeaton

CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9,

Ireland

within the detected concepts to more extensively investigate the robustness of the

detectors within this domain.

Keywords Microsoft SenseCam · lifelog · passive photos · concept detection ·

supervised learning

1 Introduction

Recording of personal life experiences through digital technology is a phenomenon we

are increasingly familiar with: music players, such as iTunes, remember the music we

listen to frequently; our web activity is recorded in web browsers’ history; and we cap-

ture important moments in our life-time through photos and video [1]. This notion of

digitally capturing our memories is known as lifelogging. While many steps have been

taken towards managing such ever-growing lifelogging collections [10,9,24], we are still

far from achieving on-demand, rapid and easy access. This is mainly due to the fact

that we cannot yet provide rapid, ﬂexible access to content of interest from the collec-

tion.

The most obvious form of content retrieval is to oﬀer reﬁnement of the lifelog col-

lection based on temporal information. Retrieval may also be enabled based on the

low-level visual features of a query image. However, in order for such a search to be

eﬀective the user must provide a visual example of the content they seek to retrieve

and there may be times when a user will not possess such an example, or that it may be

buried deep within the collection. Augmentation and annotation of the collection with

sources of context metadata is another method by which visual lifelogs may be made

searchable. Using sources of context such as location or weather conditions has been

demonstrated to be eﬀective in this regard [4,12]. There are, however, limitations to

these approaches as well, most importantly any portion of the collection without asso-

ciated context metadata would not be searchable. Moreover, while information derived

from sensors such as Bluetooth and GPS [4] may cover the ‘who’ and the ‘where’ of

events in an individual’s lifelog, they do not allow for the retrieval of relevant content

based on the ‘what’ of an event.

An understanding of the ‘what’ or the semantics of an event would be invaluable

within the search process and would empower a user to rapidly locate relevant content.

Typically, such searching is enabled in image tools like Flickr through manual user con-

tributed annotations or ‘tags’, which are then used to retrieve visual content. Despite

being eﬀective for retrieval, such a manual process could not be practical within the

domain of lifelogging, since it would be far too time and resource intensive given the

volume of the collection and the rate at which it grows. Therefore we should explore

methods for automatic annotation of visual lifelog collections.

One such method is concept detection, an often employed approach in video retrieval

[27,32,35], which aims to describe visual content with conﬁdence values indicating the

presence or absence of object and scene categories. Although it is hard to bridge the

gap between low-level features that one can extract from visual data and the high-

level conceptual interpretation a user gives to this data, the video retrieval ﬁeld has

made substantial progress by moving from speciﬁc single concept detection methods to

generic approaches. Such generic concept detection approaches are achieved by fusion

of colour-, texture-, and shape-invariant features [14,15,18,13], combined with super-

vised machine learning using support vector machines [5,34]. The emphasis on generic

indexing by learning has opened up the possibility of moving to larger concept detector

sets [20,33,36]. Unfortunately these concept detector sets are optimized for the (broad-

cast) video domain only, and their applicability to other domains such as visual lifelog

collections is unclear but is the focus of this work.

Visual lifelog data, and in particular Microsoft SenseCam data - the source for our in-

vestigation - is markedly diﬀerent from typical video or photographic data and presents

a signiﬁcantly more challenging domain for visual analysis. SenseCam images tend to

be of low quality owing to: their lower visual resolution; their use of a ﬁsheye lens

which distorts the image somewhat but increases the ﬁeld of view; and the absence

of a lens aperture resulting in many images being much darker or brighter than de-

sired for optimal visual analysis. Also, almost half of the images are generally found to

contain non-desirable artefacts such as grain, noise, blurring or light saturation [16].

Thus we have conducted an investigation to determine if semantic concept detection

methods translate to the novel domain of lifelogs and to determine the degree of ro-

bustness, precision and reliability that can be achieved with these approaches on such

collections. This investigation, and the results as reported, build upon work previous

reported in Byrne et al. [3]. Further to this prior work, here we present extended results

and analysis on the reliability of concept detection within the domain of visual lifelogs.

Additionally, we explore some aspects of a lifelog, such as temporal consistency and

its spatiotemporal nature that may lead to further enhancements of the robustness of

concept detection within lifelog archives.

The rest of this paper is organised as follows: ﬁrst we outline how we applied concept

detection to images captured by the SenseCam lifelogging device (Section 2); then we

quantitatively describes how accurate our models are in detecting concepts (Section 3);

we next examine the temporal consistency (Section 4) and co-occurences (Section 5);

ﬁnally we summarise this work and outline potentially interesting future endeavours

for concept detection within the domain of lifelogging (Sections 6 and 7).

2 Concept Detection Requirements in the Visual Lifelog Domain

The major requirements for semantic concept detection on visual lifelogs are as follows:

a) the identiﬁcation of everyday concepts; b) the identiﬁcation of positive and negative

examples; and c) reliable and accurate detection. We now discuss how we followed these

steps with respect to lifelog images captured by a SenseCam.

2.1 Use Case: Concept Detection in SenseCam Images

To study the applicability of concept detection in the lifelog domain we make use of a

device known as the SenseCam. Microsoft Research in Cambridge, UK, have developed

the SenseCam as a small wearable device that passively captures a person’s day-to-

day activities, as a series of photographs and readings from in-built sensors [19]. It is

typically hung from a lanyard around the neck and, so it provides a ‘ﬁrst person view’

on the activities that the wearer is engaged in. Anything in the view of the wearer

can be captured by the SenseCam owing to its ﬁsheye lens. The SenseCam contains

several built-in sensors which are designed to monitor the environment of the wearer.

These are: a three-axis accelerometer - to detect movement of the wearer; a passive

infrared sensor - to detect bodies in front of the wearer; light sensor - to detect changes

Fig. 1 The Microsoft SenseCam (Inset: right as worn by a user)

in light level such as when moving from indoors to outdoors; and an ambient tem-

perature sensor. At a minimum the SenseCam will automatically take a new image

approximately every 50 seconds, but sudden changes in the environment of the wearer,

detected by onboard sensors, triggers more frequent photo capture. It can capture a

typical day without interruption as the battery is suﬃcient to last for 18 hours and

can be recharged fully overnight. The SenseCam can take an average of 3,000 images

in a typical day and, as a result, a wearer can very quickly build large and rich photo

collections. Within a year, the lifelog photoset will grow to approximately 1 million

images.

Beyond simple triggering of capture by the onboard sensors, the device does not cur-

rently support more intelligent or eﬃcient capture of images. As such, the device seeks

to capture as much detail about the activities in which a user engages by sampling them

at high frequency. With no external control over the decision to capture or the possi-

bility for more selective capture, we must consider means by which we can intelligently

determine which of the large number of images produced by this capture mechanism

will oﬀer utility. In order to achieve this, we explore a post-processing step, semantic

concept detection, through which such understanding of the visual frames can be gar-

nered. We expect that semantic concept detection can ultimately be employed in order

to ﬁlter, reduce and retrieve the content contained within a visual lifelog. However the

motivation of this work is not to immediately oﬀer such functionality but rather to

ﬁrst establish that such techniques translate to this domain and these collections with

suﬃcient success to oﬀer utility.

2.2 Collection Overview

In order to evaluate concept detection, we amassed a large and diverse dataset, com-

prised of 257,518 SenseCam images. These images were gathered by ﬁve individual

users during ﬁve distinct timeframes, and so there was no overlap between the peri-

ods captured across each user’s dataset. A breakdown of the collection is illustrated

in Table 1. It is worth noting that not all collections featured the same physical sur-

roundings. Often collections contained large changes resulting from shifts in location,

user behaviour, and/or environments.

Table 1 An overview of the image collection used.

User Total Images Number of Positive Examples of Concepts Provided Days Covered

1 79,595 2,180 35

2 76,023 9,436 48

3 40,715 28,023 25

4 42,700 27,223 21

5 18,485 11,408 8

Total 257,518 78,270 137

2.3 Determining LifeLog Concepts

Current approaches to semantic concept detection require the provision of a set of pos-

itive and a set of negative labelled exemplar images for each concept. These are then

used by a classiﬁer system to train and develop a model for the concept (see Section

2.4. As part of this investigation, we ﬁrst had to identify the concepts present within

the collection for which we wanted to develop detection models, and for which a set

of training examples would be collected. In order to determine the typical concepts

within the collection, a subset of each user’s SenseCam images were visually inspected

by playing them sequentially at an accelerated speed. A list of concepts previously

used in video retrieval [27,33] and agreed upon as applicable to a SenseCam collection

were used as a starting point. As a new identiﬁable ‘concept’ was uncovered within

the collection it was added to this list. Each observed repetition of the concept gave it

additional weight and ranked it more highly for inclusion. Over 150 common concepts

were identiﬁed in this process. Next, it was decided that the most representative (i.e.

everyday) concepts should be selected and as such the candidates were then narrowed

to just 27 core concepts through iterative review and reﬁnement. Criteria for this re-

ﬁnement included the generalisability of the concept across collections and users. For

example, the concepts ‘mountain’ and ‘snow’ occurred in User 1’s collection frequently

but could not be considered as an everyday concept as it was not present in the remain-

ing collections. The collection owners were involved throughout the review process and

were asked for feedback in negotiating the ﬁnal selections. The 27 concepts represent

a set of everyday core concepts most likely to be collection-independent, which should

consequently be robust with respect to the user and setting. We were not motivated to

select those concepts which would oﬀer most utility in ﬁltering or retrieval, but rather

those which were most likely to occur in all collections and thereby enable robust eval-

uation of the applicability of semantic concept detection within the domain of visual

lifelogs, in which such techniques have not previously been explored. These core con-

cepts are outlined in Figure 2 using visual examples from the collection. Some concepts

are clearly related (e.g. it is logical to expect that ‘buildings’ and ‘outdoors’ would co-

occur) and as such it is important to note that each image may contain multiple (often

semantically related) concepts. This aspect of the collection and of semantic concepts

is further discussed in Section 5.

A large-scale manual annotation activity was undertaken to provide the required pos-

itive and negative labelled image examples. As annotating the entire collection was

impractical and given that SenseCam images tend to be temporally consistent, the

collection was skimmed by taking every ﬁfth image. As by their nature lifelog images

are highly personal, it is important for privacy reasons that it was only the owner of

the lifelog images who labels his or her images. Therefore, collection owners annotated

Everyday concept detection in visual lifelogs: validation, relationships and trends

Figures

Citations

The Evolution of First Person Vision Methods: A Survey

Passively recognising human activities through lifelogging

Toward Storytelling From Visual Lifelogging: An Overview

Remembering through lifelogging: A survey of human memory augmentation

Unconscious emotions: quantifying and logging something we are not aware of

References

The measurement of observer agreement for categorical data

LIBSVM: A library for support vector machines

The Nature of Statistical Learning Theory

Measuring nominal scale agreement among many raters.

A new method for gray-level picture thresholding using the entropy of the histogram

Related Papers (5)

SenseCam: a retrospective memory aid

Do life-logging technologies support memory for the past?: an experimental study using sensecam

SenseCam: a wearable camera that stimulates and rehabilitates autobiographical memory.

Lifelogging: Personal Big Data

MyLifeBits: a personal database for everything

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Everyday concept detection in visual lifelogs: validation, relationships and trends" ?

Q2. What have the authors stated for future works in "Everyday concept detection in visual lifelogs: validation, relationships and trends" ?

Q3. What are the main requirements for semantic concept detection on visual lifelogs?

Q4. What is the main reason why SenseCam images are of low quality?

Q5. How do the authors construct a codebook model from a Gabor filter?

Q6. What are the sensors that are used to capture the environment of the wearer?

Q7. How can the performance of a concept detection be boosted?

Q8. What is the common method used in video retrieval?

Q9. What are the advantages of concept-based retrieval?

Q10. How can the authors model the full range of image statistics in natural textures?

Q11. What are the main issues that would be attributed to the poor performance of an everyday detector?

Q12. How many unique images were judged by the 9 annotators?