What future works have the authors mentioned in the paper "Nonparametric scene parsing with adaptive feature relevance and semantic context" ?

For future work, the authors would like to explore better methods for incorporating spatial information at the patch level and also explore learning semantic concepts for scene understanding.

What is the metric used to compute the relevance of a feature channel i?

For the query point x0, the relevance for feature i can be computed by averaging the ri(z)’s in its neighbourhoodr̄i(x0) = 1 |N(x0)| ∑z∈N(x0) ri(z) (10)where N(x0) denotes a neighbourhood centered at x0 (using the current feature weights) with K0 points in it.

How do the authors compute the appearance likelihood for the entire image?

In order to compute the appearance likelihood for the entire image, the authors approximate the Naive Bayes assumption yieldingP (A|L) ≈ S∏i=1P (ai|li).

What is the metric used to compute the weight of a query image?

With the varying nature of the retrieval set for individual query images, the authors use the locally adaptive metric approach of [3] for the weight computation.

(Open Access) Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context (2013) | Gautam Singh

Q: What are the contributions in "Nonparametric scene parsing with adaptive feature relevance and semantic context" ?

This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. To further improve the accuracy of the nonparametric approach, the authors examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates.

Q: What is the problem of semantic labelling?

The problem of semantic labelling, requires simultaneous segmentation of an image into regions and categorization of all the image pixels.

Q: How many iterations of the weight computation step in Eq. (11)?

The authors carry out 5 iterations of the weight computation step in Eq. (11) adaptively changing the nearest neighbours in the weighted neighbourhood space.

Q: What other datasets were used to evaluate the performance of their method?

For evaluating the performance of their method, the authors tested and compared it with several state-of-the-art techniques on four different datasets: SiftFlow [15], SUN09 [1], Google Street View [30] and Stanford Background [6].

Nonparametric scene parsing with adaptive feature relevance and semantic

context

Gautam Singh Jana Ko

seck

George Mason University

Fairfax, VA

{gsinghc,kosecka}@cs.gmu.edu

Abstract

This paper presents a nonparametric approach to se-

mantic parsing using small patches and simple gradient,

color and location features. We learn the relevance of in-

dividual feature channels at test time using a locally adap-

tive distance metric. To further improve the accuracy of the

nonparametric approach, we examine the importance of the

retrieval set used to compute the nearest neighbours using a

novel semantic descriptor to retrieve better candidates. The

approach is validated by experiments on several datasets

used for semantic parsing demonstrating the superiority of

the method compared to the state of art approaches.

1. Introduction

The problem of semantic labelling, requires simultane-

ous segmentation of an image into regions and categoriza-

tion of all the image pixels. The main ingredients of the

problem are the choice of elementary regions (pixels, super-

pixels), types of features used to characterize them, methods

for computing local label evidence and means of integrating

the spatial information. Semantic segmentation has been

particularly active in recent years, due to the development

of methods for integration of object detection techniques,

with various contextual cues and top down information as

well as advancements in inference algorithms used to com-

pute the optimal labelling.

With the increasing complexity and size of the datasets

used for evaluation of semantic segmentation, nonpara-

metric techniques [15, 26] combined with various context

driven retrieval strategies have demonstrated notable im-

provement in the performance. These methods typically

start with an oversegmentation of an image into superpix-

els followed by the computation of a rich set of features

characterizing both appearance and local geometry at the

superpixel level. Due to a large number of diverse features,

distance learning techniques have been shown to be effec-

tive for retrieval of the closest neighbours.

In the proposed work, we follow a nonparametric ap-

proach and make the following contributions: (i) We forgo

the use of large superpixels and complex features and tackle

the problem of semantic segmentation using local patches

characterized by gradient orientation, color and location

features. The appeal of this representation is its simplic-

ity and resemblance to local patch based methods used in

the context of biologically inspired methods; (ii) We adopt

an approach for learning the relevance of individual feature

channels (gradient orientation, color and location) used in

k-nearest neighbour (k-NN) retrieval and (iii) We demon-

strate a novel approach for obtaining a retrieval set where

the coarse semantic labelling is used to retrieve similar

views and reﬁne the likelihood estimates. The proposed ap-

proach is validated extensively on several semantic segmen-

tation datasets consistently showing improved performance

over the state of the art methods.

2. Related Work

In recent years, a large number of approaches for se-

mantic segmentation have been proposed. Due to the com-

plex nature of the problem, the existing approaches differ

in the choice of elementary regions, choice of features to

describe them, methods for modeling spatial relationships,

means of incorporating of context and choice of optimiza-

tion techniques for solving the optimal labelling problem.

The most successful approaches typically use Conditional

Random Field (CRF) models [7, 6, 11, 23, 13, 12]. Tradi-

tional CRF models [23] combine local appearance informa-

tion with a smoothness prior that favours same labellings for

neighbouring regions. Researchers in [11] proposed the use

of higher order potentials in a hierarchical framework which

allowed the integration of features at different levels (pixels

and superpixels). Other works have looked at exploring ob-

ject co-occurrence statistics [7, 12] and combining results

from object detectors [13] .

With the increasing sizes of datasets and an increasing

2013 IEEE Conference on Computer Vision and Pattern Recognition

DOI 10.1109/CVPR.2013.405

3149

2013 IEEE Conference on Computer Vision and Pattern Recognition

DOI 10.1109/CVPR.2013.405

3149

2013 IEEE Conference on Computer Vision and Pattern Recognition

DOI 10.1109/CVPR.2013.405

3151

number of labels, the use of nonparametric approaches have

shown notable progress [15, 26, 4, 31]. They are appeal-

ing as they can utilize efﬁcient approximate nearest neigh-

bour search techniques e.g. k-d trees [19] and contextual

cues. Context is often captured by a retrieval set of im-

ages similar to the query and methods developed for es-

tablishing matches between image regions (at pixel or su-

perpixel level) for labelling the image. Using the method

of SIFT Flow, pixel-wise correspondences are established

between images for label transfer in [15]. Authors in [26]

work at the superpixel-level and retrieve similar images us-

ing global image features which is followed by superpixel-

level matching using local features and a Markov random

ﬁeld (MRF) to incorporate neighbourhood context. The

work of [26] was extended by [4] by training per superpixel

per feature weights and also by incorporating superpixel-

level semantic context. A set of partially similar images is

used in [31] by searching for matches for each region of the

query image and then using the retrieval set for label trans-

fer. A nonparametric method which avoids the construction

of a retrieval set is [8] which instead addresses the prob-

lem of semantic labelling by building a graph of patch cor-

respondences across image sets and transfers annotations

to unlabeled images using the established correspondences.

However the degree of the graph vertices is limited due to

memory requirements for large datasets like SiftFlow [15].

Our work is closely related to the work of [26, 4] in

that we also pursue nonparametric approach, but differ in

the choice of elementary regions, features, feature relevance

learning and the method for computing the retrieval set for

k-NN classiﬁcation. In our case, the retrieval set is obtained

in a feedback manner using a novel semantic label descrip-

tor computed from the initial semantic segmentation. Sim-

ilarly to [4], we follow the observation that a single global

distance metric is often not sufﬁcient for handling the large

variations within a class and propose to compute weights

for individual features channels. The weights in our case

are computed at the test time to indicate the importance of

color, gradient orientation vs location for individual regions.

The computation of the feature relevance we adopt falls into

a broad class of distance metric learning techniques which

have been shown to be beneﬁcial for many problems like

image classiﬁcation [5], object segmentation [17] and im-

age annotation [9]. For a comprehensive survey on distance

functions, we refer the reader to [22].

3. Approach

In this section, we will describe our baseline approach,

followed by the method of weight computation in Section 4

and semantic contextual retrieval in Section 5.

3.1. Problem Formulation

We formulate the semantic segmentation of an image

segmented into small superpixels. The output of the seman-

tic segmentation is a labelling L =(l

,...l

)



with

hidden variables assigning each superpixel s

a unique la-

bel, l

∈{1, 2,...,nL}, where nL and S is the total num-

ber of the semantic categories and superpixels respectively.

The posterior probability of a labelling L given the observed

appearance feature vectors A =[a

, a

,...,a

] computed

for each superpixel can be expressed as:

P (L|A)=

P (A|L) P (L)

P (A)

. (1)

We estimate the labelling L as a Maximum A Posteriori

Probability (MAP),

argmax

P (L|A) = argmax

P (A|L) P (L). (2)

The observation likelihood P (A|L) and the joint prior

P (L) are described in later subsections.

3.2. Superpixels and features

For an image, we extract superpixels utilizing a seg-

mentation method [29] where superpixel boundaries are ob-

tained as watersheds on a negative absolute Laplacian im-

age with LoG extremas as seeds. These blob-based super-

pixels are efﬁcient to compute and naturally consistent with

the boundaries. Similarly to [18], for each superpixel,

we compute a 133-dimensional feature vector a

comprised

of SIFT descriptor (128 dimensions), color mean over the

pixels of an individual superpixel in Lab color space (3 di-

mensions) and the location of the superpixel centroid (2 di-

mensions). The SIFT descriptor for a superpixel is com-

puted at a ﬁxed scale and orientation using publicly avail-

able code [27].

3.3. Appearance Likelihood

In order to compute the appearance likelihood for the

entire image, we approximate the Naive Bayes assumption

yielding

P (A|L) ≈



i=1

P (a

). (3)

Such an approximation assumes independence between ap-

pearance features of the superpixels given their labels.

The individual label likelihood P(a

) for a superpixel

is obtained using a k-NN method. Since a superpixel is

uniquely represented by its feature vector, we use the sym-

bols s

and a

interchangeably. For each class l

and every

superpixel s

of the query image, we compute a label likeli-

hood score:

L(a

n(l

)/n(l

,G)

)/n(

,G)

(4)

315031503152

where

•

= L \ l

is the set of all labels excluding l

;

• N

is a neighbourhood around a

with exactly k

points in it;

• n(l

) is the number of superpixels of class l

in-

side N

;

• n(l

,G) is the number of superpixels of class l

in the

set G (described later in Section 3.5).

We compute the normalized label likelihood score using

the individual label likelihood:

P (a

L(a

)



L(a

)

(5)

A straightforward way to compute the neighbourhood N

is to use the concatenated feature a

(Section 3.2) and re-

trieve the k nearest points by computing distance to super-

pixels in G. Such a retrieval can be efﬁciently performed by

the use of approximate nearest neighbour methods like k-d

trees [19].

3.4. Inference

For the joint prior P (L), we adapt the approach of [18]

which used as its smoothness term E

smooth

, a combination

of the Potts model (using constant penalty δ) and a color dif-

ference based term. The maximization in Eq. (2) can be re-

written in log-space and the optimal labelling L

∗

achieved

argmin





i=1

app

+ λ



(i,j)∈E

smooth



, (6)

where E

app

= − log P (a

) from Eq. (5) and the set E

contains all neighbouring superpixel pairs. The scalar λ

is the weight for the smoothness term. We perform the

inference in the MRF, i.e. a search for a MAP assign-

ment, using an efﬁcient and fast publicly available

MAX-

SUM solver [28].

3.5. Retrieval Set

The computation of the appearance likelihood in Sec-

tion3.3 uses images from the training set. Instead of using

the entire training set in the k-NN method, it is more useful

to utilize a subset of images which are similar to the query

image. For example, when trying to label a seaside image,

it is more helpful if we search for the nearest neighbours in

images of beaches and discard views from street scenes. We

use overall scene appearance to ﬁnd a relatively smaller set

of training images instead of using the entire training set. It

helps discard images which are dissimilar to the query im-

age and provides a scene-level context which can help im-

prove the labelling performance. The retrieval subset will

serve as the source of image annotations which will be used

to label the query image. We compute three global image

features for the dataset, namely: (i) GIST [21], (ii) spatial

pyramid [14] of quantized SIFT [16] and (iii) rgb-color his-

tograms with 8 bins per color channel. All the images in the

training set T are ranked for each individual global image

feature in ascending order of the Euclidean distance from

the query image. We then add the individual feature ranks

and re-rank the images of the training set based on the ag-

gregate rank. Finally, we select a subset of images T

from

the training set T as the retrieval set. The superpixels of the

images in set T

compose the set of training instances G in

Eq. (5).

This constitutes our baseline approach and is denoted

UKNN-MRF in the experiments for the uniformly weighted

k-NN. Its distinguishing characteristics are the use of small

patch-like superpixels, simple features and approximate

nearest neighbour methods in the context of k-NN classi-

ﬁcation. In the next two sections, we describe in detail the

two contributions of this work: a method for weighting dif-

ferent feature channels and the strategy for improving the

retrieval set.

4. Weighted k-NN

The baseline k-NN approach uses Euclidean distance to

compute the neighbourhood around the point. We propose

to use a weighted k-NN method to compute the neighbour-

hood of a query point. To compute a weighted distance be-

tween two superpixels a

and a

, we split the feature vec-

tor into three feature channels of gradient orientation, color

and location and ﬁrst compute distances in individual fea-

ture spaces:

=[d

]



(7)

where d

are the Euclidean distances between the

color, SIFT and location channels of the feature vectors a

and a

of the two superpixels respectively. We now deﬁne

a weighted distance between the two superpixels as

= w



(8)

where w =[w

] ∈

deﬁnes the weights for the

individual feature distances. Using the weighted distance

from Eq. (8), we can now obtain the neighbourhood N

around a superpixel by applying it to the feature distance

vector d

between a

and a

∈ G to compute the label

likelihood scores in Eq. (4). We now describe an approach

to compute these weights.

Weight computation With the varying nature of the

retrieval set for individual query images, we use the locally

adaptive metric approach of [3] for the weight computation.

It is a query-based technique which uses a global metric to

select neighbours for a test point which are then used to

315131513153

reﬁne the feature weights. In our setting, the test points are

the individual superpixels of the query image.

The goal is to estimate the relevance of a feature channel

i by evaluating its ability to predict class posterior proba-

bilites locally at a query point. This is done by computing

the expectation of the posterior P (l

|x) conditioned at a test

point x

along feature channel i. The ability of feature

channel i to predict P (l

|z) at x

= z

is deﬁned as

(z)=



(P (l

|z) −

P (l

= z

))

P (l

= z

)

(9)

Intuitively, the smaller the difference between P (l

|z) and

P (l

= z

), the more information feature channel i pro-

vides for predicting the class posterior probabilities locally

at z. For the query point x

, the relevance for feature i can

be computed by averaging the r

(z)’s in its neighbourhood

¯r

|N(x



z∈N(x

)

(z) (10)

where N (x

) denotes a neighbourhood centered at x

(us-

ing the current feature weights) with K

points in it. The

relative relevance can then be computed as

exp (cR

))



p=1

exp (cR

)

(11)

where m is the number of individual feature channels (three

in our case), c is a parameter which determines the inﬂu-

ence of ¯r

(at c =0, all three feature channels have equal

weights) and R

)=max

p=1

{¯r

)}−¯r

). The

quantities P (l

|z) and

P (l

= z

) in Eq. (9) are es-

timated by considering neighbourhoods centered at z de-

scribed in detail by [3]. In the experiments section, this

method evaluates the effect of the weight learning on the

ﬁnal classiﬁcation and is denoted WKNN-MRF for the

weighted k-NN.

5. Semantic Contextual Retrieval

The semantic labelling of an image, even if inaccurate

provides a strong cue about the presence and absence of

different categories in the image. While the idea of using

context to improve the labelling has been explored in the

past for image superpixels [20, 4], here we examine the ef-

fectiveness of this idea in the stage of improving the entire

retrieval set. In order to do so, we propose a global descrip-

tor derived from the intial labelling of the image which will

be used to improve the retrieval set.

To summarize the semantic label information of a la-

beled image, we introduce the semantic label descriptor for

a labelled image. This descriptor captures the basic under-

lying structure of the image and can help divide images into

sets of semantically similar images. For example, streets

inside a city have high rise buildings on the side while high-

ways generally have trees and plants besides the roadside.

Our proposed descriptor helps encode the positional infor-

mation of each category in the image and can be used for

semantic contextual retrieval.

Given an image which has been labelled using the

WKNN-MRF method, we consider a spatial pyramid of n

levels over the labelled image. At level i in the pyramid, we

divide I into a uniform grid of d × d cells where d =2

i−1

Within each grid cell, we compute the distribution for each

of the nL classes using the number of individual pixels in

that grid cell which have been assigned that class. This re-

sults in a nL-bin histogram for a single grid cell. The class

distribution values for each cell are normalized so that they

sum to one. The histograms for all the grid cells in the spa-

tial pyramid are concatenated together resulting in a image

feature f

seman

of length nL × C where C =



i=1

i−1

the total number of cells in the spatial pyramid.

A higher value for n will capture the details of the lay-

out more precisely but be more prone to classiﬁcation errors

while a lower value for n would be less sensitive to errors in

the labelling but does not encode the spatial position of the

semantic categories as well. This approach of computing a

semantic label-based descriptor is similar to [10]. However

our method differs in the fact that we use a spatial pyramid

over the labelled image instead of a single grid to encode

the semantic label information and we do not include addi-

tional appearance information in the descriptor, because it

has already been captured through other global image fea-

tures (Section 3.5). Our method also differs from [4] who

compute a superpixel-level semantic context descriptor as a

normalized label histogram of neighbouring regions.

5.1. Semantic Retrieval Set

Global image features (GIST, color histograms and spa-

tial pyramid over SIFT) were used to build retrieval set T

in Section 3.5. We now use the semantic label descriptor

seman

introduced above to help us reﬁne the quality of the

retrieval set by exploiting the semantic context.

For each image I

in the training set, we perform leave-

one-out-classiﬁcation on the image using the WKNN-MRF

approach. Using the resultant semantic image labelling,

we generate its corresponding semantic label descriptor

seman

. Similarily, for the query view I

, we label it us-

ing WKNN-MRF method and compute the corresponding

semantic label descriptor. We generate a new set of ranking

for the images in training set T based on the distance be-

tween their semantic label descriptor and that of the query

image. The ranking is computed in an ascending order of

the semantic label descriptor distances. We can now use

315231523154

this ranking in isolation or combine it with the rankings for

other global image feature types as was done in Section 3.5

to obtain the semantic retrieval set T

. Using the new re-

trieval set T

, we once again perform semantic labelling on

the image by the process described in Section 3.3- 3.4. This

method is denoted as WLKNN-MRF in our experimental

results. The WLKNN refers to a weighted k-NN using a

retrieval set built using the label descriptor only. We also

experiment with using the semantic layout descriptor with

all the other three global image features for the building of

the retrieval set and denote this method WAKNN-MRF.

6. Experiments

For evaluating the performance of our method, we tested

and compared it with several state-of-the-art techniques on

four different datasets: SiftFlow [15], SUN09 [1], Google

Street View [30] and Stanford Background [6]. The eval-

uation criterion for the methods is the per pixel accuracy

(percentage of pixels correctly labelled) and per class accu-

racy (the average of semantic category accuracies).

For Stanford Background and Google Street View

datasets, we selected 10% of the training images as the size

of our retrieval set. In case of the other two datasets, we

used a retrieval set of 75 images. For all our experiments,

we set k =9in Eq. (4) and λ =0.4 in Eq. (6). We ob-

tained these parameters by selecting a small subset of the

training images as a validation set. Computation of the

feature weights required an average of four minutes for a

single query image. To help speed up the computation of

the weights, we approximate the neighbourhood construc-

tion of [3] through k-d trees [19]. For the query view,

we index the individual features from the retrieval set in

a k-d tree, constructing one k-d tree per feature channel.

The neighbourhood computation is then approximated us-

ing the set union of the k-NN from different feature chan-

nels. We carry out 5 iterations of the weight computation

step in Eq. (11) adaptively changing the nearest neighbours

in the weighted neighbourhood space. While this approxi-

mates the weight computation, it affected our performance

only slightly (a maximum decrease of 0.4% in per-pixel ac-

curacy across the three datasets) and helped reduce the time

for weight computation for an image to 20 seconds. For an

image, feature computation, k-NN likelihood computation

and MRF inference took 1 second, 13 seconds and 0.5 sec-

ond respectively. When reporting the performance, we used

the following variants of our approach:

• UKNN-MRF: uniform weights for the features with

retrieval set obtained by global image features

• WKNN-MRF: computed weights for the features with

retrieval set obtained by global image features

• WLKNN-MRF: computed weights with retrieval set

built using the semantic layout descriptor only

• WAKNN-MRF: computed weights with retrieval set

built using a union of semantic layout descriptor and

the three other global image features.

SiftFlow SiftFlow is a large dataset of 2688 images with 33

semantic categories. [15] split the dataset into 2488 training

images and 200 test images. Table 1 reports our perfor-

mance on this dataset. Our weighted k-NN MRF performs

on a comparable level on the per-pixel accuracy with the top

methods. However it still trails [4] for the per-class accu-

racy. When we incorporate semantic context to obtain a re-

ﬁned retrieval set, our system achieves the best performance

for both per-pixel and per-class accuracies. The categories

which saw an increase of more than 10% after the use of

semantic context include ﬁeld, car, river, plant, sidewalk,

bridge, door, crosswalk. These are categories which do not

occur very frequently but achieved improved labelling with

the context. For example, identifying road and highways

helps label cars, sidewalk and crosswalk.

System Per-Pixel Per-Class

Liu et al. [15] 76.7 -

Tighe et al. [26] 76.9 29.4

Eigen et al. [4] 77.1 32.5

UKNN-MRF 75.6 27.9

WKNN-MRF 77.2 29.3

WLKNN-MRF 78.5 32.0

WAKNN-MRF 79.2 33.8

WKNN-MRF (with HOG) 76.7 27.4

Table 1. Semantic labelling performance on the SiftFlow dataset

We also experimented with replacing the SIFT feature

for the superpixel with a HOG feature [2]. This feature was

computed by using a 4 × 4 spatial grid of 4-pixel HOG

cells with the grid centered at the superpixel’s center. The

individual HOG cell descriptors were averaged to compute

the superpixel feature. The last row in Table 1 contains the

performance for this method. Classes which signiﬁcantly

improved with the use of HOG instead of SIFT include

tree, mountain, car while the accuracy dropped for road,

sea, grass, sidewalk.

SUN09 SUN09 dataset [1] has fully labelled per-pixel

ground truth for a set of 107 semantic categories. In the

experiments, the dataset was split into 4352 training images

and 4310 test images. Table 2 reports the performance of

our method on this dataset. Using the semantic context

helped obtain an improvement of 3.6% compared to the

WKNN-MRF method. In comparison to [25], we perform

better on per-pixel accuracy but trail on per-class accuracy.

It was observed that the per-pixel labelling accuracy of

outdoor scenes was more than 11% better than indoor

scenes highlighting the challenge of labelling indoor views.

Google-StreetView The Google Street View dataset

contains 320 images selected from a set of 10,000 images

315331533155

Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context

Figures

Citations

Towards unified depth and semantic prediction from a single image

Learning to segment under various forms of weak supervision

ReSeg: A Recurrent Neural Network-Based Model for Semantic Segmentation

Context Driven Scene Parsing with Attention to Rare Classes

DAG-Recurrent Neural Networks for Scene Labeling

References

Distinctive Image Features from Scale-Invariant Keypoints

Histograms of oriented gradients for human detection

Distinctive Image Features from Scale-Invariant Keypoints

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

Related Papers (5)

Learning Hierarchical Features for Scene Labeling

Decomposing a scene into geometric and semantically consistent regions

Efficient Graph-Based Image Segmentation

Fully convolutional networks for semantic segmentation

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Frequently Asked Questions (14)

Q1. How long did it take to compute the weights for an image?

Q2. What are the contributions in "Nonparametric scene parsing with adaptive feature relevance and semantic context" ?

Q3. What future works have the authors mentioned in the paper "Nonparametric scene parsing with adaptive feature relevance and semantic context" ?

Q4. What is the problem of semantic labelling?

Q5. What is the metric used to compute the relevance of a feature channel i?

Q6. How many iterations of the weight computation step in Eq. (11)?

Q7. What are the categories which saw an increase of more than 10% after the use of semantic context?

Q8. How do the authors compute the appearance likelihood for the entire image?

Q9. What other datasets were used to evaluate the performance of their method?

Q10. What is the metric used to compute the weight of a query image?

Q11. What is the semantic label descriptor for a labelled image?

Q12. How do the authors perform semantic labelling on the image?

Q13. How was the work of [26] extended?

Q14. What is the posterior probability of a labelling L given the observed appearance vectors?