Learning to Detect Basal Tubules of Nematocysts in SEM Images

doi:10.1109/ICCVW.2013.32

Learning to Detect Basal Tubules of Nematocysts in SEM images

Michael Lam, Janardhan Rao Doppa, Xu Hu, Sinisa Todorovic, and Thomas Dietterich

Oregon State University

Department of EECS

{lamm,doppa,huxu,sinisa,tgd}@eecs.oregonstate.edu

Abigail Reft and Marymegan Daly

Ohio State University

Department of Evolution, Ecology and Organismal Biology

{reft.1,daly.66}@osu.edu

Abstract

This paper presents a learning approach for detecting

nematocysts in Scanning Electron Microscope (SEM) im-

ages. The image dataset was collected and made avail-

able to us by biologists for the purposes of morphological

studies of corals, jellyﬁsh, and other species in the phylum

Cnidaria. Challenges for computer vision presented by this

biological domain are rarely seen in general images of nat-

ural scenes. We formulate nematocyst detection as labeling

of a regular grid of image patches. This structured pre-

diction problem is speciﬁed within two frameworks: CRF

and HC-Search. The CRF uses graph cuts for inference.

The HC-Search approach is based on search in the space of

outputs. It uses a learned heuristic function (H ) to uncover

high-quality candidate labelings of image patches, and then

uses a learned cost function (C) to select the ﬁnal prediction

among the candidates. While locally optimal CRF inference

may be sufﬁcient for images of natural scenes, our results

demonstrate that CRF with graph cuts performs poorly on

the nematocyst images, and that HC-Search outperforms

CRF with graph cuts. This suggests biological images of

ﬂexible objects present new challenges requiring further ad-

vances of, or alternatives to existing methods.

1. Introduction

This paper addresses the problem of object detection in

scanning electron microscope (SEM) images for the pur-

poses of morphological characterization of cnidae. This

work focuses on nematocysts, one kind of cnida, illustrated

in Figure

1.

A cnida (plural cnidae) is an explosive sub-cellular cap-

sule that ﬁres toxins when it discharges. It is produced by

a special cell called a cnidocyte. Because cnidae mani-

Figure 1: Example images of nematocysts from our dataset.

Detecting textured, elongated, highly deformable basal

tubules of nematocysts (marked yellow) against background

clutter is very challenging.

fest both extreme morphological cell-level simplicity and

wide biological diversity, cnidae provide a great opportunity

to investigate fundamental questions in biology, including

constraints and convergence in morphology [

1]. Of particu-

lar interest is a morphological characterization of the basal

tubules of nematocysts, marked yellow in the images shown

in Figure

1. This is because surfaces of the basal tubules are

characterized by spines whose shapes, lengths, and density

of placement along the surface represent important phone-

mic characters for evolutionary studies [

7].

Biological studies of nematocyst images are currently

conducted by visual inspection and manual annotation, tak-

ing prohibitive amounts of expert time. This, in turn, typi-

cally limits the studies to small image collections of narrow

scope. In this paper, we explore an opportunity for com-

puter vision to help biologists in their analysis of nemato-

cyst images by automatically detecting the basal tubules. As

the image resolution (i.e., pixel size) is calibrated to the real

size of observed specimens, detection of the basal tubules

readily gives information about the size and shape of the

nematocyst useful for morphological studies.

As can be seen in Figure

1, images of nematocysts

2013 IEEE International Conference on Computer Vision Workshops

DOI 10.1109/ICCVW.2013.32

190

2013 IEEE International Conference on Computer Vision Workshops

DOI 10.1109/ICCVW.2013.32

190

present signiﬁcant challenges to the state of the art in com-

puter vision. The basal tubules are relatively thin, elon-

gated, and highly deformable objects covered with spines.

They are typically imaged against signiﬁcant background

clutter, consisting of mucus and cellular debris. The clutter

is unavoidable, since it is extremely difﬁcult to isolate indi-

vidual nematocysts during image acquisition. Thin, elon-

gated particles of debris appear very similar to the basal

tubule. The texture of debris appears very similar to the

texture of spines along the surface of the basal tubule. In ad-

dition, some images may not show the entire basal tubule,

because it may be partially occluded by clutter, or extend

beyond the image frame. Nematocysts are often damaged

naturally and sometimes damaged through preparation, so

that large parts of the basal tubules may not be physically

present in the image. Rarely do we see the aforementioned

challenges in general images of natural scenes.

Related work mostly focuses on image classiﬁcation for

accelerating biological studies [

6]. In contrast, this paper

focuses on object detection and localization for accelerat-

ing biological studies. We formulate detection of the basal

tubules as binary labeling of a regular grid of image patches.

Patches that fall on the basal tubule are assigned label “1”,

and patches that fall on background are assigned label “0”.

One solution for this problem is to learn a binary classiﬁer

to predict each patch label independently. However, this

approach is limited, since it does not account for relation-

ships among neighboring patches. An alternative is to spec-

ify object detection as a structured prediction problem. To

this end, we employ two state-of-the-art structured predic-

tion frameworks: CRFs (e.g., [

5, 4]), and HC-Search [3, 2].

HC-Search has a number of advantages over CRFs in our

detection problem. For example, HC-Search allows us to

use higher-order features with negligible overhead.

Our evaluation on the nematocyst images demonstrates

that locally-optimal CRF inference produces poor detection

results. This is in contrast to the literature, which usually

reports very good CRF performance on images of outdoor-

and indoor-scenes. Our results demonstrate that HC-Search

outperforms CRFs. Although both CRFs and HC-Search

are considered the most powerful, state-of-the-art frame-

works for structured prediction, their relatively modest per-

formance on the nematocyst images suggests that, in gen-

eral, these kinds of biological images present new chal-

lenges for computer vision.

Our key contributions include: (I) Addressing new vi-

sion challenges in SEM images; and (II) Evaluating the

most powerful structured prediction approaches – namely,

CRFs and HC-Search – on these images, and identifying

key advantages and weaknesses of each approach.

In the following, we describe approaches that we use for

our detection problem: IID Classiﬁer in Sec. 2.1, CRFs in

Sec. 2.2, and HC-Search in Sec. 2.3. Sec. 3 presents the

dataset of nematocyst images and our results.

2. Technical Approach

In this section, we ﬁrst state the formal problem setup,

and then describe the different approaches used in this work.

Problem Setup. We are provided with a training set of

input-output pairs {(x, y

∗

)}, where input x ∈ X is the

regular grid of patches of a nematocyst image and output

y

∗

∈ Y corresponds to the ground-truth binary labeling of

the patches. Let L be a non-negative loss function such that

L(x, y

′

, y

∗

) is the loss associated with labeling a particular

input x by output y

′

when the true output is y

∗

(e.g., Ham-

ming and F1 loss). Our goal is to learn a predictor from

inputs to outputs whose predicted outputs have low loss.

2.1. IID Classiﬁer

A simple baseline approach for our problem is to learn

an IID classiﬁer (e.g., SVM, Logistic Regression) on patch

features, and make independent predictions for every image

patch. This solution is unsatisfactory, as it does not account

for relationships among neighboring patches.

Structured approaches such as Conditional Random

Fields (CRFs) [

5, 4] and HC-Search [3] leverage the struc-

ture in the problem by accounting for relationships between

inputs and outputs. In what follows, we formulate the basal

tubule detection problem within the framework of CRFs and

HC-Search.

2.2. Conditional Random Fields (CRFs)

The CRF is one of the most popular models for struc-

tured learning and inference in computer vision [

5, 4]. A

CRF deﬁnes a parametric posterior distribution over the out-

puts (labels), y, given observed image features, x, in a fac-

tored form: P (y|x, w) =

1

Z(x,w)

e

w·φ(x,y)

, where w are the

parameters, Z(x, w) is the partition function, and the fea-

tures, φ(x, y), decompose over the cliques in the underlying

graphical model.

Inference is typically posed as ﬁnding the joint MAP as-

signment that maximizes the posterior distribution: ˆy =

arg max

y ∈ Y

P (y|x, w), which is generally intractable. Pa-

rameter learning is usually formulated as minimizing the

negative conditional log-likelihood of the data. It in-

volves repeated calls to the inference procedure, and thus is

also generally intractable. Well-known approximate infer-

ence algorithms in vision include Loopy Belief Propagation

(LBP), Iterated Conditional Modes (ICM), and Graph Cuts.

In our model, the patches are organized in a graph,

G = (V, E), where V and E are sets of nodes and edges.

The nodes i = 1, 2, · · · , |V | correspond to patches in the

image, and edges (i, j) ∈ E capture their spatial relations

as a regular grid with 4-connected neighbors. Every node

i is described by a 128-dimensional SIFT descriptor vector,

191191

Ψ

u

(x

i

, y

i

), referred to as unary feature. Every edge (i, j) ∈

E is described by a pairwise feature, Ψ

pair

(x

i

, x

j

, y

i

, y

j

),

indicating the compatibility between patches i and j with

the corresponding labeling y

i

and y

j

Ψ

pair

(x

i

, x

j

, y

i

, y

j

) =



0 if y

i

= y

j

,

exp(−β|x

i

− x

j

|

2

) if y

i

6= y

j

,

(1)

where β is a parameter. Ψ

pair

(x

i

, x

j

, y

i

, y

j

) encourages

neighboring patches to take the same label.

Let the set of all patch descriptors be denoted x = {x

i

:

i = 1, · · · , |V |}, and let the set of all patch labels be de-

noted y = {y

i

: i = 1, · · · , |V |}, where y

i

∈ {0, 1}. We

investigate two different CRF formulations, referred to as

pairwise CRFs and pyramid CRFs, as explained below.

Pairwise CRF. The pairwise CRF, given by (

2), cor-

responds to the formulation that contains the unary and

pairwise features of image patches, with the standard 4-

connected neighborhood of every patch on the image lat-

tice:

w·φ(x, y)=

X

i∈V

w

u

·Ψ

u

(x

i

, y

i

)+

X

i∈V

j∈N

i

w

pair

·Ψ

pair

(x

i

, x

j

, y

i

, y

j

),

(2)

Pyramid CRFs. The pyramid CRF, given by (

3), con-

tains additional pyramid features, Ψ

pyr

(x

i

, x

j

, y

i

, y

j

). The

graphical model now contains a grid of patches from a

downsampled image by a factor of 2, in order to approx-

imate higher-order features. Each node i from the down-

sampled layer is connected to its four corresponding child

nodes k ∈ C

i

in the original image.

w·φ(x, y)=

X

i∈V

w

u

·Ψ

u

(x

i

, y

i

)+

X

i∈V

j∈N

i

w

pair

·Ψ

pair

(x

i

, x

j

, y

i

, y

j

)

+

X

i∈V,k∈C

i

w

pyr

·Ψ

pyr

(x

i

, x

k

, y

i

, y

k

).

(3)

We investigate these two CRF models combined with the

well-known inference algorithms: ICM, LBP, and Graph-

Cuts.

2.3. HC-Search

The key elements of HC-Search [

3] include the Search

space over complete outputs S

o

; Search strategy A; Heuris-

tic function H : X × Y 7→ ℜ to guide the search towards

high-quality outputs; and Cost function C : X × Y 7→ ℜ

to score the candidate outputs generated by the search pro-

cedure. A high level overview of HC-Search framework is

shown in Figure 2. Below we explain all these elements and

then describe how to learn the heuristic and cost functions.

Search Space. Every state in S

o

consists of an input-

output pair, (x, y), representing the possibility of predict-

ing y as the output for input image x (see Figure 2). Such

a search space is deﬁned in terms of two functions: 1)

Initial state function, I, such that I(x) returns an initial

state for input x; and 2) Successor function, S, such that

for any state (x, y), S((x, y)) returns a set of next states

{(x, y

1

), · · · , (x, y

k

)} that share the same input x.

The speciﬁc search space that we investigate leverages

the IID classiﬁer. Our I(x) corresponds to the predictions

made by a logistic regression classiﬁer. S generates a set of

next states by computing a set of image patches where the

classiﬁer has low conﬁdence and generating one successor

for each patch with the corresponding y value ﬂipped. We

use the conditional probability of the logistic regression IID

classiﬁer as the conﬁdence measure. This search space is

similar to the Flipbit space deﬁned in [

2].

The effectiveness of HC-Search depends critically on the

quality of the search space being used. The quality of a

search space can be understood in terms of the expected

number of search steps needed to uncover the target output

y

∗

. For most search procedures, the time required to ﬁnd

y

∗

will grow as the depth of the target in the search space

increases. Thus, one way to quantify the expected amount

of search, independently of the speciﬁc search strategy, is

by considering the expected depth of target outputs y

∗

. In

particular, for a given input-output pair (x, y

∗

), the target

depth d is deﬁned as the minimum depth at which we can

ﬁnd a state corresponding to the target output y

∗

. By this

deﬁnition, the expected target depth of our search space is

equal to the expected number of errors in the output corre-

sponding to the initial state.

Search Strategy. The role of the search procedure is to

uncover high-quality outputs, guided by the heuristic func-

tion H. Prior work [

2, 3] has shown that greedy search

works quite well when used with an effective search space.

We investigate HC-Search with greedy search. Given an in-

put x , greedy search traverses a path of length τ through

the search space, selecting as the next state, the best succes-

sor of the current state according to the heuristic. Speciﬁ-

cally, if s

i

is the state at search step i, greedy search selects

s

i+1

= arg min

s∈S(s

i

)

H(s), where s

0

= I(x).

Making Predictions. Given an input image x, and a pre-

diction time bound τ , HC-Search traverses the search space

starting at I(x), using the search procedure A, guided by

the heuristic function H, until the time bound is exceeded.

It then scores each visited state s according to C(s) and re-

turns the ˆy of the lowest-cost state as the predicted output.

Let y

∗

H

denote the best output that HC-Search could pos-

sibly return when using H, and let ˆy denote the output that it

actually returns. Also, let Y

H

(x) be the set of candidate out-

puts generated using heuristic H for a given input x. Then,

we deﬁne

y

∗

H

= arg min

y ∈ Y

H

(x)

L(x, y, y

∗

), ˆy= arg min

y ∈ Y

H

(x)

C( x , y).

(4)

192192

Figure 2: A high level overview of HC-Search. Given in-

put x and a search space, S

o

, we ﬁrst instantiate a search

space over complete outputs. Each search node in this

space consists of a input-output pair (i.e., input image and

basal tubule detection). Next, we run a search procedure A

guided by the heuristic function H for a time bound τ (no.

of search steps). The highlighted nodes correspond to the

search trajectory traversed by the search procedure, in this

case greedy search. We return the least cost output ˆy (basal

tubule detection) that is uncovered during the search as the

prediction for input x.

Heuristic and Cost Function Learning. The error of

HC-Search, ǫ

HC

, for a given H and C can be decomposed

into two parts: 1) Generation error, ǫ

H

, due to H not gener-

ating high-quality outputs; and 2) Selection error, ǫ

C|H

, the

additional error (conditional on H) due to C not selecting

the best loss output generated by H. Guided by the error

decomposition in (

5), the learning approach optimizes the

overall error, ǫ

HC

, in a greedy stage-wise manner by ﬁrst

training H to minimize ǫ

H

, and then, training C to mini-

mize ǫ

C|H

conditioned on H.

ǫ

HC

= L (x, y

∗

H

, y

∗

)

|

{z }

ǫ

H

+ L ( x , ˆy, y

∗

) − L (x, y

∗

H

, y

∗

)

|

{z }

ǫ

C|H

(5)

H is trained by imitating the search decisions made by

the true loss function (available only for training data). We

run the search procedure A for a time bound of τ for in-

put x using a heuristic equal to the true loss function, i.e.

H(x, y) = L(x, y , y

∗

), and record a set of ranking con-

straints that are sufﬁcient to reproduce the search behavior.

For greedy search, at every search step i, we include one

ranking constraint for every node (x, y) ∈ C

i

\ (x , y

best

),

such that H(x, y

best

) < H(x, y), where (x, y

best

) is the

best node in the candidate set C

i

(ties are broken by a ran-

dom tie breaker). The aggregate set of ranking examples is

given to a rank learner (e.g., SVM-Rank) to learn H.

C is trained to score the outputs Y

H

(x) generated by H

according to their true losses. Speciﬁcally, this training is

formulated as a bi-partite ranking problem to rank all the

best loss outputs Y

best

higher than all the non-best loss out-

puts Y

H

(x) \ Y

best

.

Advantages of HC-Search relative to other structured

prediction approaches, including CRFs, are as follows.

First, it scales gracefully with the complexity of the de-

pendency structure of features. In particular, we are free

to increase the complexity of H and C (e.g., by including

higher-order features) without considering its impact on the

inference complexity. [

2, 3] show that the use of higher-

order features results in signiﬁcant improvements. Second,

the terms of the error decomposition in (

5) can be easily

measured for a learned (H, C) pair, which allows for an

assessment of which function is more responsible for the

overall error. Third, HC-Search makes minimal assump-

tions about the loss function, requiring only that we have

a “blackbox” evaluation of any candidate output. Theoret-

ically, it can even work with non-decomposable loss func-

tions, such as F1 loss.

3. Experiments and Results

We evaluate IID classiﬁers (Sec.

2.1), CRFs (Sec. 2.2),

and HC-Search (Sec.

2.3) on a dataset of SEM images con-

taining nematocysts. The image dataset was prepared by an

expert biologist. Fresh specimens of cnidarian tissue were:

(a) Exposed to 1M sodium citrate for 10 minutes; (b) Rinsed

in water; (c) Preserved in 70% ethanol; (d) Dehydrated in

a graded series; (e) Sputter-coated with gold palladium in a

Cressington sputter coater; and, ﬁnally, (f) Imaged using a

FEI NOVA nanoSEM microscope. The dataset consists of

130 images, each with resolution of 1024×864 pixels. The

images often show multiple instances of nematocysts within

cluttered background, as illustrated in Figures

1, 4, 5. The

dataset is very challenging. First, the background clutter

consists of mucus and debris. These appear quite similar to

the target basal tubules. Mucus and debris often latch onto

parts of nematocysts, which may partially occlude the basal

tubules or create foreground-background confusion even to

the human eye. Parts of nematocysts may also be physi-

cally missing, or may simply be out of the ﬁeld of view.

SEM images suffer from low contrast. The ground truth

for each image is manually annotated by dividing the image

into a regular grid of 32x32 pixel patches, and labeling each

patch as belonging to the basal tubule of a nematocyst or

background.

Evaluation Setup and Metrics. We use 80 images

for training, 20 for hold-out validation, and 30 for testing.

Given a test image, our structured prediction assigns one

of the classes to each image patch on a regular grid. Per-

formance is evaluated by precision, recall, and F1 measure,

193193

where true positives are patches that fall on the ground truth

basal tubules. For HC-Search, we evaluate our sensitivity to

the time bound (τ ), the number of greedy search steps that

are allowed before making the ﬁnal prediction.

Methods. An image is divided into a regular grid of

patches. Each patch is described by a 128-dimensional

SIFT descriptor. Assigning labels to the patches is per-

formed using the following methods. IID Classiﬁer applies

either SVM or Logistic Regression independently on each

image patch. Pairwise CRF is the standard CRF that mod-

els the image using the unary and pairwise potentials of the

image patches. Pyramid CRF augments the pairwise po-

tential with hierarchical relationships between (larger) par-

ent patches and their (embedded smaller) children patches.

The notations w/ ICM, w/ LBP, and w/ GraphCuts indi-

cate that inference of CRF is conducted using ICM, LBP,

or Graph-Cuts algorithms, respectively. HC-Search uses

the following variants: No Global, Max Global, and Sum

Global, which differ in the feature representation for the

heuristic and cost functions. No Global uses only the

unary and pairwise features of image patches, given by

(

1). Max Global additionally uses a higher-order feature

describing the largest connected component of positive de-

tections. Sum Global additionally uses a higher-order term

describing all connected components of positive detections.

The higher-order feature is deﬁned as the standard Bag-of-

Words (BoW) of 300 codewords, found by K-means over

SIFTs of all image patches from the entire dataset.

Table

1 presents the detection results of IID Classiﬁers,

CRF, and HC-Search. The results of Logistic Regression

are reported for the detection threshold set at the maxi-

mum F1 score. The HC-Search results are obtained for

time bound τ = 100 (greedy search steps). Table

1 shows

that HC-Search outperforms the two types of IID Classi-

ﬁers, improving upon the initial prediction of logistic re-

gression. Also, HC-Search yields higher recall and F1 than

all variants of CRFs. Interestingly, the CRFs with ICM in-

ference gave better recall and F1 than the CRFs with LBP

and Graph-Cuts inference. From Table

1, the inclusion of

standard higher-order features (BoW) in HC-Search does

not lead to signiﬁcant performance improvements. This

contrasts with common reports in the literature and requires

further investigation.

We also test sensitivity to (i) Image patch size, (ii)

Choice of the descriptor used for patches, and (iii) Train-

ing time bound τ for HC-Search.

First, for patch sizes of 16x16 and 64x64 pixels, and ap-

propriately adjusted ground truth, all the approaches under-

perform relative to the results presented in Table 1. For all

the approaches, for 16x16 pixels, F1 decreases by 8%–11%,

and, for 64x64 pixels, F1 decreases by 8%–9%. Thus, our

default patch size of 32x32 pixels empirically works best.

Second, when replacing SIFTs with 496-dimensional

(a) IID Classiﬁer Results

Precision Recall F1

SVM .675 .147 .241

Logistic Regression .605 .129 .213

(b) CRF Results

Precision Recall F1

Pairwise w/ ICM .432 .360 .393

Pairwise w/ LBP .545 .091 .156

Pairwise w/ GraphCuts .537 .070 .124

Pyramid w/ ICM .565 .258 .354

Pyramid w/ LBP .500 .013 .025

Pyramid w/ Graph Cuts .732 .013 .026

(c) HC-Search Results

Precision Recall F1

No Global .472 .545 .506

Max Global .445 .508 .475

Sum Global .457 .533 .492

Table 1: Performance on the nematocyst images.

HOG descriptors, the F1 of all the approaches decreases by

2%–4%.

Finally, Figure

3 shows the plots of precision, recall, and

F1 of HC-Search No Global for increasing time bounds

τ. The plots show four types of curves: LL-Search, HL-

Search, LC-Search and HC-Search. LL-Search uses the

loss function as both the heuristic and the cost function, and

thus serves as an upper bound on the performance of the

selected search architecture. HL-Search uses the learned

heuristic function, and the loss function as cost function,

and thus serves to illustrate how well the learned heuris-

tic performs in terms of the quality of generated outputs.

LC-Search uses the loss function as an oracle heuristic, and

learns a cost function to score the outputs generated by the

oracle heuristic. From Figure

3, for HC-Search, we see that

as τ increases, precision drops, but recall and F1 improve up

to a certain point before decreasing. This is understandable,

because as τ increases, the generation error (ǫ

H

) will mono-

tonically decrease, since strictly more outputs will be en-

countered. Simultaneously, difﬁculty of cost function learn-

ing can increase as τ grows, since it must learn to distin-

guish among a larger set of candidate outputs. In addition,

we can see that the LC-Search curve is very close to the

LL-Search curve, while the HL-Search curve is far below

the LL-Search curve. This suggests that the overall error of

HC-Search, ǫ

HC

, is dominated by the heuristic error ǫ

H

. A

better heuristic is thus likely to lead to better performance

overall.

We also report the error decomposition results of HC-

Search in Table

2. Recall that from Equation 5, we can

194194

Learning to Detect Basal Tubules of Nematocysts in SEM Images

Figures

Citations

Data-Driven Activity Prediction: Algorithms, Evaluation Methodology, and Applications

HC-search: a learning framework for search-based structured prediction

Learning Activity Predictors from Sensor Data: Algorithms, Evaluation, and Applications

Structured prediction via output space search

Prune-and-Score: Learning for Greedy Coreference Resolution

References

Robust higher order potentials for enforcing label consistency

The phylum Cnidaria: A review of phylogenetic patterns and diversity 300 years after Linnaeus

Efficiently selecting regions for scene understanding

Dictionary-free categorization of very similar objects via stacked evidence trees

Morphology, distribution, and evolution of apical structure of nematocysts in hexacorallia.

Related Papers (5)

Search-based structured prediction

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Structured prediction via output space search

HC-search: a learning framework for search-based structured prediction

ASOD: Arbitrary shape object detection