scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Non-planar Infrared-Visible Registration for Uncalibrated Stereo Pairs

TL;DR: Experimental results reveal that the proposed method can not only successfully handle non-planar scenes but also gets state-of-the-art results on planar ones.
Abstract: Thermal infrared-visible video registration for nonplanar scenes is a new area in visual surveillance. It allows the combination of information from two spectra for better human detection and segmentation. In this paper, we present a novel online framework for visible and thermal infrared registration for non-planar scenes that includes foreground segmentation, feature matching, rectification and disparity calculation. Our proposed approach is based on sparse correspondences of contour points. The key ideas of the proposed framework are the removal of spurious regions at the beginning of videos and a registration methodology for non-planar scenes. Besides, a new non-planar dataset with an associated evaluation protocol is also proposed as a standard assessment. We evaluate our method on both public planar and non-planar datasets. Experimental results reveal that the proposed method can not only successfully handle non-planar scenes but also gets state-of-the-art results on planar ones.

Content maybe subject to copyright    Report

Non-planar Infrared-Visible Registration for Uncalibrated Stereo Pairs
Dinh-Luan Nguyen
Faculty of Information Technology
University of Science, VNU-HCMC
Ho Chi Minh city, Vietnam
1212223@student.hcmus.edu.vn
Pierre-Luc St-Charles, Guillaume-Alexandre Bilodeau
LITIV lab., Dept. of Computer & Software Eng.
Polytechnique Montreal
Montreal, QC, Canada
{pierre-luc.st-charles,gabilodeau}@polymtl.ca
Abstract
Thermal infrared-visible video registration for non-
planar scenes is a new area in visual surveillance. It al-
lows the combination of information from two spectra for
better human detection and segmentation. In this paper, we
present a novel online framework for visible and thermal in-
frared registration for non-planar scenes that includes fore-
ground segmentation, feature matching, rectification and
disparity calculation. Our proposed approach is based on
sparse correspondences of contour points. The key ideas of
the proposed framework are the removal of spurious regions
at the beginning of videos and a registration methodology
for non-planar scenes. Besides, a new non-planar dataset
with an associated evaluation protocol is also proposed as a
standard assessment. We evaluate our method on both pub-
lic planar and non-planar datasets. Experimental results
reveal that the proposed method can not only successfully
handle non-planar scenes but also gets state-of-the-art re-
sults on planar ones.
1. Introduction
The problem of thermal infrared and visible spectrum
(TIR-Vis) video content registration is a familiar task in
computer vision. The fundamental idea of registration is
finding correspondences from video frame pairs to allow
scenes and objects to be represented in a common coordi-
nate system. Some works proposed to use dense feature
matching for high registration quality [
4, 12, 15] while oth-
ers [
17, 16, 5] use sparse correspondences taken from com-
mon salient features for fast registration. Although these
systems have some contributions in this area, they still have
drawbacks that need to be solved. We address their three
main disadvantages as follows.
First, dense correspondence methods that use area-based
measurement to match correspondences from two frame
pairs are too slow to be applied on videos [
12, 4]. Thus,
there is a need for a lightweight method to boost the speed
of this registration process. Furthermore, these methods
need rectified video frames which are not always readily
available when tackling non-planar scenes (i.e. scenes in
which objects appear on different depth planes). Some au-
thors have proposed their own dataset [
4] along with recti-
fied videos created by calibration as inputs. These works
cannot adapt to raw input video captured from different
cameras. Besides, in video applications, the registration
quality can be lower. As a result, in this paper, we address
the problem of sparse feature correspondence for fast regis-
tration.
Second, existing sparse correspondence methods [
16, 5]
can only deal with planar scenes. Their frameworks assume
that all captured scenes are approximately planar. Thus, this
assumption limits their applicability to planar scenes only.
Third, since most sparse methods [
16, 5] rely on brute
force matching strategies, their computational complexity
is usually quite high. They are thus unsuited for mobile or
distributed video surveillance applications.
The typical structure of current existing frameworks used
for sparse registration comprises three main steps, which
are feature extraction, feature matching and image warp-
ing. In feature extraction and matching, traditional feature
descriptors are exploited using sparse correspondence [
16]
between multimodal imagery [
1]. Other technique has been
proposed [
13] to get more meaningful features from two
types for TIR-Vis registration. However, these techniques
are not always successful because of the differences in tex-
ture and resolution of TIR-Vis image pairs. In the image
warping step, with the assumption that all captured scenes
are nearly planar, a homography transformation is applied
to maximize overlap area between objects. It should be
noted that no existing framework uses unrectified videos as
inputs for TIR-Vis non-planar scene registration. In this pa-
per, we address the drawbacks of current existing systems
in the TIR-Vis video registration problem for both planar
and non-planar scenes.
Main contribution. There are four significant contribu-
tions presented in this paper. First, a novel method for align-
63

(1) Foreground
Segmentation
Statistical Model
(2) Noise
filtering
(4) Feature and
blob matching
(3) Videos
rectification
(5) Disparity calculation
Segmented videos
Input: Raw video frame pairs
Clean Segmented videos
Inverse rectified videos
Output: Registered Video
Rectified videos
Buffer
Match
Figure 1. Proposed framework overview. First, raw input videos are segmented to get foreground objects using a statistical model [18].
Second, a noise filtering strategy is applied to eliminate spurious noisy blobs in segmented videos to reduce unnecessary computation.
Third, videos are rectified using fundamental matrix estimation. Fourth, disparities are calculated from corresponding blobs of pair of
video frames. Finally, videos are derectified to restore frames to raw input condition.
ing TIR-Vis blobs using sparse correspondences for raw in-
put videos is proposed to deal with non-planar scenes. Ex-
perimental results show that the proposed framework also
outperforms the state-of-the-art on planar scenes.
Second, a segmentation noise filtering strategy is pre-
sented to eliminate spurious blobs at early processing
stages, which reduces unnecessary calculations afterward.
Third, a corresponding blob preservation algorithm is in-
troduced to approximate correspondence between blobs in
every frame without using a brute force method. The advan-
tage is that the correspondence list needs only to be updated
when there is a change in the order of objects in video pairs.
Fourth, we created a new public dataset for TIR-Vis reg-
istration with raw input video
1
. Groundtruth together with
an evaluation protocol are also presented to simplify com-
parisons between different frameworks in the future.
2. Related work
To get features from both TIR-Vis videos, the first works
on the topic [
6, 8] used edge maps and silhouette infor-
mation. Shape skeletons have also been exploited as fea-
tures to estimate homographic registration transformations
[
3]. Besides, blob tracking [20] was also utilized to find
correspondences. The above methods are good only in spe-
cial cases. More specifically, their accuracy mainly depends
on captured video quality. Thus, we can use [
6] only for
infrared video with large contrast between foreground and
background information. Furthermore, although skeletons
and edge information are handy for general estimation, they
do not give precise corresponding features to match because
they roughly represent objects as simple polygons.
The idea of foreground segmentation before processing
[
10, 21] has been proposed to increase accuracy in find-
ing object features. However, this approach simply exploits
shape contours and treats frames separately. Thus, there is
nearly no connecting information between frames. As a re-
sult, noise in the segmentation step has a big effect in the
accuracy of the registration system. To decide if a feature
match is good or not, a temporal correspondence buffer can
also be constructed [
16, 17]. Several kinds of buffer fill-
1
https://github.com/luannd/TIRVisReg
ing strategies can be used, such as first-in, first-out (FIFO)
[
16], or RANSAC combined with persistence voting [17].
Nonetheless, these methods are just applicable for planar
scenes since they assume all input videos as planar ones.
There is still no lightweight method available to solve the
non-planar, unrectified video registration problem.
Since all recent sparse correspondence methods [
16, 17]
are designed for planar videos, only one transformation ma-
trix is applied to register entire frames. This method of reg-
istration cannot be adapted to non-planar scenes where each
object has its own disparity (or lies on its own depth plane).
Our framework proposed in Section
3.3 addresses this com-
mon limitation. We treat each object as separate blobs so
that many transformations matrices may be used in a single
frame.
The work of St-Charles et al. [17] is the most closely re-
lated to ours. Their work uses PAWCS segmentation [
18]
to extact the foreground of TIR-Vis videos. Contour extrac-
tion together with shape context matching are used to get
correspondence between blobs. Besides, they also create a
random sampling buffer with voting scheme to filter inliers
and outliers. Transformation smoothing is used to improve
resilience to noise. However, their work is designed for pla-
nar scene registration while ours is designed to deal with
non-planar scenes, which is more general. We build upon
the merit of their work by proposing: (1) a new segmenta-
tion noise filtering method in the early processing stage, (2)
a fast blob matching strategy, (3) a keypoint matching strat-
egy that accelerates the framework by avoiding exhaustive
searches, and (4) a video rectification and disparity calcula-
tion method to register non-planar scenes.
As far as we know, our proposed framework is the first to
register non-planar TIR-Vis videos with sparse correspon-
dences. There is no public dataset and evaluation protocol
suitable for this problem. Although Bilodeau et al. [
4] cre-
ated a public dataset for non-planar video registration, the
input video frames are rectified. Thus, it is not general. As
a result, we also create a new dataset, an extended version
of Bilodeau’s work [
4], and provide our evaluation protocol
as a standard one beside the overlap assessment metric.
64

3. Framework Architecture
Our proposed framework is shown in Figure
1. We con-
sider all input frame pairs as from a non-planar scene. Thus,
each object has its own disparity. To the pair of frames, we
apply PAWCS method [
18] for segmentation, which per-
form background subtraction with a statistical model. The
resulting foreground segmentation however is still noisy
and unfit for the following blob matching step. To filter
noise, we propose a new way to remove spurious blobs
based on a coarse warping of images. Warped blobs that
do not have a correspondence in the other image of the pair
are removed, as explained in Section
3.1.
This new cleaned version of foreground segmentation
is used for feature matching. Contours are extracted from
object blobs and shape context matching is applied to get
correspondences between each pair of frame. Besides,
RANSAC algorithm [
9] is also applied to filter outliers
in order to increase transformation accuracy between ob-
ject blobs. Instead of using a brute force method to get
best match for each blob, a preservation matching strat-
egy is proposed to increase the processing speed and elim-
inate wrong matches during early processing stages. This
preservation matching strategy consists of a correspondence
match list to keep track of match pairs throughout the ana-
lyzed video sequences. The match list is updated only when
spatial relationships between objects are changed. The de-
tails of feature matching is discussed in Section
3.2.
Then, input video frames are rectified to reduce the dis-
parity search space from 2D to 1D. In Section
3.3, the
method to register non-planar scenes is described. The dis-
parity for each object in every frame is calculated using
the corresponding blob pairs obtained from previous stage.
Based on these disparities, a transformation is applied in
each object and video is unrectified to give the output as the
same format as raw input.
3.1. Segmentation and noise filtering
Similarly to [
17], we use background subtraction based
on a statistical model using colors, binary features and
an automatic feedback mechanism to segment object fore-
ground blobs from the scene’s background. We use the
PAWCS method [
18]. The resulting segmentation contains
spurious blobs from background. Eliminating these spuri-
ous blobs makes our framework more robust. As shown in
Figure
2, from raw segmentation returned by PAWCS, we
computed a coarse transformation to estimate a homogra-
phy of the whole scene. This transformation is then used to
overlap the frame pair. We remove the blobs in the frame
pair that do not overlap after the transformation.
Algorithm
1 describes our strategy in details. In the algo-
rithm, B
(F
i
)
represents all blobs in the other frame of the
i
th
frame pair, n and m
(F
i
)
are the number of frames and
number of blobs in frame F
i
respectively. There are situa-
Shape description
Extract contour
Connected component
Coarse registration
from Visible Video
Blob’s ID
and
position
Roughly estimate
registration
Registration
matrix
Compare to Infrared video
Coarse registration
from Infrared Video
Compare to Visible video
Raw input videos
--------------
--------------
--------------
--------------
Contour object list
Noise Elimination
Figure 2. Segmentation and Spurious blobs filtering strategy
Algorithm 1 Noise elimination by rough registration
Input: TIR-Vis frame pairs
Output: Cleaned videos version without spurious blobs
1: procedure NOISEELIMINATION
2: for F
i
F
1
, F
2
, ..., F
n
do
3: M
F
i
=
1
m
(F
i
)
P
m
(F
i
)
k=1
M(B
(F
i
)
k
);
4: Let B
(F
i
)
new
= Ø
5: for B
(F
i
)
k
B
(F
i
)
1
, B
(F
i
)
2
, ..., B
(F
i
)
m
do
6: B
(F
i
)
k
= Apply matrix M
F
i
for blob B
(F
i
)
k
7: B
(F
i
)
k
= Expand B
(F
i
)
k
by α percentage
8: if B
(F
i
)
k
B
(F
i
)
= Ø then
9: Eliminate B
(F
i
)
k
10: else
11: B
(F
i
)
new
= B
(F
i
)
new
B
(F
i
)
k
12: end if
13: end for
14: Save new cleaned frame B
(F
i
)
new
15: end for
16: end procedure
tions where blobs do not have correspondences in the other
frame of a frame pair due to the position of each camera (ho-
mography does not explain perfectly the non-planar scene).
We handle this case by applying a voting scheme instead of
computing a scene-wide homography. A coarse transforma-
tion matrix M(B
(F
i
)
k
) is computed for each blob, and each
matrix votes for overall scene transformation. M(B
(F
i
)
k
)
is computed by extracting the contour and general shape of
each blob B
(F
i
)
k
in frame F
i
. From these shapes, we com-
pute the best match for each blob based on point matching
strategy described in Section
3.2. Because this is a coarse
registration to eliminate noise in early stage, we just com-
pute a homography transformation instead of calculating the
disparity of each blob to reduce computation costs. Based
on the obtained correspondence list, if one blob does not
have correspondence in the other modality, it does not take
65

Exploit information
in previous frames
Pair of cleaned videos
Rank list
corresponding
match
ID - Position
Verify ID
and
spatial
relation?
Check on
current frame
Unchanged
Changed
Reserved Rank list
Exhaustive search
Final sorted
matches list
Figure 3. Blob correspondence matching strategy
part in the voting scheme. Then, the final coarse transfor-
mation M
F
i
for the current frame pair is the mean transfor-
mation of all voting blobs.
Finally, we use this scene transformation to verify the
overlap between blobs in the frame pair. Blobs are ex-
panded by α = 120% from their original size to decide
whether they have overlap with any blobs in the other frame.
Blobs with a corresponding overlapping blob are kept, the
other removed. We filter blobs in the visible video with in-
frared video as a reference and vice versa.
3.2. Feature matching
In TIR-Vis registration, keeping track of blobs to find
correspondence is one of many challenges. Indeed, corre-
sponding features should be found only on corresponding
blobs. St-Charles et al. [
17] use a brute force method to find
feature correspondences in each frame pair. In their method,
a feature is a contour point extracted and described using
the shape context descriptor [
2]. χ
2
tests are used to calcu-
late similarity scores and find matches. For each iteration,
to verify the optimal transformation between blob features,
the Thin Plate Spine (TPS) model [
7] is applied. We inherit
the merit of this strategy to find correspondences. The key
difference is that we do not exhaustively consider all possi-
ble feature matches and we do not treat frames separately.
As such, we propose a new method for faster computing of
correspondences. Our main idea is that we preserve the cor-
respondences from the previous frame pair and apply them
to the new one. This gives rise to two situations: the easier
case, where the same number of blobs appears in consec-
utive frame pairs, and the harder one when this is not the
case.
To deal with the first case, we exploit useful informa-
tion from the previous frame pair. More specifically, each
blob has a unique ID and a center position. A buffer for
temporarily saving correspondences in each frame pair is
constructed. The consecutive frame pairs are captured after
a very short time interval. Based on that observation, it is
clear that spatial relationships between objects are mostly
preserved. We exploit this characteristic by accumulating
Disparity
calculation
(T)
Input
Output
Figure 4. Non-planar scene registration
blobs with ID and position into the buffer based on a order-
ing on their position from left to right and top to bottom. To
get the correspondence for one blob, we just need to look
up blob ID in the sorted list. Besides, we also use a buffer
of previous frames to guarantee that a blob still exists in
current frame by comparing position with blob ID. When
two blob pairs in consecutive frame are associated, corre-
spondences are just search on the new blob pairs , instead
of everywhere in the image. Figure
3 details our strategy to
find correspondences.
Although consecutive frame pairs have a large propor-
tion of unchanged number of blobs, there are situations in
which objects are visible in one video but they are invisible
in the other. To handle these situations, we use a brute force
method to get blob correspondences based on shape context
[
2]. However, our brute force method is not similar to [17].
We treat blobs by their positions so that for each processing
blob, we only keep blobs in the other frame of frame pair
whose position is similar to the current one. Thus, search
space is reduced to position based area. Blob order is also
updated and used as reference for later frames.
To sum up, by using this new strategy, we only need to
update the frame buffer if there are missing or new objects
in either frame of the video pair. Therefore, the correspon-
dence search speed is significantly increased.
3.3. Non-planar registration
Our framework for non-planar registration comprises
three steps. Schematic diagram of our framework is pre-
sented in Figure
4. General formulation of our framework
66

is described by
D = H
1
T H
1
2
(1)
where D is the registration matrix to register objects in non-
planar scene, T is the disparity transformation for each blob
in current frame, H
1
and H
2
are the rectification matrix to
transform from raw video into rectified one of input and
output video respectively.
3.3.1 Frame rectification strategy
First, we address two challenges with video rectification.
As it can be seen from (
1), to get the correct transformation
for each object in video pairs, we need to estimate H
1
and
H
2
correctly. However, the main difficulty is calculating
the fundamental matrix. If the fundamental matrix is far
from the real one, the result, of course, is affected. Thus,
H
1
and H
2
matrices are not correct and it leads to wrong
disparity calculation. Some existing techniques such as [8,
11, 14] are useful for the image rectification problem, not
for the video one. Given this observation, we propose a new
technique for robust rectification using spatial and temporal
frame information.
The first part of this technique is treating each frame as
a single image. The fundamental matrix is calculated using
the correspondence buffer. To clarify, since our segmenta-
tion frames are free of spurious blobs, features for each blob
in Vis and TIR frame are accumulated to get corresponding
feature lists. Then we calculate fundamental matrix from
these feature lists of Vis and TIR frame. Because a noisy
fundamental matrix F M
cur
will be obtained by only using
a single frame, we create a global fundamental matrix F M
g
as an optimal value by using temporal information. Equa-
tion (
2) describes the relationship between current funda-
mental matrix and the global one.
F M
g
= β F M
g
+ (1 β) F M
cur
(2)
where β is an adaptation factor. We used a fixed value for
the whole dataset used in our experiments.
The second part of our technique is adaptive decision
for updating F M
g
. Since not all fundamental matrices are
good enough to take part in the updating scheme, we ap-
ply a coarse registration to validate the quality of the new
matrix. Specifically, from F M
cur
, we calculate H
1
and H
2
value. In disparity calculation step, which is described in
Section
3.3.2, we approximate disparity values by using av-
erage blob disparities. The reason for this approximation is
to reduce running time and estimate the fundamental matrix
without redundant calculations.
After disparities for whole scene are estimated, we
roughly use them for coarse registration as in Section
3.1.
Besides, the error threshold φ
cur
for computing registration
is also used to decide whether F M
cur
is qualified to update
Algorithm 2 Rectification Video strategy
Input: Pair of segmented videos
Output: Value H
1
best
and H
2
best
1: procedure RECTIFY
2: E
min
= ,F M
best
= Ø,H
1
best
= Ø,H
2
best
= Ø
3: for F
cur
F
1
, F
2
, ..., F
n
do
4: H
1
, H
2
F M
cur
5: Estimate coarse registration T
cur
6: E
cur
= register
τ
(H
1
, T
cur
, H
2
)
7: if E
cur
< E
min
then
8: F M
g
= β F M
g
+ (1 β) F M
cur
9: E
cur
= E
min
, F M
best
= F M
g
10: H
1
best
, H
2
best
F M
best
11: end if
12: end for
13: end procedure
or not. If current registration error E
cur
is lower than mean
of registration errors of τ = 30 recent frames (φ
cur
), it is
kept and used for update; otherwise, we eliminate F M
cur
.
Algorithm
2 describes our technique for rectifying videos.
3.3.2 Disparity calculation
Finding disparity is one of the most important parts of our
framework. At this stage, the two videos are rectified so
that we only need to find disparity in one dimension for
each object in each frame.
As mentioned in Section
3.1, each object is represented
by its foreground blob following the segmentation step.
Thus, calculating disparity is equivalent to calculating the
translation between two blobs. There are two steps to do
this. First, to reduce unnecessary computation, we roughly
estimate translation of two corresponding blobs by subtract-
ing their centroids. After that, the disparity search range is
set to 150% of the blob size to find a correct match. For
instance, let us suppose that we have a blob whose position
is α and rough disparity is η, the real range for finding dis-
parity is [α + η θ γ, α + η + θ γ], where γ is blob’s
width and θ is equal to 0.5. This approach allows the search
for an optimal match to be completed more quickly.
However, there is still one problem we need to address,
which is the registration evaluation criteria. Thus, we pro-
pose a new formula to estimate registration quality. The
work of Bilodeau et al. [
4] already proposed a criterion for
planar scene registration, which we adapt to individual blob
registration instead of whole scene. Specifically, let B
(1)
i,k
and B
(2)
i,k
be the i
th
blob taken from the k
th
frame of the
first video and second video, respectively; registration error
67

Citations
More filters
Journal ArticleDOI
TL;DR: This survey provides a comprehensive review of multimodal image matching methods from handcrafted to deep methods for each research field according to their imaging nature, including medical, remote sensing and computer vision.

155 citations


Cites methods from "Non-planar Infrared-Visible Registr..."

  • ...[439] VIS–IR Area-based Nonrigid – Edge + NMI [440] VIS–IR Area-based Affine + FFD Person Edge + MI [441] VIS–IR Feature-based Similarity Buildings Harris + Robust estimation [133] VIS–IR Feature-based Deformation Face Points from Edge Extraction + GFC + RKHS [445] VIS–IR Feature-based Affine Multiple Points from edge extraction + GFC [442] VIS–IR Feature-based Rigid – PC + RANSAC [444] VIS–IR Feature-based Affine + FFD Building Edge/Quadrilateral features [141] VIS–IR Feature-based Affine Multiple Harris + PIIFD; Relaxed matching [178] VIS–IR Feature-based Rigid Multiple Edge/C_SIFT+BRIEF+RANSAC [431] VIS–IR Feature-based Affine Building MSPC + RANSAC [432] VIS–IR Feature-based Affine Building Moment ranking + KCF + RANSAC [452] VIS–IR Feature-based – – Video sequence + Segmentation-blob feature [447] VIS–IR Feature-based Rigid – Video sequence + Points from Trajectories+RANSAC [448] VIS–IR Feature-based Perspective – Video sequence + Lines + Points [451] VIS–IR Feature-based Rigid – Video sequence + Points from motion [450] VIS–IR Area-based Homography – Video sequence + MI; Global to local [149] VIS–IR Feature-based Similarity – Video sequence + Motion + FAST [449] VIS–IR Feature-based Homography – Video sequence + Local shape + Polygon vertices [455] VIS–IR Learning Deformation Plants Transfer GAN + STN [453] VIS–IR Learning Deformation pedestrian GAN+FlowNet [454] VIS–IR Learning Rigid Buildings Hybrid CNN + Cross entropy & Hinge Loss [461] VIS–NIR Area-based – Face Edge map/Patch + Template matching [462] VIS–NIR Area-based – Face Learned correlation metric [182] Day–night Learning – Multiple Piecewise linear regression models + DoG for training + NMS [5] Day–night Learning – Multiple Descriptor Learning + Local patch similarity + Spatial constraint [459] MultiSpectral Area-based – – Structure Consistency Boosting (SCB) Transform [460] MultiSpectral Learning – – Gradient map + Structural loss [8] CS/image–paint Learning – – SVM + Defined visual similarity [6] Cross-Weather Learning Deformation Road GAN + Deformation flow estimation [9] 3D–Paint Learning – Building 3D Scece representation + Visual element matching...

    [...]

  • ...The method proposed in [452] registers nonplanar VIS–IR frame pairs by segmenting the salient targets and matching the blob features....

    [...]

Posted Content
TL;DR: A new method to simultaneously tackle multispectral segmentation and stereo registration is presented, based on the alternating minimization of two energy functions that are linked through the use of dynamic priors.
Abstract: The segmentation of video sequences into foreground and background regions is a low-level process commonly used in video content analysis and smart surveillance applications. Using a multispectral camera setup can improve this process by providing more diverse data to help identify objects despite adverse imaging conditions. The registration of several data sources is however not trivial if the appearance of objects produced by each sensor differs substantially. This problem is further complicated when parallax effects cannot be ignored when using close-range stereo pairs. In this work, we present a new method to simultaneously tackle multispectral segmentation and stereo registration. Using an iterative procedure, we estimate the labeling result for one problem using the provisional result of the other. Our approach is based on the alternating minimization of two energy functions that are linked through the use of dynamic priors. We rely on the integration of shape and appearance cues to find proper multispectral correspondences, and to properly segment objects in low contrast regions. We also formulate our model as a frame processing pipeline using higher order terms to improve the temporal coherence of our results. Our method is evaluated under different configurations on multiple multispectral datasets, and our implementation is available online.

11 citations


Cites background or methods or result from "Non-planar Infrared-Visible Registr..."

  • ...It could also be replaced by an automatic approach (e.g. Nguyen et al, 2016)....

    [...]

  • ...Other works have instead addressed the registration problem in the temporal domain by adopting motionbased cues (Torabi et al, 2012; Zhao and Sen-Ching, 2014; Nguyen et al, 2016), which is more similar to our approach....

    [...]

  • ...Unfortunately, previous works tackling multispectral registration have often relied on their own foreground overlap ratios to assess their performance (e.g. Nguyen et al, 2016), meaning comparisons here are impossible....

    [...]

  • ...In the works of Zhao and Sen-Ching (2014) and Nguyen et al (2016), foreground shapes obtained via background subtraction are used for contour matching....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a new method is proposed to simultaneously tackle multispectral segmentation and stereo registration using an iterative procedure, estimating the labeling result for one problem using the provisional result of the other.
Abstract: The segmentation of video sequences into foreground and background regions is a low-level process commonly used in video content analysis and smart surveillance applications Using a multispectral camera setup can improve this process by providing more diverse data to help identify objects despite adverse imaging conditions The registration of several data sources is however not trivial if the appearance of objects produced by each sensor differs substantially This problem is further complicated when parallax effects cannot be ignored when using close-range stereo pairs In this work, we present a new method to simultaneously tackle multispectral segmentation and stereo registration Using an iterative procedure, we estimate the labeling result for one problem using the provisional result of the other Our approach is based on the alternating minimization of two energy functions that are linked through the use of dynamic priors We rely on the integration of shape and appearance cues to find proper multispectral correspondences, and to properly segment objects in low contrast regions We also formulate our model as a frame processing pipeline using higher order terms to improve the temporal coherence of our results Our method is evaluated under different configurations on multiple multispectral datasets, and our implementation is available online

9 citations

Journal ArticleDOI
TL;DR: Experimental results demonstrate that this MVIDL algorithm can effectively detect targets and precisely get their position and size.
Abstract: Infrared (IR) images are not affected by factors such as illumination and have the ability to work all day long, which is of great significance for night detection of unmanned platforms. We propose a multiview infrared target detection and localization algorithm (MVIDL), a complete sensory-fusion framework that uses IR images and lidar point cloud to detect and locate infrared targets (pedestrian and vehicle). MVIDL is a two-stage pipeline with an IR camera and three-dimensional lidar information as input. First, we introduce an infrared region proposal method that fuses lidar point cloud cluster results and IR image cluster results to obtain target regions and their position. In the second stage, an aggregate feature is proposed and extracted from the target regions, after which SVM is adopted to classify. Experimental results demonstrate that this algorithm can effectively detect targets and precisely get their position and size.

5 citations

Book ChapterDOI
04 Jul 2020
TL;DR: In this article, a robust target tracking algorithm is introduced to dynamically allocate and update the matching reservoir for each target, and the precise registration of multiple targets is realized by computing transformation matrix independently for every target.
Abstract: To solve the problem of multi-target depth difference for infrared and visible image sequence registration, a registration framework based on robust tracking is proposed. Firstly, the curvature scale space corners of targets are extracted, and the descriptors based on the curvature distribution are created to complete feature matching. Since different targets lie on different depth planes, single global transformation matrix is no longer applicable. Robust target tracking algorithm is introduced to dynamically allocate and update the matching reservoir for each target. Finally, the precise registration of multiple targets is realized by computing transformation matrix independently for each target. Experiments on a public dataset of non-planar infrared and visible image sequences show that our framework achieves lower overlapping errors and improves the accuracy of multi-target registration.
References
More filters
Journal ArticleDOI
TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

23,396 citations


"Non-planar Infrared-Visible Registr..." refers methods in this paper

  • ...Besides, RANSAC algorithm [9] is also applied to filter outliers in order to increase transformation accuracy between object blobs....

    [...]

Journal ArticleDOI
TL;DR: This paper presents work on computing shape models that are computationally fast and invariant basic transformations like translation, scaling and rotation, and proposes shape detection using a feature called shape context, which is descriptive of the shape of the object.
Abstract: We present a novel approach to measuring similarity between shapes and exploit it for object recognition. In our framework, the measurement of similarity is preceded by: (1) solving for correspondences between points on the two shapes; (2) using the correspondences to estimate an aligning transform. In order to solve the correspondence problem, we attach a descriptor, the shape context, to each point. The shape context at a reference point captures the distribution of the remaining points relative to it, thus offering a globally discriminative characterization. Corresponding points on two similar shapes will have similar shape contexts, enabling us to solve for correspondences as an optimal assignment problem. Given the point correspondences, we estimate the transformation that best aligns the two shapes; regularized thin-plate splines provide a flexible class of transformation maps for this purpose. The dissimilarity between the two shapes is computed as a sum of matching errors between corresponding points, together with a term measuring the magnitude of the aligning transform. We treat recognition in a nearest-neighbor classification framework as the problem of finding the stored prototype shape that is maximally similar to that in the image. Results are presented for silhouettes, trademarks, handwritten digits, and the COIL data set.

6,693 citations


"Non-planar Infrared-Visible Registr..." refers methods in this paper

  • ...To handle these situations, we use a brute force method to get blob correspondences based on shape context [2]....

    [...]

  • ...In their method, a feature is a contour point extracted and described using the shape context descriptor [2]....

    [...]

Book ChapterDOI
01 Jan 1977
TL;DR: A family of semi-norms is defined, subject to some interpolating conditions, providing interpolation methods that preserve polynomials of degree≤m−1 and converge in Sobolev spaces Hm+s(Ω).
Abstract: We define a family of semi-norms ‖μ‖m,s=(∫ℝ n∣τ∣2s∣ℱ Dmu(τ)∣2 dτ)1/2 Minimizing such semi-norms, subject to some interpolating conditions, leads to functions of very simple forms, providing interpolation methods that: 1°) preserve polynomials of degree≤m−1; 2°) commute with similarities as well as translations and rotations of ℝn; and 3°) converge in Sobolev spaces Hm+s(Ω).

1,494 citations


"Non-planar Infrared-Visible Registr..." refers methods in this paper

  • ...For each iteration, to verify the optimal transformation between blob features, the Thin Plate Spine (TPS) model [7] is applied....

    [...]

Journal ArticleDOI
TL;DR: The aim of this paper is to be an introduction to the field, provide knowledge on the work that has been developed and to be a suitable reference for those who are looking for registration methods for a specific application.
Abstract: This paper presents a review of automated image registration methodologies that have been used in the medical field The aim of this paper is to be an introduction to the field, provide knowledge on the work that has been developed and to be a suitable reference for those who are looking for registration methods for a specific application The registration methodologies under review are classified into intensity or feature based The main steps of these methodologies, the common geometric transformations, the similarity measures and accuracy assessment techniques are introduced and described

689 citations


"Non-planar Infrared-Visible Registr..." refers methods in this paper

  • ...Some existing techniques such as [8, 11, 14] are useful for the image rectification problem, not for the video one....

    [...]

Journal ArticleDOI
11 Oct 2000
TL;DR: In this article, the authors proposed to include spatial information by combining mutual information with a term based on the image gradient of the images to be registered, which not only seeks to align locations of high gradient magnitude, but also aims for a similar orientation of the gradients at these locations.
Abstract: Mutual information has developed into an accurate measure for rigid and affine monomodality and multimodality image registration. The robustness of the measure is questionable, however. A possible reason for this is the absence of spatial information in the measure. The present paper proposes to include spatial information by combining mutual information with a term based on the image gradient of the images to be registered. The gradient term not only seeks to align locations of high gradient magnitude, but also aims for a similar orientation of the gradients at these locations. Results of combining both standard mutual information as well as a normalized measure are presented for rigid registration of three-dimensional clinical images [magnetic resonance (MR), computed tomography (CT), and positron emission tomography (PET)]. The results indicate that the combined measures yield a better registration function does mutual information or normalized mutual information per se. The registration functions are less sensitive to low sampling resolution, do not contain incorrect global maxima that are sometimes found in the mutual information function, and interpolation-induced local minima can be reduced. These characteristics yield the promise of more robust registration measures. The accuracy of the combined measures is similar to that of mutual information-based methods.

687 citations