What are the contributions in "Detecting flying objects using a single moving camera" ?

The authors propose an approach for detecting flying objects such as Unmanned Aerial Vehicles ( UAVs ) and aircrafts when they occupy a small portion of the field of view, possibly moving against complex backgrounds, and are filmed by a camera that itself moves. To this end the authors propose a regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classification on spatio-temporal image cubes and outperform state-of-the-art techniques. As this problem has not yet been extensively studied, no test datasets are publicly available. The authors therefore built their own, both for UAVs and aircrafts, and will make them publicly available so they can be used to benchmark future flying object detection and collision avoidance algorithms.

How did the authors improve the performance of the CNN algorithm?

To boost CNN performance, the authors used Local Contrast Normalization (LCN) [46] after every convolutional layer and minimize the Hinge Loss at the final layer of the network, which was shown to be effective [47], [48].

What is the way to compensate for the motion of the object?

Motion compensation can be seen as a way to make the st-cube invariant from the motion of the aircraft, as it keeps flying object in the center, for all the patches of the st-cube.

How much time does scale adjustment reduce?

Even though adding scale adjustment to motion compensation increases the processing time per st-cube, it reduces the overall computation time by a factor of about 4.

How can the authors estimate the distance to the camera?

Provided that the camera is calibrated and given the true size of the object, the authors can estimate its distance to the camera, which is critical for collision avoidance purposes.

How can the authors run the detection to only non-overlapping patches without missing the target?

Applying large translations to the training data allows us to run the detection to only non-overlapping patches without missing the target, as explained at the end of Section 4.3.

Why does the algorithm perform poorly in their case?

As shown in Fig. 2, optical flow motion compensation cannot achieve good performance in their case, mostly because the target object is rather small and its appearance can significantly change due to illumination and background changes.

What is the way to evaluate the motion compensation algorithm?

Another way to evaluate their motion compensation algorithm is to compare the detectors, trained on the data, processed with either HBT-Regression or CNN-Regression methods.

What is the probability of an object of interest in a st-cube?

During training, the authors write the probability that an st-cube η contains an object of interest (y = 1) or is a part of the background (y = 0) asP (Y = y | η) = e CNN(η)[y]eCNN(η)[0] + eCNN(η)[1] , y = {0, 1} , (3)where CNN(η)[y] denotes the classification score that the network predicts for η as being part of class y and e(·) denotes the exponential function.

(Open Access) Detecting Flying Objects Using a Single Moving Camera (2017) | Artem Rozantsev

Q: What is the main assumption of the flow dependence?

In addition to optical flow dependence, [31] makes an assumption that camera motion is translational, which is violated in aerial videos.

Q: How do the authors train two different boosted trees regressors?

5One way to predict the translation for an input patch m, is to train two different boosted trees regressors [40] φx(m) and φy(m), one for each 2D direction (horizontal and vertical).

Q: How do the authors train a classifier that takes as input st-cubes?

More specifically, the authors want to train a classifier that takes as input st-cubes such as those depicted by Fig. 4 and returns 1 or -1, depending on the presence or absence of a flying object.

Detecting Flying Objects using a Single Moving

Camera

Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE,

Abstract—We propose an approach for detecting ﬂying objects such as Unmanned Aerial Vehicles (UAVs) and aircrafts when they

occupy a small portion of the ﬁeld of view, possibly moving against complex backgrounds, and are ﬁlmed by a camera that itself moves.

We argue that solving such a difﬁcult problem requires combining both appearance and motion cues. To this end we propose a

regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classiﬁcation on

spatio-temporal image cubes and outperform state-of-the-art techniques.

As this problem has not yet been extensively studied, no test datasets are publicly available. We therefore built our own, both for UAVs

and aircrafts, and will make them publicly available so they can be used to benchmark future ﬂying object detection and collision

avoidance algorithms.

Index Terms—Motion compensation, object detection.

1 INTRODUCTION

We are headed for a world in which the skies are occupied not

only by birds and planes but also by unmanned drones ranging

from relatively large Unmanned Aerial Vehicles (UAVs) to much

smaller consumer ones. Some of these will be instrumented and

able to communicate with each other to avoid collisions but not

all. Therefore, the ability to use inexpensive and light sensors

such as cameras for collision-avoidance purposes will become

increasingly important.

This problem has been tackled successfully in the automotive

world, for example there are now commercial products [1], [2]

designed to sense and avoid both pedestrians and other cars.

In the world of ﬂying machines much progress has been made

towards accurate position estimation and navigation from single

or multiple cameras [3], [4], [5], [6], [7], [8], [9], but less in the

ﬁeld of visual-guided collision avoidance [10]. In particular, it is

not possible to simply extend the algorithms used for pedestrian

and automobile detection to the world of aircrafts and drones, as

ﬂying object detection poses some unique challenges:

• The environment is fully three dimensional, which makes

the motions more complex (e.g., objects may move in any

direction in the 3D space and may appear in any part of the

frame).

• Flying objects have very diverse shapes and can be seen

against either the ground or the sky, which produces complex

and changing backgrounds.

• Given the speeds involved, potentially dangerous objects

must be detected when they are still far away, which means

they may be very small in the images.

• A. Rozantsev is with the Computer Vision Laboratory,

Ecole Polytechnique

erale de Lausanne, Lausanne, Switzerland.

E-mail: artem.rozantsev@epﬂ.ch

• V. Lepetit is with the Institute for Computer Graphics and Vision, Graz

University of Technology, Graz, Austria.

• P. Fua is with the Computer Vision Laboratory,

Ecole Polytechnique

erale de Lausanne, Lausanne, Switzerland.

Manuscript received September 19, 2015; revised April 22, 2016.

Figure 1: Detecting a small ﬂying object against a complex moving

background. (Left) It is almost invisible to the human eye and hard

to detect from a single image. (Right) Yet, our algorithm can ﬁnd

it by using appearance and motion cues.

Fig. 1 illustrates some examples, where even for humans it is hard

to ﬁnd a ﬂying object based just on a single image. By contrast,

when looking at the sequence of frames, these objects suddenly

pop up and are easily spotted, which suggests that motion cues are

crucial for detection.

However, these motion cues are difﬁcult to exploit when the

images are acquired by a moving camera and feature backgrounds

that are challenging to stabilize because they are non-planar and

rapidly changing. Furthermore, since there may be other moving

objects in the scene, such as a person in the top row of Fig. 1,

motion by itself is not enough and appearance must also be taken

into account.

In this paper, we detect whether an object of interest is present

and constitutes danger by classifying 3D descriptors computed

from spatio-temporal image cubes. We will refer to them as

UAVs Aircrafts

Uniform background Very noisy background Non-uniform background Noisy background

No motion compensation

Optical Flow

Our approach

(a) (b) (c) (d)

Figure 2: Motion compensation for four different st-cubes of ﬂying objects seen against different backgrounds. (Top) For each one, we

show four consecutive patches before motion stabilization. In the leftmost plot below the patches, the blue dots denote the location of

the true center of the drone and the red cross is the patch center over time. The other two plots depict the x and y deviations of the drone

center with respect to the patch center. (Middle) The same four st-cubes and corresponding graphs after motion compensation using

an optical ﬂow approach, as suggested by [11]. (Bottom) The same four st-cubes and corresponding graphs after motion compensation

using our approach.

st-cubes. They are formed by stacking motion-stabilized image

windows over several consecutive frames, which give more infor-

mation than using a single image. What makes this approach both

practical and effective is a regression-based motion-stabilization

algorithm. Unlike those relying on optical ﬂow, it remains effective

even when the shape of the object to be detected is blurry or

barely visible, as illustrated by Fig. 2. This arises from the fact

that learning-based motion compensation focuses on the object

and is more resistant to complicated backgrounds, compared to

the optical ﬂow method as shown in Fig. 2.

St-cubes have been routinely used for action recognition pur-

poses [12], [13], [14] using a monocular camera. By contrast, most

current detection algorithms work either on a single frame, or by

estimating the optical ﬂow from consecutive frames. Our approach

can therefore be seen as a way to combine both the appearance

and motion information to achieve effective detection in a very

challenging context. In our experiments we show that this method

allows to achieve higher accuracy, comparing to either appearance

or motion-based methods individually.

We ﬁrst proposed using st-cubes for ﬂying objects detection

in an earlier conference paper [15]. In this initial version of our

processing pipeline, we performed motion compensation using

boosted trees. In this paper we reﬁne this idea by using deep

learning techniques that yield better stabilization and, thus, better

overall performance.

2 RELATED WORK

Approaches for detecting moving objects can be classiﬁed into

three main categories: those that rely on appearance in individual

frames, those that rely primarily on motion information across

frames, and those that combine the two. We brieﬂy review all three

types in this section. In the results section, we will demonstrate

that we can outperform state-of-the-art representatives of each

class.

Appearance-based methods rely on Machine Learning and

have proved to be powerful even in the presence of complex

lighting variations or cluttered background. They are typically

based on Deformable Part Models (DPM) [16], Convolutional

Neural Networks (CNN) [17], or Random Forests [18]. Among

them the Aggregate Channel Features (ACF) [19] algorithm is

considered as one of the best.

These approaches work best when the target objects are

sufﬁciently large and clearly visible in individual images, which is

often not the case in our applications. For example, in the images

of Fig. 1, the object is small and it is almost impossible to make

out from the background without motion cues.

Motion-based approaches can themselves be subdivided into

two subclasses. The ﬁrst comprises those that rely on background

subtraction [20], [21], [22], [23] and determine objects as groups

of pixels that are different from the background. The second

includes those that depend on optical ﬂow [24], [25], [26].

Background subtraction works best when the camera is static

or its motion is small enough to be easily compensated for, which

is not the case for the on-board camera of a fast moving aircraft.

Flow-based methods are more reliable in such situations but

still critically dependent on the quality of the ﬂow vectors, which

tends to be low when the target objects are small and blurry. Some

methods combine both optical ﬂow and background subtraction

algorithms [27], [28]. However, in our case there may be motion

Figure 3: Object detection pipeline with st-cubes and motion compensation. Provided a set of video frames from the camera, we use a

multi-scale sliding window approach to extract st-cubes. We than process every patch of the st-cube to compensate for the motion of

the aircraft and then run the detector. (best seen in color)

in different parts of the images, for example people or tree

tops. Thus motion information is not enough for reliable ﬂying

object detection. Other methods that combine optical ﬂow and

background subtraction, such as [29], [30], [31], [32] still critically

depend on optical ﬂow, which is often estimated with [26] and thus

may suffer from the low quality of the ﬂow vectors. In addition to

optical ﬂow dependence, [31] makes an assumption that camera

motion is translational, which is violated in aerial videos.

Hybrid approaches combine information about object ap-

pearance and motion patterns and are therefore the closest in spirit

to what we propose. For example, in [33], histograms of ﬂow

vectors are used as features in conjunction with more standard

appearance features and are fed to a statistical learning method.

This approach was reﬁned in [11] by ﬁrst aligning the patches

to compensate for motion and then using the differences of the

frames, which may or may not be consecutive, as additional

features. The alignment relies on the Lucas-Kanade optical ﬂow

algorithm [25]. The resulting algorithm works very well for pedes-

trian detection and outperforms most of the single-frame methods.

However, when the target objects become smaller and harder to

see, the ﬂow estimates become unreliable and this approach, like

the purely ﬂow-based ones, becomes less effective.

3 DETECTION FRAMEWORK

Our detection pipeline is illustrated by Fig. 3 and comprises the

following steps:

• Divide the video sequence into N-frame overlapping tempo-

ral slices. The larger the overlap is, the higher the precision

but only up to a point. Our experiments show that making the

overlap more than 50% increases computation time without

improving performance. Thus, 50% is what we used.

• Build st-cubes from each slice using a sliding window ap-

proach, independently at each scale.

• Apply our motion compensation algorithm to the patches of

each of the st-cubes to create stabilized st-cubes.

• Classify each st-cube as containing an object of interest or

not.

• Since each scale has been processed independently, we per-

form non-maximum suppression in scale space. If there are

several detections for the same spatial location at different

scales, we only retain the highest-scoring one. As an alter-

native to this simple scheme, we have developed a more

sophisticated learning-based one, which we discuss in more

details in Section 6.4.

In this section, we introduce two separate approaches—one

based on boosted trees, the other one on Convolutional Neural

Networks—to deciding whether or not an st-cube contains a target

object and will compare their respective performance in Section 5.

We will discuss motion compensation in Section 4.

More speciﬁcally, we want to train a classiﬁer that takes as

input st-cubes such as those depicted by Fig. 4 and returns 1 or

-1, depending on the presence or absence of a ﬂying object. Let

, s

) be the size of our st-cubes. For training purposes, we

use a dataset of pairs (b

, y

), i ∈ [1, N ], where b

∈ R

×s

is an st-cube, in other words s

image patches of resolution s

×s

pixels. Label y

∈ {−1, 1} indicates whether or not a target object

is present.

3.1 3D HoG with Gradient Boost

The ﬁrst approach we tested relies on boosted trees [34] to learn

a classiﬁer ψ(·) of the the form ψ(b) =

j=1

(b), where

j=1..H

are real valued weights, b ∈ R

×s

is the input st-

cube, h

: R

×s

→ R are weak learners, and H is the

number of selected weak learners, which controls the complexity

of the classiﬁer. The α’s and h’s are learned in a greedy manner,

using the Gradient Boost algorithm [34], which can be seen as

an extension of the classic AdaBoost to real-valued weak learners

and more general loss functions.

In standard Gradient Boost fashion, we take our weak learners

to be regression trees h

(b) = T (θ

, HoG3D(b)), where θ

denotes the tree parameters and HoG3D(b), the 3-dimensional

Histograms of Gradients (HoG3D) computed for b. HoG3D was

introduced in [14], and can be seen as an extension of the standard

HoG [35] with an additional temporal dimension. It is fast to

compute and proved to be robust to illumination changes in many

applications, and allows us to combine appearance and motion

efﬁciently.

At each iteration j, the weak learner h

(·) with the cor-

responding weight α

is taken as the one that minimizes the

exponential loss function:

(.), α

) = argmin

h(.),α

i=1

−y(ψ

j−1

)+αh(b

))

. (1)

The tests in the nodes of the trees compare one coordinate of

the HoG3D vector with a threshold, both selected during the

optimization.

3.2 Convolutional Neural Networks

Since Convolutional Neural Networks (CNN) [36] have proved

very successful in many detection problems, we have tested it

as an alternative classiﬁcation method. We use the architecture

depicted by Fig. 5, which alternates convolutional layers and

UAVs Aircrafts

Figure 4: Sample patches of the UAVs and aircrafts. Each row

corresponds to a single st-cube and illustrates different possible

motions that an aircraft could have.

pooling layers. Convolutional layers use 3D linear ﬁlters while

pooling layers apply max-pooling in 2D spatial regions only. The

last layer is fully connected and outputs the probability that the

input st-cube contains an object of interest. We use the hyperbolic

tangent function as the non-linear operator [37].

We take the input of our CNN is a normalized st-cube

η =

b − µ(b)

σ(b)

, (2)

where µ(b) and σ(b) are the mean and standard deviation of the

pixel intensities in b, respectively. Normalization is an important

step because network parameters optimization fails to converge

when using raw image intensities.

During training, we write the probability that an st-cube η

contains an object of interest (y = 1) or is a part of the background

(y = 0) as

P (Y = y | η) =

CNN(η)[y]

CNN(η)[0]

+ e

CNN(η)[1]

, y = {0, 1} , (3)

where CNN(η)[y] denotes the classiﬁcation score that the network

predicts for η as being part of class y and e

(·)

denotes the expo-

nential function. We then minimize the negative log-likelihood

(W, bias) = −

k=1

log P (Y = y

| η

) (4)

with respect to the CNN parameters. Here (η

, y

) are pairs

of normalized st-cubes and their corresponding labels from the

training dataset, as deﬁned in Section 3. To this end, we use

the algorithm of [38] combined with Dropout [39] to improve

generalization.

We tried many different network conﬁgurations, in terms of

the number of ﬁlters per layer and the size of the ﬁlters. However,

they all yield similar performance, which suggests that only

minor improvements could be obtained by further tweaking the

network. We also tried varying the dimensions of the st-cube.

These variations have a more signiﬁcant inﬂuence on performance,

which will be evaluated in Section 5.

4 MOTION COMPENSATION

Neither of the two approaches to classifying st-cubes introduced

in the previous section accounts for the fact that both the gradient

orientations used to build the 3D HoG and the ﬁlter responses

in the CNN case are biased by the global object motion. This

Figure 5: The structure of the Convolutional Neural Network,

which we used for ﬂying object detection. CL, PL and FL

correspond to Convolution, Pooling, and Fully-connected layers

respectively.

Coarse

alignment

Reﬁnement

Figure 6: Structure of the CNNs used for motion compensation.

(Top) The ﬁrst network uses extended patches to correct for the

large displacements of the aircraft. (Bottom) The second network

is applied after rectiﬁcation by the motion predicted by the ﬁrst

network, and is designed to correct for the small motions.

makes the learning task much more difﬁcult and we propose

to use motion compensation to eliminate this problem. Motion

compensation will allow us to accumulate visual evidence from

multiple frames, without adding variation due to the object motion.

We therefore aim at centering the target object, so that when

present in an st-cube, it remains at the center of all its image

patches.

More speciﬁcally, let I

denote the t-th frame of the video

sequence and (i, j) some pixel position in it. The st-cube b

i,j,t

is the 3D array of pixel intensities from images I

with z ∈

[t−s

+1, t] at image locations (k, l) with k ∈ [i− s

+1, i] and

l ∈ [j−s

+1, j], as depicted by Fig. 4. Correcting for motion can

be formulated as allowing patches m

i,j,z

, z ∈ [t−s

+1, t] of the

st-cube to shift horizontally and vertically in individual images.

In [11], these shifts are computed using optical ﬂow infor-

mation, which has been shown to be effective for pedestrians

occupying a large fraction of the patch and moving relatively

slowly from one frame to the next. However, as can be seen in

Fig. 4, these assumptions do not hold in our case and we will

show in Section 6 that this negatively impacts performance. To

overcome this difﬁculty, we introduce instead a learning-based

approach to compensate for motion and keep the object in the

center of the m

i,j,z

patches of the st-cube even when the target

object’s appearance changes drastically.

More speciﬁcally, we treat motion compensation problem as a

regression task: given a single image patch, we want to predict the

2D translation that best centers the target object. By rectifying all

the image patches in an st-cube with their predicted translation,

we can then align the images of the object of interest together.

Figure 7: Combining multiple detections in several images of a

video sequence. The red square and dots depict the positions of

the original detection across the 50 frames preceding two different

images. The green square and dots illustrate the position of the

same detections after reﬁnement. They are superposed and form

much smoother trajectories. (best seen in color)

(a) UAV dataset (b) Aircraft dataset

Figure 8: Sample image patches containing aircrafts or UAVs from

our datasets.

4.1 Boosted tree-based regressors

One way to predict the translation for an input patch m, is to train

two different boosted trees regressors [40] φ

(m) and φ

(m), one

for each 2D direction (horizontal and vertical).

As for detection, we use regression trees h

(m) =

T (θ

, HoG(m)) as weak learners, where HoG(m) denotes the

Histograms of Oriented Gradients for patch m. The difference

is that we minimize here a quadratic loss function instead of an

exponential one

L(r, φ

∗

(m)) = (r − φ

∗

(m))

, (5)

where m is the input patch, r the corresponding expected 2D

vector, and φ

∗

(m) = [φ

(m), φ

(m)]

the 2D vector predicted

by the 2 regression trees.

We then apply these regressors in an iterative way: we obtain a

ﬁrst estimate of the shift of the target object—if present—from the

center of the patch. We translate it according to this estimate, and

we re-apply the regressors. We iterate until both shift estimates

drop to 0 or the algorithm reaches a preset number of iterations.

In practice, 4 to 5 iterations are enough to achieve good accuracy.

4.2 CNN-based regressors

Another possible approach is to use a Convolutional Neural

Network (CNN) to solve the regression task. CNNs are more

ﬂexible, as features are learned directly from the training data,

in contrast to the hand-designed HoG features we need to use with

our boosted tree-based regressors.

We trained two separate CNNs whose structure is depicted

by Fig. 6. Note that there is no pooling layer after the ﬁrst

convolutional one. This is because pooling layers are typically

used not only to reduce computational complexity but also to

achieve invariance to small motions. In our case, such invariance

would be counter-productive because these motions are precisely

what we are trying to estimate. Furthermore, the computational

complexity remains manageable even without the ﬁrst pooling

layer. We trained the ﬁrst CNN using examples involving large

2D translations (coarse-CNN) and the second smaller ones (ﬁne-

CNN). In practice we use the latter to reﬁne the predictions of the

former. As when using boosted-trees, we use CNN-regressors it-

eratively until convergence, as described at the end of Section 4.1.

We ﬁrst correct for large displacements by applying several times

coarse-CNN and we then apply ﬁne-CNN, which is trained to

compensate for small shifts of the object, for a couple more

iterations.

In fact, we also tried training two different boosted-tree regres-

sors such as those discussed in Section 4.1. Unlike in the case of

the CNN regressors, it produced no signiﬁcant improvement. This

likely happens because our boosted trees motion compensation

algorithm is based on HoG, where histograms are computed over

the bins of ﬁxed size. This, in fact, introduces invariance to

small deviations of objects, which makes it hard to achieve high

localization precision.

4.3 Motion Compensated st-cubes

Once the regressors have been trained, we use them to compensate

for motion and build the st-cubes that we will use as input for

classiﬁcation, as depicted by Fig. 3. Fig. 2 illustrates several

st-cubes of a drone from the testing dataset and after motion

compensation, using either optical ﬂow from [11] or our approach.

Note that the latter tends to keep the target object much closer to

the center, especially when the background is non-uniform and

noisy or under lighting changes.

Part of the difﬁculty in detecting fast moving ﬂying objects

is that they can appear anywhere in the 3D environment and that

their apparent size can vary enormously. This makes it necessary

to scan the whole image at different scales using a sliding window

to avoid missing anything, which is computationally expensive.

Fortunately, our motion compensation scheme frees us from

the need to evaluate every image position. When there is a target

object, our algorithm automatically shifts the patch so it is in the

center. As a result, instead of having to test windows centered

at every pixel location, we only have to check non-overlapping

ones because the algorithm will automatically shift their location

to center the target object when one is present. This also makes its

unnecessary to use heuristics such as non-maximum suppression,

as all the detections that arise from a single object will be shifted to

the same position. The duplicates can therefore easily be removed,

leaving us with just a single detection per object, as illustrated by

Fig. 7.

As discussed in Section 3, we process each scale indepen-

dently. We then perform non-maximum suppression in scale-space

as a ﬁnal step.

5 DESIGNING THE OPTIMAL APPROACH

The two key components of our pipeline are motion compen-

sation and classiﬁcation of the st-cubes, both of which can be

implemented using either CNNs or hand-designed features. In this

section, we test the various possible combinations and justify the

parameter choices we made for the ﬁnal evaluation of our whole

approach against several baselines, as described in Section 6.

Detecting Flying Objects Using a Single Moving Camera

Figures

Citations

Vision Meets Drones: A Challenge

Machine Learning-Based Drone Detection and Classification: State-of-the-Art in Research

Using deep networks for drone detection

Vision Meets Drones: Past, Present and Future

Matthan: Drone Presence Detection by Identifying Physical Signatures in the Drone's RF Communication

References

Random Forests

Gradient-based learning applied to document recognition

Dropout: a simple way to prevent neural networks from overfitting

Histograms of oriented gradients for human detection

An iterative image registration technique with an application to stereo vision

Related Papers (5)

Deep Residual Learning for Image Recognition

SSD: Single Shot MultiBox Detector

You Only Look Once: Unified, Real-Time Object Detection

Faster R-CNN: towards real-time object detection with region proposal networks

Fast R-CNN

Frequently Asked Questions (14)

Q1. What are the contributions in "Detecting flying objects using a single moving camera" ?

Q2. How did the authors improve the performance of the CNN algorithm?

Q3. What is the way to compensate for the motion of the object?

Q4. What is the main assumption of the flow dependence?

Q5. How much time does scale adjustment reduce?

Q6. How do the authors train two different boosted trees regressors?

Q7. How can the authors estimate the distance to the camera?

Q8. How can the authors run the detection to only non-overlapping patches without missing the target?

Q9. How do the authors train a classifier that takes as input st-cubes?

Q10. Why does the algorithm perform poorly in their case?

Q11. What are the main categories of approaches for detecting moving objects?

Q12. What is the way to evaluate the motion compensation algorithm?

Q13. What does the algorithm do when there is a target object?

Q14. What is the probability of an object of interest in a st-cube?