scispace - formally typeset
Open AccessJournal ArticleDOI

Detecting Flying Objects Using a Single Moving Camera

Reads0
Chats0
TLDR
A regression-based approach for object-centric motion stabilization of image patches is proposed that allows us to achieve effective classification on spatio-temporal image cubes and outperform state-of-the-art techniques.
Abstract
We propose an approach for detecting flying objects such as Unmanned Aerial Vehicles (UAVs) and aircrafts when they occupy a small portion of the field of view, possibly moving against complex backgrounds, and are filmed by a camera that itself moves. We argue that solving such a difficult problem requires combining both appearance and motion cues. To this end we propose a regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classification on spatio-temporal image cubes and outperform state-of-the-art techniques. As this problem has not yet been extensively studied, no test datasets are publicly available. We therefore built our own, both for UAVs and aircrafts, and will make them publicly available so they can be used to benchmark future flying object detection and collision avoidance algorithms.

read more

Content maybe subject to copyright    Report

1
Detecting Flying Objects using a Single Moving
Camera
Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE,
Abstract—We propose an approach for detecting flying objects such as Unmanned Aerial Vehicles (UAVs) and aircrafts when they
occupy a small portion of the field of view, possibly moving against complex backgrounds, and are filmed by a camera that itself moves.
We argue that solving such a difficult problem requires combining both appearance and motion cues. To this end we propose a
regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classification on
spatio-temporal image cubes and outperform state-of-the-art techniques.
As this problem has not yet been extensively studied, no test datasets are publicly available. We therefore built our own, both for UAVs
and aircrafts, and will make them publicly available so they can be used to benchmark future flying object detection and collision
avoidance algorithms.
Index Terms—Motion compensation, object detection.
F
1 INTRODUCTION
We are headed for a world in which the skies are occupied not
only by birds and planes but also by unmanned drones ranging
from relatively large Unmanned Aerial Vehicles (UAVs) to much
smaller consumer ones. Some of these will be instrumented and
able to communicate with each other to avoid collisions but not
all. Therefore, the ability to use inexpensive and light sensors
such as cameras for collision-avoidance purposes will become
increasingly important.
This problem has been tackled successfully in the automotive
world, for example there are now commercial products [1], [2]
designed to sense and avoid both pedestrians and other cars.
In the world of flying machines much progress has been made
towards accurate position estimation and navigation from single
or multiple cameras [3], [4], [5], [6], [7], [8], [9], but less in the
field of visual-guided collision avoidance [10]. In particular, it is
not possible to simply extend the algorithms used for pedestrian
and automobile detection to the world of aircrafts and drones, as
flying object detection poses some unique challenges:
The environment is fully three dimensional, which makes
the motions more complex (e.g., objects may move in any
direction in the 3D space and may appear in any part of the
frame).
Flying objects have very diverse shapes and can be seen
against either the ground or the sky, which produces complex
and changing backgrounds.
Given the speeds involved, potentially dangerous objects
must be detected when they are still far away, which means
they may be very small in the images.
A. Rozantsev is with the Computer Vision Laboratory,
´
Ecole Polytechnique
F
´
ed
´
erale de Lausanne, Lausanne, Switzerland.
E-mail: artem.rozantsev@epfl.ch
V. Lepetit is with the Institute for Computer Graphics and Vision, Graz
University of Technology, Graz, Austria.
P. Fua is with the Computer Vision Laboratory,
´
Ecole Polytechnique
F
´
ed
´
erale de Lausanne, Lausanne, Switzerland.
Manuscript received September 19, 2015; revised April 22, 2016.
Figure 1: Detecting a small flying object against a complex moving
background. (Left) It is almost invisible to the human eye and hard
to detect from a single image. (Right) Yet, our algorithm can find
it by using appearance and motion cues.
Fig. 1 illustrates some examples, where even for humans it is hard
to find a flying object based just on a single image. By contrast,
when looking at the sequence of frames, these objects suddenly
pop up and are easily spotted, which suggests that motion cues are
crucial for detection.
However, these motion cues are difficult to exploit when the
images are acquired by a moving camera and feature backgrounds
that are challenging to stabilize because they are non-planar and
rapidly changing. Furthermore, since there may be other moving
objects in the scene, such as a person in the top row of Fig. 1,
motion by itself is not enough and appearance must also be taken
into account.
In this paper, we detect whether an object of interest is present
and constitutes danger by classifying 3D descriptors computed
from spatio-temporal image cubes. We will refer to them as

2
UAVs Aircrafts
Uniform background Very noisy background Non-uniform background Noisy background
No motion compensation
Optical Flow
Our approach
(a) (b) (c) (d)
Figure 2: Motion compensation for four different st-cubes of flying objects seen against different backgrounds. (Top) For each one, we
show four consecutive patches before motion stabilization. In the leftmost plot below the patches, the blue dots denote the location of
the true center of the drone and the red cross is the patch center over time. The other two plots depict the x and y deviations of the drone
center with respect to the patch center. (Middle) The same four st-cubes and corresponding graphs after motion compensation using
an optical flow approach, as suggested by [11]. (Bottom) The same four st-cubes and corresponding graphs after motion compensation
using our approach.
st-cubes. They are formed by stacking motion-stabilized image
windows over several consecutive frames, which give more infor-
mation than using a single image. What makes this approach both
practical and effective is a regression-based motion-stabilization
algorithm. Unlike those relying on optical flow, it remains effective
even when the shape of the object to be detected is blurry or
barely visible, as illustrated by Fig. 2. This arises from the fact
that learning-based motion compensation focuses on the object
and is more resistant to complicated backgrounds, compared to
the optical flow method as shown in Fig. 2.
St-cubes have been routinely used for action recognition pur-
poses [12], [13], [14] using a monocular camera. By contrast, most
current detection algorithms work either on a single frame, or by
estimating the optical flow from consecutive frames. Our approach
can therefore be seen as a way to combine both the appearance
and motion information to achieve effective detection in a very
challenging context. In our experiments we show that this method
allows to achieve higher accuracy, comparing to either appearance
or motion-based methods individually.
We first proposed using st-cubes for flying objects detection
in an earlier conference paper [15]. In this initial version of our
processing pipeline, we performed motion compensation using
boosted trees. In this paper we refine this idea by using deep
learning techniques that yield better stabilization and, thus, better
overall performance.
2 RELATED WORK
Approaches for detecting moving objects can be classified into
three main categories: those that rely on appearance in individual
frames, those that rely primarily on motion information across
frames, and those that combine the two. We briefly review all three
types in this section. In the results section, we will demonstrate
that we can outperform state-of-the-art representatives of each
class.
Appearance-based methods rely on Machine Learning and
have proved to be powerful even in the presence of complex
lighting variations or cluttered background. They are typically
based on Deformable Part Models (DPM) [16], Convolutional
Neural Networks (CNN) [17], or Random Forests [18]. Among
them the Aggregate Channel Features (ACF) [19] algorithm is
considered as one of the best.
These approaches work best when the target objects are
sufficiently large and clearly visible in individual images, which is
often not the case in our applications. For example, in the images
of Fig. 1, the object is small and it is almost impossible to make
out from the background without motion cues.
Motion-based approaches can themselves be subdivided into
two subclasses. The first comprises those that rely on background
subtraction [20], [21], [22], [23] and determine objects as groups
of pixels that are different from the background. The second
includes those that depend on optical flow [24], [25], [26].
Background subtraction works best when the camera is static
or its motion is small enough to be easily compensated for, which
is not the case for the on-board camera of a fast moving aircraft.
Flow-based methods are more reliable in such situations but
still critically dependent on the quality of the flow vectors, which
tends to be low when the target objects are small and blurry. Some
methods combine both optical flow and background subtraction
algorithms [27], [28]. However, in our case there may be motion

3
Figure 3: Object detection pipeline with st-cubes and motion compensation. Provided a set of video frames from the camera, we use a
multi-scale sliding window approach to extract st-cubes. We than process every patch of the st-cube to compensate for the motion of
the aircraft and then run the detector. (best seen in color)
in different parts of the images, for example people or tree
tops. Thus motion information is not enough for reliable flying
object detection. Other methods that combine optical flow and
background subtraction, such as [29], [30], [31], [32] still critically
depend on optical flow, which is often estimated with [26] and thus
may suffer from the low quality of the flow vectors. In addition to
optical flow dependence, [31] makes an assumption that camera
motion is translational, which is violated in aerial videos.
Hybrid approaches combine information about object ap-
pearance and motion patterns and are therefore the closest in spirit
to what we propose. For example, in [33], histograms of flow
vectors are used as features in conjunction with more standard
appearance features and are fed to a statistical learning method.
This approach was refined in [11] by first aligning the patches
to compensate for motion and then using the differences of the
frames, which may or may not be consecutive, as additional
features. The alignment relies on the Lucas-Kanade optical flow
algorithm [25]. The resulting algorithm works very well for pedes-
trian detection and outperforms most of the single-frame methods.
However, when the target objects become smaller and harder to
see, the flow estimates become unreliable and this approach, like
the purely flow-based ones, becomes less effective.
3 DETECTION FRAMEWORK
Our detection pipeline is illustrated by Fig. 3 and comprises the
following steps:
Divide the video sequence into N-frame overlapping tempo-
ral slices. The larger the overlap is, the higher the precision
but only up to a point. Our experiments show that making the
overlap more than 50% increases computation time without
improving performance. Thus, 50% is what we used.
Build st-cubes from each slice using a sliding window ap-
proach, independently at each scale.
Apply our motion compensation algorithm to the patches of
each of the st-cubes to create stabilized st-cubes.
Classify each st-cube as containing an object of interest or
not.
Since each scale has been processed independently, we per-
form non-maximum suppression in scale space. If there are
several detections for the same spatial location at different
scales, we only retain the highest-scoring one. As an alter-
native to this simple scheme, we have developed a more
sophisticated learning-based one, which we discuss in more
details in Section 6.4.
In this section, we introduce two separate approaches—one
based on boosted trees, the other one on Convolutional Neural
Networks—to deciding whether or not an st-cube contains a target
object and will compare their respective performance in Section 5.
We will discuss motion compensation in Section 4.
More specifically, we want to train a classifier that takes as
input st-cubes such as those depicted by Fig. 4 and returns 1 or
-1, depending on the presence or absence of a flying object. Let
(s
x
, s
y
, s
t
) be the size of our st-cubes. For training purposes, we
use a dataset of pairs (b
i
, y
i
), i [1, N ], where b
i
R
s
x
×s
y
×s
t
is an st-cube, in other words s
t
image patches of resolution s
x
×s
y
pixels. Label y
i
{−1, 1} indicates whether or not a target object
is present.
3.1 3D HoG with Gradient Boost
The first approach we tested relies on boosted trees [34] to learn
a classifier ψ(·) of the the form ψ(b) =
Σ
H
j=1
α
j
h
j
(b), where
α
j=1..H
are real valued weights, b R
s
x
×s
y
×s
t
is the input st-
cube, h
j
: R
s
x
×s
y
×s
t
R are weak learners, and H is the
number of selected weak learners, which controls the complexity
of the classifier. The αs and hs are learned in a greedy manner,
using the Gradient Boost algorithm [34], which can be seen as
an extension of the classic AdaBoost to real-valued weak learners
and more general loss functions.
In standard Gradient Boost fashion, we take our weak learners
to be regression trees h
j
(b) = T (θ
j
, HoG3D(b)), where θ
j
denotes the tree parameters and HoG3D(b), the 3-dimensional
Histograms of Gradients (HoG3D) computed for b. HoG3D was
introduced in [14], and can be seen as an extension of the standard
HoG [35] with an additional temporal dimension. It is fast to
compute and proved to be robust to illumination changes in many
applications, and allows us to combine appearance and motion
efficiently.
At each iteration j, the weak learner h
j
(·) with the cor-
responding weight α
j
is taken as the one that minimizes the
exponential loss function:
(h
j
(.), α
j
) = argmin
h(.)
N
X
i=1
e
y(ψ
j1
(b
i
)+αh(b
i
))
. (1)
The tests in the nodes of the trees compare one coordinate of
the HoG3D vector with a threshold, both selected during the
optimization.
3.2 Convolutional Neural Networks
Since Convolutional Neural Networks (CNN) [36] have proved
very successful in many detection problems, we have tested it
as an alternative classification method. We use the architecture
depicted by Fig. 5, which alternates convolutional layers and

4
UAVs Aircrafts
Figure 4: Sample patches of the UAVs and aircrafts. Each row
corresponds to a single st-cube and illustrates different possible
motions that an aircraft could have.
pooling layers. Convolutional layers use 3D linear filters while
pooling layers apply max-pooling in 2D spatial regions only. The
last layer is fully connected and outputs the probability that the
input st-cube contains an object of interest. We use the hyperbolic
tangent function as the non-linear operator [37].
We take the input of our CNN is a normalized st-cube
η =
b µ(b)
σ(b)
, (2)
where µ(b) and σ(b) are the mean and standard deviation of the
pixel intensities in b, respectively. Normalization is an important
step because network parameters optimization fails to converge
when using raw image intensities.
During training, we write the probability that an st-cube η
contains an object of interest (y = 1) or is a part of the background
(y = 0) as
P (Y = y | η) =
e
CNN(η)[y]
e
CNN(η)[0]
+ e
CNN(η)[1]
, y = {0, 1} , (3)
where CNN(η)[y] denotes the classification score that the network
predicts for η as being part of class y and e
(·)
denotes the expo-
nential function. We then minimize the negative log-likelihood
L
(W, bias) =
N
X
k=1
log P (Y = y
k
| η
k
) (4)
with respect to the CNN parameters. Here (η
k
, y
k
) are pairs
of normalized st-cubes and their corresponding labels from the
training dataset, as defined in Section 3. To this end, we use
the algorithm of [38] combined with Dropout [39] to improve
generalization.
We tried many different network configurations, in terms of
the number of filters per layer and the size of the filters. However,
they all yield similar performance, which suggests that only
minor improvements could be obtained by further tweaking the
network. We also tried varying the dimensions of the st-cube.
These variations have a more significant influence on performance,
which will be evaluated in Section 5.
4 MOTION COMPENSATION
Neither of the two approaches to classifying st-cubes introduced
in the previous section accounts for the fact that both the gradient
orientations used to build the 3D HoG and the filter responses
in the CNN case are biased by the global object motion. This
Figure 5: The structure of the Convolutional Neural Network,
which we used for flying object detection. CL, PL and FL
correspond to Convolution, Pooling, and Fully-connected layers
respectively.
Coarse
alignment
Refinement
Figure 6: Structure of the CNNs used for motion compensation.
(Top) The first network uses extended patches to correct for the
large displacements of the aircraft. (Bottom) The second network
is applied after rectification by the motion predicted by the first
network, and is designed to correct for the small motions.
makes the learning task much more difficult and we propose
to use motion compensation to eliminate this problem. Motion
compensation will allow us to accumulate visual evidence from
multiple frames, without adding variation due to the object motion.
We therefore aim at centering the target object, so that when
present in an st-cube, it remains at the center of all its image
patches.
More specifically, let I
t
denote the t-th frame of the video
sequence and (i, j) some pixel position in it. The st-cube b
i,j,t
is the 3D array of pixel intensities from images I
z
with z
[ts
t
+1, t] at image locations (k, l) with k [i s
x
+1, i] and
l [js
y
+1, j], as depicted by Fig. 4. Correcting for motion can
be formulated as allowing patches m
i,j,z
, z [ts
t
+1, t] of the
st-cube to shift horizontally and vertically in individual images.
In [11], these shifts are computed using optical flow infor-
mation, which has been shown to be effective for pedestrians
occupying a large fraction of the patch and moving relatively
slowly from one frame to the next. However, as can be seen in
Fig. 4, these assumptions do not hold in our case and we will
show in Section 6 that this negatively impacts performance. To
overcome this difficulty, we introduce instead a learning-based
approach to compensate for motion and keep the object in the
center of the m
i,j,z
patches of the st-cube even when the target
object’s appearance changes drastically.
More specifically, we treat motion compensation problem as a
regression task: given a single image patch, we want to predict the
2D translation that best centers the target object. By rectifying all
the image patches in an st-cube with their predicted translation,
we can then align the images of the object of interest together.

5
Figure 7: Combining multiple detections in several images of a
video sequence. The red square and dots depict the positions of
the original detection across the 50 frames preceding two different
images. The green square and dots illustrate the position of the
same detections after refinement. They are superposed and form
much smoother trajectories. (best seen in color)
(a) UAV dataset (b) Aircraft dataset
Figure 8: Sample image patches containing aircrafts or UAVs from
our datasets.
4.1 Boosted tree-based regressors
One way to predict the translation for an input patch m, is to train
two different boosted trees regressors [40] φ
x
(m) and φ
y
(m), one
for each 2D direction (horizontal and vertical).
As for detection, we use regression trees h
j
(m) =
T (θ
j
, HoG(m)) as weak learners, where HoG(m) denotes the
Histograms of Oriented Gradients for patch m. The difference
is that we minimize here a quadratic loss function instead of an
exponential one
L(r, φ
(m)) = (r φ
(m))
2
, (5)
where m is the input patch, r the corresponding expected 2D
vector, and φ
(m) = [φ
x
(m), φ
y
(m)]
>
the 2D vector predicted
by the 2 regression trees.
We then apply these regressors in an iterative way: we obtain a
first estimate of the shift of the target object—if present—from the
center of the patch. We translate it according to this estimate, and
we re-apply the regressors. We iterate until both shift estimates
drop to 0 or the algorithm reaches a preset number of iterations.
In practice, 4 to 5 iterations are enough to achieve good accuracy.
4.2 CNN-based regressors
Another possible approach is to use a Convolutional Neural
Network (CNN) to solve the regression task. CNNs are more
flexible, as features are learned directly from the training data,
in contrast to the hand-designed HoG features we need to use with
our boosted tree-based regressors.
We trained two separate CNNs whose structure is depicted
by Fig. 6. Note that there is no pooling layer after the first
convolutional one. This is because pooling layers are typically
used not only to reduce computational complexity but also to
achieve invariance to small motions. In our case, such invariance
would be counter-productive because these motions are precisely
what we are trying to estimate. Furthermore, the computational
complexity remains manageable even without the first pooling
layer. We trained the first CNN using examples involving large
2D translations (coarse-CNN) and the second smaller ones (fine-
CNN). In practice we use the latter to refine the predictions of the
former. As when using boosted-trees, we use CNN-regressors it-
eratively until convergence, as described at the end of Section 4.1.
We first correct for large displacements by applying several times
coarse-CNN and we then apply fine-CNN, which is trained to
compensate for small shifts of the object, for a couple more
iterations.
In fact, we also tried training two different boosted-tree regres-
sors such as those discussed in Section 4.1. Unlike in the case of
the CNN regressors, it produced no significant improvement. This
likely happens because our boosted trees motion compensation
algorithm is based on HoG, where histograms are computed over
the bins of fixed size. This, in fact, introduces invariance to
small deviations of objects, which makes it hard to achieve high
localization precision.
4.3 Motion Compensated st-cubes
Once the regressors have been trained, we use them to compensate
for motion and build the st-cubes that we will use as input for
classification, as depicted by Fig. 3. Fig. 2 illustrates several
st-cubes of a drone from the testing dataset and after motion
compensation, using either optical flow from [11] or our approach.
Note that the latter tends to keep the target object much closer to
the center, especially when the background is non-uniform and
noisy or under lighting changes.
Part of the difficulty in detecting fast moving flying objects
is that they can appear anywhere in the 3D environment and that
their apparent size can vary enormously. This makes it necessary
to scan the whole image at different scales using a sliding window
to avoid missing anything, which is computationally expensive.
Fortunately, our motion compensation scheme frees us from
the need to evaluate every image position. When there is a target
object, our algorithm automatically shifts the patch so it is in the
center. As a result, instead of having to test windows centered
at every pixel location, we only have to check non-overlapping
ones because the algorithm will automatically shift their location
to center the target object when one is present. This also makes its
unnecessary to use heuristics such as non-maximum suppression,
as all the detections that arise from a single object will be shifted to
the same position. The duplicates can therefore easily be removed,
leaving us with just a single detection per object, as illustrated by
Fig. 7.
As discussed in Section 3, we process each scale indepen-
dently. We then perform non-maximum suppression in scale-space
as a final step.
5 DESIGNING THE OPTIMAL APPROACH
The two key components of our pipeline are motion compen-
sation and classification of the st-cubes, both of which can be
implemented using either CNNs or hand-designed features. In this
section, we test the various possible combinations and justify the
parameter choices we made for the final evaluation of our whole
approach against several baselines, as described in Section 6.

Figures
Citations
More filters
Posted Content

Vision Meets Drones: A Challenge

TL;DR: A large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform, with more than 2.5 million annotated instances in 179,264 images/video frames, being the largest such dataset ever published.
Journal ArticleDOI

Machine Learning-Based Drone Detection and Classification: State-of-the-Art in Research

TL;DR: A comprehensive review of current literature on drone detection and classification using machine learning with different modalities demonstrates that machine learning-based classification of drones seems to be promising with many successful individual contributions.
Proceedings ArticleDOI

Using deep networks for drone detection

TL;DR: This study proposes a solution using an end-to-end object detection model based on convolutional neural networks and proposes an algorithm for creating an extensive artificial dataset by combining background-subtracted real images.
Posted Content

Vision Meets Drones: Past, Present and Future

TL;DR: The VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South, is described, being the largest such dataset ever published, and enables extensive evaluation and investigation of visual analysis algorithms on the drone platform.
Proceedings ArticleDOI

Matthan: Drone Presence Detection by Identifying Physical Signatures in the Drone's RF Communication

TL;DR: It is examined whether physical characteristics of the drone, such as body vibration and body shifting, can be detected in the wireless signal transmitted by drones during communication and whether the received drone signals are uniquely differentiated from other mobile wireless phenomena such as cars equipped with Wi- Fi or humans carrying a mobile phone.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Proceedings Article

An iterative image registration technique with an application to stereo vision

TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Frequently Asked Questions (14)
Q1. What are the contributions in "Detecting flying objects using a single moving camera" ?

The authors propose an approach for detecting flying objects such as Unmanned Aerial Vehicles ( UAVs ) and aircrafts when they occupy a small portion of the field of view, possibly moving against complex backgrounds, and are filmed by a camera that itself moves. To this end the authors propose a regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classification on spatio-temporal image cubes and outperform state-of-the-art techniques. As this problem has not yet been extensively studied, no test datasets are publicly available. The authors therefore built their own, both for UAVs and aircrafts, and will make them publicly available so they can be used to benchmark future flying object detection and collision avoidance algorithms. 

To boost CNN performance, the authors used Local Contrast Normalization (LCN) [46] after every convolutional layer and minimize the Hinge Loss at the final layer of the network, which was shown to be effective [47], [48]. 

Motion compensation can be seen as a way to make the st-cube invariant from the motion of the aircraft, as it keeps flying object in the center, for all the patches of the st-cube. 

In addition to optical flow dependence, [31] makes an assumption that camera motion is translational, which is violated in aerial videos. 

Even though adding scale adjustment to motion compensation increases the processing time per st-cube, it reduces the overall computation time by a factor of about 4. 

5One way to predict the translation for an input patch m, is to train two different boosted trees regressors [40] φx(m) and φy(m), one for each 2D direction (horizontal and vertical). 

Provided that the camera is calibrated and given the true size of the object, the authors can estimate its distance to the camera, which is critical for collision avoidance purposes. 

Applying large translations to the training data allows us to run the detection to only non-overlapping patches without missing the target, as explained at the end of Section 4.3. 

More specifically, the authors want to train a classifier that takes as input st-cubes such as those depicted by Fig. 4 and returns 1 or -1, depending on the presence or absence of a flying object. 

As shown in Fig. 2, optical flow motion compensation cannot achieve good performance in their case, mostly because the target object is rather small and its appearance can significantly change due to illumination and background changes. 

Approaches for detecting moving objects can be classified into three main categories: those that rely on appearance in individualframes, those that rely primarily on motion information across frames, and those that combine the two. 

Another way to evaluate their motion compensation algorithm is to compare the detectors, trained on the data, processed with either HBT-Regression or CNN-Regression methods. 

As a result, instead of having to test windows centered at every pixel location, the authors only have to check non-overlapping ones because the algorithm will automatically shift their location to center the target object when one is present. 

During training, the authors write the probability that an st-cube η contains an object of interest (y = 1) or is a part of the background (y = 0) asP (Y = y | η) = e CNN(η)[y]eCNN(η)[0] + eCNN(η)[1] , y = {0, 1} , (3)where CNN(η)[y] denotes the classification score that the network predicts for η as being part of class y and e(·) denotes the exponential function.