TL;DR: This work develops image processing and computer vision techniques for visually tracking a tennis ball, in 3D, on a court instrumented with multiple low-cost IP cameras, and incorporates a physics-based trajectory model into the system.
Abstract: In this work, we develop image processing and computer vision techniques for visually tracking a tennis ball, in 3D, on a court instrumented with multiple low-cost IP cameras The technique first obtains 2D ball tracking data from each camera view using 2D object tracking methods Next, an automatic feature-based video synchronization method is applied This technique uses the extracted 2D ball information from two or more camera views, plus camera calibration information In order to find 3D trajectory, the temporal 3D locations of the ball is estimated using triangulation of correspondent 2D locations obtained from automatically synchronized videos Furthermore, in order to improve the continuity of the tracked 3D ball during times when no two cameras have overlapping views of the ball location, we incorporate a physics-based trajectory model into the system The resultant 3D ball tracks are then visualized in a virtual 3D graphical environment Finally, we quantify the accuracy of our system in terms of reprojection error
In professional sports the authors are familiar with high-end camera technology being used to enhance the viewer experience above and beyond a traditional broadcast.
By enabling sports video analysis with low cost camera networks, many local amateur clubs and sports institutions will be able to make use of these types of technologies.
This 3D ball track data can be used for analysis purposes such as determining the speed of the ball over the net (a common tennis coach requirement), classification of type of shots played by the players, or to index the video frames and classify important events for coaching [1].
In addition, the use of less expensive cameras also lead to the distortion [2] in the videos acquired, hence camera calibration of both camera intrinsics, as well as extrinsics, is essential.
The remainder of this paper is organized as follows: Section II outlines previous work in the area.
II. RELATED WORK
The work of [1] illustrates how a low-cost camera network could be effectively used for performance analysis if ball and player tracks are known.
The authors work extend that described by Aksay et. al.[3], where techniques for 2D ball tracking, feature based automatic video synchronization and 3D estimation are described.
The authors utilize the above mentioned techniques and improvise the overall quality of the system by developing their own algorithm for prediction in case of missing points in the trajectory.
The dataset from the “3DLife ACM Multimedia Grand Challenge 2010 Dataset ”[4] is utilized.
This dataset includes 9 video streams of a competitive singles tennis match scenario from 9 IP cameras placed at different positions around an entire tennis court – see Figure1.
III. ALGORITHMIC DESIGN
The authors use videos acquired from both sideview cameras and the overhead camera in the dataset for 2D ball detection and tracking, as explained in III-A. Camera calibration data is acquired for each individual camera using the Matlab camera calibration toolbox [5].
Next, the authors apply morphological operations to enhance the size of the moving pixels, which otherwise, would be verydifficult to discrimanate between different objects.
The final constraint considered when eliminating false ball candidates is basedon distance between tennis ball positions in two consecutive frames.
As such, there is need to synchronize these videos before the 2D ball tracks from multiple cameras can be used for 3D estimation.
C. Robust 3D Tracking
A disadvantage of considering only two cameras is that the authors dont get a continuous temporal 3D ball track stream due to lack of availability of the synchronized 2D data in two views through all the frames.
To overcome this drawback, the authors employed a robust 3D tracking method using3D coordinates obtained from different camera pairs at different points in time.
The authors combine the tracking data from these multiple cameras to calculate a more stable, robustand accurate 3D ball trajectory.
The authors calculate 3D points using triangulation of the 2D points in each camera (p2D,i) with the 2D point in theoverhead camera (the 9th camera) (p2D,9): p3D,i = triangulate(p2D,i, p2D,9).
(5) The 3D points calculated at each time instance correspond toone real-world 3D coordinate and ideally all of them should be identical.
D. Physics Based Trajectory Modeling
Temporal prediction of the ball coordinates through times when no 3D ball information is available is necessary because to increase the continuity in the tracked features.
Projectiles are particles which are projected under gravity through air, such as objects thrown by hand or shells fired from a gun.
To simplify the problem, few assumptions have been made.
Parameters like air resistance and ball spin, which would require modification in the modeling, have been neglected.
The authors have developed a GUI using OpenGL [7], one of the most widely used and supported 2D and 3D graphics application programming interface (API).
IV. EXPERIMENTAL RESULTS
To evaluate their approach, the authors quantify the accuracy of their system interms of reprojection error, which is defined as the distance between the actual 2D pixel coordinates and the reprojected pixel coordinates calculated using L1 Norm.
As the number of cameras considered for analysis increases, the number of tracked points also increases, but at the cost of reprojection error.
Time is on the horizontal axis, and times at which the ball is tracked is highlighted with a horizontal line, with times when the ball track is lost represented by a gap.
The bottom line represents continuity when trajectory modeling is included.
The authors can observe that some of the gaps are filled after incorporating prediction in the system.
TL;DR: An exhaustive survey of all the published research works on ball tracking in a categorical manner is presented to present discussions on the published work so far and views and opinions followed by a modified block diagram of the tracking process.
Abstract: Increase in the number of sport lovers in games like football, cricket, etc. has created a need for digging, analyzing and presenting more and more multidimensional information to them. Different classes of people require different kinds of information and this expands the space and scale of the required information. Tracking of ball movement is of utmost importance for extracting any information from the ball based sports video sequences. Based on the literature survey, we have initially proposed a block diagram depicting different steps and flow of a general tracking process. The paper further follows the same flow throughout. Detection is the first step of tracking. Dynamic and unpredictable nature of ball appearance, movement and continuously changing background make the detection and tracking processes challenging. Due to these challenges, many researchers have been attracted to this problem and have produced good results under specific conditions. However, generalization of the published work and algorithms to different sports is a distant dream. This paper is an effort to present an exhaustive survey of all the published research works on ball tracking in a categorical manner. The work also reviews the used techniques, their performance, advantages, limitations and their suitability for a particular sport. Finally, we present discussions on the published work so far and our views and opinions followed by a modified block diagram of the tracking process. The paper concludes with the final observations and suggestions on scope of future work.
TL;DR: In this article, a semi-supervised generative adversarial network (GAN) was proposed to predict shot location and type in tennis players based on their episodic and semantic memory components.
Abstract: This paper presents a novel framework for predicting shot location and type in tennis. Inspired by recent neuroscience discoveries, we incorporate neural memory modules to model the episodic and semantic memory components of a tennis player. We propose a Semi-Supervised Generative Adversarial Network architecture that couples these memory models with the automatic feature learning power of deep neural networks, and demonstrate methodologies for learning player level behavioral patterns with the proposed framework. We evaluate the effectiveness of the proposed model on tennis tracking data from the 2012 Australian Tennis Open and exhibit applications of the proposed method in discovering how players adapt their style depending on the match context.
TL;DR: It is shown theoretically and empirically that a simple motion trajectory analysis suffices to translate from pixel measurements to the person's metric height, reaching a MAE of up to 3.9 cm on jumping motions, and that this works without camera and ground plane calibration.
Abstract: Estimating the metric height of a person from monocular imagery without additional assumptions is ill-posed. Existing solutions either require manual calibration of ground plane and camera geometry, special cameras, or reference objects of known size. We focus on motion cues and exploit gravity on earth as an omnipresent reference 'object' to translate acceleration, and subsequently height, measured in image-pixels to values in meters. We require videos of motion as input, where gravity is the only external force. This limitation is different to those of existing solutions that recover a person's height and, therefore, our method opens up new application fields. We show theoretically and empirically that a simple motion trajectory analysis suffices to translate from pixel measurements to the person's metric height, reaching a MAE of up to 3.9 cm on jumping motions, and that this works without camera and ground plane calibration.
7 citations
Cites background or methods from "3D Estimation and Visualization of ..."
TL;DR: In this article, the authors investigated how past observations from a stereo system can be used to recreate trajectories when video from only one of the cameras is available, and the best method was found to be a nearest neighbors-search optimized by a Kalman filter.
Abstract: Tracking a moving object and reconstructing its trajectory can be done with a stereo camera system, since the two cameras enable depth vision. However, such a system would not work if one of the cameras fails to detect the object. If that happens, it would be beneficial if the system could still use the functioning camera to make an approximate trajectory reconstruction.In this study, I have investigated how past observations from a stereo system can be used to recreate trajectories when video from only one of the cameras is available. Several approaches have been implemented and tested, with varying results. The best method was found to be a nearest neighbors-search optimized by a Kalman filter. On a test set with 10000 golf shots, the algorithm was able to create estimations which on average differed around 3.5 meters from the correct trajectory, with better results for trajec-tories originating close to the camera.
5 citations
Cites background or methods from "3D Estimation and Visualization of ..."
TL;DR: The experimental results show that the proposed method can estimate a 3D baseball trajectory precisely using a multiple unsynchronized camera system and is robust to variations in capture delay, both in the simulation space and in real-world situations.
Abstract: We developed a method for the precise estimation of the 3D trajectory of a baseball by modeling the movement of the baseball and estimating the capture delay, using multiple unsynchronized cameras. To develop the proposed algorithm, we mimicked the real-world process of capturing a baseball in simulation space, and analyzed the capture process using a multiple unsynchronized camera system. We represented the movement of the baseball using a piece-wise spline model, and predicted the position of the baseball in the subframes in a manner which is robust to position error and change in direction of movement of the baseball. This method accurately predicts the baseball position over time by modeling the movement of the baseball in a real baseball game environment, and improves the accuracy of the reconstructed 3D baseball trajectories. We defined an objective function to estimate the capture delay, and estimate the optimal capture delay parameter using non-linear optimization method. In addition, we evaluated the performance of the proposed method in simulation space and in a real-world situation. The experimental results show that the proposed method can estimate a 3D baseball trajectory precisely using a multiple unsynchronized camera system and is robust to variations in capture delay, both in the simulation space and in real-world situations.
3 citations
Cites background from "3D Estimation and Visualization of ..."
TL;DR: A new multi-view depth-estimation technique is proposed, employing a one-dimensional optimization strategy that reduces the noise level in the estimated depth images and enforces consistent depth images across the views, and is suitable for execution on a standard Graphics Processor Unit (GPU).
Abstract: Three-dimensional (3D) video and imaging technologies is an emerging trend in the development of digital video systems, as we presently witness the appearance of 3D displays, coding systems, and 3D camera setups. Three-dimensional multi-view video is typically obtained from a set of synchronized cameras, which are capturing the same scene from different viewpoints. This technique especially enables applications such as freeviewpoint video or 3D-TV. Free-viewpoint video applications provide the feature to interactively select and render a virtual viewpoint of the scene. A 3D experience such as for example in 3D-TV is obtained if the data representation and display enable to distinguish the relief of the scene, i.e., the depth within the scene. With 3D-TV, the depth of the scene can be perceived using a multi-view display that renders simultaneously several views of the same scene. To render these multiple views on a remote display, an efficient transmission, and thus compression of the multi-view video is necessary. However, a major problem when dealing with multiview video is the intrinsically large amount of data to be compressed, decompressed and rendered. We aim at an efficient and flexible multi-view video system, and explore three different aspects. First, we develop an algorithm for acquiring a depth signal from a multi-view setup. Second, we present efficient 3D rendering algorithms for a multi-view signal. Third, we propose coding techniques for 3D multi-view signals, based on the use of an explicit depth signal. This motivates that the thesis is divided in three parts. The first part (Chapter 3) addresses the problem of 3D multi-view video acquisition. Multi-view video acquisition refers to the task of estimating and recording a 3D geometric description of the scene. A 3D description of the scene can be represented by a so-called depth image, which can be estimated by triangulation of the corresponding pixels in the multiple views. Initially, we focus on the problem of depth estimation using two views, and present the basic geometric model that enables the triangulation of corresponding pixels across the views. Next, we review two calculation/optimization strategies for determining corresponding pixels: a local and a one-dimensional optimization strategy. Second, to generalize from the two-view case, we introduce a simple geometric model for estimating the depth using multiple views simultaneously. Based on this geometric model, we propose a new multi-view depth-estimation technique, employing a one-dimensional optimization strategy that (1) reduces the noise level in the estimated depth images and (2) enforces consistent depth images across the views. The second part (Chapter 4) details the problem of multi-view image rendering. Multi-view image rendering refers to the process of generating synthetic images using multiple views. Two different rendering techniques are initially explored: a 3D image warping and a mesh-based rendering technique. Each of these methods has its limitations and suffers from either high computational complexity or low image rendering quality. As a consequence, we present two image-based rendering algorithms that improves the balance on the aforementioned issues. First, we derive an alternative formulation of the relief texture algorithm which was extented to the geometry of multiple views. The proposed technique features two advantages: it avoids rendering artifacts ("holes") in the synthetic image and it is suitable for execution on a standard Graphics Processor Unit (GPU). Second, we propose an inverse mapping rendering technique that allows a simple and accurate re-sampling of synthetic pixels. Experimental comparisons with 3D image warping show an improvement of rendering quality of 3.8 dB for the relief texture mapping and 3.0 dB for the inverse mapping rendering technique. The third part concentrates on the compression problem of multi-view texture and depth video (Chapters 5–7). In Chapter 5, we extend the standard H.264/MPEG-4 AVC video compression algorithm for handling the compression of multi-view video. As opposed to the Multi-view Video Coding (MVC) standard that encodes only the multi-view texture data, the proposed encoder peforms the compression of both the texture and the depth multi-view sequences. The proposed extension is based on exploiting the correlation between the multiple camera views. To this end, two different approaches for predictive coding of views have been investigated: a block-based disparity-compensated prediction technique and a View Synthesis Prediction (VSP) scheme. Whereas VSP relies on an accurate depth image, the block-based disparity-compensated prediction scheme can be performed without any geometry information. Our encoder adaptively selects the most appropriate prediction scheme using a rate-distortion criterion for an optimal prediction-mode selection. We present experimental results for several texture and depth multi-view sequences, yielding a quality improvement of up to 0.6 dB for the texture and 3.2 dB for the depth, when compared to solely performing H.264/MPEG-4AVC disparitycompensated prediction. Additionally, we discuss the trade-off between the random-access to a user-selected view and the coding efficiency. Experimental results illustrating and quantifying this trade-off are provided. In Chapter 6, we focus on the compression of a depth signal. We present a novel depth image coding algorithm which concentrates on the special characteristics of depth images: smooth regions delineated by sharp edges. The algorithm models these smooth regions using parameterized piecewiselinear functions and sharp edges by a straight line, so that it is more efficient than a conventional transform-based encoder. To optimize the quality of the coding system for a given bit rate, a special global rate-distortion optimization balances the rate against the accuracy of the signal representation. For typical bit rates, i.e., between 0.01 and 0.25 bit/pixel, experiments have revealed that the coder outperforms a standard JPEG-2000 encoder by 0.6-3.0 dB. Preliminary results were published in the Proceedings of 26th Symposium on Information Theory in the Benelux. In Chapter 7, we propose a novel joint depth-texture bit-allocation algorithm for the joint compression of texture and depth images. The described algorithm combines the depth and texture Rate-Distortion (R-D) curves, to obtain a single R-D surface that allows the optimization of the joint bit-allocation in relation to the obtained rendering quality. Experimental results show an estimated gain of 1 dB compared to a compression performed without joint bit-allocation optimization. Besides this, our joint R-D model can be readily integrated into an multi-view H.264/MPEG-4 AVC coder because it yields the optimal compression setting with a limited computation effort.
TL;DR: TennisSense, a technology platform for the digital capture, analysis and retrieval of tennis training and matches, is introduced and the algorithms for extracting useful metadata from the overhead court camera are described and evaluated.
Abstract: In this paper, we introduce TennisSense, a technology platform for the digital capture, analysis and retrieval of tennis training and matches. Our algorithms for extracting useful metadata from the overhead court camera are described and evaluated. We track the tennis ball using motion images for ball candidate detection and then link ball candidates into locally linear tracks. From these tracks we can infer when serves and rallies take place. Using background subtraction and hysteresis-type blob tracking, we track the tennis players positions. The performance of both modules is evaluated using ground-truthed data. The extracted metadata provides valuable information for indexing and efficient browsing of hours of multi-camera tennis footage and we briefly illustrative how this data is used by our tennis-coach playback interface.
38 citations
"3D Estimation and Visualization of ..." refers background in this paper
TL;DR: A novel system for tennis performance analysis that allows coaches to review games and provide detailed audio-visual feedback to tennis athletes and can be generalised to other sports and allow a range of non-professional sports clubs to provide high-quality feedback to their athletes.
Abstract: We describe a novel system for tennis performance analysis that allows coaches to review games and provide detailed audio-visual feedback to tennis athletes. The basis for our system is a network of low-cost IP cameras surrounding the tennis court. Our system exploits the output of several visual analysis modules, including the tracking of players and the tennis ball, and the extraction of player silhouettes for 3D reconstruction. A range of intuitive tools within the interface allow tennis coaches to add 2D and 3D annotations to live video, view play from multiple perspectives, record audio commentary and compute game statistics in real-time. The result is a video file that can be used to provide personalised feedback to the players or for use as a teaching resource for others. While we focus on tennis in this work, we believe our system can be generalised to other sports and allow a range of non-professional sports clubs to provide high-quality feedback to their athletes.
9 citations
"3D Estimation and Visualization of ..." refers background or methods in this paper
Q1. What have the authors contributed in "3d estimation and visualization of motion in a multicamera network for sports" ?
In this work, the authors develop image processing and computer vision techniques for visually tracking a tennis ball, in 3D, on a court instrumented with multiple low-cost IP cameras. Furthermore, the authors also incorporate a physics-based trajectory model into the system to improve the continuity of the tracked 3D ball during times when no two cameras have overlapping views of the ball location. Finally, the authors quantify the accuracy of their system in terms of reprojection error.
Q2. What are the future works in "3d estimation and visualization of motion in a multicamera network for sports" ?
In future work, an accurate modelling of the ball trajectory could be developed to ensure the continuity of the tracked features, by considering realtime scenarios like ball spin and air resistance.