scispace - formally typeset
Search or ask a question
Author

Mikko J. Roininen

Bio: Mikko J. Roininen is an academic researcher from Tampere University of Technology. The author has contributed to research in topics: Modality (human–computer interaction) & Support vector machine. The author has an hindex of 3, co-authored 6 publications receiving 44 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: This work extracts domain knowledge about sport events recorded by multiple users, by classifying the sport type into soccer, American football, basketball, tennis, ice-hockey, or volleyball, by using a multi-user and multimodal approach.
Abstract: The recent proliferation of mobile video content has emphasized the need for applications such as automatic organization and automatic editing of videos. These applications could greatly benefit from domain knowledge about the content. However, extracting semantic information from mobile videos is a challenging task, due to their unconstrained nature. We extract domain knowledge about sport events recorded by multiple users, by classifying the sport type into soccer, American football, basketball, tennis, ice-hockey, or volleyball. We adopt a multi-user and multimodal approach, where each user simultaneously captures audio-visual content and auxiliary sensor data (from magnetometers and accelerometers). Firstly, each modality is separately analyzed; then, analysis results are fused for obtaining the sport type. The auxiliary sensor data is used for extracting more discriminative spatio-temporal visual features and efficient camera motion features. The contribution of each modality to the fusion process is adapted according to the quality of the input data. We performed extensive experiments on data collected at public sport events, showing the merits of using different combinations of modalities and fusion methods. The results indicate that analyzing multimodal and multi-user data, coupled with adaptive fusion, improves classification accuracies in most tested cases, up to 95.45%.

31 citations

Proceedings ArticleDOI
15 Jul 2013
TL;DR: A robust multimodal approach for classifying the sport genre in videos recorded by mobile phone users at a sport event by building models of visual appearance, camera motion and audio scene, which are used forclassifying the data from each modality.
Abstract: We present a robust multimodal approach for classifying the sport genre in videos recorded by mobile phone users at a sport event. In addition to traditional audio-visual content analysis tools, we propose to analyze auxiliary sensor data (electronic compass data and accelerometer data) captured simultaneously with the video recording. By means of machine learning techniques, we build models of visual appearance, camera motion (from auxiliary sensor data) and audio scene, which are used for classifying the data from each modality. The sport genre is obtained by fusing the information provided by the models. We propose to use the quality of each modality as an indication of its reliability. Extensive experiments were performed on real test data collected at public sport events. We provide comparisons on the use of different modality sets and fusion methods. Finally, we show how the proposed methods achieve robust classification even in the considered unconstrained scenarios.

10 citations

Journal ArticleDOI
01 Jan 2012
TL;DR: Methods that analyze contextual information of multiple user-generated videos in order to obtain semantic information about public happenings being recorded in these videos, including a method that automatically identifies the optimal set of cameras to be used in a multicamera video production are developed.
Abstract: User-generated video content has grown tremendously fast to the point of outpacing professional content creation. In this work we develop methods that analyze contextual information of multiple user-generated videos in order to obtain semantic information about public happenings (e.g., sport and live music events) being recorded in these videos. One of the key contributions of this work is a joint utilization of different data modalities, including such captured by auxiliary sensors during the video recording performed by each user. In particular, we analyze GPS data, magnetometer data, accelerometer data, video- and audio-content data. We use these data modalities to infer information about the event being recorded, in terms of layout (e.g., stadium), genre, indoor versus outdoor scene, and the main area of interest of the event. Furthermore we propose a method that automatically identifies the optimal set of cameras to be used in a multicamera video production. Finally, we detect the camera users which fall within the field of view of other cameras recording at the same public happening. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real sport events and live music performances.

8 citations

Journal ArticleDOI
TL;DR: An approach for modeling the shot cut timing of professionally edited concert videos and results show that users prefer the cut timing from the proposed system over the baseline with a clear margin, whereas a much smaller difference is observed in the preference of hand-made videos over the proposed method.
Abstract: Increasing amount of video content is being recorded by people in public events. However, the editing of such videos can be challenging for the average user. We describe an approach for modeling the shot cut timing of professionally edited concert videos. We analyze the temporal positions of cuts in relation to the music meter grid and form Markov chain models from the found switching patterns and their occurrence frequencies. The stochastic Markov chain models are combined with audio change point analysis and cut deviation models for automatically generating temporal editing cues for unedited concert video recordings. Videos edited according to the modeling are compared in a user study against a baseline automatic editing method as well as against videos edited by hand. The study results show that users prefer the cut timing from the proposed system over the baseline with a clear margin, whereas a much smaller difference is observed in the preference of hand-made videos over the proposed method.

2 citations

Proceedings ArticleDOI
14 Jul 2014
TL;DR: The core of the system is formed by a text processing module combined with a module performing PCA-assisted perceptron regression with random sub-space selection (P2R2S2) that uses Over-Feat features as a starting point and transforms them into more descriptive features via unsupervised training.
Abstract: This paper presents our system designed for MSR-Bing Image Retrieval Challenge @ ICME 2014 The core of our system is formed by a text processing module combined with a module performing PCA-assisted perceptron regression with random sub-space selection (P 2 R 2 S 2 ) P 2 R 2 S 2 uses Over-Feat features as a starting point and transforms them into more descriptive features via unsupervised training The relevance score for each query-image pair is obtained by comparing the transformed features of the query image and the relevant training images We also use a face bank, duplicate image detection, and optical character recognition to boost our evaluation accuracy Our system achieves 05099 in terms of DCG 25 on the development set and 05116 on the test set

1 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This paper focuses on the video content analysis techniques applied in sportscasts over the past decade from the perspectives of fundamentals and general review, a content hierarchical model, and trends and challenges.
Abstract: Sports data analysis is becoming increasingly large scale, diversified, and shared, but difficulty persists in rapidly accessing the most crucial information. Previous surveys have focused on the methodologies of sports video analysis from the spatiotemporal viewpoint instead of a content-based viewpoint, and few of these studies have considered semantics. This paper develops a deeper interpretation of content-aware sports video analysis by examining the insight offered by research into the structure of content under different scenarios. On the basis of this insight, we provide an overview of the themes particularly relevant to the research on content-aware systems for broadcast sports. Specifically, we focus on the video content analysis techniques applied in sportscasts over the past decade from the perspectives of fundamentals and general review, a content hierarchical model, and trends and challenges. Content-aware analysis methods are discussed with respect to object-, event-, and context-oriented groups. In each group, the gap between sensation and content excitement must be bridged using proper strategies. In this regard, a content-aware approach is required to determine user demands. Finally, this paper summarizes the future trends and challenges for sports video analysis. We believe that our findings can advance the field of research on content-aware video analysis for broadcast sports.

179 citations

Journal ArticleDOI
27 Jul 2014
TL;DR: An approach that takes multiple videos captured by social cameras---cameras that are carried or worn by members of the group involved in an activity---and produces a coherent "cut" video of the activity is presented.
Abstract: We present an approach that takes multiple videos captured by social cameras---cameras that are carried or worn by members of the group involved in an activity---and produces a coherent "cut" video of the activity. Footage from social cameras contains an intimate, personalized view that reflects the part of an event that was of importance to the camera operator (or wearer). We leverage the insight that social cameras share the focus of attention of the people carrying them. We use this insight to determine where the important "content" in a scene is taking place, and use it in conjunction with cinematographic guidelines to select which cameras to cut to and to determine the timing of those cuts. A trellis graph representation is used to optimize an objective function that maximizes coverage of the important content in the scene, while respecting cinematographic guidelines such as the 180-degree rule and avoiding jump cuts. We demonstrate cuts of the videos in various styles and lengths for a number of scenarios, including sports games, street performances, family activities, and social get-togethers. We evaluate our results through an in-depth analysis of the cuts in the resulting videos and through comparison with videos produced by a professional editor and existing commercial solutions.

144 citations

Journal ArticleDOI
23 Sep 2016-Sensors
TL;DR: The main characteristic of this review is to present the largest quantity of relevant examples of sensor fusion and smart sensors focusing on their utilization and proposals, without deeply addressing one specific system or technique, to the detriment of the others.
Abstract: The following work presents an overview of smart sensors and sensor fusion targeted at biomedical applications and sports areas. In this work, the integration of these areas is demonstrated, promoting a reflection about techniques and applications to collect, quantify and qualify some physical variables associated with the human body. These techniques are presented in various biomedical and sports applications, which cover areas related to diagnostics, rehabilitation, physical monitoring, and the development of performance in athletes, among others. Although some applications are described in only one of two fields of study (biomedicine and sports), it is very likely that the same application fits in both, with small peculiarities or adaptations. To illustrate the contemporaneity of applications, an analysis of specialized papers published in the last six years has been made. In this context, the main characteristic of this review is to present the largest quantity of relevant examples of sensor fusion and smart sensors focusing on their utilization and proposals, without deeply addressing one specific system or technique, to the detriment of the others.

110 citations

Journal ArticleDOI
TL;DR: A comprehensive review of the literature related to the use of wearable inertial sensors for performance analysis in various games is presented to provide a holistic and systematic categorisation & analysis of the wearable sensors in sports.
Abstract: Wearable Inertial sensors have revolutionised the way kinematics analysis is performed in sports. This paper aims to present a comprehensive review of the literature related to the use of wearable inertial sensors for performance analysis in various games. Kinematics analysis using wearable sensors can provide real-time feedback to the players about their adopted techniques in their respective sports and thus help them to perform efficiently. This article reviews the key technologies (IMU sensors, communication technology, data fusion and data analysis techniques) that enable the implementation of wearable sensors for performance analysis in sports. The review focuses on research papers, commercial sports sensors and 3D motion tracking products to provide a holistic and systematic categorisation & analysis of the wearable sensors in sports. The review identifies the importance of sensors classification, applications and performance parameters in sports for structured analysis. The survey also reviews the technology concerning sensor architecture, network and communication protocols, covers various data fusion algorithms and their accuracy while throwing light on essential performance matrices for an athlete. This review paper will assist both end-users and the researchers to have a comprehensive glimpse of the wearable technology pertaining to designing sensors and solutions for athletes in different sports.

95 citations

Journal ArticleDOI
TL;DR: A new high-performance algorithm based on spatio-temporal motion information is proposed to detect global abnormal events from the video stream as well as the local abnormal event.
Abstract: Abnormal event detection is one of the most important objectives in security surveillance for public scenes. In this paper, a new high-performance algorithm based on spatio-temporal motion information is proposed to detect global abnormal events from the video stream as well as the local abnormal event. We firstly propose a feature descriptor to represent the movement by adopting the covariance matrix coding optical flow and the corresponding partial derivatives of multiple connective frames or the patches of the frames. The covariance matrix of multi-RoI (region of interest) which consists of frames or patches can represent the movement in high accuracy. For public surveillance video, the normal samples are abundant while there are few abnormal samples. Thus the one-class classification method is suitable for handling this problem inherently. The nonlinear one-class support vector machine based on a proposed kernel for Lie group element is applied to detect abnormal events by merely training the normal samples. The computational complexity and time performance of the proposed method is analyzed. The PETS, UMN and UCSD benchmark datasets are employed to verify the advantages of the proposed method for both global abnormal and local abnormal event detection. This method can be used for event detection for a surveillance video and outperforms the state-of-the-art algorithms. Thus it can be adopted to detect the abnormal event in the monitoring video.

46 citations