scispace - formally typeset
Search or ask a question
Author

Marc Ritter

Bio: Marc Ritter is an academic researcher from Hochschule Mittweida. The author has contributed to research in topics: Computer science & TRECVID. The author has an hindex of 7, co-authored 63 publications receiving 318 citations. Previous affiliations of Marc Ritter include Chemnitz University of Technology.


Papers
More filters
14 Nov 2016
TL;DR: TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking George Awad, Jonathan Fiscus, David Joy, Martial Michel, Alan Smeaton, Wessel Kraaij, Maria Eskevich, Robin Aly, Roeland Ordelman, Marc Ritter, et al.
Abstract: HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking George Awad, Jonathan Fiscus, David Joy, Martial Michel, Alan Smeaton, Wessel Kraaij, Maria Eskevich, Robin Aly, Roeland Ordelman, Marc Ritter, et al.

116 citations

01 Jan 2017
TL;DR: A method for large-scale bird sound classification in the context of the LifeCLEF 2017 bird identification task was summarized, using a variety of convolutional neural networks to generate features extracted from visual representations of field recordings.
Abstract: Identifying bird species in audio recordings is a challenging field of research. In this paper, we summarize a method for large-scale bird sound classification in the context of the LifeCLEF 2017 bird identification task. We used a variety of convolutional neural networks to generate features extracted from visual representations of field recordings. The BirdCLEF 2017 training dataset consist of 36.496 audio recordings containing 1500 different bird species. Our approach achieved a mean average precision of 0,605 (official score) and 0,687 considering only foreground species.

57 citations

Proceedings ArticleDOI
01 Dec 2011
TL;DR: Supporting most aspects of a media provider's real workflows such as production, distribution, content description, archiving, and re-use of video items, a holistic framework to solve issues such as lack of human resources, necessity of parallel media distribution, and retrieving previously archived content through editors or consumers is developed.
Abstract: Supporting most aspects of a media provider's real workflows such as production, distribution, content description, archiving, and re-use of video items, we developed a holistic framework to solve issues such as lack of human resources, necessity of parallel media distribution, and retrieving previously archived content through editors or consumers.

15 citations

Proceedings ArticleDOI
06 Dec 2011
TL;DR: Three different fusion techniques are proposed to combine the advantages of two vision sensors -- a far-infrared (FIR) and a visible light camera and the results of the pedestrian classification are compared.
Abstract: Pedestrian detection is an important field in computer vision with applications in surveillance, robotics and driver assistance systems. The quality of such systems can be improved by the simultaneous use of different sensors. This paper proposes three different fusion techniques to combine the advantages of two vision sensors -- a far-infrared (FIR) and a visible light camera. Different fusion methods taken from various levels of information representation are briefly described and finally compared regarding the results of the pedestrian classification.

15 citations

Journal ArticleDOI
13 Apr 2018-PLOS ONE
TL;DR: A Matlab-based software that allows for the simulation of camera-based smFRET videos, yielding standardized data sets suitable for benchmarking video processing algorithms and pre-optimizing and evaluating spot detection algorithms using the authors' simulated video test sets.
Abstract: Single-molecule microscopy has become a widely used technique in (bio)physics and (bio)chemistry. A popular implementation is single-molecule Forster Resonance Energy Transfer (smFRET), for which total internal reflection fluorescence microscopy is frequently combined with camera-based detection of surface-immobilized molecules. Camera-based smFRET experiments generate large and complex datasets and several methods for video processing and analysis have been reported. As these algorithms often address similar aspects in video analysis, there is a growing need for standardized comparison. Here, we present a Matlab-based software (MASH-FRET) that allows for the simulation of camera-based smFRET videos, yielding standardized data sets suitable for benchmarking video processing algorithms. The software permits to vary parameters that are relevant in cameras-based smFRET, such as video quality, and the properties of the system under study. Experimental noise is modeled taking into account photon statistics and camera noise. Finally, we survey how video test sets should be designed to evaluate currently available data analysis strategies in camera-based sm fluorescence experiments. We complement our study by pre-optimizing and evaluating spot detection algorithms using our simulated video test sets.

13 citations


Cited by
More filters
Journal ArticleDOI
01 May 1970

1,935 citations

Proceedings ArticleDOI
05 Mar 2017
TL;DR: In this paper, the authors used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels.
Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

1,470 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of instance retrieval over the last decade, presenting milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods.
Abstract: In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors ( de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.

554 citations

Proceedings ArticleDOI
Esteban Real1, Jonathon Shlens1, Stefano Mazzocchi1, Xin Pan1, Vincent Vanhoucke1 
01 Jul 2017
TL;DR: A new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB), which consists of approximately 380,000 video segments automatically selected to feature objects in natural settings without editing or post-processing.
Abstract: We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO [32] label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. The data set can be found at https://research.google.com/youtube-bb. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.

501 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
Abstract: In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

410 citations