scispace - formally typeset
Search or ask a question

Showing papers by "Ioannis Pitas published in 2017"


Journal ArticleDOI
TL;DR: The proposed Approximate Kernel Extreme Learning Machine algorithm for Single-hidden Layer Feedforward Neural network training that can be used for large scale classification problems is able to scale well in both computational cost and memory, while achieving good generalization performance.

41 citations


Journal ArticleDOI
TL;DR: This paper formulate the proposed method to exploit data representations in the feature space determined by the network hidden layer outputs, as well as in ELM spaces of arbitrary dimensions, and shows that the exploitation of geometric class information enhances performance.
Abstract: In this paper, we propose an extreme learning machine (ELM)-based one-class classification method that exploits geometric class information. We formulate the proposed method to exploit data representations in the feature space determined by the network hidden layer outputs, as well as in ELM spaces of arbitrary dimensions. We show that the exploitation of geometric class information enhances performance. We evaluate the proposed approach in publicly available datasets and compare its performance with the recently proposed one-class extreme learning machine algorithm, as well as with standard and recently proposed one-class classifiers. Experimental results show that the proposed method consistently outperforms the remaining approaches.

35 citations


Journal ArticleDOI
TL;DR: The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process, and introduced a new video-based feature, called actor presence, that can be used to enhance audio-based speaker clusters.
Abstract: Multimodal clustering/diarization tries to answer the question "who spoke when" by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering and we introduce a new video-based feature, called actor presence that can be used to enhance audio-based speaker clustering. We tested the proposed method in three full length stereoscopic movies, i.e. a scenario much more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.

22 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: Overall, it was found that state-of-the-art 2D visual trackers are dependable and fast enough to be used in drone cinematography, particularly when combined with periodic target re-detection.
Abstract: In this paper, we provide a preliminary study of basic requirements for autonomous UAV cinematography via 2D target tracking. Our contribution is two-fold. First, we develop a mathematical framework so as to determine hardware camera requirements (specifically, focal length), on a representative case study, i.e., orbiting a still or moving target. Second, we examine the on-board software requirements in order to successfully achieve autonomous target following. To this end, we evaluate the performance of state-of-the-art realtime 2D visual trackers in videos captured by commercial drones. Overall, it was found that state-of-the-art 2D visual trackers are dependable and fast enough to be used in drone cinematography, particularly when combined with periodic target re-detection. A proposed variant of the Staple tracker achieved the best balance between real-time performance and tracking accuracy, on a dataset composed of 31 sports videos recorded by commercial drones.

15 citations


Proceedings ArticleDOI
13 Sep 2017
TL;DR: The proposed algorithm results in a final key-frame set which acts as as salient dictionary for the input video, which outperforms both a baseline clustering-based approach and a state-of-the-art sparse dictionary learning-based algorithm.
Abstract: Video summarization has become more prominent during the last decade, due to the massive amount of available digital video content. A video summarization algorithm is typically fed an input video and expected to extract a set of important key-frames which represent the entire content, convey semantic meaning and are significantly more concise than the original input. The most wide-spread approach relies on video frame clustering and extraction of the frames closest to the cluster centroids as key-frames. Such a process, although efficient, offloads the burden of semantic scene content modelling exclusively to the employed video frame description/representation scheme, while summarization itself is approached simply as a distance-based data partitioning problem. This work focuses on videos depicting human activities (e.g., from surveillance feeds) which display an attractive property, i.e., each video frame can be seen as a linear combination of elementary visual words (i.e., basic activity components). This is exploited so as to identify the video frames containing only the elementary visual building blocks, which ideally form a set of independent basis vectors that can linearly reconstruct the entire video. In this manner, the semantic content of the scene is considered by the video summarization process itself. The above process is modulated by a traditional distance-based video frame saliency estimation, biasing towards more spread content coverage and outlier inclusion, under a joint optimization framework derived from the Column Subset Selection Problem (CSSP). The proposed algorithm results in a final key-frame set which acts as as salient dictionary for the input video. Empirical evaluation conducted on a publicly available dataset suggest that the presented method outperforms both a baseline clustering-based approach and a state-of-the-art sparse dictionary learning-based algorithm.

13 citations


Proceedings ArticleDOI
01 Mar 2017
TL;DR: This work presents a method based on selecting as key-frames video frames able to optimally reconstruct the entire video and modelling the reconstruction algebraically as a Column Subset Selection Problem (CSSP) resulting in extracting key- frames that correspond to elementary visual building blocks.
Abstract: Summarization of videos depicting human activities is a timely problem with important applications, e.g., in the domains of surveillance or film/TV production, that steadily becomes more relevant. Research on video summarization has mainly relied on global clustering or local (frame-by-frame) saliency methods to provide automated algorithmic solutions for key-frame extraction. This work presents a method based on selecting as key-frames video frames able to optimally reconstruct the entire video. The novelty lies in modelling the reconstruction algebraically as a Column Subset Selection Problem (CSSP), resulting in extracting key-frames that correspond to elementary visual building blocks. The problem is formulated under an optimization framework and approximately solved via a genetic algorithm. The proposed video summarization method is being evaluated using a publicly available annotated dataset and an objective evaluation metric. According to the quantitative results, it clearly outperforms the typical clustering approach.

12 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: A face detection hindering method is developed, as a means of preventing the threats to people's privacy, automatic video analysis may pose, and rendering automatic face recognition improbable.
Abstract: In this paper, we develop a face detection hindering method, as a means of preventing the threats to people's privacy, automatic video analysis may pose. Face detection in images or videos is the first step in human-centered video analysis to be followed, e.g. by automatic face recognition. Therefore, by hindering face detection, we also render automatic face recognition improbable. To this end, we examine the application of two methods. First, we consider a naive approach, i.e., we simply use additive or impulsive noise to the input image, until the point where the face cannot be automatically detected anymore. Second, we examine the application of the SVD-DID face de-identification method. Our experimental results denote that both methods attain high face detection failure rates.

8 citations


Journal ArticleDOI
TL;DR: This paper proposes four algorithms that exploit available stereo disparity information, in order to detect disturbing stereoscopic effects, namely, stereoscopic window violations, bent window effects, uncomfortable fusion object objects, and depth jump cuts on stereo videos.
Abstract: The 3D video quality issues that may disturb the human visual system and negatively impact the 3D viewing experience are well known and become more relevant as the availability of 3D video content increases, primarily through 3D cinema, but also through 3D television. In this paper, we propose four algorithms that exploit available stereo disparity information, in order to detect disturbing stereoscopic effects, namely, stereoscopic window violations, bent window effects, uncomfortable fusion object objects, and depth jump cuts on stereo videos. After detecting such issues, the proposed algorithms characterize them, based on the stress they cause to the viewer’s visual system. Qualitative representative examples, quantitative experimental results on a custom-made video data set, a parameter sensitivity study, and comments on the computational complexity of the algorithms are provided, in order to assess the accuracy and the performance of stereoscopic quality defect detection.

7 citations


Journal ArticleDOI
TL;DR: Two methods that manipulate images to hinder automatic face identification are presented, which partly degrade image quality, so that humans can identify the persons in a scene, while face identification algorithms fail to do so.
Abstract: In this paper, two methods are presented that manipulate images to hinder automatic face identification. They partly degrade image quality, so that humans can identify the persons in a scene, while face identification algorithms fail to do so. The approaches used involve: a) singular value decomposition (SVD) and b) image projections on hyperspheres. Simulation experiments verify that these methods reduce the percentage of correct face identification rate by over 90 %. Additionally, the final image is not degraded beyond recognition by humans, in contrast with the majority of other de-identification methods.

7 citations


Journal ArticleDOI
TL;DR: The scope of this Editorial is to briefly present methodologies, tasks and applications of big media data analysis and to introduce the papers of the special issue on Big Media Data Analysis.
Abstract: In this editorial a short introduction to the special issue on Big Media Data Analysis is given. The scope of this Editorial is to briefly present methodologies, tasks and applications of big media data analysis and to introduce the papers of the special issue. The special issue includes six papers that span various media analysis application areas like generic image description, medical image and video analysis, distance calculation acceleration and data collection.

6 citations