scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Multimedia Information Retrieval in 2015"


Journal ArticleDOI
TL;DR: This paper proposes several speed-ups for densely sampled HOG, HOF and MBH descriptors and investigates the trade-off between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical Flow method.
Abstract: The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors. Most commonly these are histogram of oriented gradients (HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is very powerful for classification, it is also computationally expensive. This paper addresses the problem of computational efficiency. Specifically: (1) We propose several speed-ups for densely sampled HOG, HOF and MBH descriptors and release Matlab code; (2) We investigate the trade-off between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical Flow method; (3) We investigate the trade-off between accuracy and computational efficiency for computing the feature vocabulary, using and comparing most of the commonly adopted vector quantization techniques: \(k\)-means, hierarchical \(k\)-means, Random Forests, Fisher Vectors and VLAD.

93 citations


Journal ArticleDOI
TL;DR: The objective of this work is to visually search large-scale video datasets for semantic entities specified by a text query by constructing visual models for such semantic entities on-the-fly, by using an image search engine to source visual training data for the text query.
Abstract: The objective of this work is to visually search large-scale video datasets for semantic entities specified by a text query. The paradigm we explore is constructing visual models for such semantic entities on-the-fly, i.e. at run time, by using an image search engine to source visual training data for the text query. The approach combines fast and accurate learning and retrieval, and enables videos to be returned within seconds of specifying a query. We describe three classes of queries, each with its associated visual search method: object instances (using a bag of visual words approach for matching); object categories (using a discriminative classifier for ranking key frames); and faces (using a discriminative classifier for ranking face tracks). We discuss the features suitable for each class of query, for example Fisher vectors or features derived from convolutional neural networks (CNNs), and how these choices impact on the trade-off between three important performance measures for a real-time system of this kind, namely: (1) accuracy, (2) memory footprint, and (3) speed. We also discuss and compare a number of important implementation issues, such as how to remove ‘outliers’ in the downloaded images efficiently, and how to best obtain a single descriptor for a face track. We also sketch the architecture of the real-time on-the-fly system. Quantitative results are given on a number of large-scale image and video benchmarks (e.g. TRECVID INS, MIRFLICKR-1M), and we further demonstrate the performance and real-world applicability of our methods over a dataset sourced from 10,000 h of unedited footage from BBC News, comprising 5M+ key frames.

54 citations


Journal ArticleDOI
Hichem Sahbi1
TL;DR: This paper will show that the underlying kernel solution converges to a positive semi-definite fixed-point, which can also be expressed as a dot product involving “explicit” kernel maps.
Abstract: The general recipe of kernel methods, such as support vector machines (SVMs), includes a preliminary step of hand-crafting or designing similarity kernels This process, which has been extensively studied during the last decade, has proven to be relatively successful in solving many pattern recognition problems including image annotation However, many proposed solutions for kernel design, consider similarity between data by taking into account only their content and without context In this paper, we propose an alternative that upgrades and further enhances usual kernels by making them context-aware The method is based on the optimization of an objective function mixing content, regularization and also context We will show that the underlying kernel solution converges to a positive semi-definite fixed-point, which can also be expressed as a dot product involving “explicit” kernel maps When plugging these explicit context-aware kernel maps into support vector machines, performances substantially improve and outperform competitors for the hard task of image annotation using a recent ImageCLEF annotation benchmark

40 citations


Journal ArticleDOI
TL;DR: The semantic story-based video retrieval problem is transformed into a much simpler text-based search through the storyline of TV series episodes by using human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video.
Abstract: We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.

29 citations


Journal ArticleDOI
TL;DR: This paper evaluates and proposes a set of strategies for automatically building effective concept detectors from clickthrough data, and focuses on automatic training set generation and assignment of label confidence weights to the training samples and using these weights at the classifier level to improve concept detector effectiveness.
Abstract: Clickthrough data is a source of information that can be used for automatically building concept detectors for image retrieval. Previous studies, however, have shown that in many cases the resulting training sets suffer from severe label noise that has a significant impact in the SVM concept detector performance. This paper evaluates and proposes a set of strategies for automatically building effective concept detectors from clickthrough data. These strategies focus on: (1) automatic training set generation; (2) assignment of label confidence weights to the training samples and (3) using these weights at the classifier level to improve concept detector effectiveness. For training set selection and in order to assign weights to individual training samples three Information Retrieval (IR) models are examined: vector space models, BM25 and language models. Three SVM variants that take into account importance at the classifier level are evaluated and compared to the standard SVM: the Fuzzy SVM, the Power SVM, and the Bilateral-weighted Fuzzy SVM. Experiments conducted on the MM Grand Challenge dataset (consisting of 1M images and 82.3M unique clicks) for 40 concepts demonstrate that (1) on average, all weighted SVM variants are more effective than the standard SVM; (2) the vector space model produces the best training sets and best weights; (3) the Bilateral-weighted Fuzzy SVM produces the best results but is very sensitive to weight assignment and (4) the Fuzzy SVM is the most robust training approach for varying levels of label noise.

15 citations


Journal ArticleDOI
TL;DR: This work proposes an adaptive multi-modal multi-view reranking model that is able to jointly regularize the relatedness amongmodalities, the effects of feature views extracted from different modalities, as well as the complex relations among multi- modal documents.
Abstract: Information reranking aims to recover the true order of the initial search results. Traditional reranking approaches have achieved great success in uni-modal document retrieval. They, however, suffer from the following limitations when reranking multi-modal documents: (1) they are unable to capture and model the relations among multiple modalities within the same document; (2) they usually concatenate diverse features extracted from different modalities into one single vector, rather than adaptively fusing them by considering their discriminative capabilities with respect to the given query; and (3) most of them consider the pairwise relations among documents but discard their higher-order grouping relations, which leads to information loss. Towards this end, we propose an adaptive multi-modal multi-view ( $$\mathbf{aMM }$$ ) reranking model. This model is able to jointly regularize the relatedness among modalities, the effects of feature views extracted from different modalities, as well as the complex relations among multi-modal documents. Extensive experiments on three datasets well validated the effectiveness and robustness of our proposed model.

11 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed approach is able to identify main objects and reduce the influence of background in the image, and thus improve the performance of image retrieval in comparison with a conventional CBIR based on DCT.
Abstract: Content-based image retrieval (CBIR) is the process of searching digital images in a large database based on features, such as color, texture and shape of a given query image. As many images are compressed by transforms, constructing the feature vector directly in transform domain is a very popular topic. Therefore, features can be extracted directly from images in compressed format by using, for example, discrete cosine transform (DCT) for JPEG compressed images. Also, region-based image retrieval (RBIR) has attracted great interest in recent years. This paper proposes a new RBIR approach using shape-adaptive discrete cosine transform (SA-DCT). In this retrieval system, an image has a prior segmentation alpha plane, which is defined exactly as in MPEG-4. Therefore, an image is represented by segmented regions, each of which is associated with a feature vector derived from DCT and SA-DCT coefficients. Users can select any region as the main theme of the query image. The similarity between a query image and any database image is ranked according to a same similarity measure computed from the selected regions between two images. For those images without distinctive objects and scenes, users can still select the whole image as the query condition. The experimental results show that the proposed approach is able to identify main objects and reduce the influence of background in the image, and thus improve the performance of image retrieval in comparison with a conventional CBIR based on DCT.

9 citations


Journal ArticleDOI
TL;DR: Experimental results on various benchmark datasets show that the proposed system can infer enhanced visual dictionaries and the derived image feature vector can achieve better retrieval results as compared to state-of-the-art techniques.
Abstract: Characterizing images by high-level concepts from a learned visual dictionary is extensively used in image classification and retrieval. This paper deals with inferring discriminative visual dictionaries for effective image retrieval and examines a non-negative visual dictionary learning scheme towards this direction. More specifically, a non-negative matrix factorization framework with $$\ell _0$$ -sparseness constraint on the coefficient matrix for optimizing the dictionary is proposed. It is a two-step iterative process composed of sparse encoding and dictionary enhancement stages. An initial estimate of the visual dictionary is updated in each iteration with the proposed $$\ell _0$$ -constraint gradient projection algorithm. A desirable attribute of this formulation is an adaptive sequential dictionary initialization procedure. This leads to a sharp drop down of the approximation error and a faster convergence. Finally, the proposed dictionary optimization scheme is used to derive a compact image representation for the retrieval task. A new image signature is obtained by projecting local descriptors on to the basis elements of the optimized visual dictionary and then aggregating the resulting sparse encodings in to a single feature vector. Experimental results on various benchmark datasets show that the proposed system can infer enhanced visual dictionaries and the derived image feature vector can achieve better retrieval results as compared to state-of-the-art techniques.

7 citations


Journal ArticleDOI
TL;DR: This paper proposes a method using a Hidden Conditional Random Field (HCRF) which is a probabilistic discriminative classifier with a set of hidden states to distinguish training videos containing the event from the others.
Abstract: Multimedia Event Detection (MED) is the task to identify videos in which a certain event occurs. This paper addresses two problems in MED: weakly supervised setting and unclear event structure. The first indicates that since associations of shots with the event are laborious and incur annotator’s subjectivity, training videos are loosely annotated as to whether the event is contained or not. It is unknown which shots are relevant or irrelevant to the event. The second problem is the difficulty of assuming the event structure in advance, due to arbitrary camera and editing techniques. To tackle these problems, we propose a method using a Hidden Conditional Random Field (HCRF) which is a probabilistic discriminative classifier with a set of hidden states. We consider that the weakly supervised setting can be handled using hidden states as the intermediate layer to discriminate between relevant and irrelevant shots to the event. In addition, an unclear structure of the event can be exposed by features of each hidden state and its relation to the other states. Based on the above idea, we optimise hidden states and their relation so as to distinguish training videos containing the event from the others. Also, to exploit the full potential of HCRFs, we establish approaches for training video preparation, parameter initialisation and fusion of multiple HCRFs. Experimental results on TRECVID video data validate the effectiveness of our method.

6 citations


Journal ArticleDOI
TL;DR: This paper presents a methodology which approaches social event detection as a streaming multi-modal clustering task that takes advantage of the temporal nature of social events and as a side benefit, allows for scaling to real-world datasets.
Abstract: Combining items from social media streams, such as Flickr photos and Twitter tweets, into meaningful groups can help users contextualise and consume more effectively the torrents of information continuously being made available on the social web. This task is made challenging due to the scale of the streams and the inherently multimodal nature of the information being contextualised. The problem of grouping social media items into meaningful groups can be seen as an ill-posed and application specific unsupervised clustering problem. A fundamental question in multimodal contexts is determining which features best signify that two items should belong to the same grouping. This paper presents a methodology which approaches social event detection as a streaming multi-modal clustering task. The methodology takes advantage of the temporal nature of social events and as a side benefit, allows for scaling to real-world datasets. Specific challenges of the social event detection task are addressed: the engineering and selection of the features used to compare items to one another; a feature fusion strategy that incorporates relative importance of features; the construction of a single sparse affinity matrix; and clustering techniques which produce meaningful item groups whilst scaling to cluster very large numbers of items. The state-of-the-art approach presented here is evaluated using the ReSEED dataset with standardised evaluation measures. With automatically learned feature weights, we achieve an $${F}_1$$ score of 0.94, showing that a good compromise between precision and recall of clusters can be achieved. In a comparison with other state-of-the-art algorithms our approach is shown to give the best results.

6 citations


Journal ArticleDOI
TL;DR: Experimental evaluations over famous benchmarking datasets show that the proposed approach highly improves the retrieval precision of the state-of-art binary hashing methods.
Abstract: The challenge of large-scale content-based image retrieval (CBIR) has been recently addressed by many promising approaches. In this work, a new approach that jointly optimizes the search precision and time for large-scale CBIR is presented. This is achieved using binary local image descriptors, such as BRIEF or BRISK, along with binary hashing methods, such as Locality-Sensitive Hashing and Spherical Hashing (SH). The proposed approach, named Multi-Bin Search, improves the retrieval precision of binary hashing methods through computing, storing and indexing the nearest neighbor bins for each bin generated from a binary hashing method. Then, the search process does not only search the targeted bin, but also it searches the nearest neighbor bins. To efficiently search inside targeted bins, a fast exhaustive-search equivalent algorithm, inspired by Norm Ordered Matching, has been used. Also, a result reranking step that increases the retrieval precision is introduced, but with a slight increase in search time. Experimental evaluations over famous benchmarking datasets (such as the University of Kentucky Benchmarking, the INRIA Holidays, and the MIRFLICKR-1M) show that the proposed approach highly improves the retrieval precision of the state-of-art binary hashing methods.

Journal ArticleDOI
TL;DR: It is demonstrated that Approximate Laplacian Eigenmaps, which constitute a latent representation of the manifold underlying a set of images, offer a compact yet effective feature representation for the problem of concept detection.
Abstract: We present a versatile and effective manifold learning approach to tackle the concept detection problem in large-scale and online settings. We demonstrate that Approximate Laplacian Eigenmaps, which constitute a latent representation of the manifold underlying a set of images, offer a compact yet effective feature representation for the problem of concept detection. We expose the theoretical principles of the approach and present an extension that renders the approach applicable in online settings. We evaluate the approach on a number of well-known and two new datasets, coming from the social media domain, and demonstrate that it achieves equal or slightly better detection accuracy compared to supervised methods, while at the same time offering substantial speedup, enabling for instance the training of ten concept detectors using 1.5M images in just 3 min on a commodity server. We also explore a number of factors that affect the detection accuracy of the proposed approach, including the size of training set, the role of unlabelled samples in semi-supervised learning settings, and the performance of the approach across different concepts.

Journal ArticleDOI
TL;DR: This paper presents a new method for modeling magnitudes of dual-tree complex wavelet coefficients, in the context of color texture classification based on the characterization of dependency between RGB color components, Gaussian copula associated with Generalized Gamma marginal function is proposed to design the multivariate generalized Gamma density modeling.
Abstract: This paper presents a new method for modeling magnitudes of dual-tree complex wavelet coefficients, in the context of color texture classification. Based on the characterization of dependency between RGB color components, Gaussian copula associated with Generalized Gamma marginal function is proposed to design the multivariate generalized Gamma density (MG $$\Gamma $$ D) modeling. MG $$\Gamma $$ D has the advantages of genericity in terms of fitting over a variety of existing joint models. On the one hand, the generalized Gamma density function offers free-shape parameters to characterize a wide range of heavy-tailed densities, i.e., the genericity. On the other hand, the inter-component, inter-band dependency is captured by the Gaussian Copula which offers adapted flexibility. Moreover, this model leads to a closed form for the probabilistic similarity measure in terms of parameters, i.e., Kullback–Leibler divergence. By exploiting the separability between the copula and the marginal spaces, the closed form enables us to minimize the computational time needed to measure the discrepancy between two Multivariate Generalized Gamma densities in comparison to other models which imply using a Monte Carlo method characterized by an expensive time computing. For evaluating the performance of our proposal, a K-nearest neighbor (KNN) classifier is then used to test the classification accuracy. Experiments on different benchmarks using color texture databases are conducted to highlight the effectiveness of the proposed model associated to the Kullback–Leibler divergence.

Journal ArticleDOI
TL;DR: This empirical study evaluates the impact of the dimensions’ value cardinality (DVC) of image descriptors in each dimension, on the performance of large-scale similarity search and experiments demonstrate the influence of DVCs in both the sequential search and in the aforementioned similarity search methods.
Abstract: In this empirical study, we evaluate the impact of the dimensions’ value cardinality (DVC) of image descriptors in each dimension, on the performance of large-scale similarity search. DVCs are inherent characteristics of image descriptors defined for each dimension as the number of distinct values of image descriptors, thus expressing the dimension’s discriminative power. In our experiments, with six publicly available datasets of image descriptors of different dimensionality (64–5,000 dim) and size (240 K–1 M), (a) we show that DVC varies, due to the existence of several extraction methods using different quantization and normalization techniques; (b) we also show that image descriptor extraction strategies tend to follow the same DVC distribution function family; therefore, similarity search strategies can exploit image descriptors DVCs, irrespective of the sizes of the datasets; (c) based on a canonical correlation analysis, we demonstrate that there is a significant impact of image descriptors’ DVCs on the performance of the baseline LSH method [8] and three state-of-the-art hashing methods: SKLSH [28], PCA-ITQ [10], SPH [12], as well as on the performance of MSIDX method [34], which exploits the DVC information; (d) we experimentally demonstrate the influence of DVCs in both the sequential search and in the aforementioned similarity search methods and discuss the advantages of our findings. We hope that our work will motivate researchers for considering DVC analysis as a tool for the design of similarity search strategies in image databases.

Journal ArticleDOI
TL;DR: A semi-automatic labeling strategy that reduces the human annotator effort for medical image modality detection and is evaluated on ImageCLEFmed2012 data set containing approximately 300,000 images, showing that annotating $$<$$<1 % of the data is sufficient to label correctly 49.95 %" of the images.
Abstract: Medical image modality detection is a key step for indexing images from biomedical articles Traditionally, complex supervised classification methods have been used for this However, they rely on proportionally sized labeled training samples With the increase in availability of image data it has become increasingly challenging to obtain reasonably accurate manual labels to train classifiers Toward meeting this shortcoming, we propose a semi-automatic labeling strategy that reduces the human annotator effort Each image is projected into several feature spaces, and each entry in these spaces is clustered in an unsupervised manner The cluster centers for each feature representation are then labeled by a human annotator, and the labels propagated through each cluster To find the optimal cluster numbers for each feature space, a so-called “jump” method is used The final label of an image is decided by a voting scheme that summarizes the different opinions on the same image provided by the different feature representations The proposed method is evaluated on ImageCLEFmed2012 data set containing approximately 300,000 images, and showed that annotating $$<$$ 1 % of the data is sufficient to label correctly 4995 % of the images The method spared approximately 700 h of human annotation labor and associated costs

Journal ArticleDOI
TL;DR: A new framework is proposed, which jointly employs color-based visual features and audio fingerprints for detecting the duplicate videos and shows improved efficiency compared to the reference methods against a wide range of video transformations.
Abstract: Most studies in content-based video copy detection (CBCD) concentrate on visual signatures, while only very few efforts are made to exploit audio features. The audio data, if present, is an essential source of a video; hence, the integration of visual-acoustic fingerprints significantly improves the copy detection performance. Based on this aspect, we propose a new framework, which jointly employs color-based visual features and audio fingerprints for detecting the duplicate videos. The proposed framework incorporates three stages: First, a novel visual fingerprint based on spatio-temporal dominant color features is generated; Second, mel-frequency cepstral coefficients are extracted and compactly represented as acoustic signatures; Third, the resultant multimodal signatures are jointly used for the CBCD task, by employing combination rule and weighting strategies. The results of experiments on TRECVID 2008 and 2009 datasets, demonstrate the improved efficiency of the proposed framework compared to the reference methods against a wide range of video transformations.

Journal ArticleDOI
TL;DR: A semi-supervised learning method, named Multiple Binary Subspace Regression, for cross-media data concept detection, which project the original cross- media data onto the same subspace-level representation simultaneously by mapping to the corresponding subspaces for dimensionality reduction.
Abstract: Due to the ubiquitous existence of large-scale data in today’s real-world applications, including learning on cross-media data, we propose a semi-supervised learning method, named Multiple Binary Subspace Regression, for cross-media data concept detection In order to mine the common features among the data with multiple modalities, we project the original cross-media data onto the same subspace-level representation simultaneously by mapping to the corresponding subspaces for dimensionality reduction All the subspaces are set to be binary, which only involve the addition operations and omit the multiplication operations in the subsequent computation owing to the good property of the binary values The dimensionality reduction to a binary subspace and the concept detection on this subspace are also optimized simultaneously leading to a semi-supervised model For dealing with large-scale data, our learning method is easily implemented to run in a MapReduce-based Hadoop system Empirical studies demonstrate its competitive performance on convergence, efficiency, and scalability in comparison with the state-of-the-art literature

Journal ArticleDOI
TL;DR: Testing on four datasets belonging to three different classification tasks showed that the proposed method outperforms methods in previous works on local pooling in the feature space for less feature dimensionality.
Abstract: In this paper, we propose a novel feature-space local pooling method for the commonly adopted architecture of image classification. While existing methods partition the feature space based on visual appearance to obtain pooling bins, learning more accurate space partitioning that takes semantics into account boosts performance even for a smaller number of bins. To this end, we propose partitioning the feature space over clusters of visual prototypes common to semantically similar images (i.e., images belonging to the same category). The clusters are obtained by Bregman co-clustering applied offline on a subset of training data. Therefore, being aware of the semantic context of the input image, our features have higher discriminative power than do those pooled from appearance-based partitioning. Testing on four datasets (Caltech-101, Caltech-256, 15 Scenes, and 17 Flowers) belonging to three different classification tasks showed that the proposed method outperforms methods in previous works on local pooling in the feature space for less feature dimensionality. Moreover, when implemented within a spatial pyramid, our method achieves comparable results on three of the datasets used.

Journal ArticleDOI
TL;DR: VIDCAR, an unsupervised framework for Content-Based Video Retrieval (CBVR) using representation of the dynamics in the spatio-temporal model extracted from video shots, was shown to have greater precision recall than the competitors on five datasets.
Abstract: This paper presents VIDeo Content Analysis and Retrieval (VIDCAR), an unsupervised framework for Content-Based Video Retrieval (CBVR) using representation of the dynamics in the spatio-temporal model extracted from video shots. We propose Dynamic Multi Spectro Temporal-Curvature Scale Space (DMST-CSS), an improved feature descriptor for enhancing the performance of CBVR task. Our primary contribution is in representation of the dynamics of the evolution of the MST-CSS surface. Unlike the earlier MST-CSS descriptor [22], which extracts geometric features after the evolving MST-CSS surface converges to a final formation, this DMST-CSS captures the dynamics of the evolution (formation) of the surface and is thus more robust. We have represented the dynamics of MST-CSS surface as a multivariate time series to obtain a DMST-CSS descriptor. A global kernel alignment technique has been adapted to compute a match cost between query and model DMST-CSS descriptor. In our experiments, VIDCAR was shown to have greater precision recall than the competitors on five datasets.

Journal ArticleDOI
Michael S. Lew1
TL;DR: This special issue presents new ideas and developments in the field of video retrieval comprised of peer-reviewed papers recommended and carefully reviewed by the editorial board and prominent experts.
Abstract: Welcome to the special issue on video retrieval. Recently there has been an explosion in digital video due to the prevalence of digital television, movie streaming, digital surveillance, and massive video collections such as NetFlix, iTunes, and Amazon. With the immense amounts of video come the critical tasks of analyzing, browsing, and searching for video. In this special issue we present new ideas and developments in the field of video retrieval comprised of peer-reviewed papers recommended and carefully reviewed by the editorial board and prominent experts (special thanks to Mohan Kankanhalli, Stefan Rueger, and R. Manmatha). One approach toward bridging the semantic gap (this refers to the gap between machine low level features and the high level human language) is by searching through sto-

Journal ArticleDOI
TL;DR: A novel clustering framework inspired by a bioinformatics technique, namely DNA multiple sequence alignment (MSA), which is shown to be significantly faster than non-clustering-based n-gram and edit distance NDV retrieval techniques and yields better mean average precision retrieval accuracy.
Abstract: In this paper, we propose studying the impact of clustering on near-duplicate video (NDV) retrieval. The aim is to reduce the search space at retrieval time through a pre-processing clustering step performed on the dataset off-line and retrieving NDVs based on the formed clusters. Our contribution is a novel clustering framework inspired by a bioinformatics technique, namely DNA multiple sequence alignment (MSA). A series of video keyframes in chronological order is represented as an alphabetical genome, analogous to a DNA sequence and MSA is employed to automatically partition the NDVs in a video collection into clusters. After discussing the advantages and shortcomings of the main state-of-the-art clustering approaches for video clustering in the theoretical part of the paper, we empirically evaluate the performance of the proposed MSA-based framework against five clustering algorithms representative of these mainstream approaches: Birch, Cure, Dbscan, Expectation-Maximization and Proclus. Also, we show that our clustering-based approach, while being significantly faster than non-clustering-based n-gram and edit distance NDV retrieval techniques, yields better mean average precision retrieval accuracy.