scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Multimedia Information Retrieval in 2012"


Journal ArticleDOI
TL;DR: An overview of the literature concerning the automatic analysis of images of printed and handwritten musical scores and a reference scheme for any researcher wanting to compare new OMR algorithms against well-known ones is presented.
Abstract: For centuries, music has been shared and remembered by two traditions: aural transmission and in the form of written documents normally called musical scores. Many of these scores exist in the form of unpublished manuscripts and hence they are in danger of being lost through the normal ravages of time. To preserve the music some form of typesetting or, ideally, a computer system that can automatically decode the symbolic images and create new scores is required. Programs analogous to optical character recognition systems called optical music recognition (OMR) systems have been under intensive development for many years. However, the results to date are far from ideal. Each of the proposed methods emphasizes different properties and therefore makes it difficult to effectively evaluate its competitive advantages. This article provides an overview of the literature concerning the automatic analysis of images of printed and handwritten musical scores. For self-containment and for the benefit of the reader, an introduction to OMR processing systems precedes the literature overview. The following study presents a reference scheme for any researcher wanting to compare new OMR algorithms against well-known ones.

246 citations


Journal ArticleDOI
TL;DR: A new algorithm using directional local extrema patterns meant for content-based image retrieval application that shows a significant improvement in terms of their evaluation measures as compared with other existing methods on respective databases.
Abstract: In this paper, a new algorithm using directional local extrema patterns meant for content-based image retrieval application is proposed. The standard local binary pattern (LBP) encodes the relationship between reference pixel and its surrounding neighbors by comparing gray-level values. The proposed method differs from the existing LBP in a manner that it extracts the directional edge information based on local extrema in 0 $$^{\circ }$$ , 45 $$^{\circ }$$ , 90 $$^{\circ }$$ , and 135 $$^{\circ }$$ directions in an image. Performance is compared with LBP, block-based LBP (BLK_LBP), center-symmetric local binary pattern (CS-LBP), local edge patterns for segmentation (LEPSEG), local edge patterns for image retrieval (LEPINV), and other existing transform domain methods by conducting four experiments on benchmark databases viz. Corel (DB1) and Brodatz (DB2) databases. The results after being investigated show a significant improvement in terms of their evaluation measures as compared with other existing methods on respective databases.

171 citations


Journal ArticleDOI
TL;DR: This article focuses on the topic of content-based image retrieval using interactive search techniques, i.e., how does one interactively find any kind of imagery from any source, regardless of whether it is photographic, MRI or X-ray?
Abstract: We are living in an Age of Information where the amount of accessible data from science and culture is almost limitless. However, this also means that finding an item of interest is increasingly difficult, a digital needle in the prover- bial haystack. In this article, we focus on the topic of content- based image retrieval using interactive search techniques, i.e., how does one interactively find any kind of imagery from any source, regardless of whether it is photographic, MRI or X-ray? We highlight trends and ideas from over 170 recent research papers aiming to capture the wide spec- trum of paradigms and methods in interactive search, includ- ing its subarea relevance feedback. Furthermore, we identify promising research directions and several grand challenges for the future.

106 citations


Journal ArticleDOI
TL;DR: The objective of video data mining is to discover and describe interesting patterns from the huge amount ofVideo data as it is one of the core problem areas of the data-mining research community.
Abstract: Data mining is a process of extracting previously unknown knowledge and detecting the interesting patterns from a massive set of data. Thanks to the extensive use of information technology and the recent developments in multimedia systems, the amount of multimedia data available to users has increased exponentially. Video is an example of multimedia data as it contains several kinds of data such as text, image, meta-data, visual and audio. It is widely used in many major potential applications like security and surveillance, entertainment, medicine, education programs and sports. The objective of video data mining is to discover and describe interesting patterns from the huge amount of video data as it is one of the core problem areas of the data-mining research community. Compared to the mining of other types of data, video data mining is still in its infancy. There are many challenging research problems existing with video mining. Beginning with an overview of the video data-mining literature, this paper concludes with the applications of video mining.

51 citations


Journal ArticleDOI
TL;DR: This paper investigates the selection of semantic concepts for lifelogging which includes reasoning on semantic networks using a density-based approach and compares different semantic reasoning approaches to show the efficacy of this approach.
Abstract: Concept-based indexing, based on identifying various semantic concepts appearing in multimedia, is an attractive option for multimedia retrieval and much research tries to bridge the semantic gap between the media’s low-level features and high-level semantics. Research into concept-based multimedia retrieval has generally focussed on detecting concepts from high-quality media such as broadcast TV or movies, but it is not well addressed in other domains like lifelogging where the original data is captured with poorer quality. We argue that in noisy domains such as lifelogging, the management of data needs to include semantic reasoning in order to deduce a set of concepts to represent lifelog content for applications like searching, browsing or summarization. Using semantic concepts to manage lifelog data relies on the fusion of automatically detected concepts to provide a better understanding of the lifelog data. In this paper, we investigate the selection of semantic concepts for lifelogging which includes reasoning on semantic networks using a density-based approach. In a series of experiments we compare different semantic reasoning approaches and the experimental evaluations we report on lifelog data show the efficacy of our approach.

35 citations


Journal ArticleDOI
TL;DR: An ontology learning scheme is proposed in this paper which combines standard multimedia analysis techniques with knowledge drawn from conceptual meta-data to learn a domain-specific multimedia ontology from a set of annotated examples.
Abstract: A domain-specific ontology models a specific domain or part of the world. In fact, ontologies have proven to be an excellent medium for capturingpagebreak the knowledge of a domain. We propose an ontology learning scheme in this paper which combines standard multimedia analysis techniques with knowledge drawn from conceptual meta-data to learn a domain-specific multimedia ontology from a set of annotated examples. A standard machine-learning algorithm that learns structure and parameters of a Bayesian network is extended to include media observables in the learning. An expert group provides domain knowledge to construct a basic ontology of the domain as well as to annotate a set of training videos. These annotations help derive the associations between high-level semantic concepts of the domain and low-level media features. We construct a more robust and refined version of the basic ontology by learning from this set of conceptually annotated data. We show an application of our ontology-based framework for exploration of multimedia content, in the field of cultural heritage preservation. By constructing an ontology for the cultural heritage domain of Indian classical dance, and by offering an application for semantic annotation of the heritage collection of Indian dance videos, we demonstrate the efficacy of ou approach.

24 citations


Journal ArticleDOI
TL;DR: The basic idea of the proposed indexing framework is to maintain a large pool of over-complete hashing functions, which are randomly generated and shared when indexing diverse multimedia semantics, and proposes a sequential bit-selection algorithm based on local consistency and global regularization.
Abstract: In the past decade, locality-sensitive hashing (LSH) has gained a large amount of attention from both the multimedia and computer vision communities owing to its empirical success and theoretic guarantee in large-scale multimedia indexing and retrieval. Original LSH algorithms are designated for generic metrics such as Cosine similarity, $$\ell _2$$ -norm and Jaccard index, which are later extended to support those metrics learned from user-supplied supervision information. One of the common drawbacks of existing algorithms lies in their incapability to be flexibly adapted to the metric changes, along with the inefficacy when handling diverse semantics (e.g., the large number of semantic object categories in the ImageNet database), which motivates our proposed framework toward reconfigurable hashing. The basic idea of the proposed indexing framework is to maintain a large pool of over-complete hashing functions, which are randomly generated and shared when indexing diverse multimedia semantics. For specific semantic category, the algorithm adaptively selects the most relevant hashing bits by maximizing the consistency between semantic distance and hashing-based Hamming distance, thereby achieving reusability of the pre-computed hashing bits. Such a scheme especially benefits the indexing and retrieval of large-scale databases, since it facilitates one-off indexing rather than continuous computation-intensive maintenance toward metric adaptation. In practice, we propose a sequential bit-selection algorithm based on local consistency and global regularization. Extensive studies are conducted on large-scale image benchmarks to comparatively investigate the performance of different strategies for reconfigurable hashing. Despite the vast literature on hashing, to our best knowledge rare endeavors have been spent toward the reusability of hashing structures in large-scale data sets.

22 citations


Journal ArticleDOI
TL;DR: This paper presents a novel re-ranking algorithm aiming to exploit contextual information for improving the effectiveness of rankings computed by CBIR systems, and shows that the method can be applied to other tasks, such as combining ranked lists obtained using different image descriptors and combining post-processing methods.
Abstract: Content-based image retrieval (CBIR) systems aim to retrieve the most similar images in a collection, given a query image. Since users are interested in the returned images placed at the first positions of ranked lists (which usually are the most relevant ones), the effectiveness of these systems is very dependent on the accuracy of ranking approaches. This paper presents a novel re-ranking algorithm aiming to exploit contextual information for improving the effectiveness of rankings computed by CBIR systems. In our approach, ranked lists and distance scores are used to create context images, later used for retrieving contextual information. We also show that our re-ranking method can be applied to other tasks, such as (a) combining ranked lists obtained using different image descriptors (rank aggregation) and (b) combining post-processing methods. Conducted experiments involving shape, color, and texture descriptors and comparisons with other post-processing methods demonstrate the effectiveness of our method.

21 citations


Journal ArticleDOI
TL;DR: This work derives the training and inference rules for the smallest possible non-degenerated mm-pLSA model: a model with two leaf-pLSAs and a single top-level pLSA node merging the two leaf, pLSAs.
Abstract: In this work, we extend the standard single-layer probabilistic Latent Semantic Analysis (pLSA) (Hofmann in Mach Learn 42(1–2):177–196, 2001) to multiple layers. As multiple layers should naturally handle multiple modalities and a hierarchy of abstractions, we denote this new approach multilayer multimodal probabilistic Latent Semantic Analysis (mm-pLSA). We derive the training and inference rules for the smallest possible non-degenerated mm-pLSA model: a model with two leaf-pLSAs and a single top-level pLSA node merging the two leaf-pLSAs. We evaluate this approach on two pairs of different modalities: SIFT features and image annotations (tags) as well as the combination of SIFT and HOG features. We also propose a fast and strictly stepwise forward procedure to initialize the bottom–up mm-pLSA model, which in turn can then be post-optimized by the general mm-pLSA learning algorithm. The proposed approach is evaluated in a query-by-example retrieval task where various variants of our mm-pLSA system are compared to systems relying on a single modality and other ad-hoc combinations of feature histograms. We further describe possible pitfalls of the mm-pLSA training and analyze the resulting model yielding an intuitive explanation of its behaviour.

21 citations


Journal ArticleDOI
Fei Wu1, Yahong Han1, Xiang Liu1, Jian Shao1, Yueting Zhuang1, Zhongfei Zhang1 
TL;DR: This paper introduces many of the recent efforts in sparsity-based heterogenous feature selection, the representation of the intrinsic latent structure embedded in multimedia, and the related hashing index techniques.
Abstract: There is a rapid growth of the amount of multimedia data from real-world multimedia sharing web sites, such as Flickr and Youtube. These data are usually of high dimensionality, high order, and large scale. Moreover, different types of media data are interrelated everywhere in a complicated and extensive way by context prior. It is well known that we can obtain lots of features from multimedia such as images and videos; those high-dimensional features often describe various aspects of characteristics in multimedia. However, the obtained features are often over-complete to describe certain semantics. Therefore, the selection of limited discriminative features for certain semantics is hence crucial to make the understanding of multimedia more interpretable. Furthermore, the effective utilization of intrinsic embedding structures in various features can boost the performance of multimedia retrieval. As a result, the appropriate representation of the latent information hidden in the related features is hence crucial during multimedia understanding. This paper introduces many of the recent efforts in sparsity-based heterogenous feature selection, the representation of the intrinsic latent structure embedded in multimedia, and the related hashing index techniques.

21 citations


Journal ArticleDOI
TL;DR: By further extension in this descriptor, á trous gradient structure descriptor (AGSD) is proposed for content-based image retrieval and improved the retrieval performance significantly.
Abstract: This paper first introduces a trous wavelet correlogram feature descriptor for image representation. By further extension in this descriptor, a trous gradient structure descriptor (AGSD) is proposed for content-based image retrieval. AGSD facilitates the feature calculation with the help of a trous wavelet’s orientation information in local manner. The local information of the image is extracted through microstructure descriptor (MSD); it finds the relations between neighborhood pixels. Finally, relation among a trous quantized image and MSD image is used for final feature extraction. The experiments are performed on Corel 1000, Corel 2450, and MIRFLICKR 25000 databases. Average precision, weighted precision, standard deviation of weighted precision, average recall, standard deviation of recall, and rank, etc., of proposed methods are compared with optimal quantized wavelet correlogram, Gabor wavelet correlogram, and combination of standard wavelet filter and rotated wavelet filter correlogram. It is concluded that the proposed methods have improved the retrieval performance significantly.

Journal ArticleDOI
TL;DR: A novel approach that utilizes noisy shot-level visual concept detection to improve text-based video retrieval and considers entire videos as the retrieval units and focuses on queries that address a general subject matter (semantic theme) of a video.
Abstract: In this paper, we present a novel approach that utilizes noisy shot-level visual concept detection to improve text-based video retrieval. As opposed to most of the related work in the field, we consider entire videos as the retrieval units and focus on queries that address a general subject matter (semantic theme) of a video. Retrieval is performed using a coherence-based query performance prediction framework. In this framework, we make use of video representations derived from the visual concepts detected in videos to select the best possible search result given the query, video collection, available search mechanisms and the resources for query modification. In addition to investigating the potential of this approach to outperform typical text-based video retrieval baselines, we also explore the possibility to achieve further improvement in retrieval performance through combining our concept-based query performance indicators with the indicators utilizing the spoken content of the videos. The proposed retrieval approach is data driven, requires no prior training and relies exclusively on the analyses of the video collection and different results lists returned for the given query text. The experiments are performed on the MediaEval 2010 datasets and demonstrate the effectiveness of our approach.

Journal ArticleDOI
TL;DR: Experiments demonstrate that the AVG-based audio-visual representation can achieve consistent and significant performance improvements compared wtih other state-of-the-art approaches.
Abstract: We investigate general concept classification in unconstrained videos by joint audio-visual analysis. An audio-visual grouplet (AVG) representation is proposed based on analyzing the statistical temporal audio-visual interactions. Each AVG contains a set of audio and visual codewords that are grouped together according to their strong temporal correlations in videos, and the AVG carries unique audio-visual cues to represent the video content. By using the entire AVGs as building elements, video concepts can be more robustly classified than using traditional vocabularies with discrete audio or visual codewords. Specifically, we conduct coarse-level foreground/background separation in both audio and visual channels, and discover four types of AVGs by exploring mixed-and-matched temporal audio-visual correlations among the following factors: visual foreground, visual background, audio foreground, and audio background. All of these types of AVGs provide discriminative audio-visual patterns for classifying various semantic concepts. To effectively use the AVGs for improved concept classification, a distance metric learning algorithm is further developed. Based on the AVG structure, the algorithm uses an iterative quadratic programming formulation to learn the optimal distances between data points according to the large-margin nearest-neighbor setting. Various types of grouplet-based distances can be computed using individual AVGs, and through our distance metric learning algorithm these grouplet-based distances can be aggregated for final classification. We extensively evaluate our method over the large-scale Columbia consumer video set. Experiments demonstrate that the AVG-based audio-visual representation can achieve consistent and significant performance improvements compared wth other state-of-the-art approaches.

Journal ArticleDOI
Michael S. Lew1
TL;DR: The first journal focused on the field, The International Journal of Multimedia Information Retrieval (IJMIR), combines the vibrant areas related to retrieval of images, audio, video, and medical and scientific imagery with understanding of the human side of the system including the social aspects which are crucial to interactive search.
Abstract: Welcome, friends and colleagues. Just one decade in the past would show our field as being composed of a few scattered workshops. Since then there has been tremendous growth in both research and applications. In the past 4 years, as a community we have published over a 1,000 peer reviewed research papers in areas ranging from image and audio retrieval to scientific and medical image search. We have developed our own test sets, software libraries, evaluation benchmarks, workshops and major conferences (notably ICMR, CIVR, and MIR). Only one important piece was missing to make the picture complete: a research journal. Now, it is my pleasure to introduce the first journal focused on our field, The International Journal of Multimedia Information Retrieval (IJMIR). Our new journal encompasses all topics related to retrieval, exploration, and mining of media databases and collections. It combines the vibrant areas related to retrieval of images, audio, video, and medical and scientific imagery with understanding of the human side of the system including the social aspects which are crucial to interactive search. An unarguable strength of our field is that it plays an important role in diverse scientific areas because many significant recent discoveries have been made through analyzing and searching local and international imaging databases. In our journal, there are essentially two paper categories: regular papers (original research) and surveys. Regular papers are typically triple peer reviewed and are meant to represent the novel research from the community. Surveys undergo a peer review from a senior community member and are meant to capture trends, illuminate critical problems, and give insight into the state-of-the-art. In

Journal ArticleDOI
TL;DR: A statistical framework for large-scale near-duplication image retrieval which unifies the two steps by introducing kernel density function and is not only more effective but also more efficient than the bag-of-words model.
Abstract: Bag-of-words model is one of the most widely used methods in the recent studies of multimedia data retrieval. The key idea of the bag-of-words model is to quantize the bag of local features, for example SIFT, to a histogram of visual words and then standard information retrieval technologies developed from text retrieval can be applied directly. Despite its success, one problem of the bag-of-words model is that the two key steps, i.e., feature quantization and retrieval, are separated. In other words, the step of generating bag-of-words representation is not optimized for the step of retrieval which often leads to a sub-optimal performance. In this paper we propose a statistical framework for large-scale near-duplication image retrieval which unifies the two steps by introducing kernel density function. The central idea of the proposed method is to represent each image by a kernel density function and the similarity between the query image and a database image is then estimated as the query likelihood. In order to make the proposed method applicable to large-scale data sets, we have developed efficient algorithms for both estimating the density function of each image and computing the query likelihood. Our empirical studies confirm that the proposed method is not only more effective but also more efficient than the bag-of-words model.

Journal ArticleDOI
TL;DR: The experimental results demonstrate the effectiveness of the proposed efficient framework incorporating a set of discriminative image features that effectively enables us to select representative images for fast location-based scene matching by reducing the computational time by an order of magnitude.
Abstract: SIFT-based methods have been widely used for scene matching of photos taken at particular locations or places of interest. These methods are typically very time consuming due to the large number and high dimensionality of features used, making them unfeasible for use in consumer image collections containing a large number of images where computational power is limited and a fast response is desired. Considerable computational savings can be realized if images containing signature elements of particular locations can be automatically identified from the large number of images and only these representative images used for scene matching. We propose an efficient framework incorporating a set of discriminative image features that effectively enables us to select representative images for fast location-based scene matching. These image features are used for classifying images into good or bad candidates for scene matching, using different classification approaches. Furthermore, the image features created from our framework can facilitate the process of using sub-images for location-based scene matching with SIFT features. The experimental results demonstrate the effectiveness of our approach compared with the traditional SIFT-, PCA-SIFT-, and SURF-based approaches by reducing the computational time by an order of magnitude.

Journal ArticleDOI
TL;DR: The problem of cost-sensitive unsupervised learning of visual concepts from social images is reviewed, the new ideas are presented, and a comparative evaluation of representative approaches from the research literature is given.
Abstract: Visual concept learning typically requires a set of expert labeled, manual training images. However, acquiring a sufficient number of reliable annotations can be time-consuming or impractical. Therefore, in many situations it is preferable to perform unsupervised learning on user contributed tags from abundant sources such as social Internet communities and websites. Cost-sensitive learning is a natural approach toward unsupervised visual concept learning because it fundamentally optimizes the learning system accuracy regarding the cost of an error. This paper reviews the problem of cost-sensitive unsupervised learning of visual concepts from social images, presents the new ideas, and gives a comparative evaluation of representative approaches from the research literature.

Journal ArticleDOI
TL;DR: It was discovered that the behaviour of novice users begins to emulate that of the expert users as the novice users gain more expertise, but it was also found that the perceptions of novice Users, even with additional background knowledge, of the tools, collection and performance do not always match that ofThe expert users.
Abstract: Contemporary video retrieval systems are wanting in terms of helping users find appropriate videos. There are a number of reasons for this, including a lack of appropriate representations for video and the semantic gap. These problems are amplified by the fact that the importance of expertise for effective video search is not well understood. In an attempt to garner a greater understanding of the impact of expertise on video search, a user evaluation that is designed to investigate the role of expertise in video search was conducted. In our evaluation participants were given a number of video search tasks and were asked to find relevant videos using two different interfaces: the first interface required users to use background knowledge to find relevant videos and the second interface allowed users to use video search tools to complete the task. Three groups of users with varying search expertise carried out these video search tasks, with the objective that the behaviour and success of the different user groups could be examined. It was discovered that the behaviour of novice users begins to emulate that of the expert users as the novice users gain more expertise. However, it was also found that the perceptions of novice users, even with additional background knowledge, of the tools, collection and performance do not always match that of the expert users.

Journal ArticleDOI
TL;DR: This paper addresses the shape retrieval problem by casting it into the task of identifying “authority” nodes in an inferred similarity graph and also by re-ranking the shapes by suggesting the average similarity between a node and its neighboring nodes takes into account the local distribution.
Abstract: A critical issue in shape retrieval systems is that when a user submits a query shape, some shapes in the database are returned relatively often, while some are returned only when submitting specific queries. Intuitively, this phenomenon yields suboptimal retrieval accuracy. In this paper, we address the shape retrieval problem by casting it into the task of identifying “authority” nodes in an inferred similarity graph and also by re-ranking the shapes. The main idea is that the average similarity between a node and its neighboring nodes takes into account the local distribution, and therefore, helps modify the neighborhood edge weight, which guides the re-ranking. The proposed approach is evaluated on both 2D and 3D shape datasets, and the experimental results show that the proposed neighborhood induced similarity measure significantly improves the shape retrieval performance. Moreover, the computational speed of the proposed method is extremely fast.