scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Multimedia Information Retrieval in 2017"


Journal ArticleDOI
TL;DR: An overview of some recent and efficient methods in the traffic sign detection and classification, and divides detection methods into three main categories: color-based (classified according to the color space), shape-based, and learning-based methods (including deep learning).
Abstract: Over the last few years, different traffic sign recognition systems were proposed. The present paper introduces an overview of some recent and efficient methods in the traffic sign detection and classification. Indeed, the main goal of detection methods is localizing regions of interest containing traffic sign, and we divide detection methods into three main categories: color-based (classified according to the color space), shape-based, and learning-based methods (including deep learning). In addition, we also divide classification methods into two categories: learning methods based on hand-crafted features (HOG, LBP, SIFT, SURF, BRISK) and deep learning methods. For easy reference, the different detection and classification methods are summarized in tables along with the different datasets. Furthermore, future research directions and recommendations are given in order to boost TSR’s performance.

64 citations


Journal ArticleDOI
TL;DR: An extensive statistical analysis of the LFM-1b dataset is provided and it is illustrated the dataset’s usage in a simple artist recommendation task, whose results are intended to serve as baseline against which more elaborate techniques can be assessed.
Abstract: Recently, the LFM-1b dataset has been proposed to foster research and evaluation in music retrieval and music recommender systems, Schedl (Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR). New York, 2016). It contains more than one billion music listening events created by more than 120,000 users of Last.fm. Each listening event is characterized by artist, album, and track name, and further includes a timestamp. Basic demographic information and a selection of more elaborate listener-specific descriptors are included as well, for anonymized users. In this article, we reveal information about LFM-1b’s acquisition and content and we compare it to existing datasets. We furthermore provide an extensive statistical analysis of the dataset, including basic properties of the item sets, demographic coverage, distribution of listening events (e.g., over artists and users), and aspects related to music preference and consumption behavior (e.g., temporal features and mainstreaminess of listeners). Exploiting country information of users and genre tags of artists, we also create taste profiles for populations and determine similar and dissimilar countries in terms of their populations’ music preferences. Finally, we illustrate the dataset’s usage in a simple artist recommendation task, whose results are intended to serve as baseline against which more elaborate techniques can be assessed.

38 citations


Journal ArticleDOI
TL;DR: This paper gives an overview of published research efforts in the area of handwritten document image word spotting and on the technologies used in the field and describes a general model for document word spotting.
Abstract: Along with the explosive growth of the amount of handwritten documents that are preserved, processed and accessed in a digital form, handwritten document images word spotting has attracted many researchers of various research communities, such as pattern recognition, computer vision and information retrieval. Work on the problem of handwritten documents word spotting has been an active research area and significant progress has been made in the last few years. However, in spite of the great progress achieved, handwritten document word spotting still can hardly achieve acceptable performance on real-world handwritten document images that vary widely in writing style and quality. This paper gives an overview of published research efforts in the area of handwritten document image word spotting and on the technologies used in the field. We first start by describing a general model for document word spotting followed by discussing present challenges in handwritten document word spotting. Then the used databases for handwritten document word spotting and other handwritten text tasks are discussed. After that, research works on handwritten document word spotting are presented. Finally, several summary tables of published research work are provided for used handwritten documents databases and reported performance results on handwritten documents word spotting. These tables summarize different aspects and the reported accuracy for each technique.

36 citations


Journal ArticleDOI
TL;DR: The instance search benchmark worked with a variety of large collections of data including Sound & Vision, Flickr, BBC Rushes and with the small world of the BBC EastEnders series for the last 3 years.
Abstract: This paper presents an overview of the Video instance search benchmark which was run over a period of 6 years (2010–2015) as part of the TREC Video Retrieval workshop series. The main contributions of the paper include (i) an examination of the evolving design of the evaluation framework and its components (system tasks, data, measures); (ii) an analysis of the influence of topic characteristics (such as rigid/non-rigid, planar/non-planar, stationary/mobile on performance; (iii) a high-level overview of results and best-performing approaches. The instance search benchmark worked with a variety of large collections of data including Sound & Vision, Flickr, BBC Rushes for the first three pilot years and with the small world of the BBC EastEnders series for the last 3 years.

36 citations


Journal ArticleDOI
TL;DR: A literature review of various methods for biomedical image indexing and retrieval focuses on the methodology based on the visual representation of the medical images as content-based medical image retrieval (CBMIR) approaches retrieve similar medical images more efficiently as compared to text-based biomedical image retrieval approaches.
Abstract: Medical imaging performs a vital role in the medical field as it provides important information on the internal body parts for the clinical analysis and medical intervention which enables physicians to diagnose and treat a variety of diseases. Nowadays the medical diagnosis is increasing at a very high rate, which results in the formation of a huge medical image database, and retrieving similar medical images from such a huge database is a very difficult task. A literature review of various methods for biomedical image indexing and retrieval is presented here. Over 140 contributions are included from the literature in this survey. And it is mainly concentrated on the methodology based on the visual representation of the medical images as content-based medical image retrieval (CBMIR) approaches retrieve similar medical images more efficiently as compared to text-based biomedical image retrieval approaches. It also delineates how various ideas were adopted from different computer science methodologies for developing CBMIR systems.

31 citations


Journal ArticleDOI
TL;DR: This work represents the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video and proposes a novel deep learning framework to model each granularity individually.
Abstract: Video analysis is an important branch of computer vision due to its wide applications, ranging from video surveillance, video indexing, and retrieval to human computer interaction. All of the applications are based on a good video representation, which encodes video content into a feature vector with fixed length. Most existing methods treat video as a flat image sequence, but from our observations we argue that video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical representation that captures spatial and temporal information of video. To validate its effectiveness in video analysis, we apply this video representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods.

22 citations


Journal ArticleDOI
TL;DR: In this paper, an attempt is made to analyze and classify various script identification schemes for document images, and the comparison is made between these schemes, and discussion is made based upon their merits and demerits on a common platform.
Abstract: Script identification is being widely accepted techniques for selection of the particular script OCR (Optical Character Recognition) in multilingual document images. Extensive research has been done in this field, but still it suffers from low identification accuracy. This is due to the presence of faded document images, illuminations and positions while scanning. Noise is also a major obstacle in the script identification process. However, it can only be minimized up to a level, but cannot be removed completely. In this paper, an attempt is made to analyze and classify various script identification schemes for document images. The comparison is also made between these schemes, and discussion is made based upon their merits and demerits on a common platform. This will help the researchers to understand the complexity of the issue and identify possible directions for research in this field.

21 citations


Journal ArticleDOI
TL;DR: Fast discrete curvelet transform-based anisotropic feature extraction for biomedical image indexing and retrieval and the experimental results show the superiority of proposed approach over well-known existing methods.
Abstract: This paper presents fast discrete curvelet transform-based anisotropic feature extraction for biomedical image indexing and retrieval. In this work, the curvelet transform is applied on the image and feature vector is calculated using the directional energies of these curvelet coefficients. The effectiveness of the proposed approach has been tested on three well-known databases: Open access series of imaging studies MRI, Emphysema-CT and NEMA-CT. The performance of the proposed system is evaluated using average retrieval precision and average retrieval rate. The experimental results show the superiority of proposed approach over well-known existing methods.

18 citations


Journal ArticleDOI
TL;DR: A full video surveillance ontology following a formal naming syntax convention and semantics that addresses queries of both academic research and industrial applications is developed and an ontology video surveillance indexing and retrieval system (OVIS) using a set of semantic web rule language (SWRL) rules that bridges the semantic gap problem.
Abstract: Nowadays, the diversity and large deployment of video recorders result in a large volume of video data, whose effective use requires a video indexing process. However, this process generates a major problem consisting in the semantic gap between the extracted low-level features and the ground truth. The ontology paradigm provides a promising solution to overcome this problem. However, no naming syntax convention has been followed in the concept creation step, which constitutes another problem. In this paper, we have considered these two issues and have developed a full video surveillance ontology following a formal naming syntax convention and semantics that addresses queries of both academic research and industrial applications. In addition, we propose an ontology video surveillance indexing and retrieval system (OVIS) using a set of semantic web rule language (SWRL) rules that bridges the semantic gap problem. Currently, the existing indexing systems are essentially based on low-level features and the ontology paradigm is used only to support this process with representing surveillance domain. In this paper, we developed the OVIS system based on the SWRL rules and the experiments prove that our approach leads to promising results on the top video evaluation benchmarks and also shows new directions for future developments.

17 citations


Journal ArticleDOI
TL;DR: The proposed system shows an activity-based shot boundary detection where only the possible transition regions of a video are considered for shot detection, which reduces the computational complexity by processing the transition regions only.
Abstract: The paper proposes a shot boundary detection system using Gist and local descriptor. Gist provides the perceptual and conceptual information of a scene. The proposed system can be roughly divided into three steps. The first step consists of forming of groups of frames by calculating the correlation of the Gist features between consecutive frames of the video. Secondly, abrupt transitions are found out using the group (G), MSER and a threshold (for abrupt separately, $$th_{cut}$$ ). And lastly, gradual transitions of the video are found using triangular pattern matching. We have performed the experiment on TRECVid 2001 and 2007 dataset. The novel contribution of this paper is that the proposed system shows an activity-based shot boundary detection where only the possible transition regions of a video are considered for shot detection. This approach reduces the computational complexity by processing the transition regions only. We have achieved better results in terms of F1, precision and recall, when compared to previously published approaches.

14 citations


Journal ArticleDOI
TL;DR: Five criteria: used models, tagging purpose, tagging right, object type, and used dataset, are introduced for evaluating tag-based information retrieval methods as a new categorical framework engaging the graphical models as well as the two-way classical methods.
Abstract: This paper aims to provide a comprehensive survey of tag-based information retrieval that covers three areas: tag-based document retrieval, tag-based image retrieval, and tag-based music information retrieval. First of all, seven representative graphical models associated with tag contents are reviewed and evaluated in terms of effectiveness in achieving their goals. The models are explored in depth based on appropriate plate notations for the tag-based document retrieval. Second, well-established review criteria for two-way classical methods, tag refinement and tag recommendation, are utilized for tag-based image retrieval. In particular, tag refinement methods are analyzed by means of the experimental results measured on different datasets. Last, popular tagging methods in the area of music information retrieval are reviewed for the tag-based music information retrieval. We introduce five criteria: used models, tagging purpose, tagging right, object type, and used dataset, for evaluating tag-based information retrieval methods as a new categorical framework engaging the graphical models as well as the two-way classical methods.

Journal ArticleDOI
TL;DR: A new database for Arabic handwritten characters and ligatures is introduced, designed to cover all forms of Arabic characters, including ligatures, which contains 9900 ligatures and 5500 characters written by 50 writers.
Abstract: Unlike Latin, the recognition of Arabic handwritten characters remains at the level of research and experimentation. In fact, it has an undeniable interest in carrying out tasks that may be tedious in certain areas, namely the automatic processing of Arab administrative records, the digitization and safeguarding of the written Arab cultural heritage, and so on. The availability of a database makes it possible to compare objectively the results of the different systems developed in this field. Indeed, the recognition of Arabic handwriting characters still catches the absence of a reference database of Arabic handwritten characters covering all forms of Arabic characters and all possible ligatures of the characters. This article introduces a new database for Arabic handwritten characters and ligatures. This database is designed to cover all forms of Arabic characters, including ligatures. It contains 9900 ligatures and 5500 characters written by 50 writers.

Journal ArticleDOI
TL;DR: A rigorous and effective computational framework for humans affect recognition and classification through arousal valence and dominance dimensions and can confirm the sufficiency of the R-ELM when it was applied for the estimation and recognition of emotional responses.
Abstract: With the advancement of Human Computer interaction and affective computing, emotion estimation becomes a very interesting area of research. In literature, the majority of emotion recognition systems presents an insufficiency due to the complexity of processing a huge number of physiological data and analyzing various kind of emotions in one framework. The aim of this paper is to present a rigorous and effective computational framework for humans affect recognition and classification through arousal valence and dominance dimensions. In the proposed algorithm, physiological instances from the multimodal emotion DEAP dataset has been used for the analysis and characterization of emotional pattern. Physiological features were employed to predict VAD levels via Extreme Learning Machine. We adopted a feature-level fusion to exploit the complementary information of some physiological sensors in order to improve the classification performance. The proposed framework was also evaluated in a V–A quadrant by predicting four emotional classes. The obtained results proves the robustness and correctness of our proposed framework compared to other recent studies. We can also confirm the sufficiency of the R-ELM when it was applied for the estimation and recognition of emotional responses.

Journal ArticleDOI
TL;DR: A CBIR approach in which retrieved images are more likely to satisfy the user expectations is proposed and a new feature fusion strategy is developed based on Dempster–Shafer evidence theory to allow handling lack of prior probabilities and uncertain information provided by low-level features.
Abstract: The gap between human semantic perception of an image and its abstraction by some low-level features is one of the main shortcomings of the actual content-based image retrieval (CBIR) systems. This paper presents an effort to overcome this drawback and proposes a CBIR approach in which retrieved images are more likely to satisfy the user expectations. Multi-label classification, in which an instance is assigned to different classes, provides a general framework to establish semantic correspondence between images in a database and a query image. Accordingly, in the framework of multi-label classification, a new feature fusion strategy is developed based on Dempster–Shafer evidence theory to allow handling lack of prior probabilities and uncertain information provided by low-level features. In this study, texture features are extracted through wavelet correlogram and color features are obtained using correlogram of vector quantized image colors. These features are subsequently fused via a possibility approach for being used in multi-label classification to retrieve images relevant to a query image. Experimental results on three well-known public and international image datasets demonstrate the superiority of the proposed algorithm over its close counterparts in terms of average precision and F1 measure.

Journal ArticleDOI
TL;DR: This survey provides a fundamental study and review of latest happenings in the area of camera-captured scene text extraction and offers researchers an overview of the work done so far along with certain future directions.
Abstract: Owing to technological advancements and new inventions, the world is transforming into a digital world where cameras have replaced the scanners for converting the hard copy documents into their soft versions. And, nowadays, these gadgets are used to capture the images of any text, present anywhere in real-life scenes, and process it, hence making the area of camera-captured scene text extraction a key area of interest for researchers. This paper discusses the certain background concepts and provides a categorization of the major challenges faced while extracting the camera-captured scene text, and scope and applications of the scene text extraction. Moreover, it presents a comprehensive literature survey of different techniques and methods adopted by researchers to extract the text from the scene images captured via cameras with a special emphasis on Gurmukhi script. For better understanding, this paper classifies various techniques by three different parameters and discusses the available datasets along with performance metrics. To summarize, this survey provides a fundamental study and review of latest happenings in the area of camera-captured scene text extraction and offers researchers an overview of the work done so far along with certain future directions.

Journal ArticleDOI
TL;DR: A ANOVA Cosine Similarity Image Recommendation (ACSIR) framework for vertical image search where text and visual features are integrated to fill the semantic gap
Abstract: In today’s world, online shopping is very attractive and grown exponentially due to revolution in digitization. It is a crucial demand to provide recommendation for all the search engine to identify users’ need. In this paper, we have proposed a ANOVA Cosine Similarity Image Recommendation (ACSIR) framework for vertical image search where text and visual features are integrated to fill the semantic gap. Visual synonyms of each term are computed using ANOVA p value by considering image visual features on text-based search. Expanded queries are generated for user input query, and text-based search is performed to get the initial result set. Pair-wise image cosine similarity is computed for recommendation of images. Experiments are conducted on product images crawled from domain-specific site. Experiment results show that the ACSIR outperforms iLike method by providing more relevant products to the user input query.

Journal ArticleDOI
TL;DR: This paper proposes a QBE-based MIR system and investigates the impact of automatic music genre prediction on the performance of it, specifically on perspective of accuracy-time trade-off, using a score-based genre prediction method as well as similarity measures.
Abstract: A topic of music information retrieval (MIR) field is query-by-example (QBE), which searches a popular music dataset using a user-provided query and aims to find the target song. Since this type of MIR has been generally used in online systems, retrieval time is also as important as accuracy. In this paper, we propose a QBE-based MIR system and investigate the impact of automatic music genre prediction on the performance of it, specifically on perspective of accuracy-time trade-off, using a score-based genre prediction method as well as similarity measures. The proposed system is evaluated on a dataset containing 6000 music pieces from six musical genres, and we show that how much improvement on the performance can be achieved in terms of accuracy and retrieval time, compared with a typical QBE-based MIR system that uses only similarity measures to find the user-desired song.

Journal ArticleDOI
TL;DR: An unsupervised group feature selection algorithm based on canonical correlation analysis (CCA) is proposed and experiments with different audio and video classification scenarios demonstrate the outstanding performance of the proposed approach and its robustness across different datasets.
Abstract: The selection of an appropriate feature set is crucial for the efficient analysis of any media collection In general, feature selection strongly depends on the data and commonly requires expert knowledge and previous experiments in related application scenarios Current unsupervised feature selection methods usually ignore existing relationships among components of multi-dimensional features (group features) and operate on single feature components In most applications, features carry little semantics Thus, it is less relevant if a feature set consists of complete features or a selection of single feature components However, in some domains, such as content-based audio retrieval, features are designed in a way that they, as a whole, have considerable semantic meaning The disruption of a group feature in such application scenarios impedes the interpretability of the results In this paper, we propose an unsupervised group feature selection algorithm based on canonical correlation analysis (CCA) Experiments with different audio and video classification scenarios demonstrate the outstanding performance of the proposed approach and its robustness across different datasets

Journal ArticleDOI
TL;DR: This study proposes a computational framework for the clustering and analysis of multilingual visual affective concepts used in different languages which enable us to pinpoint alignable differences and nonalignable differences across cultures.
Abstract: Visual content is a rich medium that can be used to communicate not only facts and events, but also emotions and opinions. In some cases, visual content may carry a universal affective bias (e.g., natural disasters or beautiful scenes). Often however, to achieve a parity in the affections a visual media invokes in its recipient compared to the one an author intended requires a deep understanding and even sharing of cultural backgrounds. In this study, we propose a computational framework for the clustering and analysis of multilingual visual affective concepts used in different languages which enable us to pinpoint alignable differences (via similar concepts) and nonalignable differences (via unique concepts) across cultures. To do so, we crowdsource sentiment labels for the MVSO dataset, which contains 16 K multilingual visual sentiment concepts and 7.3M images tagged with these concepts. We then represent these concepts in a distribution-based word vector space via (1) pivotal translation or (2) cross-lingual semantic alignment. We then evaluate these representations on three tasks: affective concept retrieval, concept clustering, and sentiment prediction—all across languages. The proposed clustering framework enables the analysis of the large multilingual dataset both quantitatively and qualitatively. We also show a novel use case consisting of a facial image data subset and explore cultural insights about visual sentiment concepts in such portrait-focused images.

Journal ArticleDOI
TL;DR: This work deals with building a spatial-temporal descriptor which takes advantage of the twin paths to extract features of the person image and is found to surpass the performance of the existing methods.
Abstract: Automatic re-identification of people entering the camera network is an important and challenging task. Multiple frames of the same person will be easily available in surveillance videos for re-identification. Dealing with pose variations of the person in the image and partial occlusion issues is major challenge in single-frame re-identification process. The use of more frames from the surveillance videos can generate robust descriptor to tackle issues of pose variations and occlusion. In this paper, we have emphasized on using multiple frames from the same video to generate a multi-frame twin-channel descriptor. The work deals with building a spatial-temporal descriptor which takes advantage of the twin paths to extract features of the person image. Mahalanobis distance metric learning algorithms is used for matching and evaluation. Our descriptor is evaluated on two benchmark datasets and found to surpass the performance of the existing methods.

Journal ArticleDOI
TL;DR: A frame-wise video navigation system as an application of the indexing and search system using the a 2.14 TB movie dataset is implemented and the effectiveness of the proposed pruning search method when dealing with dynamic contexts and its comparative high search performance is shown.
Abstract: Many multimedia retrieval tasks are faced with increasingly large-scale datasets and variously changing preferences of users in each query. There are at least three distinctive contextual aspects comprised to form a set of preferences of a user at each query time: content, intention, and response time. A content preference refers to the low-level or semantic representations of the data that a user is interested in. An intention preference refers to how the content should be regarded as relevant. And a response time preference refers to the ability to control a reasonable wait time. This paper features the dynamic adaptability of a multimedia search system to the contexts of its users and proposes a multicontext-adaptive indexing and search system for video data. The main contribution is the integration of context-based query creation functions with high-performance search algorithms into a unified search system. The indexing method modifies inverted list data structure in order to construct disk-resident databases for large-scale data and efficiently enables a dynamic pruning search mechanism on those indices. We implement a frame-wise video navigation system as an application of the indexing and search system using the a 2.14 TB movie dataset. Experimental studies on this system show the effectiveness of the proposed pruning search method when dealing with dynamic contexts and its comparative high search performance.

Journal ArticleDOI
Michael S. Lew1
TL;DR: The ACM International Conference onMultimedia Retrieval (ICMR) was started in 2011 as the flagship meeting covering the dynamic and ubiquitous domain of multimedia search and retrieval and is currently held annually each summer.
Abstract: The ACM International Conference onMultimedia Retrieval (ICMR, www.acmicmr.org) was started in 2011 as the flagship meeting covering the dynamic and ubiquitous domain of multimedia search and retrieval. It combined the notable ACMCIVRandMIR conferences and is currently held annually each summer (e.g., www.icmr2018.org is from June 11 to 14 and has CFP due on January 20, 2018). The growth and pervasiveness of multimedia has been enormous. Already, we have seen the major changes in the way that media, whether it is news or music or video, is being communicated and retrieved. Home multimedia retrieval technology such as Amazon Echo and Google Home has breached the final barrier between multimedia collections and everyday users. Services such as Apple iTunes, Netflix and Amazon Prime supplemented by digital streaming from all major channels have created an environment where finding multimedia information is so prevalent that many