scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Image caption automatic generation method based on weighted feature

01 Oct 2013-pp 548-551
TL;DR: A novel system which generates sentential annotations for general images by employing a weighted feature clustering algorithm on the semantic concept clusters of the image regions and establishing a relationship between clustering regions and semantic concepts according to the labeled images in the training set.
Abstract: For people to use numerous images effectively on the web, technologies must be able to explain image contents and must be capable of searching for data that users need. Moreover, images must be described with natural sentences based not only on the names of objects contained in an image but also on their mutual relations. We propose a novel system which generates sentential annotations for general images. Firstly, a weighted feature clustering algorithm is employed on the semantic concept clusters of the image regions. For a given cluster, we determine relevant features based on their statistical distribution and assign greater weights to relevant features as compared to less relevant features. In this way the computing of clustering algorithm can avoid dominated by trivial relevant or irrelevant features. Then, the relationship between clustering regions and semantic concepts is established according to the labeled images in the training set. Under the condition of the new unlabeled image regions, we calculate the conditional probability of each semantic keyword and annotate the new images with maximal conditional probability. Experiments on the Corel image set show the effectiveness of the new algorithm.
Citations
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: Liu et al. as discussed by the authors presented a unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visualsemantic embedding.
Abstract: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

563 citations

Posted Content
TL;DR: A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
Abstract: Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best reported performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO) triplets to several state-of-the-art techniques.

419 citations

Patent
13 Jan 2016
TL;DR: In this article, instead of outputting results of caption analysis directly, the framework is adapted to output points in a semantic word vector space, which are not tied to particular words or a single dictionary.
Abstract: Techniques for image captioning with word vector representations are described. In implementations, instead of outputting results of caption analysis directly, the framework is adapted to output points in a semantic word vector space. These word vector representations reflect distance values in the context of the semantic word vector space. In this approach, words are mapped into a vector space and the results of caption analysis are expressed as points in the vector space that capture semantics between words. In the vector space, similar concepts with have small distance values. The word vectors are not tied to particular words or a single dictionary. A post-processing step is employed to map the points to words and convert the word vector representations to captions. Accordingly, conversion is delayed to a later stage in the process.

65 citations

Patent
13 Jan 2016
TL;DR: In this article, weak supervision data for a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning, where weak supervision refers to noisy data that is not closely curated and may include errors.
Abstract: Techniques for image captioning with weak supervision are described herein. In implementations, weak supervision data regarding a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning. Weak supervision data refers to noisy data that is not closely curated and may include errors. Given a target image, weak supervision data for visually similar images may be collected from sources of weakly annotated images, such as online social networks. Generally, images posted online include “weak” annotations in the form of tags, titles, labels, and short descriptions added by users. Weak supervision data for the target image is generated by extracting keywords for visually similar images discovered in the different sources. The keywords included in the weak supervision data are then employed to modulate weights applied for probabilistic classifications during image captioning analysis.

29 citations

Book ChapterDOI
09 Jan 2018
TL;DR: The survey presents various techniques used by researchers for scene analysis performed on different image datasets, which helps to generate better image captions.
Abstract: Automatic image captioning is the process of providing natural language captions for images automatically. Considering the huge number of images available in recent time, automatic image captioning is very beneficial in managing huge image datasets by providing appropriate captions. It also finds application in content based image retrieval. This field includes other image processing areas such as segmentation, feature extraction, template matching and image classification. It also includes the field of natural language processing. Scene analysis is a prominent step in automatic image captioning which is garnering the attention of many researchers. The better the scene analysis the better is the image understanding which further leads to generate better image captions. The survey presents various techniques used by researchers for scene analysis performed on different image datasets.

13 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper attempts to provide a comprehensive survey of the recent technical achievements in high-level semantic-based image retrieval, identifying five major categories of the state-of-the-art techniques in narrowing down the 'semantic gap'.

1,713 citations


"Image caption automatic generation ..." refers background in this paper

  • ...technology is becoming the focus of people study more and more [1,2]....

    [...]

Proceedings ArticleDOI
28 Jul 2003
TL;DR: The approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval by assuming that regions in an image can be described using a small vocabulary of blobs.
Abstract: Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach to annotating and retrieving images based on a training set of images. We assume that regions in an image can be described using a small vocabulary of blobs. Blobs are generated from image features using clustering. Given a training set of images with annotations, we show that probabilistic models allow us to predict the probability of generating a word given the blobs in an image. This may be used to automatically annotate and retrieve images given a word as a query. We show that relevance models allow us to derive these probabilities in a natural way. Experiments show that the annotation performance of this cross-media relevance model is almost six times as good (in terms of mean precision) than a model based on word-blob co-occurrence model and twice as good as a state of the art model derived from machine translation. Our approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval.

1,275 citations


"Image caption automatic generation ..." refers methods in this paper

  • ...They are typical image annotation methods based on generative model, such as cross-media relevance model (CMRM) [6], multiple Bernoulli relevance models [7], and continuous spatial relevance model [8] and so on....

    [...]

  • ...11 .... s::: Q) � 20 Q) Q. a precision recall 011 CMRM algorithm ri WFC algorithm Fig.2 annotation results comparison of CMRM algorithm and WFC algorithm CMRM algorithm using k-means clustering algorithm to generate blob, due to the shortcomings of k-means clustering algorithm, it affected the accuracy of the results....

    [...]

  • ...The proposed method compared with CMRM [6], and the experimental results shown in Figure 2....

    [...]

Proceedings ArticleDOI
27 Jun 2004
TL;DR: This work shows how it can do both automatic image annotation and retrieval (using one word queries) from images and videos using a multiple Bernoulli relevance model, which significantly outperforms previously reported results on the task of image and video annotation.
Abstract: Retrieving images in response to textual queries requires some knowledge of the semantics of the picture. Here, we show how we can do both automatic image annotation and retrieval (using one word queries) from images and videos using a multiple Bernoulli relevance model. The model assumes that a training set of images or videos along with keyword annotations is provided. Multiple keywords are provided for an image and the specific correspondence between a keyword and an image is not provided. Each image is partitioned into a set of rectangular regions and a real-valued feature vector is computed over these regions. The relevance model is a joint probability distribution of the word annotations and the image feature vectors and is computed using the training set. The word probabilities are estimated using a multiple Bernoulli model and the image feature probabilities using a non-parametric kernel density estimate. The model is then used to annotate images in a test set. We show experiments on both images from a standard Corel data set and a set of video key frames from NIST's video tree. Comparative experiments show that the model performs better than a model based on estimating word probabilities using the popular multinomial distribution. The results also show that our model significantly outperforms previously reported results on the task of image and video annotation.

815 citations

Proceedings Article
09 Dec 2003
TL;DR: An approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries using a formalism that models the generation of annotated images.
Abstract: We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words which allow us to predict the probability of generating a word given the image regions. This may be used to automatically annotate and retrieve images given a word as a query. Experiments show that our model significantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval.

762 citations


"Image caption automatic generation ..." refers methods in this paper

  • ...They are typical image annotation methods based on generative model, such as cross-media relevance model (CMRM) [6], multiple Bernoulli relevance models [7], and continuous spatial relevance model [8] and so on....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a new learning technique, which extends Multiple-Instance Learning (MIL), and its application to the problem of region-based image categorization, and provides experimental results on an image categorizing problem and a drug activity prediction problem.
Abstract: Designing computer programs to automatically categorize images using low-level features is a challenging research topic in computer vision. In this paper, we present a new learning technique, which extends Multiple-Instance Learning (MIL), and its application to the problem of region-based image categorization. Images are viewed as bags, each of which contains a number of instances corresponding to regions obtained from image segmentation. The standard MIL problem assumes that a bag is labeled positive if at least one of its instances is positive; otherwise, the bag is negative. In the proposed MIL framework, DD-SVM, a bag label is determined by some number of instances satisfying various properties. DD-SVM first learns a collection of instance prototypes according to a Diverse Density (DD) function. Each instance prototype represents a class of instances that is more likely to appear in bags with the specific label than in the other bags. A nonlinear mapping is then defined using the instance prototypes and maps every bag to a point in a new feature space, named the bag feature space. Finally, standard support vector machines are trained in the bag feature space. We provide experimental results on an image categorization problem and a drug activity prediction problem.

698 citations


Additional excerpts

  • ...Low-level feature of the image region was a 9-dimension vector, which consisted of 3-dimension wavelet texture features, 3-dimension LUV color features and 3-dimension shape features [9]....

    [...]