scispace - formally typeset
Search or ask a question
Author

Su Mei Xi

Bio: Su Mei Xi is an academic researcher from University of Suwon. The author has contributed to research in topics: Feature detection (computer vision) & Feature extraction. The author has an hindex of 1, co-authored 8 publications receiving 10 citations.

Papers
More filters
Proceedings ArticleDOI
01 Oct 2013
TL;DR: A novel system which generates sentential annotations for general images by employing a weighted feature clustering algorithm on the semantic concept clusters of the image regions and establishing a relationship between clustering regions and semantic concepts according to the labeled images in the training set.
Abstract: For people to use numerous images effectively on the web, technologies must be able to explain image contents and must be capable of searching for data that users need. Moreover, images must be described with natural sentences based not only on the names of objects contained in an image but also on their mutual relations. We propose a novel system which generates sentential annotations for general images. Firstly, a weighted feature clustering algorithm is employed on the semantic concept clusters of the image regions. For a given cluster, we determine relevant features based on their statistical distribution and assign greater weights to relevant features as compared to less relevant features. In this way the computing of clustering algorithm can avoid dominated by trivial relevant or irrelevant features. Then, the relationship between clustering regions and semantic concepts is established according to the labeled images in the training set. Under the condition of the new unlabeled image regions, we calculate the conditional probability of each semantic keyword and annotate the new images with maximal conditional probability. Experiments on the Corel image set show the effectiveness of the new algorithm.

11 citations

Journal ArticleDOI
TL;DR: Experimental results show that compared with other bimodal speech recognition approaches, this approach obtains better speech recognition performance.
Abstract: Recent years have been higher demands for automatic speech recognition (ASR) systems that are able to operate robustly in an acoustically noisy environment This paper proposes an improved product hidden markov model (HMM) used for bimodal speech recognition A two-dimensional training model is built based on dependently trained audio-HMM and visual-HMM, reflecting the asynchronous characteristics of the audio and video streams A weight coefficient is introduced to adjust the weight of the video and audio streams automatically according to differences in the noise environment Experimental results show that compared with other bimodal speech recognition approaches, this approach obtains better speech recognition performance

1 citations

Journal ArticleDOI
TL;DR: The experiment results show that basic natural language processing techniques with small calculated consumption and simple implementation help a small for information retrieval, so the role of natural language understanding may be larger in the question answering system, automatic abstract and information extraction.
Abstract: In this paper, some applications of natural language processing techniques for information retrieval have been introduced, but the results are known not to be satisfied. In order to find the roles of some classical natural language processing techniques in information retrieval and to find which one is better we compared the effects with the various natural language techniques for information retrieval precision, and the experiment results show that basic natural language processing techniques with small calculated consumption and simple implementation help a small for information retrieval. Senior high complexity of natural language processing techniques with high calculated consumption and low precision can not help the information retrieval precision even harmful to it, so the role of natural language understanding may be larger in the question answering system, automatic abstract and information extraction.

1 citations

Proceedings ArticleDOI
12 Nov 2012
TL;DR: This article proposed a new approach for automatically correcting queries over Multi-XML, called MXDR(Multi- XML Distributed Retrieval), which first classed multi-X ML documents by a clustering method, and elicited the common structure information of XML datasets.
Abstract: This article proposed a new approach for automatically correcting queries over Multi-XML, called MXDR(Multi-XML Distributed Retrieval). We first classed multi-XML documents by a clustering method, and elicited the common structure information. Then generated certifiable structured queries by analyzing the given keywords query and the common structure information of XML datasets. We can evaluate the generated structured queries over the XML data sources with any existing structure search engine.
Journal ArticleDOI
TL;DR: A new method for cross-media retrieval which uses ontology to organize different media objects and the experiment results show that the proposed method is effective in cross- media retrieval.
Abstract: With the recent advances in information retrieval, cross-media retrieval has been attracting lot of attention, but several issues remain problems such as constructing effective correlations, calculating similarity between different kinds of media objects. To gain better cross-media retrieval performance, it is crucial to mine the semantic correlations among the heterogeneous multimedia data. This paper introduces a new method for cross-media retrieval which uses ontology to organize different media objects. The experiment results show that the proposed method is effective in cross-media retrieval.

Cited by
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: Liu et al. as discussed by the authors presented a unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visualsemantic embedding.
Abstract: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

563 citations

Posted Content
TL;DR: A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
Abstract: Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best reported performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO) triplets to several state-of-the-art techniques.

419 citations

Patent
13 Jan 2016
TL;DR: In this article, instead of outputting results of caption analysis directly, the framework is adapted to output points in a semantic word vector space, which are not tied to particular words or a single dictionary.
Abstract: Techniques for image captioning with word vector representations are described. In implementations, instead of outputting results of caption analysis directly, the framework is adapted to output points in a semantic word vector space. These word vector representations reflect distance values in the context of the semantic word vector space. In this approach, words are mapped into a vector space and the results of caption analysis are expressed as points in the vector space that capture semantics between words. In the vector space, similar concepts with have small distance values. The word vectors are not tied to particular words or a single dictionary. A post-processing step is employed to map the points to words and convert the word vector representations to captions. Accordingly, conversion is delayed to a later stage in the process.

65 citations

Patent
13 Jan 2016
TL;DR: In this article, weak supervision data for a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning, where weak supervision refers to noisy data that is not closely curated and may include errors.
Abstract: Techniques for image captioning with weak supervision are described herein. In implementations, weak supervision data regarding a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning. Weak supervision data refers to noisy data that is not closely curated and may include errors. Given a target image, weak supervision data for visually similar images may be collected from sources of weakly annotated images, such as online social networks. Generally, images posted online include “weak” annotations in the form of tags, titles, labels, and short descriptions added by users. Weak supervision data for the target image is generated by extracting keywords for visually similar images discovered in the different sources. The keywords included in the weak supervision data are then employed to modulate weights applied for probabilistic classifications during image captioning analysis.

29 citations

Book ChapterDOI
09 Jan 2018
TL;DR: The survey presents various techniques used by researchers for scene analysis performed on different image datasets, which helps to generate better image captions.
Abstract: Automatic image captioning is the process of providing natural language captions for images automatically. Considering the huge number of images available in recent time, automatic image captioning is very beneficial in managing huge image datasets by providing appropriate captions. It also finds application in content based image retrieval. This field includes other image processing areas such as segmentation, feature extraction, template matching and image classification. It also includes the field of natural language processing. Scene analysis is a prominent step in automatic image captioning which is garnering the attention of many researchers. The better the scene analysis the better is the image understanding which further leads to generate better image captions. The survey presents various techniques used by researchers for scene analysis performed on different image datasets.

13 citations