scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2012"


Proceedings Article
03 Dec 2012
TL;DR: A new loss-augmented inference algorithm that is quadratic in the code length and inspired by latent structural SVMs is developed, showing strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.
Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.

562 citations


Journal ArticleDOI
TL;DR: It is argued that N400 amplitude might reflect the retrieval of lexical information from memory and, on this view, the absence of an N400-effect in semantic illusion sentences can be explained in terms of priming.

415 citations


Proceedings Article
08 Jul 2012
TL;DR: While visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks, they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words.
Abstract: Our research aims at building computational models of word meaning that are perceptually grounded. Using computer vision techniques, we build visual and multimodal distributional models and compare them to standard textual models. Our results show that, while visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks (accounting for semantic relatedness), they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words. Moreover, we show that visual and textual information are tapping on different aspects of meaning, and indeed combining them in multimodal models often improves performance.

397 citations


Journal ArticleDOI
TL;DR: This paper survey and classify most of the ontology-based approaches developed in order to evaluate their advantages and limitations and compare their expected performance both from theoretical and practical points of view, and presents a new ontological-based measure relying on the exploitation of taxonomical features.
Abstract: Estimation of the semantic likeness between words is of great importance in many applications dealing with textual data such as natural language processing, knowledge acquisition and information retrieval. Semantic similarity measures exploit knowledge sources as the base to perform the estimations. In recent years, ontologies have grown in interest thanks to global initiatives such as the Semantic Web, offering an structured knowledge representation. Thanks to the possibilities that ontologies enable regarding semantic interpretation of terms many ontology-based similarity measures have been developed. According to the principle in which those measures base the similarity assessment and the way in which ontologies are exploited or complemented with other sources several families of measures can be identified. In this paper, we survey and classify most of the ontology-based approaches developed in order to evaluate their advantages and limitations and compare their expected performance both from theoretical and practical points of view. We also present a new ontology-based measure relying on the exploitation of taxonomical features. The evaluation and comparison of our approach's results against those reported by related works under a common framework suggest that our measure provides a high accuracy without some of the limitations observed in other works.

361 citations


Journal ArticleDOI
TL;DR: This survey looks at the use of vector space models to describe the meaning of words and phrases: the phenomena thatvector space models address, and the techniques that they use to do so.
Abstract: Distributional models represent a word through the contexts in which it has been observed. They can be used to predict similarity in meaning, based on the distributional hypothesis, which states that two words that occur in similar contexts tend to have similar meanings. Distributional approaches are often implemented in vector space models. They represent a word as a point in high-dimensional space, where each dimension stands for a context item, and a word's coordinates represent its context counts. Occurrence in similar contexts then means proximity in space. In this survey we look at the use of vector space models to describe the meaning of words and phrases: the phenomena that vector space models address, and the techniques that they use to do so. Many word meaning phenomena can be described in terms of semantic similarity: synonymy, priming, categorization, and the typicality of a predicate's arguments. But vector space models can do more than just predict semantic similarity. They are a very flexible tool, because they can make use of all of linear algebra, with all its data structures and operations. The dimensions of a vector space can stand for many things: context words, or non-linguistic context like images, or properties of a concept. And vector space models can use matrices or higher-order arrays instead of vectors for representing more complex relationships. Polysemy is a tough problem for distributional approaches, as a representation that is learned from all of a word's contexts will conflate the different senses of the word. It can be addressed, using either clustering or vector combination techniques. Finally, we look at vector space models for phrases, which are usually constructed by combining word vectors. Vector space models for phrases can predict phrase similarity, and some argue that they can form the basis for a general-purpose representation framework for natural language semantics.

284 citations


Journal ArticleDOI
TL;DR: This article investigates the use of three further factors—namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)—that have been used to provide improved performance elsewhere and introduces an additional semantic task and explores the advantages of using a much larger corpus.
Abstract: In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors--namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)--that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.

283 citations


Journal ArticleDOI
01 Oct 2012
TL;DR: This work presents a histogram-based representation for time series data, similar to the “bag of words” approach that is widely accepted by the text mining and information retrieval communities, and shows that it outperforms the leading existing methods in clustering, classification, and anomaly detection on dozens of real datasets.
Abstract: For more than a decade, time series similarity search has been given a great deal of attention by data mining researchers. As a result, many time series representations and distance measures have been proposed. However, most existing work on time series similarity search relies on shape-based similarity matching. While some of the existing approaches work well for short time series data, they typically fail to produce satisfactory results when the sequence is long. For long sequences, it is more appropriate to consider the similarity based on the higher-level structures. In this work, we present a histogram-based representation for time series data, similar to the "bag of words" approach that is widely accepted by the text mining and information retrieval communities. We performed extensive experiments and show that our approach outperforms the leading existing methods in clustering, classification, and anomaly detection on dozens of real datasets. We further demonstrate that the representation allows rotation-invariant matching in shape datasets.

272 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: A large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process, and learns for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears.
Abstract: Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process. We learn for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears. Our method, called CLEAR, is shown to significantly outperform previously published approaches. The proposed method is based on first principles, and is generic enough to exploit diverse types of text corpora, while having the flexibility to impose constraints on the derived word similarities. We also make publicly available a new labeled dataset for evaluating word relatedness algorithms, which we believe to be the largest such dataset to date.

270 citations


Proceedings Article
07 Jun 2012
TL;DR: This work uses a simple log-linear regression model, trained on the training data, to combine multiple text similarity measures of varying complexity, which range from simple character and word n-grams and common subsequences to complex features such as Explicit Semantic Analysis vector comparisons and aggregation of word similarity based on lexical-semantic resources.
Abstract: We present the UKP system which performed best in the Semantic Textual Similarity (STS) task at SemEval-2012 in two out of three metrics. It uses a simple log-linear regression model, trained on the training data, to combine multiple text similarity measures of varying complexity. These range from simple character and word n-grams and common subsequences to complex features such as Explicit Semantic Analysis vector comparisons and aggregation of word similarity based on lexical-semantic resources. Further, we employ a lexical substitution system and statistical machine translation to add additional lexemes, which alleviates lexical gaps. Our final models, one per dataset, consist of a log-linear combination of about 20 features, out of the possible 300+ features implemented.

226 citations


Proceedings Article
07 Jun 2012
TL;DR: The two systems for determining the semantic similarity of short texts submitted to the SemEval 2012 Task 6 ranked in the top 5, for the three overall evaluation metrics used.
Abstract: This paper describes the two systems for determining the semantic similarity of short texts submitted to the SemEval 2012 Task 6. Most of the research on semantic similarity of textual content focuses on large documents. However, a fair amount of information is condensed into short text snippets such as social media posts, image captions, and scientific abstracts. We predict the human ratings of sentence similarity using a support vector regression model with multiple features measuring word-overlap similarity and syntax similarity. Out of 89 systems submitted, our two systems ranked in the top 5, for the three overall evaluation metrics used (overall Pearson -- 2nd and 3rd, normalized Pearson -- 1st and 3rd, weighted mean -- 2nd and 5th).

225 citations


Proceedings ArticleDOI
29 Oct 2012
TL;DR: A novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases is developed, which improves the quality of prior link-based models, and also eliminates the need for explicit interlinkage between entities.
Abstract: Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.

Journal ArticleDOI
TL;DR: This work presents a systematic discussion and comparison of main approaches for annotating existing protein data with biological information to enable the use of algorithms that use biological ontologies as framework to mine annotated data.
Abstract: The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.

BookDOI
01 Jan 2012
TL;DR: Research Track.
Abstract: Research Track- MORe: Modular Combination of OWL Reasoners for Ontology Classification- A Formal Semantics for Weighted Ontology- Personalised Graph-Based Selection of Web APIs- Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing- Automatic Typing of DBpedia Entities- Performance Heterogeneity and Approximate Reasoning in Description Logic Ontologies- Concept-Based Semantic Difference in Expressive Description Logics- SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data- RDFS Reasoning on Massively Parallel Hardware- An Efficient Bit Vector Approach to Semantics-Based Machine Perception in Resource-Constrained Devices- Semantic Enrichment by Non-experts: Usability of Manual Annotation Tools- Ontology-Based Access to Probabilistic Data with OWL QL- Predicting Reasoning Performance Using Ontology Metrics- Formal Verification of Data Provenance Records- Cost Based Query Ordering over OWL Ontologies- Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig- Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web- The Not-So-Easy Task of Computing Class Subsumptions in OWL RL- Strabon: A Semantic Geospatial DBMS- DeFacto - Deep Fact Validation- Feature LDA: A Supervised Topic Model for Automatic Detection of Web API Documentations from the Web- Efficient Execution of Top-K SPARQL Queries- Collaborative Filtering by Analyzing Dynamic User Interests Modeled by Taxonomy- Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures- Hitting the Sweetspot: Economic Rewriting of Knowledge Bases- Mining Semantic Relations between Research Areas- Discovering Concept Coverings in Ontologies of Linked Data Sources- Ontology Constraints in Incomplete and Complete Data- A Machine Learning Approach for Instance Matching Based on Similarity Metrics- Who Will Follow Whom? Exploiting Semantics for Link Prediction in Attention-Information Networks- On the Diversity and Availability of Temporal Information in Linked Open Data- Semantic Sentiment Analysis of Twitter- CrowdMap: Crowdsourcing Ontology Alignment with Microtasks- Domain-Aware Ontology Matching- Rapidly Integrating Services into the Linked Data Cloud- An Evidence-Based Verification Approach to Extract Entities and Relations for Knowledge Base Population- Blank Node Matching and RDF/S Comparison Functions- Hybrid SPARQL Queries: Fresh vs Fast Results- Provenance for SPARQL Queries- SRBench: A Streaming RDF/SPARQL Benchmark- Scalable Geo-thematic Query Answering- In-Use Track- Managing the Life-Cycle of Linked Data with the LOD2 Stack- Achieving Interoperability through Semantics-Based Technologies: The Instant Messaging Case- Linking Smart Cities Datasets with Human Computation - The Case of UrbanMatch- ourSpaces - Design and Deployment of a Semantic Virtual Research Environment- Embedded EL+ Reasoning on Programmable Logic Controllers- Experiences with Modeling Composite Phenotypes in the SKELETOME Project- Toward an Ecosystem of LOD in the Field: LOD Content Generation and Its Consuming Service- Applying Semantic Web Technologies for Diagnosing Road Traffic Congestions- deqa: Deep Web Extraction for Question Answering- QuerioCity: A Linked Data Platform for Urban Information Management- Semantic Similarity-Driven Decision Support in the Skeletal Dysplasia Domain- Using SPARQL to Query BioPortal Ontologies and Metadata- Trentino Government Linked Open Geo-data: A Case Study- Semantic Reasoning in Context-Aware Assistive Environments to Support Ageing with Dementia- Query Driven Hypothesis Generation for Answering Queries over NLP Graphs- A Comparison of Hard Filters and Soft Evidence for Answer Typing in Watson- Incorporating Semantic Knowledge into Dynamic Data Processing for Smart Power Grids- Evaluations and Experiments Track- Evaluating Semantic Search Query Approaches with Expert and Casual Users- Extracting Justifications from BioPortal Ontologies- Linked Stream Data Processing Engines: Facts and Figures- Benchmarking Federated SPARQL Query Engines: Are ExistingTestbeds Enough?- Tag Recommendation for Large-Scale Ontology-Based Information Systems- Evaluation of Techniques for Inconsistency Handling in OWL 2 QL Ontologies- Evaluating Entity Summarization Using a Game-Based Ground Truth- Evaluation of a Layered Approach to Question Answering over Linked Data- Doctoral Consortium - Long Papers- Cross Lingual Semantic Search by Improving Semantic Similarity and Relatedness Measures- Quality Reasoning in the Semantic Web- Burst the Filter Bubble: Using Semantic Web to Enable Serendipity- Reconstructing Provenance- Very Large Scale OWL Reasoning through Distributed Computation- Replication for Linked Data- Scalable and Domain-Independent Entity Coreference: Establishing High Quality Data Linkages across Heterogeneous Data Sources- Doctoral Consortium - Short Papers- Distributed Reasoning on Semantic Data Streams- Reusing XML Schemas' Information as a Foundation for Designing Domain Ontologies - A Multi-domain Framework for Community Building Based on Data Tagging- Towards a Theoretical Foundation for the Harmonization of Linked Data- Knowledge Pattern Extraction and Their Usage in Exploratory Search- SPARQL Update for Complex Event Processing- Online Unsupervised Coreference Resolution for Semi-structured Heterogeneous Data- Composition of Linked Data-Based RESTful Services

Proceedings Article
12 Jul 2012
TL;DR: This work presents a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences, and demonstrates recovery of this richer structure by extracting logical forms from natural language queries against Freebase.
Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms of weak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependency-parsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-the-art accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

Proceedings Article
07 Jun 2012
TL;DR: A novel, optimal semantic similarity approach based on word-to-word similarity metrics to solve the important task of assessing natural language student input in dialogue-based intelligent tutoring systems.
Abstract: We present in this paper a novel, optimal semantic similarity approach based on word-to-word similarity metrics to solve the important task of assessing natural language student input in dialogue-based intelligent tutoring systems. The optimal matching is guaranteed using the sailor assignment problem, also known as the job assignment problem, a well-known combinatorial optimization problem. We compare the optimal matching method with a greedy method as well as with a baseline method on data sets from two intelligent tutoring systems, AutoTutor and iSTART.

Journal ArticleDOI
TL;DR: The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes.
Abstract: Predicting protein function has become increasingly demanding in the era of next generation sequencing technology The task to assign a curator-reviewed function to every single sequence is impracticable Bioinformatics tools, easy to use and able to provide automatic and reliable annotations at a genomic scale, are necessary and urgent In this scenario, the Gene Ontology has provided the means to standardize the annotation classification with a structured vocabulary which can be easily exploited by computational methods Argot2 is a web-based function prediction tool able to annotate nucleic or protein sequences from small datasets up to entire genomes It accepts as input a list of sequences in FASTA format, which are processed using BLAST and HMMER searches vs UniProKB and Pfam databases respectively; these sequences are then annotated with GO terms retrieved from the UniProtKB-GOA database and the terms are weighted using the e-values from BLAST and HMMER The weighted GO terms are processed according to both their semantic similarity relations described by the Gene Ontology and their associated score The algorithm is based on the original idea developed in a previous tool called Argot The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes The revised algorithm has been already employed and successfully tested during in-house genome projects of grape and apple, and has proven to have a high precision and recall in all our benchmark conditions It has also been successfully compared with Blast2GO, one of the methods most commonly employed for sequence annotation The server is freely accessible at http://wwwmedcompmedicinaunipdit/Argot2

Journal ArticleDOI
TL;DR: An ontology-based information extraction and retrieval system and its application to soccer domain is presented and a keyword-based semantic retrieval approach is proposed, which is improved considerably using domain-specific information extraction, inference and rules.

Journal ArticleDOI
TL;DR: It is argued that the comprehension of visual narrative is guided by an interaction between structure and meaning, and a combination of narrative structure and semantic relatedness can facilitate semantic processing of upcoming panels.

Journal ArticleDOI
TL;DR: In this paper, the similarity measure between terms in an ontology and between entities annotated with terms drawn from the ontology, based on both co-occurrence and information content, is presented.

Journal ArticleDOI
TL;DR: Functional magnetic resonance imaging is used to assess brain activity during an analogy generation task in which the semantic distance of analogical mapping was varied, indicating increased recruitment of frontopolar cortex as a mechanism for integrating semantically distant information to generate solutions in creative analogical reasoning.
Abstract: Brain-based evidence has implicated the frontal pole of the brain as important for analogical mapping. Separately, cognitive research has identified semantic distance as a key determinant of the creativity of analogical mapping (i.e., more distant analogies are generally more creative). Here, we used functional magnetic resonance imaging to assess brain activity during an analogy generation task in which we varied the semantic distance of analogical mapping (as derived quantitatively from a latent semantic analysis). Data indicated that activity within an a priori region of interest in left frontopolar cortex covaried parametrically with increasing semantic distance, even after removing effects of task difficulty. Results implicate increased recruitment of frontopolar cortex as a mechanism for integrating semantically distant information to generate solutions in creative analogical reasoning.

Patent
18 Jan 2012
TL;DR: In this article, a method of evaluating a semantic relatedness of terms is proposed, which comprises providing a plurality of text segments, calculating, using a processor, a pluralityof weights each for another of the plurality of texts, calculating a prevalence of a co-appearance of each of the pairs of terms in the plurality, and evaluating the relatedness between members of each pair according to a combination of a respective the prevalence and a weight of each text segment wherein a coappearance occurs.
Abstract: A method of evaluating a semantic relatedness of terms. The method comprises providing a plurality of text segments, calculating, using a processor, a plurality of weights each for another of the plurality of text segments, calculating a prevalence of a co-appearance of each of a plurality of pairs of terms in the plurality of text segments, and evaluating a semantic relatedness between members of each the pair according to a combination of a respective the prevalence and a weight of each of the plurality of text segments wherein a co-appearance of the pair occurs.

Journal ArticleDOI
TL;DR: This work considered the impact of number of features, number of senses, semantic neighborhood density, imageability, and body–object interaction across five visual word recognition tasks: standard lexical decision, go/no-go lexical decided, speeded pronunciation, progressive demasking, and semantic classification.
Abstract: There is considerable evidence (e.g., Pexman, Hargreaves, Siakaluk, Bodner, & Pope, 2008) that semantically rich words, which are associated with relatively more semantic information, are recognized faster across different lexical processing tasks. The present study extends this earlier work by providing the most comprehensive evaluation to date of semantic richness effects on visual word recognition performance. Specifically, using regression analyses to control for the influence of correlated lexical variables, we considered the impact of contextual dispersion, number of features, number of senses, semantic neighborhood density, imageability, and body-object interaction across five visual word recognition tasks: standard lexical decision, go/no-go lexical decision, speeded pronunciation, semantic classification, and progressive demasking. Semantic richness effects could be reliably detected in all tasks of lexical processing, indicating that semantic representations, particularly their imaginal and featural aspects, play a fundamental role in visual word recognition. However, there was also evidence that the strength of certain richness effects could be flexibly and adaptively modulated by task demands, consistent with an intriguing interplay between task-specific mechanisms and differentiated semantic processing.

Journal ArticleDOI
TL;DR: This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text, and is the first effort exploring the use of distributional semantics.

Journal ArticleDOI
TL;DR: This paper found that semantic relatedness between an incoming word and its preceding context can override expectations based on two types of stored knowledge: real-world knowledge about the specific events and states conveyed by a verb, and the verb's broader selection restrictions on the animacy of its argument.

Journal ArticleDOI
12 Jun 2012-PLOS ONE
TL;DR: AlignNemo is a new algorithm that, given the networks of two organisms, uncovers subnetworks of proteins that relate in biological function and topology of interactions that more closely fit the models of functional complexes proposed in the literature.
Abstract: Local network alignment is an important component of the analysis of protein-protein interaction networks that may lead to the identification of evolutionary related complexes. We present AlignNemo, a new algorithm that, given the networks of two organisms, uncovers subnetworks of proteins that relate in biological function and topology of interactions. The discovered conserved subnetworks have a general topology and need not to correspond to specific interaction patterns, so that they more closely fit the models of functional complexes proposed in the literature. The algorithm is able to handle sparse interaction data with an expansion process that at each step explores the local topology of the networks beyond the proteins directly interacting with the current solution. To assess the performance of AlignNemo, we ran a series of benchmarks using statistical measures as well as biological knowledge. Based on reference datasets of protein complexes, AlignNemo shows better performance than other methods in terms of both precision and recall. We show our solutions to be biologically sound using the concept of semantic similarity applied to Gene Ontology vocabularies. The binaries of AlignNemo and supplementary details about the algorithms and the experiments are available at: sourceforge.net/p/alignnemo.

Journal ArticleDOI
TL;DR: This paper proposes a semantic-gap-oriented active learning method, which incorporates the semantic gap measure into the information-minimization-based sample selection strategy and uses an extended multilabel version of the sparse-graph-based semisupervised learning method that incorporates the semantics correlation.
Abstract: User interaction is an effective way to handle the semantic gap problem in image annotation. To minimize user effort in the interactions, many active learning methods were proposed. These methods treat the semantic concepts individually or correlatively. However, they still neglect the key motivation of user feedback: to tackle the semantic gap. The size of the semantic gap of each concept is an important factor that affects the performance of user feedback. User should pay more efforts to the concepts with large semantic gaps, and vice versa. In this paper, we propose a semantic-gap-oriented active learning method, which incorporates the semantic gap measure into the information-minimization-based sample selection strategy. The basic learning model used in the active learning framework is an extended multilabel version of the sparse-graph-based semisupervised learning method that incorporates the semantic correlation. Extensive experiments conducted on two benchmark image data sets demonstrated the importance of bringing the semantic gap measure into the active learning process.

Proceedings ArticleDOI
04 Dec 2012
TL;DR: This paper proposes a novel unsupervised context-based approach to detecting emotion from text at the sentence level that is flexible enough to classify sentences beyond Ekman's model of six basic emotions.
Abstract: Emotion detection from text is a relatively new classification task. This paper proposes a novel unsupervised context-based approach to detecting emotion from text at the sentence level. The proposed methodology does not depend on any existing manually crafted affect lexicons such as Word Net-Affect, thereby rendering our model flexible enough to classify sentences beyond Ekman's model of six basic emotions. Our method computes an emotion vector for each potential affect bearing word based on the semantic relatedness between words and various emotion concepts. The scores are then fine tuned using the syntactic dependencies within the sentence structure. Extensive evaluation on various data sets shows that our framework is a more generic and practical solution to the emotion classification problem and yields significantly more accurate results than recent unsupervised approaches.

Journal ArticleDOI
TL;DR: This work demonstrates the importance of contextual redundancy in lexical access, suggesting that contextual repetitions in language only increase a word's memory strength if the repetitions are accompanied by a modulation in semantic context.
Abstract: Recent research has challenged the notion that word frequency is the organizing principle underlying lexical access, pointing instead to the number of contexts that a word occurs in (Adelman, Brown, & Quesada, 2006). Counting contexts gives a better quantitative fit to human lexical decision and naming data than counting raw occurrences of words. However, this approach ignores the information redundancy of the contexts in which the word occurs, a factor we refer to as semantic diversity. Using both a corpus-based study and a controlled artificial language experiment, we demonstrate the importance of contextual redundancy in lexical access, suggesting that contextual repetitions in language only increase a word’s memory strength if the repetitions are accompanied by a modulation in semantic context. We introduce a cognitive process mechanism to explain the pattern of behaviour by encoding the word’s context relative to the information redundancy between the current context and the word’s current memory representation. The model gives a better account of identification latency data than models based on either raw frequency or document count, and also produces a better-organized space to simulate semantic similarity.

Journal ArticleDOI
TL;DR: An improvement on the existing methods of semantic relatedness and reduces the collaboration cost of collaborative vocabulary editing is made and SRCT-based methods are implemented and compared with some existing semanticrelatedness methods.
Abstract: For the multilingual semantic interoperations in cross-organizational enterprise systems and e-commerce systems, semantic consistency is a research issue that has not been well resolved. This paper contributes to improving multilingual semantic interoperation by proposing a concept-connected near synonym (NSG) framework for concept disambiguation. NSG framework provides a vocabulary preprocessing process of collaborative vocabulary editing, which further ensures semantically consistent vocabulary for building semantically consistent business processes and documents between context-different information systems. The vocabulary preprocessing offered by NSG automates the process of finding potential near synonym sets and identifying collaboratively editable near synonym sets. The realization of NSG framework includes a probability model that computes concept values between concepts based on a newly introduced semantic relatedness method-SRCT. In this paper, SRCT-based methods are implemented and compared with some existing semantic relatedness methods. Experiments have shown that SRCT-based methods outperform the existing methods. This paper has made an improvement on the existing methods of semantic relatedness and reduces the collaboration cost of collaborative vocabulary editing.

Journal ArticleDOI
01 May 2012
TL;DR: The method significantly outperforms the modern methods for plagiarism detection in terms of Recall, Precision and F-measure and weighting for each argument generated by SRL to study its behaviour is introduced.
Abstract: Plagiarism occurs when the content is copied without permission or citation. One of the contributing factors is that many text documents on the internet are easily copied and accessed. This paper introduces a plagiarism detection technique based on the Semantic Role Labeling (SRL). The technique analyses and compares text based on the semantic allocation for each term inside the sentence. SRL is superior in generating arguments for each sentence semantically. Weighting for each argument generated by SRL to study its behaviour is also introduced in this paper. It was found that not all arguments affect the plagiarism detection process. In addition, experimental results on PAN-PC-09 data sets showed that our method significantly outperforms the modern methods for plagiarism detection in terms of Recall, Precision and F-measure.