Showing papers presented at "International Joint Conference on Natural Language Processing in 2011"

PDF

Open Access

Proceedings Article•

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners

[...]

Tomoya Mizumoto¹, Mamoru Komachi¹, Masaaki Nagata², Yuji Matsumoto¹•Institutions (2)

Nara Institute of Science and Technology¹, Nippon Telegraph and Telephone²

01 Nov 2011

TL;DR: The authors extracted a largescale Japanese learners' corpus from the revision log of a language learning SNS and used it as training data for learners' error correction using an SMT approach.

...read moreread less

Abstract: We present an attempt to extract a largescale Japanese learners’ corpus from the revision log of a language learning SNS. This corpus is easy to obtain in largescale, covers a wide variety of topics and styles, and can be a great source of knowledge for both language learners and instructors. We also demonstrate that the extracted learners’ corpus of Japanese as a second language can be used as training data for learners’ error correction using an SMT approach. We evaluate different granularities of tokenization to alleviate the problem of word segmentation errors caused by erroneous input from language learners. Experimental results show that the character-wise model outperforms the word-wise model.

...read moreread less

164 citations

Proceedings Article•

Cross-domain Feature Selection for Language Identification

[...]

Marco Lui¹, Timothy Baldwin¹•Institutions (1)

University of Melbourne¹

01 Nov 2011

TL;DR: It is shown that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and a feature selection method is developed that generalizes across domains.

...read moreread less

Abstract: We show that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and develop a feature selection method that generalizes across domains. Our results demonstrate that our method provides improvements in transductive transfer learning for language identification. We provide an implementation of the method and show that our system is faster than popular standalone language identification systems, while maintaining competitive accuracy.

...read moreread less

161 citations

Proceedings Article•

An Empirical Study on Compositionality in Compound Nouns

[...]

Siva Reddy¹, Diana McCarthy², Suresh Manandhar¹•Institutions (2)

University of York¹, University of Cambridge²

01 Nov 2011

TL;DR: This paper collects and analyse the compositionality judgments for a range of compound nouns using Mechanical Turk, and evaluates two different types of distributional models for compositionality detection – constituent based models and composition function based models.

...read moreread less

Abstract: A multiword is compositional if its meaning can be expressed in terms of the meaning of its constituents. In this paper, we collect and analyse the compositionality judgments for a range of compound nouns using Mechanical Turk. Unlike existing compositionality datasets, our dataset has judgments on the contribution of constituent words as well as judgments for the phrase as a whole. We use this dataset to study the relation between the judgments at constituent level to that for the whole phrase. We then evaluate two different types of distributional models for compositionality detection – constituent based models and composition function based models. Both the models show competitive performance though the composition function based models perform slightly better. In both types, additive models perform better than their multiplicative counterparts.

...read moreread less

145 citations

Proceedings Article•

Fine-Grained Sentiment Analysis with Structural Features

[...]

Cäcilia Zirn¹, Mathias Niepert¹, Heiner Stuckenschmidt¹, Michael Strube²•Institutions (2)

University of Mannheim¹, Heidelberg Institute for Theoretical Studies²

01 Nov 2011

TL;DR: A fully automatic framework for fine-grained sentiment analysis on the subsentence level combining multiple sentiment lexicons and neighborhood as well as discourse relations to overcome the problem of uncertainty in polarity predictions is presented.

...read moreread less

Abstract: Sentiment analysis is the problem of determining the polarity of a text with respect to a particular topic. For most applications, however, it is not only necessary to derive the polarity of a text as a whole but also to extract negative and positive utterances on a more finegrained level. Sentiment analysis systems working on the (sub-)sentence level, however, are difficult to develop since shorter textual segments rarely carry enough information to determine their polarity out of context. In this paper, therefore, we present a fully automatic framework for fine-grained sentiment analysis on the subsentence level combining multiple sentiment lexicons and neighborhood as well as discourse relations to overcome this problem. We use Markov logic to integrate polarity scores from different sentiment lexicons with information about relations between neighboring segments, and evaluate the approach on product reviews. The experiments show that the use of structural features improves the accuracy of polarity predictions achieving accuracy scores of up to 69%.

...read moreread less

139 citations

Proceedings Article•

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

[...]

Sonal Gupta¹, Christopher D. Manning¹•Institutions (1)

Stanford University¹

01 Nov 2011

TL;DR: A method for characterizing a research work in terms of its focus, domain of application, and techniques used is presented and it is shown how tracing these aspects over time provides a novel measure of the influence of research communities on each other.

...read moreread less

Abstract: We present a method for characterizing a research work in terms of its focus, domain of application, and techniques used. We show how tracing these aspects over time provides a novel measure of the influence of research communities on each other. We extract these characteristics by matching semantic extraction patterns, learned using bootstrapping, to the dependency trees of sentences in an article’s

...read moreread less

111 citations

Proceedings Article•

Opinion Expression Mining by Exploiting Keyphrase Extraction

[...]

Gábor Berend¹•Institutions (1)

University of Szeged¹

01 Nov 2011

TL;DR: The system illustrates that the classic supervised keyphrase extraction approach – mostly used for scientific genre previously – could be adapted for opinion-related keyphrases and provides a comparison of the effectiveness of the standard keyphrase extracted features and that of the system designed for the special task of opinion expression mining.

...read moreread less

Abstract: In this paper, we shall introduce a system for extracting the keyphrases for the reason of authors’ opinion from product reviews. The datasets for two fairly different product review domains related to movies and mobile phones were constructed semiautomatically based on the pros and cons entered by the authors. The system illustrates that the classic supervised keyphrase extraction approach – mostly used for scientific genre previously – could be adapted for opinion-related keyphrases. Besides adapting the original framework to this special task through defining novel, taskspecific features, an efficient way of representing keyphrase candidates will be demonstrated as well. The paper also provides a comparison of the effectiveness of the standard keyphrase extraction features and that of the system designed for the special task of opinion expression mining.

...read moreread less

96 citations

Proceedings Article•

Ensemble-style Self-training on Citation Classification

[...]

Cailing Dong¹, Ulrich Schäfer²•Institutions (2)

University of Maryland, Baltimore¹, German Research Centre for Artificial Intelligence²

01 Nov 2011

TL;DR: This work builds an ensemble-style selftraining classification model and gets better classification performance using only few training data, which largely reduces the manual annotation work in this task.

...read moreread less

Abstract: Classification of citations into categories such as use, refutation, comparison etc. may have several relevant applications for digital libraries such as paper browsing aids, reading recommendations, qualified citation indexing, or fine-grained impact factor calculation. Most citation classification approaches described so far heavily rely on rule systems and patterns tailored to specific science domains. We focus on a less manual approach by learning domaininsensitive features from textual, physical, and syntactic aspects. Our experiments show the effectiveness of this feature set with various machine learning algorithms on datasets of different sizes. Furthermore, we build an ensemble-style selftraining classification model and get better classification performance using only few training data, which largely reduces the manual annotation work in this task.

...read moreread less

91 citations

Proceedings Article•

Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data

[...]

Yiou Wang¹, Jun'ichi Kazama¹, Yoshimasa Tsuruoka¹, Wenliang Chen¹, Yujie Zhang¹, Kentaro Torisawa¹ - Show less +2 more•Institutions (1)

National Institute of Information and Communications Technology¹

01 Nov 2011

TL;DR: A simple yet effective semi-supervised method to improve Chinese word segmentation and POS tagging by introducing novel features derived from large auto-analyzed data to enhance a simple pipelined system.

...read moreread less

Abstract: This paper presents a simple yet effective semi-supervised method to improve Chinese word segmentation and POS tagging. We introduce novel features derived from large auto-analyzed data to enhance a simple pipelined system. The auto-analyzed data are generated from unlabeled data by using a baseline system. We evaluate the usefulness of our approach in a series of experiments on Penn Chinese Treebanks and show that the new features provide substantial performance gains in all experiments. Furthermore, the results of our proposed method are superior to the best reported results in the literature.

...read moreread less

91 citations

Proceedings Article•

Word Meaning in Context: A Simple and Effective Vector Model

[...]

Stefan Thater¹, Hagen Fürstenau¹, Manfred Pinkal¹•Institutions (1)

Saarland University¹

01 Nov 2011

TL;DR: A model that represents word meaning in context by vectors which are modified according to the words in the target’s syntactic context, which outperforms all previous models on a word sense disambiguation task.

...read moreread less

Abstract: We present a model that represents word meaning in context by vectors which are modified according to the words in the target’s syntactic context. Contextualization of a vector is realized by reweighting its components, based on distributional information about the context words. Evaluation on a paraphrase ranking task derived from the SemEval 2007 Lexical Substitution Task shows that our model outperforms all previous models on this task. We show that our model supports a wider range of applications by evaluating it on a word sense disambiguation task. Results show that our model achieves state-of-the-art performance.

...read moreread less

88 citations

Proceedings Article•

From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0

[...]

Jennifer Foster¹, Özlem Çetinoğlu¹, Joachim Wagner¹, Joseph Le Roux¹, Joakim Nivre¹, Deirdre Hogan, Josef van Genabith² - Show less +3 more•Institutions (2)

Dublin City University¹, Uppsala University²

08 Nov 2011

TL;DR: It is found that the Wall-Street-Journal-trained statistical parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy.

...read moreread less

Abstract: We investigate the problem of parsing the noisy language of social media. We evaluate four Wall-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We find that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically significant improvement for all three parsers.

...read moreread less

86 citations

Proceedings Article•

Learning the Latent Topics for Question Retrieval in Community QA

[...]

Li Cai¹, Guangyou Zhou¹, Kang Liu¹, Jun Zhao¹•Institutions (1)

Chinese Academy of Sciences¹

01 Nov 2011

TL;DR: This paper proposes a topic model incorporated with the category information into the process of discovering the latent topics in the content of questions and combines the semantic similarity based latent topics with the translation-based language model into a unified framework for question retrieval.

...read moreread less

Abstract: Community-based Question Answering (cQA) is a popular online service where users can ask and answer questions on any topics. This paper is concerned with the problem of question retrieval. Question retrieval in cQA aims to find historical questions that are semantically equivalent or relevant to the queried questions. Although the translation-based language model (Xue et al., 2008) has gained the state-of-the-art performance for question retrieval, they ignore the latent topic information in calculating the semantic similarity between questions. In this paper, we propose a topic model incorporated with the category information into the process of discovering the latent topics in the content of questions. Then we combine the semantic similarity based latent topics with the translation-based language model into a unified framework for question retrieval. Experiments are carried out on a real world cQA data set from Yahoo! Answers. The results show that our proposed method can significantly improve the question retrieval performance of translation-based language model.

...read moreread less

Proceedings Article•

Incremental Joint POS Tagging and Dependency Parsing in Chinese

[...]

Jun Hatori¹, Takuya Matsuzaki¹, Yusuke Miyao², Jun'ichi Tsujii³•Institutions (3)

University of Tokyo¹, National Institute of Informatics², Microsoft³

01 Nov 2011

TL;DR: The first incremental approach to the task of joint POS tagging and dependency parsing, which is built upon a shift-reduce parsing framework with dynamic programming is proposed, achieving the new state-of-the-art performance on this joint task.

...read moreread less

Abstract: We address the problem of joint part-of-speech (POS) tagging and dependency parsing in Chinese. In Chinese, some POS tags are often hard to disambiguate without considering longrange syntactic information. Also, the traditional pipeline approach to POS tagging and dependency parsing may suffer from the problem of error propagation. In this paper, we propose the first incremental approach to the task of joint POS tagging and dependency parsing, which is built upon a shift-reduce parsing framework with dynamic programming. Although the incremental approach encounters difficulties with underspecified POS tags of look-ahead words, we overcome this issue by introducing so-called delayed features. Our joint approach achieved substantial improvements over the pipeline and baseline systems in both POS tagging and dependency parsing task, achieving the new state-of-the-art performance on this joint task.

...read moreread less

Proceedings Article•

A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations

[...]

Deana L. Pennell¹, Yang Liu¹•Institutions (1)

University of Texas at Dallas¹

01 Nov 2011

TL;DR: This paper describes a two-phase method for expanding abbreviations found in informal text using a machine translation system trained at the character level during the first phase, which is much more robust to new abbreviations than a word-level system.

...read moreread less

Abstract: This paper describes a two-phase method for expanding abbreviations found in informal text (e.g., email, text messages, chat room conversations) using a machine translation system trained at the character level during the first phase. In this way, the system learns mappings between character-level “phrases” and is much more robust to new abbreviations than a word-level system. We generate translation models that are independent of the way in which the abbreviations are formed and show that the results show little degradation compared to when type-dependent models are trained. Our experiments on a large data set show our proposed system performs well when tested both on isolated abbreviations and, with the incorporation of a second phase utilizing an in-domain language model, in the context of neighboring words.

...read moreread less

Proceedings Article•

Safety Information Mining â€” What can NLP do in a disasterâ€”

[...]

Graham Neubig¹, Yuichiroh Matsubayashi², Masato Hagiwara³, Koji Murakami•Institutions (3)

Kyoto University¹, National Institute of Informatics², Rakuten³

01 Nov 2011

TL;DR: A system to mine information regarding the safety of people in the disaster-stricken area from Twitter, a massive yet highly unorganized information source, was created.

...read moreread less

Abstract: This paper describes efforts of NLP researchers to create a system to aid the relief efforts during the 2011 East Japan Earthquake. Specifically, we created a system to mine information regarding the safety of people in the disaster-stricken area from Twitter, a massive yet highly unorganized information source. We describe the large scale collaborative effort to rapidly create robust and effective systems for word segmentation, named entity recognition, and tweet classification. As a result of our efforts, we were able to effectively deliver new information about the safety of over 100 people in the disasterstricken area to a central repository for safety information.

...read moreread less

Proceedings Article•

A Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames

[...]

Ryohei Sasano¹, Sadao Kurohashi²•Institutions (2)

Tokyo Institute of Technology¹, Kyoto University²

01 Nov 2011

TL;DR: A discriminative model is presented that simultaneously determines an appropriate case frame for a given predicate and its predicate-argument structure and the results of zero anaphora resolution on Web text and the effectiveness of the approach are reported.

...read moreread less

Abstract: We present a discriminative model for Japanese zero anaphora resolution that simultaneously determines an appropriate case frame for a given predicate and its predicate-argument structure. Our model is based on a log linear framework, and exploits lexical features obtained from a large raw corpus, as well as non-lexical features obtained from a relatively small annotated corpus. We report the results of zero anaphora resolution on Web text and demonstrate the effectiveness of our approach. In addition, we also investigate the relative importance of each feature for resolving zero anaphora in Web text.

...read moreread less

Proceedings Article•

A POS-based Ensemble Model for Cross-domain Sentiment Classification

[...]

Rui Xia¹, Chengqing Zong¹•Institutions (1)

Chinese Academy of Sciences¹

01 Nov 2011

TL;DR: A POS-based ensemble model is proposed to efficiently integrate features with different types of POS tags to improve the classification performance of cross-domain sentiment classification.

...read moreread less

Abstract: In this paper, we focus on the tasks of cross-domain sentiment classification. We find across different domains, features with some types of part-of-speech (POS) tags are domain-dependent, while some others are domain-free. Based on this finding, we proposed a POS-based ensemble model to efficiently integrate features with different types of POS tags to improve the classification performance. Weights are trained by stochastic gradient descent (SGD) to optimize the perceptron and minimal classification error (MCE) criteria. Experimental results show that the proposed ensemble model is quite effective for the task of cross-domain sentiment classification.

...read moreread less

Proceedings Article•

What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage

[...]

Christian M. Meyer¹, Iryna Gurevych¹•Institutions (1)

Technische Universität Darmstadt¹

01 Nov 2011

TL;DR: A new automatically aligned resource of Wiktionary and WordNet is proposed that has a very high domain coverage of word senses and an enriched sense representation, including pronunciations, etymologies, translations, etc.

...read moreread less

Abstract: By today, no lexical resource can claim to be fully comprehensive or perform best for every NLP task. This caused a steep increase of resource alignment research. An important challenge is thereby the alignment of differently represented word senses, which we address in this paper. In particular, we propose a new automatically aligned resource of Wiktionary and WordNet that has (i) a very high domain coverage of word senses and (ii) an enriched sense representation, including pronunciations, etymologies, translations, etc. We evaluate our alignment both quantitatively and qualitatively, and explore how it can contribute to practical tasks.

...read moreread less

Proceedings Article•

Shallow Discourse Parsing with Conditional Random Fields

[...]

Sucheta Ghosh¹, Richard Johansson¹, Giuseppe Riccardi¹, Sara Tonelli²•Institutions (2)

University of Trento¹, fondazione bruno kessler²

01 Nov 2011

TL;DR: This paper takes a data driven approach to identify arguments of explicit discourse connectives and designs the argument segmentation task as a cascade of decisions based on conditional random fields (CRFs).

...read moreread less

Abstract: Parsing discourse is a challenging natural language processing task. In this paper we take a data driven approach to identify arguments of explicit discourse connectives. In contrast to previous work we do not make any assumptions on the span of arguments and consider parsing as a token-level sequence labeling task. We design the argument segmentation task as a cascade of decisions based on conditional random fields (CRFs). We train the CRFs on lexical, syntactic and semantic features extracted from the Penn Discourse Treebank and evaluate feature combinations on the commonly used test split. We show that the best combination of features includes syntactic and semantic features. The comparative error analysis investigates the performance variability over connective types and argument positions.

...read moreread less

Proceedings Article•

Cross-Language Entity Linking

[...]

Paul McNamee¹, James Mayfield¹, Dawn Lawrie², Douglas W. Oard³, David Doermann³ - Show less +1 more•Institutions (3)

Johns Hopkins University Applied Physics Laboratory¹, Loyola University Maryland², University of Maryland, College Park³

01 Nov 2011

TL;DR: A new test collection is created to evaluate cross-language entity linking performance in twenty-one languages and presents experiments that examine issues such as: the importance of transliteration; the utility of cross- language information retrieval; and, the potential benefit of multilingual named entity recognition.

...read moreread less

Abstract: There has been substantial recent interest in aligning mentions of named entities in unstructured texts to knowledge base descriptors, a task commonly called entity linking. This technology is crucial for applications in knowledge discovery and text data mining. This paper presents experiments in the new problem of crosslanguage entity linking, where documents and named entities are in a different language than that used for the content of the reference knowledge base. We have created a new test collection to evaluate cross-language entity linking performance in twenty-one languages. We present experiments that examine issues such as: the importance of transliteration; the utility of cross-language information retrieval; and, the potential benefit of multilingual named entity recognition. Our best model achieves performance which is 94% of a strong monolingual baseline.

...read moreread less

Proceedings Article•

Sentence Subjectivity Detection with Weakly-Supervised Learning

[...]

Chenghua Lin, Yulan He¹, Richard M. Everson²•Institutions (2)

Open University¹, University of Exeter²

01 Nov 2011

TL;DR: A hierarchical Bayesian model based on latent Dirichlet allocation (LDA) is presented, called subjLDA, for sentence-level subjectivity detection, which automatically identifies whether a given sentence expresses opinion or states facts.

...read moreread less

Abstract: This paper presents a hierarchical Bayesian model based on latent Dirichlet allocation (LDA), called subjLDA, for sentence-level subjectivity detection, which automatically identifies whether a given sentence expresses opinion or states facts. In contrast to most of the existing methods relying on either labelled corpora for classifier training or linguistic pattern extraction for subjectivity classification, we view the problem as weakly-supervised generative model learning, where the only input to the model is a small set of domain independent subjectivity lexical clues. A mechanism is introduced to incorporate the prior information about the subjectivity lexical clues into model learning by modifying the Dirichlet priors of topic-word distributions. The subjLDA model has been evaluated on the Multi-Perspective Question Answering (MPQA) dataset and promising results have been observed in the preliminary experiments. We have also explored adding neutral words as prior information for model learning. It was found that while incorporating subjectivity clues bearing positive or negative polarity can achieve a significant performance gain, the prior lexical information from neutral words is less effective.

...read moreread less

Proceedings Article•

Dynamic and Static Prototype Vectors for Semantic Composition

[...]

Siva Reddy¹, Ioannis Klapaftis¹, Diana McCarthy², Suresh Manandhar¹•Institutions (2)

University of York¹, University of Cambridge²

01 Nov 2011

TL;DR: The results show that selecting relevant senses of the constituent words leads to a better semantic composition of the compound, and dynamic prototypes perform better than static prototypes.

...read moreread less

Abstract: Compositional Distributional Semantic methods model the distributional behavior of a compound word by exploiting the distributional behavior of its constituent words. In this setting, a constituent word is typically represented by a feature vector conflating all the senses of that word. However, not all the senses of a constituent word are relevant when composing the semantics of the compound. In this paper, we present two different methods for selecting the relevant senses of constituent words. The first one is based on Word Sense Induction and creates a static multi prototype vectors representing the senses of a constituent word. The second creates a single dynamic prototype vector for each constituent word based on the distributional properties of the other constituents in the compound. We use these prototype vectors for composing the semantics of noun-noun compounds and evaluate on a compositionality-based similarity task. Our results show that: (1) selecting relevant senses of the constituent words leads to a better semantic composition of the compound, and (2) dynamic prototypes perform better than static prototypes.

...read moreread less

Proceedings Article•

Clustering Semantically Equivalent Words into Cognate Sets in Multilingual Lists

[...]

Bradley Hauer¹, Grzegorz Kondrak¹•Institutions (1)

University of Alberta¹

01 Nov 2011

TL;DR: This work presents a machine-learning approach that automatically clusters words in multilingual word lists into cognate sets using a number of diverse word similarity measures and features that encode the degree of affinity between pairs of languages.

...read moreread less

Abstract: Word lists have become available for most of the world’s languages, but only a small fraction of such lists contain cognate information. We present a machine-learning approach that automatically clusters words in multilingual word lists into cognate sets. Our method incorporates a number of diverse word similarity measures and features that encode the degree of affinity between pairs of languages. The output of the classification algorithm is then used to generate cognate groups. The results of the experiments on word lists representing several language families demonstrate the utility of the proposed approach.

...read moreread less

Proceedings Article•

A Graph-based Method for Entity Linking

[...]

Yuhang Guo¹, Wanxiang Che¹, Ting Liu¹, Sheng Li¹•Institutions (1)

Harbin Institute of Technology¹

01 Nov 2011

TL;DR: This paper formalizes the task of finding a knowledge base entry that a given named entity mention refers to, namely entity linking, by identifying the most “important” node among the graph nodes representing the candidate entries by introducing three degree-based measures of graph connectivity.

...read moreread less

Abstract: In this paper, we formalize the task of finding a knowledge base entry that a given named entity mention refers to, namely entity linking, by identifying the most “important” node among the graph nodes representing the candidate entries. With the aim of ranking these entities by their “importance”, we introduce three degree-based measures of graph connectivity. Experimental results on the TACKBP benchmark data sets show that our graph-based method performs comparably with the state-of-the-art methods. We also show that using the name phrase feature outperforms the commonly used bagof-word feature for entity linking.

...read moreread less

Proceedings Article•

Towards Context-Based Subjectivity Analysis

[...]

Farah Benamara¹, Baptiste Chardon¹, Yannick Mathieu², Vladimir Popescu¹•Institutions (2)

University of Toulouse¹, University of Upper Alsace²

01 Nov 2011

TL;DR: This work proposes a new subjectivity classification at the segment level that is more appropriate for discourse-based sentiment analysis and automatically distinguish between subjective nonevaluative and objective segments by using local and global context features.

...read moreread less

Abstract: We propose a new subjectivity classification at the segment level that is more appropriate for discourse-based sentiment analysis. Our approach automatically distinguish between subjective nonevaluative and objective segments and between implicit and explicit opinions, by using local and global context features.

...read moreread less

Proceedings Article•

Automatic identification of general and specific sentences by leveraging discourse annotations

[...]

Annie Louis¹, Ani Nenkova¹•Institutions (1)

University of Pennsylvania¹

01 Nov 2011

TL;DR: The task of identifying general and specific sentences in news articles is introduced and it is shown that the specificity levels predicted by the classifier correlates with the intuitive judgement of specificity employed by people for creating these summaries.

...read moreread less

Abstract: In this paper, we introduce the task of identifying general and specific sentences in news articles. Given the novelty of the task, we explore the feasibility of using existing annotations of discourse relations as training data for a general/specific classifier. The classifier relies on several classes of features that capture lexical and syntactic information, as well as word specificity and polarity. We also validate our results on sentences that were directly judged by multiple annotators to be general or specific. We analyze the annotator agreement on specificity judgements and study the strengths and robustness of features. We also provide a task-based evaluation of our classifier on general and specific summaries written by people. Here we show that the specificity levels predicted by our classifier correlates with the intuitive judgement of specificity employed by people for creating these summaries.

...read moreread less

Proceedings Article•

Structured and Extended Named Entity Evaluation in Automatic Speech Transcriptions

[...]

Olivier Galibert¹, Sophie Rosset², Cyril Grouin², Pierre Zweigenbaum², Ludovic Quintard¹ - Show less +1 more•Institutions (2)

Conservatoire national des arts et métiers¹, Centre national de la recherche scientifique²

01 Nov 2011

TL;DR: This paper describes the structured named entity definition used in this challenge and presents a method to transfer reference annotations to ASR output, used in the Quaero 2010 evaluation of extended named entity annotation on speech transcripts, whose results are given.

...read moreread less

Abstract: The evaluation of named entity recognition (NER) methods is an active field of research. This includes the recognition of named entities in speech transcripts. Evaluating NER systems on automatic speech recognition (ASR) output whereas human reference annotation was prepared on clean manual transcripts raises difficult alignment issues. These issues are emphasized when named entities are structured, as is the case in the Quaero NER challenge organized in 2010. This paper describes the structured named entity definition used in this challenge and presents a method to transfer reference annotations to ASR output. This method was used in the Quaero 2010 evaluation of extended named entity annotation on speech transcripts, whose results are given in the paper.

...read moreread less

Proceedings Article•

Automatically Generating Questions from Queries for Community-based Question Answering

[...]

Shiqi Zhao¹, Haifeng Wang¹, Chao Li, Ting Liu², Yi Guan² - Show less +1 more•Institutions (2)

Baidu¹, Harbin Institute of Technology²

01 Nov 2011

TL;DR: Experimental results show that, the precision of 1-best and 5best generated questions is 67% and 61%, respectively, which outperforms a baseline method that directly retrieves questions for queries in a cQA site search engine.

...read moreread less

Abstract: This paper proposes a method that automatically generates questions from queries for community-based question answering (cQA) services. Our query-to-question generation model is built upon templates induced from search engine query logs. In detail, we first extract pairs of queries and user-clicked questions from query logs, with which we induce question generation templates. Then, when a new query is submitted, we select proper templates for the query and generate questions through template instantiation. We evaluated the method with a set of short queries randomly selected from query logs, and the generated questions were judged by human annotators. Experimental results show that, the precision of 1-best and 5best generated questions is 67% and 61%, respectively, which outperforms a baseline method that directly retrieves questions for queries in a cQA site search engine. In addition, the results also suggest that the proposed method can improve the search of cQA archives.

...read moreread less

Proceedings Article•

Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness

[...]

Chee Wee Leong¹, Rada Mihalcea¹•Institutions (1)

University of North Texas¹

01 Nov 2011

TL;DR: It is shown that a hybrid image-text approach can lead to improvements in word relatedness, confirming the applicability of visual cues as a possible orthogonal information source.

...read moreread less

Abstract: Traditional approaches to semantic relatedness are often restricted to text-based methods, which typically disregard other multimodal knowledge sources. In this paper, we propose a novel image-based metric to estimate the relatedness of words, and demonstrate the promise of this method through comparative evaluations on three standard datasets. We also show that a hybrid image-text approach can lead to improvements in word relatedness, confirming the applicability of visual cues as a possible orthogonal information source.

...read moreread less

Proceedings Article•

Models Cascade for Tree-Structured Named Entity Detection

[...]

Marco Dinarelli, Sophie Rosset

01 Nov 2011

TL;DR: A new set of named entities having a multilevel tree structure, where base entities are combined to define more complex ones, making the NER task more complex than previous tasks, even more due to the use of noisy data for the annotation.

...read moreread less

Abstract: Named Entity Recognition (NER) is a well-known Natural Language Processing (NLP) task, used as a preliminary processing to provide a semantic level to more complex tasks. In this paper we describe a new set of named entities having a multilevel tree structure, where base entities are combined to define more complex ones. This definition makes the NER task more complex than previous tasks, even more due to the use of noisy data for the annotation: transcriptions of French broadcast data. We propose an original and effective system to tackle this new task, putting together the strengths of solutions for sequence labeling approaches and syntactic parsing via cascading of different models. Our system was evaluated in the 2011 Quaero named entity detection evaluation campaign and ranked first, with results far better than those of the other participating systems.

...read moreread less

Proceedings Article•

Efficient induction of probabilistic word classes with LDA

[...]

Grzegorz Chrupała¹•Institutions (1)

Saarland University¹

01 Nov 2011

TL;DR: It is shown that using LDA for word class induction scales better with the number of classes than the Brown algorithm and the resulting classes outperform Brown on the three tasks.

...read moreread less

Abstract: Word classes automatically induced from distributional evidence have proved useful many NLP tasks including Named Entity Recognition, parsing and sentence retrieval. The Brown hard clustering algorithm is commonly used in this scenario. Here we propose to use Latent Dirichlet Allocation in order to induce soft, probabilistic word classes. We compare our approach against Brown in terms of efficiency. We also compare the usefulness of the induced Brown and LDA word classes for the semi-supervised learning of three NLP tasks: fine-grained Named Entity Recognition, Morphological Analysis and semantic Relation Classification. We show that using LDA for word class induction scales better with the number of classes than the Brown algorithm and the resulting classes outperform Brown on the three tasks.

...read moreread less

Collapse