scispace - formally typeset
Search or ask a question

Showing papers by "Pushpak Bhattacharyya published in 2013"


Proceedings Article
01 Aug 2013
TL;DR: The primary use of this work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI, which can decide pricing of any translation task.
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.

42 citations


Proceedings Article
01 Aug 2013
TL;DR: The problem of data sparsity in sentiment analysis, both monolingual and cross-lingual, is addressed through the means of clustering and experiments show that cluster based dataSparsity reduction leads to performance better than sense based classification for sentiment analysis at document level.
Abstract: Expensive feature engineering based on WordNet senses has been shown to be useful for document level sentiment classification. A plausible reason for such a performance improvement is the reduction in data sparsity. However, such a reduction could be achieved with a lesser effort through the means of syntagma based word clustering. In this paper, the problem of data sparsity in sentiment analysis, both monolingual and cross-lingual, is addressed through the means of clustering. Experiments show that cluster based data sparsity reduction leads to performance better than sense based classification for sentiment analysis at document level. Similar idea is applied to Cross Lingual Sentiment Analysis (CLSA), and it is shown that reduction in data sparsity (after translation or bilingual-mapping) produces accuracy higher than Machine Translation based CLSA and sense based CLSA.

39 citations


Posted Content
TL;DR: This report takes a look at the various challenges and applications of Sentiment Analysis, and discusses in details various approaches to perform a computational treatment of sentiments and opinions.
Abstract: Our day-to-day life has always been influenced by what people think. Ideas and opinions of others have always affected our own opinions. The explosion of Web 2.0 has led to increased activity in Podcasting, Blogging, Tagging, Contributing to RSS, Social Bookmarking, and Social Networking. As a result there has been an eruption of interest in people to mine these vast resources of data for opinions. Sentiment Analysis or Opinion Mining is the computational treatment of opinions, sentiments and subjectivity of text. In this report, we take a look at the various challenges and applications of Sentiment Analysis. We will discuss in details various approaches to perform a computational treatment of sentiments and opinions. Various supervised or data-driven techniques to SA like Na\"ive Byes, Maximum Entropy, SVM, and Voted Perceptrons will be discussed and their strengths and drawbacks will be touched upon. We will also see a new dimension of analyzing sentiments by Cognitive Psychology mainly through the work of Janyce Wiebe, where we will see ways to detect subjectivity, perspective in narrative and understanding the discourse structure. We will also study some specific topics in Sentiment Analysis and the contemporary works in those areas.

32 citations


Proceedings Article
01 Oct 2013
TL;DR: This paper extracts 274 words that are polar in the movie domain, but are not present in the universal sentiment lexicon, and uses the Chi-Square test to detect if the difference is significant between the counts of the word in positive and negative polarity documents.
Abstract: There are many examples in which a word changes its polarity from domain to domain. For example, unpredictable is positive in the movie domain, but negative in the product domain. Such words cannot be entered in a “universal sentiment lexicon” which is supposed to be a repository of words with polarity invariant across domains. Rather, we need to maintain separate domain specific sentiment lexicons. The main contribution of this paper is to present an effective method of generating a domain specific sentiment lexicon. For a word whose domain specific polarity needs to be determined, the approach uses the Chi-Square test to detect if the difference is significant between the counts of the word in positive and negative polarity documents. We extract 274 words that are polar in the movie domain, but are not present in the universal sentiment lexicon. Our overall accuracy is around 60% in detecting movie domain specific polar words.

19 citations


Proceedings Article
01 Jun 2013
TL;DR: This work focuses on unstructured and noisy text like tweets on which linguistic tools like parsers and POS-taggers don’t work properly and shows how conjunctions, connectives, modals and conditionals affect the sentiments in tweets.
Abstract: We propose a method for using discourse relations for polarity detection of tweets. We have focused on unstructured and noisy text like tweets on which linguistic tools like parsers and POS-taggers don’t work properly. We have showed how conjunctions, connectives, modals and conditionals affect the sentiments in tweets. We have also handled the commonly used abbreviations, slangs and collocations which are usually used in short text messages like tweets. This work focuses on a Web based application which produces results in real time. This approach is an extension of the previous work (Mukherjee et al. 2012).

17 citations


Proceedings Article
01 Jun 2013
TL;DR: This paper presents a hypothesis regarding the cognitive sub-processes involved in the task of WSD, and strives to find the levels of difficulties in annotating various classes of words, with senses.
Abstract: Word Sense Disambiguation (WSD) approaches have reported good accuracies in recent years. However, these approaches can be classified as weak AI systems. According to the classical definition, a strong AI based WSD system should perform the task of sense disambiguation in the same manner and with similar accuracy as human beings. In order to accomplish this, a detailed understanding of the human techniques employed for sense disambiguation is necessary. Instead of building yet another WSD system that uses contextual evidence for sense disambiguation, as has been done before, we have taken a step back - we have endeavored to discover the cognitive faculties that lie at the very core of the human sense disambiguation technique. In this paper, we present a hypothesis regarding the cognitive sub-processes involved in the task of WSD. We support our hypothesis using the experiments conducted through the means of an eye-tracking device. We also strive to find the levels of difficulties in annotating various classes of words, with senses. We believe, once such an in-depth analysis is performed, numerous insights can be gained to develop a robust WSD system that conforms to the principle of strong AI.

17 citations


Book ChapterDOI
24 Mar 2013
TL;DR: This study takes the stand that languages which have a genuine need for a Sentiment Analysis engine should focus on collecting a few polarity annotated documents in their language instead of relying on CLSA.
Abstract: Recently there has been a lot of interest in Cross Language Sentiment Analysis (CLSA) using Machine Translation (MT) to facilitate Sentiment Analysis in resource deprived languages. The idea is to use the annotated resources of one language (say, L1) for performing Sentiment Analysis in another language (say, L2) which does not have annotated resources. The success of such a scheme crucially depends on the availability of a MT system between L1 and L2. We argue that such a strategy ignores the fact that a Machine Translation system is much more demanding in terms of resources than a Sentiment Analysis engine. Moreover, these approaches fail to take into account the divergence in the expression of sentiments across languages. We provide strong experimental evidence to prove that even the best of such systems do not outperform a system trained using only a few polarity annotated documents in the target language. Having a very large number of documents in L1 also does not help because most Machine Learning approaches converge (or reach a plateau) after a certain training size (as demonstrated by our results). Based on our study, we take the stand that languages which have a genuine need for a Sentiment Analysis engine should focus on collecting a few polarity annotated documents in their language instead of relying on CLSA.

16 citations


Proceedings Article
01 Aug 2013
TL;DR: This paper proposes a working definition of thwarting amenable to machine learning and creates a system that detects if the document is thwarted or not and shows that machine learning with annotated corpora ( thwarted/nonthwarted) is more effective than the rule based system.
Abstract: Thwarting and sarcasm are two uncharted territories in sentiment analysis, the former because of the lack of training corpora and the latter because of the enormous amount of world knowledge it demands. In this paper, we propose a working definition of thwarting amenable to machine learning and create a system that detects if the document is thwarted or not. We focus on identifying thwarting in product reviews, especially in the camera domain. An ontology of the camera domain is created. Thwarting is looked upon as the phenomenon of polarity reversal at a higher level of ontology compared to the polarity expressed at the lower level. This notion of thwarting defined with respect to an ontology is novel, to the best of our knowledge. A rule based implementation building upon this idea forms our baseline. We show that machine learning with annotated corpora (thwarted/nonthwarted) is more effective than the rule based system. Because of the skewed distribution of thwarting, we adopt the Areaunder-the-Curve measure of performance. To the best of our knowledge, this is the first attempt at the difficult problem of thwarting detection, which we hope will at least provide a baseline system to compare against.

15 citations


Proceedings Article
01 Aug 2013
TL;DR: An improvement of 17% 35% in the accuracy of verb WSD is obtained compared to the existing EM based approach, and can be looked upon as contributing to the framework of unsupervised WSD through context aware expectation maximization.
Abstract: Word Sense Disambiguation (WSD) is one of the toughest problems in NLP, and in WSD, verb disambiguation has proved to be extremely difficult, because of high degree of polysemy, too fine grained senses, absence of deep verb hierarchy and low inter annotator agreement in verb sense annotation. Unsupervised WSD has received widespread attention, but has performed poorly, specially on verbs. Recently an unsupervised bilingual EM based algorithm has been proposed, which makes use only of the raw counts of the translations in comparable corpora (Marathi and Hindi). But the performance of this approach is poor on verbs with accuracy level at 25-38%. We suggest a modification to this mentioned formulation, using context and semantic relatedness of neighboring words. An improvement of 17% 35% in the accuracy of verb WSD is obtained compared to the existing EM based approach. On a general note, the work can be looked upon as contributing to the framework of unsupervised WSD through context aware expectation maximization.

13 citations


Proceedings Article
01 Oct 2013
TL;DR: A novel technique that uses hierarchical phrase-based statistical machine translation (SMT) for grammar correction, which is similar in spirit to human correction, because the system extracts grammar rules from the corpus and later uses these rules to translate incorrect sentences to correct sentences.
Abstract: We introduce a novel technique that uses hierarchical phrase-based statistical machine translation (SMT) for grammar correction. SMT systems provide a uniform platform for any sequence transformation task. Thus grammar correction can be considered a translation problem from incorrect text to correct text. Over the years, grammar correction data in the electronic form (i.e., parallel corpora of incorrect and correct sentences) has increased manifolds in quality and quantity, making SMT systems feasible for grammar correction. Firstly, sophisticated translation models like hierarchical phrase-based SMT can handle errors as complicated as reordering or insertion, which were difficult to deal with previously throuh the mediation of rule based systems. Secondly, this SMT based correction technique is similar in spirit to human correction, because the system extracts grammar rules from the corpus and later uses these rules to translate incorrect sentences to correct sentences. We describe how to use Joshua, a hierarchical phrase-based SMT system for grammar correction. An accuracy of 0.77 (BLEU score) establishes the efficacy of our approach.

8 citations



Proceedings Article
01 Aug 2013
TL;DR: This work proposes a new rulebased system which utilizes dependency parse information and a set of conditional rules to ensure agreement of the verb group with its subject to correct noun-number, determiner and subject-verb agreement errors.
Abstract: We describe our grammar correction system for the CoNLL-2013 shared task. Our system corrects three of the five error types specified for the shared task noun-number, determiner and subject-verb agreement errors. For noun-number and determiner correction, we apply a classification approach using rich lexical and syntactic features. For subject-verb agreement correction, we propose a new rulebased system which utilizes dependency parse information and a set of conditional rules to ensure agreement of the verb group with its subject. Our system obtained an F-score of 11.03 on the official test set using the M 2 evaluation method (the official evaluation method).

Proceedings Article
01 Aug 2013
TL;DR: IndoNet is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology, encoded in Lexical Markup Framework (LMF) and modifications in LMF are proposed to accommodate Universal Word Dictionary and SUMO.
Abstract: We present IndoNet, a multilingual lexical knowledge base for Indian languages. It is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology (SUMO). We discuss various benefits of the network and challenges involved in the development. The system is encoded in Lexical Markup Framework (LMF) and we propose modifications in LMF to accommodate Universal Word Dictionary and SUMO. This standardized version of lexical knowledge base of Indian Languages can now easily be linked to similar global resources.

Proceedings Article
01 Oct 2013
TL;DR: The evaluation shows that this engine handles some linguistic phenomena observed in Hindi news headlines, and is able to achieve automatic translation of English news headlines to Hindi while retaining the Hindi news headline styles.
Abstract: News headlines exhibit stylistic peculiarities. The goal of our translation engine ‘Making Headlines in Hindi’ is to achieve automatic translation of English news headlines to Hindi while retaining the Hindi news headline styles. There are two central modules of our engine: the modified translation unit based on Moses and a co-occurrencebased post-processing unit. The modified translation unit provides two machine translation (MT) models: phrase-based and factor-based (both using in-domain data). In addition, a co-occurrence-based post-processing option may be turned on by a user. Our evaluation shows that this engine handles some linguistic phenomena observed in Hindi news headlines.

Proceedings Article
01 Jun 2013
TL;DR: A Universal Networking Language (UNL) based semantic extraction system for measuring the semantic similarity that combines syntactic and word level similarity measures along with the UNL based semantic similarity measures for finding similarity scores between sentences.
Abstract: This paper describes the system that was submitted in the *SEM 2013 Semantic Textual Similarity shared task. The task aims to find the similarity score between a pair of sentences. We describe a Universal Networking Language (UNL) based semantic extraction system for measuring the semantic similarity. Our approach combines syntactic and word level similarity measures along with the UNL based semantic similarity measures for finding similarity scores between sentences.

Proceedings Article
01 Aug 2013
TL;DR: The TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators using a Map-Reduce-like approach to translation crowdsourcing.
Abstract: Large amount of parallel corpora is required for building Statistical Machine Translation (SMT) systems. We describe the TransDoop system for gathering translations to create parallel corpora from online crowd workforce who have familiarity with multiple languages but are not expert translators. Our system uses a Map-Reduce-like approach to translation crowdsourcing where sentence translation is decomposed into the following smaller tasks: (a) translation of constituent phrases of the sentence; (b) validation of quality of the phrase translations; and (c) composition of complete sentence translations from phrase translations. TransDoop incorporates quality control mechanisms and easy-to-use worker user interfaces designed to address issues with translation crowdsourcing. We have evaluated the crowd’s output using the METEOR metric. For a complex domain like judicial proceedings, the higher scores obtained by the map-reduce based approach compared to complete sentence translation establishes the efficacy of our work.

Proceedings Article
01 Oct 2013
TL;DR: A structure cognizant framework for pseudo relevance feedback (PRF) that finds that this structure-aware PRF yields a 2% to 30% improvement in performance (MAP) over the vanilla PRF.
Abstract: We propose a structure cognizant framework for pseudo relevance feedback (PRF). This has an application, for example, in selecting expansion terms for general search from subsets such as Wikipedia, wherein documents typically have a minimally fixed set of fields, viz., Title, Body, Infobox and Categories. In existing approaches to PRF based expansion, weights of expansion terms do not depend on their field(s) of origin. This, we feel, is a weakness of current PRF approaches. We propose a per field EM formulation for finding the importance of the expansion terms, in line with traditional PRF. However, the final weight of an expansion term is found by weighting these importance based on whether the term belongs to the title, the body, the infobox or the category field(s). In our experiments with four languages, viz., English, Spanish, Finnish and Hindi, we find that this structure-aware PRF yields a 2% to 30% improvement in performance (MAP) over the vanilla PRF. We conduct ablation tests to evaluate the importance of various fields. As expected, results from these tests emphasize the importance of fields in the order of title, body, categories and infobox.

Proceedings Article
01 Oct 2013
TL;DR: The motivation of the work comes from the need to create a root or stem identifier for a language that has electronic corpora and some elementary linguistic work in the form of, say, suffix list.
Abstract: In this paper we take an important step towards completely unsupervised stemming by giving a scheme for semi supervised stemming. The input to the system is a list of word forms and suffixes. The motivation of the work comes from the need to create a root or stem identifier for a language that has electronic corpora and some elementary linguistic work in the form of, say, suffix list. The scope of our work is suffix based morphology, (i.e., no prefix or infix morphology). We give two greedy algorithms for stemming. We have performed extensive experimentation with four languages: English, Hindi, Malayalam and Marathi. Accuracy figures ranges from 80% to 88% are reported for all languages.