Showing papers by "Soumen Chakrabarti published in 2019"

PDF

Open Access

Journal Article•DOI•

Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

[...]

Amrita Saha¹, Ghulam Ahmed Ansari¹, Abhishek Laddha¹, Karthik Sankaranarayanan¹, Soumen Chakrabarti² - Show less +1 more•Institutions (2)

IBM¹, Indian Institute of Technology Bombay²

29 Apr 2019-Transactions of the Association for Computational Linguistics

TL;DR: Complex Imperative Program Induction from Terminal Rewards (CIPITR), an advanced neural programmer that mitigates reward sparsity with auxiliary rewards, and restricts the program space to semantically correct programs using high-level constraints, KB schema, and inferred answer type is presented.

...read moreread less

Abstract: Recent years have seen increasingly complex question-answering on knowledge bases (KBQA) involving logical, quantitative, and comparative reasoning over KB subgraphs. Neural Program Induction (NPI)...

...read moreread less

39 citations

Journal Article•DOI•

Neural architecture for question answering using a knowledge graph and web corpus

[...]

Uma Sawant¹, Saurabh Garg¹, Soumen Chakrabarti¹, Ganesh Ramakrishnan¹•Institutions (1)

Indian Institute of Technology Bombay¹

07 Jan 2019

TL;DR: This work presents AQQUCN, a QA system that gracefully combines KG and corpus evidence, and aggregates signals from KGs and large corpora to directly rank KG entities, rather than commit to one semantic interpretation of the query.

...read moreread less

Abstract: In Web search, entity-seeking queries often trigger a special question answering (QA) system. It may use a parser to interpret the question to a structured query, execute that on a knowledge graph (KG), and return direct entity responses. QA systems based on precise parsing tend to be brittle: minor syntax variations may dramatically change the response. Moreover, KG coverage is patchy. At the other extreme, a large corpus may provide broader coverage, but in an unstructured, unreliable form. We present AQQUCN, a QA system that gracefully combines KG and corpus evidence. AQQUCN accepts a broad spectrum of query syntax, between well-formed questions to short “telegraphic” keyword sequences. In the face of inherent query ambiguities, AQQUCN aggregates signals from KGs and large corpora to directly rank KG entities, rather than commit to one semantic interpretation of the query. AQQUCN models the ideal interpretation as an unobservable or latent variable. Interpretations and candidate entity responses are scored as pairs, by combining signals from multiple convolutional networks that operate collectively on the query, KG and corpus. On four public query workloads, amounting to over 8000 queries with diverse query syntax, we see 5–16% absolute improvement in mean average precision (MAP), compared to the entity ranking performance of recent systems. Our system is also competitive at entity set retrieval, almost doubling F1 scores for challenging short queries.

...read moreread less

39 citations

Proceedings Article•DOI•

A Deep Generative Model for Code Switched Text

[...]

Bidisha Samanta¹, Sharmila Reddy¹, Hussain Jagirdar¹, Niloy Ganguly¹, Soumen Chakrabarti² - Show less +1 more•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Bombay²

01 Jun 2019

TL;DR: VACS is introduced, a novel variational autoencoder architecture specifically tailored to code-switching phenomena, which encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals inThe upper layer.

...read moreread less

Abstract: Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate language models for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural language models are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address code-switched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Sampling representations from the prior and decoding them produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06%) drop in perplexity.

...read moreread less

26 citations

Proceedings Article•DOI•

Neural Program Induction for KBQA Without Gold Programs or Query Annotations

[...]

Ghulam Ahmed Ansari¹, Amrita Saha¹, Vishwajeet Kumar², Mohan Bhambhani¹, Karthik Sankaranarayanan¹, Soumen Chakrabarti² - Show less +2 more•Institutions (2)

IBM¹, Indian Institute of Technology Bombay²

01 Aug 2019

TL;DR: A noise-resilient NPI model, Stable Sparse Reward based Programmer (SSRP) that evades noise-induced instability through continual retrospection and its comparison with current learning behavior is proposed.

...read moreread less

Abstract: Neural Program Induction (NPI) is a paradigm for decomposing high-level tasks such as complex question-answering over knowledge bases (KBQA) into executable programs by employing neural models. Typically, this involves two key phases: i) inferring input program variables from the high-level task description, and ii) generating the correct program sequence involving these variables. Here we focus on NPI for Complex KBQA with only the final answer as supervision, and not gold programs. This raises major challenges; namely, i) noisy query annotation in the absence of any supervision can lead to catastrophic forgetting while learning, ii) reward becomes extremely sparse owing to the noise. To deal with these, we propose a noise-resilient NPI model, Stable Sparse Reward based Programmer (SSRP) that evades noise-induced instability through continual retrospection and its comparison with current learning behavior. On complex KBQA datasets, SSRP performs at par with hand-crafted rule-based models when provided with gold program input, and in the noisy settings outperforms state-of-the-art models by a significant margin even with a noisier query annotator.

...read moreread less

23 citations

Proceedings Article•DOI•

Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text

[...]

Bidisha Samanta¹, Niloy Ganguly¹, Soumen Chakrabarti²•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Bombay²

01 Jun 2019

TL;DR: This work presents an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is relatively readily available and achieves significant improvements in sentiment labeling accuracy and hatespeech detection.

...read moreread less

Abstract: Multilingual writers and speakers often alternate between two languages in a single discourse. This practice is called “code-switching”. Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly on code-switched text. We present an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is relatively readily available. The idea is to replace carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resource-poor language. By augmenting the scarce labeled code-switched text with plentiful synthetic labeled code-switched text, we achieve significant improvements in sentiment labeling accuracy (1.5%, 5.11% 7.20%) for three different language pairs (English-Hindi, English-Spanish and English-Bengali). The improvement is even significant in hatespeech detection whereby we achieve a 4% improvement using only synthetic code-switched data (6% with data augmentation).

...read moreread less

13 citations

Journal Article•DOI•

Learning Linear Influence Models in Social Networks from Transient Opinion Dynamics

[...]

Abir De¹, Sourangshu Bhattacharya², Parantapa Bhattacharya³, Niloy Ganguly², Soumen Chakrabarti⁴ - Show less +1 more•Institutions (4)

Max Planck Society¹, Indian Institute of Technology Kharagpur², University of Virginia³, Indian Institute of Technology Bombay⁴

11 Nov 2019-ACM Transactions on The Web

TL;DR: This article begins an investigation into a family of novel data-driven influence models that accurately learn and fit realistic observations that are robust to missing observations for several timesteps after an actor has changed its opinion.

...read moreread less

Abstract: Social networks, forums, and social media have emerged as global platforms for forming and shaping opinions on a broad spectrum of topics like politics, sports, and entertainment. Users (also called actors) often update their evolving opinions, influenced through discussions with other users. Theoretical models and their analysis on understanding opinion dynamics in social networks abound in the literature. However, these models are often based on concepts from statistical physics. Their goal is to establish specific phenomena like steady state consensus or bifurcation. Analysis of transient effects is largely avoided. Moreover, many of these studies assume that actors’ opinions are observed globally and synchronously, which is rarely realistic. In this article, we initiate an investigation into a family of novel data-driven influence models that accurately learn and fit realistic observations. We estimate and do not presume edge strengths from observed opinions at nodes. Our influence models are linear but not necessarily positive or row stochastic in nature. As a consequence, unlike the previous studies, they do not depend on system stability or convergence during the observation period. Furthermore, our models take into account a wide variety of data collection scenarios. In particular, they are robust to missing observations for several timesteps after an actor has changed its opinion. In addition, we consider scenarios where opinion observations may be available only for aggregated clusters of nodes—a practical restriction often imposed to ensure privacy. Finally, to provide a conceptually interpretable design of edge influence, we offer a relatively frugal variant of our influence model, where the strength of influence between two connecting nodes depends on the node attributes (demography, personality, expertise, etc.). Such an approach reduces the number of model parameters, reduces overfitting, and offers a tractable and explicable sketch of edge influences in the context of opinion dynamics. With six real-life datasets crawled from Twitter and Reddit, as well as three more datasets collected from in-house experiments (with 102 volunteers), our proposed system gives a significant accuracy boost over four state-of-the-art baselines.

...read moreread less

12 citations

Posted Content•

Scene Graph based Image Retrieval -- A case study on the CLEVR Dataset

[...]

Sahana Ramnath, Amrita Saha, Soumen Chakrabarti, Mitesh M. Khapra

03 Nov 2019-arXiv: Artificial Intelligence

TL;DR: This work proposes a neural-symbolic approach for a one-shot retrieval of images from a large scale catalog, given the caption description, and describes an extension of this pipeline to an iterative retrieval framework, based on interactive questioning and answering.

...read moreread less

Abstract: With the prolification of multimodal interaction in various domains, recently there has been much interest in text based image retrieval in the computer vision community. However most of the state of the art techniques model this problem in a purely neural way, which makes it difficult to incorporate pragmatic strategies in searching a large scale catalog especially when the search requirements are insufficient and the model needs to resort to an interactive retrieval process through multiple iterations of question-answering. Motivated by this, we propose a neural-symbolic approach for a one-shot retrieval of images from a large scale catalog, given the caption description. To facilitate this, we represent the catalog and caption as scene-graphs and model the retrieval task as a learnable graph matching problem, trained end-to-end with a REINFORCE algorithm. Further, we briefly describe an extension of this pipeline to an iterative retrieval framework, based on interactive questioning and answering.

...read moreread less

10 citations

Journal Article•DOI•

GIRNet: Interleaved Multi-Task Recurrent State Sequence Models

[...]

Divam Gupta¹, Tanmoy Chakraborty², Soumen Chakrabarti³•Institutions (3)

Indian Institute of Technology Delhi¹, Indraprastha Institute of Information Technology², Indian Institute of Technology Bombay³

17 Jul 2019

TL;DR: GITNet, a unified position-sensitive multi-task recurrent neural network (RNN) architecture for such applications, is proposed and the superiority of GIRNet is demonstrated using three applications: sentiment classification of code-switched passages, part-of-speech tagging of code -switched text, and target position- sensitive annotation of sentiment in monolingual passages.

...read moreread less

Abstract: In several natural language tasks, labeled sequences are available in separate domains (say, languages), but the goal is to label sequences with mixed domain (such as code-switched text). Or, we may have available models for labeling whole passages (say, with sentiments), which we would like to exploit toward better position-specific label inference (say, target-dependent sentiment annotation). A key characteristic shared across such tasks is that different positions in a primary instance can benefit from different ‘experts’ trained from auxiliary data, but labeled primary instances are scarce, and labeling the best expert for each position entails unacceptable cognitive burden. We propose GIRNet, a unified position-sensitive multi-task recurrent neural network (RNN) architecture for such applications. Auxiliary and primary tasks need not share training instances. Auxiliary RNNs are trained over auxiliary instances. A primary instance is also submitted to each auxiliary RNN, but their state sequences are gated and merged into a novel composite state sequence tailored to the primary inference task. Our approach is in sharp contrast to recent multi-task networks like the crossstitch and sluice networks, which do not control state transfer at such fine granularity. We demonstrate the superiority of GIRNet using three applications: sentiment classification of code-switched passages, part-of-speech tagging of codeswitched text, and target position-sensitive annotation of sentiment in monolingual passages. In all cases, we establish new state-of-the-art performance beyond recent competitive baselines.

...read moreread less

10 citations

Book Chapter•DOI•

Automated early leaderboard generation from comparative tables

[...]

Mayank Singh¹, Rajdeep Sarkar², Atharva Vyas², Pawan Goyal², Animesh Mukherjee², Soumen Chakrabarti³ - Show less +2 more•Institutions (3)

Indian Institute of Technology Gandhinagar¹, Indian Institute of Technology Kharagpur², Indian Institute of Technology Bombay³

14 Apr 2019

TL;DR: In this paper, the authors propose a leaderboard discovery method based on partial orders between papers, where each individual performance edge is extracted from a table with citations to other papers, similar to match outcomes in an incomplete tournament.

...read moreread less

Abstract: A leaderboard is a tabular presentation of performance scores of the best competing techniques that address a specific scientific problem. Manually maintained leaderboards take time to emerge, which induces a latency in performance discovery and meaningful comparison. This can delay dissemination of best practices to non-experts and practitioners. Regarding papers as proxies for techniques, we present a new system to automatically discover and maintain leaderboards in the form of partial orders between papers, based on performance reported therein. In principle, a leaderboard depends on the task, data set, other experimental settings, and the choice of performance metrics. Often there are also tradeoffs between different metrics. Thus, leaderboard discovery is not just a matter of accurately extracting performance numbers and comparing them. In fact, the levels of noise and uncertainty around performance comparisons are so large that reliable traditional extraction is infeasible. We mitigate these challenges by using relatively cleaner, structured parts of the papers, e.g., performance tables. We propose a novel performance improvement graph with papers as nodes, where edges encode noisy performance comparison information extracted from tables. Every individual performance edge is extracted from a table with citations to other papers. These extractions resemble (noisy) outcomes of ‘matches’ in an incomplete tournament. We propose several approaches to rank papers from these noisy ‘match’ outcomes. We show that our ranking scheme can reproduce various manually curated leaderboards very well. Using widely-used lists of state-of-the-art papers in 27 areas of Computer Science, we demonstrate that our system produces very reliable rankings. We also show that commercial scholarly search systems cannot be used for leaderboard discovery, because of their emphasis on citations, which favors classic papers over recent performance breakthroughs. Our code and data sets will be placed in the public domain.

...read moreread less

8 citations

Posted Content•

Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text

[...]

Bidisha Samanta¹, Niloy Ganguly¹, Soumen Chakrabarti²•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Bombay²

13 Jun 2019-arXiv: Computation and Language

TL;DR: The authors synthesize labeled code-switched text from labeled monolingual text, which is more readily available, by replacing carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resourcepoor language.

...read moreread less

Abstract: Multilingual writers and speakers often alternate between two languages in a single discourse, a practice called "code-switching". Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly on code-switched text. We present an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is more readily available. The idea is to replace carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resource-poor language. By augmenting scarce human-labeled code-switched text with plentiful synthetic code-switched text, we achieve significant improvements in sentiment labeling accuracy (1.5%, 5.11%, 7.20%) for three different language pairs (English-Hindi, English-Spanish and English-Bengali). We also get significant gains for hate speech detection: 4% improvement using only synthetic text and 6% if augmented with real text.

...read moreread less

6 citations

Book Chapter•DOI•

Multi-task Learning for Target-Dependent Sentiment Classification

[...]

Divam Gupta¹, Kushagra Singh¹, Soumen Chakrabarti², Tanmoy Chakraborty¹•Institutions (2)

Indraprastha Institute of Information Technology¹, Indian Institute of Technology Bombay²

14 Apr 2019

TL;DR: The authors proposed a multi-task target-dependent sentiment classification system that is informed by feature representation learned for the related auxiliary task of passage-level sentiment classification, which outperforms state-of-the-art baselines.

...read moreread less

Abstract: Detecting and aggregating sentiments toward people, organizations, and events expressed in unstructured social media have become critical text mining operations. Early systems detected sentiments over whole passages, whereas more recently, target-specific sentiments have been of greater interest. In this paper, we present MTTDSC, a multi-task target-dependent sentiment classification system that is informed by feature representation learnt for the related auxiliary task of passage-level sentiment classification. The auxiliary task uses a gated recurrent unit (GRU) and pools GRU states, followed by an auxiliary fully-connected layer that outputs passage-level predictions. In the main task, these GRUs contribute auxiliary per-token representations over and above word embeddings. The main task has its own, separate GRUs. The auxiliary and main GRUs send their states to a different fully connected layer, trained for the main task. Extensive experiments using two auxiliary datasets and three benchmark datasets (of which one is new, introduced by us) for the main task demonstrate that MTTDSC outperforms state-of-the-art baselines. Using word-level sensitivity analysis, we present anecdotal evidence that prior systems can make incorrect target-specific predictions because they miss sentiments expressed by words independent of target.

...read moreread less

Posted Content•

Multi-task Learning for Target-dependent Sentiment Classification

[...]

Divam Gupta¹, Kushagra Singh¹, Soumen Chakrabarti², Tanmoy Chakraborty¹•Institutions (2)

Indraprastha Institute of Information Technology¹, Indian Institute of Technology Bombay²

08 Feb 2019-arXiv: Learning

TL;DR: MTTDSC is presented, a multi-task target-dependent sentiment classification system that is informed by feature representation learnt for the related auxiliary task of passage-level sentiment classification, and anecdotal evidence that prior systems can make incorrect target-specific predictions because they miss sentiments expressed by words independent of target is presented.

...read moreread less

Posted Content•

A Deep Generative Model for Code-Switched Text.

[...]

Bidisha Samanta¹, Sharmila Reddy¹, Hussain Jagirdar¹, Niloy Ganguly¹, Soumen Chakrabarti² - Show less +1 more•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Bombay²

21 Jun 2019-arXiv: Computation and Language

TL;DR: This article proposed a variational autoencoder architecture for code-switching, which encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in lower level and language switching signals in the upper layer.

...read moreread less

Posted Content•

Differentially Private Link Prediction With Protected Connections

[...]

Abir De¹, Soumen Chakrabarti¹•Institutions (1)

Indian Institute of Technology Bombay¹

20 Jul 2019-arXiv: Social and Information Networks

TL;DR: A form of differential privacy on graphs is proposed, which models the privacy loss only of those node-pairs which are marked as protected, and DPLP, a learning to rank algorithm, which applies a monotone transform to base scores from a non-private LP system, and then adds noise.

...read moreread less

Abstract: Link prediction (LP) algorithms propose to each node a ranked list of nodes that are currently non-neighbors, as the most likely candidates for future linkage. Owing to increasing concerns about privacy, users (nodes) may prefer to keep some of their connections protected or private. Motivated by this observation, our goal is to design a differentially private LP algorithm, which trades off between privacy of the protected node-pairs and the link prediction accuracy. More specifically, we first propose a form of differential privacy on graphs, which models the privacy loss only of those node-pairs which are marked as protected. Next, we develop DPLP , a learning to rank algorithm, which applies a monotone transform to base scores from a non-private LP system, and then adds noise. DPLP is trained with a privacy induced ranking loss, which optimizes the ranking utility for a given maximum allowed level of privacy leakage of the protected node-pairs. Under a recently-introduced latent node embedding model, we present a formal trade-off between privacy and LP utility. Extensive experiments with several real-life graphs and several LP heuristics show that DPLP can trade off between privacy and predictive performance more effectively than several alternatives.

...read moreread less

Posted Content•

Privacy Preserving Link Prediction with Latent Geometric Network Models

[...]

Abir De, Soumen Chakrabarti

20 Jul 2019-arXiv: Social and Information Networks

TL;DR: D P L P is presented, a generic framework to protect differential privacy for these popular link prediction heuristics under the ranking objective, under a recently-introduced latent node embedding model and the trade-off between privacy and link prediction utility is analyzed.

...read moreread less

Abstract: Link prediction is an important task in social network analysis, with a wide variety of applications ranging from graph search to recommendation. The usual paradigm is to propose to each node a ranked list of nodes that are currently non-neighbors, as the most likely candidates for future linkage. Owing to increasing concerns about privacy, users (nodes) may prefer to keep some or all their connections private. Most link prediction heuristics, such as common neighbor, Jaccard coefficient, and Adamic-Adar, can leak private link information in making predictions. We present D P L P , a generic framework to protect differential privacy for these popular heuristics under the ranking objective. Under a recently-introduced latent node embedding model, we also analyze the trade-off between privacy and link prediction utility. Extensive experiments with eight diverse real-life graphs and several link prediction heuristics show that D P L P can trade off between privacy and predictive performance more effectively than several alternatives.

...read moreread less

Posted Content•

Analysis of Reference and Citation Copying in Evolving Bibliographic Networks

[...]

Pradumn Kumar Pandey¹, Mayank Singh², Pawan Goyal³, Animesh Mukherjee³, Soumen Chakrabarti⁴ - Show less +1 more•Institutions (4)

Indian Institute of Technology Roorkee¹, Indian Institute of Technology Gandhinagar², Indian Institute of Technology Kharagpur³, Indian Institute of Technology Bombay⁴

26 Dec 2019-arXiv: Social and Information Networks

TL;DR: RefOrCite as discussed by the authors is a new model that allows copying of both the references from (i.e., out-neighbors of) as well as the citations to an existing node.

...read moreread less

Abstract: Extensive literature demonstrates how the copying of references (links) can lead to the emergence of various structural properties (e.g., power-law degree distribution and bipartite cores) in bibliographic and other similar directed networks. However, it is also well known that the copying process is incapable of mimicking the number of directed triangles in such networks; neither does it have the power to explain the obsolescence of older papers. In this paper, we propose RefOrCite, a new model that allows for copying of both the references from (i.e., out-neighbors of) as well as the citations to (i.e., in-neighbors of) an existing node. In contrast, the standard copying model (CP) only copies references. While retaining its spirit, RefOrCite differs from the Forest Fire (FF) model in ways that makes RefOrCite amenable to mean-field analysis for degree distribution, triangle count, and densification. Empirically, RefOrCite gives the best overall agreement with observed degree distribution, triangle count, diameter, h-index, and the growth of citations to newer papers.

...read moreread less

Posted Content•

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

[...]

Vihari Piratla¹, Sunita Sarawagi¹, Soumen Chakrabarti¹•Institutions (1)

Indian Institute of Technology Bombay¹

05 Jun 2019-arXiv: Computation and Language

TL;DR: It is reached the surprising conclusion that even limited corpus augmentation is more useful than adapting embeddings, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embeds and cannot be salvaged by adaptation.

...read moreread less

Abstract: Given a small corpus $\mathcal D_T$ pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of $\mathcal D_T$. These embeddings may be used in various tasks involving $\mathcal D_T$. A popular strategy in limited data settings is to adapt pre-trained embeddings $\mathcal E$ trained on a large corpus. To correct for sense drift, fine-tuning, regularization, projection, and pivoting have been proposed recently. Among these, regularization informed by a word's corpus frequency performed well, but we improve upon it using a new regularizer based on the stability of its cooccurrence with other words. However, a thorough comparison across ten topics, spanning three tasks, with standardized settings of hyper-parameters, reveals that even the best embedding adaptation strategies provide small gains beyond well-tuned baselines, which many earlier comparisons ignored. In a bold departure from adapting pretrained embeddings, we propose using $\mathcal D_T$ to probe, attend to, and borrow fragments from any large, topic-rich source corpus (such as Wikipedia), which need not be the corpus used to pretrain embeddings. This step is made scalable and practical by suitable indexing. We reach the surprising conclusion that even limited corpus augmentation is more useful than adapting embeddings, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embeddings and cannot be salvaged by adaptation.

...read moreread less

Proceedings Article•DOI•

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

[...]

Vihari Piratla¹, Sunita Sarawagi¹, Soumen Chakrabarti¹•Institutions (1)

Indian Institute of Technology Bombay¹

01 Jun 2019

TL;DR: The authors showed that even the best embedding adaptation strategies provide small gains beyond well-tuned baselines, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embedding and cannot be salvaged by adaptation.

...read moreread less

Abstract: Given a small corpus D_T pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of D_T. These embeddings may be used in various tasks involving D_T. A popular strategy in limited data settings is to adapt pretrained embeddings E trained on a large corpus. To correct for sense drift, fine-tuning, regularization, projection, and pivoting have been proposed recently. Among these, regularization informed by a word’s corpus frequency performed well, but we improve upon it using a new regularizer based on the stability of its cooccurrence with other words. However, a thorough comparison across ten topics, spanning three tasks, with standardized settings of hyper-parameters, reveals that even the best embedding adaptation strategies provide small gains beyond well-tuned baselines, which many earlier comparisons ignored. In a bold departure from adapting pretrained embeddings, we propose using D_T to probe, attend to, and borrow fragments from any large, topic-rich source corpus (such as Wikipedia), which need not be the corpus used to pretrain embeddings. This step is made scalable and practical by suitable indexing. We reach the surprising conclusion that even limited corpus augmentation is more useful than adapting embeddings, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embeddings and cannot be salvaged by adaptation.

...read moreread less