scispace - formally typeset
Search or ask a question

Showing papers on "Phrase published in 2020"


Posted Content
TL;DR: This paper presents a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence of structured phrase correspondence, and shows that GSMN outperforms state-of-the-art methods on benchmarks.
Abstract: Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: this https URL.

87 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: GSMN as mentioned in this paper explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relations and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase.
Abstract: Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: https://github.com/CrossmodalGroup/GSMN.

83 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: A novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors is presented.
Abstract: Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

65 citations


Posted Content
TL;DR: Quantitative and qualitative results show that, using this framework, a GPT-2 based model trained on a conversation-like Reddit dataset outperforms strong generation baselines.
Abstract: Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process, often resulting in uninteresting responses. Attempts to boost informativeness alone come at the expense of factual accuracy, as attested by pretrained language models' propensity to "hallucinate" facts. While this may be mitigated by access to background knowledge, there is scant guarantee of relevance and informativeness in generated responses. We propose a framework that we call controllable grounded response generation (CGRG), in which lexical control phrases are either provided by a user or automatically extracted by a control phrase predictor from dialogue context and grounding knowledge. Quantitative and qualitative results show that, using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.

63 citations


Journal ArticleDOI
TL;DR: This paper proposes a multi-phrase ranked search over encrypted cloud data, which also supports dynamic update operations, such as adding or deleting files, and used an inverted index to record the locations of keywords and to judge whether the phrase appears.
Abstract: As cloud computing becomes prevalent, more and more data owners are likely to outsource their data to a cloud server. However, to ensure privacy, the data should be encrypted before outsourcing. Symmetric searchable encryption allows users to retrieve keyword over encrypted data without decrypting the data. Many existing schemes that are based on symmetric searchable encryption only support single keyword search, conjunctive keywords search, multiple keywords search, or single phrase search. However, some schemes, i.e., static schemes, only search one phrase in a query request. In this paper, we propose a multi-phrase ranked search over encrypted cloud data, which also supports dynamic update operations, such as adding or deleting files. We used an inverted index to record the locations of keywords and to judge whether the phrase appears. This index can search for keywords efficiently. In order to rank the results and protect the privacy of relevance score, the relevance score evaluation model is used in searching process on client-side. Also, the special construction of the index makes the scheme dynamic. The data owner can update the cloud data at very little cost. Security analyses and extensive experiments were conducted to demonstrate the safety and efficiency of the proposed scheme.

60 citations


Posted Content
TL;DR: It is shown that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words.
Abstract: Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions A key idea is to construct effective negative captions for learning through language model guided word substitutions Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $57\%$ to achieve $767\%$ accuracy on Flickr30K Entities benchmark

59 citations


Proceedings ArticleDOI
27 Jun 2020
TL;DR: This article proposed structure-invariant testing (SIT), a novel metamorphic testing approach for validating machine translation software that generates similar source sentences by substituting one word in a given sentence with semantically similar, syntactically equivalent words.
Abstract: In recent years, machine translation software has increasingly been integrated into our daily lives. People routinely use machine translation for various applications, such as describing symptoms to a foreign doctor and reading political news in a foreign language. However, the complexity and intractability of neural machine translation (NMT) models that power modern machine translation make the robustness of these systems difficult to even assess, much less guarantee. Machine translation systems can return inferior results that lead to misunderstanding, medical misdiagnoses, threats to personal safety, or political conflicts. Despite its apparent importance, validating the robustness of machine translation systems is very difficult and has, therefore, been much under-explored. To tackle this challenge, we introduce structure-invariant testing (SIT), a novel metamorphic testing approach for validating machine translation software. Our key insight is that the translation results of "similar" source sentences should typically exhibit similar sentence structures. Specifically, SIT (1) generates similar source sentences by substituting one word in a given sentence with semantically similar, syntactically equivalent words; (2) represents sentence structure by syntax parse trees (obtained via constituency or dependency parsing); (3) reports sentence pairs whose structures differ quantitatively by more than some threshold. To evaluate SIT, we use it to test Google Translate and Bing Microsoft Translator with 200 source sentences as input, which led to 64 and 70 buggy issues with 69.5% and 70% top-1 accuracy, respectively. The translation errors are diverse, including under-translation, over-translation, incorrect modification, word/phrase mistranslation, and unclear logic.

51 citations


Book ChapterDOI
23 Aug 2020
TL;DR: Gupta et al. as discussed by the authors proposed a phrase grounding model by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words, which achieved a good performance on the Flickr30k Entities benchmark.
Abstract: Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a \(\sim 10\%\) absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of \(5.7\%\) to achieve \(76.7\%\) accuracy on Flickr30K Entities benchmark. Our code and project material will be available at http://tanmaygupta.info/info-ground.

49 citations


Journal ArticleDOI
TL;DR: This work proposes a new context-specific heterogeneous graph convolutional network (CsHGCN) framework that can combine all context representations and has a complete context that reflects the information on documents more comprehensively.
Abstract: Sentiment analysis has attracted considerable attention in recent years. In particular, implicit sentiment analysis is a more challenging problem due to the lack of sentiment words. It requires us to combine contextual information and precisely understand the emotion changing process. Graph convolutional network (GCN) techniques have been widely applied for sentiment analysis since they are capable of learning from complex structures and preserving global information. However, these models either only focus on extracting features from a single sentence and ignore the context semantic background or only consider the textual information and overlook the phrase dependency when constructing the graph. To address these problems, we propose a new context-specific heterogeneous graph convolutional network (CsHGCN) framework that can combine all context representations. It has a complete context that reflects the information on documents more comprehensively. It has a dependency structure that obtains token-token semantic acquisition more accurately. The experimental results on a Chinese implicit sentiment dataset show that our proposed model can effectively identify the target sentiment of sentences, and visualization of the attention layers further demonstrates that the model selects qualitatively informative tokens and sentences.

43 citations


Posted Content
Yang Li1, Jiacong He, Xin Zhou1, Yuan Zhang1, Jason Baldridge1 
TL;DR: This work creates PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator and decouple the language and action data by annotating action phrase spans in How-To instructions and synthesizing grounded descriptions of actions for mobile user interfaces.
Abstract: We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PIXELHELP, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in HowTo instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PIXELHELP.

39 citations


Journal ArticleDOI
TL;DR: A new second level Mel frequency cepstral coefficient-based feature named MFCC-2 that handles the large and uneven dimensionality of MFCC has been used to characterize languages in the thick of English, Bangla and Hindi.
Abstract: Developing an automatic speech recognition system for multilingual countries like India is a challenging task due to the fact that the people are inured to using multiple languages while talking. This makes language identification from speech an important and essential task prior to recognition of the same. In this paper a system is proposed towards language identification from multilingual speech signals. A new second level Mel frequency cepstral coefficient-based feature named MFCC-2 that handles the large and uneven dimensionality of MFCC has been used to characterize languages in the thick of English, Bangla and Hindi. The system has been tested with recordings of as many as 12,000 utterances of numerals and 41,884 clips extracted from YouTube videos considering background music, data from multiple environments, avoidance of noise suppression and use of keywords from different languages in a single phrase. The highest and average accuracies (for Top-3 classifiers from a pool of nine classifiers) of 98.09% and 95.54%, respectively were achieved for YouTube data.

Proceedings ArticleDOI
Yang Li1, Jiacong He, Xin Zhou1, Yuan Zhang1, Jason Baldridge1 
07 May 2020
TL;DR: PixelHelp as mentioned in this paper ) is a corpus that pairs English instructions with actions performed by people on a mobile UI emulator, annotating action phrase spans in How-To instructions and synthesizing grounded descriptions of actions for mobile user interfaces.
Abstract: We present a new problem: grounding natural language instructions to mobile user interface actions, and contribute three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp.

Journal ArticleDOI
TL;DR: This article tracked the development of phrasal vocabulary in essays produced at two different points in time and found that higher proficiency and greater exposure to the L2 learners did not result in more idiomatic and target-like output, and may, in fact, result in greater reliance on low frequency combinations whose constituent words are non-associated or mutually attracted.
Abstract: © 2019 Language Learning Research Club, University of Michigan In the present study, we sought to advance the field of learner corpus research by tracking the development of phrasal vocabulary in essays produced at two different points in time. To this aim, we employed a large pool of second language (L2) learners (N = 175) from three proficiency levels—beginner, elementary, and intermediate—and focused on an underrepresented L2 (Italian). Employing mixed-effects models, a flexible and powerful tool for corpus data analysis, we analyzed learner combinations in terms of five different measures: phrase frequency, mutual information, lexical gravity, delta Pforward, and delta Pbackward. Our findings suggest a complex picture, in which higher proficiency and greater exposure to the L2 do not result in more idiomatic and targetlike output, and may, in fact, result in greater reliance on low frequency combinations whose constituent words are non-associated or mutually attracted.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This study shows that adversarial examples also exist in dependency parsing, and proposes two approaches to study where and how parsers make mistakes by searching over perturbations to existing texts at sentence and phrase levels, and design algorithms to construct such examples in both of the black-box and white-box settings.
Abstract: Despite achieving prominent performance on many important tasks, it has been reported that neural networks are vulnerable to adversarial examples. Previously studies along this line mainly focused on semantic tasks such as sentiment analysis, question answering and reading comprehension. In this study, we show that adversarial examples also exist in dependency parsing: we propose two approaches to study where and how parsers make mistakes by searching over perturbations to existing texts at sentence and phrase levels, and design algorithms to construct such examples in both of the black-box and white-box settings. Our experiments with one of state-of-the-art parsers on the English Penn Treebank (PTB) show that up to 77% of input examples admit adversarial perturbations, and we also show that the robustness of parsing models can be improved by crafting high-quality adversaries and including them in the training stage, while suffering little to no performance drop on the clean input data.

Book ChapterDOI
23 Aug 2020
TL;DR: The proposed multimodal phrase+click approach achieves new state-of-the-art performance on interactive segmentation by employing phrase expressions as another interaction input to infer the attributes of target object.
Abstract: Existing interactive object segmentation methods mainly take spatial interactions such as bounding boxes or clicks as input. However, these interactions do not contain information about explicit attributes of the target-of-interest and thus cannot quickly specify what the selected object exactly is, especially when there are diverse scales of candidate objects or the target-of-interest contains multiple objects. Therefore, excessive user interactions are often required to reach desirable results. On the other hand, in existing approaches attribute information of objects is often not well utilized in interactive segmentation. We propose to employ phrase expressions as another interaction input to infer the attributes of target object. In this way, we can 1) leverage spatial clicks to locate the target object and 2) utilize semantic phrases to qualify the attributes of the target object. Specifically, the phrase expressions focus on “what” the target object is and the spatial clicks are in charge of “where” the target object is, which together help to accurately segment the target-of-interest with smaller number of interactions. Moreover, the proposed approach is flexible in terms of interaction modes and can efficiently handle complex scenarios by leveraging the strengths of each type of input. Our multi-modal phrase+click approach achieves new state-of-the-art performance on interactive segmentation. To the best of our knowledge, this is the first work to leverage both clicks and phrases for interactive segmentation.

Journal ArticleDOI
TL;DR: Experiments show that the introduced Phrase2Vec outperforms state-of-the-art phrase embedding models in the similarity task and the analogical reasoning task on Enwiki, DBLP, and Yelp dataset.

Posted Content
TL;DR: This work leverage a generic object detector at training time, and proposes a contrastive learning framework that accounts for both region-phrase and image-sentence matching, which achieves state-of-the-art results on visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.
Abstract: Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object class names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pairs. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on the task of visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: This paper propose a joint model of syntactic and semantic parsing on both span and dependency representations, which incorporates syntactic information effectively in the encoder of neural network and benefits from two representation formalisms in a uniform way.
Abstract: Both syntactic and semantic structures are key linguistic contextual clues, in which parsing the latter has been well shown beneficial from parsing the former. However, few works ever made an attempt to let semantic parsing help syntactic parsing. As linguistic representation formalisms, both syntax and semantics may be represented in either span (constituent/phrase) or dependency, on both of which joint learning was also seldom explored. In this paper, we propose a novel joint model of syntactic and semantic parsing on both span and dependency representations, which incorporates syntactic information effectively in the encoder of neural network and benefits from two representation formalisms in a uniform way. The experiments show that semantics and syntax can benefit each other by optimizing joint objectives. Our single model achieves new state-of-the-art or competitive results on both span and dependency semantic parsing on Propbank benchmarks and both dependency and constituent syntactic parsing on Penn Treebank.

Proceedings ArticleDOI
TL;DR: The approach is based on the idea that summarization is important for retrieval and adopts a summarization based model called encoded summarization which encodes a given document into continuous vector space which embeds the summary properties of the document.
Abstract: We present our method for tackling the legal case retrieval task of the Competition on Legal Information Extraction/Entailment 2019. Our approach is based on the idea that summarization is important for retrieval. On one hand, we adopt a summarization based model called encoded summarization which encodes a given document into continuous vector space which embeds the summary properties of the document. We utilize the resource of COLIEE 2018 on which we train the document representation model. On the other hand, we extract lexical features on different parts of a given query and its candidates. We observe that by comparing different parts of the query and its candidates, we can achieve better performance. Furthermore, the combination of the lexical features with latent features by the summarization-based method achieves even better performance. We have achieved the state-of-the-art result for the task on the benchmark of the competition.

Posted Content
TL;DR: It is found that phrase representation in state-of-the-art pre-trained transformers relies heavily on word content, with little evidence of nuanced composition.
Abstract: Deep transformer models have pushed performance on NLP tasks to new limits, suggesting sophisticated treatment of complex linguistic inputs, such as phrases. However, we have limited understanding of how these models handle representation of phrases, and whether this reflects sophisticated composition of phrase meaning like that done by humans. In this paper, we present systematic analysis of phrasal representations in state-of-the-art pre-trained transformers. We use tests leveraging human judgments of phrase similarity and meaning shift, and compare results before and after control of word overlap, to tease apart lexical effects versus composition effects. We find that phrase representation in these models relies heavily on word content, with little evidence of nuanced composition. We also identify variations in phrase representation quality across models, layers, and representation types, and make corresponding recommendations for usage of representations from these models.

Posted Content
TL;DR: This work considers the problem of segmenting image regions given a natural language phrase, and studies it on a novel dataset of 77,262 images and 345,486 phrase-region pairs, collected on top of the Visual Genome dataset.
Abstract: We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77,262 images and 345,486 phrase-region pairs. Our dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated. Phrases in our dataset correspond to multiple regions and describe a large number of object and stuff categories as well as their attributes such as color, shape, parts, and relationships with other entities in the image. Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art. We systematically handle the long-tail nature of these concepts and present a modular approach to combine category, attribute, and relationship cues that outperforms existing approaches.

Proceedings ArticleDOI
08 Oct 2020
TL;DR: The authors found that phrase representation in pre-trained transformers relies heavily on word content, with little evidence of nuanced composition, and identified variations in phrase representation quality across models, layers, and representation types, and made corresponding recommendations for usage of representations from these models.
Abstract: Deep transformer models have pushed performance on NLP tasks to new limits, suggesting sophisticated treatment of complex linguistic inputs, such as phrases. However, we have limited understanding of how these models handle representation of phrases, and whether this reflects sophisticated composition of phrase meaning like that done by humans. In this paper, we present systematic analysis of phrasal representations in state-of-the-art pre-trained transformers. We use tests leveraging human judgments of phrase similarity and meaning shift, and compare results before and after control of word overlap, to tease apart lexical effects versus composition effects. We find that phrase representation in these models relies heavily on word content, with little evidence of nuanced composition. We also identify variations in phrase representation quality across models, layers, and representation types, and make corresponding recommendations for usage of representations from these models.

Posted Content
TL;DR: The pre-trained deep bidirectional network, BERT, is used to make a model for named entity recognition in Persian to achieve second place in NSURL-2019 task 7 competition which associated with NER for the Persian language.
Abstract: Named entity recognition is a natural language processing task to recognize and extract spans of text associated with named entities and classify them in semantic Categories. Google BERT is a deep bidirectional language model, pre-trained on large corpora that can be fine-tuned to solve many NLP tasks such as question answering, named entity recognition, part of speech tagging and etc. In this paper, we use the pre-trained deep bidirectional network, BERT, to make a model for named entity recognition in Persian. We also compare the results of our model with the previous state of the art results achieved on Persian NER. Our evaluation metric is CONLL 2003 score in two levels of word and phrase. This model achieved second place in NSURL-2019 task 7 competition which associated with NER for the Persian language. our results in this competition are 83.5 and 88.4 f1 CONLL score respectively in phrase and word level evaluation.

Proceedings ArticleDOI
12 Oct 2020
TL;DR: This paper forms visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph, and learns unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy.
Abstract: Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This paper aims to improve the quality of each phrase embedding by augmenting it with a contextualized sparse representation (Sparc) and shows 4%+ improvement in CuratedTREC and SQuAD-Open.
Abstract: Open-domain question answering can be formulated as a phrase retrieval problem, in which we can expect huge scalability and speed benefit but often suffer from low accuracy due to the limitation of existing phrase representation models. In this paper, we aim to improve the quality of each phrase embedding by augmenting it with a contextualized sparse representation (Sparc). Unlike previous sparse vectors that are term-frequency-based (e.g., tf-idf) or directly learned (only few thousand dimensions), we leverage rectified self-attention to indirectly learn sparse vectors in n-gram vocabulary space. By augmenting the previous phrase retrieval model (Seo et al., 2019) with Sparc, we show 4%+ improvement in CuratedTREC and SQuAD-Open. Our CuratedTREC score is even better than the best known retrieve & read model with at least 45x faster inference speed.

Proceedings ArticleDOI
08 Nov 2020
TL;DR: PreMA is proposed, an API method recommendation approach based on explicit matching of functionality verb phrases in functionality descriptions and user queries, called PreMA that can accurately recognize the functionality categories and phrase patterns of functionality description sentences and help participants complete their tasks more accurately and with fewer retries.
Abstract: Due to the lexical gap between functionality descriptions and user queries, documentation-based API retrieval often produces poor results.Verb phrases and their phrase patterns are essential in both describing API functionalities and interpreting user queries. Thus we hypothesize that API retrieval can be facilitated by explicitly recognizing and matching between the fine-grained structures of functionality descriptions and user queries. To verify this hypothesis, we conducted a large-scale empirical study on the functionality descriptions of 14,733 JDK and Android API methods. We identified 356 different functionality verbs from the descriptions, which were grouped into 87 functionality categories, and we extracted 523 phrase patterns from the verb phrases of the descriptions. Building on these findings, we propose an API method recommendation approach based on explicit matching of functionality verb phrases in functionality descriptions and user queries, called PreMA. Our evaluation shows that PreMA can accurately recognize the functionality categories (92.8%) and phrase patterns (90.4%) of functionality description sentences; and when used for API retrieval tasks, PreMA can help participants complete their tasks more accurately and with fewer retries compared to a baseline approach.

Posted Content
TL;DR: The Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k.
Abstract: Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.

Journal ArticleDOI
20 Nov 2020
TL;DR: It is argued that a core part of what is traditionally referred to as ‘information structure’ can be deconstructed into genuine morphosyntactic features that are visible to syntactic operations, contribute to discourse-related expressive meanings, and just happen to be spelled out prosodically in Standard American and British English.
Abstract: The paper argues that a core part of what is traditionally referred to as ‘information structure’ can be deconstructed into genuine morphosyntactic features that are visible to syntactic operations, contribute to discourse-related expressive meanings, and just happen to be spelled out prosodically in Standard American and British English. We motivate two features, [FoC] and [G], and we track the fate of those features at and beyond the syntax-semantics and the syntax-phonology interfaces. [FoC] and [G] are responsible for two distinct obligatory strategies for establishing discourse coherence. A [G]-marked constituent signals a match with a discourse referent, whereas a [FoC]-marked constituent invokes alternatives and thereby signals a contrast. In Standard American and British English [FoC] aims for highest prosodic prominence in the intonational phrase, whereas [G] lacks phrase-level prosodic properties. There is no grammatical marking of newness: The apparent prosodic effects of newness are the result of default prosody.

Journal ArticleDOI
TL;DR: In the complex operations of the Indian media economy, the phrase "media markets" requires careful consideration as an analytical concept as mentioned in this paper, as a noun is typically used to refer to a...
Abstract: In the complex operations of the Indian media economy, the phrase ‘media markets’ requires careful consideration as an analytical concept. As a noun, ‘media markets’ is typically used to refer to a...

Book ChapterDOI
23 Aug 2020
TL;DR: A linguistic structure guided propagation network for one-stage phrase grounding that explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them.
Abstract: Phrase level visual grounding aims to locate in an image the corresponding visual regions referred to by multiple noun phrases in a given sentence. Its challenge comes not only from large variations in visual contents and unrestricted phrase descriptions but also from unambiguous referrals derived from phrase relational reasoning. In this paper, we propose a linguistic structure guided propagation network for one-stage phrase grounding. It explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them. Specifically, we first construct a linguistic graph parsed from the sentence and then capture multimodal feature maps for all the phrasal nodes independently. The node features are then propagated over the edges with a tailor-designed relational propagation module and ultimately integrated for final prediction. Experiments on Flickr30K Entities dataset show that our model outperforms state-of-the-art methods and demonstrate the effectiveness of propagating among phrases with linguistic relations (Source code will be available at https://github.com/sibeiyang/lspn.).