scispace - formally typeset
Search or ask a question

Showing papers on "Perplexity published in 2021"


Journal ArticleDOI
TL;DR: The Routing Transformer as discussed by the authors proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest by combining self-attention with a sparse routing module based on online k-means.
Abstract: Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n^{1.5}d) from O(n^2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192.

176 citations


Journal ArticleDOI
TL;DR: The modeling and performance in deep learning computation for an Assistant Conversational Agent (Chatbot) and the utilization of Tensorflow software library, particularly Neural Machine Translation (NMT) model are shown.

55 citations


Proceedings ArticleDOI
17 Mar 2021
TL;DR: A new way of utilizing the powerful transfer learning ability of a language model via a perplexity score is proposed and can already outperform the Major Class baseline by more than an absolute 10% on the F1-Macro metric across multiple datasets.
Abstract: Few-shot learning has drawn researchers’ attention to overcome the problem of data scarcity. Recently, large pre-trained language models have shown great performance in few-shot learning for various downstream tasks, such as question answering and machine translation. Nevertheless, little exploration has been made to achieve few-shot learning for the fact-checking task. However, fact-checking is an important problem, especially when the amount of information online is growing exponentially every day. In this paper, we propose a new way of utilizing the powerful transfer learning ability of a language model via a perplexity score. The most notable strength of our methodology lies in its capability in few-shot learning. With only two training samples, our methodology can already outperform the Major Class baseline by more than an absolute 10% on the F1-Macro metric across multiple datasets. Through experiments, we empirically verify the plausibility of the rather surprising usage of the perplexity score in the context of fact-checking and highlight the strength of our few-shot methodology by comparing it to strong fine-tuning-based baseline models. Moreover, we construct and publicly release two new fact-checking datasets related to COVID-19.

40 citations


Journal ArticleDOI
TL;DR: This paper first learns embedding representations of each word and each entity by employing a large-scale external corpora and a large and manually edited knowledge graph, respectively and proposes an unsupervised model named VAETM to infer the latent representation of topic distributions.
Abstract: Traditional topic models are widely used for semantic discovery from long texts. However, they usually fail to mine high-quality topics from short texts (e.g. tweets) due to the sparsity of features and the lack of word co-occurrence patterns. In this paper, we propose a V ariational A uto- E ncoder T opic M odel (VAETM for short) by combining word vector representation and entity vector representation to address the above limitations. Specifically, we first learn embedding representations of each word and each entity by employing a large-scale external corpora and a large and manually edited knowledge graph, respectively. Then we integrated the embedding representations into the variational auto-encoder framework and propose an unsupervised model named VAETM to infer the latent representation of topic distributions. To further boost VAETM, we propose an improved supervised VAETM (SVAETM for short) by considering label information in training set to supervise the inference of latent representation of topic distributions and the generation of topics. Last, we propose KL-divergence-based inference algorithms to infer approximate posterior distribution for our two models. Extensive experiments on three common short text datasets demonstrate our proposed VAETM and SVAETM outperform various kinds of state-of-the-art models in terms of perplexity, NPMI, and accuracy.

34 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: This article proposed a variational inference approach to estimate the latent parameter posterior distributions associated with different parts of the Transformer model architecture including multi-head self-attention, feed forward and embedding layers.
Abstract: State-of-the-art neural language models (LMs) represented by Transformers are highly complex. Their use of fixed, deterministic parameter estimates fail to account for model uncertainty and lead to over-fitting and poor generalization when given limited training data. In order to address these issues, this paper proposes a full Bayesian learning framework for Transformer LM estimation. Efficient variational inference based approaches are used to estimate the latent parameter posterior distributions associated with different parts of the Transformer model architecture including multi-head self-attention, feed forward and embedding layers. Statistically significant word error rate (WER) reductions up to 0.5% absolute (3.18% relative) and consistent perplexity gains were obtained over the baseline Transformer LMs on state-of-the-art Switchboard corpus trained LF-MMI factored TDNN systems with i-Vector speaker adaptation. Performance improvements were also obtained on a cross domain LM adaptation task requiring porting a Transformer LM trained on the Switchboard and Fisher data to a low-resource DementiaBank elderly speech corpus.

21 citations


Book ChapterDOI
01 Jan 2021
TL;DR: The authors proposed a deep AM-FM framework to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) the semantic closeness of generated response to the corresponding gold references; (2) the syntactic quality of sentence construction.
Abstract: There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AM-FM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture long-term dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.

19 citations


Proceedings ArticleDOI
19 Jan 2021
TL;DR: In this article, the authors explore three potential problems in unsupervised contrastive learning of speech representation: presence of non-speech data, noisy or low quality speech data, and imbalance in speaker distribution and show that these problems combined can already have a performance cost of up to 30% relative for the ABX score.
Abstract: Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

19 citations


Book ChapterDOI
01 Jan 2021
TL;DR: In this paper, Normalized Absolute Coherence (NAC) and normalized Absolute Perplexity (NAP) were proposed for predicting the optimal number of topics in Latent Dirichlet Allocation.
Abstract: Feature extraction is one of the challenging works in the Machine Learning (ML) arena. The more features one able to extract correctly, the more accurate knowledge one can exploit from data. Latent Dirichlet Allocation (LDA) is a form of topic modeling used to extract features from text data. But finding the optimal number of topics (on which success of LDA depends on) is tremendous challenging, especially if there is no prior knowledge about the data. Some studies suggest perplexity; some are Rate of Perplexity Change (RPC); some suggest coherence as a method to find an optimal number of a topic for achieving both of accuracy and less processing time for LDA. In this study, the authors propose two new methods named Normalized Absolute Coherence (NAC) and Normalized Absolute Perplexity (NAP) for predicting the optimal number of topics. The authors run highly standard ML experiments to measure and compare the reliability of existing methods (perplexity, coherence, RPC) and proposed NAC and NAP in searching for an optimal number of topics in LDA. The study successfully proves and suggests that NAC and NAP work better than existing methods. This investigation also suggests that perplexity, coherence, and RPC are sometimes distracting and confusing to estimate the optimal number of topics.

18 citations


Journal ArticleDOI
TL;DR: It is suggested that there is value in perceptual information, particularly in thinking about predicting the impacts of future change and that the authors still have much to learn about moving from observational and perceptual complexity to parsimonious predictability.
Abstract: This article reconsiders the concept of a perceptual model of hydrological processes as the first stage to be considered in developing a procedural model for a particular catchment area. While various perceptual models for experimental catchments have been developed, the concept is not widely used in defining or evaluating catchment models. This is, at least in part, because of the evident complexity possible in a perceptual model and the approximate nature of procedural model structures and parameterizations, particularly where there is a requirement for parameter parsimony. A perceptual model for catchments in Cumbria, North-West England, is developed as an exemplar and illustrated in terms of time varying distribution functions. Two critical questions are addressed: how can perceptual model hypotheses be tested at scales of interest, and how can constraints then be imposed on the basis of qualitative perceptual knowledge in conditioning predictive models? It is suggested that there is value in perceptual information, particularly in thinking about predicting the impacts of future change and that we still have much to learn about moving from observational and perceptual complexity to parsimonious predictability. This article is categorized under: Science of Water. © 2021 The Authors. WIREs Water published by Wiley Periodicals LLC.

18 citations


Journal ArticleDOI
TL;DR: Examining the extent to which the deep-learning-based natural language generation (NLG) models can offer responses similar to human-generated responses to the learners in MOOC forums suggested that the GPT-2 model can comparably provide emotional and community support to human learners with contextual replies.
Abstract: Among all the learning resources within MOOCs such as video lectures and homework, the discussion forum stood out as a valuable platform for students’ learning through knowledge exchange. However, peer interactions on MOOC discussion forums are scarce. The lack of interactions among MOOC learners can yield negative effects on students’ learning, causing low participation and high dropout rate. This research aims to examine the extent to which the deep-learning-based natural language generation (NLG) models can offer responses similar to human-generated responses to the learners in MOOC forums. Specifically, under the framework of social support theory, this study has examined the use of state-of-the-art deep learning models recurrent neural network (RNN) and generative pretrained transformer 2 (GPT-2) to provide students with informational, emotional, and community support with NLG on discussion forums. We first trained an RNN and GPT-2 model with 13,850 entries of post-reply pairs. Quantitative evaluation on model performance was then conducted with word perplexity, readability, and coherence. The results showed that GPT-2 outperformed RNN on all measures. We then qualitatively compared the dimensions of support provided by humans and GPT-2, and the results suggested that the GPT-2 model can comparably provide emotional and community support to human learners with contextual replies. We further surveyed participants to find out if the collected data would align with our findings. The results showed GPT-2 model could provide supportive and contextual replies to a similar extent compared to humans.

17 citations


Book ChapterDOI
01 Jan 2021
TL;DR: In this article, N-gram models are discussed and evaluated using Good Turing Estimation, perplexity measure and type-to-token ratio to predict the next word when the user provides input.
Abstract: The prediction of next word, letter or phrase for the user, while she is typing, is a really valuable tool for improving user experience. The users are communicating, writing reviews and expressing their opinion on such platforms frequently and many times while moving. It has become necessary to provide the user with an application that can reduce typing effort and spelling errors when they have limited time. The text data is getting larger in size due to the extensive use of all kinds of social media platforms and so implementation of text prediction application is difficult considering the size of text data to be processed for language modeling. This research paper’s primary objective is processing large text corpus and implementing a probabilistic model like N-grams to predict the next word when the user provides input. In this exploratory research, n-gram models are discussed and evaluated using Good Turing Estimation, perplexity measure and type-to-token ratio.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: This article showed that adding absolute position embeddings to queries and keys instead of to word embedding improves perplexity and speed up the training of a language model with short input length.
Abstract: Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.

Journal ArticleDOI
TL;DR: A corpus-based methodology is applied, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three different countries.
Abstract: The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three ...

Posted Content
TL;DR: SRU++ as discussed by the authors combines fast recurrence and attention for sequence modeling and achieves state-of-the-art results on Enwik8, Wiki-103, and Billion Word datasets.
Abstract: Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.

Journal ArticleDOI
TL;DR: In this article, an algorithmic method is proposed for preparing the sub-datasets splits for machine learning models, which aims to achieve an evenly representative splits out of the dataset with standard and algorithmic way that reduce the perplexity of random splitting.
Abstract: The datasets that appear in publications are curated and have been split into training, testing and validation sub-datasets by domain experts. Consequently, machine learning models typically perform well on such split-by-hand prepared datasets. Whereas preparing real-world datasets into curated split training, testing and validation sub-dataset requires extensive effort. Usually, repetitive random splits are carried out and trained and evaluated on until reaching out a good score on the evaluation metrics. In this paper, an algorithmic method is proposed for preparing the sub-datasets splits for machine learning models. The objective of the proposed method is to achieve an evenly representative splits out of the dataset with standard and algorithmic way that reduce the perplexity of random splitting.

Journal ArticleDOI
TL;DR: An approach to build an end-to-end Tunisian dialect speech system based on deep learning to recognize automatically the spontaneous Human speech and transcribe it into text is described.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: This article investigated whether the established results in computational psycholinguistics can be generalized across languages and found that the established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like.
Abstract: In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: This paper analyzed whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained, and provided a framework to evaluate the fit of language models to these trends.
Abstract: We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework–paired with significance tests–for evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type–token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

Journal ArticleDOI
Jianmin Guo1, Quan Zhang1, Yue Zhao1, Heyuan Shi1, Yu Jiang1, Jiaguang Sun1 
TL;DR: The authors proposed an adversarial testing framework RNN-Test for RNN systems, focusing on sequence-to-sequence (seq2seq) tasks of widespread deployments, not only classification domains.
Abstract: While massive efforts have been investigated in adversarial testing of convolutional neural networks (CNN), testing for recurrent neural networks (RNN) is still limited and leaves threats for vast sequential application domains. In this paper, we propose an adversarial testing framework RNN-Test for RNN systems, focusing on sequence-to-sequence (seq2seq) tasks of widespread deployments, not only classification domains. First, we design a novel search methodology customized for RNN models by maximizing the inconsistency of RNN states against their inner dependencies to produce adversarial inputs. Next, we introduce two state-based coverage metrics according to the distinctive structure of RNNs to exercise more system behaviors. Finally, RNN-Test solves the joint optimization problem to maximize state inconsistency and state coverage, and crafts adversarial inputs for various tasks of different kinds of inputs. For evaluations, we apply RNN-Test on four RNN models of common structures. On the tested models, the RNN-Test approach is demonstrated to be competitive in generating adversarial inputs, outperforming FGSM-based and DLFuzz-based methods to reduce the model performance more sharply with 2.78% to 37.94% higher success (or generation) rate. RNN-Test could also achieve 52.65% to 66.45% higher adversary rate than testRNN on MNIST LSTM model, as well as 53.76% to 58.02% more perplexity with 16% higher generation rate than DeepStellar on PTB language model. Compared with the traditional neuron coverage, the proposed state coverage metrics as guidance excel with 4.17% to 97.22% higher success (or generation) rate.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: The authors measured usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia and found that several extremely destructive context manipulations, such as shuffling word order within sentences and deleting all words other than nouns, remove less than 15% of the usable information.
Abstract: Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations—including shuffling word order within sentences and deleting all words other than nouns—remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.

Journal ArticleDOI
13 Aug 2021
TL;DR: LatentDirichlet Allocation is a good topic modelling algorithm compared to Latent Semantic Analysis and Hierarchical Dirichlet Process for aspect extraction process in aspect-based opinion mining and an unsupervised aspect extraction algorithm based on topic models for Aspect-based Opinion mining are proposed.
Abstract: With the massive use of electronic gadgets and the developing fame of web-based media, a great deal of text information is being produced at the rate never observed. It is not feasible for people to pursue all information produced and discover what is being investigated in their area of interest. To determine topics in large textual documents Topic modelling is used. Topic Modeling Algorithms (TMA) are Unsupervised Machine Learning approaches which are widely used and have proven to be successful in the area of Aspect-based Opinion Mining to extract ‘latent’ topics, which are aspects of interest. In this paper, the approaches that are widely used for topic modelling are examined and compared to find their importance in detecting topics based on metrics such as perplexity and coherence. As a result, Latent Dirichlet Allocation is a good topic modelling algorithm compared to Latent Semantic Analysis and Hierarchical Dirichlet Process for aspect extraction process in aspect-based opinion mining. Also, we have proposed an unsupervised aspect extraction algorithm based on topic models for Aspect-based Opinion mining.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: This article explored whether the uniform information density hypothesis can be operationalized as an inductive bias for statistical language modeling and found that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited.
Abstract: The uniform information density (UID) hypothesis, which posits that speakers behaving optimally tend to distribute information uniformly across a linguistic signal, has gained traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. In this work, we explore whether the UID hypothesis can be operationalized as an inductive bias for statistical language modeling. Specifically, we augment the canonical MLE objective for training language models with a regularizer that encodes UID. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via an analysis of generated sequences, we find that UID-regularized language models have other desirable properties, e.g., they generate text that is more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools.

Journal ArticleDOI
TL;DR: How the publication network is formed in this particular field of research is demonstrated, and how the content of abstracts can be automatically analyzed to provide a set of research topics for quick understanding and application in future projects is demonstrated.
Abstract: This research aims to illustrate the potential use of concepts, techniques, and mining process tools to improve the systematic review process. Thus, a review was performed on two online databases (Scopus and ISI Web of Science) from 2012 to 2019. A total of 9649 studies were identified, which were analyzed using probabilistic topic modeling procedures within a machine learning approach. The Latent Dirichlet Allocation method, chosen for modeling, required the following stages: 1) data cleansing, and 2) data modeling into topics for coherence and perplexity analysis. All research was conducted according to the standards of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses in a fully computerized way. The computational literature review is an integral part of a broader literature review process. The results presented met three criteria: (1) literature review for a research area, (2) analysis and classification of journals, and (3) analysis and classification of academic and individual research teams. The contribution of the article is to demonstrate how the publication network is formed in this particular field of research, and how the content of abstracts can be automatically analyzed to provide a set of research topics for quick understanding and application in future projects.

Journal ArticleDOI
TL;DR: A theoretical analysis towards Deep CockTail Network (DCTN), a universal and flexibly-deployed framework to address the problems of transferable deep representations for visual domain adaptation, reveals that DCTN significantly boosts classification accuracies in MSDA and performs extraordinarily to resist negative transfers across different MSDA scenarios.
Abstract: Transferable deep representations for visual domain adaptation (DA) provides a route to learn from labeled source images to recognize target images without the aid of target-domain supervision. Relevant researches increasingly arouse a great amount of interest due to its potential industrial prospect for non-laborious annotation and remarkable generalization. However, DA presumes source images are identically sampled from a single source while Multi-Source DA (MSDA) is ubiquitous in the real-world. In MSDA, the domain shifts exist not only between source and target domains but also among the sources; especially, the multi-source and target domains may disagree on their semantics (e.g., category shifts). This issue challenges the existing solutions for MSDAs. In this paper, we propose Deep CockTail Network (DCTN), a universal and flexibly-deployed framework to address the problems. DCTN uses a multi-way adversarial learning pipeline to minimize the domain discrepancy between the target and each of the multiple in order to learn domain-invariant features. The derived source-specific perplexity scores measure how similar each target feature appears as a feature from one of source domains. The multi-source category classifiers are integrated with the perplexity scores to categorize target images. We accordingly derive a theoretical analysis towards DCTN, including the interpretation why DCTN can be successful without precisely crafting the source-specific hyper-parameters, and target expected loss upper bounds in terms of domain and category shifts. In our experiments, DCTNs have been evaluated on four benchmarks, whose empirical studies involve vanilla and three challenging category-shift transfer problems in MSDA, i.e., source-shift, target-shift and source-target-shift scenarios. The results thoroughly reveal that DCTN significantly boosts classification accuracies in MSDA and performs extraordinarily to resist negative transfers across different MSDA scenarios.

DOI
24 Apr 2021
TL;DR: In this paper, the authors provide indications for articulating a dialogue in the face of the new educational scenario that the teaching of mathematics is going through with the pandemic caused by the SARS-CoV-2 virus, unleashed in 2019.
Abstract: This text provides indications for articulating a dialogue in the face of the new educational scenario that the teaching of mathematics is going through with the pandemic caused by the SARS-CoV-2 virus, unleashed in 2019. Understandings about digital technologies for the teaching of mathematics are discussed and this is discussed due to the perplexity that has been experienced when inhabiting the technological world in the midst of a declared pandemic situation. For this, we tried, through attentive listening to the lived moment, to glimpse possibilities that open up for the teaching of mathematics with technologies, having Heidegger's inhabitance as the soil of understanding. It concludes the centrality of being-with-the-other as essential in education, requiring different ways of thinking, of forming people and of becoming a teacher. Keywords: Mathematical Education; Digital technologies; Pandemic; Dwell.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: In this paper, a low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs is studied.
Abstract: ive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pre-trained abstractive summarizer BART, which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with independent human labeling by domain experts.

Journal ArticleDOI
TL;DR: In this paper, NMT has applied for translating English to Indian languages, especially for Telugu and parameters like accuracy, perplexity, cross-entropy and BLEU scores shows better translation quality for NMT with effective preprocessing.
Abstract: In cross-language information retrieval (CLIR), the neural machine translation (NMT) plays a vital role. CLIR retrieves the information written in a language which is different from the user's query language. In CLIR, the main concern is to translate the user query from the source language to the target language. NMT is useful for translating the data from one language to another. NMT has better accuracy for different languages like English to German and so-on. In this paper, NMT has applied for translating English to Indian languages, especially for Telugu. Besides NMT, an effort is also made to improve accuracy by applying effective preprocessing mechanism. The role of effective preprocessing in improving accuracy will be less but countable. Machine translation (MT) is a data-driven approach where parallel corpus will act as input in MT. NMT requires a massive amount of parallel corpus for performing the translation. Building an English - Telugu parallel corpus is costly because they are resource-poor languages. Different mechanisms are available for preparing the parallel corpus. The major issue in preparing parallel corpus is data replication that is handled during preprocessing. The other issue in machine translation is the out-of-vocabulary (OOV) problem. Earlier dictionaries are used to handle OOV problems. To overcome this problem the rare words are segmented into sequences of subwords during preprocessing. The parameters like accuracy, perplexity, cross-entropy and BLEU scores shows better translation quality for NMT with effective preprocessing.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: The authors proposed a method that adapts existing pre-trained language models to new languages by retraining lexical embeddings without tuning the Transformer layers and induced a bilingual lexicon from this alignment.
Abstract: Large generative language models have been very successful for English, but other languages lag behind due to data and computational limitations. We propose a method that may overcome these problems by adapting existing pre-trained language models to new languages. Specifically, we describe the adaptation of English GPT-2 to Italian and Dutch by retraining lexical embeddings without tuning the Transformer layers. As a result, we obtain lexical embeddings for Italian and Dutch that are aligned with the original English lexical embeddings and induce a bilingual lexicon from this alignment. Additionally, we show how to scale up complexity by transforming relearned lexical embeddings of GPT-2 small to the GPT-2 medium embedding space. This method minimises the amount of training and prevents losing information during adaptation that was learned by GPT-2. English GPT-2 models with relearned lexical embeddings can generate realistic sentences in Italian and Dutch, but on average these sentences are still identifiable as artificial by humans. Based on perplexity scores and human judgements, we find that generated sentences become more realistic with some additional full model finetuning, especially for Dutch. For Italian, we see that they are evaluated on par with sentences generated by a GPT-2 model fully trained from scratch. Our work can be conceived as a blueprint for training GPT-2s for other languages, and we provide a 'recipe' to do so.

Posted Content
TL;DR: This paper measured usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia and found that several extremely destructive context manipulations, such as shuffling word order within sentences and deleting all words other than nouns, remove less than 15% of the usable information.
Abstract: Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations -- including shuffling word order within sentences and deleting all words other than nouns -- remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.

Proceedings ArticleDOI
Ding Siyu, Junyuan Shang1, Wang Shuohuan2, Yu Sun2, Hao Tian1, Hua Wu2, Haifeng Wang2 
01 Aug 2021
TL;DR: This paper proposed ERNIE-Doc, a document-level language pretraining model based on Recurrence Transformers, which has a much longer effective context length to capture the contextual information of a complete document.
Abstract: Transformers are not suited for processing long documents, due to their quadratically increasing memory and time consumption. Simply truncating a long document or applying the sparse attention mechanism will incur the context fragmentation problem or lead to an inferior modeling capability against comparable model sizes. In this paper, we propose ERNIE-Doc, a document-level language pretraining model based on Recurrence Transformers. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. We pretrain ERNIE-Doc to explicitly learn the relationships among segments with an additional document-aware segment-reordering objective. Various experiments were conducted on both English and Chinese document-level tasks. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification and question answering.