Other affiliations: Microsoft, École Normale Supérieure, French Institute for Research in Computer Science and Automation ...read more
Bio: Armand Joulin is an academic researcher from Facebook. The author has contributed to research in topic(s): Word (computer architecture) & Language model. The author has an hindex of 55, co-authored 125 publication(s) receiving 25130 citation(s). Previous affiliations of Armand Joulin include Microsoft & École Normale Supérieure.
Topics: Word (computer architecture), Language model, Convolutional neural network, Artificial neural network, Question answering
•06 Dec 2021
Abstract: The goal of this work is to efficiently identify visually similar patterns from a pair of images, e.g. identifying an artwork detail copied between an engraving and an oil painting, or matching a night-time photograph with its daytime counterpart. Lack of training data is a key challenge for this co-segmentation task. We present a simple yet surprisingly effective approach to overcome this difficulty: we generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image. We then learn to predict the repeated object masks. We find that it is crucial to predict the correspondences as an auxiliary task and to use Poisson blending and style transfer on the training pairs to generalize on real data. We analyse results with two deep architectures relevant to our joint image analysis task: a transformer-based architecture and Sparse Nc-Net, a recent network designed to predict coarse correspondences using 4D convolutions. We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset and achieves competitive performance on two place recognition benchmarks, Tokyo247 and Pitts30K. We then demonstrate the potential of our approach by performing object discovery on the Internet object discovery dataset and the Brueghel dataset. Our code and data are available at http://imagine.enpc.fr/~shenx/SegSwap/.
Abstract: We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D-specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.
••01 Aug 2021
Abstract: We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT’19 systems, which train on the WMT training data and augment it with backtranslation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.
Abstract: Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
Abstract: In this article, we study the task of user profiling in question answering communities (QACs). Previous user profiling algorithms suffer from a number of defects: they regard users and words as ato...
Abstract: Modern application stores enable developers to classify their apps by choosing from a set of generic categories, or genres, such as health, games, and music. These categories are typically static—n...
Abstract: Word-embedding acts as one of the backbones of modern natural language processing (NLP). Recently, with the need for deploying NLP models to low-resource devices, there has been a surge of interest to compress word embeddings into hash codes or binary vectors so as to save the storage and memory consumption. Typically, existing work learns to encode an embedding into a compressed representation from which the original embedding can be reconstructed. Although these methods aim to preserve most information of every individual word, they often fail to retain the relation between words, thus can yield large loss on certain tasks. To this end, this paper presents Relation Reconstructive Binarization (R2B) to transform word embeddings into binary codes that can preserve the relation between words. At its heart, R2B trains an auto-encoder to generate binary codes that allow reconstructing the word-by-word relations in the original embedding space. Experiments showed that our method achieved significant improvements over previous methods on a number of tasks along with a space-saving of up to 98.4%. Specifically, our method reached even better results on word similarity evaluation than the uncompressed pre-trained embeddings, and was significantly better than previous compression methods that do not consider word relations.
Abstract: A key desiderata for inclusive and accessible speech recognition technology is ensuring its robust performance to children’s speech. Notably, this includes the rapidly advancing neural network based end-to-end speech recognition systems. Children speech recognition is more challenging due to the larger intra-inter speaker variability in terms of acoustic and linguistic characteristics compared to adult speech. Furthermore, the lack of adequate and appropriate children speech resources adds to the challenge of designing robust end-to-end neural architectures. This study provides a critical assessment of automatic children speech recognition through an empirical study of contemporary state-of-the-art end-to-end speech recognition systems. Insights are provided on the aspects of training data requirements, adaptation on children data, and the effect of children age, utterance lengths, different architectures and loss functions for end-to-end systems and role of language models on the speech recognition performance.
Abstract: Development and improvement of region proposal algorithms have rapidly become one of the most critical research areas over recent years. The perfect accuracy of region-based recognition techniques has led to the use of proposal algorithms as an imperative core in various recognition problems. The main purpose of these algorithms is to extract effective regions of an image with an appropriate number that will reduce the search space and increase detection accuracy. The early development of these algorithms was based on a set of hand-crafted features. Recently, with advances in deep learning techniques, they have been widely and successfully applied to the region proposals. This paper reviews region proposal algorithms, theory, and evaluation metrics and also addresses the existing challenges. In addition, we present a classification for generating proposals, including classical and advanced methods based on hand-crafted features and deep learning, respectively. Both categories are described in details, and an extensive review of recent works is presented. The proposal improvement methods, including ranking algorithms, are also described. In total, more than 60 different algorithms have been studied and classified, and we also point out several applications based on region proposals.
Author's H-index: 55