The human brain perceives its surroundings through multiple sensory organs and integrates these multi-sensory perceptions to generate a comprehensive understanding. Inspired by synaesthesia, multi-modal cognitive computing endows machines with multi-sensory capabilities and has become the key to general artificial intelligence. With the explosion of multi-modal data such as image, video, text, and audio, a large number of methods have been developed to address this topic. However, the theoretical basis of multi-modal cognitive computing is still unclear. From the perspective of information theory, this paper establishes an information transmission model to profile the cognitive process. Based on the theory of information capacity, this study finds out that multi-modal cognitive computing helps machines extract more information. In this way, multi-modal cognitive computing research is unified by the same theoretical basis. Then, the development of typical tasks is reviewed and discussed, including multi-modal correlation, cross-modal generation, and multi-modal collaboration. Finally, focusing on the opportunities and challenges faced by multi-modal cognitive computing, some potential directions are discussed in depth, and several open-ended questions are considered. 

多模态认知计算

To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large "teacher" model to a smaller "student" model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher's behavior within each modality. The idea aims at mimicking a teacher's modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets.

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding.

Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain time-sensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

Sparse Distillation: Speeding Up Text Classification by Using Bigger Models.

We study the robustness of machine reading comprehension (MRC) models to entity renaming -- do models make more wrong predictions when answer entities have different names? Such failures would indicate that models are overly reliant on entity knowledge to answer questions, and therefore may generalize poorly when facts about the world change or questions are asked about novel entities. To systematically audit model robustness, we propose a general and scalable method to replace person names with names from a variety of sources, ranging from common English names to names from other languages to arbitrary strings. Across four datasets and three pretrained model architectures, MRC models consistently perform worse when entities are renamed, with particularly large accuracy drops on datasets constructed via distant supervision. We also find large differences between models: SpanBERT, which is pretrained with span-level masking, is more robust than RoBERTa, despite having similar accuracy on unperturbed test data. Inspired by this, we experiment with span-level and entity-level masking as a continual pretraining objective and find that they can further improve the robustness of MRC models.

On the Robustness of Reading Comprehension Models to Entity Renaming.

Implicit knowledge, such as common sense, is key to fluid human
conversations. Current neural response generation (RG) models are trained
end-to-end, omitting unstated implicit knowledge. In this paper, we present a
self-talk approach that first generates the implicit commonsense knowledge and
then generates response by referencing the externalized knowledge, all using
one generative model. We analyze different choices to collect knowledge-aligned
dialogues, represent implicit knowledge, and elicit knowledge and responses. We
introduce three evaluation aspects: knowledge quality, knowledge-response
connection, and response quality and perform extensive human evaluations. Our
experimental results show that compared with end-to-end RG models, self-talk
models that externalize the knowledge grounding process by explicitly
generating implicit knowledge also produce responses that are more informative,
specific, and follow common sense. We also find via human evaluation that
self-talk models generate high-quality knowledge around 75% of the time. We
hope that our findings encourage further work on different approaches to
modeling implicit commonsense knowledge and training knowledgeable RG models.

Think Before You Speak: Using Self-talk to Generate Implicit Commonsense Knowledge for Response Generation

To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large “teacher” model to a smaller “student” model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher’s behavior within each modality. The idea aims at mimicking a teacher’s modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets.

Xiang Ren

Papers

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding.

Sparse Distillation: Speeding Up Text Classification by Using Bigger Models.

On the Robustness of Reading Comprehension Models to Entity Renaming.

Think Before You Speak: Using Self-talk to Generate Implicit Commonsense Knowledge for Response Generation

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding