scispace - formally typeset
Search or ask a question

Showing papers on "Question answering published in 2023"


Journal ArticleDOI
TL;DR: The authors provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work and discuss the current classifications of skills that question answering/reading comprehension systems are supposed to acquire and propose a new taxonomy.
Abstract: Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with more than 80 new datasets appearing in the past 2 years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of “skills” that question answering/reading comprehension systems are supposed to acquire and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of overfocusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data and at researchers working on new resources.

9 citations


Posted ContentDOI
31 Jan 2023
TL;DR: ChatGPT as mentioned in this paper is a language model developed by OpenAI that can be used for a variety of natural language processing tasks such as language translation, text summarization, and question answering.
Abstract: ChatGPT is a language model developed by OpenAI. It is a machine learning model that has been trained on a large dataset of human language, allowing it to generate human-like text. It can be used for a variety of natural language processing tasks such as language translation, text summarization, and question answering. In the current work we have discussed the application of ChatGPT in drug discovery.

9 citations


Journal ArticleDOI
TL;DR: In this article , a model based on Reward Integration and Policy Evaluation (RIPE) is proposed to evaluate the reasoning process by leveraging both terminal and instant rewards, and the intermediate supervision for each single reasoning hop is constructed with regard to both the fitness of the taken action and the evaluation of the unreasoned information remained in the updated question embeddings.
Abstract: Among existing knowledge graph based question answering (KGQA) methods, relation supervision methods require labeled intermediate relations for stepwise reasoning. To avoid this enormous cost of labeling on large-scale knowledge graphs, weak supervision methods, which use only the answer entity to evaluate rewards as supervision, have been introduced. However, lacking intermediate supervision raises the issue of sparse rewards, which may result in two types of incorrect reasoning path: (1) incorrectly reasoned relations, even when the final answer entity may be correct; (2) correctly reasoned relations in a wrong order, which leads to an incorrect answer entity. To address these issues, this paper considers the multi-hop KGQA task as a Markov decision process, and proposes a model based on Reward Integration and Policy Evaluation (RIPE). In this model, an integrated reward function is designed to evaluate the reasoning process by leveraging both terminal and instant rewards. The intermediate supervision for each single reasoning hop is constructed with regard to both the fitness of the taken action and the evaluation of the unreasoned information remained in the updated question embeddings. In addition, to lead the agent to the answer entity along the correct reasoning path, an evaluation network is designed to evaluate the taken action in each hop. Extensive ablation studies and comparative experiments are conducted on four KGQA benchmark datasets. The results demonstrate that the proposed model outperforms the state-of-the-art approaches in terms of answering accuracy.

8 citations


Journal ArticleDOI
30 May 2023-PeerJ
TL;DR: Zhang et al. as mentioned in this paper conducted a bibliometric analysis of the correlation through CiteSpace, and reasonably speculate that the attention mechanism has great development potential in cross-modal retrieval.
Abstract: Visual Question Answering (VQA) is a significant cross-disciplinary issue in the fields of computer vision and natural language processing that requires a computer to output a natural language answer based on pictures and questions posed based on the pictures. This requires simultaneous processing of multimodal fusion of text features and visual features, and the key task that can ensure its success is the attention mechanism. Bringing in attention mechanisms makes it better to integrate text features and image features into a compact multi-modal representation. Therefore, it is necessary to clarify the development status of attention mechanism, understand the most advanced attention mechanism methods, and look forward to its future development direction. In this article, we first conduct a bibliometric analysis of the correlation through CiteSpace, then we find and reasonably speculate that the attention mechanism has great development potential in cross-modal retrieval. Secondly, we discuss the classification and application of existing attention mechanisms in VQA tasks, analysis their shortcomings, and summarize current improvement methods. Finally, through the continuous exploration of attention mechanisms, we believe that VQA will evolve in a smarter and more human direction.

8 citations


Proceedings ArticleDOI
27 Feb 2023
TL;DR: This tutorial introduces the key steps in integrating knowledge into NLP, including knowledge grounding from text, knowledge representation and fusing, and introduces recent state-of-the-art applications in fusing knowledge into language understanding, language generation and commonsense reasoning.
Abstract: Knowledge in natural language processing (NLP) has been a rising trend especially after the advent of large scale pre-trained models. NLP models with attention to knowledge can i) access unlimited amount of external information; ii) delegate the task of storing knowledge from its parameter space to knowledge sources; iii) obtain up-to-date information; iv) make prediction results more explainable via selected knowledge. In this tutorial, we will introduce the key steps in integrating knowledge into NLP, including knowledge grounding from text, knowledge representation and fusing. In addition, we will introduce recent state-of-the-art applications in fusing knowledge into language understanding, language generation and commonsense reasoning.

7 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper revisited the bilinear attention networks (BANs) in visual question answering task from a graph perspective, and proposed a two kinds of graphs, image-graph and question-graph, to model the context of the joint embeddings of words and objects.
Abstract: This article revisits the bilinear attention networks (BANs) in the visual question answering task from a graph perspective. The classical BANs build a bilinear attention map to extract the joint representation of words in the question and objects in the image but lack fully exploring the relationship between words for complex reasoning. In contrast, we develop bilinear graph networks to model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely, image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus, our resulting model can build the relationship and dependency between objects, which leads to the realization of multistep reasoning. Experimental results on the VQA v2.0 validation dataset demonstrate the ability of our method to handle complex questions. On the test-std set, our best single model achieves state-of-the-art performance, boosting the overall accuracy to 72.56%, and we are one of the top-two entries in the VQA Challenge 2020.

6 citations


Journal ArticleDOI
TL;DR: In this article , a transformer encoder-decoder architecture is proposed to extract image features using the vision transformer (ViT) model and embed the question using a textual encoder transformer, and concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way.
Abstract: In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported.

4 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an interpretable transformer-based Path-VQA, where they embed transformers' encoder layers with vision (images) features extracted using CNN and questions features extracted by CNNs and domain-specific language model (LM).
Abstract: Pathology visual question answering (PathVQA) attempts to correctly answer medical questions presented with pathology images. Despite its great prospective in healthcare, the technology is still in its early stages with low overall accuracy. This is because it requires both high and low-level interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which cannot capture these high and low-level interactions. Further, these methods failed to interpret retrieved answers, which are obscure to humans. Models interpretability to justify the retrieved answers has remained largely unexplored and has become important to engender users trust in the retrieved answer by providing insight into the model prediction. Motivated by these gaps, we introduce an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision (images) features extracted using CNN and language (questions) features extracted using CNNs and domain-specific language model (LM). A decoder layer of the transformer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed state-of-the-art comparative methods with the public PathVQA dataset. Further, our ablation study presents the capability of each component of our transformer-based vision-language model. Finally, we demonstrate the interpretability of Trap-VQA by presenting the visualization results of both text and images used to explain the reason for a retrieved answer in the PathVQA.

4 citations


Journal ArticleDOI
TL;DR: The authors proposed a graph-enhanced multihop query-focused summarizer (GMQS) for non-factoid question answering, which leverages graph-based reasoning techniques to elaborate the multi-hop inference process in nonfactoid QA.
Abstract: Nonfactoid question answering (QA) is one of the most extensive yet challenging applications and research areas in natural language processing (NLP). Existing methods fall short of handling the long-distance and complex semantic relations between the question and the document sentences. In this work, we propose a novel query-focused summarization method, namely a graph-enhanced multihop query-focused summarizer (GMQS), to tackle the nonfactoid QA problem. Specifically, we leverage graph-enhanced reasoning techniques to elaborate the multihop inference process in nonfactoid QA. Three types of graphs with different semantic relations, namely semantic relevance, topical coherence, and coreference linking, are constructed for explicitly capturing the question-document and sentence-sentence interrelationships. Relational graph attention network (RGAT) is then developed to aggregate the multirelational information accordingly. In addition, the proposed method can be adapted to both extractive and abstractive applications as well as be mutually enhanced by joint learning. Experimental results show that the proposed method consistently outperforms both existing extractive and abstractive methods on two nonfactoid QA datasets, WikiHow and PubMedQA, and possesses the capability of performing explainable multihop reasoning.

4 citations


Journal ArticleDOI
01 Feb 2023
TL;DR: This article proposed to use a text-only train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models for visual question answering (VQA).
Abstract: Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that increasing the language model's size improves notably its performance, yielding results comparable to the state-of-the-art with our largest model, significantly outperforming current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks

4 citations


Proceedings ArticleDOI
30 Apr 2023
TL;DR: In this paper , a web-based question answering system with multimodal fusion of unstructured and structured information is proposed to fill in missing information for knowledge bases, which can achieve good quality with very few questions.
Abstract: Over recent years, large knowledge bases have been constructed to store massive knowledge graphs. However, these knowledge graphs are highly incomplete. To solve this problem, we propose a web-based question answering system with multimodal fusion of unstructured and structured information, to fill in missing information for knowledge bases. To utilize unstructured information from the Web for knowledge graph construction, we design multimodal features and question templates to extract missing facts, which can achieve good quality with very few questions. The question answering system also employs structured information from knowledge bases, such as entity types and entity-to-entity relatedness, to help improve extraction quality. To improve system efficiency, we utilize a few query-driven techniques for web-based question answering to reduce the runtime and provide fast responses to user queries. Extensive experiments have been conducted to demonstrate the effectiveness and efficiency of our system.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a heterogeneous community detection approach based on graph neural network, called HCDBG, to detect heterogeneous communities in community question answering platforms.


Journal ArticleDOI
TL;DR: This article proposed a relation-aware fine-grained reasoning (RAFR) network that performs fine grained reasoning over the nodes of relation-based diagram graphs and applied graph attention networks to learn diagram representations.
Abstract: Textbook question answering (TQA) is a task that one should answer non-diagram and diagram questions accurately, given a large context which consists of abundant diagrams and essays. Although lots of studies have made significant progress in the natural image question answering (QA), they are not applicable to comprehending diagrams and reasoning over the long multimodal context. To address the above issues, we propose a relation-aware fine-grained reasoning (RAFR) network that performs fine-grained reasoning over the nodes of relation-based diagram graphs. Our method uses semantic dependencies and relative positions between nodes in the diagram to construct relation graphs and applies graph attention networks to learn diagram representations. To extract and reason over the multimodal knowledge, we first extract the text that is the most relevant to questions, options, and the instructional diagram which is the most relevant to question diagrams at the word-sentence level and the node-diagram level, respectively. Then, we apply instructional-diagram-guided attention and question-guided attention to reason over the node of question diagrams, respectively. The experimental results show that our proposed method achieves the best performance on the TQA dataset compared with baselines. We also conduct extensive ablation studies to comprehensively analyze the proposed method.

Journal ArticleDOI
TL;DR: In this article , a new Patient-oriented Visual Question Answering (P-VQA) dataset is introduced, which builds a VQA system for patients by covering an entire treatment process including medical consultation, imaging diagnosis, clinical diagnosis, treatment advice, review, etc.
Abstract: Visual Question Answering (VQA) systems have achieved great success in general scenarios. In medical domain, VQA systems are still in their infancy as the datasets are limited by scale and application scenarios. Current medical VQA datasets are designed to conduct basic analyses of medical imaging such as modalities, planes, organ systems, abnormalities, etc., aiming to provide constructive medical suggestions for doctors, containing a large number of professional terms with limited help for patients. In this paper, we introduce a new Patient-oriented Visual Question Answering (P-VQA) dataset, which builds a VQA system for patients by covering an entire treatment process including medical consultation, imaging diagnosis, clinical diagnosis, treatment advice, review, etc. P-VQA covers 20 common diseases with 2,169 medical images, 24,800 question-answering pairs, and a medical knowledge graph containing 419 entities. In terms of methodology, we propose a Medical Knowledge-based VQA Network (MKBN) to answer questions according to the images and a medical knowledge graph in our P-VQA. MKBN learns two cluster embeddings (disease-related and relation-related embeddings) according to structural characteristics of the medical knowledge graph and learns three different interactive features (image-question, image-disease, and question-relation) according to characteristics of diagnosis. For comparisons, we evaluate several state-of-the-art baselines on the P-VQA dataset as benchmarks. Experimental results on P-VQA demonstrate that MKBN achieves the state-of-the-art performance compared with baseline methods. The dataset is available at https://github.com/cs-jerhuang/P-VQA.


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an end-to-end text reading and reasoning network, where the downstream VQA signal contributes to the optimization of text reading, by predicting semantic features from the visual information of texts, by which texts can be reasonably understood even without accurate recognition.


Journal ArticleDOI
TL;DR: In this paper , the classification rate of various state-of-the-art transformer-based models (e.g., BERT and FNET) on the task of gender identification across community question-answering fellows was assessed.
Abstract: Promoting engagement and participation is vital for online social networks such as community Question-Answering (cQA) sites. One way of increasing the contribution of their members is by connecting their content with the right target audience. To achieve this goal, demographic analysis is pivotal in deciphering the interest of each community fellow. Indeed, demographic factors such as gender are fundamental in reducing the gender disparity across distinct topics. This work assesses the classification rate of assorted state-of-the-art transformer-based models (e.g., BERT and FNET) on the task of gender identification across cQA fellows. For this purpose, it benefited from a massive text-oriented corpus encompassing 548,375 member profiles including their respective full-questions, answers and self-descriptions. This assisted in conducting large-scale experiments considering distinct combinations of encoders and sources. Contrary to our initial intuition, in average terms, self-descriptions were detrimental due to their sparseness. In effect, the best transformer models achieved an AUC of 0.92 by taking full-questions and answers into account (i.e., DeBERTa and MobileBERT). Our qualitative results reveal that fine-tuning on user-generated content is affected by pre-training on clean corpora, and that this adverse effect can be mitigated by correcting the case of words.

Book ChapterDOI
TL;DR: The authors investigated how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing, using publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder.
Abstract: We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets—CoNaLa and a newly collected dataset based on Stack Overflow—we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors presented a structured methodology for the Chinese medical knowledge-based question answering (cMed-KBQA) based on the cognitive science dual systems theory by synchronizing an observation stage and an expressive reasoning stage.
Abstract: Chinese medical knowledge-based question answering (cMed-KBQA) is a vital component of the intelligence question-answering assignment. Its purpose is to enable the model to comprehend questions and then deduce the proper answer from the knowledge base. Previous methods solely considered how questions and knowledge base paths were represented, disregarding their significance. Due to entity and path sparsity, the performance of question and answer cannot be effectively enhanced. To address this challenge, this paper presents a structured methodology for the cMed-KBQA based on the cognitive science dual systems theory by synchronizing an observation stage (System 1) and an expressive reasoning stage (System 2). System 1 learns the question's representation and queries the associated simple path. Then System 2 retrieves complicated paths for the question from the knowledge base by using the simple path provided by System 1. Specifically, System 1 is implemented by the entity extraction module, entity linking module, simple path retrieval module, and simple path-matching model. Meanwhile, System 2 is performed by using the complex path retrieval module and complex path-matching model. The public CKBQA2019 and CKBQA2020 datasets were extensively studied to evaluate the suggested technique. Using the metric average F1-score, our model achieved 78.12% on CKBQA2019 and 86.60% on CKBQA2020.


Book ChapterDOI
01 Jan 2023
TL;DR: Visconde as discussed by the authors decomposes the question into simpler questions using a few-shot large language model (LLM) and then uses a search engine to retrieve candidate passages from a large collection for each decomposed question.
Abstract: This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art search engine is used to retrieve candidate passages from a large collection for each decomposed question. In the final step, we use the LLM in a few-shot setting to aggregate the contents of the passages into the final answer. The system is evaluated on three datasets: IIRC, Qasper, and StrategyQA. Results suggest that current retrievers are the main bottleneck and that readers are already performing at the human level as long as relevant passages are provided. The system is also shown to be more effective when the model is induced to give explanations before answering a question. Code is available at https://github.com/neuralmind-ai/visconde .

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a model named as Sentimentenhanced Answer Generation and product Descriptions Fusing (SAGDF) for Product-related Question Answering task.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper analyzed location-related questions in MS MARCO based on semantic similarity, group similar questions into a cluster, and utilize the results to discover the users' interests in the geographic domain.
Abstract: Recently, open-domain question-answering systems have achieved tremendous progress because of developments in large language models (LLMs), and have successfully been applied to question-answering (QA) systems, or Chatbots. However, there has been little progress in open-domain question answering in the geographic domain. Existing open-domain question-answering research in the geographic domain relies heavily on rule-based semantic parsing approaches using few data. To develop intelligent GeoQA agents, it is crucial to build QA systems upon datasets that reflect the real users’ needs regarding the geographic domain. Existing studies have analyzed geographic questions using the geographic question corpora Microsoft MAchine Reading Comprehension (MS MARCO), comprising real-world user queries from Bing in terms of structural similarity, which does not discover the users’ interests. Therefore, we aimed to analyze location-related questions in MS MARCO based on semantic similarity, group similar questions into a cluster, and utilize the results to discover the users’ interests in the geographic domain. Using a sentence-embedding-based topic modeling approach to cluster semantically similar questions, we successfully obtained topic models that could gather semantically similar documents into a single cluster. Furthermore, we successfully discovered latent topics within a large collection of questions to guide practical GeoQA systems on relevant questions.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a depth and segmentation-based visual attention mechanism for embodied question answering (EQA), which can extract local semantic features by introducing a novel high-speed video segmentation framework.
Abstract: Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real-world environment. It has attracted increasing research interests due to its broad applications in personal assistants and in-home robots. Most of the existing methods perform poorly in terms of answering and navigation accuracy due to the absence of fine-level semantic information, stability to the ambiguity, and 3D spatial information of the virtual environment. To tackle these problems, we propose a depth and segmentation based visual attention mechanism for Embodied Question Answering. First, we extract local semantic features by introducing a novel high-speed video segmentation framework. Then guided by the extracted semantic features, a depth and segmentation based visual attention mechanism is proposed for the Visual Question Answering (VQA) sub-task. Further, a feature fusion strategy is designed to guide the navigator's training process without much additional computational cost. The ablation experiments show that our method effectively boosts the performance of the VQA module and navigation module, leading to 4.9 $\%$ and 5.6 $\%$ overall improvement in EQA accuracy on House3D and Matterport3D datasets respectively.

Journal ArticleDOI
TL;DR: In this paper , a robust end-to-end approach that can improve the efficiency and effectiveness of retrieving queries related to mineral exploration terms is proposed, where the Bidirectional Encoder Representation from Transformers (BERT) model is trained to test the answers generated from the user input question.

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , the authors address the task of open-domain health question answering (QA), where they use PubMed and Wikipedia as trustworthy document collections to retrieve evidence and pass the questions and retrieved passages to off-the-shelf question answering models, whose predictions are then aggregated into a final score.
Abstract: In this paper, we address the task of open-domain health question answering (QA). The quality of existing QA systems heavily depends on the annotated data that is often difficult to obtain, especially in the medical domain. To tackle this issue, we opt for PubMed and Wikipedia as trustworthy document collections to retrieve evidence. The questions and retrieved passages are passed to off-the-shelf question answering models, whose predictions are then aggregated into a final score. Thus, our proposed approach is highly data-efficient. Evaluation on 113 health-related yes/no question and answer pairs demonstrates good performance achieving AUC of 0.82.

Journal ArticleDOI
TL;DR: In this article , the authors proposed the first VAQA dataset, which consists of almost 138k Image-Question-Answer (IQA) triplets and is specialized in yes/no questions about real-world images.
Abstract: Abstract Visual Question Answering (VQA) is the problem of automatically answering a natural language question about a given image or video. Standard Arabic is the sixth most spoken language around the world. However, to the best of our knowledge, there are neither research attempts nor datasets for VQA in Arabic. In this paper, we generate the first Visual Arabic Question Answering (VAQA) dataset, which is fully automatically generated. The dataset consists of almost 138k Image-Question-Answer (IQA) triplets and is specialized in yes/no questions about real-world images. A novel database schema and an IQA ground-truth generation algorithm are specially designed to facilitate automatic VAQA dataset creation. We propose the first Arabic-VQA system, where the VQA task is formulated as a binary classification problem. The proposed system consists of five modules, namely visual features extraction, question pre-processing, textual features extraction, feature fusion, and answer prediction. Since it is the first research for VQA in Arabic, we investigate several approaches in the question channel, to identify the most effective approaches for Arabic question pre-processing and representation. For this purpose, 24 Arabic-VQA models are developed, where two question-tokenization approaches, three word-embedding algorithms, and four LSTM networks with different architectures are investigated. A comprehensive performance comparison is conducted between all these Arabic-VQA models on the VAQA dataset. Experiments indicate that the performance of all Arabic-VQA models ranges from 80.8 to 84.9%, while utilizing Arabic-specified question pre-processing approaches of considering the special case of separating the question tool "Image missing" and embedding the question words using fine-tuned Word2Vec models from AraVec2.0 have significantly improved the performance. The best-performing model is which treats the question tool "Image missing" as a separate token, embeds the question words using AraVec2.0 Skip-Gram model, and extracts the textual feature using one-layer unidirectional LSTM. Further, our best Arabic-VQA model is compared with related VQA models developed on other popular VQA datasets in a different natural language, considering their performance only on yes/no questions according to the scope of this paper, showing a very comparable performance.

Journal ArticleDOI
TL;DR: The authors proposed an internal knowledge-based end-to-end model, enhanced by an attentive memory network for both answer selection and answer generation tasks by considering the full advantages of the semantics and multifacts (i.e., timescales, topics, and context).
Abstract: The question answering system in open domain enables a machine to automatically select and generate the answer for questions posed by humans in a natural language form on the website. Previous approaches seek effective ways of extracting the semantic features between question and answer, but the contextual information effects in semantic matching are still limited by short-term memory. As an alternative, we propose an internal knowledge-based end-to-end model, enhanced by an attentive memory network for both answer selection and answer generation tasks by considering the full advantages of the semantics and multifacts (i.e., timescales, topics, and context). In detail, we design a long-term memory to learn the top- $k$ fine-grained similarity representations, where two memory-aware mechanisms aggregate the series of semantic word-level and sentence-level similarities to support the coarse contextual information. Furthermore, we propose a novel memory refinement mechanism with the two-dimensional of writing heads that offer an efficient approach to multiview selection of the salient word pairs. In the training stage, we adopt the transformer-based transfer learning skill to effectively pretrain the model. Experimentally, we compare the state-of-the-art approaches on four public datasets, the experimental results show that the proposed model achieves competitive performance.