专栏首页专知【论文推荐】最新7篇视觉问答(VQA)相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答(VQA)相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【导读】专知内容组整理了最近七篇视觉问答(Visual Question Answering)相关文章,为大家进行介绍,欢迎查看!

1.VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions(VQA-E:解释、阐述并增强你对视觉问题的回答)

作者:Qing Li,Qingyi Tao,Shafiq Joty,Jianfei Cai,Jiebo Luo

机构:University of Science and Technology of China,Nanyang Technological University,University of Rochester

摘要:Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

期刊:arXiv, 2018年3月20日

网址

http://www.zhuanzhi.ai/document/f39b8adecd703b04ad2dd62e94427325

2.A Read-Write Memory Network for Movie Story Understanding(一个为电影故事理解的读写记忆网络)

作者:Seil Na,Sangho Lee,Jisung Kim,Gunhee Kim

机构:Seoul National University

摘要:We propose a novel memory network model named Read-Write Memory Network (RWMN) to perform question and answering tasks for large-scale, multimodal movie story understanding. The key focus of our RWMN model is to design the read network and the write network that consist of multiple convolutional layers, which enable memory read and write operations to have high capacity and flexibility. While existing memory-augmented network models treat each memory slot as an independent block, our use of multi-layered CNNs allows the model to read and write sequential memory cells as chunks, which is more reasonable to represent a sequential story because adjacent memory blocks often have strong correlations. For evaluation, we apply our model to all the six tasks of the MovieQA benchmark, and achieve the best accuracies on several tasks, especially on the visual QA task. Our model shows a potential to better understand not only the content in the story, but also more abstract information, such as relationships between characters and the reasons for their actions.

期刊:arXiv, 2018年3月16日

网址

http://www.zhuanzhi.ai/document/bce2b92d8c8684b5308fbf6b7b39f25f

3.iVQA: Inverse Visual Question Answering(iVQA:逆视觉问题回答)

作者:Feng Liu,Tao Xiang,Timothy M. Hospedales,Wankou Yang,Changyin Sun

机构:Southeast University,Queen Mary University of London,University of Edinburgh,

摘要:We propose the inverse problem of Visual question answering (iVQA), and explore its suitability as a benchmark for visuo-linguistic understanding. The iVQA task is to generate a question that corresponds to a given image and answer pair. Since the answers are less informative than the questions, and the questions have less learnable bias, an iVQA model needs to better understand the image to be successful than a VQA model. We pose question generation as a multi-modal dynamic inference process and propose an iVQA model that can gradually adjust its focus of attention guided by both a partially generated question and the answer. For evaluation, apart from existing linguistic metrics, we propose a new ranking metric. This metric compares the ground truth question's rank among a list of distractors, which allows the drawbacks of different algorithms and sources of error to be studied. Experimental results show that our model can generate diverse, grammatically correct and content correlated questions that match the given answer.

期刊:arXiv, 2018年3月16日

网址

http://www.zhuanzhi.ai/document/92ea4a9253cf26e085bbee1374040be6

4.A dataset and architecture for visual reasoning with a working memory(具有工作记忆的视觉推理的数据集和架构)

作者:Guangyu Robert Yang,Igor Ganichev,Xiao-Jing Wang,Jonathon Shlens,David Sussillo

机构:Google Brain,New York University,Columbia University

摘要:A vexing problem in artificial intelligence is reasoning about events that occur in complex, changing visual stimuli such as in video analysis or game play. Inspired by a rich tradition of visual reasoning and memory in cognitive psychology and neuroscience, we developed an artificial, configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures. We additionally propose a deep learning architecture that performs competitively on other diagnostic VQA datasets (i.e. CLEVR) as well as easy settings of the COG dataset. However, several settings of COG result in datasets that are progressively more challenging to learn. After training, the network can zero-shot generalize to many new tasks. Preliminary analyses of the network architectures trained on COG demonstrate that the network accomplishes the task in a manner interpretable to humans.

期刊:arXiv, 2018年3月16日

网址

http://www.zhuanzhi.ai/document/581fcd7b86474896a801300e4f46ef78

5.Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning(透明性设计:在视觉推理中缩小性能和可解释性之间的差距)

作者:David Mascharka,Philip Tran,Ryan Soklaski,Arjun Majumdar

机构:MIT Lincoln Laboratory,Planck Aerosystems

摘要:Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.

期刊:arXiv, 2018年3月14日

网址

http://www.zhuanzhi.ai/document/687e4e6f8dff600d060ff4c188eb566e

6.Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering(用于图像描述和视觉问题的回答的自底向上和自顶向下的注意力)

作者:Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang

机构:Australian National University,Microsoft Research,University of Adelaide,Macquarie University

摘要:Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

期刊:arXiv, 2018年3月14日

网址

http://www.zhuanzhi.ai/document/ccf862349b06541be2dc5312a84fc2db

7.Interpretable Counting for Visual Question Answering(可解释计数的视觉问题回答)

作者:Alexander Trott,Caiming Xiong,Richard Socher

机构:Salesforce Research

摘要:Questions that require counting a variety of objects in images remain a major challenge in visual question answering (VQA). The most common approaches to VQA involve either classifying answers based on fixed length representations of both the image and question or summing fractional counts estimated from each section of the image. In contrast, we treat counting as a sequential decision process and force our model to make discrete choices of what to count. Specifically, the model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections. A distinction of our approach is its intuitive and interpretable output, as discrete counts are automatically grounded in the image. Furthermore, our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.

期刊:arXiv, 2018年3月2日

网址

http://www.zhuanzhi.ai/document/2efb2987a89a520de3af9a7ae2c01aea

本文分享自微信公众号 - 专知(Quan_Zhuanzhi),作者:专知内容组

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2018-03-23

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 【论文推荐】最新5篇深度强化学习相关论文推荐—经验驱动的网络、自动数据库管理、双光技术推荐系统、UAVs、多代理竞争对手

    【导读】专知内容组整理了最近强化学习相关文章,为大家进行介绍,欢迎查看! 1. Experience-driven Networking: A Deep Rei...

    WZEARW
  • 【论文推荐】最新6篇视觉问答(VQA)相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

    【导读】专知内容组整理了最近六篇视觉问答(Visual Question Answering)相关文章,为大家进行介绍,欢迎查看! 1. Object-base...

    WZEARW
  • 【论文推荐】最新七篇推荐系统相关论文—影响兴趣、知识Embeddings、 音乐推荐、非结构化、一致性、显式和隐式特征、知识图谱

    【导读】专知内容组整理了最近七篇推荐系统(Recommender System)相关文章,为大家进行介绍,欢迎查看! 1.Learning Recommenda...

    WZEARW
  • 枪声:枪声样本数字取证与人工智能(CS LG)

    根据炮口冲击波对武器进行分类是一项具有挑战性的任务,在各种安全和军事领域有着重要的应用。 现有的大多数工程依赖于特别部署的空间多样性麦克风传感器,以捕捉同一枪击...

    用户7095611
  • 【论文推荐】最新5篇深度强化学习相关论文推荐—经验驱动的网络、自动数据库管理、双光技术推荐系统、UAVs、多代理竞争对手

    【导读】专知内容组整理了最近强化学习相关文章,为大家进行介绍,欢迎查看! 1. Experience-driven Networking: A Deep Rei...

    WZEARW
  • Adapting the Human:穿戴科技在HRI中的利用(CS)

    根据目前HRI的样式,所有的传感器、动作和举止的可视化和易读性都由机器人或其工作元件所承担。这必然使机器人变得更加复杂,亦或是把它们限制在特定的、结构化的环境中...

    用户8078865
  • 生物水凝胶中带电粒子受阻运动的新型建模与仿真方法(CS)

    本文提出了一种新型的计算模型,以研究由于表面电荷和扩散颗粒的大小而对生物水凝胶的选择性过滤。这是第一个模型,该模型包括3D纤维的随机取向和生物聚合物网络的连通性...

    YH
  • 操作舒适性VS系统组件的重要性(CS)

    本报告的目的是研究可靠性方面的系统中组件重要性的度量。在伯恩鲍姆(Birnbaum,1968)关于该主题的第一篇著作中,创建了许多有趣的研究,并构建了重要的指标...

    用户8078865
  • 寻找平衡专家工作量和任务覆盖范围的团队(AL)

    在线劳动力市场(如Freelancer,Guru和Upwork)的兴起引发了很多有关团队形成的研究,其中获得不同技能的专家组成团队来完成任务。该工作线的核心思想...

    田冠宇
  • 识别在公共交通系统中传播疾病的有传染性的旅行者(CS SI)

    最近一种新型冠状病毒的爆发及其迅速传播突出了了解人类流动性的重要性。 密闭空间,例如公共交通工具(例如巴士及火车) ,提供适当的环境让感染迅速广泛传播。因此,调...

    用户7095611

扫码关注云+社区

领取腾讯云代金券